Administrivia

First off, be safe/healthy!
Last miles:
1. Virtual final project demo/presentation: 7 to 10-min video (DUE: Apr 13)
2. Final project peer-evaluation (DUE: Apr 14/Apr 16)
3. Final write-up/code (DUE: May 1)
4. Quiz #2 will be distributed at 9am on Apr 21 (take-home/open-ended, 24h)
5. Lab4 (DUE: Apr 20 w/ 2-week late days)

Virtual final project demo/presentation

7-10 min video (depending on #members in the team)
Upload the video to Youtube
Submit the link to us!

Quiz2

P1. TAF2: In-memory Filesystem for Tiny Allocation
P2. VM Application: Demand Paging / Lazy Loading
P3. Blocking I/O and Scheduling
P4. Designing Signal Handler Interface

Ref. Please check the review material for Quiz 2!

Grading policy (new)

Preparation: 10%
Quiz: 30%
- 20%: max(Q1, Q2)
- 10%: min(Q1, Q2)
Lab: 40%: 8% + 8% + 8% + 8% + 8% (bonus 8% for lab5)
Final project: 20%
- Demo & presentation: 5% (peer evaluation)
- Write-up/code: 15% (by TAs)

Q&A

What exactly is meant by the size of a virtual address space? For example, if you have two 64 bit computers with 16GB of physical RAM but one has a 2^40 address space whereas the other has 2^64 address space what’s different because won’t the size of each memory address in either system be 8 bytes since it is a 64 bit architecture? Why don’t they always max out the virtual address space? ie for 64 bit architecture give 2^64 address space?

Q&A (Applications in AArch64)

TBI (Top-Byte Ignore) for sw-based tagged address/memory
MTE (Memory Tagging Extension) for hw-based tagged address/memory

Ref. AArch64: Memory Tagging Extension

Today: Multicore (and locking)

The increasing number of core counts per CPU instead of the clock speed
Multicore and locking: how to correctly utilize available cores?

Example: counting

DEMO: count.c

$ ./count 1 10
cpu = 1, count = 10

$ ./count 1 100
cpu = 1, count = 100

$ ./count 2 10
cpu = 2, count = 20

Example: correctness

Q: What if we increase the count?
Q: can the global count be lower than say, 100000 in below?

$ ./count 2 100000
cpu = 2, count = 110047

Example: measuring execution time

Q: Execution time reduces by half (x2 utilization)?
Q: Problem?

$ time ./count 1 1000000000
cpu = 1, count = 1000000000
./count 1 1000000000  1.48s user 0.00s system 99% cpu 1.484 total

$ time ./count 2  500000000
cpu = 2, count = 499251437
./count 2 500000000  2.13s user 0.00s system 199% cpu 1.068 total

Example: analysis in detail

$ sudo perf stat -d ./count 1 1000000000
cpu = 1, count = 1000000000

 Performance counter stats for './count 1 1000000000':

          1,451.93 msec task-clock                #    0.999 CPUs utilized          
                 3      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                68      page-faults               #    0.047 K/sec                  
     5,523,257,353      cycles                    #    3.804 GHz                      (62.61%)
     4,025,305,227      instructions              #    0.73  insn per cycle           (75.21%)
     1,002,727,356      branches                  #  690.617 M/sec                    (75.21%)
            21,014      branch-misses             #    0.00% of all branches          (75.21%)
       994,765,813      L1-dcache-loads           #  685.133 M/sec                    (75.21%)
            52,550      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (75.21%)
             8,480      LLC-loads                 #    0.006 M/sec                    (49.59%)
             1,359      LLC-load-misses           #   16.03% of all LL-cache hits     (49.59%)

       1.453028001 seconds time elapsed

       1.451706000 seconds user
       0.000000000 seconds sys

Example: analysis in detail

$ sudo perf stat -d ./count 2 500000000
cpu = 2, count = 501651355

 Performance counter stats for './count 2 500000000':

          2,151.63 msec task-clock                #    1.995 CPUs utilized          
                29      context-switches          #    0.013 K/sec                  
                 3      cpu-migrations            #    0.001 K/sec                  
                70      page-faults               #    0.033 K/sec                  
     8,069,084,968      cycles                    #    3.750 GHz                      (61.02%)
     4,028,632,000      instructions              #    0.50  insn per cycle           (74.03%)
     1,004,762,321      branches                  #  466.978 M/sec                    (74.88%)
            30,428      branch-misses             #    0.00% of all branches          (75.75%)
       994,061,232      L1-dcache-loads           #  462.004 M/sec                    (75.84%)
        44,806,280      L1-dcache-load-misses     #    4.51% of all L1-dcache hits    (75.84%)
         9,355,986      LLC-loads                 #    4.348 M/sec                    (49.29%)
             6,268      LLC-load-misses           #    0.07% of all LL-cache hits     (48.41%)

       1.078321208 seconds time elapsed

       2.150210000 seconds user
       0.000000000 seconds sys

Q: How to fix this problem?

Q: plan?
Two goals:
- Correctness: no missing counts
- Performance: execution time

Attempt 1: use only one CPU

Fix its execution to the first CPU (id = 0)

$ taskset --help
Usage: taskset [options] [mask | cpu-list] [pid|cmd [args...]]

Show or change the CPU affinity of a process.

Options:
 -a, --all-tasks         operate on all the tasks (threads) for a given pid
 -p, --pid               operate on existing given pid
 -c, --cpu-list          display and specify cpus in list format
 -h, --help              display this help
 -V, --version           display version
...

Result

Q: correctness? performance?

$ taskset 1 time ./count 1 1000000000
cpu = 1, count = 1000000000
1.41user 0.00system 0:01.42elapsed 99%CPU (0avgtext+0avgdata 1740maxresident)k
0inputs+0outputs (0major+81minor)pagefaults 0swaps

$ taskset 1 time ./count 2 500000000
cpu = 2, count = 1000000000
1.41user 0.00system 0:01.42elapsed 99%CPU (0avgtext+0avgdata 1656maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps

Attempt 2: use atomic operation

Add a lock prefix (all memory ops)

asm volatile("lock incl %0"
              : "+m"(count)
              : "m"(count)
              : "memory");

Result

Q: correctness? performance?

$ time ./count 1 1000000000
cpu = 1, count = 1000000000
4.77user 0.00system 0:04.78elapsed 99%CPU (0avgtext+0avgdata 1672maxresident)k
0inputs+0outputs (0major+79minor)pagefaults 0swaps

$ time ./count 2 500000000
cpu = 2, count = 1000000000
33.37user 0.00system 0:17.04elapsed 195%CPU (0avgtext+0avgdata 1668maxresident)k
0inputs+0outputs (0major+79minor)pagefaults 0swaps

Analysis (see IPS)

$ sudo perf stat -d ./count 1 1000000000
cpu = 1, count = 1000000000

 Performance counter stats for './count 1 1000000000':

          4,872.17 msec task-clock                #    1.000 CPUs utilized
                10      context-switches          #    0.002 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                67      page-faults               #    0.014 K/sec
    18,205,583,156      cycles                    #    3.737 GHz                      (62.44%)
     4,008,942,183      instructions              #    0.22  insn per cycle           (74.96%)
     1,002,553,661      branches                  #  205.771 M/sec                    (74.96%)
            39,674      branch-misses             #    0.00% of all branches          (74.96%)
       999,210,530      L1-dcache-loads           #  205.085 M/sec                    (74.96%)
           128,514      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (75.11%)
            37,522      LLC-loads                 #    0.008 M/sec                    (50.08%)
             8,030      LLC-load-misses           #   21.40% of all LL-cache hits     (49.93%)

       4.873511073 seconds time elapsed

       4.868987000 seconds user
       0.000000000 seconds sys

Analysis (see IPS)

$ sudo perf stat -d ./count 2 500000000
cpu = 2, count = 1000000000

 Performance counter stats for './count 2 500000000':

         33,874.88 msec task-clock                #    1.953 CPUs utilized
                59      context-switches          #    0.002 K/sec
                 3      cpu-migrations            #    0.000 K/sec
                70      page-faults               #    0.002 K/sec
   121,053,878,992      cycles                    #    3.574 GHz                      (62.46%)
     4,025,737,317      instructions              #    0.03  insn per cycle           (74.97%)
     1,005,310,106      branches                  #   29.677 M/sec                    (74.97%)
           226,866      branch-misses             #    0.02% of all branches          (74.98%)
     1,005,112,429      L1-dcache-loads           #   29.671 M/sec                    (75.01%)
       554,644,791      L1-dcache-load-misses     #   55.18% of all L1-dcache hits    (75.05%)
           299,123      LLC-loads                 #    0.009 M/sec                    (50.02%)
            55,176      LLC-load-misses           #   18.45% of all LL-cache hits     (49.97%)

      17.348257458 seconds time elapsed

      33.839990000 seconds user
       0.000000000 seconds sys

Attempt 3: compute locally (per CPU)

Q: correctness? performance?
Q: how to improve perf even further?
Q: how to trigger a race?

   int local = 0;
   for (register int i = 0; i < cnt; i++)
     local ++;

   count += local;

Result

Try, while true; do ./count 2 10 | grep 10 ; done

$ time ./count 1 1000000000
cpu = 1, count = 1000000000
1.50user 0.00system 0:01.51elapsed 99%CPU (0avgtext+0avgdata 1728maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps

$ time ./count 2 500000000
cpu = 2, count = 1000000000
1.61user 0.00system 0:00.80elapsed 199%CPU (0avgtext+0avgdata 1644maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps

Attempt 4: using locks

    int local = 0;
    for (register int i = 0; i < cnt; i++)
      local ++;

    acquire(&lock);
    count += local;
    release(&lock)

Result

Try, while true; do ./count 2 10 | grep 10 ; done

$ time ./count 1 1000000000
cpu = 1, count = 1000000000
1.41user 0.00system 0:01.41elapsed 99%CPU (0avgtext+0avgdata 1584maxresident)k
0inputs+0outputs (0major+75minor)pagefaults 0swaps

$ time ./count 2 500000000
cpu = 2, count = 1000000000
1.44user 0.00system 0:00.72elapsed 200%CPU (0avgtext+0avgdata 1596maxresident)k
0inputs+0outputs (0major+78minor)pagefaults 0swaps

Locks

Mutual exclusion: only one core can hold a given lock
- concurrent access to the same memory location, at least one write
- example: acquire(l); x = x + 1; release(l);
Serialize critical section: hide intermediate state
- another example: transfer money from account A to B
- put(a + 100) and put(b - 100) must be both effective, or neither

Strawman: locking

    struct lock { int locked; };

    void acquire(struct lock *l) {
      for (;;) {
        if (l=>locked == 0) { // A: test
          l=>locked = 1;      // B: set
          return;
        }
      }
    }

    void release(struct lock *l) {
      l=>locked = 0;
    }

Relying on an atomic operation

Q: correctness? performance?

    struct lock { int locked; };

    void acquire(struct lock *l) {
      for (;;) {
        if (xchg(&l=>locked, 1) == 0)
          return;
      }
    }

    void release(struct lock *l) {
      xchg(&l=>locked, 0);
    }

Different aspects of performance?

Q: what about its performance with the increasing number of cpus?
DEMO: in 80-core machine (160!)

 ; in optimus:
   $ cd lock
   $ ./count-mp.py 160 1000000
 ; in host
   $ scp optimus:lock/plot.dat .
   $ plot.py

Results on the 80-core machine

Y-axis: sec, X-axis: #core

Results on the 80-core machine

Y-axis: sec, X-axis: #core

Results on the 80-core machine

Y-axis: sec, X-axis: #core

Results on the 80-core machine

Y-axis: sec, X-axis: #core

Lock in RustOS

impl<T> Mutex<T> {
    pub fn lock(&self) -> MutexGuard<T> {
        if !is_mmu_ready() { return MutexGuard { lock: &self }; }
        loop {
            if !self.locked.swap(true, Ordering::Acquire) {
                return MutexGuard { lock: &self };
            }
        }
    }
    fn unlock(&self) {
        if !is_mmu_ready() { return; }

        self.locked.store(false, Ordering::Release);
    }
}

Ref. Nomicon: Atomics, C++ Atomics: Order and consistency

Note on terminology

Non-blocking synchronization → spinlock
Blocking synchronization → mutex (sleeping locks)

Lock in AArch/Linux64

Ref. Spinlock in AArch64/Linux

Multicore in Rpi3 (armstub8.S)

_start:
    ...
    adr x5, spin_cpu0
secondary_spin:
    wfe
    ldr x4, [x5, x6, lsl #3]
    cbz x4, secondary_spin

...
.org 0xd8
.globl spin_cpu0
spin_cpu0:
    .quad 0

.org 0xe0
.globl spin_cpu1
spin_cpu1:
    .quad 0
...

Waking up application cores in RustOS

    pub unsafe fn initialize_app_cores(&self) {
        for core in 1..=3 {
            let v = pi::common::SPINNING_BASE.add(core);
            v.write_volatile(crate::init::start2 as usize);
            sev();

            while v.read_volatile() != 0 {
                spin_sleep(core::time::Duration::from_millis(200));
            }
        }
    }

Running ‘/fib’ in RustOS

Q: execution time on one core vs four cores! (in Lab5)

    pub fn initialize_scheduler(&self) {
        let mut inner = self.0.lock();
        if let None = *inner {
            // populate userspace processes as needed
            let mut scheduler = Scheduler::new();
            for pn in &["/fib", "/fib", "/fib", "/fib"] {
                scheduler.add(Process::load(pn).unwrap());
            }

            *inner = Some(scheduler);
        }
    }

Your next steps

Please check, if you are interested in:
1. Rust 9: Concurrency (Watch: Rust’s Journey to Async/Await)
2. Writing an OS in Rust: Async/Await
3. Send/Sync in Rust Book
4. Send/Sync in Nomicon
5. Understanding the Send trait

→ Lab5 (release soon!)

CS3210: Multicore and Synchronization

Administrivia

Virtual final project demo/presentation

Quiz2

Grading policy (new)

Q&A

Q&A (in x86_64)

Q&A (in AArch64)

Q&A (Applications in AArch64)

Today: Multicore (and locking)

Example: counting

Example: correctness

Example: measuring execution time

Example: analysis in detail

Example: analysis in detail

Q: How to fix this problem?

Attempt 1: use only one CPU

Result

Attempt 2: use atomic operation

Result

Analysis (see IPS)

Analysis (see IPS)

Attempt 3: compute locally (per CPU)

Result

Attempt 4: using locks

Result

Locks

Strawman: locking

Relying on an atomic operation

Different aspects of performance?

Results on the 80-core machine

Results on the 80-core machine

Results on the 80-core machine

Results on the 80-core machine

Lock in RustOS

Note on terminology

Lock in AArch/Linux64

Multicore in Rpi3 (armstub8.S)

Waking up application cores in RustOS

Running ‘/fib’ in RustOS

Your next steps

References