- Last Updated: [[2020-12-19]]
- [[2020-12-15]]
- Author: [[Kavya Joshi]]
- Recommended By:: [[Mihail Stoykov]]
<!--ID: 1631100130971-->
- Tags:: [[Go]] [[Principles of improving work performance]]
- Source: [GopherCon 2018: Kavya Joshi - The Scheduler Saga](https://www.youtube.com/watch?v=YHRO5WQGh0k)
<iframe width="560" height="315" src="https://www.youtube.com/watch?v=YHRO5WQGh0k" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
- # Notes
- [1:19](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=79)
- The scheduler is what makes sure the Goroutines run concurrently.
- [1:29](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=89)
- Coordinates Network I/O
- Pauses and resumes goroutines
- [1:41](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=101)
- Executes runtime tasks like garbage collection
- [1:49](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=109)
- Best part about Go scheduler: it can do this for hundreds of thousands of Goroutines.
- [2:06](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=126)
- Go scheduler's decisions affect application performance greatly.
- [3:10](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=190)
- Goroutines are "user-space threads". They are conceptually similar to OS/kernel threads but they are entirely managed by the Go runtime.
- [3:23](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=203)
- Goroutines are lighter-weight and faster than kernel threads.
- [3:29](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=209)
- Initial goroutine stack: 2KB memory used
- Default thread stack: 8KB
- Faster creation, destruction, context switches.
- [3:52](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=232)
- Without the Go scheduler, the OS only knows how to use the OS or kernel threads.
- [4:03](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=243)
- Go scheduler puts goroutines on kernel threads which the OS then puts on the CPU.
- [4:57](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=297)
- The Go scheduler can also trigger a blocking system call that also blocks the kernel thread, not just the goroutine.
- [6:01](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=361)
- Goals for a scheduler
- - Use small amount of kernel threads
- - Support high concurrency (lots of goroutines)
- - Be able to scale to N cores (parallelism)
- [7:44](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=464)
- Multiplexing:
- - when to create kernel threads
- - how to distribute goroutines across kernel threads
- [9:19](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=559)
- Possible solutions (and why they're not actually solutions)
- 1. Multiplex all goroutines on a single thread
- - No concurrency: if a goroutine blocks that thread, no other goroutine can run
- [9:30](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=570)
- - No parallelism: restricting to one thread on one core is not taking advantage of the machine, even if the machine can support more than one thread.
- [9:48](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=588)
- 2. Create and destroy one thread per goroutine.
- [10:01](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=601)
- - Kernel threads are expensive to create and destroy - this is additional overhead for us.
- [10:30](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=630)
- Idea 1: Reuse threads
- [10:49](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=649)
- Start a thread, but when a goroutine finishes on it, we don't destroy it; we just park it ("thread parking") and put it to sleep but track it so we can find it and reuse it later.
- [11:05](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=665)
- Have one runqueue, and pop off the first in (FIFO)
- [12:44](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=764)
- Problem: multiple threads are accessing the same runqueue, so we need a way to keep it from locking up: GO SCHEDULER
- [13:08](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=788)
- Problem 2: unbounded number of threads.
- We could end up creating thousands of threads that are never freed up.
- [13:17](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=797)
- Unbounded number of threads makes not having a lock even worse (contention)
- [13:39](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=819)
- "Cost amortization" of thread: thread becomes less costly to run on a per goroutine basis.
- * A lot of this could also describe what k6 is doing with instances in AWS, reusing them.
- [13:52](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=832)
- Idea 2: Limit threads accessing runqueue
- [15:01](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=901)
- Need to get the limit right. Too high and we're not really solving the contention problem. Too low and we're underutilizing CPU cores.
- [15:16](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=916)
- What if we set the limit to the number of CPU cores? That way we'll still get parallelism.
- [15:58](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=958)
- Scheduler can reuse threads and let goroutine 2 use it while goroutine 1 wants to block.
- [16:49](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1009)
- For future-proofing: when cores increase, contention problem will reappear.
- [18:07](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1087)
- Performance decrease when number of cores increases using this scheduler is nonlinear (gets worse faster as number of cores goes up).
- [18:45](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1125)
- Idea 3: Distributed runqueues
- Have as many runqueues as there are cores
- [18:59](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1139)
- 1 runqueue per kernel thread.
- [20:26](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1226)
- But what if one thread is free and wants to do work, but it doesn't have anything in its runqueue?
- We could employ "work stealing"
- [22:23](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1343)
- Problem: What if the thread already has work, but the goroutine it's running calls for a block? Then it can't proceed, because it has no work in the queue, and nothing else can steal the goroutine because it's tied up to the first thread.
- [22:23](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1343)
- Solution: need to have a thread handoff in addition to stealing.
- [22:37](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1357)
- Background thread to act as a monitor
- [22:59](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1379)
- A separate monitor thread has to do it because the thread in use doesn't know ahead of time when or for how long it will be blocked.
- [23:48](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1428)
- The limit for goroutine-running threads is not affected, because when a thread is blocked, that doesn't count towards the limit.
- [24:00](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1440)
- Thread handoff prevents goroutine starvation:
- goroutines will get to run
- [25:16](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1516)
- Key components of go scheduler:
- - Reuses threads
- - Limits the number of goroutine-running threads to the number of CPU cores
- [25:23](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1523)
- Uses distributed runqueues with stealing and handoff
- [26:22](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1582)
- Last problem: this is dependent on the Go program calling the scheduler. But what if it doesn't?
- What if a Go program is super CPU-heavy and just never makes the call to the scheduler?
- [26:52](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1612)
- So the Go scheduler implements PREEMPTION.
- - background thread called `sysmon` that forces long-running goroutines (>10ms, with caveats) and forces them to yield to the scheduler
- [27:37](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1657)
- Go scheduler itself has a global runqueue in addition to the distributed runqueues.
- So we're creating system runqueue and system thread for go scheduler.
- [27:49](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1669)
- Global runqueue is lesser priority so threads don't check it as often, but it's still there.
- [29:18](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1758)
- Limitations of Go Scheduler
- 1. Runqueues are FIFO, not priority-based (unlike Linux)
- [29:40](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1780)
- 2. No strong preemption
- - Really, a goroutine that is slow-running could still bring it to a halt because the global runqueue is lesser priority.
- There is a recent proposal to fix this.
- [30:08](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1808)
- Go scheduler doesn't know about the system topology (doesn't know about the hardware). It's operating independently.
- Also a proposal to fix this: use LIFO instead of FIFO to be able to use cache effectively.
- # PDF Notes
- {{pdf: assets/1622579177_14.pdf}
- # Content Notes
- OS threads
- Usually 1 per core, but [[Multithreading]] allows multiple threads to be used by a single core.
- The Go Scheduler
- Enables hundreds of thousands of goroutines to run concurrently
- Traffic cop
- coordinates network I/O
- Assigns work to goroutines, then goroutines to threads. The OS then puts those threads onto the CPU.
- Triggers blocking system calls to block goroutines and threads, when necessary
- Janitor: engages garbage collection
- Why it's awesome
- Allows high concurrency (hundreds of thousands of goroutines can run at the same time)
- Minimizes amount of __kernel__ threads used by goroutines
- Allows parallelism (if the machine has more than one core, it will take full advantage of that extra capability)
- Goroutine
- A "user-space thread", compared to an OS or kernel thread. The difference is that they are managed by the Go Scheduler, not the OS.
- Faster than kernel threads
- Take up less memory: a goroutine stack takes up 2KB; a kernel thread takes up 8KB
- Designing the Go Scheduler from scratch
- Failed Approach 1: Multiplex all goroutines on a single thread
- No concurrency: if a goroutine makes a blocking call, the whole thread is blocked
- No parallelism: Does not take advantage of a second core/thread if it's available
- Failed Approach 2: 1 Thread: 1 Goroutine
- We'd have to be destroying threads every time a goroutine is finished, and then recreating it when a new goroutine starts - this is expensive to do.
- Idea 1: [[Thread reuse]]
- When a goroutine is finished with a thread, use "thread parking" to keep the thread alive, but idle, so it can still process work later.
- Add a runqueue (FIFO) from which all threads take work
- "[Cost amortization]([[Amortizing performance cost]])": When threads are reused, each goroutine that gets executed by them costs less to create.
- Problem: Multiple threads would be accessing a runqueue at a time, causing a lock
- Problem: Number of threads is unlimited, and we could end up creating thousands of them that are never freed up
- k6 is doing this with load generators, reusing them instead of destroying them. In this case, the cost is time.
- Idea 2: External monitor
- Separate process that will be the only one to access the runqueue, preventing a lock. It can also assign the work to goroutines.
- Idea 3: Limit the number of goroutines that can access the runqueue [[Throttling access to resource]]
- Limit to number of cores - that way, we're future-proofing: if more cores are added, we'll be able to take advantage of it.
- Idea 4: [[Distributed runqueue]]s
- Have as many runqueues as there are cores: 1 runqueue per kernel thread
- The Go scheduler has a global runqueue in addition to the per-thread runqueue as well.
- Idea 5: [[Work-stealing]]
- Allow threads to take work from another thread's runqueue if they're idle and the work is not yet being processed.
- Problem: If a long running goroutine calls for a block, blocking the thread, the work can't be stolen and the thread also can't continue.
- Idea 6: [[Work hand-off]]
- External monitor hands off work that is stalled, allowing thread to proceed with work.
- Idea 7: [[Timeboxing]]
- Another background thread called `sysmon` will force long-running goroutines to stop (>10 ms long)
- Limitations of the Go Scheduler
- Runqueues are FIFO, not priority-based (unlike Linux)
- No strong preemption
- The global runqueue of the scheduler is lesser priority, so a long-running goroutine could still bring the app to a halt.
- Go doesn't know about the hardware of the machine it's operating in.