The Scheduler Saga - Fork My Brain

- Last Updated: [[2020-12-19]] - [[2020-12-15]] - Author: [[Kavya Joshi]] - Recommended By:: [[Mihail Stoykov]]  - Tags:: [[Go]] [[Principles of improving work performance]] - Source: [GopherCon 2018: Kavya Joshi - The Scheduler Saga](https://www.youtube.com/watch?v=YHRO5WQGh0k) <iframe width="560" height="315" src="https://www.youtube.com/watch?v=YHRO5WQGh0k" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> - # Notes - [1:19](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=79) - The scheduler is what makes sure the Goroutines run concurrently. - [1:29](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=89) - Coordinates Network I/O - Pauses and resumes goroutines - [1:41](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=101) - Executes runtime tasks like garbage collection - [1:49](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=109) - Best part about Go scheduler: it can do this for hundreds of thousands of Goroutines. - [2:06](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=126) - Go scheduler's decisions affect application performance greatly. - [3:10](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=190) - Goroutines are "user-space threads". They are conceptually similar to OS/kernel threads but they are entirely managed by the Go runtime. - [3:23](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=203) - Goroutines are lighter-weight and faster than kernel threads. - [3:29](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=209) - Initial goroutine stack: 2KB memory used - Default thread stack: 8KB - Faster creation, destruction, context switches. - [3:52](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=232) - Without the Go scheduler, the OS only knows how to use the OS or kernel threads. - [4:03](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=243) - Go scheduler puts goroutines on kernel threads which the OS then puts on the CPU. - [4:57](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=297) - The Go scheduler can also trigger a blocking system call that also blocks the kernel thread, not just the goroutine. - [6:01](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=361) - Goals for a scheduler - - Use small amount of kernel threads - - Support high concurrency (lots of goroutines) - - Be able to scale to N cores (parallelism) - [7:44](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=464) - Multiplexing: - - when to create kernel threads - - how to distribute goroutines across kernel threads - [9:19](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=559) - Possible solutions (and why they're not actually solutions) - 1. Multiplex all goroutines on a single thread - - No concurrency: if a goroutine blocks that thread, no other goroutine can run - [9:30](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=570) - - No parallelism: restricting to one thread on one core is not taking advantage of the machine, even if the machine can support more than one thread. - [9:48](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=588) - 2. Create and destroy one thread per goroutine. - [10:01](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=601) - - Kernel threads are expensive to create and destroy - this is additional overhead for us. - [10:30](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=630) - Idea 1: Reuse threads - [10:49](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=649) - Start a thread, but when a goroutine finishes on it, we don't destroy it; we just park it ("thread parking") and put it to sleep but track it so we can find it and reuse it later. - [11:05](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=665) - Have one runqueue, and pop off the first in (FIFO) - [12:44](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=764) - Problem: multiple threads are accessing the same runqueue, so we need a way to keep it from locking up: GO SCHEDULER - [13:08](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=788) - Problem 2: unbounded number of threads. - We could end up creating thousands of threads that are never freed up. - [13:17](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=797) - Unbounded number of threads makes not having a lock even worse (contention) - [13:39](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=819) - "Cost amortization" of thread: thread becomes less costly to run on a per goroutine basis. - * A lot of this could also describe what k6 is doing with instances in AWS, reusing them. - [13:52](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=832) - Idea 2: Limit threads accessing runqueue - [15:01](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=901) - Need to get the limit right. Too high and we're not really solving the contention problem. Too low and we're underutilizing CPU cores. - [15:16](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=916) - What if we set the limit to the number of CPU cores? That way we'll still get parallelism. - [15:58](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=958) - Scheduler can reuse threads and let goroutine 2 use it while goroutine 1 wants to block. - [16:49](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1009) - For future-proofing: when cores increase, contention problem will reappear. - [18:07](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1087) - Performance decrease when number of cores increases using this scheduler is nonlinear (gets worse faster as number of cores goes up). - [18:45](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1125) - Idea 3: Distributed runqueues - Have as many runqueues as there are cores - [18:59](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1139) - 1 runqueue per kernel thread. - [20:26](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1226) - But what if one thread is free and wants to do work, but it doesn't have anything in its runqueue? - We could employ "work stealing" - [22:23](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1343) - Problem: What if the thread already has work, but the goroutine it's running calls for a block? Then it can't proceed, because it has no work in the queue, and nothing else can steal the goroutine because it's tied up to the first thread. - [22:23](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1343) - Solution: need to have a thread handoff in addition to stealing. - [22:37](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1357) - Background thread to act as a monitor - [22:59](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1379) - A separate monitor thread has to do it because the thread in use doesn't know ahead of time when or for how long it will be blocked. - [23:48](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1428) - The limit for goroutine-running threads is not affected, because when a thread is blocked, that doesn't count towards the limit. - [24:00](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1440) - Thread handoff prevents goroutine starvation: - goroutines will get to run - [25:16](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1516) - Key components of go scheduler: - - Reuses threads - - Limits the number of goroutine-running threads to the number of CPU cores - [25:23](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1523) - Uses distributed runqueues with stealing and handoff - [26:22](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1582) - Last problem: this is dependent on the Go program calling the scheduler. But what if it doesn't? - What if a Go program is super CPU-heavy and just never makes the call to the scheduler? - [26:52](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1612) - So the Go scheduler implements PREEMPTION. - - background thread called `sysmon` that forces long-running goroutines (>10ms, with caveats) and forces them to yield to the scheduler - [27:37](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1657) - Go scheduler itself has a global runqueue in addition to the distributed runqueues. - So we're creating system runqueue and system thread for go scheduler. - [27:49](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1669) - Global runqueue is lesser priority so threads don't check it as often, but it's still there. - [29:18](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1758) - Limitations of Go Scheduler - 1. Runqueues are FIFO, not priority-based (unlike Linux) - [29:40](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1780) - 2. No strong preemption - - Really, a goroutine that is slow-running could still bring it to a halt because the global runqueue is lesser priority. - There is a recent proposal to fix this. - [30:08](https://www.youtube.com/watch?v=YHRO5WQGh0k&t=1808) - Go scheduler doesn't know about the system topology (doesn't know about the hardware). It's operating independently. - Also a proposal to fix this: use LIFO instead of FIFO to be able to use cache effectively. - # PDF Notes - {{pdf: assets/1622579177_14.pdf} - # Content Notes - OS threads - Usually 1 per core, but [[Multithreading]] allows multiple threads to be used by a single core. - The Go Scheduler - Enables hundreds of thousands of goroutines to run concurrently - Traffic cop - coordinates network I/O - Assigns work to goroutines, then goroutines to threads. The OS then puts those threads onto the CPU. - Triggers blocking system calls to block goroutines and threads, when necessary - Janitor: engages garbage collection - Why it's awesome - Allows high concurrency (hundreds of thousands of goroutines can run at the same time) - Minimizes amount of __kernel__ threads used by goroutines - Allows parallelism (if the machine has more than one core, it will take full advantage of that extra capability) - Goroutine - A "user-space thread", compared to an OS or kernel thread. The difference is that they are managed by the Go Scheduler, not the OS. - Faster than kernel threads - Take up less memory: a goroutine stack takes up 2KB; a kernel thread takes up 8KB - Designing the Go Scheduler from scratch - Failed Approach 1: Multiplex all goroutines on a single thread - No concurrency: if a goroutine makes a blocking call, the whole thread is blocked - No parallelism: Does not take advantage of a second core/thread if it's available - Failed Approach 2: 1 Thread: 1 Goroutine - We'd have to be destroying threads every time a goroutine is finished, and then recreating it when a new goroutine starts - this is expensive to do. - Idea 1: [[Thread reuse]] - When a goroutine is finished with a thread, use "thread parking" to keep the thread alive, but idle, so it can still process work later. - Add a runqueue (FIFO) from which all threads take work - "[Cost amortization]([[Amortizing performance cost]])": When threads are reused, each goroutine that gets executed by them costs less to create. - Problem: Multiple threads would be accessing a runqueue at a time, causing a lock - Problem: Number of threads is unlimited, and we could end up creating thousands of them that are never freed up - k6 is doing this with load generators, reusing them instead of destroying them. In this case, the cost is time. - Idea 2: External monitor - Separate process that will be the only one to access the runqueue, preventing a lock. It can also assign the work to goroutines. - Idea 3: Limit the number of goroutines that can access the runqueue [[Throttling access to resource]] - Limit to number of cores - that way, we're future-proofing: if more cores are added, we'll be able to take advantage of it. - Idea 4: [[Distributed runqueue]]s - Have as many runqueues as there are cores: 1 runqueue per kernel thread - The Go scheduler has a global runqueue in addition to the per-thread runqueue as well. - Idea 5: [[Work-stealing]] - Allow threads to take work from another thread's runqueue if they're idle and the work is not yet being processed. - Problem: If a long running goroutine calls for a block, blocking the thread, the work can't be stolen and the thread also can't continue. - Idea 6: [[Work hand-off]] - External monitor hands off work that is stalled, allowing thread to proceed with work. - Idea 7: [[Timeboxing]] - Another background thread called `sysmon` will force long-running goroutines to stop (>10 ms long) - Limitations of the Go Scheduler - Runqueues are FIFO, not priority-based (unlike Linux) - No strong preemption - The global runqueue of the scheduler is lesser priority, so a long-running goroutine could still bring the app to a halt. - Go doesn't know about the hardware of the machine it's operating in.