Site Reliability Engineering

# Site Reliability Engineering ![rw-book-cover](https://images-na.ssl-images-amazon.com/images/I/51XswOmuLqL._SL200_.jpg) Author:: Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy ## Highlights > Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second, despite estimates that 40–90% of the total costs of a system are incurred after birth.1 ([Location 88](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=88)) > The popular industry model that conceives of deployed, operational software as being “stabilized” in production, and therefore needing much less attention from software engineers, is wrong. ([Location 92](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=92)) > SREs are focused on finding ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient. However, we expend effort in this direction only up to a point: when systems are “reliable enough,” we instead invest our efforts in adding features or building new products. ([Location 105](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=105)) > Margaret Hamilton, working on the Apollo program on loan from MIT, ([Location 137](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=137)) > As Margaret says, “a thorough understanding of how to operate the systems was not enough to prevent human errors,” and the change request to add error detection and recovery software to the prelaunch program P01 was approved shortly afterwards. ([Location 154](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=154)) > 1 The very fact that there is such large variance in these estimates tells you something about software engineering as a discipline, but see, e.g., [Gla02] for more details. ([Location 275](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=275)) > Ben Treynor Sloss, the senior VP overseeing technical operations at Google — and the originator of the term “Site Reliability Engineering” ([Location 288](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=288)) > Historically, companies have employed systems administrators to run complex computing systems. ([Location 300](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=300)) > As the system grows in complexity and traffic volume, generating a corresponding increase in events and updates, the sysadmin team grows to absorb the additional work. Because the sysadmin role requires a markedly different skill set than that required of a product’s developers, developers and sysadmins are divided into discrete teams: “development” and “operations” or “ops.” ([Location 304](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=304)) > The sysadmin approach and the accompanying development/ops split has a number of disadvantages and pitfalls. These fall broadly into two categories: direct costs and indirect costs. ([Location 312](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=312)) > Direct costs are neither subtle nor ambiguous. Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system. ([Location 313](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=313)) > The indirect costs of the development/ops split can be subtle, but are often more expensive to the organization than the direct costs. These costs arise from the fact that the two teams are quite different in background, skill set, and incentives. ([Location 316](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=316)) > At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension. ([Location 321](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=321)) > Google has chosen to run our systems with a different approach: our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins. ([Location 335](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=335)) > SRE is what happens when you ask a software engineer to design an operations team. ([Location 338](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=338)) > 50–60% are Google Software Engineers, or more precisely, people who have been hired via the standard procedure for Google Software Engineers. The other 40–50% are candidates who were very close to the Google Software Engineering qualifications (i.e., 85–99% of the skill set required), and who in addition had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek. ([Location 345](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=345)) > To avoid this fate, the team tasked with managing a service needs to code or it will drown. Therefore, Google places a 50% cap on the aggregate “ops” work for all SREs — tickets, on-call, manual tasks, etc. ([Location 361](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=361)) > over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic, not just automated. ([Location 363](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=363)) > Google’s rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development. ([Location 366](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=366)) > [Gla02] R. Glass, Facts and Fallacies of Software Engineering, Addison-Wesley Professional, 2002. ([Location 9771](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=9771)) --- Title: Site Reliability Engineering Author: Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy Tags: readwise, books date: 2024-01-30 --- # Site Reliability Engineering ![rw-book-cover](https://images-na.ssl-images-amazon.com/images/I/51XswOmuLqL._SL200_.jpg) Author:: Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy ## AI-Generated Summary None ## Highlights > Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second, despite estimates that 40–90% of the total costs of a system are incurred after birth.1 ([Location 88](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=88)) > The popular industry model that conceives of deployed, operational software as being “stabilized” in production, and therefore needing much less attention from software engineers, is wrong. ([Location 92](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=92)) > SREs are focused on finding ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient. However, we expend effort in this direction only up to a point: when systems are “reliable enough,” we instead invest our efforts in adding features or building new products. ([Location 105](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=105)) > Margaret Hamilton, working on the Apollo program on loan from MIT, ([Location 137](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=137)) > As Margaret says, “a thorough understanding of how to operate the systems was not enough to prevent human errors,” and the change request to add error detection and recovery software to the prelaunch program P01 was approved shortly afterwards. ([Location 154](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=154)) > 1 The very fact that there is such large variance in these estimates tells you something about software engineering as a discipline, but see, e.g., [Gla02] for more details. ([Location 275](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=275)) > Ben Treynor Sloss, the senior VP overseeing technical operations at Google — and the originator of the term “Site Reliability Engineering” ([Location 288](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=288)) > Historically, companies have employed systems administrators to run complex computing systems. ([Location 300](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=300)) > As the system grows in complexity and traffic volume, generating a corresponding increase in events and updates, the sysadmin team grows to absorb the additional work. Because the sysadmin role requires a markedly different skill set than that required of a product’s developers, developers and sysadmins are divided into discrete teams: “development” and “operations” or “ops.” ([Location 304](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=304)) > The sysadmin approach and the accompanying development/ops split has a number of disadvantages and pitfalls. These fall broadly into two categories: direct costs and indirect costs. ([Location 312](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=312)) > Direct costs are neither subtle nor ambiguous. Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system. ([Location 313](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=313)) > The indirect costs of the development/ops split can be subtle, but are often more expensive to the organization than the direct costs. These costs arise from the fact that the two teams are quite different in background, skill set, and incentives. ([Location 316](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=316)) > At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension. ([Location 321](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=321)) > Google has chosen to run our systems with a different approach: our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins. ([Location 335](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=335)) > SRE is what happens when you ask a software engineer to design an operations team. ([Location 338](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=338)) > 50–60% are Google Software Engineers, or more precisely, people who have been hired via the standard procedure for Google Software Engineers. The other 40–50% are candidates who were very close to the Google Software Engineering qualifications (i.e., 85–99% of the skill set required), and who in addition had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek. ([Location 345](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=345)) > To avoid this fate, the team tasked with managing a service needs to code or it will drown. Therefore, Google places a 50% cap on the aggregate “ops” work for all SREs — tickets, on-call, manual tasks, etc. ([Location 361](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=361)) > over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic, not just automated. ([Location 363](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=363)) > Google’s rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development. ([Location 366](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=366)) > [Gla02] R. Glass, Facts and Fallacies of Software Engineering, Addison-Wesley Professional, 2002. ([Location 9771](https://readwise.io/to_kindle?action=open&asin=B01DCPXKZ6&location=9771))