%% [[Reliability]] [[Site Reliability Engineer]] %% # Chaos Engineering ## What is chaos engineering? Chaos engineering is a new branch of testing and tuning an application with a focus on disruptive environmental issues not typically considered by standard [[Performance Testing]]. It involves doing thoughtful experiments designed to replicate turbulence in a system. The name "Chaos Engineering" comes from the [[Chaos Theory]], which analyzes seemingly "chaotic" or random phenomena and finding systematic patterns underlying them. Chaos engineering seeks to reproduce what would normally be considered "unforeseen" events such as server outage, in a predictable and systematic way. Chaos engineering often consists of "experiments" rather than "tests" due to its exploratory nature. ![[Chaos Engineering with Tammy Butow#^929a6d]] [^tammy] ![[sources/Article/Using Chaos Engineering to Test Distributed Systems#^9ed440]] [^simme] <iframe width="560" height="315" src="https://www.youtube.com/embed/r4M0RM3QGxk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <iframe width="560" height="315" src="https://www.youtube.com/embed/z1-_R-h2unE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> Chaos engineers typically have the position title [[Site Reliability Engineer]], but anyone can be a chaos engineer. Activities that test an application's resilience in the realm of chaos engineering are called [[Types of chaos experiments|chaos experiments]]. ## History of chaos engineering Chaos engineering was pretty much invented as a term and as a discipline by [[Netflix]] engineers who were moving from a data center to the cloud and needed a way to make their application more robust against the (then quite common) instance shutdowns and restarts. They created something called the [[Chaos Monkey]], which randomly shut down instances. It was meant to be used during workdays, so that people could respond to the outage and learn to program around it. ![](assets/1621807236_22.png) [^reliabilitymatters] ## [[Chaos engineering terminology]] ## [[Principles of chaos engineering]] ## [[The process of chaos engineering]] ## [[Types of chaos experiments]] ## [[Tools for Chaos Engineering]] ## [[Chaos engineering is a testing discipline]] ## The role of [[Load Testing]] in chaos engineering. Many types of chaos experiments require load on the application, but the generation of that load is generally out of chaos engineering's scope. [[Load Testing]] is brought in to make an environment suitably production-like (in the case that experiments are done in a test environment or in a low load period for the production environment). ## Resources - [[Tammy Bryant]], [[Simme Aronsson]], [[Ana Medina]] - [[sources/Article/Using Chaos Engineering to Test Distributed Systems]] - [[Chaos Engineering with Tammy Butow]] - [[AWS Fault Injection Simulator]] - [Tammy Butow & Ana Medina - Next Level Chaos Engineering](https://www.youtube.com/watch?v=kAJJ0-CJOEE) - [[In the kitchen - a sprinkle of fire and chaos]] - [[PRINCIPLES OF CHAOS ENGINEERING - Principles of Chaos Engineering]] [^tammy]: [[Chaos Engineering with Tammy Butow]] [^reliabilitymatters]: Butow, T. (2020). *Reliability matters more than ever.* Failover Conf. Accessed from the [Gremlin YouTube channel](https://www.youtube.com/watch?v=VOTk-wRrdZ0). [^simme]: Aronsson, S. (2021). _Using chaos engineering to test distributed systems_.