%%
[[Reliability]] [[Site Reliability Engineer]]
%%
# Chaos Engineering
## What is chaos engineering?
Chaos engineering is a new branch of testing and tuning an application with a focus on disruptive environmental issues not typically considered by standard [[Performance Testing]]. It involves doing thoughtful experiments designed to replicate turbulence in a system.
The name "Chaos Engineering" comes from the [[Chaos Theory]], which analyzes seemingly "chaotic" or random phenomena and finding systematic patterns underlying them. Chaos engineering seeks to reproduce what would normally be considered "unforeseen" events such as server outage, in a predictable and systematic way.
Chaos engineering often consists of "experiments" rather than "tests" due to its exploratory nature.
![[Chaos Engineering with Tammy Butow#^929a6d]] [^tammy]
![[sources/Article/Using Chaos Engineering to Test Distributed Systems#^9ed440]] [^simme]
<iframe width="560" height="315" src="https://www.youtube.com/embed/r4M0RM3QGxk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<iframe width="560" height="315" src="https://www.youtube.com/embed/z1-_R-h2unE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Chaos engineers typically have the position title [[Site Reliability Engineer]], but anyone can be a chaos engineer.
Activities that test an application's resilience in the realm of chaos engineering are called [[Types of chaos experiments|chaos experiments]].
## History of chaos engineering
Chaos engineering was pretty much invented as a term and as a discipline by [[Netflix]] engineers who were moving from a data center to the cloud and needed a way to make their application more robust against the (then quite common) instance shutdowns and restarts. They created something called the [[Chaos Monkey]], which randomly shut down instances. It was meant to be used during workdays, so that people could respond to the outage and learn to program around it.
![](assets/1621807236_22.png) [^reliabilitymatters]
## [[Chaos engineering terminology]]
## [[Principles of chaos engineering]]
## [[The process of chaos engineering]]
## [[Types of chaos experiments]]
## [[Tools for Chaos Engineering]]
## [[Chaos engineering is a testing discipline]]
## The role of [[Load Testing]] in chaos engineering.
Many types of chaos experiments require load on the application, but the generation of that load is generally out of chaos engineering's scope. [[Load Testing]] is brought in to make an environment suitably production-like (in the case that experiments are done in a test environment or in a low load period for the production environment).
## Resources
- [[Tammy Bryant]], [[Simme Aronsson]], [[Ana Medina]]
- [[sources/Article/Using Chaos Engineering to Test Distributed Systems]]
- [[Chaos Engineering with Tammy Butow]]
- [[AWS Fault Injection Simulator]]
- [Tammy Butow & Ana Medina - Next Level Chaos Engineering](https://www.youtube.com/watch?v=kAJJ0-CJOEE)
- [[In the kitchen - a sprinkle of fire and chaos]]
- [[PRINCIPLES OF CHAOS ENGINEERING - Principles of Chaos Engineering]]
[^tammy]: [[Chaos Engineering with Tammy Butow]]
[^reliabilitymatters]: Butow, T. (2020). *Reliability matters more than ever.* Failover Conf. Accessed from the [Gremlin YouTube channel](https://www.youtube.com/watch?v=VOTk-wRrdZ0).
[^simme]: Aronsson, S. (2021). _Using chaos engineering to test distributed systems_.