- Author: [[gist.github.com]]
- Full Title: Using Chaos Engineering to Test Distributed Systems
- Tags:: [[Chaos Engineering]] [[Principles of chaos engineering]]
- URL: https://gist.github.com/simskij/646e4cf5f40d95a781eb10d111691f79
### Highlights first synced by [[Readwise]] [[2021-01-21]]
> Chaos engineering is the discipline of introducing turbulence into a system to build confidence around it's capacity to either whitstand, or gracefully recover from it. ([View Highlight](https://instapaper.com/read/1380280591/15254359))
^9ed440
- we'll be doing a demo with chaos-toolkit. Compared to many of the first-gen tools like [[Chaos Monkey]] it's got a pretty low barrier of entry and does not require any additional infrastructure like [[Spinnaker]] to function. ([View Highlight](https://instapaper.com/read/1380280591/15254398))
- The process consists of four different steps.
Defining the steady state
Building a hypothesis
Execution
Disproving the hypothesis ([View Highlight](https://instapaper.com/read/1380280591/15254408))
- If you manage to disprove your hypothesis, you now have a goal to aim for. The harder it is for you to actually make the system drift from the steady state, the more confidence you'll gain in the reliability of your system. ([View Highlight](https://instapaper.com/read/1380280591/15254412))
- The steady state should be a known state, matching what you'd consider business as usual. ([View Highlight](https://instapaper.com/read/1380280591/15254419))
- Some examples of this could be:
Should have an error rate of less than one percent.
Should have a request duration below 200ms.
Should have a CPU consumption per replica of less than 75% ([View Highlight](https://instapaper.com/read/1380280591/15254422))
- The hypothesis should be formulated around a positive outcome for the experiment. ([View Highlight](https://instapaper.com/read/1380280591/15254428))
- Some examples of this could be:
With two of the web server replicas terminating unexpectedly, the error rate will remain below 10%.
With a sudden surge of traffic adding 2000 concurrent users the request duration will stay below 500ms. ([View Highlight](https://instapaper.com/read/1380280591/15254430))