# Using Chaos Engineering to Test Distributed Systems ![rw-book-cover](https://readwise-assets.s3.amazonaws.com/static/images/article3.5c705a01b476.png) URL:: https://gist.github.com/simskij/646e4cf5f40d95a781eb10d111691f79 Author:: gist.github.com ## Highlights > Chaos engineering is the discipline of introducing turbulence into a system to build confidence around it's capacity to either whitstand, or gracefully recover from it. ([View Highlight](https://instapaper.com/read/1380280591/15254359)) > we'll be doing a demo with chaos-toolkit. Compared to many of the first-gen tools like [[Chaos Monkey]] it's got a pretty low barrier of entry and does not require any additional infrastructure like [[Spinnaker]] to function. ([View Highlight](https://instapaper.com/read/1380280591/15254398)) > The process consists of four different steps. > Defining the steady state > Building a hypothesis > Execution > Disproving the hypothesis ([View Highlight](https://instapaper.com/read/1380280591/15254408)) > If you manage to disprove your hypothesis, you now have a goal to aim for. The harder it is for you to actually make the system drift from the steady state, the more confidence you'll gain in the reliability of your system. ([View Highlight](https://instapaper.com/read/1380280591/15254412)) > The steady state should be a known state, matching what you'd consider business as usual. ([View Highlight](https://instapaper.com/read/1380280591/15254419)) > Some examples of this could be: > Should have an error rate of less than one percent. > Should have a request duration below 200ms. > Should have a CPU consumption per replica of less than 75% ([View Highlight](https://instapaper.com/read/1380280591/15254422)) > The hypothesis should be formulated around a positive outcome for the experiment. ([View Highlight](https://instapaper.com/read/1380280591/15254428)) > Some examples of this could be: > With two of the web server replicas terminating unexpectedly, the error rate will remain below 10%. > With a sudden surge of traffic adding 2000 concurrent users the request duration will stay below 500ms. ([View Highlight](https://instapaper.com/read/1380280591/15254430)) --- Title: Using Chaos Engineering to Test Distributed Systems Author: gist.github.com Tags: readwise, articles date: 2024-01-30 --- # Using Chaos Engineering to Test Distributed Systems ![rw-book-cover](https://readwise-assets.s3.amazonaws.com/static/images/article3.5c705a01b476.png) URL:: https://gist.github.com/simskij/646e4cf5f40d95a781eb10d111691f79 Author:: gist.github.com ## AI-Generated Summary None ## Highlights > Chaos engineering is the discipline of introducing turbulence into a system to build confidence around it's capacity to either whitstand, or gracefully recover from it. ([View Highlight](https://instapaper.com/read/1380280591/15254359)) > we'll be doing a demo with chaos-toolkit. Compared to many of the first-gen tools like [[Chaos Monkey]] it's got a pretty low barrier of entry and does not require any additional infrastructure like [[Spinnaker]] to function. ([View Highlight](https://instapaper.com/read/1380280591/15254398)) > The process consists of four different steps. > Defining the steady state > Building a hypothesis > Execution > Disproving the hypothesis ([View Highlight](https://instapaper.com/read/1380280591/15254408)) > If you manage to disprove your hypothesis, you now have a goal to aim for. The harder it is for you to actually make the system drift from the steady state, the more confidence you'll gain in the reliability of your system. ([View Highlight](https://instapaper.com/read/1380280591/15254412)) > The steady state should be a known state, matching what you'd consider business as usual. ([View Highlight](https://instapaper.com/read/1380280591/15254419)) > Some examples of this could be: > Should have an error rate of less than one percent. > Should have a request duration below 200ms. > Should have a CPU consumption per replica of less than 75% ([View Highlight](https://instapaper.com/read/1380280591/15254422)) > The hypothesis should be formulated around a positive outcome for the experiment. ([View Highlight](https://instapaper.com/read/1380280591/15254428)) > Some examples of this could be: > With two of the web server replicas terminating unexpectedly, the error rate will remain below 10%. > With a sudden surge of traffic adding 2000 concurrent users the request duration will stay below 500ms. ([View Highlight](https://instapaper.com/read/1380280591/15254430))