%%
Last Updated:
- [[2021-03-10]]
%%
Lead:[[Don Darwin]] @DonDoubleD
Bootcamp managers:
[[Ana Medina]]
[[Jason Yee]]
[[Zach Bujnoch]]
[[Aniela Grych]]
[[Kyle Hultman]]
## Foundations
Chaos engineering is about thoughtful, planned experiments designed to reveal weakness in our system.
Principles
- Plan an experiment
- Contain the blast radius and magnitude
- scale or halt
Blast radius
- the number of hosts, containers, or resources that are targeted by the experiment
Magnitude: intensity of the attack you're running.
Scientific method
- Form a hypothesis
- Experiment and test it
- Analyze results
- Expand scope and re-test
- Share results
Abort conditions
- what conditions would cause you to halt the experiment?
- examples: error rate or latency
Baseline metrics
- infrastructure monitoring metrics
- alerting and on-call metrics
- high severity incident (SEV) metrics
Roles
- Chaos general: decision maker. The call the experiment schedule, decide when abort conditions are met, and own the exercise
- Chaos commander: responsible for implementing and executing experiments
- Chaos scribe: responsible for recording the experiments and results on the Notes and Observation section on Gremlin
- Hypothesis and results, links to monitoring tool to support results
- Chaos Observer: works with scribe to gather data and correlate effects of experiments using monitoring, observability, alerting tools
Resource attacks
- CPU
- Memory
- Disk
- I/O
Network attacks
- Blackhole
- Latency
- Packet loss
- DNS
State attacks
- Shutdown
- Process killer
- Time travel
![[Pasted image 20210310195131.png]]
blackhole - cut off traffic to and from that service
## Hands-on
Hipstore Shop company - [[Online Boutique]]
![[onlineboutique.png]]
[[Mike Li]] - Chaos General
[[Omar Saenz]] - Observer
Me - Chaos Commander
[[Oleksandr Sakharchuk]]
[[Priyesh Pande]] - Chaos scribe
I ran a load test during one of the attacks: https://app.k6.io/runs/public/2089f74c2a794ae8af5845f1626886d0
![[Pasted image 20210311092844.png]]
Datadog dashboard
![[gremlinbootcamp-datadog.png]]
![[gremlinbootcamp-inducingtimeouts.png]]