%% Last Updated: - [[2021-03-10]] %% Lead:[[Don Darwin]] @DonDoubleD Bootcamp managers: [[Ana Medina]] [[Jason Yee]] [[Zach Bujnoch]] [[Aniela Grych]] [[Kyle Hultman]] ## Foundations Chaos engineering is about thoughtful, planned experiments designed to reveal weakness in our system. Principles - Plan an experiment - Contain the blast radius and magnitude - scale or halt Blast radius - the number of hosts, containers, or resources that are targeted by the experiment Magnitude: intensity of the attack you're running. Scientific method - Form a hypothesis - Experiment and test it - Analyze results - Expand scope and re-test - Share results Abort conditions - what conditions would cause you to halt the experiment? - examples: error rate or latency Baseline metrics - infrastructure monitoring metrics - alerting and on-call metrics - high severity incident (SEV) metrics Roles - Chaos general: decision maker. The call the experiment schedule, decide when abort conditions are met, and own the exercise - Chaos commander: responsible for implementing and executing experiments - Chaos scribe: responsible for recording the experiments and results on the Notes and Observation section on Gremlin - Hypothesis and results, links to monitoring tool to support results - Chaos Observer: works with scribe to gather data and correlate effects of experiments using monitoring, observability, alerting tools Resource attacks - CPU - Memory - Disk - I/O Network attacks - Blackhole - Latency - Packet loss - DNS State attacks - Shutdown - Process killer - Time travel ![[Pasted image 20210310195131.png]] blackhole - cut off traffic to and from that service ## Hands-on Hipstore Shop company - [[Online Boutique]] ![[onlineboutique.png]] [[Mike Li]] - Chaos General [[Omar Saenz]] - Observer Me - Chaos Commander [[Oleksandr Sakharchuk]] [[Priyesh Pande]] - Chaos scribe I ran a load test during one of the attacks: https://app.k6.io/runs/public/2089f74c2a794ae8af5845f1626886d0 ![[Pasted image 20210311092844.png]] Datadog dashboard ![[gremlinbootcamp-datadog.png]] ![[gremlinbootcamp-inducingtimeouts.png]]