%% Last Updated: - [[2021-02-11]] tags: [] %% This talk was part of the [[Chaos Carnival 2021]]. Author: [[Ana Medina]] from [[Gremlin]] Senior Chaos Engineer Chaos Engineer/SRE since 2016 Developer since 2007 Worked at Uber in 2016 from Costa Rica and Nicaragua Spanish speaker Diversity shoutout ## Log How do you learn? How do I learn my food tastes good? Cooking = cloud native apps Tasting while cooking = observing gradual deploys Cooking for others = experimentation Burning a dish = replicating incidents > Sometimes you need to just go ahead and burn your food so you can learn what not to do next time. Measuring the [[Cost of downtime]] Cost = R + E + C + (B + A) During the outage R = Revenue lost E = Employee productivity After the outage C = Customer chargebacks (SLA breaches) Unquantifiable B = Brand Defamation A = Employee Attrition Chaos engineering - Ana's favorite ingredient for building cloud native applications Terminology - experiment - hypothesis - blast radius - magnitude - abort conditions Steps 1. Observe your system - Architectural diagram 2. Baseline your metrics (set SLOs/SLIs per service) - 3. Form a hypothesis with abort conditions 4. Define blast radius and magnitude - Ex: Run 1 experiment with a 10% increase in load to services 5. Run experiment 6. Analyze results 7. re-test 8. Sharing your results Sometimes you learn things even before you finish going through all five. ![[Pasted image 20210211190621.png]] > In the cloud native world, always assume it will break. > Always ask what happens when X fails. - Have a [[Disaster Recovery]] plan ## Questions ### Does baselining metrics need to be done on a per-service or per-environment basis? Yes, SLOs/SLIs should be per service or per environment (depending on the metric) because otherwise you'd have too many variables. ### Picking on Redis? It's just really popular, and data loss is a pretty big issue. It's all about reliable user experience. Memory increase on Redis implementation is going to affect performance heavily ### When should you implement chaos engineering? Design stage? Depends - earlier would be better, but don't let it stop you if you're already past that stage. It also depends a lot on the organization and industry. ## See also - [[Chaos Engineering]] -