%%
Last Updated:
- [[2021-02-11]]
tags: []
%%
This talk was part of the [[Chaos Carnival 2021]].
Author: [[Ana Medina]] from [[Gremlin]]
Senior Chaos Engineer
Chaos Engineer/SRE since 2016
Developer since 2007
Worked at Uber in 2016
from Costa Rica and Nicaragua
Spanish speaker
Diversity shoutout
## Log
How do you learn?
How do I learn my food tastes good?
Cooking = cloud native apps
Tasting while cooking = observing gradual deploys
Cooking for others = experimentation
Burning a dish = replicating incidents
> Sometimes you need to just go ahead and burn your food so you can learn what not to do next time.
Measuring the [[Cost of downtime]]
Cost = R + E + C + (B + A)
During the outage
R = Revenue lost
E = Employee productivity
After the outage
C = Customer chargebacks (SLA breaches)
Unquantifiable
B = Brand Defamation
A = Employee Attrition
Chaos engineering
- Ana's favorite ingredient for building cloud native applications
Terminology
- experiment
- hypothesis
- blast radius
- magnitude
- abort conditions
Steps
1. Observe your system
- Architectural diagram
2. Baseline your metrics (set SLOs/SLIs per service)
-
3. Form a hypothesis with abort conditions
4. Define blast radius and magnitude
- Ex: Run 1 experiment with a 10% increase in load to services
5. Run experiment
6. Analyze results
7. re-test
8. Sharing your results
Sometimes you learn things even before you finish going through all five.
![[Pasted image 20210211190621.png]]
> In the cloud native world, always assume it will break.
> Always ask what happens when X fails.
- Have a [[Disaster Recovery]] plan
## Questions
### Does baselining metrics need to be done on a per-service or per-environment basis?
Yes, SLOs/SLIs should be per service or per environment (depending on the metric) because otherwise you'd have too many variables.
### Picking on Redis?
It's just really popular, and data loss is a pretty big issue. It's all about reliable user experience.
Memory increase on Redis implementation is going to affect performance heavily
### When should you implement chaos engineering? Design stage?
Depends - earlier would be better, but don't let it stop you if you're already past that stage. It also depends a lot on the organization and industry.
## See also
- [[Chaos Engineering]]
-