%%
Last Updated:
- [[2021-02-13]]
- [[2021-01-25]]
%%
- from [[Podcast/Test Guild Performance (series)]]
- with [[Joe Colantonio]] and [[Tammy Bryant]]
- [[Chaos Engineering]] #[[Types of chaos experiments] [[Principles of chaos engineering]]
- Fix failures before they go into production, how to stress test your application
- She was the 9th employee of [[Gremlin]]
- She learned about chaos engineering when working at a mortgage brokerage firm - mortgage brokering is really competitive, and performance really matters. Regulation is also a big part of it.
- Disaster recovery testing.
> [Chaos engineering] is a disciplined approach to identifying failures become they become outages.
^929a6d
- What you THINK will happen vs what actually does happen.
- It's "breaking things on purpose"
- Performance testing vs chaos engineering: chaos engineering takes it to another level and turns off machines in the cloud
- Things that chaos engineering involves
- Chaos Monkey: shutting down instances. Sometimes developers hardcode things, and they shouldn't be doing that.
- Time travel: changing clock time (like for daylight savings or security certificates)
- Injecting latency, packet loss, packet corruption
- Process killing
- "ALFI": application-level fault injection vs "ILFI" (infrastructure-level fault injection)
- supports Java apps
- You can create small blast-radius chaos experiments - it's not just disaster recovery stuff that you can do
- also for serverless applications (like Lamda, Google Cloud Functions, Azure Functions)
- Requirements: how to get started
- "It's a scientific process"
- sign up for a free Gremlin account
- monitoring and observability tools
- notification and alerts
- a company culture where it's not just the SREs that have access to relevant metrics
- Actionable advice on chaos engineering: Focus on critical systems, not low-hanging fruit.
- Start locally or in test environments before moving into production
-