%% Last Updated: - [[2021-02-13]] - [[2021-01-25]] %% - from [[Podcast/Test Guild Performance (series)]] - with [[Joe Colantonio]] and [[Tammy Bryant]] - [[Chaos Engineering]] #[[Types of chaos experiments] [[Principles of chaos engineering]] - Fix failures before they go into production, how to stress test your application - She was the 9th employee of [[Gremlin]] - She learned about chaos engineering when working at a mortgage brokerage firm - mortgage brokering is really competitive, and performance really matters. Regulation is also a big part of it. - Disaster recovery testing. > [Chaos engineering] is a disciplined approach to identifying failures become they become outages. ^929a6d - What you THINK will happen vs what actually does happen. - It's "breaking things on purpose" - Performance testing vs chaos engineering: chaos engineering takes it to another level and turns off machines in the cloud - Things that chaos engineering involves - Chaos Monkey: shutting down instances. Sometimes developers hardcode things, and they shouldn't be doing that. - Time travel: changing clock time (like for daylight savings or security certificates) - Injecting latency, packet loss, packet corruption - Process killing - "ALFI": application-level fault injection vs "ILFI" (infrastructure-level fault injection) - supports Java apps - You can create small blast-radius chaos experiments - it's not just disaster recovery stuff that you can do - also for serverless applications (like Lamda, Google Cloud Functions, Azure Functions) - Requirements: how to get started - "It's a scientific process" - sign up for a free Gremlin account - monitoring and observability tools - notification and alerts - a company culture where it's not just the SREs that have access to relevant metrics - Actionable advice on chaos engineering: Focus on critical systems, not low-hanging fruit. - Start locally or in test environments before moving into production -