%%
Last Updated:
- [[2021-02-13]]
%%
This type of test scenario assumes the worst: that one or some of your application servers have failed and are unavailable. The point of this test is to determine how gracefully your application recovers and how resilient it is to unexpected failures.
A typical use case for this is for two application servers that share the user load. Apply load as normal, at peak load level or lower, and take note of the number of connections on each server. Each should be shouldering around half of the load. Then, shut down one of the application servers on purpose. The number of users should halve temporarily as the system struggles to recover, and the users connected to the failed server hopefully see a nice error page asking them to try again (you can check for this in your scripts) rather than some unfriendly error page. After a few minutes, those users should be redirected to the one functioning server, the users should begin to be able to carry out their tasks again, and the number of connections on the one server should match the number of connections at the start of the test before the shutdown.
Unlike other types of tests, the error rate is less important in the failover test— having a server shut down in the middle of a test is going to produce errors for even the most hardy application. The test is how well and how quickly your system recovers.
For those wanting to go even further and test other catastrophic events, I really like Netflix’s [Simian Army](https://flood.io/blog/manufactured-chaos-how-netflix-does-performance-testing/) approach, which conjures up the image of monkeys being set loose in a server room (basically a DevOps engineer’s nightmare). The Chaos Monkey, for instance, randomly shuts down a node at random.
## See also
- [[Chaos Engineering]]
- [[Test Scenarios]]