Analyzing load testing results

%% date:: [[2022-09-29]], [[2022-10-27]] Parent:: [[Load Testing]], [[Load testing metrics]], [[Server metrics]] %% # [[Analyzing load testing results]] ### What are test results? We tend to place a lot of importance on load test scripting or execution, and those things are important, but it’s all for nothing if you don’t know what the results mean. In order to run good load tests, you’ll need to get comfortable dealing with data. The quantitative data from a load test may include: - [[Load testing metrics]]: Data generated by the load testing tool itself while executing the script, including response times and error rates but potentially also debugging information. You’ll get this from your [[Load Testing Tool]]. - [[Server metrics]] - The resource utilization metrics of every load generator, to rule out any execution issues. - The resource utilization on your application servers to determine how your application actually performed under load. You’ll get the resource utilization metrics from each server. We’ll refer to all of these as “test results”. The first step will be to collate all of these results in one place. Then we can delve into actually making sense of those results. ## Analyzing results After collating the relevant metrics, you’ll want to start making sense of them. Metrics are useless if their context is not taken into account. Your job is to use those numbers to tell the story of what happened. It’s impossible to thoroughly explain how to analyze results in this book, but here are some considerations to keep in mind. ### First: was the load test valid? Like any good scientist, your first duty after carrying out an experiment is to determine whether or not the conditions of the experiment accurately recreated the scenario you want to test. Here are some questions to ask yourself: - Was the load test executed for the expected duration? - Did the load generators display healthy resource utilization for the duration of the test? Were CPU, memory, and network metrics within tolerance? - Did your load test hit the throughput (requests per second) that you were aiming for? Is this similar to what you would expect in production? Consider drilling down further into separate transactions: are there business processes that are more common in production than in your load test? - Was the transaction error rate acceptable? How many of those errors were due to script errors and data in the wrong state? ### Next: How did the application handle the load? Now that you’ve determined that your load test was a good replication of production load, it’s time to figure out how your application fared under that load. Your goal here is to identify whether any performance bottlenecks exist. - What was the average transaction response time? It’s important to drill down into separate transactions here because not all transactions are alike. Which transactions performed worst? Are all the transaction response times pretty close to each other, or are there some transactions that are far and away slower than the others? Also, look at more than just the average: what are minimum and maximum response times, and is there a large gap between those two? What is the 90th percentile response time? - How much of the transaction error rate was caused by legitimate application errors? Were there any HTTP 5xx responses that were returned by your server? Did these errors occur at the start of the test when the users were still ramping up, right after the full number of users was reached, or at the end of the test? Are there clumps of errors during the test, or were they spread out across the entire test? Do errors occur at regular intervals, and if so, were there any scheduled jobs going on at the same time on the application server? - When the application failed, did it fail gracefully? Did it display a nice error page, or did it simply offer up an unfriendly error? If your application has load balancing, did the load balancer correctly redirect traffic to less utilized nodes? - Was the resource utilization healthy on all your application servers? Were your nodes similarly utilized? Were there certain points in the test that display higher utilization than others, and what was going on at that time? Did memory utilization increase as the test went on? Did garbage collection occur? - Do the server logs display any unusual errors? What was the disk queue depth during the test? Did the server run out of hard disk space during the test, and do you have policies in place for backing up and deleting unnecessary data in production? ### Finally: If there were bottlenecks, why did they occur? This is by far the most difficult part of results analysis, and you may have to liaise between different teams in order to determine why your application didn’t behave as expected. The key here is to go beyond symptoms and actually try to get down to the root cause. If you find yourself saying the following things, it’s a good sign that you haven’t investigated the issue enough: - “The response time is high because CPU utilization on the server was high.” - “The error rate was high because the server returned HTTP 500s.” - “The application was slow because all 1000 users had ramped up.” - “The login server restarted unexpectedly.” - “Load was not even across all nodes, so response times on one server were higher than on the others.” - “Memory utilization was high when the identity verification process was triggered.” These statements, as true as they may be, don’t really leave you with actionable insights. Instead, ask yourself why several times until you get to the root of the issue. For instance, take the first statement: “The response time is high because the CPU utilization was high.” Why was the CPU utilization high? Well, because the server was busy processing a lot of information at the time. Why was the server processing a lot of information? Because the requests to go to the home page retrieve information from many application components before being returned to the main server. Why does a user browsing to the home page send so many requests? Maybe it shouldn’t. In that case, asking why several times got to the root of the issue: a simple GET of the home page was requesting far more resources and potentially causing higher CPU utilization on the server than was actually necessary. In figuring out the real reason behind the symptoms you’re seeing, you’ll come up with tangible steps towards addressing it or also inform management's decision as to whether or not to proceed with a release. ## Related - [[Principles for analyzing results]] - [[Problems in results analysis]] - [[Common issues in load test results]] - [[Reporting load testing results]] ### Time-series databases Time-series databases are best for storing load testing results data to situate them over a period of time. - [[InfluxDB]] - [[TimescaleDB]] - [[Prometheus]] ### Results visualization tools - [[Tableau]] - [[Grafana]] - [[New Relic]] ## References [^artofstats]: Spiegelhalter, D. (2019). _The art of atatistics: How to learn from data_. Basic Books.