Common issues in load test results

%% date:: [[2021-06-02]], [[2021-05-20]], [[2022-09-29]] %% # [[Common issues in load test results]] On this page are patterns that commonly appear in load testing results that may indicate underlying issues or performance bottlenecks. ## Server errors What constitutes an error is dependent on the protocol being used by the load testing script. By default, HTTP responses from the application server with the codes 4xx or 5xx are considered HTTP errors. Failed responses are often returned much faster than successful responses, so an increased HTTP error rate may produce misleading request rate and response time metrics. ### High error rate This issue occurs when the test results show an error rate (ratio of errors to total requests) that is high compared to the error rate in your [[Baseline test]]. ![[high-error-rate.excalidraw.png]] %%[[high-error-rate.excalidraw|🖋 Edit in Excalidraw]], and the [[Common issues in load test results 2022-09-29 16.30.58.excalidraw.dark.png|dark exported image]]%% %% ![[perf_alert-high_http_failure_rate.png]] [Docs](https://k6.io/docs/cloud/analyzing-results/performance-insights/#high-http-failure-rate) [Sample run triggering this alert](https://app.k6.io/runs/979205) %% #### Diagnosis Determine the error rate for every transaction or request, and sort them from the highest response time to the lowest. Compare the error rates to the error rates of the same transactions in a baseline test and determine whether the difference is significant for your test scenario. #### Recommendations - Run a single iteration of the script locally to troubleshoot the failing requests before running a load test. - In the script, verify that the failing requests are formulated correctly and return the expected response. - Verify that any user accounts used have sufficient permissions to access the application. - Verify that the application is publicly accessible. If it is behind a firewall, consider using [local execution](https://k6.io/docs/testing-guides/automated-performance-testing/#local-vs-cloud-execution) for your tests. - Check for web server misconfiguration or internal errors caused by saturation of a resource (CPU, memory, disk I/O or database connections). - If you are intentionally testing HTTP error responses, you can [change which HTTP codes should be classified as errors](https://k6.io/docs/javascript-api/k6-http/setresponsecallback-callback). ### Increased HTTP failure rate This issue is similar to [[Common issues in load test results#High error rate|High error rate]], except that the error rate increases as load increases. The difference is that when the error rate is high for the whole test, the chances are higher that the error rates are due to configuration or test issues. When error rates are low, and then increase as more load is applied, that is a signal that there might be a performance bottleneck. %% ![[perf_alert-increased_http_failure_rate.png]] [Docs](https://k6.io/docs/cloud/analyzing-results/performance-insights/#increased-http-failure-rate) [Sample run triggering this alert](https://app.k6.io/runs/978769) %% A significant increase in the HTTP error rate, coupled with an increase in the number of VUs or requests per second, suggests that the target system is close to its performance limit. #### Diagnosis Graph request rate and errors. ![Graph of increased HTTP failure rate in k6](assets/increased_http_failure_detected.png) #### Recommendations - Review web server configuration for timeouts, rate limiting, or anything else that might explain why it's returning errors. - Check for internal errors caused by saturation of a resource (CPU, memory, disk I/O or database connections). - If you are intentionally testing HTTP error responses, you can [change which HTTP codes should be classified as errors](https://k6.io/docs/javascript-api/k6-http/setresponsecallback-callback). ## Response time ### The knee/hockey puck This pattern is often called a "knee" or "hockey puck" because it represents a turning point in the response time after which it increases and does not recover (does not decrease to acceptable levels again). ![[response-time-hockey-puck.excalidraw.png]] %%[[response-time-hockey-puck.excalidraw|🖋 Edit in Excalidraw]], and the [[Common issues in load test results 2022-09-29 17.09.34.excalidraw.dark.png|dark exported image]]%% The arrow in the graph above represents the point of no return: at a certain level of load, the system stopped responding within acceptable levels and could no longer keep up with the traffic. This inflection point is important, as it could give you an indication of your system's breakpoint. #### Recommendations - Run the test again, with the same test parameters, and verify whether you get the same result. - Consider using a stepped pattern that ramps up more gradually, to see if you can determine at what point the load becomes unmanageable for the system. For example, if you see the hockey puck at 600 VUs, try a stepped scenario that runs: - 200 VUs for 10 minutes - 400 VUs for 10 minutes - 600 VUs for 10 minutes - 800 VUs for 10 minutes - 1000 VUs for 10 minutes This stepped pattern makes the increase of VUs more gradual, giving the system more time to get used to the level of load and making it more clear when (or at what level of load) the system begins to struggle. ### Spike A spike in response times is a sharp and marked increase, and then a subsequent drop. ![[response-time-spike.excalidraw.png]] %%[[response-time-spike.excalidraw|🖋 Edit in Excalidraw]], and the [[Common issues in load test results 2022-09-29 17.13.32.excalidraw.dark.png|dark exported image]]%% One reason that response time spikes may occur is when many requests are queuing-- meaning that the requests have been sent to the server, but the server is not responding to them. This may be easier to see when we graph the [[Raw data|raw results]], which might look something like this: ![[response-time-queueing.excalidraw.png]] %%[[response-time-queueing.excalidraw|🖋 Edit in Excalidraw]], and the [[Common issues in load test results 2022-09-29 17.15.36.excalidraw.dark.png|dark exported image]]%% In the graph above, we're seeing every individual data point of the response time rather than the aggregated line. Whereas the line seemed to have indicated a more gradual increase and then a gradual decrease, the raw data shows us that, in fact, at a certain point in time, the server stopped responding to requests. This could indicate a performance bottleneck. Examine the server logs and resource utilization to determine what it was doing at that time and why it wasn't responding to requests. ### Regular intervals Any overly regular or "clean" pattern in your load testing results should give you pause. Real load is usually not that regular. ![[response-time-regular-intervals.excalidraw.png]] %%[[response-time-regular-intervals.excalidraw|🖋 Edit in Excalidraw]], and the [[Common issues in load test results 2022-09-29 17.17.57.excalidraw.dark.png|dark exported image]]%% #### Recommendations - Run the test again to see if it's repeatable. - Check your script to see if there's anything that might be artificially causing this pattern. If you have constant `sleep`s or think time, consider making them [[Dynamic think time and pacing|dynamic]] instead, and rerunning the test. - Look at server/component logs for any regular jobs that match the intervals. Intensive tasks or polling might affect response times. - Consider narrowing the scope of the test to help you determine whether the regular response time spikes occur across the system, or only when it touches certain components. Experimenting with your script's scope may help pinpoint where the problem lies. ### Gradual increase A gradual increase in response times *when the test throughput is not increasing* may indicate longer-term issues like memory leaks. ![[response-time-gradual-increase.excalidraw.png]] %%[[response-time-gradual-increase.excalidraw|🖋 Edit in Excalidraw]], and the [[Common issues in load test results 2022-09-29 17.19.59.excalidraw.dark.png|dark exported image]]%% Memory leaks occur only in the presence of sustained load, and is one of the most common things a [[Soak Test]] could reveal. ## Throughput ### Throughput limit exceeded %% ![[perf_alert-throughput_limit.png]] [Docs](https://k6.io/docs/cloud/analyzing-results/performance-insights/#throughput-limit) [Sample run triggering this alert](https://app.k6.io/runs/978776) %% When the throughput limit is met, the system under test is overloaded and struggling to process requests. A telltale sign of this is when the average response time continues to increase while the number of processed requests per second (throughput) flatlines, suggesting that the requests have started to queue on the application server(s). #### Diagnosis ![Increasing response time and flatlined throughput in k6](assets/throughput_limit_detected-graph.png) #### Recommendations - Investigate the cause of the bottleneck using an APM or server monitoring tool. - After making changes, re-run this test to determine whether the issue has been resolved. ## Test validity ### Check or assertion failures ### High Load Generator CPU Usage %%![Current implementation of this alert](images/performance-insights/perf_alert-high_CPU.png) [Docs](https://k6.io/docs/cloud/analyzing-results/performance-insights/#high-load-generator-cpu-usage) [Sample run triggering this alert](https://app.k6.io/runs/978917)%% As a rule of thumb, high CPU usage is a consistent utilization of 80% (system + user) or more throughout the test. When the resources on a load generator are consistently high, test results may become erratic and inaccurate as the load generator struggles to send requests and process responses. #### Diagnosis Measure Processor Time or CPU utilization of the system and the users. Graph it over time and check to see whether the total CPU utilization goes above 80%. Note: It is normal for utilization to spike at the beginning of a test as a load generator performs startup processes, but it should fall after that. #### Recommendations - Add or increase think time using [sleep](https://k6.io/docs/javascript-api/k6/sleep-t/), to slow down the request rate. You may want to increase the number of total VUs to compensate. - Select more load zones to spread out the number of VUs across more regions and load generators. - Remove or consolidate logs and custom metrics in the test script. - [Set a threshold](https://k6.io/docs/using-k6/thresholds/) for CPU utilization to be notified when this happens again. ### High Load Generator Memory Usage %%![Current implementation of this alert](images/performance-insights/perf_alert-high_memory.png) [Docs](https://k6.io/docs/cloud/analyzing-results/performance-insights/#high-load-generator-memory-usage) [Sample run triggering this alert](https://app.k6.io/runs/979578) %% The cloud load generator `#0 (amazon:us:ashburn)` showed high memory utilization during your test. When the resources on a load generator are consistently high, test results may become erratic and inaccurate as the performance of the load generator itself begins to degrade. High memory utilization can also cause [high CPU utilization](https://k6.io/docs/cloud/analyzing-results/performance-insights/#high-load-generator-cpu-usage). A good rule of thumb is that memory usage should not consistently exceed 80% during a test. The graph below shows the actual memory utilization during this test. `<graph>` #### Recommendations - Add or increase think time using [sleep](https://k6.io/docs/javascript-api/k6/sleep-t/), to slow down the request rate. You may want to increase the number of total VUs to compensate. - Select more load zones to spread out the number of VUs across more regions and load generators. - Remove or consolidate logs and custom metrics in the test script. - [Set a threshold](https://k6.io/docs/using-k6/thresholds/) for memory utilization to be notified when this happens again. - Consider [discarding response bodies](https://k6.io/docs/using-k6/options#discard-response-bodies) where possible, and use `responseType` to capture only the response bodies you require.