Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region

- Author: [[AWS]] - Full Title: Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region - Tags:: [[Incident Postmortem]] [[Chaos Engineering]] - URL: https://aws.amazon.com/message/11201/ - # Summary - On [[2020-11-25]], [[Amazon Kinesis]] experienced an outage. - The outage was due to new capacity that was added but caused the thread counts to exceed OS limits. - Amazon Kinesis is a service for real-time processing of data streams, and is used by many other Amazon products that were thus also affected by the outage, including [[CloudWatch]], [[Amazon Cognito]], [[AWS Elastic Container Service]], [[AWS Elastic Kubernetes Service]]. - The outage was further exacerbated by the fact that support teams had not been adequately trained to use backup systems for their status dashboard, which itself was affected. [[Human factors in tech]] - AWS will fix this issue by increasing the CPU and memory of containers in the Kinesis fleet, adding better alarming and notifications, increasing thread count limits, and decreasing cold start-up time. - ### Highlights first synced by [[Readwise]] [[2021-01-20]] - We wanted to provide you with some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on November 25th, 2020. - Amazon Kinesis enables real-time processing of streaming data. - The trigger, though not root cause, for the event was a relatively small addition of capacity - Streams are spread across the back-end through a sharding mechanism owned by a “front-end” fleet of servers. A back-end cluster owns many shards and provides a consistent scaling unit and fault-isolation. The front-end’s job is small but important. It handles authentication, throttling, and request-routing to the correct stream-shards on the back-end clusters. - The capacity addition was being made to the front-end fleet. - The diagnosis work was slowed by the variety of errors observed. - The resources within a front-end server that are used to populate the shard-map compete with the resources that are used to process incoming requests. - the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration - As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters. - we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet. - We are adding fine-grained alarming for thread consumption in the service. We will also finish testing an increase in thread count limits in our operating system configuration, which we believe will give us significantly more threads per server - we are making a number of changes to radically improve the cold-start time for the front-end fleet - CloudWatch uses Kinesis Data Streams for the processing of metric and log data. - Amazon Cognito uses Kinesis Data Streams to collect and analyze API access patterns. - Unfortunately, the prolonged issue with Kinesis Data Streams triggered a latent bug in this buffering code that caused the Cognito webservers to begin to block on the backlogged Kinesis Data Stream buffers. As a result, Cognito customers experienced elevated API failures and increased latencies for Cognito User Pools and Identity Pools, which prevented external users from authenticating or obtaining temporary AWS credentials - While some CloudWatch metrics continued to be processed throughout the event, the increased error rates and latencies prevented the vast majority of metrics from being successfully processed - While CloudWatch currently relies on Kinesis for its complete metrics and logging capabilities, the CloudWatch team is making a change to persist 3-hours of metric data in the CloudWatch local metrics data store. - reactive AutoScaling policies that rely on CloudWatch metrics experienced delays until CloudWatch metrics began to recover - Lambda function invocations currently require publishing metric data to CloudWatch as part of invocation - CloudWatch Events and EventBridge experienced increased API errors and delays in event processing - Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS) both make use of EventBridge to drive internal workflows used to manage customer clusters and tasks. This impacted provisioning of new clusters, delayed scaling of existing clusters, and impacted task de-provisioning. - we experienced some delays in communicating service status to customers during the early part of this event. - During the early part of this event, we were unable to update the Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event. - Going forward, we have changed our support training to ensure that our support engineers are regularly trained on the backup tool for posting to the Service Health Dashboard.