Monitoring Kubernetes - Fork My Brain

%% date:: [[2023-04-25]], [[2023-04-26]], [[2023-05-09]], [[2023-07-03]] parent:: %% # [[Monitoring Kubernetes]] [[Monitoring]] is essential for keeping track of [[Performance|Application Performance]] and preventing failures, but applications deployed on [[Kubernetes]] clusters present some unique challenges that require specialized solutions. ## Difficulties in monitoring Kubernetes - High complexity resulting from [[Microservices|microservices-based architecture]] - It can be tricky to see what has happened inside containers because they (and the pods they reside in) are [[Key characteristics of Kubernetes#Pods are ephemeral|ephemeral]] and replaced quickly. [^sysdig] - Clusters require both high-level (on the level of [[Service Level Objective (SLO)|SLOs]] and [[Performance Thresholds|Thresholds]]) and low-level (on the level of resource metrics) monitoring. ## What to monitor ### [[Metrics]] Kubernetes adds another layer to be monitored. You'll want to get metrics from multiple levels: - Network level - Node level - Application level - Pod level - resource utilization (CPU, memory, network) - number of pod restarts - pod status ### [[Logs]] [[Kubernetes#kubelet]] collects logs at the container level, making it easier to troubleshoot an issue. ### [[Traces]] ## Common mistakes in monitoring Kubernetes ### Not doing the right amount of monitoring It can be easy to do too little monitoring of Kubernetes clusters. There are multiple levels that you should consider, from the network to the individual pod, and each level may require different signals. However, doing *too much* monitoring can be a problem as well, because having too much data can have the effect of not having it at all as you struggle to make sense of a sea of meaningless signals. Applications are complex, and if you track everything, you could very quickly end up with hundreds of metrics and dozens of labels, and that all takes significant effort to comprehend. Ideally, you would be monitoring *just* enough to troubleshoot issues when they occur, and not so much that it's overwhelming or is an unecessary resource hog. ### Not doing the right amount of alerting If you are getting telemetry from your application but you haven't set up alerting rules, you're dependent on a human watching dashboards. Humans make mistakes. Determine thresholds for performance on multiple levels, identify key components, and set up alerts so that you're notified when things start to go south. It's also possible to do *too much* alerting: you could end up in the situation of an application that "cried wolf". If your team is alerted too often for insignificant events, then you'll get desensitized to alerts in general. If an alert is ambiguous in usefulness, reconsider whether it should be an alert. ### Not monitoring resource utilization Setting resource limits when you can is a good idea, but it's not enough. You should still monitor the CPU and memory utilization at multiple levels. Not doing so exposes you to underscaling or overscaling. When you underscale, you risk performance issues. When you overscale, you risk going over budget unnecessarily. ### Not monitoring network The network isn't often considered as part of an application, yet it can drastically affect the end [[User Experience]]. It's important to also keep track of network metrics such as: - [[Latency]] - [[Network bandwidth]] - [[Packet loss]] - Network errors These metrics, when not accounted for, may cloud other data and lead you to jump to false conclusions about your application's performance. Network metrics are easy to miss because they're recorded at the node level and may not be captured by Kubernetes metrics. <iframe width="560" height="315" src="https://www.youtube.com/embed/DryiirlkKIo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ### Not monitoring cost w Autoscaling and undiscipline resource usage can quickly get out of hand. At a high level, make sure that you have visibility into the infrastructure expenditure of each team so that you can identify organizational bottlenecks or tool inefficiencies. ### Not doing meta-monitoring It's time to think of your observability stack as something that in itself needs to be performant, highly available, and resilient. Who watches the watchers? If your observability stack fails, are you just going to fly blind? Are there alerts to let you know it's failed? Plan for failure. ### Only monitoring in [[Production-like environment|production]] The problem with monitoring only in production is that the best case scenario is that you're notified when the issue has already been deployed. Production monitoring is necessary, but it is reactive. Instead, you can be *proactive* by practicing [[Shift-left observability]] and monitoring pre-production environments. However, monitoring empty environments (environments that aren't [[Production-like environment|production-like]] in terms of data or traffic) is of limited use. Instead, you can use [[Load Testing Tool|load testing tools]] to make pre-prod environments more realistic beforehand. Using testing and observability together lets you identify issues before they are deployed. ## Tools for monitoring Kubernetes - (Commercial) [[Grafana Cloud]] has a solution called [[Grafana Kubernetes Monitoring]] - ([[Open-source|OSS]]) [[Prometheus Operator]] - [[Kubernetes Dashboard]]