How to avoid the most common Kubernetes monitoring mistakes

# [[How to avoid the most common Kubernetes monitoring mistakes]] ## Details Conference:: ObservabilityCON on the Road Singapore Date:: 2023-05-16 Length:: 30 minutes Location:: Raffles City Convention Center %% Use `OCR_Singapore_2023` code for Expensify/Navan %% ## Abstract Which metrics should you collect? What dashboards are best suited to effectively monitor Kubernetes clusters? How do you measure resource utilization for capacity planning? Often, this is all a game of trial and error – and one that business-critical services cannot afford. Learn how Grafana Cloud’s K8s monitoring solution was built so you can avoid the guessing game and kickstart your K8s observability strategy in minutes. In this session, we break down the most common Kubernetes monitoring mistakes and share best practices on how to set up Kubernetes monitoring the optimal way. ## Resources - [Speaker brief for OCR (all cities)](https://docs.google.com/document/d/1wohZNNrCiBbhxx_j718sLP5whCIGEUhaiDkPVuGzEL0/edit) - [Presentation decks for all cities](https://drive.google.com/drive/folders/16DC5BXA-xgUf1uucjMaqBMPbj760luXR) - [My presentation deck for OCR Singapore](https://docs.google.com/presentation/d/1OHIUw0Kxq7DVF6sKP3iUICYncuOZoxd0AgxWIKA9uEo/edit#slide=id.g152242e6d63_0_390) - [VIdeo of this talk in OCR Berlin](https://drive.google.com/file/d/1fum2-gvMyfH-yyoZSKTcm033GjjhqZzV/view?usp=share_link) by [[Jake Swiss]] ## The presentation [[Presentation/Mine/How to avoid the most common Kubernetes monitoring mistakes]] ## Demo - Log onto [the SE demo block](https://sedemo.grafana.net/) via Google SSO. - Go to Menu > Observability > Kubernetes. - Click on Configuration and walk them through all the config options. - Stress that this is still less config than they would normally have to do. - Agent/Agent Operator difference. - Click on Dashboards > Kubernetes / Compute Resources / Multi-Cluster - Talk them through the CPU and memory utilisation that you can see for all clusters - Click on Alerts - Fully-configurable to meet team's needs - Use case: Spot errors. - Click on Kubernetes again to see all the namespaces. - Click on `broken` namespace. - Click on `nginx-crashing` workload. - No efficiency data is being populated. - Check number of restarts. - Go down to the Container section and see that the Waiting Reason is *CrashLoopBackOff*. - Easy to see all of this without having to `kubectl` all the things. - Use case: pod utilization. - Click on Kubernetes again to see all the namespaces. - Click on `production` namespace. - Click on `grafana` workload. - CPU utilization > 100% - Click on one of the pods to drill down. - Scroll down to show the pod logs and events. - Use case: Managing utilization or cloud spend budget. - Click on Cluster Efficiency Overview. - Cluster CPU efficiency, and then Node CPU Efficiency. - Show that some are well-utilized (green) and two are underutilized (yellow). - When some are over-utilized (red), that presents a reliability risk. - Maybe do Predict CPU usage - Get more information about nodes by clicking on Monitoring > Nodes. - You can get more information about what pods the node is running. -