Sustainable observability

# [[Sustainable observability]] ![[Sustainable observability.svg]] Sustainable observability is the process of monitoring and minimizing the [[Carbon footprint]] of computing resources used for [[Observability]]. ## Why should we care? Some analysts suggest that [[Cloud computing]] actually accounts for 7-12% of [[Global Energy Consumption]][^1]. If we take cloud computing to be 10% as a convenient estimate, then it uses up 18,638 TWh a year. While no figures exist for observability infrastructure in particular, it's safe to assume that the environmental cost (ultimately measured in terms of [[CO2 Emissions]]) is significant. We often don't think about what it costs just to observe our applications - but we should. Even if you don't care about the environment, reducing environmental costs often (though not always) coincide with reducing operational dollar costs because the biggest cost reduction measures are simple, low-hanging fruit like not starting computing resources you aren't going to use in the first place. ## Why it's hard - Cloud computing often runs on [[Kubernetes]] clusters whose ephemeral resources can be difficult to measure - Cloud resources are really easy to start up and lose track of - The [[Noisy neighbors problem]]: On [[Cloud Provider]]s' platforms, it's common for multiple tenants to share the same physical infrastructure, muddying the waters when trying to measure the utilization/cost of any single tenant. Even for a single company, Kubernetes clusters often share physical nodes that cause the same problem. - Very large computers (4x32 cores) often run into a [[NUMA Bug]] that causes one machine out of a thousand to consume inexplicably more resources than the others. - Cloud providers often don't calculate or release accurate data about energy consumption or CO2 emissions costs because it runs contrary to their interests. ## How to do it ### Avoid creating new resources in the first place By far the best thing you can do to reduce both operational and environmental costs for your observability stack is not to create new resources you won't use in the first place. This includes: - Not emitting [[Telemetry]] for debugging (such as verbose logs) - Not collecting telemetry that you don't need - Not starting new compute when you can reuse other instances - Adding [[Resource limits]] to pods in Kubernetes so that new nodes don't need to be created unnecessarily ### Reduce existing resources The next best thing you can do is to reduce the resources you already have. The goal here is not to sacrifice functionality; it's just to make what you have more efficient. This includes: - [[Sampling]] telemetry instead of storing every single data point: intelligently and intentionally drop telemetry you collect if it isn't useful - Having your [[Telemetry collector]] aggregate information if possible (and drop the raw data points) - Routinely shut down compute that isn't being used and isn't getting traffic ### Make existing resources scalable (up and down) Use [[Autoscaling]] to enable your infrastructure to automatically adapt to increasing *and decreasing* demand. You can do this at different levels. On the pod level, you can use the built-in Kubernetes tools: - [[Kubernetes HorizontalPodAutoscaler|Horizontal Pod Autoscaler]] (HPA) scales pods up/down based on metrics like CPU, memory, or other custom ones - [[Kubernetes Vertical Pod Autoscaler|Vertical Pod Autoscaler]] (VPA) adjusts CPU and memory [[Resource limits]] and requests without adding more replicas On the cluster level, you can use the [[Kubernetes Cluster Autoscaler]] or [[Karpenter]] by [[AWS]]. Both add and remove nodes in your cluster as needed, rescheduling or moving pods to existing nodes and cleaning up the now unused nodes. Other cloud providers have provider-specific autoscalers too: - [[GKE Autoscaler]] for [[Google Cloud Platform|GCP]] - [[AKS Cluster Autoscaler]] for [[Azure]] - [[DigitalOcean]] also has built-in cluster autoscaling APIs You can also choose [[Event-driven architecture|event-driven]] tools. These respond to events like messages arriving in a queue, a new file appearing in storage, or a database entry being created. - [[KEDA]] extends HPA with external metrics - [[KNative Eventing]] and [[KNative Serving]] While not strictly an autoscaling tool per se, using CD tools like [[Argo Rollouts]] in conjunction with one of the tools above also gives you a way to incrementally roll out or deploy the scaling changes to reduce the [[Minimize the blast radius|blast radius]]. ### Calculate and measure costs There is a cultural shift that needs to happen for truly sustainable observability, and that's harder to quantify. A next practical step in this direction would be to track changes and improvements in cost. Often, just being transparent about costs (even internally) is enough to drive cultural change. Calculation tools: - [[Kepler]] is the most promising tool for calculating cost. It estimates power usage of containers, pods, nodes using [[eBPF]] and [[Machine Learning]], so that you can more easily extrapolate from that to carbon emissions. It's a new project that's still in development, though, and not yet fully ready for production. - Cloud provider-specific carbon monitoring tools - GCP: [[GCP Carbon Footprint tool]] ([site](https://cloud.google.com/carbon-footprint)) - AWS: Custom Carbon Footprint Tool (CCFT), added to AWS Data Exports ([blog](https://aws.amazon.com/blogs/aws-cloud-financial-management/updated-carbon-methodology-for-the-aws-customer-carbon-footprint-tool/0)) - Azure: [[Azure Emissions Impact Dashboard]] ([site](https://learn.microsoft.com/en-us/power-bi/connect-data/service-connect-to-emissions-impact-dashboard)) [[Data Visualization]] of some key metrics using something like [[Grafana]] is essential in providing a single source of truth for all costs associated with your observability or computing stack. ## See also - [[sources/Podcast/Kubernetes, Kepler, and Carbon Footprints The Latest Tools and Strategies to Optimize Observability|Kubernetes, Kepler, and Carbon Footprints The Latest Tools and Strategies to Optimize Observability]] is an episode on the [[Grafana Labs]] Big Tent episode where [[Tom Wilkie]], [[Niki Manoledaki]], [[Brian Brazil]], and [[Thomas Dullien]] discuss how to monitor energy and carbon impact. - [[All about CNCF Environmental Sustainability with Niki Manoledaki]] is a livestream I did with [[Marie Cruz]] where we also talk to [[Niki Manoledaki]] about her work in sustainable observability. - %% # Excalidraw Data ## Text Elements ## Drawing ```json { "type": "excalidraw", "version": 2, "source": "https://github.com/zsviczian/obsidian-excalidraw-plugin/releases/tag/2.1.4", "elements": [ { "id": "4y8R7iOA", "type": "text", "x": 118.49495565891266, "y": -333.44393157958984, "width": 3.8599853515625, "height": 24, "angle": 0, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "groupIds": [], "frameId": null, "roundness": null, "seed": 967149026, "version": 2, "versionNonce": 939059582, "isDeleted": true, "boundElements": null, "updated": 1713723615080, "link": null, "locked": false, "text": "", "rawText": "", "fontSize": 20, "fontFamily": 4, "textAlign": "left", "verticalAlign": "top", "containerId": null, "originalText": "", "lineHeight": 1.2 } ], "appState": { "theme": "dark", "viewBackgroundColor": "#ffffff", "currentItemStrokeColor": "#1e1e1e", "currentItemBackgroundColor": "transparent", "currentItemFillStyle": "solid", "currentItemStrokeWidth": 2, "currentItemStrokeStyle": "solid", "currentItemRoughness": 1, "currentItemOpacity": 100, "currentItemFontFamily": 4, "currentItemFontSize": 20, "currentItemTextAlign": "left", "currentItemStartArrowhead": null, "currentItemEndArrowhead": "arrow", "scrollX": 583.2388916015625, "scrollY": 573.6323852539062, "zoom": { "value": 1 }, "currentItemRoundness": "round", "gridSize": null, "gridColor": { "Bold": "#C9C9C9FF", "Regular": "#EDEDEDFF" }, "currentStrokeOptions": null, "previousGridSize": null, "frameRendering": { "enabled": true, "clip": true, "name": true, "outline": true } }, "files": {} } ``` %% [^1]: O'Brien, I (2024). *Data center emissions probably 662% higher than big tech claims. Can it keep up the ruse?* Retrieved from The Guardian in October 2025: https://www.theguardian.com/technology/2024/sep/15/data-center-gas-emissions-tech