How Grafana Labs Switched to Karpenter to Reduce Costs and Complexities in Amazon EKS

# How Grafana Labs Switched to Karpenter to Reduce Costs and Complexities in Amazon EKS ![rw-book-cover](https://grafana.com/media/karpenter/cluster-workload-utilization-2.png) URL:: https://grafana.com/blog/2023/11/09/how-grafana-labs-switched-to-karpenter-to-reduce-costs-and-complexities-in-amazon-eks/ Author:: Paula Julve, Logan Ballard ![rw-book-cover](https://grafana.com/media/karpenter/cluster-workload-utilization-2.png) ## AI-Generated Summary Grafana Labs replaced Cluster Autoscaler with Karpenter to cut costs and simplify their EKS infrastructure. Karpenter provisions nodes dynamically, prefers Spot with On‑Demand fallback, and avoids IP and pod-scheduling issues. The change improved developer agility, disaster recovery, and overall cluster efficiency. ## Highlights > In this blog, we’ll dig into why CA wasn’t the long-term answer, the alternatives we considered, and why we ultimately went with [Karpenter](https://karpenter.sh/). We’ll also look at some of the associated trade offs and lessons learned, as well as why Karpenter has helped us reduce costs and complexity, which in turn helps us provide better service to our users who are on AWS. ([View Highlight](https://read.readwise.io/read/01k8kqmwg68mpasd3hjhk38586)) > CA works by scaling Kubernetes node groups up and down. It watches for pending pods that cannot fit onto the existing nodes, and it will provision new nodes for them to land on, according to the node groups definition and their underlying [Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html). ([View Highlight](https://read.readwise.io/read/01k8kqnjt0dphem20ra3f33jrh)) > AWS allows you to define a diverse range of instance types for your node groups. However, if you list multiple types for your group, CA will only run calculations for one of them in order to determine how many nodes it needs to scale up. It will then request that many instances to AWS, but you have no control over which instance types you’re actually getting. You may end up with capacity that does not match your actual needs, leading to more readjusting. ([View Highlight](https://read.readwise.io/read/01k8kqq0zx8vemqp5rqft329gz)) > As a consequence of the previous point, we ended up with more and more specialized node group definitions in order to answer our workloads’ demands. ([View Highlight](https://read.readwise.io/read/01k8kqr0h4ktwbt487179x5ed5)) > CA doesn’t consider [“soft” constraints like `preferredDuringSchedulingIgnoredDuringExecution`](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#does-ca-respect-node-affinity-when-selecting-node-groups-to-scale-up:~:text=CA%20does%20not%20consider%20%22soft%22%20constraints%20like%20preferredDuringSchedulingIgnoredDuringExecution%20when%20selecting%20node%20groups.) when it selects node groups. If a workload prefers a spot node and one has space for it, the Kubernetes scheduler will try to land it there. But if there’s no space, CA will not take this into consideration when deciding which group to scale up. In our case, this meant the only way we could ever get spot nodes was if the workloads used a hard requirement instead, setting the [`requiredDuringSchedulingIgnoredDuringExecution`](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity) property. ([View Highlight](https://read.readwise.io/read/01k8kqs13rw03qs7vw2z91n181))