# [[Deploying Karpenter on GCP]]
![[Deploy Karpenter on GCP.svg]]
I tried to deploy [[Karpenter]] on [[Google Cloud Platform|GCP]] using [the instructions here](https://www.cloudpilot.ai/en/blog/deploy-karpenter-on-google-cloud/), but I ran into issues. Karpenter just didn't seem to be able to interact with [[Google Kubernetes Engine|GKE]] nodes. It seems to still be too heavily integrated with [[AWS]] infrastructure. I decided to pivot to [[GKE Autoscaler]]. Below is my AI's log of the incident and reasoning.
# Karpenter on GKE: Implementation Attempt and Lessons Learned
**Date:** October 30, 2025
**Project:** Spice Runner Demo
**Objective:** Implement dynamic node autoscaling to scale beyond 4-pod limitation
**Outcome:** Switched to GKE Cluster Autoscaler after encountering critical blockers
---
## Executive Summary
We attempted to implement Karpenter on GKE to provide dynamic node provisioning for our KEDA-based pod autoscaling demo. After extensive troubleshooting, we discovered that Karpenter's GKE support is more experimental than advertised and has critical bugs preventing production use. We successfully implemented GKE Cluster Autoscaler instead, which provides the same functionality with production-ready stability.
**Time Investment:** ~3 hours
**Result:** GKE Cluster Autoscaler deployed and verified working
**Key Learning:** Experimental software status should be carefully evaluated before implementation
---
## Background
### Initial Problem
- KEDA was successfully scaling pods from 1 to 10 based on traffic metrics
- Cluster was limited to 4 pods due to node capacity constraints
- Needed automatic node provisioning to scale beyond this limit
### Why We Chose Karpenter Initially
1. **Cutting-edge technology**: Showcases latest autoscaling capabilities
2. **Web research indicated GKE support**: CloudPilot AI announced GKE provider
3. **Superior to traditional autoscalers**: Faster provisioning, better cost optimization
4. **Demo appeal**: More impressive than standard GKE Cluster Autoscaler
---
## Implementation Attempts
### Attempt 1: Official AWS Karpenter Chart
**Approach:** Installed official Karpenter Helm chart from `karpenter/karpenter`
**Issues Encountered:**
- Chart is AWS EKS-specific only
- Requires AWS-specific configuration (`aws.clusterName`, `aws.clusterEndpoint`)
- No GCP provider in official chart
- Pods crashed immediately with AWS configuration errors
**Outcome:** ❌ Failed - Wrong implementation path
**Time Spent:** 30 minutes
---
### Attempt 2: CloudPilot AI GCP Provider
**Approach:** Found CloudPilot AI's experimental GKE provider implementation
**What We Did:**
1. Enabled required GCP APIs (compute, container)
2. Created GCP service account with necessary IAM roles:
- `roles/compute.instanceAdmin.v1`
- `roles/iam.serviceAccountUser`
- `roles/container.developer` (later upgraded to `container.admin`)
3. Set up Workload Identity binding
4. Cloned CloudPilot AI repository: `github.com/cloudpilot-ai/karpenter-provider-gcp`
5. Installed via local Helm chart
**Issues Encountered:**
#### Issue 1: Resource Constraints
- **Problem:** Karpenter pods couldn't schedule due to insufficient CPU
- **Error:** `0/3 nodes are available: 3 Insufficient cpu`
- **Cause:** Cluster was already at capacity
- **Resolution:**
- Reduced Karpenter resource requests from 100m to 25m CPU
- Reduced from 128Mi to 64Mi memory
- Changed replicas from 2 to 1
- Temporarily scaled down application pods
#### Issue 2: Priority Class Quota
- **Problem:** Pods failed to create with priority class error
- **Error:** `insufficient quota to match these scopes: [{PriorityClass In [system-cluster-critical]}]`
- **Cause:** GKE restricts use of system-critical priority classes
- **Resolution:** Removed `priorityClassName` from deployment
#### Issue 3: Missing Environment Variable
- **Problem:** Karpenter crashed on startup
- **Error:** `missing required flag location or env var LOCATION`
- **Cause:** Helm chart didn't set LOCATION environment variable
- **Resolution:** Manually added `LOCATION=us-central1-a` to deployment
#### Issue 4: IAM Permissions
- **Problem:** Karpenter couldn't create node pools
- **Error:** `Required "container.clusters.update" permission(s) for "...clusters/spice-runner-cluster", forbidden`
- **Cause:** Service account only had `container.developer` role
- **Resolution:** Upgraded to `roles/container.admin`
#### Issue 5: Critical Bug - "Default Node Template Not Found"
- **Problem:** Karpenter created GKE node pools but couldn't read their configuration
- **Error:** `default node template not found` (repeated continuously)
- **Cause:** CloudPilot AI provider expects GCE instance templates, but GKE uses a different architecture
- **Impact:** NodeClass resources remained in "Not Ready" state; no nodes could be provisioned
- **Attempted Fixes:**
- Tried different image aliases (`ContainerOptimizedOS@latest`, versioned aliases)
- Tried `imageFamily` instead of `imageSelectorTerms`
- Enabled autoscaling on GKE node pools
- Verified service account permissions
- **Outcome:** ❌ Unfixable without code changes to CloudPilot AI provider
**Final State:**
- ✅ Karpenter pods running (2/2)
- ✅ IAM permissions correct
- ✅ GKE node pools created (`karpenter-default`, `karpenter-ubuntu`)
- ❌ NodeClass unable to resolve image information
- ❌ No actual node provisioning occurring
- ❌ Pods remained unscheduled when scaled up
**Outcome:** ❌ Failed - Critical software bug in experimental provider
**Time Spent:** 2 hours
---
## Root Cause Analysis
### Technical Issue
The CloudPilot AI Karpenter GCP provider has an architectural mismatch:
- **What it expects:** GCE instance templates (like AWS Launch Templates)
- **What GKE provides:** Managed node pool configurations
- **Error in code:** `pkg/providers/imagefamily/helper.go` tries to read `defaultNodeTemplate` which doesn't exist for GKE node pools
### Software Maturity Assessment
**Initial Understanding:**
- CloudPilot AI blog post stated "preview release, fully functional for testing"
- Web search results indicated GKE support was available
**Reality:**
- GKE provider is pre-alpha quality
- Critical functionality (image resolution) is broken
- No workaround available without modifying source code
- Not suitable even for demo purposes
### Lessons Learned
1. **"Preview" can mean very different levels of maturity**
2. **Web search results may be marketing rather than reality**
3. **Experimental software should be tested quickly before deep investment**
4. **Blog post announcements ≠ production-ready code**
---
## Decision to Switch
### Why GKE Cluster Autoscaler
**Advantages:**
- ✅ Production-ready and Google-supported
- ✅ Works immediately without configuration complexity
- ✅ Integrates seamlessly with KEDA
- ✅ Solves the original problem (scale beyond 4 pods)
- ✅ No experimental bugs or limitations
- ✅ Well-documented and proven at scale
**Implementation:**
```bash
gcloud container clusters update spice-runner-cluster \
--enable-autoscaling \
--node-pool=default-pool \
--min-nodes=1 \
--max-nodes=10 \
--zone=us-central1-a
```
**Time to deploy:** 30 seconds
**Verification:** ✅ Successfully scaled from 3 to 5 nodes when load increased
---
## Testing Results
### GKE Cluster Autoscaler Performance
**Test Scenario:** Scale deployment from 2 to 15 replicas
**Results:**
- **Initial state:** 3 nodes, 2 pods
- **After scaling:** 5 nodes, 15 pods
- **New nodes provisioned:** 2 (within 60 seconds)
- **Node names:** `gke-spice-runner-cluster-default-pool-b16b6744-fbkp`, `gke-spice-runner-cluster-default-pool-b16b6744-nhkw`
- **Scale-down behavior:** After scaling back to 2 pods, extra nodes marked for removal (10-minute grace period)
**Verdict:** ✅ Working perfectly
---
## Cost-Benefit Analysis
### Time Investment
| Activity | Time Spent | Outcome |
|----------|------------|---------|
| Research & Planning | 30 min | Informed decision |
| AWS Karpenter Attempt | 30 min | Failed (wrong path) |
| CloudPilot AI Implementation | 2 hours | Failed (software bug) |
| GKE Cluster Autoscaler | 5 min | Success |
| **Total** | **3 hours** | **Working solution** |
### What We Gained
1. **Deep understanding** of Karpenter architecture
2. **Experience** with experimental Kubernetes autoscaling
3. **Knowledge** of when to cut losses and use proven solutions
4. **Documentation** of what doesn't work (valuable for community)
### What We Lost
- 3 hours that could have been spent on features
- Some configuration files (easily recreated if needed)
- Initial excitement about cutting-edge tech
---
## Recommendations
### For Production Use
**Use GKE Cluster Autoscaler** - It's proven, supported, and works immediately.
### For Experimentation
If you want to try Karpenter on GKE:
1. **Wait for maturity**: Monitor CloudPilot AI's progress
2. **Test in isolated environment**: Don't attempt on production-like demos
3. **Set time limits**: 1 hour to get working, then pivot
4. **Check GitHub issues**: Look for reported bugs before starting
5. **Verify basic functionality**: Test node provisioning immediately
### For AWS Users
Karpenter on AWS EKS is production-ready and works well. The issues we encountered are GKE-specific.
---
## Configuration Files Removed
The following Karpenter-related files were removed from the repository:
%%
# Excalidraw Data
## Text Elements
## Drawing
```json
{
"type": "excalidraw",
"version": 2,
"source": "https://github.com/zsviczian/obsidian-excalidraw-plugin/releases/tag/2.1.4",
"elements": [
{
"id": "4y8R7iOA",
"type": "text",
"x": 118.49495565891266,
"y": -333.44393157958984,
"width": 3.8599853515625,
"height": 24,
"angle": 0,
"strokeColor": "#1e1e1e",
"backgroundColor": "transparent",
"fillStyle": "solid",
"strokeWidth": 2,
"strokeStyle": "solid",
"roughness": 1,
"opacity": 100,
"groupIds": [],
"frameId": null,
"roundness": null,
"seed": 967149026,
"version": 2,
"versionNonce": 939059582,
"isDeleted": true,
"boundElements": null,
"updated": 1713723615080,
"link": null,
"locked": false,
"text": "",
"rawText": "",
"fontSize": 20,
"fontFamily": 4,
"textAlign": "left",
"verticalAlign": "top",
"containerId": null,
"originalText": "",
"lineHeight": 1.2
}
],
"appState": {
"theme": "dark",
"viewBackgroundColor": "#ffffff",
"currentItemStrokeColor": "#1e1e1e",
"currentItemBackgroundColor": "transparent",
"currentItemFillStyle": "solid",
"currentItemStrokeWidth": 2,
"currentItemStrokeStyle": "solid",
"currentItemRoughness": 1,
"currentItemOpacity": 100,
"currentItemFontFamily": 4,
"currentItemFontSize": 20,
"currentItemTextAlign": "left",
"currentItemStartArrowhead": null,
"currentItemEndArrowhead": "arrow",
"scrollX": 583.2388916015625,
"scrollY": 573.6323852539062,
"zoom": {
"value": 1
},
"currentItemRoundness": "round",
"gridSize": null,
"gridColor": {
"Bold": "#C9C9C9FF",
"Regular": "#EDEDEDFF"
},
"currentStrokeOptions": null,
"previousGridSize": null,
"frameRendering": {
"enabled": true,
"clip": true,
"name": true,
"outline": true
}
},
"files": {}
}
```
%%