# [[How to install Kepler]] ![[How to install Kepler.svg]] Kepler (Kubernetes Efficient Power Level Exporter) tracks energy consumption in Kubernetes clusters. This guide covers installation on cloud environments (GKE, EKS, AKS) without hardware power sensors, using model-based estimation. ## Prerequisites - Kubernetes 1.20+ - kubectl with cluster-admin access - Linux kernel 4.18+ with eBPF support - Prometheus (for metrics collection) - Grafana (optional, for visualization) ## Installation Steps ### 1. Create Kepler Deployment Create a file named `kepler.yaml` with the following content: --- # Kepler Namespace apiVersion: v1 kind: Namespace metadata: name: kepler --- # Kepler ServiceAccount apiVersion: v1 kind: ServiceAccount metadata: name: kepler namespace: kepler --- # Kepler ClusterRole apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kepler rules: - apiGroups: [""] resources: - nodes - nodes/metrics - nodes/stats - nodes/proxy - pods verbs: ["get", "list", "watch"] - apiGroups: [""] resources: - nodes/stats verbs: ["get"] - nonResourceURLs: - /metrics verbs: ["get"] --- # Kepler ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kepler roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kepler subjects: - kind: ServiceAccount name: kepler namespace: kepler --- # Kepler ConfigMap apiVersion: v1 kind: ConfigMap metadata: name: kepler-config namespace: kepler data: KEPLER_NAMESPACE: "kepler" ENABLE_GPU: "false" METRICS_PORT: "9102" BIND_ADDRESS: "0.0.0.0:9102" # CRITICAL: Disable RAPL for cloud environments ENABLE_RAPL: "false" ENABLE_PLATFORM_RAPL: "false" ENABLE_EBPF_CGROUPID: "true" # Enable model-based estimation ESTIMATOR: "true" MODEL_SERVER_ENABLE: "true" MODEL_SERVER_URL: "http://kepler-model-server.kepler.svc.cluster.local:8100" MODEL_CONFIG: "CONTAINER_COMPONENTS_ESTIMATOR=true,CONTAINER_COMPONENTS_INIT_URL=https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CgroupOnly/ScikitMixed/ScikitMixed.json" --- # Kepler DaemonSet apiVersion: apps/v1 kind: DaemonSet metadata: name: kepler namespace: kepler labels: app.kubernetes.io/name: kepler app.kubernetes.io/component: exporter spec: selector: matchLabels: app.kubernetes.io/name: kepler app.kubernetes.io/component: exporter template: metadata: labels: app.kubernetes.io/name: kepler app.kubernetes.io/component: exporter spec: serviceAccountName: kepler hostNetwork: true hostPID: true dnsPolicy: ClusterFirstWithHostNet containers: - name: kepler image: quay.io/sustainable_computing_io/kepler:release-0.7.11 imagePullPolicy: IfNotPresent securityContext: privileged: true ports: - name: http-metrics containerPort: 9102 hostPort: 9102 livenessProbe: httpGet: path: /healthz port: 9102 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 60 timeoutSeconds: 10 failureThreshold: 5 envFrom: - configMapRef: name: kepler-config env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: NODE_IP valueFrom: fieldRef: fieldPath: status.hostIP volumeMounts: - name: lib-modules mountPath: /lib/modules readOnly: true - name: tracing mountPath: /sys readOnly: true - name: proc mountPath: /proc readOnly: true resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi volumes: - name: lib-modules hostPath: path: /lib/modules type: Directory - name: tracing hostPath: path: /sys type: Directory - name: proc hostPath: path: /proc type: Directory tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master - effect: NoSchedule key: node-role.kubernetes.io/control-plane --- # Kepler Service apiVersion: v1 kind: Service metadata: name: kepler namespace: kepler labels: app.kubernetes.io/name: kepler app.kubernetes.io/component: exporter spec: type: ClusterIP clusterIP: None ports: - name: http-metrics port: 9102 targetPort: http-metrics protocol: TCP selector: app.kubernetes.io/name: kepler app.kubernetes.io/component: exporter --- # Kepler Model Server Deployment apiVersion: apps/v1 kind: Deployment metadata: name: kepler-model-server namespace: kepler labels: app.kubernetes.io/name: kepler-model-server spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: kepler-model-server template: metadata: labels: app.kubernetes.io/name: kepler-model-server spec: containers: - name: model-server image: quay.io/sustainable_computing_io/kepler_model_server:v0.7 imagePullPolicy: IfNotPresent command: - python3 - -u - src/server/model_server.py ports: - name: http containerPort: 8100 livenessProbe: httpGet: path: /healthz port: 8100 initialDelaySeconds: 30 periodSeconds: 60 resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 1Gi --- # Kepler Model Server Service apiVersion: v1 kind: Service metadata: name: kepler-model-server namespace: kepler labels: app.kubernetes.io/name: kepler-model-server spec: type: ClusterIP ports: - name: http port: 8100 targetPort: http protocol: TCP selector: app.kubernetes.io/name: kepler-model-server ### 2. Deploy Kepler kubectl apply -f kepler.yaml ### 3. Verify Installation # Check pods are running kubectl get pods -n kepler # Expected output: All pods should show 1/1 Running # View Kepler logs kubectl logs -n kepler -l app.kubernetes.io/name=kepler --tail=20 # Test metrics endpoint kubectl exec -n kepler daemonset/kepler -- curl -s localhost:9102/metrics | grep kepler_container ## Prometheus Configuration Add this scrape job to your Prometheus configuration: scrape_configs: - job_name: 'kepler' kubernetes_sd_configs: - role: pod namespaces: names: - kepler relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name] regex: kepler action: keep - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component] regex: exporter action: keep - source_labels: [__meta_kubernetes_pod_ip] target_label: __address__ replacement: ${1}:9102 - source_labels: [__meta_kubernetes_pod_node_name] target_label: node scrape_interval: 30s Restart Prometheus to apply the configuration: kubectl rollout restart deployment/prometheus -n <prometheus-namespace> ## Key Metrics ### Power Consumption # Total cluster power (Watts) sum(rate(kepler_node_platform_joules_total[1m]) * 1000) # Power per pod sum(rate(kepler_container_joules_total[1m]) * 1000) by (pod_name) # Power per namespace sum(rate(kepler_container_joules_total[1m]) * 1000) by (container_namespace) # Power per node sum(rate(kepler_node_platform_joules_total[1m]) * 1000) by (node) ### Energy and Cost # Total energy (kWh) sum(kepler_node_platform_joules_total) / 3600000 # Energy per pod (kWh) sum(kepler_container_joules_total) by (pod_name) / 3600000 # Estimated CO2 emissions (grams, US grid average 475g/kWh) (sum(kepler_node_platform_joules_total) / 3600000) * 475 # Estimated cost (USD at $0.12/kWh) (sum(kepler_node_platform_joules_total) / 3600000) * 0.12 ### Efficiency Metrics # Requests per Watt sum(rate(http_requests_total[1m])) / sum(rate(kepler_container_joules_total[1m]) * 1000) # CPU efficiency (cores per Watt) sum(rate(container_cpu_usage_seconds_total[1m])) / sum(rate(kepler_container_joules_total[1m]) * 1000) ## Configuration Details ### Critical Settings for Cloud Environments ENABLE_RAPL: "false" # Disable hardware sensor requirement ENABLE_PLATFORM_RAPL: "false" # No RAPL available on cloud VMs ESTIMATOR: "true" # Enable model-based estimation MODEL_SERVER_ENABLE: "true" # Use ML model server for estimates ### How Estimation Works Without hardware power sensors (RAPL), Kepler estimates power consumption based on: - CPU utilization patterns - Memory usage - Network I/O activity - Disk operations - Machine learning models trained on real hardware data **Accuracy:** Approximately 85-90% for relative comparisons between workloads. Less accurate for absolute power values. **Best Use Cases:** - Comparing efficiency of different implementations - Identifying power-hungry workloads - Tracking trends over time - Cost attribution by namespace/team ## Grafana Dashboard ### Quick Import 1. In Grafana, navigate to Dashboards → Import 2. Enter dashboard ID: `18681` (official Kepler dashboard) 3. Select your Prometheus datasource 4. Click Import ### Manual Dashboard Creation Create panels with these queries: **Cluster Power (Gauge)** sum(rate(kepler_node_platform_joules_total[1m]) * 1000) **Top 10 Power Consumers (Table)** topk(10, sum(rate(kepler_container_joules_total[5m]) * 1000) by (pod_name, container_namespace)) **Power by Namespace (Pie Chart)** sum(rate(kepler_container_joules_total[5m]) * 1000) by (container_namespace) **Power Over Time (Time Series)** sum(rate(kepler_node_platform_joules_total[1m]) * 1000) by (node) **Energy Efficiency Score (Stat)** sum(rate(http_requests_total[1m])) / sum(rate(kepler_container_joules_total[1m]) * 1000) ## Troubleshooting ### Pods Stuck in ContainerCreating Check for volume mount errors: kubectl describe pod -n kepler <pod-name> Common issue on GKE: Read-only filesystem preventing certain mounts. Remove unnecessary volume mounts like kernel-src or kernel-debug. ### Pods Crash with "no RAPL zones found" Ensure these ConfigMap settings: ENABLE_RAPL: "false" ENABLE_PLATFORM_RAPL: "false" This error occurs when Kepler tries to use hardware sensors that don't exist in cloud environments. ### No Metrics Appearing Wait 2-3 minutes after deployment, then check: # Verify metrics are exposed kubectl port-forward -n kepler svc/kepler 9102:9102 curl localhost:9102/metrics | grep kepler_ # Check Prometheus is scraping # In Prometheus UI: Status → Targets → Look for "kepler" job ### Model Server Won't Start Check logs: kubectl logs -n kepler deployment/kepler-model-server Common fix: Ensure the command and args are properly set in the deployment spec. ### Metrics Show Zero or NaN This can happen when: - Pods just started (wait 2-3 minutes to collect data) - No workload is running in the cluster - Model server is still initializing Check model server readiness: kubectl get pods -n kepler -l app.kubernetes.io/name=kepler-model-server ## Resource Usage ### Kepler DaemonSet (per node) - CPU: ~100m (requests), up to 500m (limits) - Memory: ~128Mi (requests), up to 512Mi (limits) - Overall overhead: ~2-5% per node ### Model Server - CPU: ~100m (requests), up to 500m (limits) - Memory: ~256Mi (requests), up to 1Gi (limits) ## Security Considerations Kepler requires: - **Privileged containers** (for eBPF probes) - **Host network** access - **Host PID** namespace access - **Read access** to `/sys`, `/proc`, `/lib/modules` Review your organization's security policies before deploying Kepler in production. ## Best Practices 1. **Use specific image versions** (not `:latest`) for production deployments 2. **Establish baselines** by measuring idle cluster power before running workloads 3. **Focus on relative comparisons** rather than absolute values when using estimation mode 4. **Set resource limits** to prevent Kepler from consuming excessive resources 5. **Monitor estimation accuracy** by comparing similar workloads 6. **Regular updates** to get improved ML models and bug fixes 7. **Create alerts** for abnormally high power consumption ## Monitoring Kepler Itself # Check Kepler pod health kubectl get pods -n kepler -w # View Kepler exporter logs kubectl logs -n kepler -l app.kubernetes.io/name=kepler -f # Check model server logs kubectl logs -n kepler -l app.kubernetes.io/name=kepler-model-server -f # Verify metrics collection kubectl top pods -n kepler # Test metrics endpoint kubectl port-forward -n kepler daemonset/kepler 9102:9102 curl http://localhost:9102/metrics | grep kepler_ | wc -l ## Uninstalling Kepler # Remove all Kepler components kubectl delete namespace kepler # Remove cluster-level resources kubectl delete clusterrole kepler kubectl delete clusterrolebinding kepler # Remove Prometheus scrape config (manual step) # Edit your Prometheus ConfigMap and remove the kepler job ## Upgrading Kepler # Update the image version in kepler.yaml # Change: quay.io/sustainable_computing_io/kepler:release-0.7.11 # To: quay.io/sustainable_computing_io/kepler:release-0.7.12 (example) kubectl apply -f kepler.yaml # Verify rollout kubectl rollout status daemonset/kepler -n kepler kubectl rollout status deployment/kepler-model-server -n kepler ## Hardware-Based Mode (Bare Metal) If you're running on bare metal with RAPL support, modify the ConfigMap: ENABLE_RAPL: "true" # Enable hardware sensors ENABLE_PLATFORM_RAPL: "true" # Use platform RAPL interface ESTIMATOR: "false" # Disable estimation (use real measurements) MODEL_SERVER_ENABLE: "false" # Model server not needed This provides 95%+ accurate hardware-based measurements instead of estimates. ## Additional Resources - **Official Documentation:** https://sustainable-computing.io/ - **GitHub Repository:** https://github.com/sustainable-computing-io/kepler - **Model Server:** https://github.com/sustainable-computing-io/kepler-model-server - **Community:** #kepler channel in Kubernetes Slack - **Dashboards:** https://grafana.com/grafana/dashboards/?search=kepler ## Use Cases ### 1. Cost Attribution Track energy costs per team or project: # Daily cost per namespace (assumes $0.12/kWh) (sum(increase(kepler_container_joules_total[24h])) by (container_namespace) / 3600000) * 0.12 ### 2. Workload Optimization Compare power efficiency of different implementations: # Power per request for two services sum(rate(kepler_container_joules_total{pod_name=~"service-a.*"}[5m]) * 1000) / sum(rate(http_requests_total{pod=~"service-a.*"}[5m])) vs sum(rate(kepler_container_joules_total{pod_name=~"service-b.*"}[5m]) * 1000) / sum(rate(http_requests_total{pod=~"service-b.*"}[5m])) ### 3. Sustainability Reporting Generate monthly carbon emission reports: # Monthly CO2 in kilograms (sum(increase(kepler_node_platform_joules_total[30d])) / 3600000) * 0.475 ### 4. Autoscaling Insights Correlate power consumption with pod scaling: # Compare power vs pod count sum(rate(kepler_container_joules_total{container_namespace="default"}[1m]) * 1000) and count(kube_pod_status_phase{namespace="default", phase="Running"}) ## Summary Kepler provides valuable energy visibility for Kubernetes clusters, even without hardware power sensors. While model-based estimates aren't billing-grade accurate, they excel at: ✅ Comparing workload efficiency ✅ Identifying power-hungry applications ✅ Tracking consumption trends ✅ Cost attribution ✅ Sustainability reporting ✅ Optimizing resource allocation For maximum accuracy, deploy on bare metal with RAPL support. For cloud environments, estimation mode provides sufficient insights for optimization and cost management. %% # Excalidraw Data ## Text Elements ## Drawing ```json { "type": "excalidraw", "version": 2, "source": "https://github.com/zsviczian/obsidian-excalidraw-plugin/releases/tag/2.1.4", "elements": [ { "id": "4y8R7iOA", "type": "text", "x": 118.49495565891266, "y": -333.44393157958984, "width": 3.8599853515625, "height": 24, "angle": 0, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "groupIds": [], "frameId": null, "roundness": null, "seed": 967149026, "version": 2, "versionNonce": 939059582, "isDeleted": true, "boundElements": null, "updated": 1713723615080, "link": null, "locked": false, "text": "", "rawText": "", "fontSize": 20, "fontFamily": 4, "textAlign": "left", "verticalAlign": "top", "containerId": null, "originalText": "", "lineHeight": 1.2 } ], "appState": { "theme": "dark", "viewBackgroundColor": "#ffffff", "currentItemStrokeColor": "#1e1e1e", "currentItemBackgroundColor": "transparent", "currentItemFillStyle": "solid", "currentItemStrokeWidth": 2, "currentItemStrokeStyle": "solid", "currentItemRoughness": 1, "currentItemOpacity": 100, "currentItemFontFamily": 4, "currentItemFontSize": 20, "currentItemTextAlign": "left", "currentItemStartArrowhead": null, "currentItemEndArrowhead": "arrow", "scrollX": 583.2388916015625, "scrollY": 573.6323852539062, "zoom": { "value": 1 }, "currentItemRoundness": "round", "gridSize": null, "gridColor": { "Bold": "#C9C9C9FF", "Regular": "#EDEDEDFF" }, "currentStrokeOptions": null, "previousGridSize": null, "frameRendering": { "enabled": true, "clip": true, "name": true, "outline": true } }, "files": {} } ``` %%