Last month I finally looked at our GPU utilization dashboards properly. What I saw made me physically uncomfortable: 14 A100 GPUs across our cluster, average utilization hovering around 15%. We were paying for dedicated hardware that spent most of its time doing absolutely nothing.
This is embarrassingly common. Teams request a full GPU for a workload that uses it for training bursts of 20 minutes, then idles for hours. Kubernetes treats GPUs as integer resources — you either have one or you don’t. There’s no native way to share.
Here’s how I clawed back most of that waste.
The Problem: GPUs Are Not Like CPUs
With CPU and memory, Kubernetes can overcommit. Requests and limits give you flexibility. GPUs? None of that. The nvidia.com/gpu resource is an extended resource — it’s all-or-nothing:
resources:
limits:
nvidia.com/gpu: 1
This pod now owns that entire GPU. Even if it only uses 2GB of the 80GB VRAM and runs inference for 5 seconds every minute.
You can check current allocation vs actual usage pretty easily:
# What Kubernetes thinks is allocated
kubectl describe nodes | grep -A5 "nvidia.com/gpu"
# What's actually happening on the GPU
kubectl exec -it <gpu-pod> -- nvidia-smi
The gap between these two numbers is your burning money.
Option 1: NVIDIA GPU Time-Slicing
The quickest win. Time-slicing lets multiple pods share a single physical GPU by time-multiplexing access. It’s not MIG (Multi-Instance GPU) — there’s no memory isolation — but for inference workloads and dev environments, it works well enough.
First, update your NVIDIA device plugin config:
# nvidia-device-plugin-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-plugin-configs
namespace: kube-system
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 4
This tells the device plugin to advertise each physical GPU as 4 virtual slices. A node with 2 GPUs now shows 8 nvidia.com/gpu available.
Apply it and restart the device plugin:
kubectl apply -f nvidia-device-plugin-config.yaml
kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n kube-system
After a minute, check your node capacity:
kubectl describe node gpu-node-01 | grep gpu
# nvidia.com/gpu: 8 (was 2)
The Catch
Time-slicing does NOT isolate GPU memory. If Pod A allocates 70GB VRAM on an 80GB card, Pod B will OOM when it tries to allocate anything meaningful. You need to trust your workloads or enforce memory limits at the application level (e.g., CUDA_VISIBLE_DEVICES or framework-level caps like PyTorch’s max_split_size_mb).
For our inference services, I added this to every deployment:
env:
- name: NVIDIA_MEM_LIMIT_MB
value: "16000"
And in the Python code:
import torch
torch.cuda.set_per_process_memory_fraction(0.2) # 20% of GPU mem
Option 2: Kubernetes Scheduler Plugins for GPU-Aware Scheduling
Time-slicing is a blunt tool. For smarter allocation, I set up a custom scheduler plugin that considers actual GPU utilization when placing pods.
The scheduler-plugins project has a Trimaran plugin family. The one I used is TargetLoadPacking — it tries to bin-pack workloads onto GPUs that are already partially used instead of spreading them out.
Here’s the scheduler config I ended up with:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: gpu-aware-scheduler
plugins:
score:
enabled:
- name: TargetLoadPacking
pluginConfig:
- name: TargetLoadPacking
args:
defaultRequests:
nvidia.com/gpu: "500m"
targetUtilization: 70
metricProvider:
type: Prometheus
address: http://prometheus.monitoring:9090
The key insight: this pulls real utilization metrics from Prometheus (via DCGM exporter) and scores nodes based on actual GPU load, not just what Kubernetes thinks is allocated.
Deploy the DCGM exporter if you haven’t already:
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring \
--set serviceMonitor.enabled=true
Now you get metrics like DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_FB_USED in Prometheus, and the scheduler uses them.
Option 3: The “Good Enough” Approach — Just Add Monitoring First
If you’re not ready for scheduler plugins (they add operational complexity), start with visibility. You can’t fix what you can’t see.
My Grafana dashboard query for GPU waste:
# Allocated but unused GPU memory (per node)
sum by (node) (
DCGM_FI_DEV_FB_FREE{} / DCGM_FI_DEV_FB_TOTAL{}
) * 100
And for utilization over time:
# Average GPU utilization per pod over 1 hour
avg_over_time(DCGM_FI_DEV_GPU_UTIL{}[1h])
I set up an alert that fires when any GPU has been below 10% utilization for more than 2 hours during business hours. That alone caught 4 “forgotten” notebooks running in our dev namespace that nobody was using.
What We Actually Saved
After two weeks of time-slicing on inference nodes + the monitoring alerts:
- GPU count went from 14 to 8 (same workloads, fewer nodes)
- Monthly GPU spend dropped from ~$14k to ~$8k
- Average utilization went from 15% to 55%
Not perfect, but $6k/month buys a lot of coffee.
Gotchas I Hit Along the Way
1. CUDA version mismatches with time-slicing. If your pods use different CUDA versions and share a GPU, you’ll get cryptic driver errors. Pin your CUDA version across all GPU workloads or use separate node pools per CUDA version.
2. The device plugin restart kills running GPU pods. I learned this the hard way on a Tuesday afternoon. Always drain the node first:
kubectl drain gpu-node-01 --ignore-daemonsets --delete-emptydir-data
# now update config and restart plugin
kubectl uncordon gpu-node-01
3. Prometheus scrape intervals matter. If you’re using metrics-based scheduling and Prometheus scrapes every 60s, the scheduler sees stale data. I dropped GPU metric scraping to 15s for the DCGM exporter. It’s a few more time series but worth it.
4. Don’t time-slice training workloads. Seriously. Two training jobs sharing a GPU will both run 3x slower. Time-slicing is for inference and dev notebooks. Training gets dedicated hardware, no exceptions.
What’s Next
I’m looking at NVIDIA MIG (Multi-Instance GPU) for our A100s, which gives actual hardware-level isolation — separate memory, separate compute engines. It’s more complex to set up (you partition GPUs at the driver level), but it’s the proper solution for multi-tenant clusters.
For now, time-slicing + monitoring + proper scheduling gets you 80% of the way there. Start with the monitoring. Look at your dashboards. I guarantee you’ll find waste.
The best optimization is the one you can see.