I’ve been running mixed clusters with ML training jobs and regular services for about two years. Scheduling has been the biggest headache. A distributed training run would get only some pods placed, GPUs would sit there doing nothing, and everyone would lose time.
Kubernetes 1.35 came out last week, so I spent the weekend testing it on our staging cluster. A few of these changes are genuinely useful.
Gang Scheduling Finally Exists
The biggest addition is workload-aware scheduling with gang scheduling support. It’s still alpha, so I would not put it in production yet, but the model is exactly what we needed: a group of pods either gets scheduled together, or not at all.
Before 1.35, we had a pile of workarounds. Volcano, the Coscheduling plugin, plus custom scripts that deleted partial placements. It worked, but it was fragile.
Here’s what the new API looks like:
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: training-run-042
spec:
podSets:
- name: workers
count: 4
template:
spec:
containers:
- name: trainer
image: my-registry/llm-trainer:v3.2
resources:
limits:
nvidia.com/gpu: 1
schedulingPolicy:
gangScheduling:
mode: Strict
Strict means all four pods are placed, or none are. No more burning GPU budget while one pod is still Pending.
I tested this on a 4-node GPU cluster, two A100s per node. A workload requesting four GPUs started exactly the way you’d hope: all pods together. Then I requested six GPUs when only five were free. The workload stayed in Waiting instead of half-starting. Perfect.
kubectl get workloads
NAME STATUS AGE
training-run-042 Running 2m
training-run-043 Waiting 45s
kubectl describe workload training-run-043
# ...
# Message: Insufficient nvidia.com/gpu: requested 6, available 5
Important detail: the feature gate is WorkloadAwareScheduling, and it’s off by default. You need it on both kube-scheduler and kube-apiserver:
# In your kubeadm config or static pod manifests:
--feature-gates=WorkloadAwareScheduling=true
In-Place Pod Resize Is Stable
This has been in progress since 1.27 and is now GA. You can change CPU and memory limits on a running pod without restarting the container.
For inference services, this is a big deal. Traffic spikes, you raise CPU, and the process keeps running. No cold start. No reconnect storm. No model reload.
kubectl patch pod inference-server-abc123 --subresource resize \
--type merge -p '{"spec":{"containers":[{"name":"server","resources":{"limits":{"cpu":"4"}}}]}}'
I tested an ONNX inference pod from 2 to 4 CPU cores. The pod stayed up and latency dropped within seconds. Scaling back down worked too. Memory resize has one catch: if current memory use is already above the new limit, the resize is rejected. The pod is not killed, which is exactly what you want.
kubectl get pod inference-server-abc123 -o jsonpath='{.status.resize}'
# "InProgress" -> "Completed"
One thing I did not expect: HPA can use this now. You can resize vertically first and only scale out horizontally when needed. VPA integration is reportedly landing in 1.36, but even today this is practical with custom metrics.
KYAML Is the Default kubectl Output
This one can surprise people. kubectl now emits KYAML by default instead of regular YAML. KYAML is stricter and avoids common parsing mistakes, including the classic case where NO gets treated as boolean false.
If your scripts parse kubectl output, run tests before upgrading. Most scripts will be fine, but edge cases are real. You can temporarily switch back:
export KUBECTL_KYAML=false
I ran our CI pipeline with KYAML enabled and found two breakages:
- A script expecting bare
yesandnoin ConfigMaps now receives quoted values, which broke a string comparison. - Some multi-line strings are rendered in a slightly different flow style.
Both fixes were small, but this is exactly the kind of issue that’s annoying to debug under production pressure.
DRA Keeps Improving
Dynamic Resource Allocation is not new in 1.35, but it keeps getting better. Device claims are more predictable now, and scheduling hints seem to line up better with gang scheduling.
For our GPU workloads, DRA claim resolution feels faster. On 1.34 we sometimes saw delays of 10 to 15 seconds between pod scheduling and GPU allocation. On 1.35, it’s usually under three seconds. I have not dug into whether that comes from a specific fix or just scheduler improvements stacking up.
What I Changed in Our Clusters
After testing, this is what I’m rolling out:
- Enable gang scheduling in staging for distributed training jobs, then run it for a month before deciding on production.
- Move inference pods to in-place resize instead of deleting and recreating pods with new resource requests.
- Update CI scripts for KYAML compatibility before production upgrades.
- Keep DRA config as is, but keep watching allocation latency.
The Bigger Picture
Kubernetes is clearly becoming the default operations layer for AI infrastructure. Gang scheduling, in-place resize, and DRA together make serious ML workloads much more practical without adding extra schedulers.
A year ago I would’ve said, “Use Slurm” for multi-node GPU training and move on. Now it’s not that simple. For common cases, Kubernetes 1.35 is good enough that running one platform for everything starts to win.
Gang scheduling being alpha is still the main caveat. For inference-heavy clusters, though, stable in-place resize alone is a strong reason to upgrade.
If you’re planning a rollout, start by validating CI/CD scripts against KYAML. That’s where quiet breakage tends to hide. The new scheduling features are opt-in, so they will not surprise you.