CPU Limits Don't Kill Pods - The #1 Kubernetes Misunderstanding

I keep seeing the same debugging rabbit hole.

A team adds CPU limits, latency gets weird, and the first question is: “Are pods getting killed?” Usually no. That’s memory behavior, not CPU behavior.

CPU limits do not kill pods. They throttle them. That one distinction explains a lot of “everything looks fine but users are complaining” incidents.

The Misunderstanding

A lot of engineers assume this mapping:

Memory limit exceeded → pod gets killed (OOMKill) ✅
CPU limit exceeded → pod gets killed ❌

The second one is the trap. The official Kubernetes documentation spells it out:

CPU limits are enforced by CPU throttling. When a container approaches its cpu limit, the kernel will restrict access to the CPU corresponding to the container’s limit. Thus, a cpu limit is a hard limit the kernel enforces. Containers may not use more CPU than is specified in their cpu limit.

Memory limits are enforced by the kernel with out of memory (OOM) kills.

Same resources section, totally different enforcement model.

How CPU Requests and Limits Actually Work

These two knobs are related, but they solve different problems.

resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "1000m"

CPU requests (500m = half a core) tell the scheduler what to reserve. Kubernetes maps this to CFS shares via cpu.shares. That is proportional sharing. Under contention, you get your guaranteed share. If the node is quiet, you can use more.

CPU limits (1000m = one core) set a hard ceiling. The kernel enforces it with CFS bandwidth control (cfs_quota_us and cfs_period_us). Even with idle CPU on the node, your container cannot cross that ceiling.

Requests define fair sharing. Limits define a hard stop.

The CFS Scheduler and How Throttling Works

Kubernetes relies on Linux CFS bandwidth control for CPU limits. The kernel docs describe it like this:

CFS runs in periods, default cfs_period_us is 100,000 microseconds (100ms)
1000m (1 core) gives cfs_quota_us of 100,000µs per period
500m gives 50,000µs per period
Quota is handed to per-CPU run queues. When quota is spent, threads are throttled until the next period

In practice: a request burst hits your service, your process burns its quota early, then it sits paused until the next period. No restart, no crash, just periodic freezes.

That is why this hurts latency so much.

The Multi-Core Gotcha (The Part That Really Bites)

This is the part many experienced teams miss.

Engineers at a major job search platform found a serious CFS throttling bug introduced in Linux v4.18 (commit 512ac999). It caused heavy throttling even when containers were not actually consuming their full budget.

High-level version: CFS keeps a global quota bucket, then hands out slices (default 5ms) to cores as threads run. Unused time should return to the global pool. But the kernel leaves 1ms behind per core to avoid lock contention.

On a small machine, this is minor. On very high core counts, that “stranded” quota gets painful fast. Their team reported worst-case latency improving from over two seconds to 30 milliseconds after fixing it.

Fixes landed in 5.4+ and were backported to some 4.x trees, but older kernels can still bite you.

A travel tech company reported similar behavior: random stalls, failed health checks, connection issues, all traced back to aggressive throttling.

The Sneaky Symptoms

The annoying part is where this shows up.

What dashboards often show:

CPU at 40-50%
No restarts
No OOMKills
Nothing obviously broken

What users feel:

p99 latency spikes
Random timeouts
Slow endpoints
Failed health checks
Requests that should finish in 10ms taking 200ms

Why the mismatch? Time averaging. Your 5-minute graph looks safe while each 100ms period is stop-and-go.

As Kubernetes issue #67577 puts it: “CFS quotas can lead to unnecessary throttling.”

I have watched teams blame networks and databases for days before anyone checks throttling metrics.

How to Detect Throttling

Start with container_cpu_cfs_throttled_periods_total. A quick ratio query:

rate(container_cpu_cfs_throttled_periods_total{container!=""}[5m])
/
rate(container_cpu_cfs_periods_total{container!=""}[5m])

If it is above 0, throttling exists. If it sits above 20-25%, users are probably noticing.

You can also check cgroup stats directly. For cgroup v1:

# Check the CFS stats for a running container
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu,cpuacct/cpu.stat

For cgroup v2 (most modern distros):

kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu.stat

Watch these fields:

nr_periods - total periods where any thread was runnable
nr_throttled - periods that hit the quota limit
throttled_time - total throttled time (nanoseconds)

If nr_throttled keeps climbing relative to nr_periods, you are getting throttled.

You can also inspect what the kernel is enforcing:

# cgroup v1
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_period_us

# cgroup v2
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu.max

The Case for Removing CPU Limits

A lot of production teams now remove CPU limits and keep CPU requests.

Why this can work well:

Requests already protect fairness during contention via cpu.shares.
Limits block useful burst capacity when nodes have idle CPU.
Throttling creates unpredictable latency, which is often worse than extra CPU burn.

Tim Hockin, an original Kubernetes maintainer at Google, has recommended this pattern for years in many environments.

A requests-only example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
      - name: my-app
        resources:
          requests:
            cpu: "500m"
            memory: "256Mi"
          limits:
            memory: "512Mi"
            # No CPU limit - intentionally omitted

Memory limits still matter. You want a hard memory boundary to avoid OOM chaos and node instability. CPU is different: with good requests set across workloads, letting services burst can improve latency without harming fairness.

QoS tradeoff: removing CPU limits moves pods from Guaranteed to Burstable QoS. That affects eviction behavior under pressure. For strict latency-sensitive workloads, evaluate static CPU Manager policy.

Kubernetes 1.34+: PodLevelResources (beta) lets you define pod-level CPU and memory budgets, which helps sidecar-heavy pods share idle capacity more efficiently.

When You Should Still Use CPU Limits

There are cases where limits are still the right move:

Multi-tenant clusters with low trust between teams
Strict cost accounting and chargeback requirements
Batch jobs that can consume every spare core and hurt interactive services
Compliance constraints that require hard boundaries
Strong noisy-neighbor controls on shared nodes

If you run a single-team, latency-sensitive service cluster, dropping CPU limits is often worth testing.

A Quick Note on the CFS Burst Feature

Linux 5.14+ added a CFS burst feature via cpu.cfs_burst_us. It lets a cgroup bank unused quota and spend it during spikes.

It does not eliminate the core limit model, but it can soften burst throttling for spiky workloads.

Kubernetes does not expose this directly in pod specs today. You can tune it through cgroups if you really need to.

A Real Debugging Session

Here is the practical flow I use:

# Step 1: Check if any pods are being throttled
kubectl top pods -n my-namespace

# Step 2: Look at the actual resource specs
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}' | jq .

# Step 3: Check CFS throttling in Prometheus
# Use the PromQL query from above, or set up an alert:
# alert: CPUThrottlingHigh
# expr: rate(container_cpu_cfs_throttled_periods_total[5m])
#       / rate(container_cpu_cfs_periods_total[5m]) > 0.25

# Step 4: Check the kernel version (is the CFS fix present?)
kubectl exec -it <pod-name> -- uname -r
# Kernels before 5.4 may have the quota stranding bug

# Step 5: If throttled, either raise the limit or remove it
kubectl patch deployment my-app -n my-namespace --type=json \
  -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/resources/limits/cpu"}]'

# Step 6: Watch latency improve almost immediately

Most of the time, the aha moment is simple: remove or raise the CPU limit, then watch p99 fall without touching application code.

A Real War Story: CoreDNS and the Silent Collapse

This one still stings a bit.

We had been postponing a Kubernetes upgrade on a production EKS cluster for weeks. When we finally did it, part of the change was switching from the AWS-managed CoreDNS addon to a self-managed Helm chart. Seemed straightforward.

The Helm chart came with default CPU limits on CoreDNS. We did not notice.

A few minutes after the switch, everything collapsed. DNS resolution ground to a halt, services could not find each other, cascading failures hit every microservice in the cluster. Full production outage.

We reverted fast. Faster than our Prometheus alerting could even fire, actually. The throttling threshold alert was configured, but the incident was so short and sharp that we reacted before it triggered.

The post-mortem told the whole story. When we pulled up the container CPU dashboard in Grafana, it was obvious. CoreDNS had been hitting its CPU limit hard, CFS throttling kicked in, and DNS latency shot through the roof. Every service that needed to resolve a hostname (which is all of them) started timing out.

The fix was simple: remove the CPU limit from CoreDNS. DNS is one of those things where even small throttling delays multiply across every single request in the cluster.

Lessons from this one:

Always check resource defaults when switching from managed addons to self-managed Helm charts. The defaults are not always sane for your workload.
CoreDNS is latency-critical infrastructure. CPU limits on it can take down an entire cluster.
Alerting thresholds need to account for fast incidents. If your team can react in 2 minutes but your alert needs 5 minutes of sustained throttling to fire, you have a gap.
Grafana container dashboards are your post-mortem best friend. The CFS throttling metrics told us exactly what happened, even though the alerts did not catch it in time.

Key Takeaways

CPU limits cause throttling. Memory limits cause OOMKills. Different mechanisms.
Throttling is often invisible in basic CPU charts. Track CFS throttling metrics directly.
Pre-5.4 kernels had a well-known quota stranding problem that amplified throttling on multi-core systems.
CPU requests drive proportional scheduling (cpu.shares). CPU limits enforce hard ceilings (CFS quota).
Many teams run safely with requests and no CPU limits for latency-sensitive services.
Before blaming network or database, check throttling first.

If your service is mysteriously slow while dashboards look normal, this is one of the first things to verify.

The Misunderstanding#

How CPU Requests and Limits Actually Work#

The CFS Scheduler and How Throttling Works#

The Multi-Core Gotcha (The Part That Really Bites)#

The Sneaky Symptoms#

How to Detect Throttling#

The Case for Removing CPU Limits#

When You Should Still Use CPU Limits#

A Quick Note on the CFS Burst Feature#

A Real Debugging Session#

A Real War Story: CoreDNS and the Silent Collapse#

Key Takeaways#

Further Reading#