Kubernetes

OpenTelemetry Auto-Instrumentation on Kubernetes: Zero-Code Observability That Actually Works

Last week I inherited a cluster with around 40 microservices. Observability was close to nonexistent: basic Prometheus metrics, plus a few random log lines. The team wanted distributed tracing “by next sprint.” There was no realistic way to touch app code across a dozen repos in two weeks. So I chose OpenTelemetry Operator auto-instrumentation. This is what happened in practice. The Setup We run Kubernetes 1.31 on EKS. The goal was simple: get traces and metrics from every service into Grafana Tempo and Mimir without changing application code. ...

Kubernetes 1.35: The Release That Finally Gets AI Workloads Right

I’ve been running mixed clusters with ML training jobs and regular services for about two years. Scheduling has been the biggest headache. A distributed training run would get only some pods placed, GPUs would sit there doing nothing, and everyone would lose time. Kubernetes 1.35 came out last week, so I spent the weekend testing it on our staging cluster. A few of these changes are genuinely useful. Gang Scheduling Finally Exists The biggest addition is workload-aware scheduling with gang scheduling support. It’s still alpha, so I would not put it in production yet, but the model is exactly what we needed: a group of pods either gets scheduled together, or not at all. ...

Cluster API v1.12: In-Place Updates Changed How I Think About Node Lifecycle

I run a few Kubernetes clusters on bare metal with Cluster API and the BYOH (Bring Your Own Host) provider. Until now, every upgrade followed the same pattern: drain nodes, delete machines, rebuild everything, then wait. Reliable, yes. Fast, not even close, especially once you are past 40 nodes with little spare capacity. Cluster API v1.12 shipped a few weeks ago, and the headline for me was in-place updates. Instead of always doing the immutable delete-and-recreate path, CAPI can now apply some changes directly on existing machines. I spent last week testing this on our staging cluster, and the result was better than I expected. ...

Kyverno 1.17: CEL Policies Hit GA, Time to Migrate

Kyverno 1.17 landed yesterday, and the big news is that CEL policy types are now GA. If you’ve been running Kyverno with JMESPath-based ClusterPolicy resources, the clock is ticking. They’re officially deprecated and scheduled for removal in v1.20 (October 2026). I spent today migrating a production cluster with about 60 policies. Here is what actually happened. Why This Matters Kyverno has been using JMESPath expressions for years. They work, but they’re Kyverno-specific. CEL (Common Expression Language) is what Kubernetes itself uses for ValidatingAdmissionPolicy since 1.30. By switching to CEL, Kyverno aligns with upstream and gets significantly better evaluation performance. ...

NIS2 and Kubernetes: What You Actually Need to Do

If you run Kubernetes in the EU, NIS2 is part of your day-to-day now. The directive has applied since October 2024, and each member state has been enforcing it through local law. I have spent the last few months hardening real clusters for these requirements, so this post is the practical version of what I learned. This is not legal advice. It is the technical checklist I wish I had from day one. ...

Reclaiming Idle GPUs in Kubernetes Before They Burn Your Budget

Last month I finally looked at our GPU utilization dashboards properly. What I saw made me physically uncomfortable: 14 A100 GPUs across our cluster, average utilization hovering around 15%. We were paying for dedicated hardware that spent most of its time doing absolutely nothing. This is embarrassingly common. Teams request a full GPU for a workload that uses it for training bursts of 20 minutes, then idles for hours. Kubernetes treats GPUs as integer resources — you either have one or you don’t. There’s no native way to share. ...

CPU Limits Don't Kill Pods - The #1 Kubernetes Misunderstanding

I keep seeing the same debugging rabbit hole. A team adds CPU limits, latency gets weird, and the first question is: “Are pods getting killed?” Usually no. That’s memory behavior, not CPU behavior. CPU limits do not kill pods. They throttle them. That one distinction explains a lot of “everything looks fine but users are complaining” incidents. The Misunderstanding A lot of engineers assume this mapping: Memory limit exceeded → pod gets killed (OOMKill) ✅ CPU limit exceeded → pod gets killed ❌ The second one is the trap. The official Kubernetes documentation spells it out: ...

Kubernetes Node Readiness Controller - Finally, Proper Node Bootstrap Gates

Last week I ran into a familiar mess, pods landing on nodes before the CNI plugin was actually ready. Kubelet marks the node as Ready, scheduler starts placing workloads, then everything sits in ContainerCreating because Calico is still coming up. I have worked around this with init containers and postStart tricks for way too long. I came across the Node Readiness Controller announcement on the Kubernetes blog. It is a new SIG project (v0.1.1), and it is basically what I wanted, custom readiness gates for nodes managed through a CRD. ...

Detecting Kubernetes Nodes Running Only DaemonSet Pods, A Deep Dive

Detecting Kubernetes Nodes Running Only DaemonSet Pods, A Deep Dive A real-world story about PromQL struggles, Helm templating, alert design, and operational savings by Dedico Servers. Executive Summary At Dedico Servers, we specialize in building efficient, cost-optimized Kubernetes clusters. In this article, we engineer a Prometheus-based alert to detect nodes running only DaemonSet pods, an operational and financial risk. By tackling this hidden inefficiency, we help our clients save thousands of dollars annually while improving the resilience of their clusters. ...

Scaling GitOps with ArgoCD ApplicationSets

Managing Kubernetes applications with ArgoCD is already a game-changer, but what if you need to deploy the same app across 10 clusters, or generate dynamic app configs based on Git branches or Helm values? That’s where ApplicationSets step in. 🚀 What is an ApplicationSet? An ApplicationSet is a Kubernetes custom resource that tells ArgoCD how to automatically generate multiple Application resources from a template. It’s like templating your ArgoCD apps, letting you define how they should be generated and where they should go. ...