Posts

Cluster API v1.12: In-Place Updates Changed How I Think About Node Lifecycle

I run a few Kubernetes clusters on bare metal with Cluster API and the BYOH (Bring Your Own Host) provider. Until now, every upgrade followed the same pattern: drain nodes, delete machines, rebuild everything, then wait. Reliable, yes. Fast, not even close, especially once you are past 40 nodes with little spare capacity. Cluster API v1.12 shipped a few weeks ago, and the headline for me was in-place updates. Instead of always doing the immutable delete-and-recreate path, CAPI can now apply some changes directly on existing machines. I spent last week testing this on our staging cluster, and the result was better than I expected. ...

Why I Turned Off Dependabot and What I Use Instead

Last Tuesday, one of my Go services got 14 Dependabot PRs in a single day. All of them came from one CVE, and none of them affected the way our code actually runs in production. We still had to read the alerts, review the PRs, wait for CI, and merge changes. That was the moment I decided to stop using Dependabot for this workflow. What finally broke it for me The issue was CVE-2026-26958 in filippo.io/edwards25519. ...

How a Missing Close Button Saved My AI Session

Last week I lost a 45-minute Codex session because my thumb grazed the close button on a terminal tab. No warning. No confirmation. Just gone. The session context, chain of thought, and iterative refinements I’d been building all evaporated because of one bad click. If you’ve worked with AI coding agents (Codex, Claude Code, Cursor, whatever), you probably know this feeling. These tools build context over a conversation. Lose that in the middle, and you’re paying for it in both time and mental energy. ...

Kyverno 1.17: CEL Policies Hit GA, Time to Migrate

Kyverno 1.17 landed yesterday, and the big news is that CEL policy types are now GA. If you’ve been running Kyverno with JMESPath-based ClusterPolicy resources, the clock is ticking. They’re officially deprecated and scheduled for removal in v1.20 (October 2026). I spent today migrating a production cluster with about 60 policies. Here is what actually happened. Why This Matters Kyverno has been using JMESPath expressions for years. They work, but they’re Kyverno-specific. CEL (Common Expression Language) is what Kubernetes itself uses for ValidatingAdmissionPolicy since 1.30. By switching to CEL, Kyverno aligns with upstream and gets significantly better evaluation performance. ...

Open Source Contributions as Digital Citizenship

Contributing to open source is like earning citizenship in a virtual society. Here’s what that journey feels like from the inside.

We Ditched Artifactory and Built a Self-Hosted Artifact Registry Stack

Last month our Artifactory renewal came in at 40% more than last year. No new features we needed, just the usual “enterprise tier” squeeze. Security scanning? Pay more. Replication? Pay more. SSO that isn’t SAML-only? You guessed it. So I spent two weeks building a replacement. Here’s what actually worked, what didn’t, and the gotchas nobody warns you about. What We Were Running Our Artifactory setup handled: Docker images (~800 images, ~12TB total) npm packages (private registry, ~200 internal packages) Helm charts Generic binary artifacts (build outputs, firmware blobs) The big requirements: vulnerability scanning on push, OIDC SSO, and cross-region replication to a DR site. ...

NIS2 and Kubernetes: What You Actually Need to Do

If you run Kubernetes in the EU, NIS2 is part of your day-to-day now. The directive has applied since October 2024, and each member state has been enforcing it through local law. I have spent the last few months hardening real clusters for these requirements, so this post is the practical version of what I learned. This is not legal advice. It is the technical checklist I wish I had from day one. ...

Reclaiming Idle GPUs in Kubernetes Before They Burn Your Budget

Last month I finally looked at our GPU utilization dashboards properly. What I saw made me physically uncomfortable: 14 A100 GPUs across our cluster, average utilization hovering around 15%. We were paying for dedicated hardware that spent most of its time doing absolutely nothing. This is embarrassingly common. Teams request a full GPU for a workload that uses it for training bursts of 20 minutes, then idles for hours. Kubernetes treats GPUs as integer resources — you either have one or you don’t. There’s no native way to share. ...

CPU Limits Don't Kill Pods - The #1 Kubernetes Misunderstanding

I keep seeing the same debugging rabbit hole. A team adds CPU limits, latency gets weird, and the first question is: “Are pods getting killed?” Usually no. That’s memory behavior, not CPU behavior. CPU limits do not kill pods. They throttle them. That one distinction explains a lot of “everything looks fine but users are complaining” incidents. The Misunderstanding A lot of engineers assume this mapping: Memory limit exceeded → pod gets killed (OOMKill) ✅ CPU limit exceeded → pod gets killed ❌ The second one is the trap. The official Kubernetes documentation spells it out: ...

Kubernetes Node Readiness Controller - Finally, Proper Node Bootstrap Gates

Last week I ran into a familiar mess, pods landing on nodes before the CNI plugin was actually ready. Kubelet marks the node as Ready, scheduler starts placing workloads, then everything sits in ContainerCreating because Calico is still coming up. I have worked around this with init containers and postStart tricks for way too long. I came across the Node Readiness Controller announcement on the Kubernetes blog. It is a new SIG project (v0.1.1), and it is basically what I wanted, custom readiness gates for nodes managed through a CRD. ...