Kubernetes

Ingress NGINX Is Officially Dead. Here's How I Migrated Off It in a Weekend

I woke up on March 25th to a Slack message from our security team: “ingress-nginx is EOL as of yesterday. Timeline for migration?” I had been ignoring this for months. The retirement was announced back in November 2025, but it felt distant. Now it was real. No more CVE patches. No more bug fixes. The clock was ticking. What Actually Happened On March 24, 2026, Kubernetes SIG Network and the Security Response Committee officially retired ingress-nginx. The project is done. Container images and Helm charts will stay available (they’re not deleting anything), but there will be no new releases. If a critical vulnerability drops tomorrow, you’re on your own. ...

I Replaced 12 Dev Clusters with vCluster and My AWS Bill Dropped 60%

Every team wanted their own cluster. QA had one, staging had one, each developer wanted one for feature branches. We ended up with 12 EKS clusters, most of them sitting at 15% utilization, all of them costing real money. I kept hearing about vCluster from Loft Labs and finally gave it a shot three months ago. The pitch sounded too good: full Kubernetes clusters running inside a single host cluster, each with its own API server, its own resources, complete isolation. No extra nodes, no extra control planes to manage. ...

I Started Verifying Every Container Image in My Clusters and Here Is What Broke

Last week I noticed that the Kubernetes project had quietly rewritten its image promoter, the tool that pushes official images to registry.k8s.io. The interesting part was not the rewrite itself. It was the fact that the new version now ships proper SLSA provenance attestations and cosign signatures across the mirrors. That was the moment I had to admit something slightly embarrassing: I had been signing my own images in CI for a while, but I was not actually enforcing verification anywhere in the cluster. The signatures existed, but nothing was checking them. So I finally sat down and fixed it. ...

Crossplane Compositions: Self-Service Infrastructure That Developers Actually Use

I spent two years being the guy who provisions databases. Every Monday morning, same Slack message: “Hey, can I get a Postgres instance for the new service?” I’d open Terraform, copy a module block, change three variables, run the plan, wait for approval, apply. Twenty minutes of my life, gone. Multiply that by four teams and it adds up fast. Then I set up Crossplane with Compositions, and now developers do it themselves with a single YAML file. Here’s how I got there and what broke along the way. ...

Securing Production Debugging in Kubernetes Without Losing Your Sanity

Last week I got paged at 2 AM for a payment service that was dropping requests. My first instinct was the same as always: grab the cluster-admin kubeconfig from the shared wiki page and start poking around. I caught the bug in ten minutes, but the next morning our security team flagged my session in the audit logs. Fair enough. That cluster-admin kubeconfig had been “temporary” for about eight months. ...

Debugging etcd in Production Kubernetes: What I Wish I Knew Earlier

Last month I got paged at 2 AM because the API server in a production cluster started timing out. Pods stopped scheduling, kubectl just hung, and the on-call Slack channel had already turned into chaos. About thirty minutes later, I traced it back to etcd. Again. etcd sits in the middle of every Kubernetes cluster, so when it starts having a bad day, the whole cluster feels it. The tricky part is that etcd failures rarely announce themselves clearly. You usually do not get a clean “etcd is broken” signal. You get fuzzy symptoms instead: slow API calls, delayed scheduling, weird timeouts. After dealing with enough of these incidents, I ended up with a playbook of checks that I run almost automatically now. Lately, a tool called etcd-diagnosis has made that process much easier. ...

Registry Mirror Authentication in Kubernetes Without Breaking Tenant Isolation

I spent most of last week chasing image pull failures in a multi-tenant cluster. It turned out the problem was our private registry mirror. We were using it as a pull-through cache, but the credentials lived on the nodes. One team rotated their credentials and, a few minutes later, pods in three other namespaces started failing too. That was the moment it became obvious we had a shared-credentials problem. That sent me down the rabbit hole of CRI-O’s credential provider for registry mirrors. After setting it up, I do not really want to go back. ...

Keycloak on Kubernetes: SSO for Your Internal Tools Without Losing Your Mind

I got tired of managing separate logins for Grafana, ArgoCD, Harbor, and every other internal tool we run. Every new team member meant creating five accounts. Every offboarding meant hoping I remembered to revoke all of them. So I finally sat down and deployed Keycloak on our Kubernetes cluster. This is what actually happened, not the sanitized version. Why Keycloak I looked at Dex, Authelia, and Keycloak. Dex is lightweight but limited if you need more than OIDC proxying. Authelia is great for simple setups but felt thin for our use case. Keycloak is heavier, but it handles OIDC, SAML, user federation, and has a proper admin UI. For a team running 8+ internal services, the weight is justified. ...

Kubernetes Gateway API: I Finally Replaced All My Ingress Resources

I kept postponing this migration for way too long. Every time Gateway API came up, I had the same answer: “yeah, I know, I should do it.” Then last week I finally stopped talking about it and migrated three production clusters from Ingress to Gateway API. After doing it end to end, I wish I had moved sooner. Why I Finally Did It The trigger was a multi-tenant cluster where two teams shared the same domain but needed different TLS behavior. ...

Cilium Tetragon: eBPF Runtime Security That Actually Catches Things

I’ve been running Falco for runtime security on most of my clusters for the past two years. It did the job, but the kernel module approach always felt brittle. Every kernel upgrade felt like rolling dice. When Cilium Tetragon reached 1.3 stable and went full eBPF with no kernel module, I finally gave it a real try on a production cluster. This is what happened. Why I Switched from Falco Falco has been solid, no question. But I kept running into the same issues: ...