Posts

DevOps: Jacks of All Trades, Masters of AI

DevOps people have always lived between worlds. One day the problem is a slow deployment. The next day it is a Kubernetes networking issue, a Terraform drift, a flaky test suite, an unexpected cloud bill, a security exception, or a production incident that refuses to fit neatly into anyone’s dashboard. That variety used to make DevOps look like a discipline of generalists. In the AI era, that is not a weakness. It is the point. ...

Amazon Linux 2 Support Ends in 2026, So I Started Moving Nodes Now

AWS sent another reminder this week: Amazon Linux 2 support ends on June 30, 2026. That still sounds far enough away to ignore, right up until you remember all the places AL2 tends to hide. EC2 launch templates, golden AMIs, EKS managed node groups, ECS hosts, Packer builds, CI runners, old Lambda assumptions, and that one admin box nobody has opened since 2021. I started treating this as an infrastructure migration, not an operating system upgrade. That framing matters. If the plan is to SSH into machines and upgrade them in place, the day is already heading in the wrong direction. The cleaner path is inventory, rebuild, roll, observe, then delete the old capacity. ...

Ingress NGINX Is Officially Dead. Here's How I Migrated Off It in a Weekend

I woke up on March 25th to a Slack message from our security team: “ingress-nginx is EOL as of yesterday. Timeline for migration?” I had been ignoring this for months. The retirement was announced back in November 2025, but it felt distant. Now it was real. No more CVE patches. No more bug fixes. The clock was ticking. What Actually Happened On March 24, 2026, Kubernetes SIG Network and the Security Response Committee officially retired ingress-nginx. The project is done. Container images and Helm charts will stay available (they’re not deleting anything), but there will be no new releases. If a critical vulnerability drops tomorrow, you’re on your own. ...

The Axios NPM Compromise Just Hit, Here Is How I Locked Down Our Pipelines in 3 Hours

I woke up this morning to a Slack message from our security lead: “axios got owned on npm.” I thought it was a joke. Axios has 60 million weekly downloads. It is one of those packages you just assume is safe because everyone uses it. It was not a joke. What Actually Happened Two malicious versions hit npm overnight: [email protected] and [email protected]. The attacker compromised a lead maintainer’s npm credentials, changed the account email to a ProtonMail address, and published manually via the npm CLI. No pull request. No CI run. No code review. Just a npm publish from a stolen account. ...

I Replaced 12 Dev Clusters with vCluster and My AWS Bill Dropped 60%

Every team wanted their own cluster. QA had one, staging had one, each developer wanted one for feature branches. We ended up with 12 EKS clusters, most of them sitting at 15% utilization, all of them costing real money. I kept hearing about vCluster from Loft Labs and finally gave it a shot three months ago. The pitch sounded too good: full Kubernetes clusters running inside a single host cluster, each with its own API server, its own resources, complete isolation. No extra nodes, no extra control planes to manage. ...

Your GitHub Actions Are a Supply Chain Attack Surface and You Probably Haven't Noticed

Last week I spent a full Saturday auditing every GitHub Actions workflow across our repos. Not because I wanted to, but because the Trivy supply chain attack made me realize how thin the ice was under my feet. If you missed it: someone managed to sneak a malicious commit into the actions/checkout action by exploiting GitHub’s fork commit reachability. They swapped a SHA pin in Trivy’s release workflow to point at an orphaned commit in a fork. The commit looked legit, the comment said # v6.0.2, the author was spoofed to look like a real maintainer. The actual payload downloaded Go files from a typosquatted domain and replaced Trivy’s source code during the build. ...

I Started Verifying Every Container Image in My Clusters and Here Is What Broke

Last week I noticed that the Kubernetes project had quietly rewritten its image promoter, the tool that pushes official images to registry.k8s.io. The interesting part was not the rewrite itself. It was the fact that the new version now ships proper SLSA provenance attestations and cosign signatures across the mirrors. That was the moment I had to admit something slightly embarrassing: I had been signing my own images in CI for a while, but I was not actually enforcing verification anywhere in the cluster. The signatures existed, but nothing was checking them. So I finally sat down and fixed it. ...

Crossplane Compositions: Self-Service Infrastructure That Developers Actually Use

I spent two years being the guy who provisions databases. Every Monday morning, same Slack message: “Hey, can I get a Postgres instance for the new service?” I’d open Terraform, copy a module block, change three variables, run the plan, wait for approval, apply. Twenty minutes of my life, gone. Multiply that by four teams and it adds up fast. Then I set up Crossplane with Compositions, and now developers do it themselves with a single YAML file. Here’s how I got there and what broke along the way. ...

Securing Production Debugging in Kubernetes Without Losing Your Sanity

Last week I got paged at 2 AM for a payment service that was dropping requests. My first instinct was the same as always: grab the cluster-admin kubeconfig from the shared wiki page and start poking around. I caught the bug in ten minutes, but the next morning our security team flagged my session in the audit logs. Fair enough. That cluster-admin kubeconfig had been “temporary” for about eight months. ...

Debugging etcd in Production Kubernetes: What I Wish I Knew Earlier

Last month I got paged at 2 AM because the API server in a production cluster started timing out. Pods stopped scheduling, kubectl just hung, and the on-call Slack channel had already turned into chaos. About thirty minutes later, I traced it back to etcd. Again. etcd sits in the middle of every Kubernetes cluster, so when it starts having a bad day, the whole cluster feels it. The tricky part is that etcd failures rarely announce themselves clearly. You usually do not get a clean “etcd is broken” signal. You get fuzzy symptoms instead: slow API calls, delayed scheduling, weird timeouts. After dealing with enough of these incidents, I ended up with a playbook of checks that I run almost automatically now. Lately, a tool called etcd-diagnosis has made that process much easier. ...