Devops

Securing Production Debugging in Kubernetes Without Losing Your Sanity

Last week I got paged at 2 AM for a payment service that was dropping requests. My first instinct was the same as always: grab the cluster-admin kubeconfig from the shared wiki page and start poking around. I caught the bug in ten minutes, but the next morning our security team flagged my session in the audit logs. Fair enough. That cluster-admin kubeconfig had been “temporary” for about eight months. ...

I Migrated 47 Terraform Modules to OpenTofu and Here's What Broke

Last month I finally pulled the trigger. After months of watching the OpenTofu project mature and HashiCorp’s licensing situation settle into something I wasn’t comfortable with for client work, I migrated 47 Terraform modules across three production environments to OpenTofu. It took about two weeks of actual work spread over a month, and most of it was smooth. Most. Why I Switched The BSL license change was the catalyst, but not the only reason. A few of my clients started asking uncomfortable questions about their Terraform Enterprise contracts. One of them got a letter from HashiCorp’s sales team that made the cost trajectory pretty clear. OpenTofu had reached a point where the risk of staying felt bigger than the risk of moving. ...

AWS S3 Bucketsquatting Is Dead: Account Regional Namespaces Are Here

I have deleted an S3 bucket exactly once and regretted it immediately. Back in 2022, I tore down a staging environment, and within a few hours someone else had claimed the same bucket name. A CloudFormation stack in another account kept happily writing logs to a bucket I no longer controlled. Not my favorite Friday. AWS has finally shipped a real fix: account regional namespaces for S3 general purpose buckets. It took about seven years, which feels both absurd and very on-brand. ...

Registry Mirror Authentication in Kubernetes Without Breaking Tenant Isolation

I spent most of last week chasing image pull failures in a multi-tenant cluster. It turned out the problem was our private registry mirror. We were using it as a pull-through cache, but the credentials lived on the nodes. One team rotated their credentials and, a few minutes later, pods in three other namespaces started failing too. That was the moment it became obvious we had a shared-credentials problem. That sent me down the rabbit hole of CRI-O’s credential provider for registry mirrors. After setting it up, I do not really want to go back. ...

Keycloak on Kubernetes: SSO for Your Internal Tools Without Losing Your Mind

I got tired of managing separate logins for Grafana, ArgoCD, Harbor, and every other internal tool we run. Every new team member meant creating five accounts. Every offboarding meant hoping I remembered to revoke all of them. So I finally sat down and deployed Keycloak on our Kubernetes cluster. This is what actually happened, not the sanitized version. Why Keycloak I looked at Dex, Authelia, and Keycloak. Dex is lightweight but limited if you need more than OIDC proxying. Authelia is great for simple setups but felt thin for our use case. Keycloak is heavier, but it handles OIDC, SAML, user federation, and has a proper admin UI. For a team running 8+ internal services, the weight is justified. ...

Kubernetes Gateway API: I Finally Replaced All My Ingress Resources

I kept postponing this migration for way too long. Every time Gateway API came up, I had the same answer: “yeah, I know, I should do it.” Then last week I finally stopped talking about it and migrated three production clusters from Ingress to Gateway API. After doing it end to end, I wish I had moved sooner. Why I Finally Did It The trigger was a multi-tenant cluster where two teams shared the same domain but needed different TLS behavior. ...

OpenTelemetry Auto-Instrumentation on Kubernetes: Zero-Code Observability That Actually Works

Last week I inherited a cluster with around 40 microservices. Observability was close to nonexistent: basic Prometheus metrics, plus a few random log lines. The team wanted distributed tracing “by next sprint.” There was no realistic way to touch app code across a dozen repos in two weeks. So I chose OpenTelemetry Operator auto-instrumentation. This is what happened in practice. The Setup We run Kubernetes 1.31 on EKS. The goal was simple: get traces and metrics from every service into Grafana Tempo and Mimir without changing application code. ...

Kubernetes 1.35: The Release That Finally Gets AI Workloads Right

I’ve been running mixed clusters with ML training jobs and regular services for about two years. Scheduling has been the biggest headache. A distributed training run would get only some pods placed, GPUs would sit there doing nothing, and everyone would lose time. Kubernetes 1.35 came out last week, so I spent the weekend testing it on our staging cluster. A few of these changes are genuinely useful. Gang Scheduling Finally Exists The biggest addition is workload-aware scheduling with gang scheduling support. It’s still alpha, so I would not put it in production yet, but the model is exactly what we needed: a group of pods either gets scheduled together, or not at all. ...

Cluster API v1.12: In-Place Updates Changed How I Think About Node Lifecycle

I run a few Kubernetes clusters on bare metal with Cluster API and the BYOH (Bring Your Own Host) provider. Until now, every upgrade followed the same pattern: drain nodes, delete machines, rebuild everything, then wait. Reliable, yes. Fast, not even close, especially once you are past 40 nodes with little spare capacity. Cluster API v1.12 shipped a few weeks ago, and the headline for me was in-place updates. Instead of always doing the immutable delete-and-recreate path, CAPI can now apply some changes directly on existing machines. I spent last week testing this on our staging cluster, and the result was better than I expected. ...

Why I Turned Off Dependabot and What I Use Instead

Last Tuesday, one of my Go services got 14 Dependabot PRs in a single day. All of them came from one CVE, and none of them affected the way our code actually runs in production. We still had to read the alerts, review the PRs, wait for CI, and merge changes. That was the moment I decided to stop using Dependabot for this workflow. What finally broke it for me The issue was CVE-2026-26958 in filippo.io/edwards25519. ...