Devops

Keycloak on Kubernetes: SSO for Your Internal Tools Without Losing Your Mind

I got tired of managing separate logins for Grafana, ArgoCD, Harbor, and every other internal tool we run. Every new team member meant creating five accounts. Every offboarding meant hoping I remembered to revoke all of them. So I finally sat down and deployed Keycloak on our Kubernetes cluster. This is what actually happened, not the sanitized version. Why Keycloak I looked at Dex, Authelia, and Keycloak. Dex is lightweight but limited if you need more than OIDC proxying. Authelia is great for simple setups but felt thin for our use case. Keycloak is heavier, but it handles OIDC, SAML, user federation, and has a proper admin UI. For a team running 8+ internal services, the weight is justified. ...

Kubernetes Gateway API: I Finally Replaced All My Ingress Resources

I kept postponing this migration for way too long. Every time Gateway API came up, I had the same answer: “yeah, I know, I should do it.” Then last week I finally stopped talking about it and migrated three production clusters from Ingress to Gateway API. After doing it end to end, I wish I had moved sooner. Why I Finally Did It The trigger was a multi-tenant cluster where two teams shared the same domain but needed different TLS behavior. ...

OpenTelemetry Auto-Instrumentation on Kubernetes: Zero-Code Observability That Actually Works

Last week I inherited a cluster with around 40 microservices. Observability was close to nonexistent: basic Prometheus metrics, plus a few random log lines. The team wanted distributed tracing “by next sprint.” There was no realistic way to touch app code across a dozen repos in two weeks. So I chose OpenTelemetry Operator auto-instrumentation. This is what happened in practice. The Setup We run Kubernetes 1.31 on EKS. The goal was simple: get traces and metrics from every service into Grafana Tempo and Mimir without changing application code. ...

Kubernetes 1.35: The Release That Finally Gets AI Workloads Right

I’ve been running mixed clusters with ML training jobs and regular services for about two years. Scheduling has been the biggest headache. A distributed training run would get only some pods placed, GPUs would sit there doing nothing, and everyone would lose time. Kubernetes 1.35 came out last week, so I spent the weekend testing it on our staging cluster. A few of these changes are genuinely useful. Gang Scheduling Finally Exists The biggest addition is workload-aware scheduling with gang scheduling support. It’s still alpha, so I would not put it in production yet, but the model is exactly what we needed: a group of pods either gets scheduled together, or not at all. ...

Cluster API v1.12: In-Place Updates Changed How I Think About Node Lifecycle

I run a few Kubernetes clusters on bare metal with Cluster API and the BYOH (Bring Your Own Host) provider. Until now, every upgrade followed the same pattern: drain nodes, delete machines, rebuild everything, then wait. Reliable, yes. Fast, not even close, especially once you are past 40 nodes with little spare capacity. Cluster API v1.12 shipped a few weeks ago, and the headline for me was in-place updates. Instead of always doing the immutable delete-and-recreate path, CAPI can now apply some changes directly on existing machines. I spent last week testing this on our staging cluster, and the result was better than I expected. ...

Why I Turned Off Dependabot and What I Use Instead

Last Tuesday, one of my Go services got 14 Dependabot PRs in a single day. All of them came from one CVE, and none of them affected the way our code actually runs in production. We still had to read the alerts, review the PRs, wait for CI, and merge changes. That was the moment I decided to stop using Dependabot for this workflow. What finally broke it for me The issue was CVE-2026-26958 in filippo.io/edwards25519. ...

How a Missing Close Button Saved My AI Session

Last week I lost a 45-minute Codex session because my thumb grazed the close button on a terminal tab. No warning. No confirmation. Just gone. The session context, chain of thought, and iterative refinements I’d been building all evaporated because of one bad click. If you’ve worked with AI coding agents (Codex, Claude Code, Cursor, whatever), you probably know this feeling. These tools build context over a conversation. Lose that in the middle, and you’re paying for it in both time and mental energy. ...

Kyverno 1.17: CEL Policies Hit GA, Time to Migrate

Kyverno 1.17 landed yesterday, and the big news is that CEL policy types are now GA. If you’ve been running Kyverno with JMESPath-based ClusterPolicy resources, the clock is ticking. They’re officially deprecated and scheduled for removal in v1.20 (October 2026). I spent today migrating a production cluster with about 60 policies. Here is what actually happened. Why This Matters Kyverno has been using JMESPath expressions for years. They work, but they’re Kyverno-specific. CEL (Common Expression Language) is what Kubernetes itself uses for ValidatingAdmissionPolicy since 1.30. By switching to CEL, Kyverno aligns with upstream and gets significantly better evaluation performance. ...

We Ditched Artifactory and Built a Self-Hosted Artifact Registry Stack

Last month our Artifactory renewal came in at 40% more than last year. No new features we needed, just the usual “enterprise tier” squeeze. Security scanning? Pay more. Replication? Pay more. SSO that isn’t SAML-only? You guessed it. So I spent two weeks building a replacement. Here’s what actually worked, what didn’t, and the gotchas nobody warns you about. What We Were Running Our Artifactory setup handled: Docker images (~800 images, ~12TB total) npm packages (private registry, ~200 internal packages) Helm charts Generic binary artifacts (build outputs, firmware blobs) The big requirements: vulnerability scanning on push, OIDC SSO, and cross-region replication to a DR site. ...

Reclaiming Idle GPUs in Kubernetes Before They Burn Your Budget

Last month I finally looked at our GPU utilization dashboards properly. What I saw made me physically uncomfortable: 14 A100 GPUs across our cluster, average utilization hovering around 15%. We were paying for dedicated hardware that spent most of its time doing absolutely nothing. This is embarrassingly common. Teams request a full GPU for a workload that uses it for training bursts of 20 minutes, then idles for hours. Kubernetes treats GPUs as integer resources — you either have one or you don’t. There’s no native way to share. ...