Securing Production Debugging in Kubernetes Without Losing Your Sanity

Last week I got paged at 2 AM for a payment service that was dropping requests. My first instinct was the same as always: grab the cluster-admin kubeconfig from the shared wiki page and start poking around. I caught the bug in ten minutes, but the next morning our security team flagged my session in the audit logs. Fair enough. That cluster-admin kubeconfig had been “temporary” for about eight months. ...

March 19, 2026

Debugging etcd in Production Kubernetes: What I Wish I Knew Earlier

Last month I got paged at 2 AM because the API server in a production cluster started timing out. Pods stopped scheduling, kubectl just hung, and the on-call Slack channel had already turned into chaos. About thirty minutes later, I traced it back to etcd. Again. etcd sits in the middle of every Kubernetes cluster, so when it starts having a bad day, the whole cluster feels it. The tricky part is that etcd failures rarely announce themselves clearly. You usually do not get a clean “etcd is broken” signal. You get fuzzy symptoms instead: slow API calls, delayed scheduling, weird timeouts. After dealing with enough of these incidents, I ended up with a playbook of checks that I run almost automatically now. Lately, a tool called etcd-diagnosis has made that process much easier. ...

March 17, 2026