Debugging etcd in Production Kubernetes: What I Wish I Knew Earlier
Last month I got paged at 2 AM because the API server in a production cluster started timing out. Pods stopped scheduling, kubectl just hung, and the on-call Slack channel had already turned into chaos. About thirty minutes later, I traced it back to etcd. Again. etcd sits in the middle of every Kubernetes cluster, so when it starts having a bad day, the whole cluster feels it. The tricky part is that etcd failures rarely announce themselves clearly. You usually do not get a clean “etcd is broken” signal. You get fuzzy symptoms instead: slow API calls, delayed scheduling, weird timeouts. After dealing with enough of these incidents, I ended up with a playbook of checks that I run almost automatically now. Lately, a tool called etcd-diagnosis has made that process much easier. ...