Debugging etcd in Production Kubernetes: What I Wish I Knew Earlier

Last month I got paged at 2 AM because the API server in a production cluster started timing out. Pods stopped scheduling, kubectl just hung, and the on-call Slack channel had already turned into chaos. About thirty minutes later, I traced it back to etcd. Again.

etcd sits in the middle of every Kubernetes cluster, so when it starts having a bad day, the whole cluster feels it. The tricky part is that etcd failures rarely announce themselves clearly. You usually do not get a clean “etcd is broken” signal. You get fuzzy symptoms instead: slow API calls, delayed scheduling, weird timeouts. After dealing with enough of these incidents, I ended up with a playbook of checks that I run almost automatically now. Lately, a tool called etcd-diagnosis has made that process much easier.

The First Five Minutes

When a cluster starts behaving like etcd might be involved, I start with the basics. These three commands tell me whether the cluster is still fundamentally healthy:

# Check if all members are healthy
etcdctl endpoint health --cluster -w table

# See member status, leader, raft index
etcdctl endpoint status --cluster -w table

# List members and their peer URLs
etcdctl member list -w table

I am looking for a few simple signals: are all members healthy, is there a leader, and are the raft indexes moving forward or stuck? If endpoint health hangs or throws errors for one member, that is usually where I start digging.

One thing that tripped me up early on is that in managed environments, or in some distro-specific setups, etcdctl is not always available on the host. In that case, I just exec into the etcd pod:

kubectl exec -n kube-system etcd-controlplane-0 -- etcdctl \
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/server.crt \
  --key /etc/kubernetes/pki/etcd/server.key \
  endpoint health --cluster

The Two Killers: Disk and Space

In my experience, about 80% of etcd incidents fall into two buckets.

Slow Disks

etcd is extremely sensitive to disk latency. Every transaction goes into the WAL (write-ahead log), and it gets fsynced. If fsync latency stays above 10ms for any real length of time, things usually start going sideways.

I check this with:

# From etcd metrics (if you have Prometheus)
etcd_disk_wal_fsync_duration_seconds_bucket

# Or a quick fio test on the etcd data directory
fio --name=etcd-test --filename=/var/lib/etcd/test \
  --rw=write --ioengine=sync --fdatasync=1 \
  --size=22m --bs=2300 --runtime=30

The fio test gets close enough to etcd’s write pattern to be useful. If p99 latency is above 10ms, the storage is too slow. I have seen this happen when etcd was placed on a shared EBS volume that was also serving a database. That is a bad trade. Give etcd its own fast SSD, ideally with provisioned IOPS in cloud environments.

Database Space Exceeded

Then there is the dreaded mvcc: database space exceeded error. etcd ships with a default 2 GiB storage quota. Once you hit it, writes stop, and the cluster is effectively frozen.

The usual fix is compaction followed by defragmentation:

# Get the current revision
rev=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')

# Compact everything before current revision
etcdctl compact $rev

# Defragment all members (one at a time!)
etcdctl defrag --endpoints=https://etcd-0:2379
etcdctl defrag --endpoints=https://etcd-1:2379
etcdctl defrag --endpoints=https://etcd-2:2379

# Verify the alarm is set
etcdctl alarm list

# Clear the alarm after defrag freed space
etcdctl alarm disarm

The order matters. Compact first, then defrag. Also, never defrag every member at the same time. Defrag blocks a member briefly, and that is not when you want to find out how much quorum matters.

One lesson I learned the hard way: even with auto-compaction enabled, which you absolutely should have, you can still run into the space limit if defragmentation is not happening. Compaction only marks old space as reclaimable. Defrag is what actually gives it back. These days I run defrag on a schedule with a CronJob.

The etcd-diagnosis Tool

I started using etcd-diagnosis recently, and honestly, it would have saved me hours in older incidents. Instead of checking every signal by hand, I can run this:

etcd-diagnosis report --endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379 \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

It generates one report that covers:

Cluster health and membership
Disk I/O latency (WAL fsync)
Network latency between members
Resource pressure (memory, disk usage)
Key etcd metrics

The best part is that the output is something you can hand to upstream maintainers or a vendor without spending half the incident in a loop of “can you also send us this metric?” It is already packaged as a useful diagnostic artifact.

Preventing Future Incidents

After getting burned enough times, this is what I now put on every cluster:

Monitoring alerts:

# Alert when WAL fsync latency is high
- alert: EtcdHighFsyncDuration
  expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.01
  for: 5m
  labels:
    severity: warning

# Alert before space runs out
- alert: EtcdDatabaseSizeNearQuota
  expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.8
  for: 10m
  labels:
    severity: warning

Scheduled defragmentation:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-defrag
  namespace: kube-system
spec:
  schedule: "0 3 * * 0"  # Weekly, Sunday 3 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: defrag
            image: bitnami/etcd:3.5
            command:
            - /bin/sh
            - -c
            - |
              for ep in https://etcd-0:2379 https://etcd-1:2379 https://etcd-2:2379; do
                echo "Defragmenting $ep"
                etcdctl defrag --endpoints=$ep \
                  --cacert=/certs/ca.crt \
                  --cert=/certs/client.crt \
                  --key=/certs/client.key
                sleep 30
              done

Dedicated storage: etcd gets its own volume. No sharing.

Lessons Learned

Most etcd problems are storage problems. Check disk performance first.
apply request took too long almost always means fsync latency.
Auto-compaction doesn’t mean auto-defrag. You need both.
Keep the etcd database small. If it grows past 1 GiB, something is probably wrong, often stale resources or a controller loop creating objects.
Three members is the sweet spot. Five gives you more fault tolerance but slower writes. One is asking for trouble.
Back up etcd regularly: etcdctl snapshot save backup.db. Test your restores.

etcd is one of those components that quietly does its job until the day it really does not. When that day comes, having a short list of commands ready and knowing what signals matter can be the difference between a fifteen minute fix and a three hour scramble.

The First Five Minutes#

The Two Killers: Disk and Space#

Slow Disks#

Database Space Exceeded#

The etcd-diagnosis Tool#

Preventing Future Incidents#

Lessons Learned#