Last week I got paged at 2 AM for a payment service that was dropping requests. My first instinct was the same as always: grab the cluster-admin kubeconfig from the shared wiki page and start poking around. I caught the bug in ten minutes, but the next morning our security team flagged my session in the audit logs. Fair enough. That cluster-admin kubeconfig had been “temporary” for about eight months.

So I finally sat down and built a proper debugging workflow. One that gives on-call engineers exactly the access they need, for exactly the time they need it, and nothing more.

The Problem With “Just Use cluster-admin”

Every team I have worked with has the same story. Someone creates a high-privilege kubeconfig “just for emergencies.” Then it ends up in a password manager, shared across the team, never rotated. The audit logs show a generic service account doing things, and nobody knows who actually ran that kubectl exec at 3 AM.

The real cost is not the security risk alone. It is that when something goes wrong, you cannot reconstruct what happened. Shared credentials kill your audit trail.

Step 1: A Namespaced Debug Role

Instead of cluster-admin, I created a Role that covers what on-call actually needs:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: oncall-debug
  namespace: payments
rules:
  - apiGroups: [""]
    resources: ["pods", "events"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["pods/exec", "pods/portforward"]
    verbs: ["create"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/ephemeralcontainers"]
    verbs: ["update"]

That last rule is for kubectl debug, which I will get to in a minute. The key thing is binding this to a group, not individual users:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: oncall-debug
  namespace: payments
subjects:
  - kind: Group
    name: oncall-payments
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: oncall-debug
  apiGroup: rbac.authorization.k8s.io

Your identity provider handles who is in oncall-payments. When the on-call rotation changes, nobody touches Kubernetes RBAC. The group membership updates automatically.

Step 2: Short-Lived Credentials

The biggest win was moving to credentials that expire. We use OIDC with our identity provider, so the kubeconfig just calls a credential helper:

users:
- name: oncall
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1
      command: cred-helper
      args: ["--cluster=prod", "--ttl=30m"]

Every 30 minutes, the token expires. No more stale kubeconfigs floating around in wiki pages. If you do not have OIDC set up, you can use short-lived client certificates instead:

# Generate a key locally
openssl genpkey -algorithm Ed25519 -out oncall.key

# Create a CSR with your identity and team group
openssl req -new -key oncall.key -out oncall.csr \
  -subj "/CN=robert/O=oncall-payments"

Then submit a CertificateSigningRequest with a 30-minute TTL:

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  name: oncall-robert-20260319
spec:
  request: <base64-encoded oncall.csr>
  signerName: kubernetes.io/kube-apiserver-client
  expirationSeconds: 1800
  usages:
    - client auth

After approval, you get a certificate that is valid for exactly 30 minutes. The /O=oncall-payments in the subject maps to the Kubernetes group, so RBAC kicks in automatically.

Step 3: Ephemeral Containers Instead of SSH

The old way of debugging a misbehaving pod was to kubectl exec into it and install tools on the fly. That works until you realize your distroless images do not have curl, tcpdump, or even a shell.

Ephemeral containers solve this properly:

kubectl debug -it payment-api-7d4f8b-x2k9n \
  --image=nicolaka/netshoot \
  --target=payment-api \
  -n payments

This attaches a debug container to the running pod without restarting it. The netshoot image has all the networking tools you could want. The --target flag shares the process namespace with the app container, so you can see its processes and network connections.

A few things I learned the hard way:

  • The ephemeral container stays after you detach. It does not auto-cleanup. I add a label and run a CronJob that garbage-collects pods with stale debug containers.
  • Resource limits matter. If your pod is already close to its memory limit, adding an ephemeral container with tcpdump can push it over. Set resource requests on the debug container.
  • Not all runtimes support process namespace sharing. Check that shareProcessNamespace is not explicitly disabled in your pod spec.

Step 4: Audit Everything

With OIDC or client certificates, every API call is tied to a real identity. But you also want to know what commands were run inside exec sessions. Kubernetes audit logging captures the API calls:

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["pods/exec", "pods/portforward"]
    verbs: ["create"]
  - level: Metadata
    resources:
      - group: ""
        resources: ["pods/ephemeralcontainers"]
    verbs: ["update"]

This logs the full request and response for exec and port-forward, and metadata for ephemeral container creation. Ship these to your SIEM and you have a complete trail of who debugged what and when.

What Changed

After rolling this out, our incident response actually got faster. Sounds counterintuitive, but here is why: engineers stopped second-guessing whether they were “allowed” to debug something. The guardrails are clear, the access is automatic for on-call, and nobody has to hunt for a shared kubeconfig at 2 AM.

The setup took about a day. Most of that was getting our OIDC provider to include group claims correctly (every identity provider has its own quirks there). The RBAC manifests are maybe 30 lines of YAML per namespace.

If you are still using shared cluster-admin credentials for production debugging, this is a good week to stop. The Kubernetes blog just published a detailed guide on this exact topic, which covers even more patterns like access brokers and hardware-backed keys.

Start with the namespaced Role and a RoleBinding to a group. That alone eliminates most of the risk. Add short-lived credentials when you are ready, and ephemeral containers when your team is comfortable with the workflow. You do not have to do everything at once.