I Started Verifying Every Container Image in My Clusters and Here Is What Broke

Last week I noticed that the Kubernetes project had quietly rewritten its image promoter, the tool that pushes official images to registry.k8s.io. The interesting part was not the rewrite itself. It was the fact that the new version now ships proper SLSA provenance attestations and cosign signatures across the mirrors.

That was the moment I had to admit something slightly embarrassing: I had been signing my own images in CI for a while, but I was not actually enforcing verification anywhere in the cluster. The signatures existed, but nothing was checking them. So I finally sat down and fixed it.

The starting point

I had cosign in my GitHub Actions pipelines for about a year. Every image got signed during the build:

cosign sign --yes \
  --oidc-issuer https://token.actions.githubusercontent.com \
  ghcr.io/myorg/myapp:${GITHUB_SHA}

I liked the setup because it was simple. Keyless signing through Sigstore’s Fulcio CA meant no long lived keys, no secrets to rotate, and no extra key management ceremony. The signature just lived next to the image in the registry.

The problem was that this only felt secure. In practice, my clusters would still pull unsigned images without complaining. I had added the nice part, but skipped the part that actually enforces anything.

Setting up policy enforcement with Kyverno

I went with Kyverno because I already had it running for other admission policies. I could have used the Sigstore policy-controller too, but I did not feel like adding another webhook unless I had a good reason.

This was the first policy I put in place to verify cosign signatures:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
spec:
  validationFailureAction: Enforce
  background: false
  rules:
    - name: verify-ghcr-images
      match:
        any:
          - resources:
              kinds:
                - Pod
      verifyImages:
        - imageReferences:
            - "ghcr.io/myorg/*"
          attestors:
            - entries:
                - keyless:
                    subject: "https://github.com/myorg/*"
                    issuer: "https://token.actions.githubusercontent.com"
                    rekor:
                      url: https://rekor.sigstore.dev

I applied it, felt smug for about half a minute, and then watched a good chunk of staging stop scheduling.

What broke immediately

Init containers. This was the first thing I forgot about. Pods do not just pull the main container image. They also pull init container images, and in my case one of the debug sidecars was still unsigned. Every pod depending on it got rejected. The fix was straightforward: sign that image too, or explicitly exclude it.

Helm chart images. Third party charts were the next problem. If a chart pulls something like Bitnami Redis, I do not control that image and I cannot retroactively sign it. That meant I had to treat those images differently:

verifyImages:
  - imageReferences:
      - "ghcr.io/myorg/*"
    attestors:
      # ...
  - imageReferences:
      - "docker.io/bitnami/*"
      - "registry.k8s.io/*"
    mutateDigest: true
    verifyDigest: true
    required: false

For third party images, required: false turned out to be a reasonable compromise. Kyverno still rewrites tags to digests, which protects against tag mutation, but it does not insist on a signature I have no way to provide. Not ideal, but realistic.

Cached images. This one was more annoying. If a node already had the image cached and the workload used imagePullPolicy: IfNotPresent, there was nothing to verify because the kubelet never pulled the image again. Kyverno validates admission requests, not runtime pulls. For the workloads I actually cared about, I switched to imagePullPolicy: Always so the policy would matter consistently.

Adding SLSA provenance verification

Signatures tell me who produced an image. Provenance tells me how it was produced. Once I saw Kubernetes publishing SLSA attestations for its own images, I wanted the same guarantee for mine.

In GitHub Actions I added the SLSA generator:

- uses: slsa-framework/slsa-github-generator/.github/workflows/[email protected]
  with:
    image: ghcr.io/myorg/myapp
    digest: ${{ steps.build.outputs.digest }}

Then I extended the Kyverno policy so it would check provenance too:

verifyImages:
  - imageReferences:
      - "ghcr.io/myorg/*"
    attestors:
      - entries:
          - keyless:
              subject: "https://github.com/myorg/*"
              issuer: "https://token.actions.githubusercontent.com"
    attestations:
      - type: https://slsa.dev/provenance/v1
        conditions:
          - all:
              - key: "{{ buildDefinition.buildType }}"
                operator: Equals
                value: "https://slsa-framework.github.io/github-actions-buildtypes/workflow/v1"

That gave me a much stronger guardrail. If someone built an image manually on a laptop and pushed it to the registry, it would not get through.

The cosign command I actually kept using

When something failed, this was the command I ended up running over and over:

cosign verify \
  --certificate-identity-regexp "https://github.com/myorg/.*" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  ghcr.io/myorg/myapp@sha256:abc123...

And for attestations:

cosign verify-attestation \
  --type slsaprovenance \
  --certificate-identity-regexp "https://github.com/slsa-framework/.*" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
  ghcr.io/myorg/myapp@sha256:abc123...

The output is JSON, which is technically helpful and emotionally annoying. Piping it through jq .payload | base64 -d | jq . makes it readable enough to debug without squinting.

Performance impact

The real world cost in my cluster was about 200 ms of extra admission latency per pod creation. That is not nothing, but it also was not catastrophic. Kyverno has to hit the registry and the Rekor transparency log for every new image, so the slowdown is real on first deploy.

After that, caching helps a lot. Since Kyverno caches image verification results, repeated deployments of the same digest skip most of the pain. The first rollout is slower. The next ones are fine.

The weak point was Rekor. If Sigstore is having a bad day, your deployment pipeline feels it immediately. I kept the Rekor check enabled, but I also made the policy explicit so I could tune the behavior later if needed:

verifyImages:
  - imageReferences:
      - "ghcr.io/myorg/*"
    attestors:
      - entries:
          - keyless:
              subject: "https://github.com/myorg/*"
              issuer: "https://token.actions.githubusercontent.com"
              rekor:
                url: https://rekor.sigstore.dev
                ignoreTlog: false
    useCache: true

Was it worth it?

Yes, easily.

About a week after I put this in place, one of our CI jobs was misconfigured and started pushing unsigned images to a new repository. Without enforcement, those images probably would have drifted into production before anyone noticed. Instead, the deployment failed immediately, the error was obvious, and the developer fixed it without turning it into a bigger incident.

The full setup took me about a day, and most of that time went into dealing with exceptions for third party images. Basic signature verification was quick. The fiddly part was making it strict enough to be useful without breaking every chart that pulls something external.

Supply chain security still has a reputation for being the kind of thing you only do if you are extremely paranoid. I do not really buy that anymore. After the xz backdoor, the polyfill.io mess, and the endless stream of compromised npm packages, this feels less like overengineering and more like basic hygiene. I would rather spend a day wiring this up than spend a week explaining why an unsigned image made it into a production cluster.

The starting point#

Setting up policy enforcement with Kyverno#

What broke immediately#

Adding SLSA provenance verification#

The cosign command I actually kept using#

Performance impact#

Was it worth it?#