Cilium Tetragon: eBPF Runtime Security That Actually Catches Things

I’ve been running Falco for runtime security on most of my clusters for the past two years. It did the job, but the kernel module approach always felt brittle. Every kernel upgrade felt like rolling dice. When Cilium Tetragon reached 1.3 stable and went full eBPF with no kernel module, I finally gave it a real try on a production cluster.

This is what happened.

Why I Switched from Falco

Falco has been solid, no question. But I kept running into the same issues:

Kernel module rebuilds on node upgrades (even with the eBPF probe, compatibility was hit or miss)
High CPU usage on nodes running 80+ pods
Rules syntax that nobody on the team wanted to touch
False positives that made everyone ignore alerts

Tetragon promised lower overhead because it hooks straight into the kernel with eBPF, gives fine-grained policy control through TracingPolicy CRDs, and understands Kubernetes natively. I was skeptical at first, but the overhead numbers were hard to ignore.

Installing Tetragon

I run Cilium as my CNI already, so adding Tetragon was straightforward. If you’re not on Cilium CNI, that’s fine. Tetragon works standalone.

helm repo add cilium https://helm.cilium.io/
helm repo update

helm install tetragon cilium/tetragon \
  --namespace kube-system \
  --set tetragon.enableProcessCred=true \
  --set tetragon.enableProcessNs=true \
  --set tetragon.exportRateLimit=200

The enableProcessCred and enableProcessNs flags matter more than they look. Without them, you miss uid/gid details and namespace context in events. I skipped them on my first deploy and burned an hour figuring out why policies were not matching.

Verify it’s running:

kubectl get pods -n kube-system -l app.kubernetes.io/name=tetragon

You should see a tetragon pod on every node (it’s a DaemonSet).

Your First TracingPolicy

Out of the box, Tetragon gives you process lifecycle events (exec, exit). Useful, but not the interesting part. The real power is in TracingPolicy.

Here’s the first one I wrote. It catches any container that opens /etc/shadow:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: sensitive-file-access
spec:
  kprobes:
    - call: fd_install
      syscall: false
      args:
        - index: 0
          type: int
        - index: 1
          type: "file"
      selectors:
        - matchArgs:
            - index: 1
              operator: Equal
              values:
                - "/etc/shadow"
                - "/etc/passwd"
                - "/etc/kubernetes/pki"
          matchNamespaces:
            - namespace: Pid
              operator: NotIn
              values:
                - "host_ns"

Apply it:

kubectl apply -f sensitive-file-access.yaml

Now test it. Exec into any pod and try to read /etc/shadow:

kubectl exec -it some-pod -- cat /etc/shadow

Check the Tetragon logs:

kubectl logs -n kube-system -l app.kubernetes.io/name=tetragon -c export-stdout --tail=20 | \
  jq 'select(.process_kprobe != null)'

You should see a JSON event with the full process tree, container ID, pod name, namespace, and labels. That is the nice part. No correlation work, no sidecar, no fragile log parsing. The kernel tells you what happened, and Tetragon adds K8s context.

The Policy That Saved Us

Two weeks after deploying Tetragon, it caught something Falco had missed. One of our Java services was spawning a shell subprocess to run curl for health checks (yes, really). Ugly, but not malicious. Still, that exact pattern, shell spawned from Java, is also what container escape behavior can look like.

I wrote a TracingPolicy that alerts on any non-init process exec inside containers:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: unexpected-process-exec
spec:
  kprobes:
    - call: sys_execve
      syscall: true
      args:
        - index: 0
          type: "string"
      selectors:
        - matchArgs:
            - index: 0
              operator: NotIn
              values:
                - "/bin/sh"
                - "/usr/bin/java"
                - "/usr/bin/python3"
          matchNamespaces:
            - namespace: Pid
              operator: NotIn
              values:
                - "host_ns"
          matchBinaries:
            - operator: NotIn
              values:
                - "/pause"
                - "/usr/bin/tini"

Within a day, this caught a compromised npm dependency in a staging service trying to download and execute a binary. The process tree was: node -> sh -> curl -> suspicious-binary. We had not noticed it because health checks were still passing.

Performance: The Numbers

I measured CPU and memory overhead on a node running 120 pods with 3 TracingPolicies active:

Metric	Falco (eBPF probe)	Tetragon
CPU (avg)	180m	45m
CPU (p99)	620m	110m
Memory	340Mi	95Mi
Event latency	~8ms	~1.2ms

The difference is not subtle. Tetragon was roughly 4x more CPU efficient and used about a quarter of the memory. Lower event latency also meant enforcement could happen in real time without adding obvious syscall delay.

Enforcement Mode: Proceed with Caution

Tetragon can kill processes that match a policy. That is powerful and risky. Add matchActions with Sigkill to a selector:

selectors:
  - matchArgs:
      - index: 0
        operator: Equal
        values:
          - "/usr/bin/wget"
    matchActions:
      - action: Sigkill

I tested this in staging for a month before enabling it in production. Start with Override (returns an error to the syscall) instead of Sigkill (terminates the process). Override is less disruptive and buys you tuning time.

My recommendation: run observe-only mode for at least 2 weeks per cluster. Export events to your SIEM, build dashboards, tune out false positives, then turn on enforcement.

Piping Events to Your Stack

Tetragon exports JSON events, and getting them into your observability stack is pretty straightforward:

# Direct to stdout (the export-stdout container)
kubectl logs -n kube-system -l app.kubernetes.io/name=tetragon -c export-stdout -f

# Or use the tetragon CLI for filtered, human-readable output
kubectl exec -n kube-system ds/tetragon -c tetragon -- \
  tetra getevents -o compact

In production, I pipe export-stdout logs through Fluent Bit into Loki. A small Fluent Bit filter parses JSON and adds severity from the policy name. From there, Grafana dashboards show process exec events by namespace, file access violations, and network connections from unexpected binaries.

The Grafana dashboard JSON is around 200 lines, so I will spare you the full dump. The key panels are:

Process exec heatmap by namespace and binary name
Policy violations over time, grouped by TracingPolicy name
Top talkers: pods generating the most security events

Gotchas I Hit

1. ARM64 nodes need a recent kernel. If you’re running mixed amd64/arm64 clusters (like I do with Graviton nodes on EKS), make sure your ARM nodes run kernel 5.15+. Older kernels have eBPF verifier bugs that cause Tetragon pods to CrashLoop.

2. TracingPolicy ordering matters. If two policies match the same syscall, both fire. But if one has a Sigkill action and the other has Override, the Sigkill wins. Document your policies carefully.

3. Export rate limiting is your friend. I set exportRateLimit=200 in Helm values. Without it, a noisy workload (looking at you, PHP-FPM) can produce thousands of events per second and swamp the log pipeline.

4. The tetra CLI is essential for debugging. Install it locally:

curl -LO https://github.com/cilium/tetragon/releases/latest/download/tetra-linux-amd64.tar.gz
tar xzf tetra-linux-amd64.tar.gz
sudo mv tetra /usr/local/bin/

Use tetra getevents with --namespace and --pod filters to debug policies without drowning in cluster-wide noise.

5. Don’t forget to exclude kube-system. Your first week will be full of alerts from kubelet, kube-proxy, and other system components doing legitimate things. Add namespace exclusions to your policies early.

Was It Worth It?

Yes, absolutely. Tetragon replaced Falco on three of my four clusters. The fourth still runs an older kernel that does not support every eBPF feature Tetragon needs, so Falco stays there for now.

Lower resource use, Kubernetes-native policies, and real enforcement make Tetragon the better fit for most clusters on kernel 5.10+. The TracingPolicy CRD model also means I manage security policies the same way as everything else in Kubernetes, through GitOps with ArgoCD.

If you already run Cilium as your CNI, adding Tetragon is an easy decision. If not, it is still worth evaluating in standalone mode. eBPF-based runtime security is clearly where things are heading, and Tetragon is the most production-ready implementation I have used so far.

Why I Switched from Falco#

Installing Tetragon#

Your First TracingPolicy#

The Policy That Saved Us#

Performance: The Numbers#

Enforcement Mode: Proceed with Caution#

Piping Events to Your Stack#

Gotchas I Hit#

Was It Worth It?#