Every team wanted their own cluster. QA had one, staging had one, each developer wanted one for feature branches. We ended up with 12 EKS clusters, most of them sitting at 15% utilization, all of them costing real money.

I kept hearing about vCluster from Loft Labs and finally gave it a shot three months ago. The pitch sounded too good: full Kubernetes clusters running inside a single host cluster, each with its own API server, its own resources, complete isolation. No extra nodes, no extra control planes to manage.

Spoiler: it actually works.

The Problem

Our setup was typical mid-size platform engineering pain. Six developers, a QA team, staging, two demo environments, and a couple of sandbox clusters for experimentation. Each EKS cluster had its own node groups, its own ALB controllers, its own cert-manager installation.

The bill was around $4,200/month just for the dev/test clusters. Not catastrophic, but hard to justify when most of them were idle 80% of the time.

Namespace isolation wasn’t enough. Developers needed CRD access, they needed to install Helm charts, they needed cluster-scoped resources. Namespaces can’t give you that.

Getting Started

Install the vCluster CLI:

curl -L -o vcluster "https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-amd64"
chmod +x vcluster
sudo mv vcluster /usr/local/bin/

Creating a virtual cluster takes about 30 seconds:

vcluster create dev-alice --namespace team-alice

That’s it. vCluster spins up a lightweight K3s control plane inside a pod, creates a syncer that maps resources between the virtual and host cluster, and hands you a kubeconfig.

vcluster connect dev-alice --namespace team-alice
kubectl get nodes

The virtual cluster sees its own nodes (synced from the host), its own kube-system namespace, everything. From the developer’s perspective, it’s a real cluster.

The Architecture That Clicked

Here’s what runs inside the host cluster for each vCluster:

host-cluster/
  namespace: team-alice/
    pod: dev-alice-0          # K3s control plane + syncer
    pvc: data-dev-alice-0     # etcd storage (SQLite by default)
    service: dev-alice        # API server endpoint

One pod. One PVC. One service. That’s the overhead per virtual cluster. Compare that to a full EKS cluster with its own VPC, node groups, and add-ons.

The syncer is the key piece. When a developer creates a Deployment in the virtual cluster, the syncer creates the corresponding resources in the host namespace. Pods run on the host nodes, but the developer only sees their own stuff.

Real Configuration

The default setup works for quick experiments, but production use needs a vcluster.yaml:

controlPlane:
  distro:
    k3s:
      enabled: true
  statefulSet:
    resources:
      requests:
        cpu: 200m
        memory: 256Mi
      limits:
        cpu: 1000m
        memory: 1Gi

sync:
  toHost:
    ingresses:
      enabled: true
  fromHost:
    nodes:
      enabled: true
      selector:
        labels:
          nodepool: shared

policies:
  resourceQuota:
    enabled: true
    quota:
      requests.cpu: "4"
      requests.memory: 8Gi
      limits.cpu: "8"
      limits.memory: 16Gi
  limitRange:
    enabled: true
    default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 100m
      memory: 128Mi
vcluster create dev-alice --namespace team-alice -f vcluster.yaml

The resource quota and limit range are critical. Without them, one developer can eat all the host cluster resources and everyone else suffers.

What Broke (And How I Fixed It)

DNS Resolution Between Virtual Clusters

Services in one vCluster can’t resolve services in another by default. That’s actually the correct behavior for isolation. But our QA team needed to hit a shared database running in a separate vCluster.

The fix was mapping the database service from the host:

sync:
  fromHost:
    services:
      enabled: true
      mappings:
      - from:
          namespace: shared-services
          name: postgres-primary
        to:
          namespace: default
          name: shared-db

Persistent Volumes

PVCs work, but you need to understand they’re created on the host cluster’s storage class. If your host cluster uses gp3 EBS volumes, that’s what your vCluster gets. No surprises there, but developers who expected different storage classes were confused.

I added a clear onboarding doc and mapped the storage classes explicitly:

sync:
  fromHost:
    storageClasses:
      enabled: true

Ingress Conflicts

Multiple vClusters can’t share the same hostname on an Ingress. The syncer rewrites Ingress names to avoid conflicts, but hostnames need to be unique. We solved this with a naming convention:

<service>.<vcluster-name>.dev.example.com

Wildcard DNS + wildcard TLS cert, done.

The GitOps Integration

We manage vClusters with ArgoCD. Each developer gets a vCluster defined in Git:

# clusters/dev-alice.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: vcluster-dev-alice
  namespace: argocd
spec:
  project: platform
  source:
    repoURL: https://charts.loft.sh
    chart: vcluster
    targetRevision: 0.24.x
    helm:
      valuesObject:
        controlPlane:
          distro:
            k3s:
              enabled: true
        sync:
          toHost:
            ingresses:
              enabled: true
  destination:
    server: https://kubernetes.default.svc
    namespace: team-alice
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

New developer joins? They open a PR adding their cluster config. Merge, and ArgoCD provisions it. Developer leaves? Delete the file, ArgoCD cleans up.

CI/CD Ephemeral Clusters

The real win was CI/CD. We replaced our shared staging cluster with ephemeral vClusters per pull request:

# .github/workflows/pr-env.yaml
name: PR Environment
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Create ephemeral vCluster
        run: |
          vcluster create pr-${{ github.event.number }} \
            --namespace pr-envs \
            --connect=false \
            -f .vcluster/ephemeral.yaml          

      - name: Connect and deploy
        run: |
          vcluster connect pr-${{ github.event.number }} \
            --namespace pr-envs
          helm upgrade --install myapp ./charts/myapp \
            --set image.tag=${{ github.sha }}          

      - name: Run integration tests
        run: |
          kubectl wait --for=condition=ready pod -l app=myapp --timeout=120s
          ./scripts/integration-tests.sh          

And a cleanup workflow on PR close:

vcluster delete pr-${{ github.event.number }} --namespace pr-envs

Each PR gets a full Kubernetes environment in 30 seconds, tests run in complete isolation, and everything gets torn down when the PR merges or closes. No more “who broke staging” conversations.

The Numbers

After three months:

Before After
12 EKS clusters 1 EKS cluster (3 node groups)
~$4,200/month ~$1,700/month
15% avg utilization 55% avg utilization
20 min to provision new env 30 seconds
Manual cleanup Automatic with TTL

The cost savings paid for themselves immediately. But the real value is developer velocity. Nobody waits for a cluster anymore. Nobody shares an environment that someone else can break.

What I’d Do Differently

Start with resource quotas from day one. I didn’t, and within the first week someone deployed a stress test that OOM-killed pods across three other virtual clusters. The host cluster’s resources are shared, whether you like it or not.

Use K3s, not K8s distro. vCluster supports full K8s as the virtual control plane, but K3s is lighter and boots faster. Unless you need specific K8s API features, K3s is the right call.

Set up monitoring on the host cluster, not inside vClusters. Prometheus running in each virtual cluster is wasteful. A single Prometheus on the host can scrape all the pods, and you can use labels to separate metrics per tenant.

Who Should Use This

If you’re running more than three clusters and most of them aren’t production, vCluster is probably worth evaluating. The sweet spots I’ve seen:

  • Dev/test environments per developer or per team
  • CI/CD ephemeral environments per pull request
  • Multi-tenant SaaS where each customer needs cluster-level isolation
  • Training/demo environments that spin up and tear down frequently

If you’re running a single production cluster and don’t have multi-tenancy needs, it’s probably overkill.

The project is open source, well-maintained, and backed by Loft Labs. The community is active and the docs are solid. Three months in, I haven’t hit a showstopper, and my developers are happier than they’ve been in a while. That counts for something.