Every team wanted their own cluster. QA had one, staging had one, each developer wanted one for feature branches. We ended up with 12 EKS clusters, most of them sitting at 15% utilization, all of them costing real money.
I kept hearing about vCluster from Loft Labs and finally gave it a shot three months ago. The pitch sounded too good: full Kubernetes clusters running inside a single host cluster, each with its own API server, its own resources, complete isolation. No extra nodes, no extra control planes to manage.
Spoiler: it actually works.
The Problem
Our setup was typical mid-size platform engineering pain. Six developers, a QA team, staging, two demo environments, and a couple of sandbox clusters for experimentation. Each EKS cluster had its own node groups, its own ALB controllers, its own cert-manager installation.
The bill was around $4,200/month just for the dev/test clusters. Not catastrophic, but hard to justify when most of them were idle 80% of the time.
Namespace isolation wasn’t enough. Developers needed CRD access, they needed to install Helm charts, they needed cluster-scoped resources. Namespaces can’t give you that.
Getting Started
Install the vCluster CLI:
curl -L -o vcluster "https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-amd64"
chmod +x vcluster
sudo mv vcluster /usr/local/bin/
Creating a virtual cluster takes about 30 seconds:
vcluster create dev-alice --namespace team-alice
That’s it. vCluster spins up a lightweight K3s control plane inside a pod, creates a syncer that maps resources between the virtual and host cluster, and hands you a kubeconfig.
vcluster connect dev-alice --namespace team-alice
kubectl get nodes
The virtual cluster sees its own nodes (synced from the host), its own kube-system namespace, everything. From the developer’s perspective, it’s a real cluster.
The Architecture That Clicked
Here’s what runs inside the host cluster for each vCluster:
host-cluster/
namespace: team-alice/
pod: dev-alice-0 # K3s control plane + syncer
pvc: data-dev-alice-0 # etcd storage (SQLite by default)
service: dev-alice # API server endpoint
One pod. One PVC. One service. That’s the overhead per virtual cluster. Compare that to a full EKS cluster with its own VPC, node groups, and add-ons.
The syncer is the key piece. When a developer creates a Deployment in the virtual cluster, the syncer creates the corresponding resources in the host namespace. Pods run on the host nodes, but the developer only sees their own stuff.
Real Configuration
The default setup works for quick experiments, but production use needs a vcluster.yaml:
controlPlane:
distro:
k3s:
enabled: true
statefulSet:
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
sync:
toHost:
ingresses:
enabled: true
fromHost:
nodes:
enabled: true
selector:
labels:
nodepool: shared
policies:
resourceQuota:
enabled: true
quota:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
limitRange:
enabled: true
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
vcluster create dev-alice --namespace team-alice -f vcluster.yaml
The resource quota and limit range are critical. Without them, one developer can eat all the host cluster resources and everyone else suffers.
What Broke (And How I Fixed It)
DNS Resolution Between Virtual Clusters
Services in one vCluster can’t resolve services in another by default. That’s actually the correct behavior for isolation. But our QA team needed to hit a shared database running in a separate vCluster.
The fix was mapping the database service from the host:
sync:
fromHost:
services:
enabled: true
mappings:
- from:
namespace: shared-services
name: postgres-primary
to:
namespace: default
name: shared-db
Persistent Volumes
PVCs work, but you need to understand they’re created on the host cluster’s storage class. If your host cluster uses gp3 EBS volumes, that’s what your vCluster gets. No surprises there, but developers who expected different storage classes were confused.
I added a clear onboarding doc and mapped the storage classes explicitly:
sync:
fromHost:
storageClasses:
enabled: true
Ingress Conflicts
Multiple vClusters can’t share the same hostname on an Ingress. The syncer rewrites Ingress names to avoid conflicts, but hostnames need to be unique. We solved this with a naming convention:
<service>.<vcluster-name>.dev.example.com
Wildcard DNS + wildcard TLS cert, done.
The GitOps Integration
We manage vClusters with ArgoCD. Each developer gets a vCluster defined in Git:
# clusters/dev-alice.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: vcluster-dev-alice
namespace: argocd
spec:
project: platform
source:
repoURL: https://charts.loft.sh
chart: vcluster
targetRevision: 0.24.x
helm:
valuesObject:
controlPlane:
distro:
k3s:
enabled: true
sync:
toHost:
ingresses:
enabled: true
destination:
server: https://kubernetes.default.svc
namespace: team-alice
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
New developer joins? They open a PR adding their cluster config. Merge, and ArgoCD provisions it. Developer leaves? Delete the file, ArgoCD cleans up.
CI/CD Ephemeral Clusters
The real win was CI/CD. We replaced our shared staging cluster with ephemeral vClusters per pull request:
# .github/workflows/pr-env.yaml
name: PR Environment
on:
pull_request:
types: [opened, synchronize]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Create ephemeral vCluster
run: |
vcluster create pr-${{ github.event.number }} \
--namespace pr-envs \
--connect=false \
-f .vcluster/ephemeral.yaml
- name: Connect and deploy
run: |
vcluster connect pr-${{ github.event.number }} \
--namespace pr-envs
helm upgrade --install myapp ./charts/myapp \
--set image.tag=${{ github.sha }}
- name: Run integration tests
run: |
kubectl wait --for=condition=ready pod -l app=myapp --timeout=120s
./scripts/integration-tests.sh
And a cleanup workflow on PR close:
vcluster delete pr-${{ github.event.number }} --namespace pr-envs
Each PR gets a full Kubernetes environment in 30 seconds, tests run in complete isolation, and everything gets torn down when the PR merges or closes. No more “who broke staging” conversations.
The Numbers
After three months:
| Before | After |
|---|---|
| 12 EKS clusters | 1 EKS cluster (3 node groups) |
| ~$4,200/month | ~$1,700/month |
| 15% avg utilization | 55% avg utilization |
| 20 min to provision new env | 30 seconds |
| Manual cleanup | Automatic with TTL |
The cost savings paid for themselves immediately. But the real value is developer velocity. Nobody waits for a cluster anymore. Nobody shares an environment that someone else can break.
What I’d Do Differently
Start with resource quotas from day one. I didn’t, and within the first week someone deployed a stress test that OOM-killed pods across three other virtual clusters. The host cluster’s resources are shared, whether you like it or not.
Use K3s, not K8s distro. vCluster supports full K8s as the virtual control plane, but K3s is lighter and boots faster. Unless you need specific K8s API features, K3s is the right call.
Set up monitoring on the host cluster, not inside vClusters. Prometheus running in each virtual cluster is wasteful. A single Prometheus on the host can scrape all the pods, and you can use labels to separate metrics per tenant.
Who Should Use This
If you’re running more than three clusters and most of them aren’t production, vCluster is probably worth evaluating. The sweet spots I’ve seen:
- Dev/test environments per developer or per team
- CI/CD ephemeral environments per pull request
- Multi-tenant SaaS where each customer needs cluster-level isolation
- Training/demo environments that spin up and tear down frequently
If you’re running a single production cluster and don’t have multi-tenancy needs, it’s probably overkill.
The project is open source, well-maintained, and backed by Loft Labs. The community is active and the docs are solid. Three months in, I haven’t hit a showstopper, and my developers are happier than they’ve been in a while. That counts for something.