Migrating Legacy Infrastructure to Kubernetes

Enterprise project · EKS · Terraform · Helm · Zero-downtime migration

The Challenge

An enterprise client was running their entire platform on a fleet of manually provisioned EC2 instances. What started as a “quick and simple” setup years ago had grown into an unmaintainable web of snowflake servers, hand-edited configuration files, and tribal knowledge that existed only in the heads of two senior engineers. You’ve probably seen this movie before.

The pain points:

Scaling was manual. Someone had to spin up new instances, install dependencies, configure load balancers, and update DNS. During peak periods, this took hours
No two servers were identical. Each had accumulated years of ad-hoc changes, making debugging a nightmare
Infrastructure costs were spiraling. Oversized instances running 24/7 because nobody was confident enough to right-size them
Onboarding new engineers took weeks because there was no documentation, just “ask Dave”
The two engineers who understood the setup were the biggest flight risk in the company. And they knew it

They needed to modernize without disrupting a platform that served paying enterprise customers.

The Approach

I planned and executed the migration in three phases over 10 weeks. The non-negotiable constraint: zero customer-facing downtime.

Phase 1: Infrastructure as Code (Week 1-3)

Before touching Kubernetes, I needed to codify what existed. You can’t migrate what you don’t understand.

Audited every running service. Mapped dependencies, ports, environment variables, storage mounts, cron jobs. I found 7 services nobody knew were still running. (They were all important, naturally)
Wrote Terraform for the existing infrastructure first. VPC, security groups, IAM roles, RDS instances. This gave us a safety net and documentation in one step
Set up Terragrunt for DRY configuration across environments. Each environment (dev, staging, production) was now a thin overlay on shared modules

Phase 2: EKS Cluster and Containerization (Week 3-7)

With the existing infrastructure codified, I built the target state:

EKS cluster provisioned via Terraform. Managed node groups with a mix of on-demand (for stateful workloads) and spot instances (for stateless services)
Helm charts for every service. Templated, version-controlled, with sensible defaults and per-environment overrides
Container images built with multi-stage Dockerfiles, optimized for size and security. Every image was scanned before being pushed to ECR
Secrets management migrated from .env files on servers to AWS Secrets Manager, injected via External Secrets Operator
Persistent storage mapped to EBS CSI driver and EFS for shared storage needs

I ran the new environment in parallel, with synthetic traffic, for two weeks before moving any real workload. Patience pays off during migrations.

Phase 3: Zero-Downtime Migration (Week 7-10)

The migration itself was the most carefully orchestrated part:

Service by service, not big bang. I migrated the least critical service first, then progressively moved higher-risk services
Weighted DNS routing. Started at 5% of traffic to Kubernetes, monitored for a week, then increased to 25%, 50%, 100%
Database connections were the trickiest part. I used connection pooling with PgBouncer and careful connection string management to point services at the same RDS instances regardless of where they ran
Rollback plan for every service. The old infrastructure stayed warm and ready for instant failback for 30 days after each migration

The Results

Zero downtime

Not a single minute of customer-facing downtime during the entire migration. Weighted routing made it invisible.

40% cost reduction

Right-sized containers + spot instances + autoscaling. They were paying for capacity they weren't using.

Minutes, not hours

Scaling went from "call Dave and wait 3 hours" to HPA-driven autoscaling in minutes. Peak periods handled automatically.

Beyond the numbers, the migration fundamentally changed how the team operated. New engineers could spin up a complete development environment in under 10 minutes. Deployments that used to be dreaded became routine. The two senior engineers who were flight risks? They were finally free to work on product features instead of babysitting servers. Dave was thrilled.

The Terraform codebase became the single source of truth for all infrastructure. No more tribal knowledge, no more snowflake servers, no more “ask Dave.”