AWS sent another reminder this week: Amazon Linux 2 support ends on June 30, 2026. That still sounds far enough away to ignore, right up until you remember all the places AL2 tends to hide. EC2 launch templates, golden AMIs, EKS managed node groups, ECS hosts, Packer builds, CI runners, old Lambda assumptions, and that one admin box nobody has opened since 2021.

I started treating this as an infrastructure migration, not an operating system upgrade. That framing matters. If the plan is to SSH into machines and upgrade them in place, the day is already heading in the wrong direction. The cleaner path is inventory, rebuild, roll, observe, then delete the old capacity.

For most of my AWS workloads, the target is Amazon Linux 2023. It is not just AL2 with newer packages. There are real behavior changes: DNF instead of YUM, cgroup v2, systemd-networkd, different SSH defaults, Python 3 only, AWS CLI v2, journal based logging, and /tmp mounted as tmpfs by default. Most applications will not care. The little bits of operational glue around them often will.

First, Find Every AL2 Machine

I do not trust tags for this. Tags tell me what somebody meant to build two years ago. SSM tells me what is actually running today.

aws ssm describe-instance-information \
  --query "InstanceInformationList[?contains(PlatformName, 'Amazon Linux 2')].[InstanceId,ComputerName,PlatformName,PlatformVersion]" \
  --output table

If SSM coverage is not complete, I fall back to EC2 image data:

aws ec2 describe-instances \
  --filters Name=instance-state-name,Values=running \
  --query "Reservations[].Instances[].[InstanceId,ImageId,Tags[?Key=='Name']|[0].Value]" \
  --output text | while read id ami name; do
    aws ec2 describe-images \
      --image-ids "$ami" \
      --query "Images[0].[Name,Description]" \
      --output text | sed "s/^/$id $name $ami /"
  done | grep -i 'amzn2\|amazon linux 2'

For EKS, I check node operating systems directly:

kubectl get nodes -o custom-columns='NAME:.metadata.name,OS:.status.nodeInfo.osImage,KERNEL:.status.nodeInfo.kernelVersion'

That command is blunt in the best way. If it prints Amazon Linux 2, that node has to move.

What Broke First

My first test migration was a small internal service behind an Auto Scaling group. The application itself came up fine. The bootstrap script did not.

The old user data had this kind of thing in it:

yum install -y jq amazon-cloudwatch-agent python2
service rsyslog restart
pip install awscli

That is three problems in four lines on AL2023. dnf is the default package manager, Python 2 is gone, and AWS CLI v2 is already the normal path in many setups. If your logging setup assumes rsyslog, check it carefully too, because AL2023 leans much more on the systemd journal.

My cleaned up bootstrap looked more like this:

#!/usr/bin/env bash
set -euo pipefail

dnf install -y jq amazon-cloudwatch-agent python3-pip
systemctl enable --now amazon-cloudwatch-agent
aws --version

That was the simple case. The more annoying one was a temp file workflow that wrote a large intermediate file to /tmp. On AL2023, /tmp is tmpfs. Great for speed, less great when an application quietly drops gigabytes there. I moved that workload to /var/tmp/myapp and added a systemd tmpfiles rule.

cat >/etc/tmpfiles.d/myapp.conf <<'EOF'
d /var/tmp/myapp 0750 app app 7d
EOF
systemd-tmpfiles --create /etc/tmpfiles.d/myapp.conf

Terraform Rollout for EC2 and ASG

For plain EC2 and Auto Scaling groups, I prefer a new launch template version plus an instance refresh. No live-machine surgery.

data "aws_ami" "al2023" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["al2023-ami-2023.*-x86_64"]
  }

  filter {
    name   = "architecture"
    values = ["x86_64"]
  }
}

resource "aws_launch_template" "app" {
  name_prefix   = "app-"
  image_id      = data.aws_ami.al2023.id
  instance_type = "t3.medium"

  user_data = base64encode(templatefile("${path.module}/userdata.sh", {}))
}

resource "aws_autoscaling_group" "app" {
  name                = "app"
  vpc_zone_identifier = var.subnet_ids
  desired_capacity    = 3
  min_size            = 3
  max_size            = 6

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 90
      instance_warmup        = 180
    }
  }
}

Then I start the refresh deliberately, during a window where I can watch it:

aws autoscaling start-instance-refresh \
  --auto-scaling-group-name app \
  --preferences MinHealthyPercentage=90,InstanceWarmup=180

The gotcha here is health checks. If the ASG health check says green while the service is only half initialized, you can roll bad instances through the entire group. I now wire the application readiness endpoint into the load balancer target health before starting the refresh.

EKS Node Groups Need a Separate Plan

EKS migrations are safer when you create a parallel node group, move workloads over, and then delete the old one. I do not like mutating the existing node group and hoping the rollout behaves.

aws eks describe-nodegroup \
  --cluster-name prod \
  --nodegroup-name workers-al2 \
  --query 'nodegroup.{amiType:amiType,version:version,releaseVersion:releaseVersion}'

Create the AL2023 node group with your normal IaC. In Terraform it is usually just a second managed node group:

module "eks" {
  source = "terraform-aws-modules/eks/aws"

  eks_managed_node_groups = {
    workers_al2023 = {
      ami_type       = "AL2023_x86_64_STANDARD"
      instance_types = ["m7i.large"]
      min_size       = 2
      desired_size   = 3
      max_size       = 6

      labels = {
        os = "al2023"
      }
    }
  }
}

After the new nodes join, I cordon and drain one old node by hand before trusting automation:

kubectl cordon ip-10-0-12-34.eu-central-1.compute.internal
kubectl drain ip-10-0-12-34.eu-central-1.compute.internal \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60

This is where DaemonSets tell you the truth. CNI plugins, log agents, runtime security agents, node exporters, and backup sidecars are the usual suspects. Anything that assumes cgroup v1 needs a real test before production.

My Checklist Before Calling It Done

I keep this boring on purpose:

# EC2 inventory should stop showing AL2
aws ssm describe-instance-information \
  --query "InstanceInformationList[?contains(PlatformName, 'Amazon Linux 2')].InstanceId" \
  --output text

# Kubernetes nodes should report AL2023
kubectl get nodes -o custom-columns='NAME:.metadata.name,OS:.status.nodeInfo.osImage'

# No old AMI IDs left in launch templates
aws ec2 describe-launch-template-versions \
  --launch-template-id lt-1234567890abcdef0 \
  --versions '$Latest' \
  --query 'LaunchTemplateVersions[0].LaunchTemplateData.ImageId'

I also watch CloudWatch alarms for noisy disk, memory, and restart patterns for at least a day. AL2023 boots faster in my tests, but a faster boot does not make up for a broken bootstrap script.

The Part People Will Underestimate

The hard part is not replacing an AMI ID. The hard part is finding every place where the old operating system quietly became part of the contract.

For me, those contracts were small and boring: yum, Python 2, /tmp, rsyslog, custom CIS scripts, and one backup agent that cared about cgroup layout. None of them deserved a dramatic migration project on its own. Together, they were enough to make a direct cutover risky.

June 2026 is close enough that I would not wait for a big migration sprint. Pick one ASG, one EKS node group, or one CI runner pool, and move it now. The first ten percent will teach you more than any planning spreadsheet.