Infrastructure

Kubernetes 1.35: The Release That Finally Gets AI Workloads Right

I’ve been running mixed clusters with ML training jobs and regular services for about two years. Scheduling has been the biggest headache. A distributed training run would get only some pods placed, GPUs would sit there doing nothing, and everyone would lose time. Kubernetes 1.35 came out last week, so I spent the weekend testing it on our staging cluster. A few of these changes are genuinely useful. Gang Scheduling Finally Exists The biggest addition is workload-aware scheduling with gang scheduling support. It’s still alpha, so I would not put it in production yet, but the model is exactly what we needed: a group of pods either gets scheduled together, or not at all. ...

Cluster API v1.12: In-Place Updates Changed How I Think About Node Lifecycle

I run a few Kubernetes clusters on bare metal with Cluster API and the BYOH (Bring Your Own Host) provider. Until now, every upgrade followed the same pattern: drain nodes, delete machines, rebuild everything, then wait. Reliable, yes. Fast, not even close, especially once you are past 40 nodes with little spare capacity. Cluster API v1.12 shipped a few weeks ago, and the headline for me was in-place updates. Instead of always doing the immutable delete-and-recreate path, CAPI can now apply some changes directly on existing machines. I spent last week testing this on our staging cluster, and the result was better than I expected. ...

Kubernetes Introduction: When to Use It and When Not To

Kubernetes Is Not the Answer to Every Problem I say this as someone who spends a significant part of their work building and operating Kubernetes clusters. Kubernetes is a fantastic tool — but it’s not for everything, and introducing it at the wrong time can cause more problems than it solves. When to Use Kubernetes Many microservices (10+) that scale independently Variable load — autoscaling handles capacity automatically Multiple teams and environments — namespaces and RBAC provide clean separation High availability requirements (99.9%+ uptime) — self-healing, health checks, rolling updates Multi-cloud or hybrid strategy — Kubernetes abstracts the provider When NOT to Use Kubernetes One or two simple applications — use a VPS, Docker Compose, or managed PaaS instead Small team with no K8s experience — the learning curve takes months No CI/CD pipeline yet — build that first; Kubernetes builds on top of it Cost-sensitive project — minimum production EKS cluster costs $250-800/month Legacy stateful apps not designed for containers — significant refactoring needed Decision Framework Ask yourself: Do you have 5+ independently deployable services? Variable load needing autoscaling? K8s expertise on the team? Budget for minimum K8s costs? Containerizable services? ...