99.99% Uptime for a Radio Platform

Media streaming · High availability · Monitoring · Incident response

The Challenge

A media technology company was building a radio streaming platform to serve millions of listeners across multiple countries. The platform needed to handle live broadcasts, on-demand content, and real-time ad insertion, all with essentially zero tolerance for downtime.

The stakes were real:

  • Live broadcasts can’t buffer. If the stream drops during a live show, listeners switch to a competitor and may never come back
  • Regulatory requirements demanded documented uptime guarantees
  • Peak traffic was unpredictable. Breaking news events could 10x normal load in minutes
  • The existing infrastructure was a single-region setup with no redundancy and monitoring that amounted to “someone checks if it’s working.” Spoiler: they didn’t always check

They needed an architecture that could survive just about anything and keep streaming.

The Approach

I designed and implemented a geo-redundant architecture from the ground up, with monitoring and incident response built in from day one.

Geo-Redundant Architecture

I deployed across two AWS regions with active-active configuration:

  • Route 53 health checks with latency-based routing. Listeners automatically connected to the nearest healthy endpoint
  • Identical infrastructure stacks in each region, defined entirely in Terraform. No snowflake configurations
  • Cross-region data replication for content metadata and user state, with eventual consistency where acceptable and strong consistency where required
  • CDN layer with CloudFront for static assets and pre-recorded content, reducing origin load by 70%

The key design principle: either region should be able to serve 100% of traffic independently. I tested this regularly by draining one region completely. Trust, but verify.

Monitoring and Observability

I built a comprehensive observability stack:

  • Prometheus for metrics collection with long-term storage in Thanos
  • Grafana dashboards that were operational, not decorative. Each dashboard answered a specific question: “Is the platform healthy?” “Where is the bottleneck?” “Should I wake someone up?”
  • Custom metrics for stream health: listener count, buffer ratio, stream latency, ad insertion success rate
  • SLOs defined upfront: 99.99% availability, <200ms stream start time, <1% ad insertion failure rate
  • Multi-channel alerting. PagerDuty for critical, Slack for warning, email for informational. I spent significant time tuning thresholds because alert fatigue is the enemy of reliability

Incident Response

The monitoring was only half the equation. I built a complete incident response framework:

  • Runbooks for every alert. Not “investigate the issue” but specific, actionable steps any on-call engineer could follow at 3 AM
  • Automated remediation for common issues: pod restarts, traffic shifting, cache clearing
  • Incident severity levels with clear escalation paths
  • Post-incident review process. Blameless, focused on system improvement
  • Chaos engineering exercises. Monthly game days where we deliberately broke things in production to validate our resilience. Breaking things on purpose is surprisingly fun when you know you can recover

The Results

After three months of building and hardening:

99.99% uptime

Achieved and maintained over 12 months. Total downtime: under 53 minutes for the entire year.

< 30s failover

Regional failover completed in under 30 seconds. Most listeners experienced zero interruption.

MTTR: 4 minutes

Mean time to recovery dropped from hours to minutes thanks to runbooks and automated remediation.

The platform successfully handled multiple breaking news events with 10x traffic spikes without any degradation. The monitoring stack caught issues before they became outages. In one case, I detected and mitigated a database connection pool exhaustion 15 minutes before it would have affected listeners.

The client’s confidence in the platform changed their business strategy. They went from cautiously onboarding new radio stations to expanding aggressively, knowing the infrastructure could handle it.

Ready to talk? Let's figure out what you need.

Book a Free Chat