A Crash Course On High Availability

The Uptime Engineer

👋 Hi, I am Yoshik

This week, you'll learn what high availability actually means numerically, why redundancy alone isn't enough, and the exact design checklist that separates systems that survive failures from ones that cause them.

🔥Tool Spotlight

Gatus - Self-hosted health dashboard
Monitors your endpoints, evaluates response conditions, and alerts on SLA breaches. Fully config-driven in YAML - cleaner than Uptime Robot with no third-party dependency.

📚 Worth Your Time

Google SRE Book: Embracing Risk - The chapter that introduced SLOs and error budgets to the industry. If you haven't read it, this is the week.

AWS Well-Architected: Reliability Pillar - The most practical HA checklist for cloud infrastructure. Bookmark it.

You've seen it in every architecture review.

"We're highly available - we have two servers."

Redundancy and availability are not the same thing. Confusing them is how you end up creating a disastrous production architecture.

Start with the numbers.

When someone says "five nines," they mean 99.999% uptime annually. That sounds like a lot. Here's what it means in real downtime:

99%     → 3.65 days/year
99.9%   → 8.7 hours/year
99.99%  → 52 minutes/year
99.999% → 5.2 minutes/year

Five nines is 5 minutes of allowed downtime across an entire year.

Most teams aren't close. Most teams don't even know their current number. And that's the first problem - you can't improve what you're not measuring.

Uptime ≠ availability.

This distinction matters more than most people realize.

Uptime is "is the process running?" Availability is "can a user successfully use the system?"

Your server can be up. Your health check can return 200. Your process can be running - and your users can still be timing out. That gap is where most availability problems live.

A database that takes 30 seconds to respond is "up." A load balancer routing 20% of traffic to a dead node is "partially available." A deploy that throws 500 errors for 4 minutes before rollback is an availability event - even if no alerts fired.

Measure from the user's perspective, not the server's.

Redundancy is not enough.

Two servers is not high availability. It's a starting point.

For redundancy to translate into actual availability, you need three more things:

Health checks. Your load balancer needs to know when a node is sick and stop sending it traffic. Without this, redundancy just means two servers receiving bad requests instead of one.
Automatic failover. When the primary dies, something has to detect that and reroute - without a human logging in and editing a config. If your failover requires manual intervention, you have recovery, not availability.
Data layer redundancy. Two compute nodes sharing one database isn't redundant. Take down that database and both nodes go with it. Redundancy at the app layer without redundancy at the data layer is a false sense of safety.

The four failure patterns that show up in every post-mortem.

1. Single Points of Failure

A SPOF is any component where one failure breaks everything. They're not always obvious:

A single load balancer with no standby
One database with no replica
A shared config on a single NFS mount
One DNS record pointing to one IP

Draw your architecture. Ask: what single component, if removed, takes everything down? Every answer is a SPOF.

2. Cascading failures

Service A calls Service B. Service B slows down. Service A keeps retrying. Its thread pool fills. It stops responding. Service C, which depends on A, starts failing.

The fix is a circuit breaker - a pattern where a service stops calling a failing downstream dependency, returns a fallback, and periodically retries. It breaks the cascade before it spreads.

3. The thundering herd

Your system crashes. You bring it back. Everything that was waiting hammers it simultaneously. It crashes again.

This happens after cache evictions, cold starts, and outage recoveries. The fix: gradual traffic ramp-up, request queuing, and jitter in retry logic so clients don't all retry at the exact same millisecond.

4. Failover that was never tested

This one is common and embarrassing.

You have a primary database and a replica. The primary dies during peak traffic. You try to promote the replica. It hasn't been syncing for three hours because replication silently broke last Tuesday.

If you haven't run a failover drill in 90 days, you don't have failover. You have a plan.

The HA design checklist

When you're building or reviewing a system, run through these:

No single points of failure at compute, network, data, and DNS layers
Health checks on every node, with automatic traffic removal on failure
Automatic failover - no human in the critical path
Database replication with tested promotion runbooks
Circuit breakers on all downstream service calls
Retry logic with exponential backoff + jitter - not immediate retries that amplify the failure
Graceful degradation - when a non-critical dependency fails, the core flow still works
Load spread across failure domains - not all nodes in the same availability zone
Failover drills at least quarterly - actually kill nodes, verify recovery

Run this right now:

kubectl get pods -A | grep -v Running

If anything is CrashLooping, Pending, or showing unexpected restarts that's not a future concern. That's a current availability gap your users might already be feeling.

High availability starts with knowing what's already failing before your users have to tell you.

Knowing your nines is table stakes. Knowing your weakest link is the job.

Join 1,000+ engineers learning DevOps the hard way

Every week, I share: