Multi-Region Failover

slug: 006-multi-region-failover number: 6 title: "Multi-Region Failover" description: "Active-active, active-passive, and the realities of cross-region replication. Pat Helland's framing 18 years on." youtubeId: null publishedAt: null anchor: authors: "Pat Helland" year: 2007 title: "Life Beyond Distributed Transactions" institution: "Microsoft" venue: "CIDR"

The pattern at a glance

Long-form article coming soon. The narration below is the spoken version of this episode — read it as a quick transcript while the written companion is in draft.

Transcript

The primary region goes down. Your databases, your application servers, your message queues — everything you run there is gone. Customers worldwide are hitting checkout. Every request fails.

You have a replica in another region that's healthy, but a few seconds behind. Your DNS still points to the dead region.

How fast can you fail over? And what do you lose when you do?

Three reasons to run in multiple regions.

One: latency. Users in Singapore should not pay two hundred milliseconds to reach a database in Virginia. Local replicas let local users see local response times.

Two: data residency. European user data must be processed in Europe under GDPR. Healthcare data has similar constraints in many jurisdictions. Multi-region is sometimes a regulatory requirement, not an architecture choice.

Three: disaster recovery. A single region can fail. Power, network, fiber cuts, software bugs in the cloud control plane — entire regions have gone dark for hours. Multi-region is your contingency.

Pat Helland published Life Beyond Distributed Transactions at CIDR in 2007. The paper named what was then becoming inevitable: that distributed systems would have to operate without global atomic transactions, across heterogeneous infrastructure, often spanning organizational boundaries. The vocabulary the industry now uses for multi-region — entities, references, eventual convergence — comes from that paper.

Two metrics define your failover behavior.

RTO is Recovery Time Objective: how long the system can be down. Five seconds, five minutes, five hours — whatever the business commits to. Lower RTO costs more.

RPO is Recovery Point Objective: how much data you can afford to lose. Zero data loss means synchronous replication everywhere — every write waits for the secondary region to acknowledge. Most systems accept some RPO, a few seconds of in-flight writes lost during failover, as the price for not paying cross-region latency on every commit.

These two numbers shape every other decision.

Two architectures.

Active-passive: one region serves all traffic. The other region holds a replica, ready to take over but idle. Failover means promoting the standby and redirecting traffic. Simpler to reason about, simpler to operate. The trade-off: the passive region is paid-for capacity that does nothing in normal conditions, and failover always involves a discrete jump.

Active-active: both regions serve traffic simultaneously. Each writes to the other. Failover means one region absorbing the other's load. The trade-off: every write must reconcile across regions, which means either accepting eventual consistency between regions or paying cross-region latency on every commit.

Active-passive is the right answer for systems where consistency matters more than seamless failover. Active-active is the right answer when no individual user can be allowed to see downtime.

Three mechanisms route traffic away from a failed region.

DNS failover: change the DNS record from the primary's address to the secondary's. Simple but slow — DNS caches at every layer of the internet. Your fast-failover-target users may still hit the dead region for minutes after you point elsewhere.

Anycast: announce the same address from multiple regions over BGP, the internet's routing protocol. The internet picks the topologically nearest live region automatically. When a region drops out, traffic re-routes in seconds.

Global load balancers: a managed Layer-7 service that maintains a pool of regional backends and routes around failed ones. The vendor solves the DNS and BGP problems for you, at the cost of vendor dependency.

Pick based on latency budget and operational maturity.

Three traps every multi-region system hits.

One: untested failover. The failover that has never been exercised in production is a failover that doesn't work. Schedule planned failover drills. Teams that drill survive regional outages. Teams that don't, don't.

Two: data divergence. During failover or partition, both regions may accept writes for the same record. Reconciliation is application logic, not database magic. Decide in advance: last-write-wins, vector clocks, or domain-specific merge — and encode that decision somewhere besides a tribal-knowledge runbook.

Three: coordination fatigue. Multi-region adds complexity to schemas, deploys, observability, on-call, and security. Most teams underestimate the steady-state cost. A multi-region system needs roughly twice the operational headcount of a single-region equivalent at the same maturity.

Single-region is often the right answer.

If your users are concentrated in one geography and your downtime tolerance is hours rather than seconds, single-region with a tested backup is cheaper, simpler, and more reliable than a multi-region system you don't operate confidently.

Multi-region is for systems where downtime is unacceptable, latency is geographically diverse, or regulation demands it. If none of those apply, the operational tax buys nothing.

Multi-region failover is not a single feature. It's a discipline that touches every layer of the stack — replication, routing, observability, application reconciliation, and the on-call rotation.

The hard part is not the architecture. The hard part is operating it. The systems that survive a regional outage are the ones whose teams have failed over before.

Next episode: RAG — retrieval-augmented generation, and why it's a distributed search problem with an large language model at the end.