The Architecture Review
All episodes
Episode 03

Consensus and Leader Election (Raft)

How a cluster agrees on a single value despite failures. Raft, Paxos, and the gap between correct and understandable.

Video publishes soon

slug: 003-raft-consensus number: 3 title: "Consensus and Leader Election (Raft)" description: "How a cluster agrees on a single value despite failures. Raft, Paxos, and the gap between correct and understandable." youtubeId: null publishedAt: null anchor: authors: "Diego Ongaro & John Ousterhout" year: 2014 title: "In Search of an Understandable Consensus Algorithm" institution: "Stanford University" venue: "USENIX ATC"

The pattern at a glance

Long-form article coming soon. The narration below is the spoken version of this episode — read it as a quick transcript while the written companion is in draft.

Transcript

Three nodes in a database cluster. The network partitions. Two nodes still talk to each other. The third is isolated.

Each side thinks the other crashed. Both elect a leader. Now you have two leaders. Two databases accepting writes. Two divergent histories.

This is split brain. And it's why distributed databases need consensus.

Distributed consensus is this: a group of nodes must agree on a single value, even when some of them fail, and even when the network drops messages.

It sounds simple. It is not. It was proved in 1985 that no consensus algorithm can be guaranteed to terminate in an asynchronous network with even one node failure.

So real algorithms work around that result. They assume bounded message delays, or randomness, or majority quorums — meaning a majority of nodes must agree before anything commits. The most famous workaround is Paxos.

The problem isn't new. Leslie Lamport solved consensus in 1989, in a paper called Paxos. Paxos is correct. It is also notoriously hard to implement — every production system using it has its own subtle variant.

In 2014, Diego Ongaro and John Ousterhout at Stanford published Raft. Same correctness. Designed to be learnable.

Raft has one leader at any time. The leader handles every client request. It writes each command to its own log, replicates the log to followers, and tells them when entries are committed.

Three subproblems make Raft tractable. Leader election — how the cluster picks one leader. Log replication — how the leader gets entries onto a majority of nodes. Safety — rules that prevent two committed logs from diverging.

Each subproblem has clear rules. A node is a follower, a candidate, or a leader. It transitions between roles on simple triggers: timeouts, votes, heartbeats — periodic keep-alive messages from the leader.

A senior engineer can implement Raft from the paper in a weekend. That was the design goal.

Time in Raft is divided into terms — logical periods that act like Raft's clock. Each term has one leader, or none.

When a follower hears nothing from the leader for a random timeout — typically 150 to 300 milliseconds — it becomes a candidate. It increments the term and asks every other node for a vote.

Each node votes yes once per term, for the first candidate it sees that has a log at least as up-to-date as its own.

A candidate that wins a majority becomes leader. It immediately sends heartbeats. The randomized timeout prevents two candidates from splitting the vote forever.

The leader receives a client command. It appends the command to its own log, then sends a replication call to every follower in parallel.

A follower accepts the entry if its previous log entry matches what the leader expects. If not, the follower rejects, the leader backs up its index, and tries again. This consistency check guarantees that all logs converge to the same prefix.

Once a majority of followers have appended the entry, the leader marks it committed. It tells followers in the next heartbeat. Only then does the leader return success to the client.

The client never hears about an entry that didn't reach a majority. There is no rollback.

Two traps.

One: leader failure during replication. The leader sends entries, then dies before telling followers they're committed. The election rules guarantee the new leader has every committed entry, but it may have uncommitted entries that need to be cleaned up. The new leader's first job is reconciliation.

Two: network partitions. A leader on the minority side keeps trying to replicate. It cannot reach a majority, so nothing commits. From the client's perspective, every request times out. The majority side, meanwhile, elects a new leader and accepts writes. When the partition heals, the old leader steps down — its term is now stale.

The hard part of Raft is not the protocol. It's everything around it: snapshots, log compaction, and cluster membership changes — all the operational concerns the paper sketches but doesn't solve.

You need consensus when you have replicated state that must agree exactly. Distributed databases. Strongly-consistent configuration stores. Distributed locks.

You do not need consensus for caches, for stateless services, for eventually-consistent reads. Adding Raft to a system that doesn't need it adds latency — every write waits for a majority round-trip — and a coordination bottleneck.

If your problem can be solved with eventual consistency, choose that. Raft is the right answer to a specific question.

The genius of Raft is not the algorithm. It's the recognition that an algorithm whose correctness depends on being implemented correctly should be implementable.

Paxos was correct. Raft is correct and implementable. That difference rewired an entire generation of infrastructure.

Next episode: sharding. How to spread data across machines without losing your mind.