The Saga Pattern

slug: 001-saga-pattern number: 1 title: "The Saga Pattern" description: "Distributed transactions without two-phase commit. The 1987 pattern that still defines microservice consistency." youtubeId: null publishedAt: null anchor: authors: "Hector Garcia-Molina & Kenneth Salem" year: 1987 title: "Sagas" institution: "Princeton University" venue: "ACM SIGMOD"

The pattern at a glance

Long-form article coming soon. The narration below is the spoken version of this episode — read it as a quick transcript while the written companion is in draft.

Transcript

A customer clicks place order. The payment service charges their card. Inventory service tries to reserve the last unit. It fails. The order rolls back.

But the payment already committed. Money taken. No product to ship.

This is the distributed transaction problem at cloud scale.

Two-phase commit promises atomicity. A coordinator asks every participant: are you ready? Everyone votes. If all say yes, commit. If even one says no, everyone rolls back.

In a single database, this works. Across services owned by different teams, on different platforms, it doesn't. The coordinator holds locks until the slowest participant votes. The participant might be in another region. It might be down. Your transaction blocks until it comes back.

So we trade. We give up atomicity. We get back the ability to keep moving.

The pattern is older than microservices.

It comes from a 1987 paper by Hector Garcia-Molina and Kenneth Salem at Princeton. They were studying long-lived transactions — database transactions that ran for hours and couldn't hold locks the whole time.

Their answer: break the transaction into a sequence of local commits, and pair each one with a compensating action.

Forty years later, the constraints are different. The answer still holds.

A saga is a sequence of local transactions. Each one commits in its own service, in its own database, immediately. There is no global lock.

Every step has a partner — a compensating action that semantically reverses what its forward action did.

Three steps: T1, T2, T3. If they succeed in order, the saga commits. If T2 fails after T1 succeeded, run C1. If T3 fails after T1 and T2 succeeded, run C2, then C1.

Not rollback. Compensate. C1 is not "undo T1." It is the business action that makes T1's effects irrelevant. Refund the charge. Issue a credit note. Cancel the reservation.

There are two ways to wire a saga.

In orchestration, a single coordinator service owns the flow. It calls T1, waits, calls T2, waits. On failure, it walks the compensations in reverse. The flow is explicit. You can read it top to bottom.

In choreography, there is no coordinator. Each service publishes events. Other services subscribe. Each participant knows only its own next step.

Orchestration is easier to debug. The trade-off: the orchestrator becomes a coupling point. Every new step requires a deploy.

Choreography decouples completely. The trade-off: nobody owns the end-to-end view. You'll need an observability layer to know whether a saga is in flight, stuck, or done.

For most teams, start with orchestration. Move to choreography only when the coupling cost outweighs the observability cost.

Most saga tutorials skip the hard parts. Three of them.

One: sagas have no isolation. While a saga is in flight, another reader sees intermediate state. They query inventory, see zero, give up — because the in-flight saga reserved the last unit. Then the saga fails and the inventory comes back. Too late. This is a dirty read. You design around it. Semantic locks, pessimistic reservations with timeouts, or just visible pending state.

Two: compensations must be idempotent. Every step can be retried, including the compensations. Refunding the same charge twice is worse than not refunding at all.

Three: compensations can fail. Now your saga cannot complete and cannot roll back. The only honest answer is a dead-letter queue, an alert, and a human.

Sagas are wrong for short single-database transactions — just use a transaction. They are wrong when you need strong consistency reads across the saga's lifetime. They are wrong when the compensation cannot be expressed. You can't un-send an email. You can't un-ship a package once the truck leaves the dock.

If you cannot write the compensation, you do not have a saga. You have a one-way door.

Sagas are not a way to get atomicity back. They are a way to get throughput and availability at the cost of atomicity, and to manage the consequences explicitly.

Most distributed transaction problems are not solved by a better protocol. They are solved by a clearer view of what compensation actually means in the business domain. The pattern is the easy part.

Next episode: CQRS and event sourcing, and why they are not the same thing.