5 min read By Vamsi Karuturi · Senior Backend Engineer at Salesforce

Distributed Transactions

Real Incident: eBay's 2PC Meltdown, Early 2000s

eBay's payment and listing systems used distributed Two-Phase Commit transactions. During peak auction endings (Sunday nights, thousands closing per minute), the coordinator held locks across both databases for 2-3 seconds while waiting for the slowest participant. One slow query created a cascading lock pile-up: thousands of transactions blocked, both systems frozen, and the infamous "eBay is down" pages appeared. They replaced 2PC with event-driven compensating transactions — essentially SAGAs years before the term existed. Distributed locks don't just slow you down — they create correlated failures across systems that should be independent.

Why This Comes Up in Interviews

The moment you say "microservices" in a system design interview, the interviewer will probe:

"What happens if the payment succeeds but inventory fails?"
"How do you maintain consistency across services?"
"Can you use a database transaction across services?"
"What's the trade-off between 2PC and eventual consistency?"

If you can't reason about distributed transactions, your microservice design has a fatal flaw.

The Problem: Transactions Don't Cross Service Boundaries

In a monolith, you wrap five database operations in one transaction. If anything fails, everything rolls back. ACID handles it.

In microservices: Order Service has its own database. Payment Service has its own database. Inventory Service has its own database. There is no single transaction boundary that spans all three.

The uncomfortable truth: There is no way to get true ACID transactions across microservices without paying an enormous performance and availability penalty. Every solution is a trade-off.

Two-Phase Commit (2PC) — The Traditional Answer

How It Works

Phase 1 (Prepare): Coordinator sends "prepare" to every participant. Each participant executes locally, acquires locks, writes to WAL — does everything except commit. Votes YES or NO.

Phase 2 (Commit/Abort): If ALL voted YES → coordinator sends "commit." If ANY voted NO → sends "abort." Each participant commits or rolls back.

Why 2PC Fails for Microservices

Problem	Impact
Blocking protocol	Between phases, every participant holds DB locks and WAITS. If coordinator takes 5s to collect votes, locks held for 5s. Kills throughput.
Coordinator is SPOF	Coordinator crashes after Phase 1 but before Phase 2 → all participants stuck in "prepared" state FOREVER, holding locks
Latency multiplier	Total latency = sum of slowest participant in each phase. p99 = p99 of SLOWEST service, twice.
Heterogeneous services	Requires all participants to support XA protocol. Good luck getting your Payment API, email service, and third-party shipping API to implement XA.

When 2PC is acceptable: Within a single database cluster (e.g., distributed PostgreSQL). NOT across microservices.

The SAGA Pattern — Local Transactions + Compensations

Instead of one distributed transaction, break it into a sequence of local transactions. Each step commits independently. If step 3 fails, run compensating transactions for steps 2 and 1 in reverse order.

Key insight: Each compensation is a NEW forward transaction that semantically undoes the previous one — not a database rollback.

Choreography-Based SAGA (Event-Driven)

Each service knows its part. When it completes its local transaction, it publishes an event. The next service subscribes and does its work. On failure, it publishes a failure event triggering upstream compensations.

Text Only

Order Service → publishes "OrderCreated"
  → Inventory Service listens → reserves stock → publishes "InventoryReserved"
    → Payment Service listens → charges card → publishes "PaymentCharged"
      → Shipping Service listens → books courier → publishes "ShippingBooked"

Pros	Cons
Loosely coupled (services know events, not each other)	Flow is implicit — no single place shows the full SAGA
No single point of failure	Debugging requires distributed tracing across services
Natural fit with event-driven architecture	Cyclic dependencies emerge silently
Each service fully autonomous	Hard to test end-to-end

When to use: 2-4 services in the chain. Once you hit 5+, the implicit flow becomes a nightmare to debug.

Orchestration-Based SAGA (Central Coordinator)

A central SAGA orchestrator directs the entire flow via a state machine. It tells each service what to do, waits for response, decides next step, triggers compensations on failure.

Text Only

Orchestrator state machine:
  CREATED → send ReserveInventory → AWAITING_INVENTORY
  INVENTORY_RESERVED → send ChargePayment → AWAITING_PAYMENT
  PAYMENT_CHARGED → send BookShipping → AWAITING_SHIPPING
  SHIPPING_BOOKED → COMPLETED

  PAYMENT_FAILED → send ReleaseInventory → COMPENSATING
  INVENTORY_RELEASED → send CancelOrder → FAILED

Pros	Cons
Entire flow visible in one place (state machine)	Orchestrator is a bottleneck/SPOF
Easy to debug ("where is this SAGA right now?")	Risk of becoming a "god service"
Centralized error handling and retries	Tighter coupling to orchestrator
Easy to add new steps	Must persist orchestrator state for crash recovery

When to use: 4+ services, complex flows, production-critical paths where debuggability matters. Most FAANG-level systems use orchestration.

Comparison: 2PC vs Choreography vs Orchestration

Aspect	2PC	Choreography SAGA	Orchestration SAGA
Consistency	Strong (ACID)	Eventual	Eventual
Availability	Low (blocking locks)	High	High (orchestrator must be HA)
Latency	High (lock duration)	Low (async)	Medium (sequential commands)
Coupling	Tight (XA protocol)	Loose (events)	Medium (orchestrator knows all services)
Debuggability	Medium	Hard (distributed tracing)	Easy (state machine)
Rollback	Perfect (DB rollback)	Semantic (compensations)	Semantic (compensations)
Scalability	Poor	Excellent	Good
Best for	Single DB cluster	Simple 2-4 service flows	Complex multi-service workflows

Compensating Transactions — The Hard Part

A compensating transaction is a semantic undo — a new action that reverses the business effect. NOT the same as a database rollback.

Action	Compensation	Difficulty
Reserve inventory	Release inventory	Easy — clean reversal
Create database record	Delete or mark as cancelled	Easy — but others may have read it
Charge credit card	Issue a refund	Medium — refund has different fees, takes days
Send an email	Send "sorry, ignore that" email?	Hard — can't unsend
Ship a physical package	Recall? Reroute? Hope it hasn't left?	Hard — real world has no ROLLBACK
Post to social media	Delete post (people already saw it)	Impossible to fully undo

Design principle: Order SAGA steps so that irreversible actions (send email, ship package) happen LAST. Reversible actions (reserve, authorize) happen first.

The Outbox Pattern — Solving Dual Writes

Problem: Service needs to update its database AND publish an event. If DB write succeeds but event publish fails (or vice versa), you get inconsistency.

Solution: Write the event to an "outbox" table in the SAME database transaction as the business data. A separate process reads the outbox and publishes to the message broker.

Text Only

BEGIN TRANSACTION
  INSERT INTO orders (id, user_id, total) VALUES (...)
  INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', '{...}')
COMMIT

-- Background process (CDC or poller):
-- Read outbox → publish to Kafka → mark as published

Aspect	Without Outbox	With Outbox
Atomicity	DB + broker not atomic (dual write)	Single DB transaction (atomic)
Failure mode	Lost events or phantom events	At-least-once delivery (idempotent consumers)
Complexity	Simple code, complex failure	More infrastructure, predictable
Used by	—	Debezium (CDC), Transactional outbox

Idempotency — Critical for SAGAs

With at-least-once delivery and compensations, messages WILL be delivered multiple times. Every service must be idempotent.

Pattern:

Every message has a unique idempotency_key
Service checks: "Have I processed this key before?"
If yes → ACK, skip. If no → process, record key, ACK.

Implementation: Idempotency table in the service's database with the message ID as primary key. Check before processing, insert after.

Real Systems

System	What It Provides	Pattern
Temporal (ex-Cadence, Uber)	Durable workflow execution, automatic retries, state persistence	Orchestration
AWS Step Functions	Serverless state machines with built-in error handling	Orchestration
Eventuate	Event sourcing + SAGA framework for JVM	Both choreography and orchestration
MassTransit	.NET SAGA framework with state machines	Orchestration
Axon Framework	Java event sourcing + SAGA support	Both
Kafka + custom	Build your own with topics and consumer groups	Choreography

Uber's choice: Built Cadence (now Temporal) specifically because debugging choreographed SAGAs across 8+ services for a single trip was impossible at scale. Engineers spent more time tracing event chains than fixing bugs. Switching to orchestration cut incident investigation time by 70%.

Back-of-Envelope: SAGA Timing

E-commerce order (5 services):

Step	Latency	Cumulative
Create order	20ms	20ms
Reserve inventory	50ms	70ms
Charge payment (external API)	500ms	570ms
Book shipping	100ms	670ms
Send confirmation	30ms	700ms

Orchestration (sequential): ~700ms total. Acceptable for user-facing operations.

Choreography (async, parallel where possible): Steps 2+3 in parallel = ~520ms. Faster but harder to coordinate.

Compensation path (worst case): Payment charged, shipping fails. Refund = 3-5 business days. Customer sees "order cancelled" immediately, money returns later. This is why SAGAs are eventually consistent, not instantly consistent.

Interview Framework

When asked "How do you maintain consistency across microservices?":

Step 1 — Reject 2PC: "Two-Phase Commit doesn't work across microservices — it's blocking, the coordinator is a SPOF, and most services don't support XA. We need eventual consistency."

Step 2 — Choose SAGA: "I'd use the SAGA pattern — each service does a local transaction and publishes an event. If a downstream step fails, compensating transactions undo prior steps."

Step 3 — Choreography vs Orchestration: "For this flow with [N] services, I'd use [orchestration because we need debuggability and the flow is complex / choreography because it's simple 2-3 steps and we want loose coupling]."

Step 4 — Compensations: "Each step has a defined compensation. I'd order steps so irreversible actions (send email, ship) happen last. Reversible actions (reserve, authorize) happen first."

Step 5 — Reliability: "I'd use the outbox pattern to ensure events are published atomically with DB writes. All consumers are idempotent using message IDs to handle duplicates."

Quick Recall

Question	Answer
Why not 2PC for microservices?	Blocking locks, coordinator SPOF, latency, services can't all implement XA
SAGA in one sentence?	Sequence of local transactions + compensating transactions on failure
Choreography vs orchestration?	Choreography = events, loose coupling, hard to debug. Orchestration = state machine, debuggable, orchestrator is SPOF.
Outbox pattern solves what?	Dual-write problem (DB + event broker atomicity)
Why idempotency?	At-least-once delivery means duplicates. Must be safe to replay.
Irreversible actions?	Put them LAST in the SAGA sequence
Real orchestration tools?	Temporal, AWS Step Functions, Eventuate
SAGA consistency model?	Eventually consistent — not ACID across services
When choreography?	2-4 simple steps, want loose coupling
When orchestration?	4+ steps, complex flow, need debuggability

Distributed Transactions

Why This Comes Up in Interviews

The Problem: Transactions Don't Cross Service Boundaries

Two-Phase Commit (2PC) — The Traditional Answer

How It Works

Why 2PC Fails for Microservices

The SAGA Pattern — Local Transactions + Compensations

Choreography-Based SAGA (Event-Driven)

Orchestration-Based SAGA (Central Coordinator)

Comparison: 2PC vs Choreography vs Orchestration

Compensating Transactions — The Hard Part

The Outbox Pattern — Solving Dual Writes

Idempotency — Critical for SAGAs

Real Systems

Back-of-Envelope: SAGA Timing

Interview Framework

Quick Recall

5-Minute System Design — Weekly