Scaling an Order Pipeline to 50k Events/Sec
Problem
The existing order pipeline was a synchronous chain of HTTP calls. At peak traffic, tail latency and failures cascaded: a slow downstream caused order timeouts, duplicate writes from naive retries, and intermittent data inconsistency. The business needed to grow 10x without growing incidents.
Constraints
- Zero tolerance for lost orders — the event log is the system of record
- Strict per-merchant ordering for order state transitions
- Team of four; no budget for a bespoke distributed database
Architecture
We moved to an event-driven pipeline on Kafka with NestJS services. Commands are written via the transactional outbox pattern so the domain DB and outbox row commit atomically; a relay publishes to Kafka exactly-once at the producer boundary. Consumers are idempotent by event id, and read models are projected asynchronously into Postgres and Elasticsearch. Partitioning is keyed by merchant id to preserve ordering without global locks.
Trade-offs
Why: Gives us exactly-once effects without the operational cost of distributed transactions.
Cost: Extra table, relay service, and careful idempotency on consumers.
Why: Preserves ordering where it matters without serializing the whole stream.
Cost: Hot-merchant skew requires partition splitting and replay tooling.
Why: Write path is fast and decoupled; read models can be rebuilt.
Cost: Eventual consistency must be explicit in the UI and APIs.
Outcome
- Sustained 50k events/sec with sub-second p99 end-to-end latency
- Zero lost events in six months of production
- Incident count dropped ~70% vs. the prior architecture