2025

Scaling an Order Pipeline to 50k Events/Sec

KafkaNestJSEvent-DrivenScaling

Problem

The existing order pipeline was a synchronous chain of HTTP calls. At peak traffic, tail latency and failures cascaded: a slow downstream caused order timeouts, duplicate writes from naive retries, and intermittent data inconsistency. The business needed to grow 10x without growing incidents.

Constraints

Zero tolerance for lost orders — the event log is the system of record
Strict per-merchant ordering for order state transitions
Team of four; no budget for a bespoke distributed database

Architecture

We moved to an event-driven pipeline on Kafka with NestJS services. Commands are written via the transactional outbox pattern so the domain DB and outbox row commit atomically; a relay publishes to Kafka exactly-once at the producer boundary. Consumers are idempotent by event id, and read models are projected asynchronously into Postgres and Elasticsearch. Partitioning is keyed by merchant id to preserve ordering without global locks.

Trade-offs

Outbox + relay instead of 2PC

Why: Gives us exactly-once effects without the operational cost of distributed transactions.

Cost: Extra table, relay service, and careful idempotency on consumers.

Per-merchant partitioning

Why: Preserves ordering where it matters without serializing the whole stream.

Cost: Hot-merchant skew requires partition splitting and replay tooling.

Async read models

Why: Write path is fast and decoupled; read models can be rebuilt.

Cost: Eventual consistency must be explicit in the UI and APIs.

Outcome

Sustained 50k events/sec with sub-second p99 end-to-end latency
Zero lost events in six months of production
Incident count dropped ~70% vs. the prior architecture