In event-driven systems, it is easy to assume that events will be processed in the same order they were created.
Sometimes that is true.
Often it is not.
A service may publish events in one order, but consumers may observe them differently because of retries, multiple partitions, multiple consumers, network delays, deployments, or slow processing.
This can create subtle bugs.
For example:
PaymentSucceeded arrives before OrderCreated
InventoryReleased arrives before InventoryReserved
SubscriptionCancelled arrives before SubscriptionActivated
PlayerBalanceUpdated version 9 arrives before version 8
In a simple demo, messages usually arrive in the order you expect.
In production, you should be more careful.
Ordering is one of those problems that looks small until it affects money, access, inventory, or user trust.
Some events are independent.
For example:
UserLoggedIn
ProductViewed
PageVisited
If those events are used for analytics, exact ordering may not matter much.
But other events represent a business process.
For example:
OrderCreated
PaymentAuthorized
InventoryReserved
OrderConfirmed
OrderCancelled
The order of those events matters.
If a consumer processes OrderConfirmed before OrderCreated, what should happen?
If a reporting projection sees OrderCancelled and later receives an old OrderConfirmed,
should the order become confirmed again?
If a balance projection processes an older balance update after a newer one, should it overwrite the newer state?
These questions need explicit answers.
Imagine a subscription system.
The source service publishes these events:
SubscriptionCreated
SubscriptionActivated
SubscriptionCancelled
A consumer builds a read model for customer support.
If the events are processed in order, the final state is clear:
Cancelled
But imagine the consumer receives them like this:
SubscriptionCreated
SubscriptionCancelled
SubscriptionActivated
If the consumer blindly applies each event, the subscription may end up as:
Activated
That is wrong.
The latest business state was cancelled, but an older event arrived late and overwrote it.
This is the kind of bug ordering problems can create.
Events can arrive out of order for several reasons.
One reason is partitioning.
In systems like Kafka, ordering is usually guaranteed only within a partition. If related events use different partition keys, they may be processed in different orders.
Another reason is retries.
An earlier event may fail processing and get retried later, while newer events continue successfully.
For example:
Event 1 fails and is delayed
Event 2 succeeds
Event 3 succeeds
Event 1 is retried later
From the consumer's perspective, event 1 arrived late.
Another reason is parallel consumers.
If multiple workers process messages concurrently, one worker may finish later than another even if it started earlier.
Deployments, network delays, dead-letter queues, and manual replays can also change the order in which events are processed.
The important point is this:
Published order is not always the same as processed order.
When people say a broker preserves order, ask:
Order within what boundary?
Global ordering across all events is rare and often not desirable.
Most systems provide ordering within a narrower scope.
For example:
Within one queue
Within one topic partition
Within one aggregate ID
Within one consumer instance
This distinction matters.
If all events for order_123 go to the same partition, then those events can usually be consumed in
order relative to each other.
But events for order_123 and order_456 may be processed independently.
That is usually good. You often do not need a global order for all orders in the system.
You need a meaningful order per business entity.
A common technique is to partition events by aggregate ID.
For example:
orderId
paymentId
accountId
subscriptionId
playerId
If all events for the same order use orderId as the partition key, the broker can keep those events
in order within the same partition.
For example:
OrderCreated order_123
PaymentAuthorized order_123
OrderConfirmed order_123
These should go to the same partition.
That makes it easier for consumers to process the order lifecycle correctly.
But there is a trade-off.
If one aggregate has very high traffic, it can become a hot partition.
For example:
accountId for a very active account
playerId for a high-frequency game session
tenantId for a very large customer
The ordering key is also a scaling decision.
You need to choose a key that matches the consistency needs of the domain.
Broker ordering helps, but it is not enough by itself.
A consumer can still process messages incorrectly.
For example, a consumer may process messages concurrently after reading them from the broker.
Or a message may fail, be retried later, and arrive after newer messages.
Or someone may replay older events.
Or the consumer may receive events from multiple topics.
So even when the broker gives ordering guarantees, consumers should still validate the business state.
A good consumer does not only ask:
Did this message arrive in order?
It asks:
Does this event make sense for the current state?
That is a safer model.
One of the most useful tools for ordering is a version number.
For example:
{
"eventType": "SubscriptionCancelled",
"subscriptionId": "sub_123",
"version": 5,
"occurredAt": "2026-06-23T10:00:00Z"
}
The version represents the sequence of changes for that aggregate.
A consumer can store the latest version it has processed.
For example:
Current projection version: 5
Incoming event version: 4
The incoming event is older.
The consumer should not overwrite newer state with older state.
This protects projections from stale events.
For state-changing consumers, version numbers also help detect gaps.
For example:
Current version: 3
Incoming version: 5
Version 4 is missing.
The consumer can decide what to do:
Wait and retry later
Fetch current state from source service
Move message to retry queue
Alert if the gap does not resolve
Versions make ordering problems visible.
Without versions, the consumer may not know that anything is wrong.
Events often include timestamps.
For example:
occurredAt
publishedAt
processedAt
These are useful for debugging and metrics.
But timestamps are not always safe for ordering.
Clocks can drift.
Events can be created at nearly the same time.
Different services may use different clocks.
A timestamp can tell you when something probably happened, but it is not always a reliable sequence number.
For business-critical ordering, I prefer explicit versions or sequence numbers per aggregate.
Use timestamps for context.
Use versions for correctness.
Another practical technique is to model valid state transitions.
For example, an order might have states like:
PendingPayment
PaymentAuthorized
Confirmed
Cancelled
Refunded
Then the consumer can enforce rules:
PendingPayment -> PaymentAuthorized is valid
PaymentAuthorized -> Confirmed is valid
Confirmed -> Cancelled may be valid
Cancelled -> Confirmed is not valid
Refunded -> Confirmed is not valid
This prevents old or unexpected events from corrupting the state.
For example, if an OrderConfirmed event arrives after the order was already cancelled, the system
should not blindly confirm it.
It should treat it as a stale event, invalid transition, or situation requiring investigation.
State machines are useful because they turn hidden assumptions into explicit rules.
Ordering problems often show up in projections.
A projection is a read model built from events.
For example:
Customer support view
Reporting dashboard
Search index
Account balance view
Subscription status page
Projections usually consume events asynchronously.
That means they can lag behind the source of truth.
They can also receive events late or out of order.
For simple projections, this may be acceptable.
For example, analytics being slightly delayed is usually fine.
But for projections that users or support teams rely on, correctness matters.
A projection should know how to handle:
Duplicate events
Late events
Missing events
Unsupported event versions
Invalid state transitions
Replay
The projection should not blindly trust arrival order.
Balances are a good example of where ordering and idempotency both matter.
Imagine these events:
BalanceCredited +50
BalanceDebited -20
BalanceCredited +10
If the consumer simply applies increments, duplicate or out-of-order processing can create problems.
A safer design may use a transaction ledger:
transactionId
accountId
amount
type
createdAt
Each transaction is applied once.
The balance can be calculated from the ledger or updated with strong consistency guarantees.
Another approach is to publish balance snapshots with versions:
accountId: acc_123
balance: 140
version: 8
Then the consumer only applies the snapshot if the version is newer than the current version.
The right design depends on the domain.
But the rule is the same:
Do not let an old or duplicate event corrupt financial state.
Event replay is useful.
You may replay events to rebuild a projection, fix a bug, or backfill a new system.
But replay changes processing conditions.
A consumer may receive old events that are technically valid but no longer relevant to the current state.
For example:
OrderCreated from three months ago
PaymentAuthorized from three months ago
OrderConfirmed from three months ago
If the consumer sends emails or triggers side effects during replay, that can be dangerous.
Ordering during replay also needs care.
If you replay events from multiple topics or partitions, you may not reconstruct the exact original interleaving.
For projections, that may be fine if events include versions and consumers are deterministic.
For side effects, replay should usually be disabled or carefully controlled.
Sagas also need to think about ordering.
A saga might expect this flow:
InventoryReserved
PaymentAuthorized
OrderConfirmed
But it may receive:
PaymentAuthorized before InventoryReserved
Depending on the design, that may be valid or invalid.
If the saga is orchestrated, the orchestrator can control the sequence by sending commands step by step.
For example:
Only send AuthorizePayment after InventoryReserved
That reduces ordering complexity.
In choreography, services react to events more independently, so ordering assumptions can become harder to see.
This is one reason orchestration can be useful for complex workflows.
It gives you a central place to control the sequence and handle unexpected results.
Out-of-order events are one problem.
Missing events are another.
A consumer may receive version 5 while it only processed up to version 3.
That suggests version 4 is missing or delayed.
For example:
Processed: Order version 3
Received: Order version 5
Missing: Order version 4
The consumer should not ignore this silently.
Possible strategies:
Pause processing for that aggregate
Retry the event later
Fetch the current state from the source service
Rebuild the projection for that aggregate
Alert if the gap remains too long
The right choice depends on how important the projection is.
For a reporting dashboard, fetching current state may be enough.
For a financial ledger, you may need stricter guarantees.
Not every system needs strict ordering.
Trying to force strict ordering everywhere can reduce throughput and increase complexity.
For example:
Analytics events
Search indexing
Non-critical notifications
Activity feeds
Recommendation updates
These systems often tolerate some delay, duplication, or minor reordering.
In those cases, it may be better to design consumers to be tolerant rather than forcing expensive ordering guarantees.
For example:
Use last-write-wins with version checks
Periodically rebuild projections
Deduplicate by event ID
Accept small delays
Use reconciliation jobs
Good architecture is about choosing where correctness matters most.
Ordering should be strict where business correctness requires it.
Everywhere else, tolerance may be simpler.
For each event flow, I like to ask:
Does order matter for this event type?
Does order matter globally or only per aggregate?
What key should preserve ordering?
Can events arrive late?
Can older events overwrite newer state?
Do events include a version or sequence number?
What happens if a version is missing?
Can this consumer process messages concurrently?
Is replay safe?
How do we detect stale events?
These questions reveal whether the system is relying on hidden assumptions.
Hidden ordering assumptions are dangerous.
Explicit ordering rules are much safer.
Ordering issues should be visible.
Useful metrics include:
Out-of-order event count
Stale event count
Missing version gaps
Invalid state transition count
Projection rebuild count
Retry count caused by missing earlier events
Oldest unresolved version gap
Logs should include:
eventId
eventType
aggregateId
eventVersion
currentVersion
correlationId
causationId
state before
state after
This makes debugging much easier.
When something goes wrong, you want to know whether the problem was duplication, ordering, missing events, or invalid business logic.
If I had to explain ordering problems in an interview, I would say:
In event-driven systems, I do not assume global ordering. Many brokers only guarantee ordering within a queue or partition, and retries, parallel consumers, replays, and failures can still cause events to be processed later than expected.
If ordering matters, I would usually scope it to a business entity, such as orderId,
accountId, or subscriptionId, and use that as the partition key. I would also include
a version or sequence number in events so consumers can detect stale events or missing versions.
Consumers should not blindly trust arrival order. They should validate state transitions, ignore or handle stale events, and decide what to do when an expected earlier event is missing.
For some flows, strict ordering is critical, like balances, payments, inventory, and subscription status. For others, like analytics or search indexing, eventual consistency and tolerant consumers may be enough.
The main idea is to make ordering assumptions explicit instead of accidentally depending on the broker to always deliver events in the perfect order.
Ordering problems are easy to miss because most systems behave nicely during development.
There is one producer.
One consumer.
A small number of messages.
No retries.
No replays.
No partial outages.
Then production adds concurrency, failures, deployments, backpressure, and scale.
Events arrive late.
Old messages are replayed.
One consumer is slower than another.
A failed message is retried after newer messages already succeeded.
That is when ordering assumptions become bugs.
A good event-driven system does not need perfect global order.
It needs clear ordering rules where they matter.
Use the right partition key.
Include versions.
Validate state transitions.
Make stale and missing events visible.
And most importantly, decide which parts of the business actually require strict ordering.
This post is part of my Backend Architecture Notes series. In the next post, I will look at event schema versioning, and how to evolve events without breaking consumers.