Ensuring Data Consistency between Event-Driven Microservices

Sagynysh Baitursinov
9 min readAug 28, 2022

Asynchronous communication using Event Driven Architecture is a popular design choice when building a system with distributed services. This design helps to achieve the choreography of (micro)services, making them less coupled, removing a need to centrally coordinate business processes, avoiding single points of failure, and scaling bottlenecks for the whole system. At the same time utilizing events for communication makes services easily testable, and helps to draw their boundaries and responsibilities clear and clean. If it sounded convincing to follow Event-Driven Architecture for your next project, you might be thinking about which message medium to use for event propagation between (micro)services. Regardless of what you choose, whether it’s Kafka, ActiveMQ, or AWS-SQS, a broker cannot guarantee by itself that messages will never get lost when technical failures happen. Undelivered or falsely delivered events lead to inconsistent states in the system, which might be hard to detect and often disrupt business processes. To avoid such outcomes it is the responsibility of developers to make services communicate with message brokers in a way resilient to failures. This article will explain how such fault tolerance can be achieved using two design patterns: transactional outbox and idempotent consumers.

Photo by Ann H from Pexels

Technical failures do happen, whether they happen on side of a message broker, or on the consumer/producer services side. They might occur due to an unexpected software error, forced service restarts, faulty deployment configurations, electricity shutdown, or an unfortunate natural disaster happening right above a data center running the software. The worst thing about failures is that you cannot choose at which exact moment they happen so that services’ threads have time to gracefully finish or undo message consumption or production. However, what you can choose is to design your services in such a way that after restarting, interrupted processes resume, and data does not get lost in the middle. Being able to recover from failures is the key to reaching data consistency. This article will demonstrate how this can be achieved by the following example of a simple data transfer between two services.

Example application: Online Newsletter

We have a task to implement a simple software, which is consist of two microservices: Subscriptions-service and Newsletter-service. Subscriptions-service faces a front-end application, takes care of customer registration, and stores their data. Newsletter-service gets notified about new users so that it send them newsletters on a regular basis. Both microservices need to have a database to perform their responsibilities: Subscriptions-service needs a database since it faces the front-end and manages subscriber data, prevents them from registering twice, etc. On the other hand, Newsletter-service is interested in storing all active subscriptions as newsletter-targets so that it can email newsletter issues to them. Note that the two do not share a database, because we want the microservices to be truly decoupled and independent of one another. Additionally, developers are allowed to choose different database management systems for each of them.

As you can see from the diagram above two microservices are connected via a message broker. Newsletter-service listens to events produced from Subscriptions-service of type subscriber.created that contain information about the subscriptions created via REST calls. Let’s see what would be the expected happy path for successful data transfer from service one to service two.

Happy path for a user subscribing to the newsletter

  1. Front-end application sends a synchronous request to sign up for the newsletter
  2. Subscriptions-service creates and commits the new subscriber to its database
  3. Subscriptions-service synchronously notifies the message broker about a subscriber.created event, and the event is propagated to consumers’ queues
  4. Front-end application receives HTTP response with status 201 (created) from Subscriptions-service
  5. Newsletter-service consumes the ‘subscriber.created’ event
  6. Newsletter-service creates and commits a new newsletter-target to its database
  7. Newsletter-service acknowledges the broker that subscriber.created event got successfully processed, so it can be removed from the message queue

That was a happy path, which lead to the subscriber’s data settling in both microservices. However, as stated before you should never design your services with only happy paths in your mind. Such Optimism Driven Development may lead to situations when the system has no way to recover when a non-functional process termination actually occurs.

Some failures will cause only minor inconveniences: for example if Step1 or Step2 fails, then a user or the front-end application can retry the subscription request and the attempt might succeed later. Even if it doesn’t, the user will be asked to try again later. In the end, the domain state doesn’t get corrupt. Other failures may be left unnoticed, if not mitigated properly. And those are more dangerous for the system’s state consistency due to the lack of ability of services to recover to a consistent state.

“Unhappy” scenarios when the domain ends up in inconsistent states

Photo by Pixabay from Pexels

Lost event:

  • If Step2 (database state change) succeeds and Step3 (event publishing) fails or isn’t reached, subscriber data will be written in Subscriptions-service, even though an event about this state change will not get published. Besides, as Step4 is not reached, the front-end might display an initial error, but after a retry or after re-fetching the state of the subscriber from Subscriptions-service, a user is to be informed that the initial subscription attempt has succeeded. Unfortunately, the user shouldn’t get too excited, as another microservice, Newsletter-service, has no idea about the newsletter-target that should have been created, but has not due to lack of subscriber.created event. Consequently, the user will never get the next issue of the newsletter, even though Subscriptions-service assures that their subscription is active in DB.

False event:

  • Trying to mitigate the “Lost event” inconsistency scenario, some might think to redesign the process in such a way that a subscription.created event is sent first, then only after the publishing succeeds, subscriber entity is committed in the database. But if the database commit does not succeed after having subscription.created already sent, a system state occurs, in which Newsletter-service knows about newsletter-target, even though an end user gets informed by Subscription-service that no new subscribers were created. The consequence is users receive newsletter emails while thinking that the subscription has failed.

A single event processed multiple times:

  • Step6 (database state change) succeeds and Step7 (event acknowledgment) fails: Newsletter-service fails to acknowledge to the broker that it has processed the message successfully. For such failures, Message Brokers have built-in message re-delivery mechanisms. So when Step7 is not run, subscriber.created message will be re-delivered, and consumed by Newsletter-service multiple times. If not prepared for such a scenario, Newsletter-service creates two (or more) newsletter-target entries for the same subscriber, which brings us to a situation when our user is receiving the same newsletter more than once.

How to ensure that data stays consistent?

The failure types “Lost or False events” described above may occur if a service does not change the state of entities in DB and emit events informing about state change atomically. Atomic here means guaranteed indivisibility: if one action succeeds, the second has to succeed. If one action fails, the second one must be reverted. When applied to our example, two actions that should be done in an atomic transaction are 1) having a subscriber committed to DB and 2) having a subscriber.created event sent to Message Broker. Fortunately for us, database management systems provide possibilities to perform operations within atomic transactions. Atomic transactions in relational (SQL) databases may span several local operations changing multiple entries across different tables. NoSQL databases on the other hand ensure atomicity of one document’s or one row’s insertion and modification.

But how to bring atomicity to operations involving both a Database and a Message Broker?

There are such Database management systems and Message Brokers that support distributed transactions (XA — eXtended Architecture) across multiple services with a help of a technique called two-phase commits (2PC). However, they significantly affect the performance of the system, limit developers' choice of tools, and on top of that, leads developers into Optimism Driven Development. Therefore, distributed transactions are not recommended. Another solution to the problem might be required.

Transactional outbox

The solution is to start creating and publishing events using a design pattern called Transactional outbox. In this pattern, events are written to the same database in the scope of a local transaction together with the entity state change. As the atomicity of these two operations is guaranteed by the Database, we can be sure that subscriber.created will be known in the system if the actual subscriber is created. In relational (SQL) databases a separate “events” table might be needed for storing subscriber.created or other events. In NoSQL databases, events are to be written to the same document that receives a state change in the scope of a single write operation.

In order to publish the events from DB to their actual destination, which is a Message Broker, a technical publisher process must be implemented. It has to regularly query not yet published events and publish them, marking them in DB as published after publishing.

Note that it’s still possible to have the outbox event publisher process finished with publishing the subscriber.created event, but failed to mark the entry in the outbox table as SENT. Atomicity between these two actions cannot be guaranteed. The worst that can happen in such a case is the event being published multiple times. But at least with the transactional outbox approach “Lost or False events” issue is already taken care of. As a next step we need to take care of the third issue that we described before: A single event processed multiple times. Now it may happen due to several reasons:

  1. Transactional outbox publishing the same message to Message Broker multiple times.
  2. Message broker delivers the same message to consumers multiple times. Note: usually Message Brokers do not guarantee “only once delivery”, guaranteeing only “at least once delivery” of a message. If you understand the just-described transactional outbox pattern, it should be obvious why “only once delivery” is not guaranteed by Message Brokers.
  3. Consumer fails to acknowledge that message got processed successfully, so message broker re-delivers the message to consumer two or more times.

As there are multiple sources of “event multiplication”, the issue must be taken care of on the consumers’ side.

Idempotent consumer

For those who hear the term first time here’s the description of Idempotent Consumer from Chris Richardson (microservices.io):

the outcome of processing the same message repeatedly must be the same as processing the message once

If we apply this to our scenario, Newsletter-service should be designed in such a way that even if it consumes the same subscriber.created event twice or more times, the result should be the same as if it consumed it once.

There are a few ways how to make a service idempotent:

  1. Use an inbox table, which stores incoming events’ IDs as primary or unique column values, so the event replay will be caught by database constraints.
  2. Store IDs of incoming events in the business entities’ table. F.e. subscriber entity will be aware of subscriber-creator-event-id. Make the column unique.
  3. Build your consumer to be naturally idempotent from the business process itself. For example in the case of handling the subscriber.created event, it’s possible to use a uniqueness constraint on the newsletter-target’s email column.
  4. Utilize event sourcing technique for the service’s repository layer. In event sourcing event consumption is idempotent by default.

Conclusion

In this article, we discussed two techniques called Transactional Outbox and Idempotent Consumer, which can help you build distributed microservices communicating asynchronously without data loss or false data creation. A beautiful thing about the techniques is that they can work with different database management systems and message brokers. They do not negatively affect the performance of the system. Besides, they’re relatively easy to implement. I hope the article was helpful to you. Keep in mind to think about what can go wrong in the system while designing and implementing it. Do not be driven solely by optimism.

Further readings:

Chris Richardson: https://microservices.io/patterns/data/transactional-outbox.html

Chris Richardson: https://microservices.io/patterns/communication-style/idempotent-consumer.html

Dave Syer: https://www.infoworld.com/article/2077963/distributed-transactions-in-spring--with-and-without-xa.html

--

--

Sagynysh Baitursinov

Senior software developer based in Hamburg, Germany. I write about Software Development.