Kafka: Real World Patterns and Trade-offs

Vijay Rangan
9 min readJun 24, 2023

In this article, we will look at 4 real world patterns using Kafka and the associated trade-offs. These patterns are inspired by large scale applications that I have personally architected and contributed to. We will explore the following patterns:

  1. Mono partition topics
  2. Partitioning
  3. Service-internal topics
  4. The outbox pattern

Introduction

If you aren’t already familiar, Apache Kafka is an open-source distributed event streaming platform used by thousands of companies around the globe for applications ranging from asynchronous service communication to data streaming pipelines for analytical workloads.
It has become a popular choice for building scalable and reliable distributed systems, especially in micro service architectures.

1. Mono Partition Topics

All messages produced to Kafka are stored in partitions. Partitions belong to a topic and a for all practical purposes, a topic is identified by it’s name. Therefore, you can think of a topic as a logical grouping of partitions.

Kafka topics ensure message ordering at a partition level and DOES NOT provide topic wide message ordering. This means if we have a topic with 4 partitions (like below), messages produced to this topic will be consumed based on the order of insertion in each of the partitions.

Topic with 4 partitions with messages in each ordered from left to right

Kafka affords us the power to horizontally scale consumers. This is done by increasing the number of partitions in a topic — more partitions, more parallel processing by consumer groups. So, the concept of a “mono partition topic” is quite unintuitive. In some situations, however, it can be very useful.

Consider the scenario of capturing a continuous stream of modifications in a database table. To maintain a comprehensive audit trail of all database changes, say for a “users” table, a well-suited approach is to utilize a topic named “users_table_events” with a single partition. As new records are inserted, existing entries get updated, or data gets deleted within the table, a corresponding event is published as a message to this topic, preserving the sequential order in which they were executed.

PROS:

  1. Ordering Guarantee: Single partition topics provide strict message ordering since all messages are processed in a sequential manner within a single partition. This ensures that messages are consumed and processed in the exact order they were produced.
  2. Simplified Consumer Logic: With a single partition topic, the consumer logic becomes simpler as there is no need to handle partition assignment and coordination among multiple partitions.
  3. Guaranteed Message Processing: In failure scenarios, the processing of messages within a single partition is more deterministic and predictable compared to multi-partition topics. If a consumer encounters an error or needs to be restarted, it can simply resume processing from the last consumed message within the single partition, without worrying about synchronisation with other partitions.

CONS:

  1. Limited Parallelism: Single partition topics restrict the parallelism and scalability of message processing since all messages are confined to a single partition. This can become a bottleneck if the message throughput is high.
  2. Reduced Fault Tolerance: Single partition topics have limited fault tolerance capabilities compared to multi-partition topics. If the single partition becomes unavailable due to a broker failure or other issues, the entire topic becomes inaccessible. It is important to ensure appropriate monitoring and fault recovery mechanisms to mitigate the impact of potential failures.
  3. Potential Performance Challenges: Single partition topics may experience performance challenges when dealing with high message throughput. As all messages are processed within a single partition, the processing speed of the consumer becomes a critical factor. Bottlenecks can occur if the consumer cannot process messages quickly enough, leading to latency and potential message backlogs.

2. Message ordering using partition keys

Of all the patterns we’ll look at today, this is likely the most common one I’ve seen. In Kafka, partition keys are used by the producer client library to decide which partition a message is sent.

Common partitioning strategies:
1. Round-robin
2. Key based
3. Hash based
4. Custom partitioner

Each partitioning strategy carries with it it’s own trade-offs — a subject that deserves a post of its own. I’ll keep that little gem for another day.

By default, producers use a round-robin approach where each message is sent to one partition at a time. If a key is set during message production, then messages with the same key are always sent to the same partition.

Messages with the same key are always produced to the same partition

Kafka guarantees FIFO (First-In-First-Out) ordering for all messages that specify a partition key. This feature is particularly beneficial in domains such as accounting systems, as well as event sourcing. For example, processing financial transactions requires maintaining the correct account balances and order of operations. By using the account ID (say) as the partition key, all transactions related to a specific account will be processed sequentially, ensuring consistency and accurate balances.

PROS:

  1. Message Ordering: By carefully selecting a partition key, you can ensure that related messages with the same key are stored in and processed from the same partition. This guarantees the ordering of messages within a specific partition, enabling you to maintain the sequence of operations when required.
  2. Parallel Processing: Kafka allows multiple partitions to be processed concurrently, providing horizontal scalability. By choosing partition keys wisely, you can distribute the workload across partitions, enabling parallel processing of messages by multiple consumers. This enhances the overall throughput and performance of the system.
  3. Consumer Affinity: Partition keys can be used to achieve consumer affinity, meaning that messages with the same partition key are consumed by the same consumer instance, allowing for efficient state management and minimising redundant processing.

CONS:

  1. Data Skew: There aren’t obvious cons with using partition keys but there are trade-offs depending on the strategy you choose. For instance, using key based strategy can lead to “hot partitions” where some partitions may receive a significantly higher volume of messages compared to others. such as higher volume of transactions by a few account IDs compared to others.
  2. Limited Partition Count: Each Kafka topic has a fixed number of partitions, and the number of partitions determines the maximum parallelism and scalability of the topic. Choosing partition keys that result in too few partitions may limit the system’s ability to scale horizontally. On the other hand, having an excessive number of partitions may introduce unnecessary complexity and overhead. It is crucial to strike a balance by determining an appropriate partition count based on the anticipated workload and processing requirements.
  3. Key Selection Complexity: Selecting effective partition keys requires careful consideration of the data and its characteristics. Choosing an inappropriate partition key can result in imbalanced workloads, poor performance, or increased data skew. It may also require domain-specific knowledge to identify suitable attributes that can be used as partition keys.

3. Service-internal topics

This pattern is relatively uncommon but is employed at NuBank as a solution for handling time-consuming tasks within a service’s purview.

Instead of using topics to publish events for consumption by other services, this pattern leverages them as a means of offloading upfront work. By doing so, it reduces potentially expensive computations in exchange for improved response times to the caller. This approach essentially functions as a queueing mechanism while taking advantage of Kafka’s scaling capabilities.

Let’s dive into two how this was used at NuBank. But first, a little terminology.

Transactional environment: This term refers to the collection of microservices and their associated databases that deal with the application layer, serving real customers.

Analytical environment: This encompasses the environment and set of services dedicated to business analytics, big data, and the ETL (Extract, Transform, Load) process at NuBank.

Extraction service using internal topic to extract 300+ production databases

Within the data ingestion team at NuBank, we adopted this approach to schedule extractions of databases from the transactional environment. At regular intervals, we initiated the extraction process from over 300 microservices using the Datomic database. The process of extraction is, as you can imagine, a time-consuming operation.

Using service-internal topics, when the scheduler called the extraction service, a new batch of extractions was enqueued.

Once enqueued, the extraction service efficiently handled the time-consuming task of data extraction by processing all databases. This approach enabled us to promptly respond to the scheduler and efficiently handle large volumes of data from these databases by scaling out the consumers.

PROS:

  1. Flexibility and Agility: Private topics give services the flexibility to publish events specific to their needs without affecting other services and can evolve independently, adding or modifying private topics as needed.
  2. Fine-grained Access Control: Private topics enable fine-grained access control, allowing only authorized services to produce or consume messages from the topic. This enhances security by preventing unauthorized access and ensuring that sensitive data or internal communication is restricted to the intended services.
  3. Performance Optimization: Private topics can be optimized for specific use cases or performance requirements. Since only internal services are involved, you can configure the topic and broker settings based on the needs of these services, such as retention policies, replication factors, or message compression. This optimization can improve overall system performance and efficiency.

CONS:

  1. Increased Complexity: Introducing private topics adds complexity to the overall system architecture. Managing additional topics and ensuring proper coordination between producers and consumers can become challenging. Proper documentation and communication are crucial to maintain a clear understanding of the purpose and usage of private topics across the organization.

4. Outbox pattern

The Outbox pattern is used in micro services architecture to ensure reliable and consistent event-driven communication between services.

In this pattern, each service maintains an “outbox” within its own database. The outbox serves as a buffer where events or messages to be published asynchronously are stored. Instead of directly publishing messages to Kafka, the service writes the events to its outbox as part of a database transaction. This helps mitigate the problems associated with 2PC (two phase commit).

Service update it’s data and records this change within the same DB transaction

PROS:

  1. Data Consistency: By associating the outbox with the service’s own transaction, data changes and events are committed or rolled back atomically, ensuring consistency between data updates and event publication.
  2. Fault Tolerance: In case of failures or downtime, the outbox processor can resume processing from the last successfully processed event, ensuring fault tolerance and message durability.
  3. Scalability: The outbox processor can be horizontally scaled to handle high event volumes, ensuring efficient processing and distribution of events. (with the trade-off of synchronisation, detailed below)

CONS:

  1. Increased Complexity: Implementing the Outbox pattern adds complexity to the system architecture. It introduces additional components, such as the outbox tables, outbox processor, and event publishing mechanism, which require development and maintenance efforts.
  2. Latency: The Outbox pattern introduces an inherent delay in event publication. Events are not immediately published when data changes occur, as they need to be written to the outbox first and then processed by the outbox processor. Depending on the frequency of polling and the processing time of the outbox processor, there can be a slight delay between data changes and event delivery.
  3. Synchronisation Challenges: The Outbox pattern requires synchronization between the service’s data changes and the events written to the outbox. Ensuring that events are correctly associated with the corresponding data changes can be challenging, especially in scenarios where distributed transactions or multi-step workflows are involved. Care must be taken to handle scenarios like compensating transactions or rollbacks to maintain consistency.

Despite these challenges, the Outbox pattern remains a valuable approach for achieving reliable event-driven communication in a micro services world.

Conclusion

While there are many ways to use Kafka, it is important to carefully evaluate the specific requirements and trade-offs of the system you are working with before adopting any of these patterns; considering factors such as the scale, complexity, and performance needs of the application.

--

--