What is Kafka?
Apache Kafka is messaging system built to scale for big data. Similar to Apache ActiveMQ or RabbitMq, Kafka enables applications built on different platforms to communicate via asynchronous message passing. But Kafka differs from these more traditional messaging systems in key ways:
- It’s designed to scale horizontally, by adding more commodity servers.
- It provides much higher throughput for both producer and consumer processes.
- It can be used to support both batch and real-time use cases.
- It doesn’t support JMS, Java’s message-oriented middleware API.
Before we explore Kafka’s architecture, you should know its basic terminology:
- A producer is process that can publish a message to a topic.
- a consumer is a process that can subscribe to one or more topics and consume messages published to topics.
- A topic category is the name of the feed to which messages are published.
- A broker is a process running on single machine.
- A cluster is a group of brokers working together.
Architecture of a Kafka message system
Kafka’s architecture is very simple, which can result in better performance and throughput in some systems. Every topic in Kafka is like a simple log file. When a producer publishes a message, the Kafka server appends it to the end of the log file for its given topic. The server also assigns an offset, which is a number used to permanently identify each message. As the number of messages grows, the value of each offset increases; for example, if the producer publishes three messages the first one might get an offset of 1, the second an offset of 2, and the third an offset of 3.
When the Kafka consumer first starts, it will send a pull request to the server, asking to retrieve any messages for a particular topic with an offset value higher than 0. The server will check the log file for that topic and return the three new messages. The consumer will process the messages, then send a request for messages with an offset higher than 3, and so on.
In Kafka, the client is responsible for remembering the offset count and retrieving messages. The Kafka server doesn’t track or manage message consumption. By default, a Kafka server will keep a message for seven days. A background thread in the server checks and deletes messages that are seven days or older. A consumer can access messages as long as they are on the server. It can read a message multiple times, and even read messages in reverse order of receipt. But if the consumer fails to retrieve the message before the seven days are up, it will miss that message.
Kafka: Streaming Architecture
Kafka is used most often for streaming data in real-time into other systems. Kafka is a middle layer to decouple your real-time data pipelines. Kafka core is not good for direct computations such as data aggregations or CEP. Kafka streaming, which is part of the Kafka ecosystem, provides the ability to do real-time analytics. Kafka can be used to feed fast lane systems (real-time and operational data systems) like Storm, Flink, Spark streaming, and your services and CEP systems. Kafka is also used to stream data for batch data analysis. Kafka feeds Hadoop. It streams data into your big data platform or into RDBMS, Cassandra, Spark, or even S3 for some future data analysis. These data stores often support data analysis, reporting, data science crunching, compliance auditing, and backups.
Kafka streaming architecture diagram referred from dzone.com
What Kafka doesn’t do
Kafka can be very fast because it presents the log data structure as a first-class citizen. It’s not a traditional message broker with lots of bells and whistles.
- Kafka does not have individual message IDs. Messages are simply addressed by their offset in the log.
- Kafka also does not track the consumers that a topic has or who has consumed what messages. All of that is left up to the consumers.
Because of those differences from traditional messaging brokers, Kafka can make optimizations.
- It lightens the load by not maintaining any indexes that record what messages it has. There is no random access — consumers just specify offsets and Kafka delivers the messages in order, starting with the offset.
- There are no deletes. Kafka keeps all parts of the log for the specified time.
- It can efficiently stream the messages to consumers using kernel-level IO and not buffering the messages in user space.
- It can leverage the operating system for file page caches and efficient writeback/writethrough to disk.
How Kafka supports microservices
As powerful and popular as Kafka is for big data ingestion, the “log” data structure has interesting implications for applications built around the Internet of Things, microservices, and cloud-native architectures in general. Domain-driven design concepts like CQRS and event sourcing are powerful mechanisms for implementing scalable microservices, and Kafka can provide the backing store for these concepts. Event sourcing applications that generate a lot of events can be difficult to implement with traditional databases, and an additional feature in Kafka called “log compaction” can preserve events for the lifetime of the app. Basically, with log compaction, instead of discarding the log at preconfigured time intervals (7 days, 30 days, etc.), Kafka can keep the entire set of recent events around for all the keys in the set. This helps make the application very loosely coupled, because it can lose or discard logs and just restore the domain state from a log of preserved events.
How does Kafka compare to traditional messaging competitors?
Just as the evolution of the database from RDBMS to specialized stores has led to efficient technology for the problems that need it, messaging systems have evolved from the “one size fits all” message queues to more nuanced implementations (or assumptions) for certain classes of problems. Both Kafka and traditional messaging have their place.
Traditional message brokers allow you to keep consumers fairly simple in terms of reliable messaging guarantees. The broker (JMS, AMQP, or whatever) tracks what messages have been acknowledged by the consumer and can help a lot when order processing guarantees are required and messages must not be missed. Traditional brokers typically implement multiple protocols (e.g., Apache ActiveMQ implements AMQP, MQTT, STOMP, and others) to be used as a bridge for components that use different protocols. Additional functionalities such as message TTLs, non-persistent messaging, request-response messaging, correlation ID selectors, etc. are all perfectly valid messaging use cases where Kafka would not be a good fit.
Should you use Kafka?
The answer will always depend on what your use case is. Kafka fits a class of problem that a lot of web-scale companies and enterprises have, but just as the traditional message broker is not a one size fits all, neither is Kafka. If you’re looking to build a set of resilient data services and applications, Kafka can serve as the source of truth by collecting and keeping all of the “facts” or “events” for a system.
Here is a description of a few of the popular use cases for Apache Kafka®. For an overview of a number of these areas in action, see this blog post.
Every service at LinkedIn includes at least a Kafka producer, since this is how metrics are propagated. This includes technical metrics (query throughput, latency, etc.) as well as business metrics (ad impressions, ad clicks, etc.). These events are emitted at the rate of one per metric per minute per host, and thus represent aggregated information (count, average, max, percentiles, etc.) that get loaded into a time-series database. These metrics can thus be almost immediately visualized in dashboards, and their trend over time can be examined. Threshold alerts are also fired from this data.
This is similar to business metrics, except that it is typically more complex than just a simple numerical value going into a time-series database. This includes, for example, page view events, which contain many dimensions (member ID, page key, etc.). These events are emitted at the rate of one per triggering event, and can thus be much higher volume for each given type of events than metrics. These end up being ingested into Hadoop (via Gobblin) for offline batch processing and into Samza for nearline stream processing.
Services can get their logs ingested into Kafka in order to load them into ELK, for monitoring purposes.
Besides the above use cases, which are very common across the industry, Kafka is also used as a building block in almost every other piece of infrastructure at LinkedIn:
- Espresso, a master-slave document store, uses Kafka for its internal replication (masters write into Kafka, slaves consume from it).
- Venice, a derived data key-value store, uses Kafka as the transport mechanism to produce data from Hadoop and from Samza and to consume it into Venice storage nodes. Venice leverages Kafka for cross-datacenter transfers and even as a long-term source of truth to re-bootstrap data when doing cluster expansion or rebalancing.
- Galene, a search index, uses Kafka to keep its real-time indices up to date.
- Liquid, a graph database, uses Kafka to keep its index up to date.
- Pinot, an OLAP store, uses Kafka to update its columnar formatted tables.
- Brooklin, a change capture system, listens to row changes in Espresso and Oracle, and publishes those changes into Kafka for downstream consumers. Brooklin supports both up to date consumers as well as bootstrap consumers which bulkload entire table snapshots.
- Samza leverages Kafka in so many ways, that it deserves its own section.
As mentioned above, Samza is the stream processing framework used at LinkedIn. It leverages Kafka in a variety of ways:
- Samza can consume data from raw Kafka topics as well as from Brooklin streams (which underneath use special-purpose Kafka topics, as mentioned previously).
- Samza outputs its computation results into either raw Kafka topics, or into Venice stores (which also use special-purpose Kafka topics). Samza jobs can be multi-stage jobs, and each stage propagates data to the next downstream stage via Kafka.
- Finally, another way which Samza leverages Kafka is for the durability of its local state, thus allowing stateful Samza jobs to be resilient to hardware failures, as the state can be automatically regenerated in a new instance, by bootstrapping it from Kafka.
Use cases fulfilled by Samza are very diverse: pushing notifications and emails to members, site speed analysis, ad click-through rate computation, fraud detection, standardization machine learning pipelines, and many more.
Some use cases require inter-process messaging, and Kafka can be used for this. These events are transient in nature and may not need to be persisted into a long-term storage system, contrarily to the previously mentioned use cases, which usually result in long-term storage.