What Is Apache Kafka - Fork My Brain

# What Is Apache Kafka? ![rw-book-cover](https://a0.awsstatic.com/libra-css/images/logos/aws_logo_smile_1200x630.png) URL:: https://aws.amazon.com/es/msk/what-is-kafka/ Author:: Amazon Web Services, Inc. ## Highlights > Apache Kafka is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally. ([View Highlight](https://read.readwise.io/read/01gce4k8kvehnn9fs2eerqfwc7)) > Kafka provides three main functions to its users: > • Publish and subscribe to streams of records > • Effectively store streams of records in the order in which records were generated > • Process streams of records in real time > Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data. ([View Highlight](https://read.readwise.io/read/01gce4kg4bw6221zymn1pc95y6)) > Kafka is used to build real-time streaming data pipelines and real-time streaming applications. ([View Highlight](https://read.readwise.io/read/01gce54hrvrk5k94s8gkb8nb2f)) > Kafka is also often used as a message broker solution, which is a platform that processes and mediates communication between two applications. ([View Highlight](https://read.readwise.io/read/01gce54sn13gyyh6535vy4eea0)) > Kafka combines two messaging models, queuing and publish-subscribe, to provide the key benefits of each to consumers. Queuing allows for data processing to be distributed across many consumer instances, making it highly scalable. However, traditional queues aren’t multi-subscriber. The publish-subscribe approach is multi-subscriber, but because every message goes to every subscriber it cannot be used to distribute work across multiple worker processes. Kafka uses a partitioned log model to stitch together these two solutions. A log is an ordered sequence of records, and these logs are broken up into segments, or partitions, that correspond to different subscribers. This means that there can be multiple subscribers to the same topic and each is assigned a partition to allow for higher scalability. Finally, Kafka’s model provides replayability, which allows multiple independent applications reading from data streams to work independently at their own rate. ([View Highlight](https://read.readwise.io/read/01gce82yk5a6n6vfxxvekpfxk4)) > Queuing > ![](https://d1.awsstatic.com/product-marketing/MSK/product-page-diagram_Kafka_Queue.610d4e93fbe68690ac3202838bb833c032df9b60.png) > Publish-Subscribe > ![](https://d1.awsstatic.com/product-marketing/MSK/product-page-diagram_Kafka_PubSub.4c6a1384cc43bab62e45e293bc8a5d650bf2dec7.png) ([View Highlight](https://read.readwise.io/read/01gce83cys1n5mhjkhk59hckzf)) > Kafka’s partitioned log model allows data to be distributed across multiple servers, making it scalable beyond what would fit on a single server. ([View Highlight](https://read.readwise.io/read/01gce83yzdjncswhck50nctcb3)) > Kafka also acts as a very scalable and fault-tolerant storage system by writing and replicating all data to disk. ([View Highlight](https://read.readwise.io/read/01gce84grv5h8t4avk95npgspa)) > Kafka has four APIs: > • Producer API: used to publish a stream of records to a Kafka topic. > • Consumer API: used to subscribe to topics and process their streams of records. > • Streams API: enables applications to behave as stream processors, which take in an input stream from topic(s) and transform it to an output stream which goes into different output topic(s). > • Connector API: allows users to seamlessly automate the addition of another application or data system to their current Kafka topics. ([View Highlight](https://read.readwise.io/read/01gce84p31n8wqxaygvvkqt6nm)) > [RabbitMQ](https://www.rabbitmq.com/) is an open source message broker that uses a messaging queue approach. Queues are spread across a cluster of nodes and optionally replicated, with each message only being delivered to a single consumer. ([View Highlight](https://read.readwise.io/read/01gce85fcype5sbbmbf2dxkrr4)) > **Characteristics** > **Apache Kafka** > **RabbitMQ** > **Architecture** > Kafka uses a partitioned log model, which combines messaging queue and publish subscribe approaches. > RabbitMQ uses a messaging queue. > **Scalability** > Kafka provides scalability by allowing partitions to be distributed across different servers. > Increase the number of consumers to the queue to scale out processing across those competing consumers. > **Message retention** > Policy based, for example messages may be stored for one day. The user can configure this retention window. > Acknowledgement based, meaning messages are deleted as they are consumed. > **Multiple consumers** > Multiple consumers can subscribe to the same topic, because Kafka allows the same message to be replayed for a given window of time. > Multiple consumers cannot all receive the same message, because messages are removed as they are consumed. > **Replication** > Topics are automatically replicated, but the user can manually configure topics to not be replicated. > Messages are not automatically replicated, but the user can manually configure them to be replicated. > **Message ordering** > Each consumer receives information in order because of the partitioned log architecture. > Messages are delivered to consumers in the order of their arrival to the queue. If there are competing consumers, each consumer will process a subset of that message. > **Protocols** > Kafka uses a binary protocol over TCP. > Advanced messaging queue protocol (AMQP) with support via plugins: MQTT, STOMP. ([View Highlight](https://read.readwise.io/read/01gce86h6kwj8s14dqce07cbt3)) --- Title: What Is Apache Kafka? Author: Amazon Web Services, Inc. Tags: readwise, articles date: 2024-01-30 --- # What Is Apache Kafka? ![rw-book-cover](https://a0.awsstatic.com/libra-css/images/logos/aws_logo_smile_1200x630.png) URL:: https://aws.amazon.com/es/msk/what-is-kafka/ Author:: Amazon Web Services, Inc. ## AI-Generated Summary Apache Kafka is a distributed streaming platform that is used to build real time streaming data pipelines and applications that adapt to data streams. Learn more about how Kafka works, the benefits, and how your business can begin using Kafka. ## Highlights > Apache Kafka is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally. ([View Highlight](https://read.readwise.io/read/01gce4k8kvehnn9fs2eerqfwc7)) > Kafka provides three main functions to its users: > • Publish and subscribe to streams of records > • Effectively store streams of records in the order in which records were generated > • Process streams of records in real time > Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data. ([View Highlight](https://read.readwise.io/read/01gce4kg4bw6221zymn1pc95y6)) > Kafka is used to build real-time streaming data pipelines and real-time streaming applications. ([View Highlight](https://read.readwise.io/read/01gce54hrvrk5k94s8gkb8nb2f)) > Kafka is also often used as a message broker solution, which is a platform that processes and mediates communication between two applications. ([View Highlight](https://read.readwise.io/read/01gce54sn13gyyh6535vy4eea0)) > Kafka combines two messaging models, queuing and publish-subscribe, to provide the key benefits of each to consumers. Queuing allows for data processing to be distributed across many consumer instances, making it highly scalable. However, traditional queues aren’t multi-subscriber. The publish-subscribe approach is multi-subscriber, but because every message goes to every subscriber it cannot be used to distribute work across multiple worker processes. Kafka uses a partitioned log model to stitch together these two solutions. A log is an ordered sequence of records, and these logs are broken up into segments, or partitions, that correspond to different subscribers. This means that there can be multiple subscribers to the same topic and each is assigned a partition to allow for higher scalability. Finally, Kafka’s model provides replayability, which allows multiple independent applications reading from data streams to work independently at their own rate. ([View Highlight](https://read.readwise.io/read/01gce82yk5a6n6vfxxvekpfxk4)) > Queuing > ![](https://d1.awsstatic.com/product-marketing/MSK/product-page-diagram_Kafka_Queue.610d4e93fbe68690ac3202838bb833c032df9b60.png) > Publish-Subscribe > ![](https://d1.awsstatic.com/product-marketing/MSK/product-page-diagram_Kafka_PubSub.4c6a1384cc43bab62e45e293bc8a5d650bf2dec7.png) ([View Highlight](https://read.readwise.io/read/01gce83cys1n5mhjkhk59hckzf)) > Kafka’s partitioned log model allows data to be distributed across multiple servers, making it scalable beyond what would fit on a single server. ([View Highlight](https://read.readwise.io/read/01gce83yzdjncswhck50nctcb3)) > Kafka also acts as a very scalable and fault-tolerant storage system by writing and replicating all data to disk. ([View Highlight](https://read.readwise.io/read/01gce84grv5h8t4avk95npgspa)) > Kafka has four APIs: > • Producer API: used to publish a stream of records to a Kafka topic. > • Consumer API: used to subscribe to topics and process their streams of records. > • Streams API: enables applications to behave as stream processors, which take in an input stream from topic(s) and transform it to an output stream which goes into different output topic(s). > • Connector API: allows users to seamlessly automate the addition of another application or data system to their current Kafka topics. ([View Highlight](https://read.readwise.io/read/01gce84p31n8wqxaygvvkqt6nm)) > [RabbitMQ](https://www.rabbitmq.com/) is an open source message broker that uses a messaging queue approach. Queues are spread across a cluster of nodes and optionally replicated, with each message only being delivered to a single consumer. ([View Highlight](https://read.readwise.io/read/01gce85fcype5sbbmbf2dxkrr4)) > **Characteristics** > **Apache Kafka** > **RabbitMQ** > **Architecture** > Kafka uses a partitioned log model, which combines messaging queue and publish subscribe approaches. > RabbitMQ uses a messaging queue. > **Scalability** > Kafka provides scalability by allowing partitions to be distributed across different servers. > Increase the number of consumers to the queue to scale out processing across those competing consumers. > **Message retention** > Policy based, for example messages may be stored for one day. The user can configure this retention window. > Acknowledgement based, meaning messages are deleted as they are consumed. > **Multiple consumers** > Multiple consumers can subscribe to the same topic, because Kafka allows the same message to be replayed for a given window of time. > Multiple consumers cannot all receive the same message, because messages are removed as they are consumed. > **Replication** > Topics are automatically replicated, but the user can manually configure topics to not be replicated. > Messages are not automatically replicated, but the user can manually configure them to be replicated. > **Message ordering** > Each consumer receives information in order because of the partitioned log architecture. > Messages are delivered to consumers in the order of their arrival to the queue. If there are competing consumers, each consumer will process a subset of that message. > **Protocols** > Kafka uses a binary protocol over TCP. > Advanced messaging queue protocol (AMQP) with support via plugins: MQTT, STOMP. ([View Highlight](https://read.readwise.io/read/01gce86h6kwj8s14dqce07cbt3))