Big Industries Academy
Exploring Messaging and Streaming Technologies Part2: Apache Pulsar
Francine Anestis writes a series of articles about various Messaging and Streaming technologies. She examines the Key Features, Architecture, Use Cases, Strengths & Weaknesses, Cost and Maturity Level of each technology. In this second article she explores Apache Pulsar.
Apache Pulsar is a distributed open-source messaging and streaming platform that offers low latency and high throughput. Developed originally by Yahoo and now an Apache Software Foundation project, Pulsar is designed to address some of the limitations of other messaging systems by offering a multi-tenant architecture, seamless scalability, and built-in geo-replication. It is particularly suited for real-time analytics, event-driven applications, and data pipeline solutions.
Key Features
- Multi-Tenancy: Supports multiple tenants, namespaces, and topics, enabling isolation and efficient resource utilization.
- Seamless Scalability: Easily scalable from a few nodes to thousands, accommodating growing data needs without major reconfiguration.
- Geo-Replication: Built-in support for geo-replication, ensuring data is replicated across different geographical locations for disaster recovery and data locality.
- Unified Messaging Model: Supports both message queuing and publish-subscribe messaging patterns.
- Serverless Functions: Pulsar Functions allow you to process data in real-time without managing infrastructure.
- Schema Registry: Supports schema management, ensuring data integrity and compatibility across different producers and consumers.
Architecture
- Producer: Clients that publish messages to Pulsar topics.
- Consumer: Clients that subscribe to topics and process incoming messages.
- Broker: Manages topics and handles message routing from producers to consumers.
- BookKeeper: Apache BookKeeper is used for durable storage, ensuring messages are reliably stored and replicated.
- Zookeeper: Manages metadata, configurations, and coordinates distributed components.
Use Cases
- Real-Time Analytics: Processing and analyzing streaming data in real-time.
- Event-Driven Architectures: Handling events in microservices architectures.
- Log Aggregation: Collecting and aggregating logs from multiple sources for centralized analysis.
- Data Pipelines: Facilitating the movement and transformation of data across different systems.
Strengths
- Multi-Tenancy: Enables isolation and efficient resource utilization for multiple tenants.
- Low Latency and High Throughput: Optimized for real-time message processing with low latency.
- Geo-Replication: Built-in support for replicating data across multiple geographic locations.
- Flexibility: Supports both queuing and pub-sub patterns, as well as multiple messaging protocols.
- Pulsar Functions: Allows for real-time stream processing without additional infrastructure.
- Protocols Interoperability: Apart from its own protocol (TCP), Pulsar supports other protocols as well (e.g., RESTful API)
Weaknesses
- Limited Tooling: Compared to more mature platforms like Apache Kafka, Pulsar's ecosystem and tooling are still developing.
- Complex Setup: Setting up and managing a Pulsar cluster can be complex, particularly with geo-replication and multi-tenancy.
- Operational Overhead: The use of separate components (Brokers, Bookies, and Zookeeper nodes) adds operational overhead. Managing these components and ensuring they work together seamlessly can be challenging.
- BookKeeper Latency: Pulsar uses Apache BookKeeper for message storage, which can introduce additional latency compared to systems that handle storage directly.
- Community and Ecosystem: Smaller community and fewer third-party integrations compared to more established systems.
- Steeper Learning Curve: Due to its architecture and the additional components it relies on (like Zookeeper and BookKeeper), Pulsar can have a steeper learning curve for new users. Understanding how these components interact and how to configure them correctly requires more effort compared to some other systems.
- Latency Concerns: Although Pulsar performs well in many scenarios, there can be higher latencies in write and read operations compared to some other systems, particularly when durability and replication are prioritized.
Cost
- Open-source (free): Pulsar is available as an open-source project under the Apache License.
- Managed Services: Some cloud providers and vendors offer managed Pulsar services with varying pricing models.
Maturity Level
- Relatively mature (2016): Although newer than some alternatives, Pulsar has seen significant adoption and development since its release.
Francine Anestis
My diploma thesis as well as my internship being on ETL, Analysis and Forecasting of Big Streaming Data, I am keen on learning more and immersing myself in Data Engineering and Data Space in general. Building data pipelines, using Kafka, databases and algorithms captivated me during my studies as Electrical and Computer Engineer and as a result I decided to dedicate myself on Data Engineering. I am very excited starting my learning and career path at Big Industries. Regarding my skills, if I had to choose one programming language and a platform, I would say that Python and Kafka are my strongest assets, but I am looking forward to extending that list.