Typically, an Enterprise Service Bus (ESB) or other integration solutions like Extract-Transform-Load (ETL) tools have been used to try to decouple systems. However, the sheer number of connectors, as well as the requirement that applications publish and subscribe to the data at the same time, mean that systems are always intertwined. As a result, development projects have lots of dependencies on other systems and nothing can be truly decoupled.
This blog post shows why so many enterprises leverage the open source ecosystem of Apache Kafka for successful integration of different legacy and modern applications, and how this differs but also complements existing integration solutions like ESB or ETL tools.
No matter in which enterprise you work, no matter when your company was founded, you will have the requirement to integrate your applications with each other to implement your business processes.
This includes many different factors:
Many enterprise architectures are a bit messy—something like this:
Every company needs to solve these spaghetti architectures. Depending on the decade, you either bought something like an ETL tool to build batch pipelines or an ESB to design a SOA. Some products also changed their names. Today, you are offered things like middleware messaging, an integration platform, microservice gateway, or API management. The branding and product name do not matter. You always see the same picture as a solution to move away from your spaghetti architecture to a central integral box in the middle, like this:
This rarely worked well in practice, unfortunately. Most SOA projects in the last two decades failed. Instead of using an ETL tool or ESB for this, enterprises are now moving on to a streaming platform to solve this issue. Is this the next bubble on the market? Just a new term? Or, did something really change to allow successful integration across an enterprise—whether you integrate legacy mainframes, standard applications like CRM and ERPs, modern microservices built with any programming platform, or public cloud services? Why are companies now migrating to Apache Kafka to build this streaming platform? Why is everybody happy and talking about this at conferences, tech talks, and blog posts? How does it compare to an ESB or ETL tool?
The next sections will answer all these questions, and explain the reason and differences between the open source ecosystem of Apache Kafka and other existing integration solutions.
A streaming platform (you can also enter another buzzword here) leverages events as a core principle. You think in data flows of events and process the data while it is in motion.
Many concepts, such as event sourcing, or design patterns such as Enterprise Integration Patterns (EIPs), are based on event-driven architecture. The following are some characteristics of a streaming platform:
A streaming platform establishes huge benefits for your enterprise architecture:
Here are some generic scenarios for how you can leverage a streaming platform with the characteristics discussed above:
Producer and consumers of different applications are really decoupled. They scale independently at their speed and requirements. You can add new applications over time, both on the producer and consumer side. Often, one event is required to be consumed by many independent applications to complete the business process. For example, a hotel room reservation needs immediate payment fraud detection in real time, the ability to process the booking through all backend systems in near real time, and overnight batch analytics to improve customer 360, aftersales, hotel logistics, and other business processes.
While some processes need real-time processing, you also need to be capable of supporting batch processes. You even need re-consumption of data more often than you would think in the beginning, such as in cases of an application being down for some time, A/B testing with different versions of an application, adding a new application that needs to consume the data from scratch, or building different analytic models via machine learning based on the same data sets.
Think about some more use cases that you can build easily with a real decoupled system that is still a well-integrated and scalable streaming platform:
Now, you understand the added value of a real decoupled, scalable streaming platform. So, do I have to introduce this as a central data platform for all of our applications?
Caution! No mature enterprise can do a big bang successfully. Legacy applications exist everywhere. Go step by step from pre-streaming to streaming platform. If you come from the mainframe ages, then you might even have batch and non-streaming applications forever (or realistically at least for the next 20-30 years). That’s fine. You just need to bring the events from these systems into the event-driven central nervous system.
The following shows the streaming maturity model that we use to identify the current situation and planning in large enterprises:
Where you are today?
Most traditional enterprises start their journey in the pre-streaming phase. That’s totally fine. The next section explains why almost any successful transformation into a streaming platform leverages the Apache Kafka ecosystem as a key architectural component.
Often people are familiar with Apache Kafka, as it has been a hugely successful open source project, created at LinkedIn for big data log analytics. That was the beginning of Kafka, and just one of many use cases today. Kafka evolved from a data ingestion layer to a real-time streaming platform for all the use cases previously discussed. Many projects focus on building mission-critical applications around Kafka. It has to be up and performant 24/7. If Kafka is down, their business processes stop working.
Kafka is unique because it combines messaging, storage, and processing of events all in one platform. It does this in a distributed architecture using a distributed commit log and topics divided into multiple partitions, as seen below:
With this distributed architecture, Kafka is different from existing integration and messaging solutions. Not only is it massively scalable and built for high throughput but different consumers can also read data independently of each other and in different speeds.
Applications publish data as a stream of events while other applications pick up that stream and consume it when they want. Because all events are stored, applications can hook into this stream and consume as required—in batch, real time or near-real-time. This means that you can truly decouple systems and enable proper agile development. Furthermore, a new system can subscribe to the stream and catch up with historic data up until the present before existing systems are properly decommissioned.
The uniqueness of having messaging, storage, and processing in one distributed, scalable, fault-tolerant, high-volume, technology-independent streaming platform is the reason for the global success of Apache Kafka in almost every bigger company on this planet.
So, what is Apache Kafka and its open source ecosystem? Let’s take a high-level look at its components:
All these open source components are based on top of the core messaging and storage layer of Apache Kafka, and all leverage its features of high scalability, high volume/throughput, and failover.
Younger companies like Netflix, LinkedIn, and Zalando built their whole infrastructure on Kafka. Older companies are not that fortunate because they have plenty of mainframes, monoliths, and legacy technology. However, as discussed, a big bang replacement is not the right way to be successful. It’s a lot like transforming your home. Although it might make sense in theory to rebuild it from the ground up, oftentimes it is more practical to extend the house, change certain rooms, or redecorate.
Legacy apps are typically based on complex data formats like EDIFACT, use complex interfaces like CORBA, or are built with an unmaintainable and inflexible programming language like Cobol. You cannot simply turn it off, or cut it out and replace it. This has to be done step by step. Legacy and modern applications co-exist to run the existing business and add new offerings to augment it.
Innovate by integrating your old systems via a streaming platform—i.e., the Apache Kafka ecosystem. Use concepts like change data capture (CDC) and integration tools, such as an ESB or ETL, with great graphical tooling and connectors for legacy applications. With this foundation, you can build new applications with modern technologies, big data systems, machine learning, etc., natively around Apache Kafka, and at the same time keep access to your legacy events, which you still need for added value in new projects.
Real decoupling and technology independence also means that you have dumb pipes and smart endpoints. If you build all the integration logic (or, even worse, some business logic) into the central integration layer, then all your scalability, agility, and independence of the different systems are gone. This is a key difference to traditional integration solutions in which you put all the logic in the middle layer ESB. It creates dependency on this (proprietary) technology/API, as well as inflexibility.
Smart endpoints can be anything. You can leverage the Kafka ecosystem to build applications around Kafka with Kafka Streams, KSQL, or any Kafka client like Java, .Net, Python, or Go. You can also use any other application to integrate other applications with Kafka. The secret to long-term success is that the infrastructure is open to any technology and architectural pattern.
ETL and ESB have excellent tooling, including graphical mappings for doing complex integration with SOAP, EDIFACT, SAP BAPI, COBOL, etc. It is already running, paid, and integrated. Therefore, existing MQ and ESB solutions, which already integrate with your legacy world, are not competitive to Apache Kafka. Rather, they are complementary! Leverage them like you did in the past to integrate with the old world. You can use the following:
Apache Kafka and its ecosystem is designed as a distributed architecture with many smart features built-in to allow high throughput, high scalability, fault tolerance and failover! Let the product or service teams build their applications with Kafka Streams, KSQL, and any other Kafka client API. Integrate Kafka with ESB and ETL tools if you need their features for specific legacy integration. An ESB or ETL process can be a source or sink to Apache Kafka like any other Kafka producer or consumer API. Oftentimes, the integration with legacy systems using such a tool is built and running already anyway. Currently all these tools also have a Kafka connector because the market drives them this way. So, you just need to combine the existing integration with the Kafka connector, and there you have it: flexible, scalable and highly available integration between legacy and future ecosystems through Kafka.
Apache Kafka is an open source streaming platform that allows you to build a scalable, distributed infrastructure that integrates legacy and modern applications in a flexible, decoupled way. It is already battle-tested for processing trillions of messages and petabytes of data per day. Simply leverage the Apache Kafka ecosystem in your enterprise architecture to make integration of your various systems successful and dynamic. But whatever you do, do not try to build an ESB around Kafka—it is an anti-pattern that will create inflexibility and unwanted dependencies. Instead, leverage the distributed architecture of the Apache Kafka ecosystem to build a flexible, event-driven streaming infrastructure with high throughput, high scalability, fault tolerance, and failover.
Source: Kai Waehner, Technology Evangelist at Confluent