Apache Kafka And Zookeeper Now Supported On IBM i
September 9, 2020 Alex Woodie
IBM quietly added support for two open source technologies, Apache Kafka and Apache Zookeeper, on IBM i. Both projects are critical elements of emerging distributed computing frameworks in the big data space, although they have very different uses.
Apache Zookeeper is a core underlying technology for enabling distributed computing and ensuring that applications running atop Zookeeper are highly resilient. The software functions as a centralized service for keeping track of nodes in a cluster, and for synchronizing data and services among the different nodes.
All of the nodes in a Zookeeper cluster have access to a shared hierarchical namespace. If any one of the nodes in a cluster becomes unavailable, Zookeeper automatically performs a failover, and the services and data running atop the failed node are automatically migrated to an available node.
Zookeeper was originally developed to help coordinate nodes in Apache Hadoop clusters. Hadoop, of course, is the distributed data storage and processing system that emerged from Yahoo in the mid-2000s to handle the huge amount of data that made up the search engine’s index of the Web.
Zookeeper was a sub-project of Hadoop at the Apache Software Foundation. But in recent years, Zookeeper has been adopted as the core underlying clustering technology by a number of distributed computing projects at the ASF, including HBase, Hive, Solr, NiFi, Druid, and Kafka, which are commonly considered part of the “Hadoop family” of products. It has since become a top-level project at ASF.
Apache Kafka, meanwhile, can best be thought of as a next-generation message bus for event data. While it’s commonly linked to Hadoop and is included in Hadoop distributions, it really lives outside of the Hadoop family and institutes an entirely new way of storing and processing event data.
Kafka was developed about 10 years ago by engineers at LinkedIn to handle the social media company’s fast-growing data collection system. Every time a LinkedIn user does something on the website or mobile app, such as click on a post or accept an invitation to connect, it generates an event in LinkedIn’s system.
The company, which is now owned by Microsoft, had been using a traditional batch-oriented message bus to handle this data, but the engineers found that it was unable to keep up with data volumes. The LI engineers envisioned an entirely new type of platform that would treat event data as a first-class citizen, as opposed to the second-tier status that events get in a traditional relational database.
Sitting atop Zookeeper, Kafka functions as a distributed system for storing and processing event data. The event data is sourced from components called “Producers,” which writes data into categories called “topics.” Users (or applications) can subscribe to these topics and receive the stream of data through a component called a “consumer.”
As a publish/subscribe system for high-volume data, Kafka clusters can be used for extract, transform, and load (ETL) workloads, as well as for real-time analytics systems. Kafka clusters are composed of multiple servers, or “brokers,” and it can scale well into the petabyte range.
Kafka’s scalability was put to use at LinkedIn. By 2011, when the company first implemented Kafka, LinkedIn users were generating about 1 billion messages per day. By 2015, the company was generating 1 trillion events per day.
Most of the Silicon Valley Web giants have adopted the open source Kafka software to manage large flows of event data, including Netflix, Uber, Pinterest, and Airbnb. To service this emerging ecosystem, the original developers of Kafka at LinkedIn formed a spin-off called Confluent, which continues to lead development of Kafka and host its own cloud-based version.
While Kafka may be new to IBM i, the application has certainly has seen its share of IBM i data over the years. Companies like Precisely (formerly Syncsort) and Attunity (now owned by Qlik) have developed connectors to pump data from IBM i and mainframe sources into the Kafka bus.
Because of the power of Kafka to create extensible data pipelines and to perform real-time transformations on the data flowing thorough those pipelines, Kafka has emerged as a key architecture element in next-generation big data analytics systems. Many organizations use Kafka to pump data from source systems into data warehouse and cloud-based data lakes, such as Google Cloud’s BigQuery, Snowflake, Microsoft Azure Synapse Analytics (formerly SQL Warehouse), and Amazon Web Services RedShift. AWS, for its part, offers its own pub/sub system, called Kinesis.
It’s unclear exactly why IBM added support for Kafka and Zookeeper, or exactly how these technologies will run on IBM i. (We hope to connect with IBM in the near future to get answers to these questions.) In any event, these are two of the most impactful projects in the open source big data community, along with Apache Hadoop and Apache Spark, so it’s good to see IBM taking steps to keep up with the wider data world, as it has recently done with open source databases like MongoDB, Apache Cassandra, and PostgreSQL.
It’s also good to see that IBM is providing professional support for Kafka and Zookeeper on IBM i. IBM recently added these projects to the list of open source projects it supports on IBM i through its Technology Support Services (TSS) program. Check out the website at www.ibm.com/support/pages/open-source-support-ibm-i to see for yourself.