Posted on March 8, 2018
In our era of big data, your IT infrastructure may be taxed by the influx of data from a wide variety of sources. On top of that, customers demand to see their data in “real time” without any lag time, so your servers need to process and display data quickly. Apache Kafka is a new technology, developed in 2011, that allows you to do just that.
Apache Kafka is a distributed streaming platform that enables companies to create real-time data feeds. It’s used by companies like Uber, Twitter, Airbnb, Yelp, and over 30% of today’s Fortune 500 companies. For example, by integrating diverse kinds of data such as likes, page clicks, searches, orders, shopping carts, and inventory, Apache Kafka can help feed data in real time into a predictive analytics engine to analyze customer behavior.
Now that Apache Kafka has reached a stable 1.0 version, more companies are adopting the technology as the backbone of their IT infrastructure. Increasingly, CTOs are prioritizing enabling more real-time architecture and reducing the wait time on data availability. Apache Kafka-related questions on Google Search and tech forums like StackOverflow and Github have also skyrocketed in recent years—signaling it as a trending hot topic.
So what are the benefits of Apache Kafka, why should your company adopt it, and what skills will your IT team need to successfully implement it?
As companies deliver an increasing amount of data from different sources (e.g. website, user interactions, financial transactions) to a wide range of target systems (e.g. databases, analytics, email systems), developers have to write integrations for each one. So for example, if you have 4 source systems and 6 target systems, your IT team developers would have to write code for 24 integrations. This is a cumbersome process, not to mention, a slow and error-prone way to deliver data. Here are the four key benefits of using Apache Kafka.
Previously, data transformations from external source systems were done in batches often at night. Apache Kafka solves this slow, multi-step process by acting as an intermediary receiving data from source systems and then making this data available to target systems in real time. What’s more, your systems won’t crash because Apache Kafka is its own separate set of servers (called an Apache Kafka cluster).
Essentially, Apache Kafka reduces the need for multiple integrations–as all your data goes through Apache Kafka. Rather than your developers coding multiple integrations so you can harvest data from different systems, you only have to create one integration with Apache Kafka for each producing system and each consuming system.
By decoupling your data streams, Apache Kafka lets you consume data when you want it. Without the need for slow integrations, Apache Kafka decreases latency (or how long it takes for each data point to load) to a mere 10 milliseconds (~10x decrease or more compared to other integrations). This means you can deliver data quickly and in real time. Apache Kafka can also horizontally scale to hundreds of brokers (or servers) within a cluster to manage big data.
Some companies have a high load of millions of data points per second going through Kafka. For example, Uber uses Kafka to feed car position data into their surge pricing computation model in real time.
As all your data is centralized in Apache Kafka, access to data for any team becomes easier. For example, in the past, your fraud team may have had to engage with the web team to get a specific type of user data since they were run on different target systems. Now your fraud team will be able to access the user data directly via Apache Kafka, alongside other feeds such as financial data or website interactions. Simple, right?
Once you understand the benefits and decide to adopt Apache Kafka, your IT team will need to acquire key skills to set up and manage Apache Kafka at your organization. Here are some of the critical skills your team will need.
How to learn, set up, and configure Apache Kafka. Apache Kafka is already built, open source, and free. So it’s more about first acquiring the skills, then setting up Apache Kafka and configuring it for your systems. My course Apache Kafka Series: Learn Apache Kafka for Beginners is a good place for your team to start learning the technology. I cover the Apache Kafka ecosystem, how some target architectures may look like, as well as fundamental concepts of Kafka like topics, partitions, replication, brokers, producers, consumer groups, Zookeeper, delivery semantics, and more. My course also offers hands-on practice so your team can gain some practical experience using Apache Kafka.
Once you’re ready, I recommend my more advanced course that teaches Kafka Cluster Setup and Administration. In addition, I also offer consulting services to help companies design, set up, and configure Apache Kafka.
Kafka Streams and Kafka Connect. If you want to simplify integrations, your team will also need some Kafka-specific skills like Kafka Streams and Kafka Connect. These are the more advanced Kafka concepts and frameworks your team will need to build reliable and production ready integrations over time. As a consultant, I usually show how to build one or two integrations, but your team would have to scale that for the rest of the integrations.
Kafka Connect is a tool for scalable and reliable streaming data between Apache Kafka and other data systems. You can already leverage tons of existing connectors written for you at: confluent.io/product/connectors/. My course Kafka Connect teaches you all the skills you will need to implement and leverage these connectors.
The Kafka Streams Library is used to process, aggregate, and transform your data within Kafka. My course Kafka Streams for Data Processing teaches how to use this data processing library on Apache Kafka, through several examples that demonstrate the range of possibilities.
Don’t migrate your whole system to Apache Kafka at once. Instead, start with a small non-critical project. For example, don’t change the backbone of your financial systems, but change something less important such as your email notification system. Second, one of the biggest mistakes I see is companies spend months trying to build a reliable Apache Kafka cluster. Instead, I would recommend starting with managed services or hiring a consultant to set up a small project on Apache Kafka. This enables you to get started right away on the development side and helps make the case for why Apache Kafka is critical for your company. From there, you will be able to scale, onboard more data and projects, and enable your company to react to events in real time more effectively.