The Decision: Apache Kafka or Amazon Kinesis?

Choosing between Apache Kafka and Amazon Kinesis can be a difficult task. While some argue that they are inherently different systems, I think most of the differences are overstated. This post is highlights the differences I found most notable in the two platforms in aims to help those who are choosing between the two.

Dependencies

Kafka relies on Apache Zookeeper. Apache Zookeeper is a distributed key-value store (at least on the surface). It's used to help manage configurations and service discovery for distributed systems. It's been around for a really long time and it's used in a great number of distributed platforms like Hadoop and Solr. Zookeeper sounds great, but it does have it's criticisms and tradeoffs. In my opinion, the criticism are mostly fair but its not a deal breaker. Contingent on your use case and experience it can be a downside operating Zookeeper, but I never found it too cumbersome.

Amazon leverages some of it's existing technology to build and run Kinesis. Instead of relying on Zookeeper Kinesis uses DynamoDB. Many of the people I've talked to about this difference see this as a notably change and improvement of Kinesis over Kafka. I can see the argument, but it appears to be a matter of opinion more than any empirical truth.

SDK Support

Kinesis is lightyears ahead in this area, at least in terms of breadth. The AWS SDK supports Android, Java, Go, .NET just to name a few. Further, the AWS CLI tools are also available. Many developers I've worked with complain about either the system of the AWS SDK or the documentation. Depending on the languages and platforms you use, those criticism may be fair. Overall though, being able to use Kinesis on several platforms is a huge plus.

Kafka pretty much only supports Java. There is third party support for other languages, but the example code and most of what you see built is using the Java SDK. For Java shops, this is awesome but it might be too much to integrate Java if you're a Python shop for example. There are tons of client libraries written that you can find on GitHub but YMMV in terms of quality.

Vendor Lock-in

Perhaps one of the most attractive things about Kafka is that it's open source. Many folks don't want a black box piece of technology, and they want to dig into the internals and make upstream contributions. That being said, there are some serious costs associated with open source software that operations and developers must consider when using it. You can be completely free from Vendor lock-in by using Kafka and managing it all on your own. Kinesis offers no such option, at a minimum you'll be using a product by AWS for AWS.

Confluent sells a managed version of Kafka that I've heard great things about but have not used. You can deploy and manage Kafka using their tooling. This may be the best of both worlds as you'll get to use the Kafka technology on top of a public cloud without having to really manage it. This is pretty new according to their product page so you might be in for a ride, but I've never heard a bad thing about Confluent so your risks are probably low.

A contrived example of using Kafka (or Kinesis)

Eco-system

Kinesis is not as robust of an ecosystem as Kafka, in large part due to the proprietary nature of the product. That being said, it's not very hard to develop connectors, sources and sinks for Kinesis. If you're in the Amazon ecosystem and don't really care about other technologies, you shouldn't really look any further. All of the Amazon data stores (S3, Redshift, DynamoDB, Elastic) are supported out of the box.

Kafka supports a wide number of technologies, including traditional RMDBS systems and streaming data processing frameworks like Apache Spark and Flink. Kafka has a lot of momentum and will just keep growing a lot faster in this area. I like this about Kafka, and it's a really compelling reason to use it.

Summary

Data Engineering

Building a Real-Time Bike-Share Data Pipeline with StreamSets, Kafka and MapD

Sep 10, 2018

Data Engineering

Sep 10, 2018

Data Engineering

Nov 13, 2017

Data Engineering

Thoughts on IFTTT

Nov 13, 2017

Data Engineering

Nov 13, 2017

Data Engineering

Nov 6, 2017

Data Engineering

Thoughts on Dremio

Nov 6, 2017

Data Engineering

Nov 6, 2017

Data Engineering

Blog

The Decision: Apache Kafka or Amazon Kinesis?

Dependencies

SDK Support

Vendor Lock-in

Eco-system

Summary