The Promise and Expense of Distributed In-Memory Data Processing

For the past few years, I’ve been an ardent believer in the promise of distributed in-memory data processing. The first time I took a Hive job, re-wrote it in Spark and saw it returned 10 times faster I was a believer. Since then, I’ve done a lot more impressive things in memory -- whether that be Alluxio, Apache Spark or MemSQL, these systems offer a lot of good things. As I’ve explored more tools to manage and process data, I’ve found myself in the Apache Ignite ecosystem. If you’re not familiar with Apache Ignite, take some time to read about it or watch this excellent tutorial. Apache Ignite is great and I’ll cover some of the cool things about it in this post. This post is, however, not about Ignite, but specifically about how distributed in-memory computing is a promise and expense.

Distributed In-Memory Cache

The Promise

Distributed in-memory cache provides a great a way to reduce latency and computational overhead in many systems. I’ve used Memcached and Couchbase as a caching system in the past, and they do a fantastic job of their intended us case. Simply put, most caches are key-value stores and they allow quick lookups, typically on high performing hardware. Apache Ignite’s “Data Grid” platform takes it a step further, allowing for a purely in memory cache and a SQL API on top of it. It is pleasant to use, and allows for a wider array of applications to be developed on top of caching.

The Expense

The negative side of caching is handling the logic and keeping data in sync with a database. This process can be relatively automated or it could be a complete nightmare contingent on the system you implement. Additionally, the memory centric hardware required to run a good cache in memory is quite expensive and running it all in a distributed fashion will add operational overhead. For the right use case, they can be a life saver. For architectures that already leverage distributed systems and have the operational support it can be a fantastic alternative.

Notable Projects: Redis, Apache Ignite Data Grid, Infinispan

In Memory Distributed File System

The Promise

Over the years, I’ve worked professionally with Hadoop, Ceph and GlusterFS, becoming acquainted with distributed file systems. They offer a cheaper, and more robust alternative to traditional databases and data warehouses. The Hadoop Distributed File System is perhaps the most well know and well use and a whole ecosystem has sprung up around building applications on top of Hadoop. Most of these file systems are on disk, but Alluxio and Apache Ignites distributed file system are both implemented in memory. These would give you all the benefits on the on-disk file system with much increased performance and the ability to even create your own caching layer.

The Expense

Distributed file systems are complex and adding an in-memory layer with its compression and limitations add operational overhead. You wouldn’t replace an on disk distributed file system with an in-memory one, rather you would extend it. One thing of note is that I distributed file systems were designed to run on commodity hardware and to be a cheaper alternative to Databases. This is simply not true of in-memory distributed file systems. They are expensive because memory is much more expensive than disk.

Notable Projects: Alluxio, Apache Ignite In-memory file system

Distributed In-Memory Compute

The Promise

Apache Spark is perhaps the most well-known project that performs distributed in-memory computation. With its long list of APIs and ergonomics built around the Resilient Distribute Dataset (RDD) most of the complication about doing computation in this way are obfuscated away from the end user. This has been good for my career and happiness as I’ve been able to use a wider array of systems and expect the same outcomes because of Spark. Once a cluster of machines is running, submitting new jobs to it is trivial. Also, notebook environments like Juypter, Zepplin and Beaker are fantastic for data science and reproduction.

Even more, platforms like Amazon Athena make this process even easier by providing a way to use these APIs without worrying about standing up your own hardware.

The Expense

The expense of doing computing this is mostly found in the hardware needed to do it at scale. Both the dollar cost and the operation overhead of setting up a cluster, using Mesos or Yarn, and handling failures are all non-trivial. As established already, memory centric machines are among the most expensive to provision on many of the compute platforms. Becoming good at Mesos is a profession in and of itself and long-running jobs are just part of doing this work.

Notable Projects: Apache Spark, Apache Flink, Apache Beam, Apache Ignite, PrestoDB

Distributed In-Memory Database

The Promise

We’ve already established that querying data, doing computation and almost everything is faster in memory than on disk, so it’s no different for a database. We could store all our data in-memory and access it with SQL. MemSQL and Apache Ignite both offer this capability. The appeal of doing this should be obvious. Take MySQL or Postgres and shove it all in memory, distribute it with the same APIs, it sounds like a good time and it is a good time.

The Expense

In-Memory Databases require tons of memory, since you are trying to 1-to-1 put data from disk to memory. Additionally, they need memory to just operate, and they tend to need A LOT of it. Operationally, both MemSQL and Apache Ignite don’t have any glaring “gotchas” and are fairly beginner friendly. In my opinion, these Databases are better used for distinct use cases, like an analytics layer or real-time needs and not for application layer.

Notable Projects: MemSQL, Apache Ignite IMDB

Conclusions

Distributed in-memory data processing is here to stay and we should see more projects come up that take advantage of it. Overall, the tooling is very good and relatively mature. If costs are a concern, there are often better options than going with these solutions and even if costs are not a concern there are some considerations with using memory centric solutions in each of these domains. Consider each carefully, experiment and test the ergonomics!

Data Engineering

Building a Real-Time Bike-Share Data Pipeline with StreamSets, Kafka and MapD

Sep 10, 2018

Data Engineering

Sep 10, 2018

Data Engineering

Nov 13, 2017

Data Engineering

Thoughts on IFTTT

Nov 13, 2017

Data Engineering

Nov 13, 2017

Data Engineering

Nov 6, 2017

Data Engineering

Thoughts on Dremio

Nov 6, 2017

Data Engineering

Nov 6, 2017

Data Engineering

Blog

The Promise and Expense of Distributed In-Memory Data Processing

Distributed In-Memory Cache

In Memory Distributed File System

Distributed In-Memory Compute

Distributed In-Memory Database

Conclusions