In a recent project i wanted to do text searches over a large unstructured dataset (100 GB) in memory and I was able to do it in Spark once I provisioned a machine with enough memory. I was able to do it quickly and efficiently, but I was bugged that I couldn't compress the data and had to spin up a master with that much memory. I did some googling and found myself spinning up a Solr server and a ElasticSearch server before finding succinct by AMPLab. Succinct is designed to do exactly what I wanted, take a large unstructured dataset and compress it in memory so I can query it quickly. It solves the problem of compression on unstructured data, and the ability to query from that. I was pretty enthused about using it so I decided to write this post for anyone interested in the area looking to get a started.
Compression is mostly valuable for taking up less space as you already knew. Some popular data formats that use compression in the Apache Hadoop ecosystem are Parquet, ORC and Avro. Succinct shares some similarities with these formats but it's designed to be compressed in memory versus on disk. Sparks encoders and serialization methods are designed to do something similar, but only work with specific types at the moment. Succinct provides not only the compression, but also the ability to query on that data. The compression guarantees plus the ability to query on the raw data is what makes succinct so special. You can read about the performances characteristics on their site.
The Data & API
Like I said in the introduction, I have a bunch of raw unstructured text in my file system and I want to load it into memory and search it. I'll load all those files into an RDD, compress it with succinct and then do some searches. You can find installation instructions on the projects github. I'm a fan of SBT, but many people use Maven and building with either works. I use spark 1.6.1 for this post as well.
I load the data into Spark:
From here, I do a search on the RDD and get a count of “Birthdays was the worst days." [This is a bunch of lyrics data I got my hands on a few weeks ago]
One thing to consider, is doing this is very time consuming. It takes a very long time to compress this data. Amplab says about 8GB per hour per core so consider that with the results.I pushed that 100 GB of data into 85 GB of memory and searches returned seemingly instantaneously. With plain Spark, I got worst performance (although it returned in just a few seconds) and took all 100GBs.
Since my data is just blobs of text, it's not really worth trying to serialize it down to something like a DataFrame and have to define a schema before hand. Once you go through the effort of this serialization you do have some handy facilities for doing searchs and filtering as seen in this example from the repo:
The next time you're doing raw text searches in Spark, consider using Succinct. It doesn't add very much cruft on top of the Spark API and it adds a whole lot of compression and lightning fast search. If it doesn't seem worth it so save 15GB of ram and gain some performance on these searches, then succinct is obviously not for you. With the expensive of cluster computing I'm always looking to gain efficiency and this has been a great gain.