The How and Why of Spark and Couchbase

I can spend a lot of time gushing about Couchbase and the details about its architecture and implementation. I've grown to really love Couchbase as a NoSQL store but my love for it isn't really a good reason to write a blog post. I think a great deal of people using Couchbase for analytical purposes can benefit from combining it with Spark. This blog is just a quick rundown of some of the features I'll often use when working with the two. More so my notes than really any wider statement.

House Keeping

I won't go into too much detail about Couchbase, but it's a JSON document store that is easy to distribute and has some other great features. I would recommend reading the docs for more details.

Type Safe Serialization

image

A big annoyance with JSON is serialization. If you have data that looks like this:

Using it for analytical purposes requires some kind of query language or hand written code to loop through the objects but there are not a lot of guarantees about types when doing this in doing so. It's safer to at least know what you're dealing with and that requires guaranteeing an implicit conversion. In Scala we can use some help from case classes and Spray JSON to accomplish this:

From here, converting JSON to a Spark dataset is fairly trivial:

We can go safely from a JSON string to a dataset of rows and columns that have properly defined types with minimal effort, allowing for a natural pipeline from Couchbase to Spark.

You might say “Well you will have to hand write code to do these implicit conversions,” which is true. You can do a loose conversion to a list of JSON documents and then convert the schema afterward, or use some of Sparks built in facilities:

N1QL

image

N1QL is the query language behind Couchbase allowing you to write SQL-like queries over the JSON data structure. There have been other attempts at this but none implemented as well as N1QL, in my opinion. If I have data that looks like this:

I could query it with N1QL in the following way:

That may not be that impressive because there aren't any nested structures to get through. You can do array searches in N1QL as well making for some interesting query opportunities like the following:

I find N1QL intuitive, especially in the Spark SQL context where you're already writing SQL-Like syntax. I used N1QL in the example in the previous section without explaining it. You can see how it is a bit more intuitive than the traditional Couchbaseget if you come from a SQL background.

Streaming

image

You can stream data from Couchbase as well instead of querying it. This only makes sense if you have some analytical needs based on updates to the database.

Setting up this code is straight forward:

You can use this as an analytics layer to watch for abnormalities in the data or to trigger other events or pipelines. You can also use Spark streaming to write data to Couchbase from a Stream as outlined in the docs.

You can use full-text search in Couchbase, similar to Elastic Search. Full-text search is much less precise than a SQL query but it's appropriate for many use cases. You can use full-text search without much effort:

From here you can use pattern matching to ensure correct serialization and so on. In addition to a simple search like this, there are many complex searches you can do in Couchbase as well.

That's All

Couchbase isn't a good fit for all applications but it's being adopted pretty rapidly and it's great to know it's nice to work with for Spark. I've been using it for a few months now and have grown to like it quite a bit.

Notes

Examples are done with Spark 2.0.1 and Spark Couchbase Connector 2.0

Poll