Despite having an SEO hostile name, h2o.ai is a pretty cool company. They have developed a great open source plug-and-play data science platform in h2o. They some other projects that are noteworthy and of course Sparkling Water, the subject of this post. Sparkling Water is essentially the h2o APIs on top of Spark, allowing the power of h20 to take advantage of Sparks distributed computing model. That being said, is it worth it to load another dependency when Sparks MLLib is adequate for most machine learning needs? I went through this exercise a few weeks ago and this post is mostly my notes with some added illustration and some code.
Cost vs Benefit
Using h2o isn't free in the context of complexity. I've jotted down the costs and benefits of using it:
- A large dependency
- Added overhead of additional datatypes on top of Sparks datatypes (H2OFrames)
- Steeper learning curve
- Moar algorithms
- Performance (sometimes)
I'm sure a lot more can be said about the differences, but these are the things I noticed in a few days playing around with h2o. I’m going to spend the rest of the post trying to illustrate this.
Generalized Linear Regression
We'll do a linear regression with Sparkling Water on a small dataset. I’m less concerned about the scientific rigor of doing this and more concerned with the ease of using the APIs.
Fortunately, you can include all of what you need for sparkling water in just one dependency (although a rather larger one). I use SBT but you can build this with maven as well.
From here you will need to create a Spark Context and an H20 context that takes a Spark Context as an argument:
From here it gets kind of weird. We need to use hex, the Java library that implements the algorithms in h2o:
Contrast with Spark MLs linear regression (Note: there isn’t a base GLM implementation so I used an old code sample with similar data to this one but not exactly matching column names. ):
Neither MLLib nor Sparkling Water are what I would exactly call intuitive or beginner friendly. For example, take doing a linear regression in scikitlearn:
There is a lot less overhead because we don’t have to deal with the type system and the APIs are built for the domain (DataFrames). That being said, Spark does offer quite a bit more configuration power to the end user, and h2o even more power. MLLib and h2o both allow distributed computation as well, which R nor Panads does in any generalizable way.
A second thought is about how complex it is to get started with Sparkling Water. I had to read a booklet before being able to get a “Hello, World” example up. Perhaps this is by design; we shouldn’t have people who are unfamiliar with the details surrounding statistics. Now that I’ve invested a handful of hours I feel like I can work my way through most problems, however. That’s about the same way I felt about MLLib, the first day of using it I was in a web of writing custom code to get from one data structure to another.
My final thought is about hex. Hex has a lot in it and there is a lot to be said about that. There are far too many “machine learning” libraries where one or two algorithms are implemented; Sparkling Water has a wealth of them. For this post, I stuck with the basics but you can get crazy if you wanted to pretty quickly. For the most part, you can just use H2OFrames as an input into those algorithms, which is incredibly convenient. This is simply not the case in MLLib as I noted above and as the example I gave demonstrates.
While this is just one example, the trade offs are pretty clear to me. You add cruft and a bit of complexity by using Sparkling Water, but for a more unified way of doing data science and saving models for reuse. if you want to do deep learning on Spark, there is not really a better place to start from what I’ve seen so far, but if you just want to do a linear regression it’s hard to make a case for H2o. I’ll be using Sparkling Water some more as we’re exploring some algorithms at work, but I plan on using MLLIb as well and comparing and contrasting.
It's worth nothing that Sparkling Water has some nice features out of the box like a fast csv parser and a nice Notebook environment called h2o flow. Those could be very nice given your use case.
If you’re using Sparkling Water in production I’d love to talk to you. Feel free to shoot me an email, or hit me up on Twitter. I'd really appreciate it!