Blog

Announcement: I’m Writing A Book on Apache Spark

For the last year or so I’ve been blogging regularly about the Apache Spark platform. During that time, Spark has grown from something that people in data science and engineering have used to something that is almost ubiquitous. I’ve enjoyed working with the platform professionally, and even on a number of personal projects. Over this year, I’ve spent a lot of time trying to get SBT configurations to work correctly, converting JSON to DataSets, and painstakingly trying to get missing data imputations to work sensibly. This time has taught me that for as popular as Spark is, there is a pretty big gap in resources for it. It’s not that the docs are bad (they are actually excellent), it’s not that it’s a super hard platform to learn, it’s just because it’s programming. Programming is tough, digging through a huge Scaladoc is tough, but it’s what it takes to get decently proficient at Spark. This isn’t necessarily unique to Spark, but the pain is pain all the same.

Spark has enabled me to think about computing and data in an entirely different way. It has taught me to be much more ambitious about data, and I think lots of people can benefit from that. I’ve spent so much time writing and debugging Spark, that I feel like I have a lot to share. My blog is evidence of this, as many people have reached out to thank me for stuff I’ve written on Spark. What I wrote helped them think about a problem in a different way, or helped them appreciate an overlooked aspect of Spark more. I feel like I could continue to write blogs and have a good impact or I can write a more lasting resource, in the form of a book. 

My motivation is not to “make a killing” off the book or to become a “thought leader.” I hope for it to be published and to provide value, but I care more about the experience of writing it than making tons of money. This is partly why I’m not interested in doing a course on Apache Spark. I don’t want the responsibility for on-going membership fees or keeping content up to date. I want to pass on principles and focus on platform level tips and not get bogged down in API details like a course would force me to. I also want to spend more time working on this project with my wife, Bethany. She provides all the illustrations for my blog and does a tremendous job and I believe together we can put together a lasting resource for those new to Spark. 

As for timing, we’re working on a writing calendar this week and I’ll post updates on my newsletter for anyone interested in following the progress. The books working title is The Apache Spark Field Guide, I feel like “field guide” perfectly describes what I’m trying to do. It will not be your typical technical book with a lot of code samples, i’ll spend a lot more time walking through the nuances Spark execution and helpful tips in using Spark. There are already great books out on taking Spark from nowhere to somewhere, but there isn’t a good place to quickly explain concepts in a way that’s not fact based recitation.

Until next time.