When coming to Spark from a background in R or Python Pandas, you’ll likely get tripped up on a few things. The most notable of these is the difference between R and Python dataframe apis and the Spark dataframe API. Furthermore, not all models in Spark are fit with a dataframe and the inter loop between dataframes and RDD (Resilient distributed datasets) are not so obvious.

Earlier this week I read in data from a csv as a dataframe and fit a linear regression model. From here I wanted to change some of the model and fit a Linear Regression Model with Stochastic Gradient Descent.

Fitting the model for the Linear Regression is straight forward:

Taking this same model and running it for stochastic gradient descent requires some transformations. We need to go from a dataframe to an RDD[LabelPoint[Double, Vector]].

Here’s how it’s done:

While the transformation isn’t exactly straight forward, Scala functional style makes it easy to reason your way to where you need to be. Heres to more transformations.