SparkSQL provides an easy to use API for distributed datasets on the Spark Platform. It’s trivial to do sums, group by’s, pivots and other aggregations in SparkSQL. Most of these built in functions are simply User defined functions (UDFs) and take a single row input and provide an output for each of those rows or Aggregate functions that take a group and provides some summary data on those rows. It might be useful to do both of these operations simultaneously to calculate a cumulative sum, a moving average, ranks or something along these lines.
Window functions provide exactly this. The examples of Windown functions I found made it a bit tricky to understand or get into. Many SparkSQL related questions on Stack Overflow are a result of a misunderstanding or pure ignorance when it comes to window functions. This blog is a small attempt to clear up some of the confusion around these concepts.
I borrowed some data from the Databricks site covering this same topic it looks like this:
We want to rank order each category by revenue. We can do that with the following code:
You’ll notice some peculiar things about this code. First, the Window object we’re using. We have to define our Window including partition by and order by to make any sense of it.
A Little More
Ranking a partion isn’t the most powerful usage of window functions. In recent months I was working on a regression model with splines. I needed to calculate the percentiles across a partition and assign each row one of those percentiles. I accomplished that with roughly this code (Some modifications for real data).
Window functions provide an easy to reason about, easy to debut and cleanly way to do importation aggregation work in Spark. Hopefully reading this post helped some!