Jowanza Joseph

Blog

Ergonomics in Data Engineering

In software engineering, there are several tools that make life easier. This list includes but is not limited to Integrated Development Environments (IDEs), command line tools, third party services and software and the languages we use to write programs. Over the years, it’s become increasingly obvious to me that these tools make a big difference. Being comfortable with and enjoying the tooling that you use and reduces the cognitive overhead of designing systems and allows you to have more fun. This post is a combination of notes I’ve taken on the subject over the last year. It’s about tools that make the trade more comfortable, some that make it more uncomfortable and stuff that the jury is still out on. I had a lot of fun writing this piece and while there isn’t any code it’s still technical.

Languages & Types

Types are both helpful and hurtful in data engineering. As someone who works primarily in Scala, I value compiled languages and types. While they are valuable, I won’t understate the cognitive overhead it adds to writing programs. In general, I think if they help make code easier to understand and write they are worthwhile. I’m of the opinion that one of the areas where types are helpful is when dealing with data. If you’ve ever tried to run analytical queries (sums, windowing and the sort) over JSON data, you’ve likely been pissed off at the process. This is the nature of JSON, it’s a fine format for consuming, but not a great format for transactions or analytics.

I’ve spent a lot of time writing code that turns JSON into DataFrames and converting DataFrames back to JSON. This is part of the job but it’s not ergonomic or enjoyable to do. It’s a pain and something I don’t like to do. I’m not sure what the solution is here because the JSON store and key-value store allow us to do awesome stuff, but dealing with so many paradigms isn’t a great time.

Containers

The phrase ‘containerize’ is one of the more annoying you’ll hear these days because it doesn’t mean anything. Another personal favorite is ‘we need to get everything on Docker.’ In the Data Engineering space Apache Mesos is the most frequent resource / container management solution I’ve used. Apache Mesos is a handy tool for data engineering, especially in the beginning. It helps you manage your resources and deploy and teardown different technology without much effort. Mesos is ergonomically pleasant and DC/OS is even more so.

While using Mesos is pleasant, it’s not exactly compatible with other platforms. I’m just starting to see more resources around data engineering with Kubernetes but it has some catching up to do in this space. Nevertheless, data engineering could benefit from more tools like DC/OS and Kubernetes.

Testing

The merits of testing don’t need to be rehashed here, they are, in large part, what makes any of data engineering ergonomic. It’s a not easy to test massive and distributed data pipelines, but any testing (unit, integration, regression) is a huge help in developing and maintaining data pipelines. One of the aspects of testing that’s a bummer to me is that there is not a right way to do it. The mantra of “the more tests, the better” is incomplete in a distributed, multi-tenant environment. I wish there were more industry standards or tools that could help you in this arena.

Batch

Are batch workloads a thing of the past? Here are a few pieces that effectively argue that:

  1. Stuck in the Middle: The Future of Data Integration is No ETL
  2. Streaming Transformations as Alternatives to ETL
  3. ETL is Dead, Long Live Streams

Everyone I know in the space does most their wok in batch. While batch workloads are resource intensive and often a bottleneck, it’s much easier to think of data this way, in memory, on a single node, on disk or distributed batch workloads are much easier to think through. Often, batch workflows mutate data, or write do large writes to file systems or databases. The tooling for batch computing are mostly very friendly. SQL based tools like SparkSQL, Redshift, Presto are easy for anyone familiar with SQL to use.

NoSQL is still a bit of a challenge in this arena. Each NoSQL platform comes with a different paradigm, a different query language, storage model and computational engine. NoSQL solutions have a lot to offer, but each solution comes with a whole world of expertise that is needed. As for NoSQL JSON stores, Couchbase seems to have a great balance by supporting a SQL-like syntax on top of JSON. I have a lot of hope for this space, I think it is the next frontier but it still has a long ways to go.

Streams

Streams sound like the holy grail to a data engineer. The ability to have data created once and consumed without bound and without necessary order puts a ton of control back into the engineer’s hands. It gives us the opportunity to control resource usage, and the control of the costs of processing data. Streams are also much more complicated than batch to think through. You must re-work your thoughts from a world of batch to a world of events. It’s easier than it sounds and for a lot of teams it just makes sense to do everything in batch or micro-batch.

Streams are likely generated with a pub-sub system, and Kafka is all the rage these days. I like Kafka, and think it’s one of the more considered tools in this space. I consider it the only way to get started in the streaming space. It’s one of the tools that makes streams a good process in my opinion. I think that eventually most workloads in data engineering can be streams but we’re a way off on tooling.

Workflow Software

There us a handful of workflow software out there and I only have opinions on two of them. The first is Apache Airflow and the Second is StreamSets Data collector (I’m very interested in trying Hydrograph as well). I call this brand of software workflow software because if helps you define a workflow. It serves an important purpose in helping to operationalize and simplify ETL processes.

Not all of these programs are built on the same idea. For example, Airflow is a DAG based model built off Python scripts. It’s not my favorite approach as Python is not a tool I use that often. StreamSets takes a different approach on how to do this workflow management. It’s built on JVM technology and it’s mostly done through a nice UI. It supports a wider array of technology and is extremely pleasant to use. I think this is an important space for the future of data engineering. These tools take the complication of communication between systems and mostly trivializes it. I hope to see more growth and buildout in this space.

Amazon, Google and Microsoft

Amazon, Google and Microsoft all provide hardware and software through their cloud computing platforms. Each of them has a slightly different, incompatible with the others take on how to do cloud based data engineering. As a matter of opinion, I like to think they are all equal in what they offer. I haven’t used Google Compute Cloud, but I have used both Amazon and Microsoft services and for the most part they are both very good.

These services take away the overhead of deploying hardware and trivializes scaling them up and down and managing them. From a costs perspective, they are much cheaper than standing up your own hardware and trying to do the same. Each platform has a service I’m particularly expected about:

  1. Amazon Athena
  2. Microsoft Azure Event Hubs
  3. Google Cloud Data Flow

I think each of these has a unique and engineering friendly offering that’ll change data engineering for the better. I’ve written a bit about Athena in the past and I will publish on the other two in the coming months.

Conclusion

Data engineering is the white-collar version of plumbing. The work is important to the function of every application and part of the system, but most what happens goes un-seen. The analogy extends further, in that while there are complexities in plumbing, each decision influences the overall performance of the system. It’s important when architecting, supporting and extending these systems that we have tools that allow us to do it well and do it right. Both wellness and rightness are contingent on the systems you’re working on. Both open source and proprietary software makes the job easier, but there is still a ton to be done. To all the software and hardware mentioned in this article, thanks for supporting this work, I’ll do my part to help make this a fun and inviting space to be in for years to come.