Dayne Jones

full stack engineer | home | github | twitter | email | linkedIn | resume | blog

Blog

The poor man's data pipeline

May 17, 2017

Tasked with building out a shiny new data pipeline in May of 2017, I devoured every bit of experience I could find from colleagues, meetups, and research. My criteria were low operational responsibility, high flexibility for future development, and that the thing be simple enough that I can implement it myself. If you know anything about the landscape of big data, you know solutions in this space aren't exactly known for their low maintenance and simplicity. Just read some of the blogs of the big players like Spotify, Netflix, and Yelp. Yes, these are complex in part because the companies are massive but I think you'll find every modern case study looks similar. I was convinced there was an easier way to get 80% of the functionality for 20% of the work without cobbling together scripts that will not allow for future development.

The primary issue I encountered is that many of the tools that come highly recommended are highly complex. The nature of working with massive amounts of data means that your ingestion and processing of that data must be highly distributed so independent pieces can scale easily. It is also a technical challenge to ensure reliable delivery of your data. This leads to architectures that are not intuitive for the average engineer (I still don't really know what Kafka does).

How was I, a lowly software engineer, supposed to build a proof of concept that can be built into a real, useful piece of software without losing months of development time upfront and countless maintenance time on the backend? What I have come up with seems to give me a flexible, massively scalable, and near real time data pipeline. Let's see how it works. What follows is an overview. A more instructional guide and actual code can be found in the Github repository.

  1. Ingest (or extract)

    To get the data I want, I borrowed a page out of the standard tracking pixel playbook. We will fire a tracking pixel with query string parameters (something like this: http://track.domain.com/pixel.png?user_id=123&event=click) to deliver the data we're interested in tracking. This request can be fired from anywhere. Setting up a simple HTTP load balancer in Google Cloud Platform allows us to do 2 things with no ops team and almost no configuration. 1. We get to respond quickly with a 1x1 transparent png and 2. we get an easy way to gather the logs from that request. Here's what my load balancer config looks like:

    To get to the next step, we need to use Pub/Sub. Pub/Sub is a very simple messaging service. Just create a new topic in Pub/Sub and remember its name. To begin publishing log messages, all you need is a built in feature of StackDriver (Google's logging service). Here's what that setup looks like.

  2. Process (or transform)

    My data parsing is very lightweight. It consists of turning a URL like the aforementioned http://track.domain.com/pixel.png?user_id=123&event=click into an object that looks like this: {user_id: 123, event: 'click'}. Ideally, we want something that can scale up during spikes and lay dormant the rest of the time. Bonus points for speed and flexibility of development. Google Cloud Functions fit the bill for this very well. My Google Cloud Function is subscribed to the same Pub/Sub topic. It recieves a single log message, processes it, then streams the result to BigQuery.

  3. Store (or load)

    Within a few seconds of the initial request, your event is in BigQuery and ready to be queried. Here's a visual for reference. Keep in mind, your schema can be whatever you dream up.

    A note about BigQuery: out of all the Google Cloud Platform products, BigQuery seems to have made the biggest impact on the industry. This is the one part of my pipeline with which I did have some experience and I felt very confident in its value. The Google Cloud Function uploads to partitioned tables by date and by user_id (since we are running a multi-tenant setup). This means that even though we're gathering a huge amount of data, we will only be querying small subsets of that data to keep cost low. Since repeat queries for which the underlying data has not changed are completely free, you will only pay when you query the same data repeatedly on the day that it is being gathered.

    Even apart from its massive scalability, ease of use, and much better pricing than alternatives on the market, BigQuery solves a big problem for me since it has fairly wide support for common BI products like Looker and Tableau. Even Google Data Studio is worth a mention since it's already setup for you.

Enough rambling. Here's the repository with all the relevant code and instructions. Feedback welcome!