Up until recently at Buffer, we had a big problem: we were drowning in data.

Buffer is a data-driven company. So many of the decisions we make hinge on data that we collect and analyze. We track and measure analytics data for all kinds of things – user behavior, how our A/B experiments are running, as well as system-generated data such as API logs.

As outlined in this post by Sunil, we have built a pretty robust platform for tracking and measuring data, all powered by Amazon’s SQS and MongoDB.
There did seem to be a missing piece. We were still suffering from metrics debt.

Technical debt can inhibit developers from moving quickly and slow down the pace of innovation. It also causes organisations to lose faith and confidence in their software. In the same manner, metrics debt can slow down your team’s ability to measure and track data, ask the right kind of questions, and be confident that analyses are correct and that you’re not missing any data.

Although it was pretty easy for us to collect and store massive amounts of data, we still had a pretty hard time getting to the insights.

Our data team was always scrambling to keep up with the needs of the company. We wanted to empower anyone at Buffer to analyze and play with our data, but it was hard to do so without writing a lot of code.

We would often narrow down our scope to analyze only very recent data, but we have been tracking some data for over three years. It became increasingly difficult to visualize this information across the lifetime of Buffer.

Why was this the case?

MongoDB

We were relying on MongoDB’s aggregation framework to query and aggregate our metrics data. Many ad-hoc data requests were simple to meet by just querying the growth DB. But this needed a certain level of skill and knowledge of MongoDB. While it was easier for developers to jump in and play with the data, it became a hurdle for other team members. And because we had so much data, running queries on some collections where pretty slow.

Growth dashboard

For more complex and ongoing data needs, we also built a custom growth dashboard site. It’s written in Django and connects to the metrics database. This is where we’ve built some of the more complex daily reporting and tracking screens. Many of these were powered by ETL (Extract, Transform and Load) jobs. These run in the background and update data periodically.

Building something custom is flexible. But it sucks up a lot of developer time to design new ETL pipelines and surface them in the dashboard UIs.

We were fiddling with Javascript charts and nightly jobs that needed a lot of maintenance and support. We realized we would much rather just get quick insights into our data.

The way out

We quickly realized that we might need to look into some tools that are better suited to the job. After investigating various different platforms and technologies, two technologies kept popping up: Hadoop and Amazon’s Redshift.

Looking into Hadoop led us to Mortar, a platform-as-a-service company that helps you get up and running with big data technologies really quickly.

We felt we could start tackling the problems we had with the sheer size of our data using tools like Hadoop and Redshift, but that still meant that data lived in the domain of developers.

That is when we discovered Looker. Looker is a web-based BI platform that aims to make data analysis simple and accessible to everyone, from developer to data analysts and product people. It also runs really well on Redshift, which we were leaning towards using.

Now, how do we shift to Redshift?

All the pieces were starting to fall in to place for us. We were pretty sure that Redshift and Looker would be a great solution for our BI needs, but we were still not sure what would be the best way to transform the huge amount of data to a relational format and load it in to Redshift. After experimenting with MoSQL as well as some homegrown solutions we quickly ran into issues with the size of our MongoDB collections. It would simply take too long to load the data.

Hadoop to the rescue!

We had initially thought of Hadoop and Redshift as competing technologies, but it became clear to us that they could complement each other well.

Screenshot 2014-10-29 09.30.58

Turns out that Hadoop works really well as an ETL pipeline. Using the Mongo Hadoop connector it’s easy to load data into a Hadoop cluster. From there we can use the Pig scripting language and the power of parallel computing to quickly perform various transforms on the data and save the output to S3. Then we can load it into Redshift using the native COPY command.

Here is an example of how a Pig script to extract data from MongoDB:

https://gist.github.com/michael-erasmus/842735920578702f4b18

This script simply loads the data from MongoDB, filtering on the date field in the collection. It is then outputted to flat tab delimited files in an S3 bucket. From there we could run the transform step. This is the Pig script for that:

https://gist.github.com/michael-erasmus/935c652b281930c07805

Here we have a few more things happening. The data is being loaded from S3, transformed and then stored in S3 again. We run the data through a few transformations to get it ready to be copied into Redshift. Dates are formatted to the default timestamp format, and the value field, which is stored as a bounded array in mongo, is flattened into columns.

Mortar and Luigi

Since we were already using Mortar, it was simple to get started building this ETL pipeline. Mortar has full integration with Luigi, a neat open source framework built by Spotify. It lets you easily build complex data pipelines using Python.

Each step in your pipeline is called a Task. Tasks are usually derived from a base class that knows how to perform its data processing task and monitor its progress. Most importantly, each Task can specify which other Tasks it requires to be run first, which helps Luigi build up a dependency tree of your Tasks and figure out how to run your pipeline.

Our Luigi scripts are modeled after the typical ETL pipeline, with an Extract and Load Task that calls the Mortar API to run our Pig scripts, as well as a Load Task to load data from S3 into Redshift.

We have also tweaked our Luigi scripts to use a common set of base classes for each ETL, to easily share configuration and loading strategies. Here is the Luigi script for the actions_taken ETL:

https://gist.github.com/michael-erasmus/18c59e226f20bae93e51

Here we define Luigi Task classes for the Extract, Transform and Load steps. These class definitions are very declarative, they only need to override a couple of base methods to supply specific configuration for its pipeline. Most of the work here is happening in the base classes, which in turn rely on the built-in Luigi tasks for running Hadoop jobs and loading data into Redshift.

A busy week

All of this fell into place right about the time we had our team retreat in New York. Buffer is a fully remote, distributed team with people in timezones across the world. Every few months, we all meet up somewhere amazing and spend the week working together and getting to know each other better. Our data team embraced this opportunity to flesh out our new data architecture and to see how far we could get in our retreat week.

By the end of the week, we had data running all the way from MongoDB to Redshift and could already start showing off Looker and Redshift to the rest of the company!

Automating the pipeline

Once we had a good way of getting data into Redshift, we needed to automate the process to make sure that our BI database is constantly loaded with the most recent data. We already use Jenkins a lot within Buffer, so it felt like a great fit for automating our ETL scripts.

We set up a periodic Jenkins task that executed a bash script to run our ETLs. This worked great and we didn’t need to kick off ETLs from our developer machines whenever data got stale.

Metrics Architecture

Incremental loads

Initially we were pulling in all the data from scratch every day, overwriting the existing tables in Redshift with new data from Mongo. As you can imagine, this didn’t scale too well. We had to run on very big Hadoop clusters (around a 100 instances) and the whole ETL pipeline would still take hours to finish. What we really wanted was to have a Jenkins job running every hour, only pulling in the most recent data.

This all became possible when we stumbled across a little known parameter that can be set by the MongoDB Hadoop loader: mongo.input.query

This allowed us to filter the input collection to only load documents after a certain date. We tweaked our Luigi scripts to first get the maximum timestamp field value of the latest record loaded in the existing Redshift table. This is passed as a parameter to the load Pig script, which it can use to filter the input collection.

This made a huge change in our loading times. Once we had most of the data in a Redshift table we could incrementally load the most recent documents in less than a minute, using a tiny, two instance Hadoop cluster.

The result

Once we had our BI database with data loaded on the hour, ready to be consumed in Looker, we could start to play 🙂
Looker uses LookML, a declarative, YAML based language that lets you model your underlying database in Looker. It’s pretty smart at letting you create new models with generators and using the GUI, so modeling data as you need it becomes pretty easy.
After you have your models set up, it’s really easy to explore data by joining, filtering, pivoting and aggregating data. Any dataset result you produce can easily be turned into a variety of different graphs.

Being able to explore data like this changes the way you do analysis.

Lots of fresh insights and novel ways of thinking about and modeling your business. Our product people were really quick to take up Looker and satisfy their own data needs, coming up with amazing insights really quickly:


Creating a chart like this would have taken a lot of time and effort before. In fact, being able to go back three years in time might not have been possible at all! But with the power of Redshift and Looker we where able to graph this in minutes and see results instantly.

Your thoughts?

Building our new data architecture has been an amazing and fun adventure. And we’re really happy with what we’ve ended up with. Having this kind of power and flexibility has really opened up a world of possibilities to explore data science in a new way. It feels like we’ve just scratched the surface of what’s possible and we know that we’ll continue building on what we have right now.

It’s great to be able to share this journey, and we would love to hear what we’ve done wrong or what could be improved in any way.

Please feel free to share your thoughts, ideas or suggestions with us! If you have any questions about any of this or would like just like to know more, just hit me up on Twitter at @michael_erasmus or of the other members of our data team: @sunils34 and @stevenc81.

Free up your day with our Social Media Tools

Buffer can save you up to an hour a day and grow your traffic too.

Learn More
Written by Michael Erasmus

I lead the Data Analytics team at Buffer. I love to learn by building things. I feel the most content when I’m surfing and I love hanging out with friends and family, traveling, good food, and drinking copious amounts of Kombucha.

  • AlexisTheory

    I had an idea and now I’m curious: does Buffer track customer interaction events by customer, sorta tracking the relationship with each one? For example, say a customer makes a suggestion over email, a question over live chat, and a comment on Twitter, would you guys be able to look at your data and say “Oh this is the same person” and interact with them on a “Hello again! How is that fix to X problem working out” level?

    • Michael Erasmus

      Hi! Thanks, that is a great question! We already track a lot of this stuff using the support software we’re using Helpscout, Twitspark and Olark. But this data lives in differrent places and our Happiness team has to be able to pull data from all these disparate systems.

      One of the things we are very excited about with our new architecture is to be able to import data from all these sources into our shiny new data warehouse. From there we’re hoping to be able to create customer dashboards in Looker that can serve as single point information about a customer.

      This would be great for our Heroes to quickly get a comprehensive view of a customer!

  • andy

    Wow great post. I’m actually working on an MVP that already requires some ETL goodness due to a dependence on a badly performing external API.

    Already I know what I’m building won’t scale but for where I am now it’s ok. but reading this is super useful.

    One question though. Not very familiar with RedShift and Hadoop. Are we basically talking about a datastore infrastructure that let’s me SQL query it like any relational db, except it’s super super fast?

    Or does it require data to be stored and indexed in a very particular way for it to work?

    Or are those completely nonsensical questions? 🙂

    Any any case, good work! Looker looks super interesting.

    • Michael Erasmus

      Hi Andy! Thank you so much!

      No, those are very good questions, thanks for asking 🙂

      Yep, Redshift is like a souped up version of PostgreSQL that is optimized for analytics on huge datasets. Unlike traditional relational databases stores it store tables as sections of columns of data rather than as rows of data.

      But as a user of Redshift, all that is abstracted away and you don’t really need to be concerned about that, you can just use it as you would a normal SQL database!

      We do have to a little careful about the indexes we create in Redshift, but so far it’s been pretty darn fast even when we haven’t fiddled with indexes too much.

      Hadoop on the other hand is an implementation of the MapReduce programming model, which let’s you distribute computing work in a parallel fashion on clusters of machines.

      At the moment we’re mostly using Hadoop to get data into Redshift, by loading it from MongoDB in parallel and running transformations to get the data from a document based JSON format to a flat file that can be imported into Redshit.

      • andy

        Hmmm super interesting, thanks Michael. Will have to read up more on Redshift. For example, what are the cons of using it? Write speeds?

        Don’t worry you don’t have to answer here, I’ll stackoverflow it! 🙂

        Thanks for the reply though, have a lot more insight now, appreciate it.

        • Michael Erasmus

          That’s a great link, thanks for sharing! Like they mention in there, we have also noticed that Redshift doesn’t enforce uniqueness on Primary Keys, so you have to be super careful not to get any duplicates into your dataset.

          Overall I still love using a NoSQL database at the application level, but for data warehousing and analysis, SQL really shines.

    • If you’re looking for all round decent ETL tool to move around data definitely check out Pentaho PDI http://community.pentaho.com/projects/data-integration/

  • serge fantino

    Hi Michael, very interesting post indeed.
    Just one question: did you consider loading directly JSON files into redshift ? /COPY command provides support for mapping json fields to columns as well.

    • Michael Erasmus

      Good one Serge, thanks!

      We did consider just using COPY like that but with the size of our collections it just wasn’t really possible to export data out of Mongo in a reasonable time.

      I’m not sure if we missed a trick here, but if so I would love to hear it!

      One thing I do like about using Hadoop is that it gives us a lot of power and flexibility in the transformation step. Some of the transforms we do are not just a simple collection to a table, but extracting values from embedded arrays in documents and unwinding these into a separate collection.

  • Great post Michael, Looker looks very interesting. We have had great success running Mondrian on top of Amazon Redshift with our own custom BI tooling surfacing dashboards. And the big bonus is you can run this jRuby 🙂

    One question…. How do you deal with making Transformations done in Lookers LooKML available to other services (or is that not really required)?

    • Hey Angus,

      I’m an analyst with Looker and can speak to that. Looker allows you to, if required, download, schedule or share anything you create within the application. You also have the ability to write transformations from Looker back into the DB (in the case of Redshift a scratch schema) we call these ‘persisted derived tables’.

      Looker can also publish transformations/results via URLs, embedded iframes (entire dashboards and single visualizations), scheduled reports and looks, etc. We also have a few developer sdks that allow you to easily extract data from the DB through Looker’s platform. Our technology stack is pretty unique for a B.I. tool (web server application, direct connection to the DB via JDBC, and the LookML language in order to describe the relationships, dimensions, measures, transformations of your database). Conceptually you can think of it as an API on top of your DB with a data modeling layer, with that you have a web interface to interact, run ad-hoc reporting & analysis, build dashboards, collaborate around reports, etc.

      ok all finished unless you have more questions 🙂

  • zoltanctoth

    Great writeup Michael. We have a very similar infrastructure at Prezi using our home-grown ETL solution (partly inspired by Luigi), EMR, Redshift with chart.io on the top of it. Can you briefly tell me how a developer/pm can create their custom data pipeline in this setup? What are your main challenges here?

    • Michael Erasmus

      Thanks! Great validation to hear you have such a similar stack.

      We have a little mini-framework set up with base classes for each step in the ETL, so I created a little generator script using jinja2 templates to generate the Luigi Script and Pig scripts for a ETL.

      After generating the files, we just need to tweak luigi script to know about the schema in Redshift and do any transforms on the Pig side.

      I have been curious about perhaps automating some, or all of this, at some point, but we’ll see if we ever get there 🙂

  • Lincoln Ritter

    Thanks a lot, Michael.

    We use Redshift and Looker at Animoto as well. We use AWS Data Pipeline for orchestrating our data movement. I’m curious if you evaluated Data Pipeline and if so, why you stuck with Jenkins to do your scheduling.

    Cheers!

    • Michael Erasmus

      Hi Lincoln, thanks for the pointer!
      We didn’t really look into Data Pipeline too much, but it does sound intriguing! Where do you load data from? Would love to know more…

      • Lincoln Ritter

        We pull data from anywhere and everywhere – MySQL, Mongo, Google Analytics, Responsys, Splunk. For MySQL there are “connectors” built in to Data Pipeline, for the other stuff we use custom scripts appropriate to each data source.

        We mainly picked data pipeline since it was a system we didn’t have to build and operate ourselves. The downside is a bit less flexibility.

        Your use of Hadoop for ETL is interesting. We hadn’t considered that. Instead, we use very basic scripts to dump JSON documents to S3 for later analysis using EMR.

        • Michael Erasmus

          Wow, thanks so much, that’s awesome to know! I think that for us the power of hadoop comes from the transformations we can do, but I’m definitely going to look into data pipeline some more.

          By the way, have you considered open sourcing some of the custom scripts you use? That would be amazing 🙂

  • Moty Michaely

    Great article Michael!

    Hadoop, Redshift and Looker seems like a great fit!

    Can you please describe the amount of raw data you guys are processing? How much time does it take for each batch to be processed? Are 500 million records being processed directly from Redshift? This is interesting as Hadoop is probably one of the best solutions for ETL. It is scalable (especially on the cloud), enables processing big data from various data stores, and is de-facto standard for batch processing. But, the problem is, that not all companies have the required skill set (or would like to acquire it) to write map reduce code either by using Java or higher level languages like Pig or Hive.

    That said, Xplenty (where I am VP, R&D) lets you quickly and easily prepare your data for analytics and augment your data architecture that might currently consist of Amazon S3, RDS, Redshift, NoSQL DBs etc, with a code free user interface. And is fully powered by Hadoop.

    By the way, it would be great if you could share the points which you guys considered when evaluating your Hadoop as a Service provider (and what made you choose Mortar).

    Thanks!

  • Great post Michael! It seems like you came up with an elegant solution to make sense of all your data. I’d love to start working with big data to solve some challenges at my current job but it’s honestly a bit overwhelming and I’m not sure where to start. Could you recommend some resources for a beginner to get started working with big data in a practical way?

  • Swapneel Sawant

    Great insight Michael!!!
    Thanks a lot.

  • Hi, whats your Tracking pipeline to get the data into Mongo, wanted to know more about that? One worker? do you also do updates via the SQS or just all Inserts? How do you handle the queue in order?

  • Steve Lucas

    Hey @michael_erasmus:disqus thanks for the post. (I found it still quite relevant even now.)
    Would you mind giving a brief update on if any of your architecture has changed? Ie; Did you end up using AWS Pipelines over Hadoop for ETL?
    Are you using Redshift Athena or Spectrum now that data storage landscape has changed (and your massive user growth, and requiring more events for new products?)
    Has looker been able to scale and meet your business’s BI needs even three years on?
    Thanks again for the transparency on your data stack. Super generous and very helpful!