Dirty data: dealing with substantial volume external sources


At a large European publisher, we had the challenge of receiving external data from 10+ different sources, adding up to tens of GBs per day. Add to this that the file formats will sometimes change without notifications and that sometimes connections go bad or files go missing. This is while trying to maintain that at least the amount of data is near correct in an environment where the 'correct' amount of data for a source is often a difficult to predict number somewhere between 20M and 50M records for a particular day.

We built a extracting and loading pipeline to get data into Hadoop en expose it via Hive tables, which includes scheduling, reporting, monitoring, transforming and, above all, the ability to respond to changes very quickly. After all, responding to a file format change within the same day or adding a new source in a day are very reasonable user requests (right). We were focused on developer friendliness and rely on a fully open source stack, using Hadoop, Hive, Jenkins, various scripting languages and more. This is my talk about the setup and our lessons learned.

In our quest for data quality, we also did work on attempting to predict the expected data volumes, based on seasonality and weather information, in order to proactively alert when a data import appears to fall short of the expected volume. I will include these results in the talk.

About the speaker: 
Friso is a developer who has lately been setting up and using Hadoop a lot for a living. Also, he is a trainer teaching the Cloudera Hadoop developer classes and (co-)organizer of the Dutch Hadoop community meetup (NL-HUG) and the Dutch NoSQL NL meetup.

Schedule info

Time slot: 
4 June 11:30 - 12:15