Dataiku Flow : data-driven scheduling for real-life data pipelines


The life of a data scientist is full of traps. After you've vanquished the daemon of file formats and charsets and the hydra of data enrichment, mastered Pig, Hive and other processing tools and found your way through the maze of Machine Learning, you are faced with an even greater task.

Your precious insight must now be productivized, computed each day from many sources through dozens of processing steps, and integrated in the target system. You must assemble together disparate systems that were never meant to communicate. Errors will happen, you'll have to handle them, make sure that no day of data is missing. And you'll suffer.

Dataiku Flow is an open source data-driven scheduling tool for complex data pipelines. Instead of scheduling tasks and actions to perform, you describe your data pipeline, no matter how large, as dependencies between datasets and tasks. You then simply request your output data: "Compute the aggregated visits info for 2013/06/03". Flow computes all required dependencies and partitions, and performs all computations in parallel. It relies on the real state of the system to know what data is already present and what must be refreshed. Flow manages a unified schema across tools.

Flow has builtin support for many tasks : files syncing across many protocols (filesystem, HDFS, S3, Google Cloud Storage, FTP, ...), Pig, Hive, SQL, Python, R, ... Custom task kinds can easily be added.

The talk will describe the concepts, architecture and features of Dataiku Flow and detail its unique capabilities. We'll explain the rationale for a new tool, and how it compares to and completes existing tools like Apache Oozie, HCatalog and ETL solutions. We'll also give some roadmap information.

About the speaker: 
Clément Stenac is a passionate software engineer, currently CTO of Dataiku, a French startup who helps companies get into the world of data-driven innovation. We edit an integrated platform dedicated to the data lab, bringing together many open source technologies into a consistent whole and adding unique tools to boost data scientists' productivity in the data enrichment phases and drive them through the data analysis and machine learning processes. Clément was previously head of development at Exalead, leading the design and implementation of large-scale search engine software. He also has extended experience with open source software, as a former developer of the VideoLAN and Debian projects.

Schedule info

Time slot: 
3 June 12:50 - 13:15