Dataiku Flow : data-driven scheduling for real-life data pipelines
The life of a data scientist is full of traps. After you've vanquished the daemon of file formats and charsets and the hydra of data enrichment, mastered Pig, Hive and other processing tools and found your way through the maze of Machine Learning, you are faced with an even greater task.
Your precious insight must now be productivized, computed each day from many sources through dozens of processing steps, and integrated in the target system. You must assemble together disparate systems that were never meant to communicate. Errors will happen, you'll have to handle them, make sure that no day of data is missing. And you'll suffer.
Dataiku Flow is an open source data-driven scheduling tool for complex data pipelines. Instead of scheduling tasks and actions to perform, you describe your data pipeline, no matter how large, as dependencies between datasets and tasks. You then simply request your output data: "Compute the aggregated visits info for 2013/06/03". Flow computes all required dependencies and partitions, and performs all computations in parallel. It relies on the real state of the system to know what data is already present and what must be refreshed. Flow manages a unified schema across tools.
Flow has builtin support for many tasks : files syncing across many protocols (filesystem, HDFS, S3, Google Cloud Storage, FTP, ...), Pig, Hive, SQL, Python, R, ... Custom task kinds can easily be added.
The talk will describe the concepts, architecture and features of Dataiku Flow and detail its unique capabilities. We'll explain the rationale for a new tool, and how it compares to and completes existing tools like Apache Oozie, HCatalog and ETL solutions. We'll also give some roadmap information.
Schedule info
- Login to post comments