Apache Drill Implementation Deep Dive

Track: 
scale

Apache Drill is an exciting project that aims to provide SQL query capabilities for a wide variety of data sources in an extensible way.

But the technologies underneath the implementation are also very exciting even outside of the context of Drill itself. These ideas can be repurposed for a wide variety of other uses either by directly extracting code from Drill, or by using the philosophies and ideas in new forms.

I will talk about how Drill goes about several key tasks including:

  • forming a DAG of operations and binding these operations to real code. Drill does this using a JSON concrete syntax so that the DAG can be created or executed in a variety of implementation languages including Java and C++.

  • moving schema-free or flexible schema nested data through an execution DAG as efficiently as data with a rigid relational schema. Drill uses a novel column oriented format with a new batch adaptive schema technology. This allows the inefficiencies associated with non-columnar data or with flexible schemas to be confined to the data sources where the penalty is only paid once. From that point, very fast execution is possible, on par with any other data format.

  • transforming DAG's from surface query syntax trees to logical query plans to execution plans. Drill supports SQL 2003 as a primary query language, but allows access to the internal logical plan language as well. The logical plan is created from the query syntax tree and then transformed into an multi-node execution plan using an advanced cost-based optimizer. This optimizer uses evolutionary algorithms to do find efficient query forms and is useful in its own right.

  • executing queries expressed as DAG's both correctly and quickly. The execution plans in Drill are expressed as DAG's that are passed to remote execution engines. In order to execute these at super-high speed, code generation can be used. In order to make sure that the execution is correct, Drill provides a reference interpreter. The trick here is that the abstract plan language allows the reference interpreter to be implemented in Java while other execution engines can be implemented in C++ or other languages. The Drill reference interpreter allows other execution engines to be very aggressive in how they run, but still be testable.

About the speaker: 
Ted is currently Chief Application Architect for MapR Technologies. Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase™. He is also a committer and/or PMC member for Apache Mahout, Apache Zookeeper and Apache Drill. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at the very first Hadoop User Group meeting. Michael works at MapR Technologies in the role of Chief Data Engineer, where he helps people to tap the potential of big data. His background is in large-scale data integration research and development, advocacy and standardisation. He has experience with NoSQL databases and the Hadoop ecosystem. Michael speaks at events, blogs about big data, and writes articles and books on the topic. Michael contributes to Apache Drill, a distributed system for interactive analysis of large-scale datasets.

Schedule info

Time slot: 
3 June 16:00 - 16:45
Room: 
Palais