Apache Drill Implementation Deep Dive
Apache Drill is an exciting project that aims to provide SQL query capabilities for a wide variety of data sources in an extensible way.
But the technologies underneath the implementation are also very exciting even outside of the context of Drill itself. These ideas can be repurposed for a wide variety of other uses either by directly extracting code from Drill, or by using the philosophies and ideas in new forms.
I will talk about how Drill goes about several key tasks including:
forming a DAG of operations and binding these operations to real code. Drill does this using a JSON concrete syntax so that the DAG can be created or executed in a variety of implementation languages including Java and C++.
moving schema-free or flexible schema nested data through an execution DAG as efficiently as data with a rigid relational schema. Drill uses a novel column oriented format with a new batch adaptive schema technology. This allows the inefficiencies associated with non-columnar data or with flexible schemas to be confined to the data sources where the penalty is only paid once. From that point, very fast execution is possible, on par with any other data format.
transforming DAG's from surface query syntax trees to logical query plans to execution plans. Drill supports SQL 2003 as a primary query language, but allows access to the internal logical plan language as well. The logical plan is created from the query syntax tree and then transformed into an multi-node execution plan using an advanced cost-based optimizer. This optimizer uses evolutionary algorithms to do find efficient query forms and is useful in its own right.
executing queries expressed as DAG's both correctly and quickly. The execution plans in Drill are expressed as DAG's that are passed to remote execution engines. In order to execute these at super-high speed, code generation can be used. In order to make sure that the execution is correct, Drill provides a reference interpreter. The trick here is that the abstract plan language allows the reference interpreter to be implemented in Java while other execution engines can be implemented in C++ or other languages. The Drill reference interpreter allows other execution engines to be very aggressive in how they run, but still be testable.
Schedule info
- Login to post comments