ORC File - Improving Hive Data Storage


Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.

About the speaker: 
Owen has been contributing to Apache Hadoop since before it was called Hadoop. He was the first committer added to the project and has provided technical leadership on MapReduce, and security. Using Hadoop, in 2008 he set the world record for sorting a terabyte of data in 3.5 minutes and in 2009 he sorted a petabyte in 16.25 hours. He was also the founding chair of the Apache Hadoop Project Management Committee. For the last year, he has been working on Hive. He has a PhD in Software Engineering from the University of California, Irvine. Owen may be followed on Twitter: @owen_omalley.

Schedule info

Time slot: 
3 June 14:45 - 15:30