Bug bites Elephant: Test-driven Quality Assurance in Big Data Application Development


Around the currently available large piles of Big Data, there's happening quite a mixed gathering: Business Engineers define which insightswould be precious, Analysts build models, Hadoop programmers tame the flood of data, and Operations people setup machines and networks. It's exactly the interplay of all participants which is central to project success. This setup together with the distributed nature of processing poses new challenges to well-established models of assuring software artifact quality: How can non-programmers define acceptance criteria? How can functionalities be tested which depend on cluster execution, orchestration of, e.g., different hadoop jobs without delaying the development process? Which data selection is suited best for simulating the live environment? How can intermediate results in arbitrary serialization formats be inspected?

In this talk, experiences and best practices from approaching these problems in a large-scale log data analysis project will be presented. At 1&1, our team develops hadoop applications which process roughly 1 billion log events (~1 TB) per day. We will give an overview of the hard- and software setup of our quality assurance environment, which includes FitNesse as a wiki-style acceptance testing framework.Starting from a comparison with existing test frameworks like MRUnit, we will explain how we automate the parameterized deployment of our applications, choose test data sampling strategies, perform workflow management and orchestration of jobs / applications, and use Pig for inspection of intermediate results and definition of final acceptance criteria. Our conclusion is that test-driven development in the field of Big Data requires adaption of existing paradigms, but is crucial for maintaining high quality standards for the resulting applications.

About the speaker: 
Dr. Dominik Benz studied Computer Science with minor Psychology at the University of Freiburg, Germany. In his PhD at the Knowledge and Data Engineering Group (University of Kassel) he applied Data Mining and Knowledge Discovery methods to large datasets of Social Web Systems in order to discover emergent semantic structures. Since November 2012 he is working as a Big Data Engineer at Inovex GmbH, focussing on quality-driven development of Hadoop Applications in Business Intelligence contexts.

Schedule info

Time slot: 
3 June 12:00 - 12:45