Scaling by Cheating: approximation, sampling, and fault-friendliness for scalable Big Learning

Track: 
store
Speaker(s): 

Data storage and analysis technology has traditionally focused on absolute guarantees of accuracy: transactional correctness, consistency, error correction. Big Data storage paradigms like NoSQL give up some of these for scalability, since some large scale data processing doesn't demand these guarantees. Machine Learning applications differ in that there aren't known correct answers and outputs to begin with. Then what about Big Learning?

It turns out that it's often fine to sample, to approximate, to estimate, randomize, guess, or even accept some data loss in these large-scale learning applications. In fact, it's often essential to scale up, and that can improve results or reduce costs while, strangely, sacrificing correctness in the details.

This talk will survey several representative parts of Apache Mahout where these techniques are deployed successfully, including random projection, sampling, and approximations. The talk will also take an example of tolerating data loss from Myrrix, a recommender product built from Mahout. These examples will be generalized to suggest and inspire applications to other Big Learning projects.

Some basic familiarity with machine learning is likely required to make the most of the discussion. No Mahout knowledge is necessary.

About the speaker: 
Sean is a committer and PMC member for Apache Mahout, author of much of its recommender / collaborative filtering implementations, and co-author of Mahout in Action. He is founder of Myrrix (http://myrrix.com) a commercialization of recommender technology and evolved from his work on Mahout.