Scaling by Cheating: approximation, sampling, and fault-friendliness for scalable Big Learning
Data storage and analysis technology has traditionally focused on absolute guarantees of accuracy: transactional correctness, consistency, error correction. Big Data storage paradigms like NoSQL give up some of these for scalability, since some large scale data processing doesn't demand these guarantees. Machine Learning applications differ in that there aren't known correct answers and outputs to begin with. Then what about Big Learning?
It turns out that it's often fine to sample, to approximate, to estimate, randomize, guess, or even accept some data loss in these large-scale learning applications. In fact, it's often essential to scale up, and that can improve results or reduce costs while, strangely, sacrificing correctness in the details.
This talk will survey several representative parts of Apache Mahout where these techniques are deployed successfully, including random projection, sampling, and approximations. The talk will also take an example of tolerating data loss from Myrrix, a recommender product built from Mahout. These examples will be generalized to suggest and inspire applications to other Big Learning projects.
Some basic familiarity with machine learning is likely required to make the most of the discussion. No Mahout knowledge is necessary.
- Login to post comments