Clustering of Real-time Data at Scale
Last year at Buzzwords it was reported the Apache Mahout project had a new kind of clustering algorithm soon to be available which promised extraordinary speed. Since that time, that promise has been filled. This new algorithm is extraordinarily fast, possibly the fastest production clustering algorithm available. It also has many unusual characteristics which can make clustering applicable in new ways.
This talk is a report on the progress of this new kind of clustering. I will describe the theory behind how this algorithm works and how it is able to provide high quality clustering with only a single pass through the data. Mostly, however, I will focus on practical results of this algorithm.
These results will include results in the following areas:
Speed. We will show just how fast this algorithm is both in single machine threaded implementations as well as in a map-reduce version
Quality. We will show how this algorithm compares with the other clustering algorithms available in Mahout.
Scalability. We will show how this algorithm can be applied to very, very large data sets on large clusters.
Integration. We will show how this algorithm can be integrated into search engines such as SolR.
We will also show how real-time on-line clustering can be done using these new algorithms with Storm. Using Storm and streaming k-means, it is possible to cluster and re-cluster data that is streaming by at very high speeds. Even without storing all of the historical data, accurate clustering of all historical data can be done at any point in time. One of the uses of this capability is real-time change-point detection. Change-point detection allows you to find points in time when the statistical properties of a data stream change in some interesting way. Another use is for real-time anomaly detection in which anomalous data points can be identified in real time without any background batch processing.
Schedule info
- Login to post comments