Sampling Within k-Means Algorithm to Cluster Large Datasets

Time

-

Locations

LS 152






 

Description

Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm implements sampling within k-means to reduce the amount of data analyzed, thus decreasing run-time. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both speed and accuracy. Results show that our algorithm is significantly more efficient than the existing algorithm with comparable accuracy.

This research was completed as part of the REU Site Interdisciplinary Program in High Performance Computing at the University of Maryland, Baltimore County.

Tags: