In:
Proceedings of the VLDB Endowment, Association for Computing Machinery (ACM), Vol. 16, No. 7 ( 2023-03), p. 1740-1748
Abstract:
How can we efficiently and scalably cluster high-dimensional data? The k -means algorithm clusters data by iteratively reducing intra-cluster Euclidean distances until convergence. While it finds applications from recommendation engines to image segmentation, its application to high-dimensional data is hindered by the need to repeatedly compute Euclidean distances among points and centroids. In this paper, we propose Marigold ( k -means for high-dimensional data), a scalable algorithm for k -means clustering in high dimensions. Marigold prunes distance calculations by means of (i) a tight distance-bounding scheme; (ii) a stepwise calculation over a multiresolution transform; and (iii) exploiting the triangle inequality. To our knowledge, such an arsenal of pruning techniques has not been hitherto applied to k -means. Our work is motivated by time-critical Angle-Resolved Photoemission Spectroscopy (ARPES) experiments, where it is vital to detect clusters among high-dimensional spectra in real time. In a thorough experimental study with real-world data sets we demonstrate that Marigold efficiently clusters high-dimensional data, achieving approximately one order of magnitude improvement over prior art.
Type of Medium:
Online Resource
ISSN:
2150-8097
DOI:
10.14778/3587136.3587147
Language:
English
Publisher:
Association for Computing Machinery (ACM)
Publication Date:
2023
detail.hit.zdb_id:
2478691-3
Permalink