Fast Density-based Clustering in Parallel with Different Parallel K-d Tree Schemes

Author: Henning Meyerhenke
E-Mail address: henningm (AT) cs.uni-jena.de

Keywords: Parallel density-based clustering, Parallel k-d tree


Abstract. Since obtaining spatial data has become very easy through technologies like remote sensing or medical devices such as computer tomography, the processing of this large amount of data into information has become a computational issue. One technique of this knowledge discovery process in spatial databases is clustering - the grouping of objects depending on their spatial proximity. One of the proposed algorithms for this is FDC [11], which uses a density-based clustering approach. As there is a need for parallel processing in very large databases to distribute resource allocation, this paper introduces PFDC, a parallel version of FDC. Three approaches how to parallelize the algorithm are presented, two of them in detail. The major difference between them is the implementation of the parallel spatial access method, the data structure k-d tree. All of these approaches have been implemented with standard tools (C++ and MPI message passing) in order to work on a variety of parallel platforms. Experiments show that the two variations presented in detail achieve good speedup results.


Bibliography

  1. I. Al-furaih, S. Aluru, S. Goil, and S. Ranka: Practical Algorithms for Selection on Coarse-Grained Parallel Computers, IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 8, pp. 313-324, Aug. 1997.

  2. I. Al-furaih, S. Aluru, S. Goil, and S. Ranka: Parallel Construction of Multidimensional Binary Search Trees. In: IEEE Trans. on Parallel and Distributed Systems, vol. 11, no. 2, pp. 136-148, Feb. 2000.

  3. T. Cormen, C. E. Leiserson, R. L. Rivest: Introduction to Algorithms. MIT Press, 1990.

  4. M. Ester, H.-P. Kriegel, J. Sander, X. Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD'96), pp. 226-231.

  5. J. Jàjà: Introduction to parallel algorithms. Addison-Wesley, 1992.

  6. K. Mehlhorn, S. Näher: LEDA: A Platform for Combinatorial and Geometric Computing. Cambridge University Press, 1999.

  7. H. Samet: The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1990.

  8. H. Shi, J. Schaeffer: Parallel sorting by regular Sampling. Journal of Parallel and Distributed Computing, vol. 14, no. 4, pp. 361-372, April 1992.

  9. Software library LAM/MPI available at www.lam-mpi.org.

  10. X. Xu, J. Jaeger, H.-P. Kriegel: A Fast Parallel Clustering Algorithm for Large Spatial Databases. In: High Performance Data Mining. Ed. by Y. Guo and R. Grossmann, pp. 29-56. Kluwer Academic Publishers, 1999.

  11. B. Zhou, D. W. Cheung, B. Kao: A Fast Algorithm for Density-Based Clustering in Large Database. In: Proc. of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD '99), pp. 338-349. Springer-Verlag, 1999.