Revision as of 15:17, 16 April 2008 by Sraghav (Talk)

Types of Clustering Methods

  1. Partitioning Methods:
    The partitioning method divides a data set into M different clusters, where each object belongs to one of these clusters. Each cluster has a representative element, or centroid, which describes all the objects contained in that cluster. In some cases, the arithmetic mean of the attribute vectors of all the objects in the cluster is a good representative.

    Method: This method determines the clusters of a data set in one pass, and runs in O(NlogN). The disadvantage of this method is that it is highly dependent on the order in which the objects are processed.

    I. Set the 1st object as the centroid for the 1st cluster.
    II. Calculate S (similarity coefficient) for each object with each existing centroid.
    III. If the calculated S for a specific object is higher than a given threshold, add the object to the corresponding cluster, and recalculate the centroid for this cluster. If the S value is lower than the threshold, create a new cluster. If at the end some objects have not been clustered, repeat Step II.

  2. Hierarchical Agglomerative Methods:

    Method: This method determines clusters by adding the closest pair of objects to a single cluster. This algorithm run in N^3 time.

    I. Determine which 2 objects are closest to each other, and insert them into one cluster.
    II. Determine the next 2 closest objects. (Note: In this step, an object can be a single object or a cluster)
    III. If at the end, some objects have not been clustered, repeat Step II.



Methods OldKiwi.jpg
Image from: http://www.daylight.com/meetings/mug96/

Various Clustering Algorithms

Description: This is a link to the Speech and Image Processing group at Joensuu Science Park in Finland. It includes links to many programs and pseudo codes which are intended for clustering data sets with an undetermined number of clusters. These programs two problems commonly faced in clustering applications: determining the number of clusters in the data set, and finding the location of each of these clusters. Since clustering is considered an NP-complete problem, ideal solutions have exponential running time. For this reason, the algorithms presented by this group are sub-optimal, but run in polynomial time. [1]

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett