Line 1: Line 1:
 +
== Introduction ==
 +
 
The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k < n. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance, or, the squared error function.
 
The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k < n. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance, or, the squared error function.
  
Line 15: Line 17:
  
 
4) Check whether the average distance is less thansome threshold. If yes we are done ! Else go back to step 1.
 
4) Check whether the average distance is less thansome threshold. If yes we are done ! Else go back to step 1.
 +
 +
== Software ==
 +
 +
* Efficient k-Means Implementation (follow link under [[Tools_OldKiwi]])
 +
 +
== References ==
 +
 +
* An Efficient k-Means Clustering Algorithm: Analysis and Implementation, T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman and, A. Wu, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 24, No. 7, July 2002.

Latest revision as of 09:13, 24 March 2008

Introduction

The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k < n. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance, or, the squared error function.


$ V=\sum_{i=1}^{k}\sum_{xj\epsilon Sj}^{}{(xj-\mu j)}^{2} $

So basically it is trying to choose clusters so that the average distance between the data vectors and the chosen means (basically the model parameter) is a minimum. That is the means are the a good representation of the way the data is distributed in the space.

Usually when we implement the K-means algorithm we use a iterative procedure.

1) Randomly initialize a bunch of means.

2) For every data vector see which mean it is "closest" to and update this fact.

3) Using the information about each data vector being close toa particular mean for each mean and its associated data recompute the new mean and update.

4) Check whether the average distance is less thansome threshold. If yes we are done ! Else go back to step 1.

Software

  • Efficient k-Means Implementation (follow link under Tools_OldKiwi)

References

  • An Efficient k-Means Clustering Algorithm: Analysis and Implementation, T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman and, A. Wu, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 24, No. 7, July 2002.

Alumni Liaison

Recent Math PhD now doing a post-doctorate at UC Riverside.

Kuei-Nuan Lin