Revision as of 14:15, 16 March 2008 by Svenkata (Talk)

The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k < n. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance, or, the squared error function.


$ V=\sum_{i=1}^{k}\sum_{xj\epsilon Sj}^{}{(xj-\mu j)}^{2} $

So basically it is trying to choose clusters so that the average distance between the data vectors and the chosen means (basically the model parameter) is a minimum. That is the means are the a good representation of the way the data is distributed in the space.

Usually when we implement the K-means algorithm we use a iterative procedure.

1) Randomly initialize a bunch of means.

2) For every data vector see which mean it is "closest" to and update this fact.

3) Using the information about each data vector being close toa particular mean for each mean and its associated data recompute the new mean and update.

4) Check whether the average distance is less thansome threshold. If yes we are done ! Else go back to step 1.

Alumni Liaison

Meet a recent graduate heading to Sweden for a Postdoctorate.

Christine Berkesch