Line 26: Line 26:
  
 
<center>[[Image:runyan2.jpg|frame|none|alt=Alt text|<font size= 4> '''Figure 2''' </font size>]] </center> <br />
 
<center>[[Image:runyan2.jpg|frame|none|alt=Alt text|<font size= 4> '''Figure 2''' </font size>]] </center> <br />
 +
 +
In some ways, each cluster can be thought of a different "class" of data points. In this way, clustering can be seen as the unsupervised analog of the classification problem.
 +
 +
There is no "best" clustering algorithm. Any algorithm imposes some sort of structure on the data. If the structure imposed by the clustering algorithm matches the true structure of the data, then the actual clusters may be discovered. As a result, there are many different clustering methods. Here I will describe a popular and intuitive method known as K-means clustering.
  
 
----
 
----
 +
<center>
 +
=K-means clustering=
 +
</center>
 +
 +
One way of thinking about a cluster is a set of points who's pairwise distances are small compared to the distance to the points of a different cluster. This is illustrated in the following figure:
 +
 +
<center>[[Image:runyan3.jpg|frame|none|alt=Alt text|<font size= 4> '''Figure 3''' </font size>]] </center> <br />
 +
In the figure above, the magenta line represents a general distance between two points of the same cluster and the red line represents a general distance between points of different clusters. Intuitively, we expect the red line to be significantly longer than the magenta line.
 +
 +
 +
 
----
 
----
 
----
 
----
  
 
[[2014_Spring_ECE_662_Boutin|Back to ECE662, Spring 2014]]
 
[[2014_Spring_ECE_662_Boutin|Back to ECE662, Spring 2014]]

Revision as of 15:26, 6 May 2014


Introduction to Clustering A slecture by CS student David Runyan


Introduction


In class, we covered the simple Bayesian classifier. This form of classification falls under a category known as supervised learning. What this means is that a set of labelled data data is used to to "train" the underlying model. However, it is not always possible to have such a data set, yet we may still wish to discover some form of underlying structure in an unlabelled data set. Such a task falls under the category of unsupervised learning.

Clustering is a form of unsupervised learning. "Clustering is the problem of identifying groups, or clusters of data points in multidimensional space". For example, consider the following data set:

Alt text
Figure 1

Although this data set is not labelled, we can still clearly see three different groups, or clusters, of data points, as shown here:

Alt text
Figure 2

In some ways, each cluster can be thought of a different "class" of data points. In this way, clustering can be seen as the unsupervised analog of the classification problem.

There is no "best" clustering algorithm. Any algorithm imposes some sort of structure on the data. If the structure imposed by the clustering algorithm matches the true structure of the data, then the actual clusters may be discovered. As a result, there are many different clustering methods. Here I will describe a popular and intuitive method known as K-means clustering.


K-means clustering

One way of thinking about a cluster is a set of points who's pairwise distances are small compared to the distance to the points of a different cluster. This is illustrated in the following figure:

Alt text
Figure 3

In the figure above, the magenta line represents a general distance between two points of the same cluster and the red line represents a general distance between points of different clusters. Intuitively, we expect the red line to be significantly longer than the magenta line.




Back to ECE662, Spring 2014

Alumni Liaison

Ph.D. on Applied Mathematics in Aug 2007. Involved on applications of image super-resolution to electron microscopy

Francisco Blanco-Silva