Revision as of 11:22, 17 April 2008

1 Notes from class
2 Good/bad news
3 Peer Review
4 Clustering
- 4.1 Minimizing
5 Statistical Clustering Methods
- 5.1 Example: Model the data as a Gaussian Mixture
  - 5.1.1 Sub-example
  - 5.1.2 Sub-example 2
- 5.2 A clustering algorithm by Prof. Bouman
6 Clustering by finding valleys of densities
- 6.1 Professor Bouman's "Cluster" Software

Notes from class

Instructor: Mimi Boutin

Good/bad news

There will not be a final homework assignment
Peer Review is coming along...

Peer Review

Double blind review.

What should the review contain?

Constructive comments. Do this for each question

# Summarize what they did 
# Say what is good  
# Say what could be improved

Grading. Grade the whole assignment.

# OK (B)
# Good (A)
# Excellent/Very Good (A+)
# OK- or NOT OK (B-/Fail)

Due April 25 11:59:59 pm (will accept it until April 28 if you handed in hw2 on time.)

Clustering

There are no best criterion for obtaining a partition of $\mathbb{D}$
Each criterion imposes a certain structure on the data.
If the data conforms to this structure, the true clusters can be discovered.
There are very few criteria which can be both handled mathematically and understood intuitively. If you can develope a useful intuitive criterion and describe and optimize it mathematically, you can make big contributions to this field.
Some criteria can be written in different ways, for example minimizing the square error, as in the following section.

Minimizing

J=\sum_{j=1}\sum_{x_{i}\in S_{j}}||x_{i}-\mu_{j}||^{2}

=\sum_{j=1}^c\frac{1}{|S_{j}|}\sum_{x_{i},x_{k}\in S_{j}}||x_{i}-x_{k}||^{2}is

same as minimizing trace of the within class variation matrix,

J_{within}=\sum_{j=1}^c\sum_{x_{i}\in S_{j}}(x_{i}-\mu_{j})^{T}(x_{i}-x_{j})=\sum_{j=1}^c\sum_{x_{i}\in S_{j}}tr((x_{i}-\mu_{j})(x_{i}-x_{j})^{T})

which is the same as maximizing the between-cluster variance

S_{Total}=S_{W}+S_{B}

tr(S_{w})=tr(S_{Total})-tr(S_{B})

Since the total variance,

S_{Total}

, does not change, maximizing the between-class variance is equivalent to minimizing the within-class variance.

{[}Comment about decision surfaces which I missed...]

Statistical Clustering Methods

Clustering by mixture decomposition works best with Gaussians. What are examples of gaussians? Circles, ellipses. However, if you have banana-like data, that won't work so well. {[}Article idea, with links to earlier discussion of this concept?]

If you are in higher dimensions, do not even think of using statistical clustering methods! There is simply too much space through which the data is spread, so they won't be lumped together very closely.

Assume each pattern $x_{i}\in D$ was drawn from a mixture $$ c $$ underlying populations. ( $$ c $$ is the number of classes)
Assume the form for each population is known

p(x|\omega_{i},\theta_{i})

where

\omega_{i}

is the population label and

\theta_{i}

is the vector of unknown parameters.

Define the mixture density

p(x|\theta)=\sum_{i=1}^cp(x|\omega_{i},\theta_{i})P(\omega_{i})

\theta_{i}=(\theta_{1},\theta_{2},\ldots,\theta_c)

Use pattern's $x_{i}'s$ to estimate $\theta$ and the priors $P(\omega_{i})$

Then the separation between the clusters is given by the separation surface between the classes (which is defined as ...)\\ {[}Cute figure of separation between classes here...]

Example: Model the data as a Gaussian Mixture

$p(x|\mu_{i},\ldots,\mu_)=\sum_{i=1}^cP(\omega_{i})\frac{e^{(x_{1}-\mu_{1})^{T}\Sigma_{1}^{-1}(x_{1}-\mu_{1})}}{2\pi|\Sigma_{i}|^{n/2}}$

Note: the Maximum Likelihood approach to estimating $\theta$ is the same as minimizing the cost function $$ J $$ of the between-class scatter matrices.

Sub-example

If $\Sigma_{1}=\Sigma_{2}=\ldots=\Sigma_{c}$ , then this is the same as minimizing $|S_{w}|$

Sub-example 2

If $\Sigma_{1},\Sigma_{2},\ldots,\Sigma_{c}$ are not all equal, this

is the same as minimizing

\prod_{i=1}^|S_{W}^{(i)}|

A clustering algorithm by Prof. Bouman

The name of the program is {}"Cluster." It has been seen to work on as many as 25 dimensions.

www.ece.purdue.edu/\textasciitilde{}bouman

both in C and Matlab

The key of the approach is the best fit criterion, known as the {[}{[}Rissanen criterion]] or {[}{[}MDL]] (minimum descriptive length). The quantity minimized is the minimum number of bits needed to encode both the samples and the parameters of the clusters.

Note: there is a typo error in the key figure (two of the blocks have been exchanged).

Initialize the clusters (pick a large \#)
Initialize the density parameters

\includegraphics[width=0.5\textwidth]{/home/yoderj/ee662_class28_notes}

Clustering by finding valleys of densities

Idea: Cluster boundaries correspond to local minima of density fct (=valleys)

in 2D

{[}{[}Cute figure of valleys being used to separate data again]]

Advantages

no presumptions about the shape of the data
no presumptions about the number of clusters

==Approach 1: {=="bump hunting"}

CLIQUE98 Agrawal et al.

Approximate the density fct p(x), (Parzen window)

Professor Bouman's "Cluster" Software

The "Cluster" software, developed by Purdue University's own Professor Charles Bouman, can be found, along with all supporting documentation, here: Cluster Homepage. The algorithm takes an iterative bottom-up (or agglomerative) approach to clustering. Different from many clustering algorithms, this one uses a so-called "Rissaren criterion" or "minimum description length" (MDL) as the best fit criterion. In short, MDL favors density estimates with parameters that may be encoded (along with the data) with very few bits. i.e. The simpler it is to represent both the density parameters and data in some binary form, the better the estimate is considered.

Below is the block diagram from the software manual (with misplaced blocks corrected) describing the function of the software.

@@ Line 1: / Line 1: @@
+= Notes from class =
+Instructor: Mimi Boutin
+=Good/bad news=
+*There will not be a final homework assignment
+*Peer Review is coming along...
+=Peer Review=
+Double blind review.
+What should the review contain?
+*Constructive comments. Do this for each question
+ # Summarize what they did
+ # Say what is good
+ # Say what could be improved
+*Grading. Grade the whole assignment.
+ # OK (B)
+ # Good (A)
+ # Excellent/Very Good (A+)
+ # OK- or NOT OK (B-/Fail)
+Due April 25 11:59:59 pm (will accept it until April 28 if you handed
+in hw2 on time.)
+=Clustering=
+*There are no best criterion for obtaining a partition of <math>\mathbb{D}</math>
+*Each criterion imposes a certain structure on the data.
+*If the data conforms to this structure, the true clusters can be discovered.
+*There are very few criteria which can be both handled mathematically  and understood intuitively. If you can develope a useful intuitive  criterion and describe and optimize it mathematically, you can make  big contributions to this field.
+*Some criteria can be written in different ways, for example minimizing  the square error, as in the following section.
+==Minimizing==
+ <center><math>  J=\sum_{j=1}\sum_{x_{i}\in S_{j}}||x_{i}-\mu_{j}||^{2}</math></center>
+ <center><math>  =\sum_{j=1}^c\frac{1}{|S_{j}|}\sum_{x_{i},x_{k}\in S_{j}}||x_{i}-x_{k}||^{2}is</math></center>
+same as minimizing trace of the within class variation matrix,
+<center><math>J_{within}=\sum_{j=1}^c\sum_{x_{i}\in S_{j}}(x_{i}-\mu_{j})^{T}(x_{i}-x_{j})=\sum_{j=1}^c\sum_{x_{i}\in S_{j}}tr((x_{i}-\mu_{j})(x_{i}-x_{j})^{T})</math></center>
+which is the same as maximizing the between-cluster variance<center><math>  S_{Total}=S_{W}+S_{B}</math></center>  <center><math>  tr(S_{w})=tr(S_{Total})-tr(S_{B})</math></center>  Since the total variance, <math>S_{Total}</math>, does not change, maximizing  the between-class variance is equivalent to minimizing the within-class  variance.
+*{[}Comment about decision surfaces which I missed...]
+=Statistical Clustering Methods=
+Clustering by mixture decomposition works best with Gaussians. What
+are examples of gaussians? Circles, ellipses. However, if you have
+banana-like data, that won't work so well. {[}Article idea, with links
+to earlier discussion of this concept?]
+If you are in higher dimensions, do not even think of using statistical
+clustering methods! There is simply too much space through which the
+data is spread, so they won't be lumped together very closely.
+*Assume each pattern <math>x_{i}\in D</math> was drawn from a mixture <math>c</math> underlying  populations. (<math>c</math> is the number of classes)
+*Assume the form for each population is known
+<center><math>  p(x|\omega_{i},\theta_{i})</math></center>  where <math>\omega_{i}</math> is the population label and <math>\theta_{i}</math> is the  vector of unknown parameters.
+*Define the mixture density
+<center><math>  p(x|\theta)=\sum_{i=1}^cp(x|\omega_{i},\theta_{i})P(\omega_{i})</math></center>
+<center><math>  \theta_{i}=(\theta_{1},\theta_{2},\ldots,\theta_c)</math></center>
+*Use pattern's <math>x_{i}'s</math> to estimate <math>\theta</math> and the priors <math>P(\omega_{i})</math>
+Then the separation between the clusters is given by the separation
+surface between the classes (which is defined as ...)\\
+{[}Cute figure of separation between classes here...]
+==Example: Model the data as a Gaussian Mixture==
+<math>p(x|\mu_{i},\ldots,\mu_)=\sum_{i=1}^cP(\omega_{i})\frac{e^{(x_{1}-\mu_{1})^{T}\Sigma_{1}^{-1}(x_{1}-\mu_{1})}}{2\pi|\Sigma_{i}|^{n/2}}</math>
+Note: the Maximum Likelihood approach to estimating <math>\theta</math>is the
+same as minimizing the cost function <math>J</math> of the between-class scatter
+matrices.
+===Sub-example===
+If <math>\Sigma_{1}=\Sigma_{2}=\ldots=\Sigma_{c}</math>, then this is the same
+as minimizing <math>|S_{w}|</math>
+===Sub-example 2===
+If <math>\Sigma_{1},\Sigma_{2},\ldots,\Sigma_{c}</math> are not all equal, this
+is the same as minimizing<center><math>
+\prod_{i=1}^|S_{W}^{(i)}|</math></center>
+==A clustering algorithm by Prof. Bouman==
+The name of the program is {}"Cluster." It has been seen to work
+on as many as 25 dimensions.
+www.ece.purdue.edu/\textasciitilde{}bouman
+both in C and Matlab
+The key of the approach is the best fit criterion, known as the {[}{[}Rissanen
+criterion]] or {[}{[}MDL]] (minimum descriptive length). The quantity
+minimized is the minimum number of bits needed to encode both the
+samples and the parameters of the clusters.
+Note: there is a typo error in the key figure (two of the blocks have
+been exchanged).
+#Initialize the clusters (pick a large \#)
+#Initialize the density parameters
+\includegraphics[width=0.5\textwidth]{/home/yoderj/ee662_class28_notes}
+=Clustering by finding valleys of densities=
+Idea: Cluster boundaries correspond to local minima of density fct
+(=valleys)
+in 2D
+{[}{[}Cute figure of valleys being used to separate data again]]
+Advantages
+*no presumptions about the shape of the data
+*no presumptions about the number of clusters
+==Approach 1: {=="bump hunting"}
+CLIQUE98 Agrawal ''et al.''
+*Approximate the density fct p(x), (Parzen window)
 == Professor Bouman's "Cluster" Software ==
 The "Cluster" software, developed by Purdue University's own Professor Charles Bouman, can be found, along with all supporting documentation, here:  [http://cobweb.ecn.purdue.edu/~bouman/software/cluster/ Cluster Homepage].  The algorithm takes an iterative bottom-up (or agglomerative) approach to clustering.  Different from many clustering algorithms, this one uses a so-called "Rissaren criterion" or "minimum description length" (MDL) as the best fit criterion.  In short, MDL favors density estimates with parameters that may be encoded (along with the data) with very few bits.  i.e.  The simpler it is to represent both the density parameters and data in some binary form, the better the estimate is considered.

Difference between revisions of "Lecture 26 - Statistical Clustering Methods OldKiwi" - Rhea