(Clustering method, given distances between each pairs of objects in the dataset)
(Clustering method, given distances between each pairs of objects in the dataset)
Line 28: Line 28:
  
 
* A ''complete graph'' is a graph with <math>d(d-1)/2</math> edges.
 
* A ''complete graph'' is a graph with <math>d(d-1)/2</math> edges.
 
* A ''subgraph'' <math>G'</math> of a graph <math>G=(V,E,f)</math> is a graph <math>(V',E',f')</math> such that <math>V'\subset V</math> <math>E'\subset E</math> <math>f'\subset f</math> restricted to <math>E'</math>
 
  
 
Example:  
 
Example:  
Line 35: Line 33:
 
  Number of edges e = 6
 
  Number of edges e = 6
 
[[Image:ECE662_lect23_Old Kiwi.gif]]
 
[[Image:ECE662_lect23_Old Kiwi.gif]]
 +
 +
* A ''subgraph'' <math>G'</math> of a graph <math>G=(V,E,f)</math> is a graph <math>(V',E',f')</math> such that <math>V'\subset V</math> <math>E'\subset E</math> <math>f'\subset f</math> restricted to <math>E'</math>
  
 
* A ''path'' in a graph between <math>V_i,V_k \subset V_k</math> is an alternating sequence of vertices and edges containing no repeated edges and no repeated vertices and for which <math>e_i</math> is incident to <math>V_i</math> and <math>V_{i+1}</math>, for each <math>i=1,2,\dots,k-1</math>. (<math>V_1 e_1 V_2 e_2 V_3 \dots V_{k-1} e_{k-1} V_k</math>)
 
* A ''path'' in a graph between <math>V_i,V_k \subset V_k</math> is an alternating sequence of vertices and edges containing no repeated edges and no repeated vertices and for which <math>e_i</math> is incident to <math>V_i</math> and <math>V_{i+1}</math>, for each <math>i=1,2,\dots,k-1</math>. (<math>V_1 e_1 V_2 e_2 V_3 \dots V_{k-1} e_{k-1} V_k</math>)

Revision as of 11:17, 8 April 2008

Clustering method, given distances between each pairs of objects in the dataset

Let d be the number of objects. Let $ DIST_{ij} $ denote the distance between objects $ X_i $ and $ X_j $. The notion of distance here is not clear unless the application itself is considered. For example, when dealing with some psychological studies data, we may have to consult the psycholigist himself as to what is the "distance" between given two concepts. These distances form the input to our clustering algorithm.

The constraint on these distances is that they be - symmetric between any two objects ($ DIST_{ij}=DIST_{ji} \forall i\neq j $) - always positive - zero for distance of an object from itself ($ DIST_{ii}=0 $) - follow \bigtriangleup inequality

Dist table Old Kiwi.jpg


Graph Theory Clustering

dataset $ \{x_1, x_2, \dots , x_d\} $ no feature vector given.

given $ dist(x_i , x_j) $

Construct a graph:

  • node represents the objects.
  • edges are relations between objects.
  • edge weights represents distances.


Definitions:

  • A complete graph is a graph with $ d(d-1)/2 $ edges.

Example:

Number of nodes d = 4
Number of edges e = 6

ECE662 lect23 Old Kiwi.gif

  • A subgraph $ G' $ of a graph $ G=(V,E,f) $ is a graph $ (V',E',f') $ such that $ V'\subset V $ $ E'\subset E $ $ f'\subset f $ restricted to $ E' $
  • A path in a graph between $ V_i,V_k \subset V_k $ is an alternating sequence of vertices and edges containing no repeated edges and no repeated vertices and for which $ e_i $ is incident to $ V_i $ and $ V_{i+1} $, for each $ i=1,2,\dots,k-1 $. ($ V_1 e_1 V_2 e_2 V_3 \dots V_{k-1} e_{k-1} V_k $)
  • A graph is "connected" if a path exists between any two vertices in the graph
  • A component is a maximal connected graph. (i.e. includes as many nodes as possible)
  • A maximal complete subgraph of a graph $ G $ is a complete subgraph of $ G $ that is not a proper subgraph of any other complete subgraph of $ G $.
  • A cycle is a path of non-trivial length $ k $ that comes back to the node where it started
  • A tree is a connected graph with no cycles. The weight of a tree is the sum of all edge weights in the tree.
  • A spanning tree is a tree containing all vertices of a graph.
  • A minimum spanning tree (MST) of a graph G is tree having minimal weight among all spanning trees of $ G $.



Graphical Clustering Methods

  • "Similarity Graph Methods"

Choose distance threshold $ t_0 $

If $ dis(X_i,X_j)<t_0 $ draw an ege between $ X_i $ and $ X_j $


Example) $ t_0= 1.3 $

<<Picture>>

Can define clusters s the connected component of the similarity graph

=> Same result as "Single linkage algorithms"

$ X_i \sim X_j $ if there is chain

$ X_i \sim X_{i_1} \sim X_{i_2} \sim \cdots X_{i_k} \sim X_{j} $ complete

Can also define clusters as the maximal subgraphs of similarity graph

<<Picture>>

=> More compact, less enlagated clusters

Not good for, say,

<<Picture>>

Alumni Liaison

Recent Math PhD now doing a post-doctorate at UC Riverside.

Kuei-Nuan Lin