(Support Vector Machines (SVM))
 
(12 intermediate revisions by 4 users not shown)
Line 1: Line 1:
[http://balthier.ecn.purdue.edu/index.php/ECE662#Class_Lecture_Notes Class Lecture Notes]
+
<center><font size= 4>
 +
'''[[ECE662]]: Statistical Pattern Recognition and Decision Making Processes'''
 +
</font size>
  
 +
Spring 2008, [[user:mboutin|Prof. Boutin]]
  
 +
[[Slectures|Slecture]]
 +
 +
<font size= 3> Collectively created by the students in [[ECE662:BoutinSpring08_OldKiwi|the class]]</font size>
 +
</center>
 +
 +
----
 +
=Lecture 11 Lecture notes=
 +
Jump to: [[ECE662_Pattern_Recognition_Decision_Making_Processes_Spring2008_sLecture_collective|Outline]]|
 +
[[Lecture 1 - Introduction_OldKiwi|1]]|
 +
[[Lecture 2 - Decision Hypersurfaces_OldKiwi|2]]|
 +
[[Lecture 3 - Bayes classification_OldKiwi|3]]|
 +
[[Lecture 4 - Bayes Classification_OldKiwi|4]]|
 +
[[Lecture 5 - Discriminant Functions_OldKiwi|5]]|
 +
[[Lecture 6 - Discriminant Functions_OldKiwi|6]]|
 +
[[Lecture 7 - MLE and BPE_OldKiwi|7]]|
 +
[[Lecture 8 - MLE, BPE and Linear Discriminant Functions_OldKiwi|8]]|
 +
[[Lecture 9 - Linear Discriminant Functions_OldKiwi|9]]|
 +
[[Lecture 10 - Batch Perceptron and Fisher Linear Discriminant_OldKiwi|10]]|
 +
[[Lecture 11 - Fischer's Linear Discriminant again_OldKiwi|11]]|
 +
[[Lecture 12 - Support Vector Machine and Quadratic Optimization Problem_OldKiwi|12]]|
 +
[[Lecture 13 - Kernel function for SVMs and ANNs introduction_OldKiwi|13]]| 
 +
[[Lecture 14 - ANNs, Non-parametric Density Estimation (Parzen Window)_OldKiwi|14]]|
 +
[[Lecture 15 - Parzen Window Method_OldKiwi|15]]|
 +
[[Lecture 16 - Parzen Window Method and K-nearest Neighbor Density Estimate_OldKiwi|16]]|
 +
[[Lecture 17 - Nearest Neighbors Clarification Rule and Metrics_OldKiwi|17]]|
 +
[[Lecture 18 - Nearest Neighbors Clarification Rule and Metrics(Continued)_OldKiwi|18]]|
 +
[[Lecture 19 - Nearest Neighbor Error Rates_OldKiwi|19]]|
 +
[[Lecture 20 - Density Estimation using Series Expansion and Decision Trees_OldKiwi|20]]|
 +
[[Lecture 21 - Decision Trees(Continued)_OldKiwi|21]]|
 +
[[Lecture 22 - Decision Trees and Clustering_OldKiwi|22]]|
 +
[[Lecture 23 - Spanning Trees_OldKiwi|23]]|
 +
[[Lecture 24 - Clustering and Hierarchical Clustering_OldKiwi|24]]|
 +
[[Lecture 25 - Clustering Algorithms_OldKiwi|25]]|
 +
[[Lecture 26 - Statistical Clustering Methods_OldKiwi|26]]|
 +
[[Lecture 27 - Clustering by finding valleys of densities_OldKiwi|27]]|
 +
[[Lecture 28 - Final lecture_OldKiwi|28]]
 +
----
 +
----
 
== Derivation of Fischer's Linear Discriminant ==
 
== Derivation of Fischer's Linear Discriminant ==
  
Main article: [[Derivation of Fisher's Linear Discriminant_OldKiwi]]
+
Main article: [[Derivation of Fisher's Linear Discriminant_OldKiwi|Derivation of Fisher's Linear Discriminant]]
  
 
The derivation was completed in this lecture.
 
The derivation was completed in this lecture.
Line 59: Line 100:
 
===Explanation===
 
===Explanation===
  
starts with <math>\vec{\omega} \bullet y_i + \omega_0 > 0</math> for class 1 and <math>\vec{\omega}  \bullet y_i + \omega_0 < 0</math> for class 2
+
starts with <math>\vec{\omega} \cdot y_i + \omega_0 > 0</math> for class 1 and <math>\vec{\omega}  \cdot y_i + \omega_0 < 0</math> for class 2
  
 
the data points are then projected onto an axis at 1 which results in
 
the data points are then projected onto an axis at 1 which results in
  
<math>\vec{\omega} \bullet y_i  > 0</math> for class 1 and <math>\vec{\omega} \bullet y_i  < 0</math> for class 2
+
<math>\vec{\omega} \cdot y_i  > 0</math> for class 1 and <math>\vec{\omega} \cdot y_i  < 0</math> for class 2
  
 
one class is then projected onto an axis at -1 which results in
 
one class is then projected onto an axis at -1 which results in
  
<math>\vec{\omega} \bullet y_i  > 0</math> for all <math>y_i</math>
+
<math>\vec{\omega} \cdot y_i  > 0</math> for all <math>y_i</math>
  
 
== Support Vector Machines (SVM) ==
 
== Support Vector Machines (SVM) ==
Line 84: Line 125:
 
-  Support Vectors - for finding the hyperplane with the biggest margins.
 
-  Support Vectors - for finding the hyperplane with the biggest margins.
 
-  Kernel - to simplify computation (This is key for real world applications)
 
-  Kernel - to simplify computation (This is key for real world applications)
 +
----
 +
Previous: [[Lecture_10_-_Batch_Perceptron_and_Fisher_Linear_Discriminant_OldKiwi|Lecture 10]].
 +
Next: [[Lecture_12_-_Support_Vector_Machine_and_Quadratic_Optimization_Problem_OldKiwi|Lecture 12]]
  
 
+
[[ECE662:BoutinSpring08_OldKiwi|Back to ECE662 Spring 2008 Prof. Boutin]]
[http://balthier.ecn.purdue.edu/index.php/ECE662#Class_Lecture_Notes Class Lecture Notes]
+
[[Category:ECE662]]
 
+
[[Category:decision theory]]
== Lectures ==
+
[[Category:lecture notes]]
[http://balthier.ecn.purdue.edu/index.php/Lecture_1_-_Introduction 1] [http://balthier.ecn.purdue.edu/index.php/Lecture_2_-_Decision_Hypersurfaces 2] [http://balthier.ecn.purdue.edu/index.php/Lecture_3_-_Bayes_classification 3]
+
[[Category:pattern recognition]]
[http://balthier.ecn.purdue.edu/index.php/Lecture_4_-_Bayes_Classification 4] [http://balthier.ecn.purdue.edu/index.php/Lecture_5_-_Discriminant_Functions 5] [http://balthier.ecn.purdue.edu/index.php/Lecture_6_-_Discriminant_Functions 6] [http://balthier.ecn.purdue.edu/index.php/Lecture_7_-_MLE_and_BPE 7] [http://balthier.ecn.purdue.edu/index.php/Lecture_8_-_MLE%2C_BPE_and_Linear_Discriminant_Functions 8] [http://balthier.ecn.purdue.edu/index.php/Lecture_9_-_Linear_Discriminant_Functions 9] [http://balthier.ecn.purdue.edu/index.php/Lecture_10_-_Batch_Perceptron_and_Fisher_Linear_Discriminant 10] [http://balthier.ecn.purdue.edu/index.php/Lecture_11_-_Fischer%27s_Linear_Discriminant_again 11] [http://balthier.ecn.purdue.edu/index.php/Lecture_12_-_Support_Vector_Machine_and_Quadratic_Optimization_Problem 12] [http://balthier.ecn.purdue.edu/index.php/Lecture_13_-_Kernel_function_for_SVMs_and_ANNs_introduction 13] [http://balthier.ecn.purdue.edu/index.php/Lecture_14_-_ANNs%2C_Non-parametric_Density_Estimation_%28Parzen_Window%29 14] [http://balthier.ecn.purdue.edu/index.php/Lecture_15_-_Parzen_Window_Method 15] [http://balthier.ecn.purdue.edu/index.php/Lecture_16_-_Parzen_Window_Method_and_K-nearest_Neighbor_Density_Estimate 16] [http://balthier.ecn.purdue.edu/index.php/Lecture_17_-_Nearest_Neighbors_Clarification_Rule_and_Metrics 17] [http://balthier.ecn.purdue.edu/index.php/Lecture_18_-_Nearest_Neighbors_Clarification_Rule_and_Metrics%28Continued%29 18]
+
[[Category:slecture]]
[http://balthier.ecn.purdue.edu/index.php/Lecture_19_-_Nearest_Neighbor_Error_Rates 19]
+
[http://balthier.ecn.purdue.edu/index.php/Lecture_20_-_Density_Estimation_using_Series_Expansion_and_Decision_Trees 20]
+

Latest revision as of 11:18, 10 June 2013

ECE662: Statistical Pattern Recognition and Decision Making Processes

Spring 2008, Prof. Boutin

Slecture

Collectively created by the students in the class


Lecture 11 Lecture notes

Jump to: Outline| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28



Derivation of Fischer's Linear Discriminant

Main article: Derivation of Fisher's Linear Discriminant

The derivation was completed in this lecture.

Recall from last lecture

Last time, we considered

$ J(\vec{w}) = \frac{\vec{w}^t S_B \vec{w}}{\vec{w}^t S_W \vec{w}} $


which is explicit function of $ \vec{w} $

One can do this because numerator of $ J(\vec{w}) $ can be written as

$ \mid \tilde m_1 - \tilde m_2 \mid^2 = \mid w \cdot (m_1 - m_2) \mid^2 = w^t (m_1 - m_2) (m_1^t - m_2^t) w $

$ \rightarrow S_B = (m_1 - m_2) (m_1^t - m_2^t) $


In a same way, denominator can be written as

$ \tilde s_1^2 + \tilde s_2^2 = \sum_{y_i \in class \ i} (w \cdot y_i - \tilde m_1)^2 = \sum w^t (y_i - m_i)(y_i^t - m_i^t) w $

$ = w^t \left[ \sum (y_i - m_i)(y_i^t - m_i^t) \right] w $

$ \rightarrow S_W = \sum_{y_i \in class \ i} (y_i - m_i)(y_i^t - m_i^t) $


Fisher Linear Discriminant

It is a known result that J is maximum at $ \omega_0 $ such that $ S_B\omega_0=\lambda S_W\omega_0 $. This is the "Generalized eigenvalue problem.

Note that if $ |S_W|\neq 0 $, then $ {S_W}^{-1}S_B\omega_0=\lambda\omega_0 $. It can be written as the "Standard eigenvalue problem". The only difficulty (which is a big difficulty when the feature space dimension is large) is that matrix inversion is very unstable.


Observe that $ S_B\omega_0=(\vec{m_1}-\vec{m_2})(\vec{m_1}-\vec{m_2})^T\omega_0=cst.(\vec{m_1}-\vec{m_2}) $. Therefore the standard eigenvalue problem as presented above becomes $ {S_W}^{-1}cst.(\vec{m_1}-\vec{m_2})=\lambda\omega_0 $. From this equation, value of $ \omega_0 $ can easily be obtained, as $ \omega_0={S_W}^{-1}(\vec{m_1}-\vec{m_2}) $ or any constant multiple of this. Note that magnitude of $ \omega_0 $ is not important, the direction it represents is important.

Fischer's Linear Discriminant in Projected Coordinates

Claim

$ \vec{c}=\omega_0={S_W}^{-1}(\vec{m_1}-\vec{m_2}) $

is the solution to $ \mathbf{Y}\vec{c}=\vec{b} $ with $ \vec{b}=(d/d_1, \cdots, <d_1 times>, d/(d-d_1), \cdots, <(d-d_1) times>)^T $


Here is an animation of the 1D example given in class on projections

Lecture11-1 OldKiwi.gif

Explanation

starts with $ \vec{\omega} \cdot y_i + \omega_0 > 0 $ for class 1 and $ \vec{\omega} \cdot y_i + \omega_0 < 0 $ for class 2

the data points are then projected onto an axis at 1 which results in

$ \vec{\omega} \cdot y_i > 0 $ for class 1 and $ \vec{\omega} \cdot y_i < 0 $ for class 2

one class is then projected onto an axis at -1 which results in

$ \vec{\omega} \cdot y_i > 0 $ for all $ y_i $

Support Vector Machines (SVM)

A support vector for a hyperplane $ \vec{c} $ with margin $ b_i \geq b $ is a sample $ y_{io} $ such that $ c\cdot{y_{io}} = b $.

Lec11 sv pic1 OldKiwi.jpg

Lec11 sv pic2 OldKiwi.jpg

Support Vector Machines are a two step process:

1) Preprocessing - X1,...,Xd features in kth dimensional real space is mapped to features in n dimensional real space where n>>k.

2) Linear Classifier - separates classes in n dimensional real space via hyperplane. - Support Vectors - for finding the hyperplane with the biggest margins. - Kernel - to simplify computation (This is key for real world applications)


Previous: Lecture 10. Next: Lecture 12

Back to ECE662 Spring 2008 Prof. Boutin

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett