(25 intermediate revisions by 7 users not shown)
Line 1: Line 1:
[http://balthier.ecn.purdue.edu/index.php/ECE662 ECE662 Main Page]
+
=Lecture 11, [[ECE662]]: Decision Theory=
  
Derivation of Fischer's Linear Discriminant
+
Lecture notes for [[ECE662:BoutinSpring08_Old_Kiwi|ECE662 Spring 2008]], Prof. [[user:mboutin|Boutin]].
===========================
+
 
Main article: [Derivation of Fisher's Linear Discriminant]
+
Other lectures: [[Lecture 1 - Introduction_Old Kiwi|1]],
 +
[[Lecture 2 - Decision Hypersurfaces_Old Kiwi|2]],
 +
[[Lecture 3 - Bayes classification_Old Kiwi|3]],
 +
[[Lecture 4 - Bayes Classification_Old Kiwi|4]],
 +
[[Lecture 5 - Discriminant Functions_Old Kiwi|5]],
 +
[[Lecture 6 - Discriminant Functions_Old Kiwi|6]],
 +
[[Lecture 7 - MLE and BPE_Old Kiwi|7]],
 +
[[Lecture 8 - MLE, BPE and Linear Discriminant Functions_Old Kiwi|8]],
 +
[[Lecture 9 - Linear Discriminant Functions_Old Kiwi|9]],
 +
[[Lecture 10 - Batch Perceptron and Fisher Linear Discriminant_Old Kiwi|10]],
 +
[[Lecture 11 - Fischer's Linear Discriminant again_Old Kiwi|11]],
 +
[[Lecture 12 - Support Vector Machine and Quadratic Optimization Problem_Old Kiwi|12]],
 +
[[Lecture 13 - Kernel function for SVMs and ANNs introduction_Old Kiwi|13]], 
 +
[[Lecture 14 - ANNs, Non-parametric Density Estimation (Parzen Window)_Old Kiwi|14]],
 +
[[Lecture 15 - Parzen Window Method_Old Kiwi|15]],
 +
[[Lecture 16 - Parzen Window Method and K-nearest Neighbor Density Estimate_Old Kiwi|16]],
 +
[[Lecture 17 - Nearest Neighbors Clarification Rule and Metrics_Old Kiwi|17]],
 +
[[Lecture 18 - Nearest Neighbors Clarification Rule and Metrics(Continued)_Old Kiwi|18]],
 +
[[Lecture 19 - Nearest Neighbor Error Rates_Old Kiwi|19]],
 +
[[Lecture 20 - Density Estimation using Series Expansion and Decision Trees_Old Kiwi|20]],
 +
[[Lecture 21 - Decision Trees(Continued)_Old Kiwi|21]],
 +
[[Lecture 22 - Decision Trees and Clustering_Old Kiwi|22]],
 +
[[Lecture 23 - Spanning Trees_Old Kiwi|23]],
 +
[[Lecture 24 - Clustering and Hierarchical Clustering_Old Kiwi|24]],
 +
[[Lecture 25 - Clustering Algorithms_Old Kiwi|25]],
 +
[[Lecture 26 - Statistical Clustering Methods_Old Kiwi|26]],
 +
[[Lecture 27 - Clustering by finding valleys of densities_Old Kiwi|27]],
 +
[[Lecture 28 - Final lecture_Old Kiwi|28]],
 +
----
 +
----
 +
 
 +
== Derivation of Fischer's Linear Discriminant ==
 +
 
 +
Main article: [[Derivation of Fisher's Linear Discriminant_Old Kiwi]]
  
 
The derivation was completed in this lecture.
 
The derivation was completed in this lecture.
  
Recall from last lecture
+
== Recall from last lecture ==
=================
+
  
 
Last time, we considered
 
Last time, we considered
Line 15: Line 47:
 
</math>
 
</math>
  
|jinha_Jw|
 
 
.. |jinha_Jw| image:: tex
 
  :alt: tex: J(\vec{w}) = \frac{\vec{w}^t S_B \vec{w}}{\vec{w}^t S_W \vec{w}}
 
  
 
which is explicit function of <math>\vec{w}</math>
 
which is explicit function of <math>\vec{w}</math>
 
.. |jinha_vecw| image:: tex
 
  :alt: tex: \vec{w}
 
  
 
One can do this because numerator of <math>J(\vec{w})</math> can be written as
 
One can do this because numerator of <math>J(\vec{w})</math> can be written as
Line 31: Line 56:
 
<math>\rightarrow S_B = (m_1 - m_2) (m_1^t - m_2^t)</math>
 
<math>\rightarrow S_B = (m_1 - m_2) (m_1^t - m_2^t)</math>
  
|jinha_top|
 
 
.. |jinha_top| image:: tex
 
  :alt: tex: \mid \tilde m_1 - \tilde m_2 \mid^2 = \mid w \cdot (m_1 - m_2) \mid^2 = w^t (m_1 - m_2) (m_1^t - m_2^t) w
 
 
|jinha_sb|
 
 
.. |jinha_sb| image:: tex
 
  :alt: tex: \rightarrow S_B = (m_1 - m_2) (m_1^t - m_2^t)
 
  
 
In a same way, denominator can be written as
 
In a same way, denominator can be written as
Line 50: Line 66:
  
  
Fisher Linear Discriminant
+
== Fisher Linear Discriminant ==
===================
+
  
 +
It is a known result that J is maximum at <math>\omega_0</math> such that <math>S_B\omega_0=\lambda S_W\omega_0</math>. This is the "Generalized eigenvalue problem.
  
.. |omega_0| image:: tex
+
Note that if <math>|S_W|\neq 0</math>, then <math>{S_W}^{-1}S_B\omega_0=\lambda\omega_0</math>. It can be written as the "Standard eigenvalue problem". The only difficulty (which is a big difficulty when the feature space dimension is large) is that matrix inversion is very unstable.
  :alt: tex: \omega_0
+
  
.. |generalized_evalue_prob| image:: tex
 
  :alt: tex: S_B\omega_0=\lambda S_W\omega_0
 
  
 +
Observe that <math>S_B\omega_0=(\vec{m_1}-\vec{m_2})(\vec{m_1}-\vec{m_2})^T\omega_0=cst.(\vec{m_1}-\vec{m_2})</math>. Therefore the standard eigenvalue problem as presented above becomes <math>{S_W}^{-1}cst.(\vec{m_1}-\vec{m_2})=\lambda\omega_0</math>. From this equation, value of <math>\omega_0</math> can easily be obtained, as <math>\omega_0={S_W}^{-1}(\vec{m_1}-\vec{m_2})</math> or any constant multiple of this. Note that magnitude of <math>\omega_0</math> is not important, the direction it represents is important.
  
.. |standard_evalue_prob| image:: tex
+
== Fischer's Linear Discriminant in Projected Coordinates ==
  :alt: tex: {S_W}^{-1}S_B\omega_0=\lambda\omega_0
+
  
 +
===Claim===
  
  - Maximizing J as defined in previous class
+
<math>\vec{c}=\omega_0={S_W}^{-1}(\vec{m_1}-\vec{m_2})
It is a known result that J is maximum at |omega_0| such that |generalized_evalue_prob|. This is the "Generalized eigen-value problem.
+
</math>
  
.. |notsingular| image:: tex
+
is the solution to <math>\mathbf{Y}\vec{c}=\vec{b}</math> with <math>\vec{b}=(d/d_1, \cdots, <d_1 times>, d/(d-d_1), \cdots, <(d-d_1) times>)^T</math>
  :alt: tex: |S_W|\neq 0
+
 
+
Note that if |notsingular|, then |standard_evalue_prob|. It can be written as the "Standard eigenvalue problem". The only difficulty (which is a big difficulty when the feature space dimension is large) is that matrix inversion is very unstable.
+
 
+
.. |cst_meandifference| image:: tex
+
  :alt: tex: S_B\omega_0=(\vec{m1}-\vec{m2})(\vec{m1}-\vec{m2})^T\omega_0=cst.(\vec{m1}-\vec{m2})
+
 
+
.. |reduced_prob| image:: tex
+
  :alt: tex: {S_W}^{-1}cst.(\vec{m1}-\vec{m2})=\lambda\omega_0
+
 
+
.. |value_omega_0| image:: tex
+
  :alt: tex: \omega_0={S_W}^{-1}(\vec{m1}-\vec{m2})
+
 
+
Observe that |cst_meandifference|. Therefore the standard eigenvalue problem as presented above becomes |reduced_prob|. From this equation, value of |omega_0| can easily be obtained, as |value_omega_0| or any constant multiple of this. Note that magnitude of |omega_0| is not important, the direction it represents is important.
+
 
+
 
+
Fischer's Linear Discriminant in Projected Coordinates
+
=====================================
+
**Claim**
+
 
+
.. |claim1| image:: tex
+
  :alt: tex: \vec{c}=\omega_0={S_W}^{-1}(\vec{m1}-\vec{m2})
+
 
+
.. |prob_matrix1| image:: tex
+
  :alt: tex: \mathbf{Y}\vec{c}=\vec{b}
+
 
+
.. |opt_b_value1| image:: tex
+
  :alt: tex: \vec{b}=(d/d_1, \cdots, <d_1 times>, d/(d-d_1), \cdots, <(d-d_1) times>)^T
+
 
+
|claim1|
+
 
+
is the solution to |prob_matrix1| with |opt_b_value1|
+
  
  
 
Here is an animation of the 1D example given in class on projections
 
Here is an animation of the 1D example given in class on projections
  
.. image:: projection1.gif
+
[[Image:lecture11-1_Old Kiwi.gif]]
  
.. |1DEx1| image:: tex
+
===Explanation===
  :alt: tex: \vec{\omega} \bullet y_i + \omega_0 > 0
+
  
.. |1DEx2| image:: tex
+
starts with <math>\vec{\omega} \cdot y_i + \omega_0 > 0</math> for class 1 and <math>\vec{\omega} \cdot y_i + \omega_0 < 0</math> for class 2
  :alt: tex: \vec{\omega} \bullet y_i + \omega_0 < 0
+
 
+
.. |1DEx3| image:: tex
+
  :alt: tex: \vec{\omega} \bullet y_i < 0
+
 
+
.. |1DEx4| image:: tex
+
  :alt: tex: \vec{\omega} \bullet y_i  > 0
+
 
+
Explanation *(feel free to change / correct)*
+
 
+
starts with |1DEx1| for class 1 and |1DEx2| for class 2
+
  
 
the data points are then projected onto an axis at 1 which results in
 
the data points are then projected onto an axis at 1 which results in
  
|1DEx4| for class 1 and |1DEx3| for class 2
+
<math>\vec{\omega} \cdot y_i  > 0</math> for class 1 and <math>\vec{\omega} \cdot y_i  < 0</math> for class 2
  
 
one class is then projected onto an axis at -1 which results in
 
one class is then projected onto an axis at -1 which results in
  
|1DEx4| for all yi
+
<math>\vec{\omega} \cdot y_i  > 0</math> for all <math>y_i</math>
  
 +
== Support Vector Machines (SVM) ==
  
**Support Vector Machines (SVM)**
+
A support vector for a hyperplane <math>\vec{c}</math> with margin <math>b_i \geq b</math> is a sample <math>y_{io}</math> such that <math>c\cdot{y_{io}} = b</math>.
  
A support vector for a hyperplane (vector c) with margin (bi >= b) is a sample (yio) such that (*yio = b.)  (someone please fix with mathtype)
+
[[Image:lec11_sv_pic1_Old Kiwi.jpg]]
  
.. image:: lec11_sv_pic1.bmp
+
[[Image:lec11_sv_pic2_Old Kiwi.jpg]]
.. image:: lec11_sv_pic2.bmp
+
  
 
Support Vector Machines are a two step process:
 
Support Vector Machines are a two step process:
1)  Preprocessing - X1,...,Xd features in kth dimensional real space is mapped to features in n dimensional real space where n>>k. 
 
2)  Linear Classifier - separates classes in n dimensional real space via hyperplane.
 
  -  Support Vectors - for finding the hyperplane with the biggest margins.
 
  -  Kernel - to simplify computation (This is key for real world applications)
 
  
 +
1)  Preprocessing - X1,...,Xd features in kth dimensional real space is mapped to features in n dimensional real space where n>>k.
  
Previous: [Lecture 10]
+
2)  Linear Classifier - separates classes in n dimensional real space via hyperplane.
Next: [Lecture 12]
+
-  Support Vectors - for finding the hyperplane with the biggest margins.
 
+
-  Kernel - to simplify computation (This is key for real world applications)
 +
----
  
[http://balthier.ecn.purdue.edu/index.php/ECE662#Class_Lecture_Notes Class Lecture Notes]
+
[[ECE662:BoutinSpring08_Old_Kiwi|Back to ECE662, Spring 2008, Prof. Boutin]]
 +
[[Category:Lecture Notes]]

Latest revision as of 08:49, 17 January 2013

Lecture 11, ECE662: Decision Theory

Lecture notes for ECE662 Spring 2008, Prof. Boutin.

Other lectures: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,



Derivation of Fischer's Linear Discriminant

Main article: Derivation of Fisher's Linear Discriminant_Old Kiwi

The derivation was completed in this lecture.

Recall from last lecture

Last time, we considered

$ J(\vec{w}) = \frac{\vec{w}^t S_B \vec{w}}{\vec{w}^t S_W \vec{w}} $


which is explicit function of $ \vec{w} $

One can do this because numerator of $ J(\vec{w}) $ can be written as

$ \mid \tilde m_1 - \tilde m_2 \mid^2 = \mid w \cdot (m_1 - m_2) \mid^2 = w^t (m_1 - m_2) (m_1^t - m_2^t) w $

$ \rightarrow S_B = (m_1 - m_2) (m_1^t - m_2^t) $


In a same way, denominator can be written as

$ \tilde s_1^2 + \tilde s_2^2 = \sum_{y_i \in class \ i} (w \cdot y_i - \tilde m_1)^2 = \sum w^t (y_i - m_i)(y_i^t - m_i^t) w $

$ = w^t \left[ \sum (y_i - m_i)(y_i^t - m_i^t) \right] w $

$ \rightarrow S_W = \sum_{y_i \in class \ i} (y_i - m_i)(y_i^t - m_i^t) $


Fisher Linear Discriminant

It is a known result that J is maximum at $ \omega_0 $ such that $ S_B\omega_0=\lambda S_W\omega_0 $. This is the "Generalized eigenvalue problem.

Note that if $ |S_W|\neq 0 $, then $ {S_W}^{-1}S_B\omega_0=\lambda\omega_0 $. It can be written as the "Standard eigenvalue problem". The only difficulty (which is a big difficulty when the feature space dimension is large) is that matrix inversion is very unstable.


Observe that $ S_B\omega_0=(\vec{m_1}-\vec{m_2})(\vec{m_1}-\vec{m_2})^T\omega_0=cst.(\vec{m_1}-\vec{m_2}) $. Therefore the standard eigenvalue problem as presented above becomes $ {S_W}^{-1}cst.(\vec{m_1}-\vec{m_2})=\lambda\omega_0 $. From this equation, value of $ \omega_0 $ can easily be obtained, as $ \omega_0={S_W}^{-1}(\vec{m_1}-\vec{m_2}) $ or any constant multiple of this. Note that magnitude of $ \omega_0 $ is not important, the direction it represents is important.

Fischer's Linear Discriminant in Projected Coordinates

Claim

$ \vec{c}=\omega_0={S_W}^{-1}(\vec{m_1}-\vec{m_2}) $

is the solution to $ \mathbf{Y}\vec{c}=\vec{b} $ with $ \vec{b}=(d/d_1, \cdots, <d_1 times>, d/(d-d_1), \cdots, <(d-d_1) times>)^T $


Here is an animation of the 1D example given in class on projections

Lecture11-1 Old Kiwi.gif

Explanation

starts with $ \vec{\omega} \cdot y_i + \omega_0 > 0 $ for class 1 and $ \vec{\omega} \cdot y_i + \omega_0 < 0 $ for class 2

the data points are then projected onto an axis at 1 which results in

$ \vec{\omega} \cdot y_i > 0 $ for class 1 and $ \vec{\omega} \cdot y_i < 0 $ for class 2

one class is then projected onto an axis at -1 which results in

$ \vec{\omega} \cdot y_i > 0 $ for all $ y_i $

Support Vector Machines (SVM)

A support vector for a hyperplane $ \vec{c} $ with margin $ b_i \geq b $ is a sample $ y_{io} $ such that $ c\cdot{y_{io}} = b $.

Lec11 sv pic1 Old Kiwi.jpg

Lec11 sv pic2 Old Kiwi.jpg

Support Vector Machines are a two step process:

1) Preprocessing - X1,...,Xd features in kth dimensional real space is mapped to features in n dimensional real space where n>>k.

2) Linear Classifier - separates classes in n dimensional real space via hyperplane. - Support Vectors - for finding the hyperplane with the biggest margins. - Kernel - to simplify computation (This is key for real world applications)


Back to ECE662, Spring 2008, Prof. Boutin

Alumni Liaison

To all math majors: "Mathematics is a wonderfully rich subject."

Dr. Paul Garrett