(28 intermediate revisions by 8 users not shown)
Line 1: Line 1:
[http://balthier.ecn.purdue.edu/index.php/ECE662 ECE662 Main Page]
+
[[Category:ECE662]]
 +
[[Category:decision theory]]
 +
[[Category:lecture notes]]
 +
[[Category:pattern recognition]]
 +
[[Category:slecture]]
  
[http://balthier.ecn.purdue.edu/index.php/ECE662#Class_Lecture_Notes Class Lecture Notes]
+
<center><font size= 4>
 +
'''[[ECE662]]: Statistical Pattern Recognition and Decision Making Processes'''
 +
</font size>
  
'''Parametric Methods'''
+
Spring 2008, [[user:mboutin|Prof. Boutin]]
  
.. |prior_prob1| image:: tex
+
[[Slectures|Slecture]]
:alt: tex: P(\omega_i)
+
 
 +
<font size= 3> Collectively created by the students in [[ECE662:BoutinSpring08_OldKiwi|the class]]</font size>
 +
</center>
 +
 
 +
----
 +
=Lecture 9 Lecture notes=
 +
Jump to: [[ECE662_Pattern_Recognition_Decision_Making_Processes_Spring2008_sLecture_collective|Outline]]|
 +
[[Lecture 1 - Introduction_OldKiwi|1]]|
 +
[[Lecture 2 - Decision Hypersurfaces_OldKiwi|2]]|
 +
[[Lecture 3 - Bayes classification_OldKiwi|3]]|
 +
[[Lecture 4 - Bayes Classification_OldKiwi|4]]|
 +
[[Lecture 5 - Discriminant Functions_OldKiwi|5]]|
 +
[[Lecture 6 - Discriminant Functions_OldKiwi|6]]|
 +
[[Lecture 7 - MLE and BPE_OldKiwi|7]]|
 +
[[Lecture 8 - MLE, BPE and Linear Discriminant Functions_OldKiwi|8]]|
 +
[[Lecture 9 - Linear Discriminant Functions_OldKiwi|9]]|
 +
[[Lecture 10 - Batch Perceptron and Fisher Linear Discriminant_OldKiwi|10]]|
 +
[[Lecture 11 - Fischer's Linear Discriminant again_OldKiwi|11]]|
 +
[[Lecture 12 - Support Vector Machine and Quadratic Optimization Problem_OldKiwi|12]]|
 +
[[Lecture 13 - Kernel function for SVMs and ANNs introduction_OldKiwi|13]]| 
 +
[[Lecture 14 - ANNs, Non-parametric Density Estimation (Parzen Window)_OldKiwi|14]]|
 +
[[Lecture 15 - Parzen Window Method_OldKiwi|15]]|
 +
[[Lecture 16 - Parzen Window Method and K-nearest Neighbor Density Estimate_OldKiwi|16]]|
 +
[[Lecture 17 - Nearest Neighbors Clarification Rule and Metrics_OldKiwi|17]]|
 +
[[Lecture 18 - Nearest Neighbors Clarification Rule and Metrics(Continued)_OldKiwi|18]]|
 +
[[Lecture 19 - Nearest Neighbor Error Rates_OldKiwi|19]]|
 +
[[Lecture 20 - Density Estimation using Series Expansion and Decision Trees_OldKiwi|20]]|
 +
[[Lecture 21 - Decision Trees(Continued)_OldKiwi|21]]|
 +
[[Lecture 22 - Decision Trees and Clustering_OldKiwi|22]]|
 +
[[Lecture 23 - Spanning Trees_OldKiwi|23]]|
 +
[[Lecture 24 - Clustering and Hierarchical Clustering_OldKiwi|24]]|
 +
[[Lecture 25 - Clustering Algorithms_OldKiwi|25]]|
 +
[[Lecture 26 - Statistical Clustering Methods_OldKiwi|26]]|
 +
[[Lecture 27 - Clustering by finding valleys of densities_OldKiwi|27]]|
 +
[[Lecture 28 - Final lecture_OldKiwi|28]]
 +
----
 +
----
 +
'''Parametric Methods'''
  
.. |prob_likelihood1| image:: tex
 
:alt: tex: p(\vec{X}|\omega_i) \forall i
 
 
Two applications:
 
Two applications:
a) Parametric Density Estimation: Using sample data, we estimate probabilities <math>P(\omega_i)</math>, <math>p(\vec{X}|\omega_i) \forall i</math> etc using estimation methods like [MLE] and [BPE]. Then from these estimated probabilities (and not true probabilities, which are unknown; only their parametric form was known) we can use Bayes classification rule to build a classifier.
+
* Parametric Density Estimation: Using sample data, we estimate probabilities <math>P(\omega_i)</math>, <math>p(\vec{X}|\omega_i) \forall i</math> etc using estimation methods like [[MLE_OldKiwi|MLE]] and [[BPE_OldKiwi|BPE]]. Then from these estimated probabilities (and not true probabilities, which are unknown; only their parametric form was known) we can use Bayes classification rule to build a classifier.
  
b) [Parametric Classifiers]: We find parametric decision boundaries to approximate true decision boundaries between classes. This is very different approach from approximating the probabilities with their estimates, as in previous method.
+
* [[Parametric Classifiers_OldKiwi|Parametric Classifiers]]: We find parametric decision boundaries to approximate true decision boundaries between classes. This is very different approach from approximating the probabilities with their estimates, as in previous method.
  
  
 
Example:
 
Example:
  
.. image:: Lecture9_parametric_decion_boundary.JPG
+
[[Image:Lecture9_parametric_decion_boundary_OldKiwi.JPG]]
  
.. |true_decision_boundary| image:: tex
+
True decision boundary: <math> \{\vec{X}|P(\omega_1|\vec{X})-P(\omega_2|\vec{X})=0\}</math> can be approximated by <math>\{\vec{X}|g(\vec{X})=0\}</math>. Note that 'g' is not an estimate of the difference in probabilities as in true decision boundary, but it is just an approximate parametric form for the true decision boundary. Choice of 'g' depends on whether it is analytic, and hence easy to handle computationally.
:alt: tex: \{\vec{X}|P(\omega_1|\vec{X})-P(\omega_2|\vec{X})=0\}
+
 
+
.. |approximate_decision_boundary| image:: tex
+
:alt: tex: \{\vec{X}|g(\vec{X})=0\}
+
True decision boundary: |true_decision_boundary| can be approximated by |approximate_decision_boundary|. Note that 'g' is not an estimate of the difference in probabilities as in true decision boundary, but it is just an approximate parametric form for the true decision boundary. Choice of 'g' depends on whether it is analytic, and hence easy to handle computationally.
+
  
 
E.g. 'g' can be a degree-2 polynomial, or it can be a degree-1 polynomial (resulting in a linear classifier).
 
E.g. 'g' can be a degree-2 polynomial, or it can be a degree-1 polynomial (resulting in a linear classifier).
  
.. |linear_g1| image:: tex
+
If 'g' is linear, it can be written as <math>g(\vec{X})=\vec{V}\cdot \vec{X}+V_0=(\vec{V},V_0)\cdot (\vec{X},1)</math> or <math>g(1,\vec{X})=\vec{c}\cdot(1,\vec{X})</math>.
:alt: tex: g(\vec{X})=\vec{V}.\vec{X}+V_0=(\vec{V},V_0).(\vec{X},1)
+
 
+
.. |linear_g2| image:: tex
+
:alt: tex: g(1,\vec{X})=\vec{c}.(1,\vec{X})
+
 
+
If 'g' is linear, it can be written as |linear_g1| or |linear_g2|.
+
  
**Extension of Feature Space**
+
'''Extension of Feature Space:'''
 
This trick justifies the importance of studying linear classifiers even though they do not occur so often in practice directly. Actually, many non-linear classifiers can be seen as linear.
 
This trick justifies the importance of studying linear classifiers even though they do not occur so often in practice directly. Actually, many non-linear classifiers can be seen as linear.
  
.. |R_squared1| image:: tex
+
Example 1: Consider g to be a polynomial parametric form of a classifier in <math>\mathbb{R}^2</math>.
:alt: tex: \mathbb{R}^2
+
  
.. |g_poly_R_squared1| image:: tex
+
<math> g(x,y)=c_0+c_1x+c_2y+c_3xy+c_4{x}^2+c_5{y}^2 </math>
:alt: tex: g(x,y)=c_0+c_1x+c_2y+c_3xy+c_4{x}^2+c_5{y}^2
+
  
Example 1: Consider g to be a polynomial parametric form of a classifier in |R_squared1|.
+
Here g is not a linear classifier in <math>\mathbb{R}^2</math>. Let us define <math>\tilde{g}:\mathbb{R}^5 \to \mathbb{R}</math>
  
|g_poly_R_squared1|
+
<math> \tilde{g}(u_1,u_2,u_3,u_4,u_5)=c_0+c_1u_1+c_2u_2+c_3u_3+c_4u_4+c_5u_5 </math>
  
.. |g_tilde_from_5_1| image:: tex
+
where <math> u_1=x; u_2=y; u_3=xy; u_4={x}^2; u_5={y}^2 </math>. Here we have defined a parametric from of the classifier in extended feature space. This form is linear. Hence "Decision boundary defined by 'g' is linear in extended feature space."
:alt: tex: \tilde{g}:\mathbb{R}^5 \to \mathbb{R}
+
  
.. |g_tilde_def1| image:: tex
+
Example 2: Consider another polynomial,
:alt: tex: \tilde{g}(u_1,u_2,u_3,u_4,u_5)=c_0+c_1u_1+c_2u_2+c_3u_3+c_4u_4+c_5u_5
+
  
Here g is not a linear classifier in |R_squared1|. Let us define |g_tilde_from_5_1|
 
 
|g_tilde_def1|
 
 
.. |change_var1| image:: tex
 
:alt: tex: u_1=x; u_2=y; u_3=xy; u_4={x}^2; u_5={y}^2
 
 
where |change_var1|. Here we have defined a parametric from of the classifier in extended feature space. This form is linear. Hence "Decision boundary defined by 'g' is linear in extended feature space."
 
 
Example 2: Consider another polynomial,
 
  
|circle_ex|
+
<math> g(x)={x}^2+{y}^2-2{y} = {x}^2+{(y-1)}^2 -1 </math>
  
.. |circle_ex| image:: tex
 
:alt: tex: g(x)={x}^2+{y}^2-2{y}
 
= {x}^2+{(y-1)}^2 -1
 
 
This is a circle centred at (0,1):
 
This is a circle centred at (0,1):
  
.. image:: Circle.jpg
+
[[Image:Circle_OldKiwi.jpg]]
  
 
This non-linear function can be transformed into linear function in extended feature space like below.
 
This non-linear function can be transformed into linear function in extended feature space like below.
  
|jinha_linear_circle|
+
<math> \{  (x, y, x^2, y^2) \mid (0, -2, 1, 1) \cdot (x, y, x^2, y^2)  = 0 \} </math>
 
+
.. |jinha_linear_circle| image:: tex
+
:alt: tex: \{  (x, y, x^2, y^2) \mid (0, -2, 1, 1) \cdot (x, y, x^2, y^2)  = 0 \}
+
 
+
 
+
.. image:: lec9_embedd.bmp
+
  
 +
[[Image:lec9_embedd_OldKiwi.jpg]]
  
 
Example 3: 1D example
 
Example 3: 1D example
  
.. image:: Jinha_1D_Example01.jpg
+
[[Image:Jinha_1D_Example01_OldKiwi.jpg]]
 
+
As we can see from above figure, decision boundary is not linear ( region is not connected). From this, we can construct an extended [feature vector] space.
+
  
|jinha_xx2|
+
As we can see from above figure, decision boundary is not linear (region is not connected). From this, we can construct an extended [feature vector] space.
  
.. |jinha_xx2| image:: tex
+
<math> x \rightarrow (x, x^2) \in R^2 </math>
:alt: tex: x \rightarrow (x, x^2) \in R^2
+
  
.. image:: Jinha_1D_Example02.jpg
+
[[Image:Jinha_1D_Example02_OldKiwi.jpg]]
  
 
This red line (which is a hyperplane in this case) can separate these two classes.
 
This red line (which is a hyperplane in this case) can separate these two classes.
 
This can be expressed as below mathematical form.
 
This can be expressed as below mathematical form.
  
|jinha_pp2|
+
<math>  p(x \mid w_1) \rightarrow p(x, x^2 \mid w_1) </math>
  
.. |jinha_pp2| image:: tex
+
'''Taylor Series:''' If true <math>\vec{g}</math> is analytic
:alt: tex: p(x \mid w_1) \rightarrow p(x, x^2 \mid w_1)
+
e.g <math> g(\vec{X})=\sum_{i=1}^{n}{\lambda}_{i}{e}^{{c}_{i}\vec{x}} </math>
  
.. |g_vc| image:: tex
+
then, <math> g(\vec{x}) </math> = Taylor Series --> infinite polynomial in powers
:alt: tex: \vec{g}
+
  
.. |g_vec_eq| image:: tex
+
because we can approximate with Taylor polynomial of degree n
:alt: tex: g(\vec{X})=\sum_{i=1}^{n}{\lambda}_{i}{e}^{{c}_{i}\vec{x}}
+
(i.e as n get higher, approximation gets better)
  
.. |g_x| image:: tex
+
[[Image:poly2_OldKiwi.jpg]]
:alt: tex: g(\vec{x})
+
 
+
 
+
**Taylor Series** If true |g_vc| is analytic
+
e.g |g_vec_eq|
+
 
+
then, |g_x| = Taylor Series --> infinite polynomial in powers
+
 
+
because we can approximate with Taylor plynomial of degree n
+
(i.e as n get higher, approximatino gets better)
+
 
+
 
+
.. image:: poly2.gif
+
  
 
However, there are issues on this approach. It gets complex with the increase of the dimension of the [feature vector].
 
However, there are issues on this approach. It gets complex with the increase of the dimension of the [feature vector].
  
If |jinha_x|
+
If <math> \vec{x} = (x_1, x_2, \cdots, x_d) </math>
  
.. |jinha_x| image:: tex
+
A degree 2 polynomial in <math> \vec{x} = (x_1, x_2, \cdots, x_d) </math> has <math> \frac{1}{2} (d+1) (d+2) \approx d^2 </math> monomials
:alt: tex: \vec{x} = (x_1, x_2, \cdots, x_d)
+
  
A degree 2 polynomial in |jinha_x| has |jinha_dim2poly| monomials
+
A degree n polynomial in <math> \vec{x} = (x_1, x_2, \cdots, x_d) </math> has <math> \approx d^n </math> monomials.
  
.. |jinha_dim2poly| image:: tex
+
'''How to find decision hyperplane given training data'''
:alt: tex: \frac{1}{2} (d+1) (d+2) \approx d^2
+
 
+
A degree n polynomial in |jinha_x| has |jinha_dn| monomials.
+
 
+
.. |jinha_dn| image:: tex
+
:alt: tex: \approx d^n
+
 
+
**How to find decision hyperplane given training data**
+
  
 
In the best of all cases, data is "linearly separable", meaning there exists a vector c such that for all y training data:
 
In the best of all cases, data is "linearly separable", meaning there exists a vector c such that for all y training data:
Line 152: Line 134:
 
c * y < 0 of u belongs to class w2.
 
c * y < 0 of u belongs to class w2.
  
**Trick:** replace all y's in the training data belonging to class w2 by -y, then look for vector c such that
+
'''Trick:''' replace all y's in the training data belonging to class w2 by -y, then look for vector c such that
 
c * y > 0 for all y.
 
c * y > 0 for all y.
 
  
 
Example: 1D feature space
 
Example: 1D feature space
Line 160: Line 141:
 
x->(1,x)=:y
 
x->(1,x)=:y
  
.. image:: Lecture9_1D_yi_-yi.JPG
+
[[Image:Lecture9_1D_yi_-yi_OldKiwi.JPG]]
  
 
First component is +1 for all of class 1 training data
 
First component is +1 for all of class 1 training data
Line 166: Line 147:
 
First component is -1 for all of class 2 training data
 
First component is -1 for all of class 2 training data
  
.. image:: Lecture9_1D_yi_-yi_2.JPG
+
[[Image:Lecture9_1D_yi_-yi_2_OldKiwi.JPG]]
  
 
Observe: c is not unique
 
Observe: c is not unique
  
To make |vec_c| unique, introduce 'margin' b>0 and ask that c be the minimum length vector such that  |khh0|, for all i
+
To make <math> \vec{c} </math> unique, introduce 'margin' b>0 and ask that c be the minimum length vector such that  <math> \vec{c} \cdot y_{i} \geq b </math>, for all i
  
.. |khh0| image:: tex
+
'''Geometric Interpretation of margin'''
:alt: tex: \vec{c} \cdot y_{i} \geq b
+
  
.. |vec_c| image:: tex
+
If <math> \vec{c} \cdot y_{i} = b </math>, then <math> \vec{c} </math> belongs to hyperplane with normal vector <math> y_{i0} </math> which is at distance <math> \displaystyle \frac{b}  {\displaystyle ||y_{i0} ||} </math> from the origin.
:alt: tex: \vec{c}
+
  
**Geometric Interpretation of margin**
 
  
If |khh1|, then |vec_c| belons to hyperplane with normal vector |y_io| which is at distance |b_over_y_io| from the origin.
+
<math> \vec{c} \cdot y_{i} \geq b </math>, for all i means that <math> \vec{c} </math> is always at least a distance <math> \displaystyle \frac{b}  {\displaystyle ||y_{i0} ||} </math> from the origin for all i
  
.. |khh1| image:: tex
+
In general, it is impossible to satisfy <math>\vec{c} \cdot y_{i} \geq 0 </math>, for all i
:alt: tex: \vec{c} \cdot y_{i} = b
+
  
.. |y_io| image:: tex
+
Perhaps, we could try to minimize number of misclassfied training samples
:alt: tex: y_{i0}
+
  
.. |b_over_y_io| image:: tex
+
Let <math> J(\vec{c})=\{ y_{i| \vec{c} \cdot y_{i} \leq 0 \} </math> -> Too hard to minimize.
:alt: tex: \displaystyle \frac{b}  {\displaystyle ||y_{i0} ||}
+
  
|khh0|, for all i means that |vec_c| is always at least a distance |b_over_y_io| from the origin for all i
+
'''[[Perceptron_OldKiwi]] Criterion function'''
  
In general, it is impossible to satisfy |khh2|, for all i
+
<math> J_{p}(\vec{c})=\displaystyle \sum_{y_i, missclassified} (- \vec{c} \cdot y_{i} ) </math>
  
.. |khh2| image:: tex
+
[[Image:lec9_percept_OldKiwi.JPG]]
:alt: tex: \vec{c} \cdot y_{i} \geq 0
+
 
+
 
+
Perhaps, we could try to minimize number of mislassfied training samples
+
 
+
Let |khh5| -> Too hard to minimize.
+
 
+
.. |khh5| image:: tex
+
:alt: tex: J(\vec{c})=\{ y_{i}  | \vec{c} \cdot y_{i} \leq 0 \}
+
 
+
 
+
**[Perceptron] Criterion function**
+
 
+
|khh6|
+
 
+
.. |khh6| image:: tex
+
:alt: tex: J_{p}(\vec{c})=\displaystyle \sum_{y_i, missclassified} (- \vec{c} \cdot y_{i} )
+
 
+
.. image:: lec9_percept.bmp
+
  
 
Measure how 'far away' you are from bias right.
 
Measure how 'far away' you are from bias right.
distance from |y_i| to hyperplane defined by |vec_c| is |khh3|
+
distance from |y_i| to hyperplane defined by |vec_c| is <math> ||\vec{y_i} \cdot \frac{\vec{c}}{||\vec{c}||} ||=\frac{1}{|| \vec{c} ||} \vec{y_i} \cdot \vec{c} </math>
 
+
.. |y_i| image:: tex
+
:alt: tex: y_{i}
+
 
+
.. |khh3| image:: tex
+
:alt: tex: ||\vec{y_i} \cdot \frac{\vec{c}}{||\vec{c}||} ||=\frac{1}{|| \vec{c} ||} \vec{y_i} \cdot \vec{c}
+
 
+
Since |khh4| is proportional to distance from |y_i| to hyperplane
+
 
+
.. |khh4| image:: tex
+
:alt: tex: \vec{c} \cdot \vec{y}
+
  
 +
Since <math> \vec{c} \cdot \vec{y} </math> is proportional to distance from <math> y_{i} </math> to hyperplane
 +
----
 +
Previous: [[Lecture_8_-_MLE%2C_BPE_and_Linear_Discriminant_Functions_OldKiwi|Lecture 8]]
 +
Next: [[Lecture_10_-_Batch_Perceptron_and_Fisher_Linear_Discriminant_OldKiwi|Lecture 10]]
  
Previous: [Lecture 8]
+
[[ECE662:BoutinSpring08_OldKiwi|Back to ECE662 Spring 2008 Prof. Boutin]]
Next: [Lecture 10]
+
[[Image:[[Image:Example_OldKiwi.jpg]]]]
+

Latest revision as of 11:15, 10 June 2013


ECE662: Statistical Pattern Recognition and Decision Making Processes

Spring 2008, Prof. Boutin

Slecture

Collectively created by the students in the class


Lecture 9 Lecture notes

Jump to: Outline| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28



Parametric Methods

Two applications:

  • Parametric Density Estimation: Using sample data, we estimate probabilities $ P(\omega_i) $, $ p(\vec{X}|\omega_i) \forall i $ etc using estimation methods like MLE and BPE. Then from these estimated probabilities (and not true probabilities, which are unknown; only their parametric form was known) we can use Bayes classification rule to build a classifier.
  • Parametric Classifiers: We find parametric decision boundaries to approximate true decision boundaries between classes. This is very different approach from approximating the probabilities with their estimates, as in previous method.


Example:

Lecture9 parametric decion boundary OldKiwi.JPG

True decision boundary: $ \{\vec{X}|P(\omega_1|\vec{X})-P(\omega_2|\vec{X})=0\} $ can be approximated by $ \{\vec{X}|g(\vec{X})=0\} $. Note that 'g' is not an estimate of the difference in probabilities as in true decision boundary, but it is just an approximate parametric form for the true decision boundary. Choice of 'g' depends on whether it is analytic, and hence easy to handle computationally.

E.g. 'g' can be a degree-2 polynomial, or it can be a degree-1 polynomial (resulting in a linear classifier).

If 'g' is linear, it can be written as $ g(\vec{X})=\vec{V}\cdot \vec{X}+V_0=(\vec{V},V_0)\cdot (\vec{X},1) $ or $ g(1,\vec{X})=\vec{c}\cdot(1,\vec{X}) $.

Extension of Feature Space: This trick justifies the importance of studying linear classifiers even though they do not occur so often in practice directly. Actually, many non-linear classifiers can be seen as linear.

Example 1: Consider g to be a polynomial parametric form of a classifier in $ \mathbb{R}^2 $.

$ g(x,y)=c_0+c_1x+c_2y+c_3xy+c_4{x}^2+c_5{y}^2 $

Here g is not a linear classifier in $ \mathbb{R}^2 $. Let us define $ \tilde{g}:\mathbb{R}^5 \to \mathbb{R} $

$ \tilde{g}(u_1,u_2,u_3,u_4,u_5)=c_0+c_1u_1+c_2u_2+c_3u_3+c_4u_4+c_5u_5 $

where $ u_1=x; u_2=y; u_3=xy; u_4={x}^2; u_5={y}^2 $. Here we have defined a parametric from of the classifier in extended feature space. This form is linear. Hence "Decision boundary defined by 'g' is linear in extended feature space."

Example 2: Consider another polynomial,


$ g(x)={x}^2+{y}^2-2{y} = {x}^2+{(y-1)}^2 -1 $

This is a circle centred at (0,1):

Circle OldKiwi.jpg

This non-linear function can be transformed into linear function in extended feature space like below.

$ \{ (x, y, x^2, y^2) \mid (0, -2, 1, 1) \cdot (x, y, x^2, y^2) = 0 \} $

Lec9 embedd OldKiwi.jpg

Example 3: 1D example

Jinha 1D Example01 OldKiwi.jpg

As we can see from above figure, decision boundary is not linear (region is not connected). From this, we can construct an extended [feature vector] space.

$ x \rightarrow (x, x^2) \in R^2 $

Jinha 1D Example02 OldKiwi.jpg

This red line (which is a hyperplane in this case) can separate these two classes. This can be expressed as below mathematical form.

$ p(x \mid w_1) \rightarrow p(x, x^2 \mid w_1) $

Taylor Series: If true $ \vec{g} $ is analytic e.g $ g(\vec{X})=\sum_{i=1}^{n}{\lambda}_{i}{e}^{{c}_{i}\vec{x}} $

then, $ g(\vec{x}) $ = Taylor Series --> infinite polynomial in powers

because we can approximate with Taylor polynomial of degree n (i.e as n get higher, approximation gets better)

Poly2 OldKiwi.jpg

However, there are issues on this approach. It gets complex with the increase of the dimension of the [feature vector].

If $ \vec{x} = (x_1, x_2, \cdots, x_d) $

A degree 2 polynomial in $ \vec{x} = (x_1, x_2, \cdots, x_d) $ has $ \frac{1}{2} (d+1) (d+2) \approx d^2 $ monomials

A degree n polynomial in $ \vec{x} = (x_1, x_2, \cdots, x_d) $ has $ \approx d^n $ monomials.

How to find decision hyperplane given training data

In the best of all cases, data is "linearly separable", meaning there exists a vector c such that for all y training data: c * y > 0 if y belongs to class w1. (where * is a dot product) c * y < 0 of u belongs to class w2.

Trick: replace all y's in the training data belonging to class w2 by -y, then look for vector c such that c * y > 0 for all y.

Example: 1D feature space

x->(1,x)=:y

Lecture9 1D yi -yi OldKiwi.JPG

First component is +1 for all of class 1 training data

First component is -1 for all of class 2 training data

Lecture9 1D yi -yi 2 OldKiwi.JPG

Observe: c is not unique

To make $ \vec{c} $ unique, introduce 'margin' b>0 and ask that c be the minimum length vector such that $ \vec{c} \cdot y_{i} \geq b $, for all i

Geometric Interpretation of margin

If $ \vec{c} \cdot y_{i} = b $, then $ \vec{c} $ belongs to hyperplane with normal vector $ y_{i0} $ which is at distance $ \displaystyle \frac{b} {\displaystyle ||y_{i0} ||} $ from the origin.


$ \vec{c} \cdot y_{i} \geq b $, for all i means that $ \vec{c} $ is always at least a distance $ \displaystyle \frac{b} {\displaystyle ||y_{i0} ||} $ from the origin for all i

In general, it is impossible to satisfy $ \vec{c} \cdot y_{i} \geq 0 $, for all i

Perhaps, we could try to minimize number of misclassfied training samples

Let $ J(\vec{c})=\{ y_{i} | \vec{c} \cdot y_{i} \leq 0 \} $ -> Too hard to minimize.

Perceptron_OldKiwi Criterion function

$ J_{p}(\vec{c})=\displaystyle \sum_{y_i, missclassified} (- \vec{c} \cdot y_{i} ) $

Lec9 percept OldKiwi.JPG

Measure how 'far away' you are from bias right. distance from |y_i| to hyperplane defined by |vec_c| is $ ||\vec{y_i} \cdot \frac{\vec{c}}{||\vec{c}||} ||=\frac{1}{|| \vec{c} ||} \vec{y_i} \cdot \vec{c} $

Since $ \vec{c} \cdot \vec{y} $ is proportional to distance from $ y_{i} $ to hyperplane


Previous: Lecture 8 Next: Lecture 10

Back to ECE662 Spring 2008 Prof. Boutin

Alumni Liaison

ECE462 Survivor

Seraj Dosenbach