(7 intermediate revisions by one other user not shown)
Line 1: Line 1:
(See [[Lecture 10_OldKiwi]])
+
<center><font size= 4>
 +
'''A proof of the convergence of the Perceptron'''
 +
</font size>
  
The following theorem, due to Novikoff (1962), proves the convergence of a perceptron using linearly-separable samples.  This proof was taken from `Learning Kernel Classifiers, Theory and Algorithms By Ralf Herbrich
+
[[ECE662]]: Statistical decision theory and decision making processes
<http://books.google.com/books?id=el_cDjHdP0cC&pg=PA51&lpg=PA51&dq=novikoff+perceptron+convergence&source=web&ots=2pWu1dQHIM&sig=K5ubpLbO4n0XjrgzhEuOD06Xfmo#PPA261,M1>`_.
+
  
Consider the following definitions:
+
(Supplement to [[Lecture_10_-_Batch_Perceptron_and_Fisher_Linear_Discriminant_OldKiwi|Lecture 10]] of the [[ECE662_Pattern_Recognition_Decision_Making_Processes_Spring2008_sLecture_collective|lecture notes]])
<math> </math>
+
A training set <math>z=(x,y)\in\mathbb{Z}^m</math>|zdef|
+
  
"Functional margin" on example |xiyiinz| is defined as |gammaitildeDef|;
+
</center>
 +
----
  
"Functional margin" on training set |z| is defined as |gammaztildeDef|;
+
The following theorem, due to Novikoff (1962), proves the convergence of a [[perceptron_OldKiwi]] using linearly-separable samples.  This proof was taken from [http://books.google.com/books?id=el_cDjHdP0cC&pg=PA51&lpg=PA51&dq=novikoff+perceptron+convergence&source=web&ots=2pWu1dQHIM&sig=K5ubpLbO4n0XjrgzhEuOD06Xfmo#PPA261,M1 Learning Kernel Classifiers, Theory and Algorithms By Ralf Herbrich]
  
"Geometrical margin" on example |xiyiinz| is defined as |gammaidef|;
+
Consider the following definitions:
 +
A training set <math>z=(x,y)\in\mathbb{Z}^m</math>
  
"Geometrical margin" on training set |z| is defined as |gammazdef|.
+
"Functional margin" on example <math>(x_i,y_i)\in z</math> is defined as <math>\tilde{\gamma}_i = y_i\langle x_i,c \rangle</math>;
  
Let |psidef|, where |feature| is the feature vector for |xi|.
+
"Functional margin" on training set <math>z</math> is defined as <math>\tilde{\gamma}_z = \min_{(x_i,y_i)\in z} \tilde{\gamma}_i</math>;
  
.. |zdef| image:: tex
+
"Geometrical margin" on example <math>(x_i,y_i)\in z</math> is defined as <math>\gamma_i = \frac{\tilde{\gamma}_i(c)}{\|c\|}</math>;
  :alt: tex: z=(x,y)\in\mathbb{Z}^m
+
  
.. |z| image:: tex
+
"Geometrical margin" on training set <math>z</math> is defined as <math>\gamma_z = \frac{\tilde{\gamma}_z(c)}{\|c\|}</math>.
  :alt: tex: z
+
  
.. |xiyiinz| image:: tex
+
Let <math>\psi = \max_{x_i\in x} \|\phi(x_i)\|</math>, where <math>\phi(x_i)</math> is the feature vector for <math>x_i</math>.
  :alt: tex: (x_i,y_i)\in z
+
  
.. |gammaitildeDef| image:: tex
+
Now let <math>c_k</math> be a final solution vector after <math>k</math> steps. Then the last update of the Perceptron algorithm has
  :alt: tex: \tilde{\gamma}_i = y_i\langle x_i,c \rangle
+
  
.. |gammaztildeDef| image:: tex
+
<math>c_k = c_{k-1} + y_ix_i</math>.
  :alt: tex: \tilde{\gamma}_z = \min_{(x_i,y_i)\in z} \tilde{\gamma}_i
+
  
.. |gammaidef| image:: tex
+
For any vector, <math>c^*</math>, we have
  :alt: tex: \gamma_i = \frac{\tilde{\gamma}_i(c)}{\|c\|}
+
  
.. |gammazdef| image:: tex
+
<math>\langle c^*,c_k \rangle = \langle c^*,c_{k-1} \rangle + y_i \langle c^*,x_i \rangle \geq \langle c^*,c_{k-1}\rangle + \gamma_z(c^*) \geq \ldots \geq k\gamma_z(c^*)</math>,
  :alt: tex: \gamma_z = \frac{\tilde{\gamma}_z(c)}{\|c\|}
+
  
.. |psidef| image:: tex
+
where the inequalities follow from repeated applications up to step 0 where we assume <math>c_0=0</math>. Similarly, by the algorithm definition,
  :alt: tex: \psi = \max_{x_i\in x} \|\phi(x_i)\|
+
  
.. |feature| image:: tex
+
<math>\|c_k\|^2 = \|c_{k-1}\|^2 + 2y_i\langle c_{k-1},x_i \rangle + \|x_i\|^2 \leq \|c_{k-1}\|^2 + \psi^2 \leq \ldots \leq k\psi^2.</math>
  :alt: tex: \phi(x_i)
+
 
+
.. |xi| image:: tex
+
  :alt: tex: x_i
+
 
+
Now let |ck| be a final solution vector after |k| steps.  Then the last update of the Perceptron algorithm has
+
 
+
|stagekdef|.
+
 
+
.. |stagekdef| image:: tex
+
  :alt: tex: c_k = c_{k-1} + y_ix_i
+
 
+
.. |ck| image:: tex
+
  :alt: tex: c_k
+
 
+
.. |k| image:: tex
+
  :alt: tex: k
+
 
+
For any vector, |c*|, we have
+
 
+
|cc_kinnerprod|,
+
 
+
.. |c*| image:: tex
+
  :alt: tex: c^*
+
 
+
.. |cc_kinnerprod| image:: tex
+
  :alt: tex: \langle c^*,c_k \rangle = \langle c^*,c_{k-1} \rangle + y_i \langle c^*,x_i \rangle \geq \langle c^*,c_{k-1}\rangle + \gamma_z(c^*) \geq \ldots \geq k\gamma_z(c^*)
+
 
+
where the inequalities follow from repeated applications up to step 0 where we assume |c0|. Similarly, by the algorithm definition,
+
 
+
|cknorm|
+
 
+
.. |c0| image:: tex
+
  :alt: tex: c_0=0
+
 
+
.. |cknorm| image:: tex
+
  :alt: tex: \|c_k\|^2 = \|c_{k-1}\|^2 + 2y_i\langle c_{k-1},x_i \rangle + \|x_i\|^2 \leq \|c_{k-1}\|^2 + \psi^2 \leq \ldots \leq k\psi^2.
+
  
 
Then by the Cauchy-Schwartz inequality, we have
 
Then by the Cauchy-Schwartz inequality, we have
  
|kgamz|.
+
<math>k\gamma_z(c^*) \leq \langle c^*,c_k \rangle \leq \|c^*\| \cdot \|c_k\| \leq \sqrt{k} \psi</math>.
 
+
.. |kgamz| image:: tex
+
  :alt: tex: k\gamma_z(c^*) \leq \langle c^*,c_k \rangle \leq \|c^*\| \cdot \|c_k\| \leq \sqrt{k} \psi
+
  
 
It follows, then, that the number of required iterations of the Perceptron algorithm has a finite upper bound, i.e.
 
It follows, then, that the number of required iterations of the Perceptron algorithm has a finite upper bound, i.e.
  
|kbound|
+
<math>k\leq \left( \frac{\psi}{\gamma_z(c^*)}\right)^2</math>
 
+
----
.. |kbound| image:: tex
+
[[Lecture_10_-_Batch_Perceptron_and_Fisher_Linear_Discriminant_OldKiwi|Back to Lecture 10]]
  :alt: tex: k\leq \left( \frac{\psi}{\gamma_z(c^*)}\right)^2
+

Latest revision as of 06:37, 21 April 2013

A proof of the convergence of the Perceptron

ECE662: Statistical decision theory and decision making processes

(Supplement to Lecture 10 of the lecture notes)


The following theorem, due to Novikoff (1962), proves the convergence of a perceptron_OldKiwi using linearly-separable samples. This proof was taken from Learning Kernel Classifiers, Theory and Algorithms By Ralf Herbrich

Consider the following definitions: A training set $ z=(x,y)\in\mathbb{Z}^m $

"Functional margin" on example $ (x_i,y_i)\in z $ is defined as $ \tilde{\gamma}_i = y_i\langle x_i,c \rangle $;

"Functional margin" on training set $ z $ is defined as $ \tilde{\gamma}_z = \min_{(x_i,y_i)\in z} \tilde{\gamma}_i $;

"Geometrical margin" on example $ (x_i,y_i)\in z $ is defined as $ \gamma_i = \frac{\tilde{\gamma}_i(c)}{\|c\|} $;

"Geometrical margin" on training set $ z $ is defined as $ \gamma_z = \frac{\tilde{\gamma}_z(c)}{\|c\|} $.

Let $ \psi = \max_{x_i\in x} \|\phi(x_i)\| $, where $ \phi(x_i) $ is the feature vector for $ x_i $.

Now let $ c_k $ be a final solution vector after $ k $ steps. Then the last update of the Perceptron algorithm has

$ c_k = c_{k-1} + y_ix_i $.

For any vector, $ c^* $, we have

$ \langle c^*,c_k \rangle = \langle c^*,c_{k-1} \rangle + y_i \langle c^*,x_i \rangle \geq \langle c^*,c_{k-1}\rangle + \gamma_z(c^*) \geq \ldots \geq k\gamma_z(c^*) $,

where the inequalities follow from repeated applications up to step 0 where we assume $ c_0=0 $. Similarly, by the algorithm definition,

$ \|c_k\|^2 = \|c_{k-1}\|^2 + 2y_i\langle c_{k-1},x_i \rangle + \|x_i\|^2 \leq \|c_{k-1}\|^2 + \psi^2 \leq \ldots \leq k\psi^2. $

Then by the Cauchy-Schwartz inequality, we have

$ k\gamma_z(c^*) \leq \langle c^*,c_k \rangle \leq \|c^*\| \cdot \|c_k\| \leq \sqrt{k} \psi $.

It follows, then, that the number of required iterations of the Perceptron algorithm has a finite upper bound, i.e.

$ k\leq \left( \frac{\psi}{\gamma_z(c^*)}\right)^2 $


Back to Lecture 10

Alumni Liaison

Questions/answers with a recent ECE grad

Ryne Rayburn