ECE662 Spring 2010 Makeup Lecture Notes #1

Friday, April 9th 2010
For anyone that was unable to attend. Feel free to correct any mistakes!

(...Continuing the "Gradient Descent Procedure" discussion from last time)

Batch Method Similar idea, but uses all samples to update our $\vec{c}_{k}$ value.

Theorem: if samples are linearly separable, then the batch perceptron algorithm...
(using all samples)
(w/ constant step size $\eta(k) = const, \forall k$ )

...will terminate after a finite number of steps. (note that we don't know how many.)
(Proof: see DHS text)

"Online Method"
Uses one sample at a time
Consider the sequence: $\{\vec{y}_{j}\}_{j=1}^{\infty} = \{\vec{y}_{1}, \vec{y}_{2}, ..., \vec{y}_{d}, \vec{y}_{1}, \vec{y}_{2}, ..., \vec{y}_{d}, \vec{y}_{1}, \vec{y}_{2}, ... , \vec{y}_{d}, \vec{y}_{1}, \vec{y}_{2}, ... \}$ where $$ d=|D| $$
begin with our guess, $\vec{c}_{k}$ .

Let $\vec{z}_{1}$ be the first mis-classified sample in the sequence above: Say $\vec{z}_{1} = \vec{y}_{j1}$

Now, update our solution:
$\vec{c}_{2} = \vec{c}_{1} + const\cdot\vec{z}_{1}$

Again: Let $\vec{z}_{2}$ be the next mis-classified sample in the sequence above: Say $\vec{z}_{2} = \vec{y}_{j2}, j2 > j1$
Update the solution:
$\vec{c}_{3} = \vec{c}_{2} + const\cdot\vec{z}_{2}$

Or more generally, $\vec{c}_{k+1} = \vec{c}_{k} + const\cdot\vec{z}_{k}$

We hope that this gets better with time. Note that the time to compute our next step is lower, but more steps overall may be required for convergence (vs gradient descent.)

Theorem: if samples are linearly separable, then the online perceptron will converge after a finite number of steps. (Proof: see DHS text)

Does this mean it will find a good separation? No guarantee... but it will converge. Note that this algo. is only occasionally used in practice. (If you are interested in this from an algorithmic standpoint, examine the proofs in the text.)

Note: We could minimize other cost functions, e.g.
$J(\vec{c}) = \sum_{\vec{y},misclassified} (\vec{c}\cdot\vec{y}_{i})^{2}$
This one is very flat near the $\vec{c}=0$ boundary though. We can revise it:
$J(\vec{c}) = \sum_{\vec{y},misclassified} (\vec{c}\cdot\vec{y}_{i} - \vec{b})^{2}$
But this one tends to drag the decision boundary toward outliers (i.e. gives too much weight to outliers). So how about:
$J(\vec{c}) = \sum_{\vec{y},misclassified}\frac{ (\vec{c}\cdot\vec{y}_{i} - \vec{b})^{2}}{|\vec{y}_{i}|^{2}}$
etc. Sometimes a pre-processor can be used to help eliminate outliers as well.

A better method:
Now we will formulate our problem as a Least Squares Procedure.
We want: $\vec{c}$ such that: $\vec{c}\cdot\vec{y}_{i}>0, \forall\vec{y}_{i}\in D$
--> "Inequalities"

So, choose numbers b_{i}>0 (see later which ones to choose)
Consider: $\vec{c}\cdot\vec{y}_{i}=b_{i}, \foralli=1..d$
--> "Equalities"

Usually, very many equations (samples, d) and very few unknowns (#dimnensions=n+1).

if d=n+1 (#samples = number of features, plus transformation)
This is a "square problem," and can be solved by matrix inversion (if $determinant(\vec{Y})\ne0$ )
Matrix equation: $\vec{Y}\vec{c}=\vec{b}$

if d>>n+1 (many more samples and features)
This is an "over-constraint system"
Typically, there is no solution.
We seek $\vec{c}$ that is "close" to a solution by minimizing the L2 norm:
$\|\vec{Y}\vec{c}-\vec{b}\|_{L2}$

which has solution:
$\vec{c} = (Y^{T}Y)^{-1}Y^{T}\vec{b}$

Back to 2010 Spring ECE 662 mboutin

ECE662Sp10 MakeupLectureNotes01 - Rhea

ECE662 Spring 2010 Makeup Lecture Notes #1

Alumni Liaison