Line 74: Line 74:
 
<center><math> arg \max \limits_{\alpha}  \quad\quad \sum_{i=1}^{n}\alpha_i-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} w_iw_j\alpha_i\alpha_j\textbf{y}_i^T\textbf{y}_j  </math></center>
 
<center><math> arg \max \limits_{\alpha}  \quad\quad \sum_{i=1}^{n}\alpha_i-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} w_iw_j\alpha_i\alpha_j\textbf{y}_i^T\textbf{y}_j  </math></center>
 
<center><math> subject  \quad to: \sum_{i=1}^nw_i\alpha_i = 0, C\geqslant \alpha_i \geqslant 0  </math></center>
 
<center><math> subject  \quad to: \sum_{i=1}^nw_i\alpha_i = 0, C\geqslant \alpha_i \geqslant 0  </math></center>
 +
 +
 +
In many applications the data is not linearly separable, in this case 'kernel trick' is applied to map the data into high-dimensional feature spaces. The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function.
 +
 +
A kernel function is a mapping function:
 +
<center><math> k: \mathbb{R}^n \times \mathbb{R}^n to \mathbb{R}
 +
that satisfies:
 +
 +
The kernel is related to the transform  by the equation . The value w is also in the transformed space, with . Dot products with w for classification can again be computed by the kernel trick, i.e. . However, there does not in general exist a value w' such that .

Revision as of 18:57, 3 May 2014


'Support Vector Machine and its Applications in Classification Problems
A slecture by Xing Liu Partially based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.



Outline of the slecture

  • Background in Linear Classifiers
  • Maximum Margin Classifier
  • Support Vector Machine
  • Summary
  • References


Background in Linear Classifiers

    In this section, we will introduce the framework and basic idea of linear classification problem.

    In a linear classification problem, the feature space can be divided into different regions by hyperplanes. In this lecture, we will take a two-catagory case to illustrate. Given training samples $ \textbf{y}_1,\textbf{y}_2,...\textbf{y}_n \in \mathbb{R}^p $, each $ \textbf{y}_i $ is a p-dimensional vector and belongs to either class $ w_1 $ or $ w_2 $. The goal is to find the maximum-margin hyperplane that separate the points in the feature space that belong to class $ w_1 $ from those that belong to class $ w_2 $. The discriminate function can be written as

$ g(\textbf{y}) = \textbf{c}\cdot\textbf{y} $

    We want to find $ \textbf{c}\in\mathbb{R}^{n+1} $ so that a testing data point $ \textbf{y}_i $ is labelled

$ {w_1} $ if $ \textbf{c}\cdot\textbf{y}>0 $
$ {w_2} $ if $ \textbf{c}\cdot\textbf{y}<0 $

    We can apply a trick here to replace all $ \textbf{y} $'s in class $ w_2 $ by $ -\textbf{y} $, then the above task is equivalent to looking for $ \textbf{c} $ so that

$ \textbf{c}\cdot \textbf{y}>0, \forall \textbf{y} \in $new sample space.


    You might have already observed the ambiguity of c in the above discussion: if c separates data, $ \lambda \textbf{c} $ also separates the data. One solution might be set $ |\textbf{c}|=1 $. Another solution is to introduce a bias denoted b, and ask

$ \textbf{c}\cdot\textbf{y}\geqslant b > 0, \forall \textbf{y} $.

    In this scenario, the hyperplane is defined by $ \{\textbf{y}: f(\textbf{y})=\textbf{c}\cdot \textbf{y} - \textbf{b}=0\} $ and it divides the space into two, the sign of the discriminant function $ f(\textbf{y}) = \textbf{c}\cdot \textbf{y} - \textbf{b} $ denotes the side of the hyperplane a testing point is on. We notice that he decision boundary by this hyperplane is linear, hence the classifier is called a linear classifier. $ \textbf{c} $ is the normal of the plane lying on the positive side of every hyperplane. $ \frac{b_i}{||c||} $ is the distance from each point $ \textbf{y}_i $ to the hyperplane. We notice that he decision boundary by this hyperplane is linear, hence the classifier is called a linear classifier.


    The above approach is equivalent to finding a solution for

$ \textbf{Y}\textbf{c} = \begin{bmatrix} b_1\\b_2\\...\\b_n\end{bmatrix} $

where $ \textbf{Y} =\begin{bmatrix} y_1^T \\ y_2^T \\ ... \\ y_n^T \end{bmatrix} $


    In most cases when n>p, it is always impossible to find a solution for $ \textbf{c} $. An alternative approach is to find c that minimize a criterion function $ J(\textbf{c}) $. There are variant forms of criterion functions. For example, we can try to minimize the error vector between $ \textbf{c}\cdot\textbf{y} $ and $ b $, hence the criterion function can be defined as

$ J(\textbf{c}) = ||\textbf{Y}\textbf{c}-\textbf{b}||^2 $

The solution to the above problem is

$ \textbf{c} = (\textbf{Y}^T\textbf{Y})^{-1}\textbf{Y}^T\textbf{b} $

if $ det(\textbf{Y}^T\textbf{Y})\neq 0 $

otherwise, the solution is defined generally by

$ \lim_{\epsilon \to 0}(\textbf{Y}^T\textbf{Y}+\epsilon\textbf{I})^{-1}\textbf{Y}^Tb $
.

This solution is a MSE solution to $ \textbf{Y}\textbf{c} = \textbf{b} $and it always exists.

Support Vector Machine

Support vector machines are an example of a linear two-class classifier. For a given hyperlane we denote by $ \textbf{y}_1,\textbf{y}_2 $ the closest point to the hyperpalne among the positive (negative). The distance from the two points to the hyperplane is $ g(\textbf{y}_1)/||c|| $ and $ g(\textbf{y}_2)/||c|| $. The margin is defined as the region between the two points. In SVM, the hyperplane is chosen so that the margin is maximized, i.e. we want to maximize 1/||c||, which is equivalent to minimizing $ ||c||^2 $. This leads to the following optimization problem:

$ arg \min \limits_{c,b} \quad\quad \frac{1}{2}||c||^2 $
$ subject \quad to: \quad w_i(\textbf{c}\textbf{y}_i+b) \geqslant 1, \quad i = 1,...,n $

A more general classifier that allows misclassfication would be

$ arg \min \limits_{c,b} \quad\quad \frac{1}{2}||c||^2 +C\sum_{i=1}^{n} \xi_i $
$ subject \quad to: \quad w_i(\textbf{c}\textbf{y}_i+b) \geqslant 1-\xi_i, \xi\geqslant 0 \quad i = 1,...,n $

where ξi ≥ 0 are slack variables that allow an example to be in the margin (0 ≤ ξi ≤ 1, also called a margin error) or to be misclassified (ξi > 1). The constant C > 0 sets the relative importance of maximizing the margin and minimizing the amount of slack. Using the method of Lagrange multipliers, we can obtain the dual formulation which is expressed in terms of variables αi:

$ arg \max \limits_{\alpha} \quad\quad \sum_{i=1}^{n}\alpha_i-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} w_iw_j\alpha_i\alpha_j\textbf{y}_i^T\textbf{y}_j $
$ subject \quad to: \sum_{i=1}^nw_i\alpha_i = 0, C\geqslant \alpha_i \geqslant 0 $


In many applications the data is not linearly separable, in this case 'kernel trick' is applied to map the data into high-dimensional feature spaces. The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function.

A kernel function is a mapping function:

$ k: \mathbb{R}^n \times \mathbb{R}^n to \mathbb{R} that satisfies: The kernel is related to the transform by the equation . The value w is also in the transformed space, with . Dot products with w for classification can again be computed by the kernel trick, i.e. . However, there does not in general exist a value w' such that . $

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett