(22 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<center><font size="4"></font><font size="4">'''Support Vector Machine and its Applications in Classification Problems''' <br> </font> <font size="2">A [https://www.projectrhea.org/learning/slectures.php slecture] by Xing Liu</font> Partially based on the [[2014 Spring ECE 662 Boutin|ECE662 Spring 2014 lecture]] material of [[User:Mboutin|Prof. Mireille Boutin]]. </center>  
+
<center><font size="4"></font><font size="4">'''Support Vector Machine and its Applications in Classification Problems'''</font></center> <center><font size="4"><br> </font> <font size="2">A [http://www.projectrhea.org/learning/slectures.php slecture] by Xing Liu</font></center> <center>Partially based on the [[2014_Spring_ECE_662_Boutin_Statistical_Pattern_recognition_slectures|ECE662 Spring 2014 lecture]] material of [[User:Mboutin|Prof. Mireille Boutin]].</center>  
 
----
 
----
  
Line 8: Line 8:
 
*Background in Linear Classifiers  
 
*Background in Linear Classifiers  
 
*Support Vector Machine  
 
*Support Vector Machine  
*Effect of Kernal Functions on SVM  
+
*Effect of Kernel Functions on SVM  
*Effect of Kernal Parameters on SVM<br>  
+
*Effect of Kernel Parameters on SVM<br>  
*References<br>
+
*References
  
= &nbsp;Background in Linear Classifiers&nbsp;  =
+
<br>
 +
 
 +
----
 +
 
 +
----
 +
 
 +
= Background in Linear Classifiers&nbsp;  =
  
 
&nbsp; &nbsp;In this section, we will introduce the framework and basic idea of linear classification problem. &nbsp; &nbsp;  
 
&nbsp; &nbsp;In this section, we will introduce the framework and basic idea of linear classification problem. &nbsp; &nbsp;  
  
&nbsp; &nbsp;In a linear classification problem, the feature space can be divided into different regions by hyperplanes. In this lecture, we will take a two-catagory case to illustrate. Given training samples  
+
&nbsp; &nbsp;In a linear classification problem, the feature space can be divided into different regions by hyperplanes. In this lecture, we take a two-catagory case to illustrate. Suppose we are given training samples  
  
<math>\textbf{y}_1,\textbf{y}_2,...\textbf{y}_n \in \mathbb{R}^p</math> , each <math>\textbf{y}_i </math> is a p-dimensional vector and belongs to either class <span class="texhtml">''w''<sub>1</sub></span> or <span class="texhtml">''w''<sub>2</sub></span>. The goal is to find the maximum-margin hyperplane that separate the points in the feature space that belong to class <span class="texhtml">''w''<sub>1</sub></span> from those that belong to class <span class="texhtml">''w''<sub>2</sub></span>. The discriminate function can be written as  
+
<math>\textbf{y}_1,\textbf{y}_2,...\textbf{y}_n \in \mathbb{R}^p</math>&nbsp;, where each <math>\textbf{y}_i </math>&nbsp;is a p-dimensional vector and belongs to either class <span class="texhtml">''w''<sub>1</sub></span> or <span class="texhtml">''w''<sub>2</sub></span>. The goal is to find the maximum-margin hyperplane that separate the points in the feature space that belong to class <span class="texhtml">''w''<sub>1</sub></span> from those that belong to class <span class="texhtml">''w''<sub>2</sub></span>. If the data is linear seperable, the discriminate function can be written as  
 
<center><math>g(\textbf{y}) = \textbf{c}\cdot\textbf{y}</math></center>  
 
<center><math>g(\textbf{y}) = \textbf{c}\cdot\textbf{y}</math></center>  
 
<br>  
 
<br>  
Line 26: Line 32:
 
<br>  
 
<br>  
  
&nbsp; &nbsp;We can apply a trick here to replace all &lt;math&gt;\textbf{y} 's in class <span class="texhtml">''w''<sub>2</sub></span> by <math>-\textbf{y}</math>, then the above task is equivalent to looking for <math>\textbf{c}</math> so that&nbsp;  
+
&nbsp; &nbsp;We can apply a trick here to replace all <math>\textbf{y}'s</math>&nbsp; in class <span class="texhtml">''w''<sub>2</sub></span> by <math>-\textbf{y}</math>, then the above task is equivalent to looking for <math>\textbf{c}</math> so that&nbsp;  
  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<math>\textbf{c}\cdot \textbf{y}>0, \forall \textbf{y} \in ~new~ sample ~space</math>&nbsp;.  
+
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<math>\textbf{c}\cdot \textbf{y}>0, ~~ \forall \textbf{y} \in ~new~ sample ~space</math>&nbsp;.  
  
<br>&nbsp; &nbsp;You might have already observed the ambiguity of c in the above discussion: if c separates data, <math>\lambda \textbf{c}</math> also separates the data. One solution might be set <math>|\textbf{c}|=1</math>. Another solution is to introduce a bias denoted b, and ask  
+
<br>&nbsp; &nbsp;You might have already observed the ambiguity of <math>\textbf{c}</math>&nbsp;in the above discussion: if <math>\textbf{c}</math>&nbsp;separates data, <math>\lambda \textbf{c}</math> also separates the data. One solution might be to set <math>|\textbf{c}|=1</math>. Another solution is to introduce a bias denoted b, and ask  
<center><math>\textbf{c}\cdot\textbf{y}\ge b > 0, \forall \textbf{y}</math><br></center>  
+
<center><math>\textbf{c}\cdot\textbf{y}\ge b > 0,~~ \forall \textbf{y}</math><br></center>  
<br> &nbsp; &nbsp;In this scenario, the hyperplane is defined by <math>\{\textbf{y}: f(\textbf{y})=\textbf{c}\cdot \textbf{y} - \textbf{b}=0\}</math> and it divides the space into two, the sign of the discriminant function <math>f(\textbf{y}) = \textbf{c}\cdot \textbf{y} - \textbf{b}</math> denotes the side of the hyperplane a testing point is on. We notice that he decision boundary by this hyperplane is linear, hence the classifier is called a linear classifier. <math>\textbf{c}</math> is the normal of the plane lying on the positive side of every hyperplane. <math>\frac{b_i}{||c||}</math> is the distance from each point <math>\textbf{y}_i</math> to the hyperplane. We notice that he decision boundary by this hyperplane is linear, hence the classifier is called a linear classifier.
+
<br> &nbsp; &nbsp;In this scenario, the hyperplane is hence defined by <math>\{\textbf{y}: f(\textbf{y})=\textbf{c}\cdot \textbf{y} - \textbf{b}=0\}</math> and it divides the space into two, the sign of the discriminant function <math>f(\textbf{y}) = \textbf{c}\cdot \textbf{y} - \textbf{b}</math> denotes the side of the hyperplane a testing point is on. We notice that the decision boundary by this hyperplane is linear, hence the classifier is called a <u>linear classifier.</u> <math>\textbf{c}</math> is the normal of the plane lying on the positive side of every hyperplane. <math>\frac{b_i}{||c||}</math> is the distance from each point <math>\textbf{y}_i</math> to the hyperplane.&nbsp;
  
<br>&nbsp; &nbsp;The above approach is equivalent to finding a solution for  
+
<br>&nbsp; &nbsp;To classify the data is equivalent to setting up a hyperplane, i.e. to find the solution for&nbsp;
 
<center><math>\textbf{Y}\textbf{c} = \begin{bmatrix} b_1\\b_2\\...\\b_n\end{bmatrix}</math>&nbsp;,</center>  
 
<center><math>\textbf{Y}\textbf{c} = \begin{bmatrix} b_1\\b_2\\...\\b_n\end{bmatrix}</math>&nbsp;,</center>  
 
where <math>\textbf{Y} =\begin{bmatrix} y_1^T \\ y_2^T \\ ... \\ y_n^T \end{bmatrix}</math>&nbsp;.  
 
where <math>\textbf{Y} =\begin{bmatrix} y_1^T \\ y_2^T \\ ... \\ y_n^T \end{bmatrix}</math>&nbsp;.  
  
<br>&nbsp; &nbsp;In most cases when n&gt;p, it is always impossible to find a solution for <math>\textbf{c}</math>. An alternative approach is to find c that minimize a criterion function <math>J(\textbf{c})</math>. There are variant forms of criterion functions. For example, we can try to minimize the error vector between <math>\textbf{c}\cdot\textbf{y}</math> and <span class="texhtml">''b''</span>, hence the criterion function can be defined as  
+
<br>&nbsp; &nbsp;In most cases when n&gt;p, it is impossible to find a solution for <math>\textbf{c}</math>. An alternative approach is to find <math>\textbf{c
 +
}</math>&nbsp;that minimize a criterion function <math>J(\textbf{c})</math>. There are variant forms of criterion functions. For example, we can try to minimize the error vector between <math>\textbf{c}\cdot\textbf{y}</math> and <math>\textbf{b}</math>, hence the criterion function is defined as  
 
<center><math>J(\textbf{c}) = ||\textbf{Y}\textbf{c}-\textbf{b}||^2</math>.</center>  
 
<center><math>J(\textbf{c}) = ||\textbf{Y}\textbf{c}-\textbf{b}||^2</math>.</center>  
 
<br>  
 
<br>  
Line 47: Line 54:
  
 
<br>  
 
<br>  
<center><math>\lim_{\epsilon \to 0}(\textbf{Y}^T\textbf{Y}+\epsilon\textbf{I})^{-1}\textbf{Y}^Tb</math>.</center>  
+
<center><math>\lim_{\epsilon \to 0}(\textbf{Y}^T\textbf{Y}+\epsilon\textbf{I})^{-1}\textbf{Y}^Tb</math>.<br> </center>  
<br>  
+
 
+
 
<br> &nbsp; &nbsp;This solution is a MSE solution to <math>\textbf{Y}\textbf{c} = \textbf{b}</math>&nbsp;and it always exists.  
 
<br> &nbsp; &nbsp;This solution is a MSE solution to <math>\textbf{Y}\textbf{c} = \textbf{b}</math>&nbsp;and it always exists.  
  
 
<br>  
 
<br>  
 +
 +
----
 +
 +
----
  
 
= Support Vector Machine  =
 
= Support Vector Machine  =
  
&nbsp; &nbsp;Support vector machines are an example of a linear two-class classifier. For a given hyperlane we denote by <math>\textbf{y}_1,\textbf{y}_2</math> the closest point to the hyperpalne among the positive (negative). The distance from the two points to the hyperplane is <math>g(\textbf{y}_1)/||c||</math> and <math>g(\textbf{y}_2)/||c||</math>. The margin is defined as the region between the two points. In SVM, the hyperplane is chosen so that the margin is maximized, i.e. we want to maximize 1/||c||, which is equivalent to minimizing <span class="texhtml"> |  | ''c'' |  | <sup>2</sup></span>. This leads to the following optimization problem:
+
----
  
 +
== Linear SVM  ==
 +
 +
&nbsp; &nbsp;Support vector machine is an example of a linear classifier. For a given hyperlane, we denote&nbsp;<math>\textbf{y}_1(\textbf{y}_2)</math> the closest points to the hyperpalne among the positive (negative). By geometry the distance from the two points to the hyperplane is <math>g(\textbf{y}_1)/||c||</math> and <math>g(\textbf{y}_2)/||c||</math>. The margin is defined as the region between the two points. In SVM, the hyperplane is chosen so that the margin is maximized, i.e. we want to maximize 1/||c||, which is equivalent to minimizing <span class="texhtml"> |  | ''c'' |  | <sup>2</sup></span>. This leads to the following optimization problem:<br>
 +
<center><math>arg \min \limits_{c,b} \quad\quad \frac{1}{2}||c||^2</math>,</center><center><br> </center> <center></center> <center><math>subject \quad to: \quad w_i(\textbf{c}\textbf{y}_i+b) \geqslant 1, \quad i = 1,...,n</math>.</center>
 
<br>  
 
<br>  
<center><math>arg \min \limits_{c,b} \quad\quad \frac{1}{2}||c||^2</math>,</center>
+
 
<br>
+
<br> &nbsp; &nbsp;A more general classifier that allows misclassfication would be<br>
<center></center> <center><math>subject \quad to: \quad w_i(\textbf{c}\textbf{y}_i+b) \geqslant 1, \quad i = 1,...,n</math>.</center>  
+
<center><math>arg \min \limits_{c,b} \quad\quad \frac{1}{2}||c||^2 +C\sum_{i=1}^{n} \xi_i</math>,<br> </center> <center><math>subject \quad to: \quad w_i(\textbf{c}\textbf{y}_i+b) \geqslant 1-\xi_i, \xi\geqslant 0 \quad i = 1,...,n</math>.<br> <br> </center>
 +
<br> where ξi ≥ 0 are slack variables that allow an example to be in the margin (0 ≤ ξi ≤ 1, also called a margin error) or to be misclassified (ξi &gt; 1). The constant C &gt; 0 sets the relative importance of maximizing the margin and minimizing the amount of slack. Using the method of Lagrange multipliers, we can obtain the dual formulation which is expressed in terms of variables αi:<br>
 +
<center><math>arg \max \limits_{\alpha} \quad\quad \sum_{i=1}^{n}\alpha_i-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} w_iw_j\alpha_i\alpha_j\textbf{y}_i^T\textbf{y}_j</math><br> </center> <center></center> <center><math>subject \quad to: \sum_{i=1}^nw_i\alpha_i = 0, C\geqslant \alpha_i \geqslant 0</math></center>  
 
<br>  
 
<br>  
  
<br> &nbsp; &nbsp;A more general classifier that allows misclassfication would be
+
----
  
<br>
+
== Nonlinear case  ==
<center><math>arg \min \limits_{c,b} \quad\quad \frac{1}{2}||c||^2 +C\sum_{i=1}^{n} \xi_i</math>,</center> <center><math>subject \quad to: \quad w_i(\textbf{c}\textbf{y}_i+b) \geqslant 1-\xi_i, \xi\geqslant 0 \quad i = 1,...,n</math>.</center>
+
<br>
+
  
<br> where ξi ≥ 0 are slack variables that allow an example to be in the margin (0 ≤ ξi ≤ 1, also called a margin error) or to be misclassified (ξi &gt; 1). The constant C &gt; 0 sets the relative importance of maximizing the margin and minimizing the amount of slack. Using the method of Lagrange multipliers, we can obtain the dual formulation which is expressed in terms of variables αi:
+
<br>&nbsp; &nbsp;In many applications the data is not linearly separable, in this case 'kernel trick' is applied to map the data into high-dimensional feature spaces. We first map the input space to a feature space using <math>\varphi: \mathbb{R}^n \rightarrow \mathbb{R}^m</math>. Then the discriminant function becomes
<center><math>arg \max \limits_{\alpha} \quad\quad \sum_{i=1}^{n}\alpha_i-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} w_iw_j\alpha_i\alpha_j\textbf{y}_i^T\textbf{y}_j</math></center> <center><math>subject \quad to: \sum_{i=1}^nw_i\alpha_i = 0, C\geqslant \alpha_i \geqslant 0</math></center>
+
<center><math>f(\textbf{y}) = \textbf{c}^T\varphi(\textbf{y}) + b.</math></center>  
<br>&nbsp; &nbsp;In many applications the data is not linearly separable, in this case 'kernel trick' is applied to map the data into high-dimensional feature spaces. We first map the input space to a feature space using <math>\varphi: \mathbb{R}^n \rightarrow \mathbb{R}^m</math> Then the discriminant function is then
+
&nbsp; &nbsp;Suppose <math>\textbf{c}</math>&nbsp;can be expressed as a linear combination of the input data points&nbsp;<math>\textbf{c} = \sum_{i=1}^{n}\alpha_i\textbf{y}_i</math>, then the discriminant function in the new feature space takes the form  
<center><math>f(\textbf{y}) = \textbf{c}^T\varphi(\textbf{y})) + b.</math></center>  
+
&nbsp; &nbsp;Suppose <math>\textbf{c} = \sum_{i=1}^{n}\alpha_i\textbf{y}_i</math>, then the discriminant function in the new feature space takes the form  
+
  
 
<br>  
 
<br>  
<center><math>f(\textbf{y}) = \sum_{i=1}^{n}\alpha_i\varphi(\textbf{y}_i)^T\varphi(\textbf{y})+b</math>.</center>  
+
<center><math>f(\textbf{y}) = \sum_{i=1}^{n}\alpha_i\varphi(\textbf{y}_i)^T\varphi(\textbf{y})+b</math>.<br> </center>  
<br>  
+
 
+
 
<br> &nbsp; &nbsp;By defining a kernel function as a mapping function  
 
<br> &nbsp; &nbsp;By defining a kernel function as a mapping function  
 
<center><math>k: \mathbb{R}^n \times \mathbb{R}^n to \mathbb{R}</math></center>  
 
<center><math>k: \mathbb{R}^n \times \mathbb{R}^n to \mathbb{R}</math></center>  
 +
<br>
 +
 
that satisfies  
 
that satisfies  
 +
 +
<br>
 
<center><math>k(\textbf{y}_i,\textbf{y}_j) = \varphi(\textbf{y}_i)^T\varphi(\textbf{y}_j)</math>&nbsp;,</center>  
 
<center><math>k(\textbf{y}_i,\textbf{y}_j) = \varphi(\textbf{y}_i)^T\varphi(\textbf{y}_j)</math>&nbsp;,</center>  
the discriminant function in terms of the kernel function is
+
<br>
 +
 
 +
<br> the discriminant function can be written&nbsp;in terms of the kernel function  
 
<center><math>f(\textbf{y}) = \sum_{i=1}^n\alpha_i k(\textbf{y},\textbf{y}_i)+b.</math></center>  
 
<center><math>f(\textbf{y}) = \sum_{i=1}^n\alpha_i k(\textbf{y},\textbf{y}_i)+b.</math></center>  
&nbsp; &nbsp;The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function.  
+
&nbsp; &nbsp;The resulting algorithm is formally similar to the linear SVM, except that every dot product is replaced by a nonlinear kernel function.  
 +
 
 +
&nbsp; &nbsp;There are several kernel functions we can choose from. Some common kernels include:
 +
 
 +
{| width="600" border="1" cellpadding="1" cellspacing="1"
 +
|-
 +
| Linear
 +
| <math>k(\textbf{x}_i,\textbf{x}_j)=\textbf{x}_i\cdot\textbf{x}_j</math>
 +
|-
 +
| Polynomial
 +
| <math>k(\textbf{x}_i,\textbf{x}_j) = (\textbf{x}_i\cdot\textbf{x}_j+1)^d</math>
 +
|-
 +
| Gaussian radial basis function
 +
| <math>k(\textbf{x}_i,\textbf{x}_j) = exp(-\gamma||\text{x}_i-\textbf{x}_j||^2)</math>
 +
|}
 +
 
 +
<br>&nbsp; &nbsp;To&nbsp;train an SVM and find the large margin hyperplane, one needs to set the parameters <span class="texhtml">α<sub>''i''</sub></span>&nbsp;and <math>\textbf{b}</math>. Other than the parameters in the optimization problem, the SVM has another set of parameters called hyperparameters, including the soft margin constant, <span class="texhtml">''C''</span>, and any parameters of the kernel function. In the following sections we will apply SVM on classification problem using different kernel functions, then illustrate the effect of the parameters of kernel functions on the SVM classification.&nbsp;
  
 
<br>  
 
<br>  
  
= Effect of Kernel Functions on SVM  =
+
----
  
There are several kernel functions we can choose from. Some common kernels include:
+
----
  
Linear: <math>k(\textbf{x}_i,\textbf{x}_j)=\textbf{x}_i\cdot\textbf{x}_j</math>
+
= Effect of Kernel Functions on SVM<br> =
  
Polynomial:&nbsp;<math>k(\textbf{x}_i,\textbf{x}_j) = (\textbf{x}_i\cdot\textbf{x}_j+1)^d</math>
+
&nbsp; &nbsp;Different kernel functions performs variantly depending on the structure of training data.&nbsp;In the section we give examples of using SVM to classify the '''Ripley data set''' (with mixture of two Gaussian distributions in each class)&nbsp;based on the above three kernel functions. The classfications are illustrated in Fig.1~3. The parameters are tuned by cross-validation. The mis-classification rates calculated based on the 10 fold cross validation using the data set are as follows
  
Gaussian radial basis function:&nbsp;<math>k(\textbf{x}_i,\textbf{x}_j) = exp(-\gamma||\text{x}_i-\textbf{x}_j||^2)</math>.  
+
{| width="500" border="1" cellpadding="1" cellspacing="1"
 +
|-
 +
| Linear
 +
| 0.1488
 +
|-
 +
| Polynomial
 +
| 0.0744
 +
|-
 +
| Gaussian radial basis function
 +
| 0.0651
 +
|}
  
<br> In the section we give examples of using SVM to classify the Ripley data set based on the above kernel functions. The classfications are illustrated in the Fig.1~3.&nbsp;The mis-classification rate are as follows:
+
<br>  
  
Linear: &nbsp;0.1488<br>Polynomial: &nbsp;0.0744<br>Gaussian radial basis function: 0.0651<br>
+
&nbsp; &nbsp;We can see that in this example when data is not linearly seperable, the polynomial kernel and Gaussian radial basis function work much better than linear kernel functions as expected.
<center>[[Image:LinearKernel.png|600px]]</center> <center>Fig 1. SVM classification using linear kernel function</center> <center>[[Image:Polynomial.png|600px]]</center> <center>Fig 2. SVM classification using polynomial kernel function</center> <center>[[Image:Rbf.png|600px]]</center> <center>Fig 3. SVM classification using Gaussian radial basis function</center>  
+
<center>[[Image:LinearKernel.png|550px]]</center> <center>Fig 1. SVM classification using linear kernel function</center> <center>[[Image:Polynomial.png|550px]]</center> <center>Fig 2. SVM classification using polynomial kernel function</center> <center>[[Image:Rbf.png|550px]]</center> <center>Fig 3. SVM classification using Gaussian radial basis function</center>  
 
<br>  
 
<br>  
  
= Effect of Kernal Parameters on SVM  =
+
----
  
The effect of degree of polynomial kernel function on the classification is shown in Fig. 4. Figure 5 shows the&nbsp;mis-classification rate as a function of the degree.&nbsp;We can see that higher degree polynomial kernels allow a more flexible decision boundary and higher accuracy.
+
----
  
<br>
+
= Effect of Kernel Parameters on SVM  =
  
<br>  
+
&nbsp; &nbsp;You might have noticed the parameters in the kernel functions (width of a Gaussian kernel or degree of a polynomial kernel). The parameter can be tuned using cross-validation to achieve the best performance, which is what we have done in the previous section. In this secion, we manually adjust the parameters to see the effect on the classification.&nbsp;
<center>[[Image:DegreePoly2.png|210px]] [[Image:DegreePoly7.png|210px]][[Image:DegreePoly12.png|210px]]</center> <center>Fig 4. SVM classification using polynomial kernel function of degree 2(left), 7(middle), and 12(right)</center>  
+
 
 +
&nbsp; &nbsp;The effect of degree of polynomial kernel function on the classification is shown in Fig. 4. Figure 5 shows the&nbsp;mis-classification rate as a function of the degree.&nbsp;We can see that higher degree polynomial kernels allow a more flexible decision boundary and higher accuracy. For the data set in our example, higher degree also leads to a low mis-classification rate.<br>  
 +
<center>[[Image:DegreePoly2.png|550px]] [[Image:DegreePoly7.png|550px]][[Image:DegreePoly12.png|550px]]</center> <center>Fig 4. SVM classification using polynomial kernel function of degree 2(top), 7(middle), and 12(bottom)</center>  
 
<br>  
 
<br>  
  
 
<br>  
 
<br>  
<center></center> <center>[[Image:DegreeOfpolynomial.png|500px]]</center> <center>Fig 5. The effect of the degree of a polynomial kernel</center>  
+
<center></center> <center>[[Image:DegreeOfpolynomial.png|550px]]</center> <center>Fig 5. The effect of the degree of a polynomial kernel</center> <center><br> </center>  
<br>
+
<center></center> <center></center>  
+
 
<br>  
 
<br>  
  
The effect of the width parameter of the Gaussian kernel (<span class="texhtml">γ</span>) is shown in Fig. 6 and the mis-classfication rate is shown in Fig. 7. We can see that for large values (100) of <span class="texhtml">γ</span>&nbsp;in the last figure the decision boundary is nearly linear. As <span class="texhtml"> / ''g''''a''''m''''m''''a''</span>&nbsp;decreases, the flexibility of the decision boundary increases. Small values of <span class="texhtml">γ</span>&nbsp;lead to overfitting in the first figure. &nbsp;<br>  
+
<br> &nbsp; &nbsp;The effect of the width parameter of the Gaussian kernel (<span class="texhtml">γ</span>) is shown in Fig. 6 and the mis-classfication rate is shown in Fig. 7. We can see that for large values (100) of <span class="texhtml">γ</span>&nbsp;the decision boundary is nearly linear. As γ&lt;span class="texhtml"&lt;/span&gt;&nbsp;decreases, the flexibility of the decision boundary increases. Small values of <span class="texhtml">γ</span>&nbsp;might lead to overfitting, as shown in the first figure. &nbsp;<br>  
  
 
<br>  
 
<br>  
  
 
<br>  
 
<br>  
<center>[[Image:Gamma1RBF.png|300px]] [[Image:Gamma2RBF.png|300px]][[Image:Gamma3RBF.png|300px]][[Image:Gamma4RBF.png|300px]][[Image:Gamma5RBF.png|300px]]</center> <center>Fig 6. SVM classification using RBF kernel function with gamma =&nbsp;0.001, 0.01, 0.1, 1, and 10</center>  
+
<center>[[Image:Gamma1RBF.png|550px]] [[Image:Gamma2RBF.png|550px]][[Image:Gamma3RBF.png|550px]][[Image:Gamma4RBF.png|550px]][[Image:Gamma5RBF.png|550px]]</center> <center>Fig 6. SVM classification using RBF kernel function with gamma =&nbsp;0.001, 0.01, 0.1, 1, and 10 (from top to bottom).</center>  
 
<br>  
 
<br>  
<center></center> <center>[[Image:GammaOfBRF.png|500px]]</center> <center>Fig 6. The effect of the gamma of a BRF kernel</center>  
+
<center></center> <center>[[Image:GammaOfBRF.png|550px]]</center> <center>Fig 6. The effect of the gamma of a BRF kernel<br> </center> <center></center> <center></center>  
 
<br>  
 
<br>  
<center></center> <center></center>
+
 
 +
= References  =
 +
 
 +
[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.<br>[2].&nbsp;"Pattern Classification" by Duda, Hart, and Stork
 +
 
 +
[3].&nbsp;"A User’s Guide to Support Vector Machines" by Ben-Hur and Weston<br>  
 +
 
 +
[4]. LS-SVM lab:&nbsp;http://www.esat.kuleuven.be/sista/lssvmlab/
 +
 
 
<br>  
 
<br>  
 +
 +
----
 +
 +
----
  
 
== [https://kiwi.ecn.purdue.edu/rhea/index.php/SVMAndApplications Questions and comments]  ==
 
== [https://kiwi.ecn.purdue.edu/rhea/index.php/SVMAndApplications Questions and comments]  ==
  
If you have any questions, comments, etc. please post them on [[Support Vector Machine and its Applications in Classification Problems|[https://kiwi.ecn.purdue.edu/rhea/index.php/SVMAndApplications this page]].
+
If you have any questions, comments, etc. please post them on [https://kiwi.ecn.purdue.edu/rhea/index.php/SVMAndApplications this page]  
 +
 
 +
<br>
  
 
----
 
----

Latest revision as of 10:56, 22 January 2015

Support Vector Machine and its Applications in Classification Problems

A slecture by Xing Liu
Partially based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.


Outline of the slecture

  • Background in Linear Classifiers
  • Support Vector Machine
  • Effect of Kernel Functions on SVM
  • Effect of Kernel Parameters on SVM
  • References




Background in Linear Classifiers 

   In this section, we will introduce the framework and basic idea of linear classification problem.    

   In a linear classification problem, the feature space can be divided into different regions by hyperplanes. In this lecture, we take a two-catagory case to illustrate. Suppose we are given training samples

$ \textbf{y}_1,\textbf{y}_2,...\textbf{y}_n \in \mathbb{R}^p $ , where each $ \textbf{y}_i $ is a p-dimensional vector and belongs to either class w1 or w2. The goal is to find the maximum-margin hyperplane that separate the points in the feature space that belong to class w1 from those that belong to class w2. If the data is linear seperable, the discriminate function can be written as

$ g(\textbf{y}) = \textbf{c}\cdot\textbf{y} $


   We want to find $ \textbf{c}\in\mathbb{R}^{n+1} $ so that a testing data point $ \textbf{y}_i $ is labelled

$ w_1 ~~ if ~ \textbf{c}\cdot\textbf{y}>0 $ ,
$ w_2 ~~ if ~ \textbf{c}\cdot \textbf{y} < 0 $ .


   We can apply a trick here to replace all $ \textbf{y}'s $  in class w2 by $ -\textbf{y} $, then the above task is equivalent to looking for $ \textbf{c} $ so that 

                                                   $ \textbf{c}\cdot \textbf{y}>0, ~~ \forall \textbf{y} \in ~new~ sample ~space $ .


   You might have already observed the ambiguity of $ \textbf{c} $ in the above discussion: if $ \textbf{c} $ separates data, $ \lambda \textbf{c} $ also separates the data. One solution might be to set $ |\textbf{c}|=1 $. Another solution is to introduce a bias denoted b, and ask

$ \textbf{c}\cdot\textbf{y}\ge b > 0,~~ \forall \textbf{y} $


   In this scenario, the hyperplane is hence defined by $ \{\textbf{y}: f(\textbf{y})=\textbf{c}\cdot \textbf{y} - \textbf{b}=0\} $ and it divides the space into two, the sign of the discriminant function $ f(\textbf{y}) = \textbf{c}\cdot \textbf{y} - \textbf{b} $ denotes the side of the hyperplane a testing point is on. We notice that the decision boundary by this hyperplane is linear, hence the classifier is called a linear classifier. $ \textbf{c} $ is the normal of the plane lying on the positive side of every hyperplane. $ \frac{b_i}{||c||} $ is the distance from each point $ \textbf{y}_i $ to the hyperplane. 


   To classify the data is equivalent to setting up a hyperplane, i.e. to find the solution for 

$ \textbf{Y}\textbf{c} = \begin{bmatrix} b_1\\b_2\\...\\b_n\end{bmatrix} $ ,

where $ \textbf{Y} =\begin{bmatrix} y_1^T \\ y_2^T \\ ... \\ y_n^T \end{bmatrix} $ .


   In most cases when n>p, it is impossible to find a solution for $ \textbf{c} $. An alternative approach is to find $ \textbf{c } $ that minimize a criterion function $ J(\textbf{c}) $. There are variant forms of criterion functions. For example, we can try to minimize the error vector between $ \textbf{c}\cdot\textbf{y} $ and $ \textbf{b} $, hence the criterion function is defined as

$ J(\textbf{c}) = ||\textbf{Y}\textbf{c}-\textbf{b}||^2 $.


   The solution to the above problem is

$ \textbf{c} = (\textbf{Y}^T\textbf{Y})^{-1}\textbf{Y}^T\textbf{b} $     if $ det(\textbf{Y}^T\textbf{Y})\neq 0 $,

otherwise, the solution is defined generally by


$ \lim_{\epsilon \to 0}(\textbf{Y}^T\textbf{Y}+\epsilon\textbf{I})^{-1}\textbf{Y}^Tb $.


   This solution is a MSE solution to $ \textbf{Y}\textbf{c} = \textbf{b} $ and it always exists.




Support Vector Machine


Linear SVM

   Support vector machine is an example of a linear classifier. For a given hyperlane, we denote $ \textbf{y}_1(\textbf{y}_2) $ the closest points to the hyperpalne among the positive (negative). By geometry the distance from the two points to the hyperplane is $ g(\textbf{y}_1)/||c|| $ and $ g(\textbf{y}_2)/||c|| $. The margin is defined as the region between the two points. In SVM, the hyperplane is chosen so that the margin is maximized, i.e. we want to maximize 1/||c||, which is equivalent to minimizing | | c | | 2. This leads to the following optimization problem:

$ arg \min \limits_{c,b} \quad\quad \frac{1}{2}||c||^2 $,

$ subject \quad to: \quad w_i(\textbf{c}\textbf{y}_i+b) \geqslant 1, \quad i = 1,...,n $.



   A more general classifier that allows misclassfication would be

$ arg \min \limits_{c,b} \quad\quad \frac{1}{2}||c||^2 +C\sum_{i=1}^{n} \xi_i $,
$ subject \quad to: \quad w_i(\textbf{c}\textbf{y}_i+b) \geqslant 1-\xi_i, \xi\geqslant 0 \quad i = 1,...,n $.


where ξi ≥ 0 are slack variables that allow an example to be in the margin (0 ≤ ξi ≤ 1, also called a margin error) or to be misclassified (ξi > 1). The constant C > 0 sets the relative importance of maximizing the margin and minimizing the amount of slack. Using the method of Lagrange multipliers, we can obtain the dual formulation which is expressed in terms of variables αi:

$ arg \max \limits_{\alpha} \quad\quad \sum_{i=1}^{n}\alpha_i-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} w_iw_j\alpha_i\alpha_j\textbf{y}_i^T\textbf{y}_j $
$ subject \quad to: \sum_{i=1}^nw_i\alpha_i = 0, C\geqslant \alpha_i \geqslant 0 $



Nonlinear case


   In many applications the data is not linearly separable, in this case 'kernel trick' is applied to map the data into high-dimensional feature spaces. We first map the input space to a feature space using $ \varphi: \mathbb{R}^n \rightarrow \mathbb{R}^m $. Then the discriminant function becomes

$ f(\textbf{y}) = \textbf{c}^T\varphi(\textbf{y}) + b. $

   Suppose $ \textbf{c} $ can be expressed as a linear combination of the input data points $ \textbf{c} = \sum_{i=1}^{n}\alpha_i\textbf{y}_i $, then the discriminant function in the new feature space takes the form


$ f(\textbf{y}) = \sum_{i=1}^{n}\alpha_i\varphi(\textbf{y}_i)^T\varphi(\textbf{y})+b $.


   By defining a kernel function as a mapping function

$ k: \mathbb{R}^n \times \mathbb{R}^n to \mathbb{R} $


that satisfies


$ k(\textbf{y}_i,\textbf{y}_j) = \varphi(\textbf{y}_i)^T\varphi(\textbf{y}_j) $ ,



the discriminant function can be written in terms of the kernel function

$ f(\textbf{y}) = \sum_{i=1}^n\alpha_i k(\textbf{y},\textbf{y}_i)+b. $

   The resulting algorithm is formally similar to the linear SVM, except that every dot product is replaced by a nonlinear kernel function.

   There are several kernel functions we can choose from. Some common kernels include:

Linear $ k(\textbf{x}_i,\textbf{x}_j)=\textbf{x}_i\cdot\textbf{x}_j $
Polynomial $ k(\textbf{x}_i,\textbf{x}_j) = (\textbf{x}_i\cdot\textbf{x}_j+1)^d $
Gaussian radial basis function $ k(\textbf{x}_i,\textbf{x}_j) = exp(-\gamma||\text{x}_i-\textbf{x}_j||^2) $


   To train an SVM and find the large margin hyperplane, one needs to set the parameters αi and $ \textbf{b} $. Other than the parameters in the optimization problem, the SVM has another set of parameters called hyperparameters, including the soft margin constant, C, and any parameters of the kernel function. In the following sections we will apply SVM on classification problem using different kernel functions, then illustrate the effect of the parameters of kernel functions on the SVM classification. 




Effect of Kernel Functions on SVM

   Different kernel functions performs variantly depending on the structure of training data. In the section we give examples of using SVM to classify the Ripley data set (with mixture of two Gaussian distributions in each class) based on the above three kernel functions. The classfications are illustrated in Fig.1~3. The parameters are tuned by cross-validation. The mis-classification rates calculated based on the 10 fold cross validation using the data set are as follows

Linear 0.1488
Polynomial 0.0744
Gaussian radial basis function 0.0651


   We can see that in this example when data is not linearly seperable, the polynomial kernel and Gaussian radial basis function work much better than linear kernel functions as expected.

LinearKernel.png
Fig 1. SVM classification using linear kernel function
Polynomial.png
Fig 2. SVM classification using polynomial kernel function
Rbf.png
Fig 3. SVM classification using Gaussian radial basis function




Effect of Kernel Parameters on SVM

   You might have noticed the parameters in the kernel functions (width of a Gaussian kernel or degree of a polynomial kernel). The parameter can be tuned using cross-validation to achieve the best performance, which is what we have done in the previous section. In this secion, we manually adjust the parameters to see the effect on the classification. 

   The effect of degree of polynomial kernel function on the classification is shown in Fig. 4. Figure 5 shows the mis-classification rate as a function of the degree. We can see that higher degree polynomial kernels allow a more flexible decision boundary and higher accuracy. For the data set in our example, higher degree also leads to a low mis-classification rate.

DegreePoly2.png DegreePoly7.pngDegreePoly12.png
Fig 4. SVM classification using polynomial kernel function of degree 2(top), 7(middle), and 12(bottom)



DegreeOfpolynomial.png
Fig 5. The effect of the degree of a polynomial kernel



   The effect of the width parameter of the Gaussian kernel (γ) is shown in Fig. 6 and the mis-classfication rate is shown in Fig. 7. We can see that for large values (100) of γ the decision boundary is nearly linear. As γ<span class="texhtml"</span> decreases, the flexibility of the decision boundary increases. Small values of γ might lead to overfitting, as shown in the first figure.  



Gamma1RBF.png Gamma2RBF.pngGamma3RBF.pngGamma4RBF.pngGamma5RBF.png
Fig 6. SVM classification using RBF kernel function with gamma = 0.001, 0.01, 0.1, 1, and 10 (from top to bottom).


GammaOfBRF.png
Fig 6. The effect of the gamma of a BRF kernel


References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.
[2]. "Pattern Classification" by Duda, Hart, and Stork

[3]. "A User’s Guide to Support Vector Machines" by Ben-Hur and Weston

[4]. LS-SVM lab: http://www.esat.kuleuven.be/sista/lssvmlab/




Questions and comments

If you have any questions, comments, etc. please post them on this page



Back to ECE662, Spring 2014

Alumni Liaison

ECE462 Survivor

Seraj Dosenbach