Discriminant Functions For The Normal Density - Part 2

Continuing from where we left of in Part 1, in a problem with feature vector y and state of nature variable w, we can represent the discriminant function as:

$g_i(\mathbf{x}) = - \frac{1}{2} \left (\mathbf{x} - \boldsymbol{\mu}_i \right)^t\boldsymbol{\Sigma}_i^{-1} \left (\mathbf{x} - \boldsymbol{\mu}_i \right) - \frac{d}{2} \ln 2\pi - \frac{1}{2} \ln |\boldsymbol{\Sigma}_i| + \ln P(w_i)$

we will now look at the multiple cases for a multivariate normal distribution.

Case 1: Σ_i = σ²I

This is the simplest case and it occurs when the features are statistically independent and each feature has the same variance, σ². Here, the covariance matrix is diagonal since its simply σ² times the identity matrix I. This means that each sample falls into equal sized clusters that are centered about their respective mean vectors. The computation of the determinant and the inverse |Σ_i| = σ^2d and Σ_i^-1 = (1/σ²)I. Because both |Σ_i| and the (d/2) ln 2π term in the equation above are independent of i, we can ignore them and thus we obtain this simplified discriminant function:

$g_i(\mathbf{x}) = - \frac{||\mathbf{x} - \boldsymbol{\mu}_i||^2 }{2\boldsymbol{\sigma}^{2}} + \ln P(w_i)$

where ||.|| denotes the Euclidean norm, that is,

$||\mathbf{x} - \boldsymbol{\mu}_i||^2 = \left (\mathbf{x} - \boldsymbol{\mu}_i \right)^t (\mathbf{x} - \boldsymbol{\mu}_i)$

If the prior probabilities are not equal, then the discriminant function shows that the squared distance ||x - μ||² must be normalized by the variance σ² and offset by adding ln P(w_i); therefore if x is equally near two different mean vectors, the optimal decision will favor the priori more likely. Expansion of the quadratic form (x - μ_i)^t(x - μ_i) yields :

$g_i(\mathbf{x}) = -\frac{1}{2\boldsymbol{\sigma}^{2}}[\mathbf{x}^t\mathbf{x} - 2\boldsymbol{\mu}_i^t\mathbf{x} + \boldsymbol{\mu}_i^t\boldsymbol{\mu}_i] + \ln P(w_i)$

which looks like a quadratic function of x. However, the quadratic term x^tx is the same for all i, meaning it can be ignored since it just an additive constant, thereby we obtain the equivalent discriminant function:

$g_i(\mathbf{x}) = \mathbf{w}_i^t\mathbf{x} + w_{i0}$

where

$\mathbf{w}_i = \frac{1}{\boldsymbol{\sigma}^{2}}\boldsymbol{\mu}_i$

and

$w_{i0} = -\frac{1}{2\boldsymbol{\sigma}^{2}}\boldsymbol{\mu}_i^t\boldsymbol{\mu}_i + \ln P(w_i)$

w_i0 is the threshold or bias for the ith category.

A classifier that uses linear discriminants is called a linear machine. For a linear machine, the decision surfaces for a linear machine are just pieces of hyperplanes defined by the linear equations g_i(x) = g_j(x) for the two categories with the highest posterior probabilities. In this situation, the equation can be written as

$\mathbf{w}^t(\mathbf{x} - \mathbf{x}_0) = 0$

where

$\mathbf{w} = \boldsymbol{\mu}_i - \boldsymbol{\mu}_j$

and

$\mathbf{x}_0 = \frac{1}{2}(\boldsymbol{\mu}_i + \boldsymbol{\mu}_j) - \frac{\boldsymbol{\sigma}^{2}}{||\boldsymbol{\mu}_i - \boldsymbol{\mu}_j||^2}\ln\frac{P(w_i)}{P(w_j)}(\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)$

The equations define a hyperplane through the point x₀ and orthogonal to the vector w. Because w = μ_i - μ_j, the hyperplane separating R_i and R_j is orthogonal to the line linking the means. if P(w_i) = P(w_j), the point x₀ is halfway between the means and the hyperplane is the perpendicular bisector of the line between the means in fig1 below. If P(w_i) ≠ P(w_j), the point x₀ shifts away from the more likely mean.

Figure 1

Case 2: Σ_i = Σ

Another case occurs when the covariance matrices for all the classes are identical. It corresponds to a situation where the samples fall into hyperellipsoidal clusters of equal size and shape, with the cluster of the ith class being centered around the mean vector μ_i. Both |Σ_i| and the (d/2) ln 2π terms can also be ignored as done in the first step because they are independent of i. This leads to the simplified discriminant function:

$g_i(\mathbf{x}) = - \frac{1}{2} \left (\mathbf{x} - \boldsymbol{\mu}_i \right)^t\boldsymbol{\Sigma}_i^{-1} \left (\mathbf{x} - \boldsymbol{\mu}_i \right) + \ln P(w_i)$

If the prior probabilities P(w_i) are equal for all classes, then the ln P(w_i) term can be ignored, however if they are unequal then the decision will be biased in favor of the more likely priori. Expansion of the quadratic form ('x - μ_i)^tΣ^-1(x - μ_i) results in a sum involving the term x^tΣ^-1'x which is independent of i. After this term is dropped, we get the resulting linear discriminant function:

$g_i(\mathbf{x}) = \mathbf{w}_i^t\mathbf{x} + w_{i0}$

where

$\mathbf{w}_i = \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_i$

and

$w_{i0} = -\frac{1}{2}\boldsymbol{\mu}_i^t\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_i + \ln P(w_i)$

Because the discriminants are linear, the resulting decision boundaries are again hyperplanes. If R_i and R_j are very close, the boundary between them has the equation:

$\mathbf{w}^t(\mathbf{x} - \mathbf{x}_0) = 0$

where

$\mathbf{w} = \boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)$

and

$\mathbf{x}_0 = \frac{1}{2}(\boldsymbol{\mu}_i + \boldsymbol{\mu}_j) - \frac{\ln[P(w_i)/P(w_j)]}{(\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)^t\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)}(\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)$

Because w = σ^-1(μ_i - μ_j) is generally not in the direction of μ_i - μ_j, the hyperplane separating R_i and R_j is generally not orthogonal to the line between the means. If the prior probabilities are equal, it intersects the line at point x₀, and then x₀ is halfway between the means. If the prior probabilities are not equal, the boundary hyperplane is shifted away from the more likely mean. Figure 2 below shows what the boundary decision looks like for this case

Figure 2

Case 3: Σ_i = arbitrary

In the general multivariate Gaussian case where the covariance matrices are different for each class, the only term that can be dropped from the initial discriminant function is the (d/2) ln 2π term. The resulting discriminant term is;

$g_i(\mathbf{x}) = \mathbf{x}^t\mathbf{W}_i\mathbf{x} + \mathbf{w}_i^t\mathbf{x} + w_{i0}$

where

$\mathbf{W}_i= -\frac{1}{2}\boldsymbol{\Sigma}_i^{-1}$

$\mathbf{w}_i= \boldsymbol{\Sigma}_i^{-1}\boldsymbol{\mu}_i$

and

$w_{i0} = -\frac{1}{2}\boldsymbol{\mu}_i^t\boldsymbol{\Sigma}_i^{-1}\boldsymbol{\mu}_i - \frac{1}{2}\ln|\boldsymbol{\Sigma}_i|+ \ln P(w_i)$

This leads to hyperquadric decision boundaries as seen in the figure below.

. Figure 3

EXAMPLE

Given the set of data below of a distribution with two classes w₁ and w₂ both with prior probablility of 0.5, find the discriminant functions and decision boundary.

Sample	w₁	w₂
1	-5.01	-0.91
2	-5.43	1.30
3	1.08	-7.75
4	0.86	-5.47
5	-2.67	6.14
6	4.94	3.60
7	-2.51	5.37
8	-2.25	7.18
9	5.56	-7.39
10	1.03	-7.50

From the data given above we know to use the equations from case 1 since all points in each class have the same variance, therefore the means are:

$\boldsymbol{\mu}_1 = \sum_{k=1}^{10}\mathbf{x}_1 = -0.44$

$\boldsymbol{\mu}_2 = \sum_{k=1}^{10}\mathbf{x}_2 = -0.543$

and the variances are

$\boldsymbol{\sigma}_1^2 = \sum_{k=1}^{10}\mathbf{x}_1 - \boldsymbol{\mu}_1 = 31.34$

$\boldsymbol{\sigma}_2^2 = \sum_{k=1}^{10}\mathbf{x}_2 - \boldsymbol{\mu}_2 = 52.62$

The discriminant functions are then

$g_1 = \frac{-0.44}{31.34} - \frac{-0.44^2}{2*31.34} +\ln(0.5) = -0.710$

$g_2 = \frac{-0.543}{52.62} - \frac{-0.543^2}{2*52.62} +\ln(0.5) = -0.706$

and the decision boundary x₀ is going to be halfway between the means at 0.492 because they have the same prior probability.

Discriminant Functions For The Normal(Gaussian) Density - Part 2 - Rhea

Discriminant Functions For The Normal Density - Part 2

Alumni Liaison