Discriminant Functions For The Normal Density - Part 1

Continuing from where we left of in Part 1, in a problem with feature vector y and state of nature variable w, we can represent the discriminant function as:

$g_i(\mathbf{x}) = - \frac{1}{2} \left (\mathbf{x} - \boldsymbol{\mu}_i \right)^t\boldsymbol{\Sigma}_i^{-1} \left (\mathbf{x} - \boldsymbol{\mu}_i \right) - \frac{d}{2} \ln 2\pi - \frac{1}{2} \ln |\boldsymbol{\Sigma}_i| + \ln P(w_i)$

we will now look at the multiple cases for a multivariate normal distribution.

Case 1: Σ_i = σ²I

This is the simplest case and it occurs when the features are statistically independent and each feature has the same variance, σ². Here, the covariance matrix is diagonal since its simply σ² times the identity matrix I. This means that each sample falls into equal sized clusters that are centered about their respective mean vectors. The computation of the determinant and the inverse |Σ_i| = σ^2d and Σ_i^-1 = (1/σ²)I. Because both |Σ_i| and the (d/2) ln 2π term in the equation above are independent of i, we can ignore them and thus we obtain this simplified discriminant function:

$g_i(\mathbf{x}) = - \frac{||\mathbf{x} - \boldsymbol{\mu}||_i^2 }{2\boldsymbol{\sigma}^{2}} + \ln P(w_i)$

where ||.|| denotes the Euclidean norm, that is,

$||\mathbf{x} - \boldsymbol{\mu}_i|| = \left (\mathbf{x} - \boldsymbol{\mu}_i \right)^t (\mathbf{x} - \boldsymbol{\mu}_i)$

If the prior probabilities are not equal, then the discriminant function shows that the squared distance ||x - μ||² must be normalized by the variance σ² and offset by adding ln P(w_i); therefore if x is equally near two different mean vectors, the optimal decision will favor the priori more likely. Expansion of the quadratic form (x - μ_i)^t(x - μ_i) yields :

$g_i(\mathbf{x}) = -\frac{1}{2\boldsymbol{\sigma}^{2}}[\mathbf{x}^2\mathbf{x} - 2\boldsymbol{\mu}_i^t\mathbf{x} + \boldsymbol{\mu}_i^t\boldsymbol{\mu}_i] + \ln P(w_i)$

which looks like a quadratic function of x. However, the quadratic term x^tx is the same for all i, meaning it can be ignored since it just an additive constant, thereby we obtain the equivalent discriminant function:

$g_i(\mathbf{x}) = \mathbf{w}_i^2\mathbf{x} + w_{i0}$

where

$\mathbf{w}_i = \frac{1}{\boldsymbol{\sigma}^{2}}\boldsymbol{\mu}_i$

and

$w_{i0} = -\frac{1}{2\boldsymbol{\sigma}^{2}}\boldsymbol{\mu}_i^t\boldsymbol{\mu}_i + \ln P(w_i)$

w_i0 is the threshold or bias for the ith category.

A classifier that uses linear discriminants is called a linear machine. For a linear machine, the decision surfaces for a linear machine are just pieces of hyperplanes defined by the linear equations g_i(x) = g_j(x) for the two categories with the highest posterior probabilities. In this situation, the equation can be written as

$\mathbf{w}^2(\mathbf{x} - \mathbf{x}_0) = 0$

where

$\mathbf{w} = \boldsymbol{\mu}_i - \boldsymbol{\mu}_j$

and

$\mathbf{x}_0 = \frac{1}{2}(\boldsymbol{\mu}_i + \boldsymbol{\mu}_j) - \frac{\boldsymbol{\sigma}^{2}}{||\boldsymbol{\mu}_i - \boldsymbol{\mu}_j||^2}\ln\frac{P(w_i)}{P(w_j)}(\boldsymbol{\mu}_i - \boldsymbol{\mu}_j)$

Discriminant Functions For The Normal(Gaussian) Density - Part 2 - Rhea

Discriminant Functions For The Normal Density - Part 1

Alumni Liaison