Bayes Rule for 1-dimensional and N-dimensional feature spaces - Rhea

Classification using Bayes Rule in 1-dimensional and N-dimensional feature spaces

A slecture by Jihwan Lee

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

Introduction

In this slecture, we discuss Bayes rule that is widely used for many different kinds of applications, especially, pattern recognition. Due to its simplicity and effectiveness, we can use the method in both discrete values case and continuous values case, and it also usually works well in multi-dimensional feature space. So, we will take a look at what the definition of Bayes rule is, how it can be used for the classification task with examples, and how we can derive it in different cases.

Bayes' Theorem

Bayes theorem is a probabilistic theory that can explain a relationship between the prior probability and the posterior probability of two random variables or events. Give two events $A$ and $B$ , we may want to know $P (A | B)$ and it can be obtained if we know knowledge for other probabilities, $P (B | A)$ , $P (A)$ , and $P (B)$ . By the definition of the conditional probability, a joint probability of $A$ and $B$ , $P (A, B)$ , is the product of $P (A | B)$ $P (B)$ . We can also write $P (A, B) = P (B | A) P (A)$ . From those two equations, $P (A | B) P (B) = P (B | A) P (A)$ and we can conclude

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

The formula above is called Bayes Theorem. Let's see an example. Suppose that a man told you that he had a nice conversation with a person. Assuming the populations of male and female are the same, the probability that his conversational partner was a woman is 0.5. However, if he gives you additional information that the conversational partner had long hair, then the probability of the partner being a woman will be higher than 0.5 because women are more likely to have long hair. So, let $W$ be an event that the conversational partner was a women and $L$ be an event that a person has a long haired person. Suppose that 75% of all women have long hair, 15% of men have long hair, and both genders are equally likely. That is, $P (W | L) = 0.75, P (W) = P (M) = 0.5$ , and these are what we already know. With these information, we can say that the probability that the person who he had a conversation with was a woman is

$\begin{align} P(W|L) & = \frac{P(L|W)P(W)}{P(L|W)P(W) + P(L|M)P(M)} \\ & = \frac{0.75 * 0.5}{0.75*0.5 + 0.15*0.5} \\ & = \frac{5}{6} \simeq 0.83 \end{align}$

Here is another example. In a public university, 51% of the students are males. One adult is randomly selected for a survey. It turned out later that the selected survey subject was studying sciences. Also, 10% of students study sciences while 5% of females study sciences. What is the probability that the selected subject is a male? Let's use the following notations: $M$ = male \t $\overline{M}$ = female $S$ = study sciences \t $\overline{S}$ = study other fields Before having the information about the subject's major, the probability that the subject is a male is 51%, that is, $P (M) = 0.51$ . However, based on the additional information, we now have the following:

P (M) = 0.51

because 51% of students are males

P(\overline{M}) = 0.49

because 49% of students are females

P (S | M) = 0.1

because 10% of the male students study sciences

P(S|\overline{M}) = 0.05

because 5% of the female students study sciences

Now, in order to figure out the probability that the survey subject is a male given the subject studies sciences, $P (M | S)$ , we can apply Bayes theorem as following

\begin{align} P(M|S) & = \frac{P(S|M)P(M)}{P(S)} \\ & = \frac{P(S|M)P(M)}{P(S|M)P(M) + P(S|\overline{M})P(\overline{M})} \\ & = \frac{0.1 * 0.51}{0.1* 0.51 + 0.49 * 0.05} \\ & = \frac{0.051}{0.0755} \\ & \sim 0.675 = 67.5% \end{align}

Classification by Bayes Rule

In this section, we investigate how Bayes rule can be used for the task of classification. Suppose that there are many classes $\omega_1, \omega_2, \dots, \omega_c$ and data that should belong to one of those classes. So, what we want to do is to classify new data that do not have any class information into one of the classes, based on training data whose class membership information is already known. Using Bayes rule, the probability that a new data $x$ belongs to class $ω i$ is given by

$\begin{align} P(\omega_i | x) & = \frac{\rho(x|\omega_i)P(\omega_i)}{\rho(x)} \\ & = \frac{\rho(x|\omega_i)P(\omega_i)}{\sum_{l=1}{c}\rho(x|\omega_l)P(\omega_l)} \end{align}$

where \rho is a density function for continuous values. Now we can compare probability values of $P (ω i | x)$ for each class and make a decision by taking one with higher probability as following

$\begin{align} P(\omega_i|x) \geq P(\omega_j|x) \forall{j}=1, \dots, c \\ \iff \frac{\rho(x|\omega_i)P(\omega_i)}{\rho(x)} \geq \frac{\rho(x|\omega_j)P(\omega_j)}{\rho(x)} \forall{j} = 1, \dots, c \\ \iff \rho(x|\omega_i)P(\omega_i) \geq \rho(x|\omega_j)P(\omega_j) \forall{j} = 1, \dots, c \end{align}$

From now on, we consider only two class classification problem for simplicity. Let $g 1 (x) = ρ(x | ω 1) P (ω 1)$ and $g 2 (x) = ρ(x | ω 2) P (ω 2)$ . Then we can decide class $ω 1$ for $x$ if $g_1(x) \geq g_2(x)$ , otherwise, decide class $ω 2$ . Equivalently, we can define a discriminant function $g (x) = g 1 (x) - g 2 (x)$ and decide class $ω 1$ if $g(x) \geq 0$ , otherwise class $ω 2$ . We here need to notice followings

The discriminant function $g (x)$ is not unique. As long as the discriminant function is monotonically increasing, it will make the same decision. (e.g., $g (x) = ln g 1 (x) - ln g 2 (x)$ )
We do not need to know the actual function $g (x)$ , but need to know just its sign.

The following sections describe two cases of classification. One is where data exist in 1-dimensional feature space and the other is where data exist in N-dimensional feature space.

Case 1: 1-dimensional feature space

For simplicity, samples $x i$ of class $ω i$ are drawn from Gaussian distribution $x_i \sim N(\mu_i, \sigma_i^2)$ . Then, the class-conditional density is given by

\rho(x|\omega_i) = \frac{1}{\sqrt{2 \pi} \sigma_i} exp[- \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2]

and let's take

\begin{align} g_i(x) & = \ln(\rho(x|\omega_i)P(\omega_i)) \\ & = \ln{\rho(x|\omega_i)} + \ln{P(\omega_i)} \\ & = \ln{\frac{1}{\sqrt{2 \pi}\sigma_i}} - \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2 + \ln{P(\omega_i)} \\ & = - \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2 - \ln{\sqrt{2 \pi}\sigma_i} + \ln{P(\omega_i)} \\ & = - \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2 + C_i \end{align}

where C_i is a constant value which is independent of $x$ . Finally, we can use the following discriminant function

g(x) = - \frac{1}{2}(\frac{x-\mu_1}{\sigma_1})^2 + \frac{1}{2}(\frac{x-\mu_2}{\sigma_2})^2 + C_1 - C_2

So, given a value of $x$ , we can assign class to $x$ based on the sign of the discriminant function above.

Case 2: N-dimensional feature space

Let's assume that each data is drawn from Multivariate Gaussian distribution with mean $μ i$ and standard deviation matrix $Σ i$ . Then the class-conditional density is given by

\rho(\vec{x}|\omega_i) = \frac{1}{(2 \pi)^{\frac{n}{2}} |\Sigma_i|^{\frac{1}{2}}} exp[- \frac{1}{2}(\vec{x} - \vec{\mu_i})^T \Sigma_{i}^{-1} (\vec{x} - \vec{\mu_i})]

Again we use

\begin{align} g_i(\vec{x}) & = \ln{\rho(\vec{x}|\omega_i)P(\omega_i)} \\ & = \ln{\rho(\vec{x}|\omega_i)} + \ln{P(\omega_i)} \\ & = - \frac{1}{2} (\vec{x} - \vec{\mu_i})^T \Sigma_{i}^{-1} (\vec{x} - \vec{\mu_i}) + \ln{\frac{1}{(2 \pi)^{\frac{n}{2}} |\Sigma_i|^{\frac{1}{2}}}} + \ln{P(\omega_i)} \end{align}

Note that the first term is actually $-\frac{1}{2}$ square of Mahalanobis distance from $\vec{x}$ to $\vec{\mu_i}$ and the remaining terms are independent of $\vec{x}$ , which means that the closer $\vec{x}$ is to $\vec{\mu_i}$ , the larger $g i (x)$ is. We can also observe that if the standard deviation matrix $Σ$ is equal to $I$ (identity matrix) the Mahalanobis distance at the first term of the equation above becomes Euclidean distance, that is,

\begin{align} g_i(x) & = - \frac{1}{2} \| \vec{x} - \vec{\mu_i} \|_{2}^{2} + \ln{\frac{1}{(2 \pi)^{\frac{n}{2}}}} + \ln{\frac{1}{(\Sigma_i)^{\frac{1}{2}}}} + \ln{P(\omega_i)} \\ & = - \frac{1}{2} \| \vec{x} - \vec{\mu_i} \|_{2}^{2} + \ln{\frac{1}{(2 \pi)^{\frac{n}{2}}}} + \ln{P(\omega_i)} \end{align}

and the second term can be ignored since it is constant and independent of $\vec{x}$ In two-class classification, we should decide $ω i$ if

\begin{align} - \frac{1}{2} \| \vec{x} - \vec{\mu_1} \|_{2}^{2} + \ln{P(\omega_1)} \geq - \frac{1}{2} \| \vec{x} - \vec{\mu_2} \|_{2}^{2} + \ln{P(\omega_2)} \end{align}

otherwise, we should decide $ω 2$

Questions and comments

If you have any questions, comments, etc. please post them on this page.

Back to ECE662, Spring 2014