Difference between revisions of "Bayes Rule for 1-dimensional and N-dimensional feature spaces" - Rhea

Revision as of 21:08, 25 April 2014

Introduction

In this slecture, we discuss Bayes rule that is widely used for many different kinds of applications, especially, pattern recognition. Due to its simplicity and effectiveness, we can use the method in both discrete values case and continuous values case, and it also usually works well in multi-dimensional feature space. So, we will take a look at what the definition of Bayes rule is, how it can be used for the classification task with examples, and how we can derive it in different cases.

Bayes' Theorem

Bayes theorem is a probabilistic theory that can explain a relationship between the prior probability and the posterior probability of two random variables or events. Give two events $A$ and $B$ , we may want to know $P (A | B)$ and it can be obtained if we know knowledge for other probabilities, $P (B | A)$ , $P (A)$ , and $P (B)$ . By the definition of the conditional probability, a joint probability of $A$ and $B$ , $P (A, B)$ , is the product of $P (A | B)$ $P (B)$ . We can also write $P (A, B) = P (B | A) P (A)$ . From those two equations, $P (A | B) P (B) = P (B | A) P (A)$ and we can conclude

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

The formula above is called Bayes Theorem. Let's see an example. Suppose that a man told you that he had a nice conversation with a person. Assuming the populations of male and female are the same, the probability that his conversational partner was a woman is 0.5. However, if he gives you additional information that the conversational partner had long hair, then the probability of the partner being a woman will be higher than 0.5 because women are more likely to have long hair. So, let $W$ be an event that the conversational partner was a women and $L$ be an event that a person has a long haired person. Suppose that 75% of all women have long hair, 15% of men have long hair, and both genders are equally likely. That is, $P (W | L) = 0.75, P (W) = P (M) = 0.5$ , and these are what we already know. With these information, we can say that the probability that the person who he had a conversation with was a woman is

$\begin{align} P(W|L) & = \frac{P(L|W)P(W)}{P(L|W)P(W) + P(L|M)P(M)} \\ & = \frac{0.75 * 0.5}{0.75*0.5 + 0.15*0.5} \\ & = \frac{5}{6} \simeq 0.83 \end{align}$

Classification by Bayes Rule

In this section, we investigate how Bayes rule can be used for the task of classification. Suppose that there are many classes $\omega_1, \omega_2, \dots, \omega_c$ and data that should belong to one of those classes. So, what we want to do is to classify new data that do not have any class information into one of the classes, based on training data whose class membership information is already known. Using Bayes rule, the probability that a new data $x$ belongs to class $ω i$ is given by

$\begin{align} P(\omega_i | x) & = \frac{\rho(x|\omega_i)P(\omega_i)}{\rho(x)} \\ & = \frac{\rho(x|\omega_i)P(\omega_i)}{\sum_{l=1}{c}\rho(x|\omega_l)P(\omega_l)} \end{align}$

where \rho is a density function for continuous values. Now we can compare probability values of $P (ω i | x)$ for each class and make a decision by taking one with higher probability as following

$\begin{align} P(\omega_i|x) \geq P(\omega_j|x) \forall{j}=1, \dots, c \\ \iff \frac{\rho(x|\omega_i)P(\omega_i)}{\rho(x)} \geq \frac{\rho(x|\omega_j)P(\omega_j)}{\rho(x)} \forall{j} = 1, \dots, c \\ \iff \rho(x|\omega_i)P(\omega_i) \geq \rho(x|\omega_j)P(\omega_j) \forall{j} = 1, \dots, c \end{align}$

From now on, we consider only two class classification problem for simplicity. Let $g 1 (x) = ρ(x | ω 1) P (ω 1)$ and $g 2 (x) = ρ(x | ω 2) P (ω 2)$ . Then we can decide class $ω 1$ for $x$ if $g_1(x) \geq g_2(x)$ , otherwise, decide class $ω 2$ . Equivalently, we can define a discriminant function $g (x) = g 1 (x) - g 2 (x)$ and decide class $ω 1$ if $g(x) \geq 0$ , otherwise class $ω 2$ . We here need to notice followings

The discriminant function $g (x)$ is not unique. As long as the discriminant function is monotonically increasing, it will make the same decision. (e.g., $g (x) = ln g 1 (x) - ln g 2 (x)$ )
We do not need to know the actual function $g (x)$ , but need to know just its sign.

The following sections describe two cases of classification. One is where data exist in 1-dimensional feature space and the other is where data exist in N-dimensional feature space.

Case 1: 1-dimensional feature space

For simplicity, samples $x i$ of class $ω i$ are drawn from Gaussian distribution $x_i \sim N(\mu_i, \sigma_i^2)$ . Then, the class-conditional density is given by

\rho(x|\omega_i) = \frac{1}{\sqrt{2 \pi} \sigma_i} exp[- \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2]

and let's take

\begin{align} g_i(x) & = \ln(\rho(x|\omega_i)P(\omega_i)) \\ & = \ln{\rho(x|\omega_i)} + \ln{P(\omega_i)} \\ & = \ln{\frac{1}{\sqrt{2 \pi}\sigma_i}} - \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2 + \ln{P(\omega_i)} \\ & = - \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2 - \ln{\sqrt{2 \pi}\sigma_i} + \ln{P(\omega_i)} \\ & = - \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2 + C_i \end{align}

where C_i is a constant value which is independent of $x$ . Finally, we can use the following discriminant function

g(x) = - \frac{1}{2}(\frac{x-\mu_1}{\sigma_1})^2 + \frac{1}{2}(\frac{x-\mu_2}{\sigma_2})^2 + C_1 - C_2

So, given a value of $x$ , we can assign class to $x$ based on the sign of the discriminant function above.

Case 2: N-dimensional feature space

Let's assume that each data is drawn from Multivariate Gaussian distribution with mean $μ i$ and standard deviation matrix $Σ i$ . Then the class-conditional density is given by

\rho(\vec{x}|\omega_i) = \frac{1}{(2 \pi)^{\frac{n}{2}} |\Sigma_i|^{\frac{1}{2}}} exp[- \frac{1}{2}(\vec{x} - \vec{\mu_i})^T \Sigma_{i}^{-1} (\vec{x} - \vec{\mu_i})]

Again we use

\begin{align} g_i(\vec{x}) & = \ln{\rho(\vec{x}|\omega_i)P(\omega_i)} \\ & = \ln{\rho(\vec{x}|\omega_i)} + \ln{P(\omega_i)} \\ & = - \frac{1}{2} (\vec{x} - \vec{\mu_i})^T \Sigma_{i}^{-1} (\vec{x} - \vec{\mu_i}) + \ln{\frac{1}{(2 \pi)^{\frac{n}{2}} |\Sigma_i|^{\frac{1}{2}}}} + \ln{P(\omega_i)} \end{align}

@@ Line 1: / Line 1: @@
-== Introduction ==
+== Introduction  ==
-In this slecture, we discuss Bayes' rule that is widely used for many different kinds of applications, especially, pattern recognition. Due to its simplicity and effectiveness, we can use the method in both discrete values case and continuous values case, and it also usually works well in multi-dimensional feature space. So, we will take a look at what the definition of Bayes' rule is, how it can be used for the classification task with examples, and how we can derive it in different cases.
-== Bayes' Theorem ==
+In this slecture, we discuss Bayes rule that is widely used for many different kinds of applications, especially, pattern recognition. Due to its simplicity and effectiveness, we can use the method in both discrete values case and continuous values case, and it also usually works well in multi-dimensional feature space. So, we will take a look at what the definition of Bayes rule is, how it can be used for the classification task with examples, and how we can derive it in different cases.
-Bayes' theorem is a probabilistic theory that can explain a relationship between the prior probability and the posterior probability of two random variables or events.
+== Bayes' Theorem  ==
+Bayes theorem is a probabilistic theory that can explain a relationship between the prior probability and the posterior probability of two random variables or events. Give two events <span class="texhtml">''A''</span> and <span class="texhtml">''B''</span>, we may want to know <span class="texhtml">''P''(''A'' | ''B'')</span> and it can be obtained if we know knowledge for other probabilities, <span class="texhtml">''P''(''B'' | ''A'')</span>, <span class="texhtml">''P''(''A'')</span>, and <span class="texhtml">''P''(''B'')</span>. By the definition of the conditional probability, a joint probability of <span class="texhtml">''A''</span> and <span class="texhtml">''B''</span>, <span class="texhtml">''P''(''A'',''B'')</span>, is the product of <span class="texhtml">''P''(''A'' | ''B'')</span><span class="texhtml">''P''(''B'')</span>. We can also write <span class="texhtml">''P''(''A'',''B'') = ''P''(''B'' | ''A'')''P''(''A'')</span>. From those two equations, <span class="texhtml">''P''(''A'' | ''B'')''P''(''B'') = ''P''(''B'' | ''A'')''P''(''A'')</span> and we can conclude
+<center><math>P(A|B) = \frac{P(B|A)P(A)}{P(B)}</math></center>
+The formula above is called Bayes Theorem. Let's see an example. Suppose that a man told you that he had a nice conversation with a person. Assuming the populations of male and female are the same, the probability that his conversational partner was a woman is 0.5. However, if he gives you additional information that the conversational partner had long hair, then the probability of the partner being a woman will be higher than 0.5 because women are more likely to have long hair. So, let <span class="texhtml">''W''</span> be an event that the conversational partner was a women and <span class="texhtml">''L''</span> be an event that a person has a long haired person. Suppose that 75% of all women have long hair, 15% of men have long hair, and both genders are equally likely. That is, <span class="texhtml">''P''(''W'' | ''L'') = 0.75,''P''(''W'') = ''P''(''M'') = 0.5</span>, and these are what we already know. With these information, we can say that the probability that the person who he had a conversation with was a woman is
+<center>
+<math>\begin{align}
+P(W|L) & = \frac{P(L|W)P(W)}{P(L|W)P(W) + P(L|M)P(M)} \\
+& = \frac{0.75 * 0.5}{0.75*0.5 + 0.15*0.5} \\
+& = \frac{5}{6} \simeq 0.83
+\end{align}</math>
+</center>
+== Classification by Bayes Rule  ==
+In this section, we investigate how Bayes rule can be used for the task of classification. Suppose that there are many classes <math>\omega_1, \omega_2, \dots, \omega_c</math> and data that should belong to one of those classes. So, what we want to do is to classify new data that do not have any class information into one of the classes, based on training data whose class membership information is already known. Using Bayes rule, the probability that a new data <span class="texhtml">''x''</span> belongs to class <span class="texhtml">ω<sub>''i''</sub></span> is given by
+<center>
+<math> \begin{align}
+P(\omega_i | x) & = \frac{\rho(x|\omega_i)P(\omega_i)}{\rho(x)} \\
+& = \frac{\rho(x|\omega_i)P(\omega_i)}{\sum_{l=1}{c}\rho(x|\omega_l)P(\omega_l)}
+\end{align}</math>
+</center>
+where \rho is a density function for continuous values. Now we can compare probability values of <span class="texhtml">''P''(ω<sub>''i''</sub> | ''x'')</span> for each class and make a decision by taking one with higher probability as following
+<center>
+<math> \begin{align}
+P(\omega_i|x) \geq P(\omega_j|x) \forall{j}=1, \dots, c \\
+\iff \frac{\rho(x|\omega_i)P(\omega_i)}{\rho(x)} \geq \frac{\rho(x|\omega_j)P(\omega_j)}{\rho(x)} \forall{j} = 1, \dots, c \\
+\iff \rho(x|\omega_i)P(\omega_i) \geq \rho(x|\omega_j)P(\omega_j) \forall{j} = 1, \dots, c
+\end{align} </math>
+</center>
+From now on, we consider only two class classification problem for simplicity. Let <span class="texhtml">''g''<sub>1</sub>(''x'') = ρ(''x'' | ω<sub>1</sub>)''P''(ω<sub>1</sub>)</span> and <span class="texhtml">''g''<sub>2</sub>(''x'') = ρ(''x'' | ω<sub>2</sub>)''P''(ω<sub>2</sub>)</span>. Then we can decide class <span class="texhtml">ω<sub>1</sub></span> for <span class="texhtml">''x''</span> if <math>g_1(x) \geq g_2(x)</math>, otherwise, decide class <span class="texhtml">ω<sub>2</sub></span>. Equivalently, we can define a discriminant function <span class="texhtml">''g''(''x'') = ''g''<sub>1</sub>(''x'') − ''g''<sub>2</sub>(''x'')</span> and decide class <span class="texhtml">ω<sub>1</sub></span> if <math>g(x) \geq 0 </math>, otherwise class <span class="texhtml">ω<sub>2</sub></span>. We here need to notice followings
+#The discriminant function <span class="texhtml">''g''(''x'')</span> is not unique. As long as the discriminant function is monotonically increasing, it will make the same decision. (e.g., <span class="texhtml">''g''(''x'') = ln''g''<sub>1</sub>(''x'') − ln''g''<sub>2</sub>(''x'')</span>)
+#We do not need to know the actual function <span class="texhtml">''g''(''x'')</span>, but need to know just its sign.
+The following sections describe two cases of classification. One is where data exist in 1-dimensional feature space and the other is where data exist in N-dimensional feature space.
+== Case 1: 1-dimensional feature space  ==
+For simplicity, samples <span class="texhtml">''x''<sub>''i''</sub></span> of class <span class="texhtml">ω<sub>''i''</sub></span> are drawn from Gaussian distribution <math>x_i \sim N(\mu_i, \sigma_i^2)</math>. Then, the class-conditional density is given by
+<center><math>
+\rho(x|\omega_i) = \frac{1}{\sqrt{2 \pi} \sigma_i} exp[- \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2]
+</math></center>
+and let's take
+<center><math> \begin{align}
+g_i(x) & = \ln(\rho(x|\omega_i)P(\omega_i)) \\
+& = \ln{\rho(x|\omega_i)} + \ln{P(\omega_i)} \\
+& = \ln{\frac{1}{\sqrt{2 \pi}\sigma_i}} - \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2 + \ln{P(\omega_i)} \\
+& = - \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2 - \ln{\sqrt{2 \pi}\sigma_i} + \ln{P(\omega_i)} \\
+& = - \frac{1}{2}(\frac{x-\mu_i}{\sigma_i})^2 + C_i
+\end{align}</math></center>
+where C_i is a constant value which is independent of <span class="texhtml">''x''</span>. Finally, we can use the following discriminant function
+<center><math>
+g(x) = - \frac{1}{2}(\frac{x-\mu_1}{\sigma_1})^2 + \frac{1}{2}(\frac{x-\mu_2}{\sigma_2})^2 + C_1 - C_2
+</math></center>
+So, given a value of <span class="texhtml">''x''</span>, we can assign class to <span class="texhtml">''x''</span> based on the sign of the discriminant function above.
+== Case 2: N-dimensional feature space  ==
+Let's assume that each data is drawn from Multivariate Gaussian distribution with mean <span class="texhtml">μ<sub>''i''</sub></span> and standard deviation matrix <span class="texhtml">Σ<sub>''i''</sub></span>. Then the class-conditional density is given by
+<center><math>
+\rho(\vec{x}|\omega_i) = \frac{1}{(2 \pi)^{\frac{n}{2}} |\Sigma_i|^{\frac{1}{2}}} exp[- \frac{1}{2}(\vec{x} - \vec{\mu_i})^T \Sigma_{i}^{-1} (\vec{x} - \vec{\mu_i})]
+</math></center>
+Again we use
+<center><math>\begin{align}
+g_i(\vec{x}) & = \ln{\rho(\vec{x}|\omega_i)P(\omega_i)} \\
+& = \ln{\rho(\vec{x}|\omega_i)} + \ln{P(\omega_i)} \\
+& = - \frac{1}{2} (\vec{x} - \vec{\mu_i})^T \Sigma_{i}^{-1} (\vec{x} - \vec{\mu_i}) + \ln{\frac{1}{(2 \pi)^{\frac{n}{2}} |\Sigma_i|^{\frac{1}{2}}}} + \ln{P(\omega_i)}
+\end{align}</math></center>