Revision as of 23:52, 13 May 2014 by Fang29 (Talk | contribs)

WARNING: THIS MATERIAL WAS PLAGIARIZED FROM DUDA AND HART!!!!!

Bayesian Parameter Estimation: Gaussian Case

A slecture by ECE student Shaobo Fang

Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.



Introduction: Bayesian Estimation

Although the estimator obtained from Maximum Likelihood Estimation (MLE) and Bayersian Parameter Estimation(BPE) would be similar or even identical for most of the time, the key idea(structure) for MLE and BPE is completely different. For Maximum Likelihood Estimation, we can consider the parameter estimated to be a fixed number (or several numbers if more than one parameters), while in BPE the estimated parameter is a vector (r.v.).

To start with, Bayes' formula was transformed into the following form given samples class $ \mathcal{D} $:


$ P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)} $

Based on the observation on above equations, it can be concluded that both class-conditional densities and the priori could be obtained based on the training data.

Now, assuming that the we are working on a supervised case with labelled training data, that is all samples from the training data could be separated accurately into c subsets $ \mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c $.

Hence, the above equation could be further developed into the following form:

$ P(w_i|x,D) = \frac{p(x|w_i,D_i)P(w_i)}{\sum_{j = 1}^c p(x|w_j,D_j)P(w_j)} $

Now, assume that a set of $ N $ independent samples were obtained from a certain class $ \mathcal{D} = \{x_1, x_2, ... , x_N \} $ and for each of the sample there exist a probability function with the parameter form: p(x). In order to form a BPE estimation, we will consider $ \theta $ to be a vector (random variable). More specifically, a probability function given a class condition of D and a parameter vector of $ \theta $ is defined as below:

$ p(x|D) $ can be computed as:

$ p(x|D) = \int p(x, \theta|D)d\theta = \int p(x|\theta)p(\theta|D)d\theta $


Bayesian Parameter Estimation: General Theory

In order to provide better understanding regarding Bayesian Parameter Estimation (BPE) technique, first of all we will briefly discuss the general technique. For the BPE method, as $ \theta $ is considered to be a random variable (vector) hence it is assumed to be unknown. Although $ \theta $ in general is unknown, another assumption need to be made that $ \theta $ has the priori distribution of the form $ p(\theta) $ which is considered to be known. Hence, in order to estimate the parameter $ \theta $ both the information in priori and the information from set $ \mathcal{D} $ of n samples $ x_1, x_2, ... , x_n $ need to be utilized. Since the training data is known and well labelled, obviously the density function of a sample x with parameter $ \theta $ is known, denoted as $ p(x|\theta) $.

From the previous section we have already obtained:

$ p(x|D) = \int p(x|\theta)p(\theta|D)d\theta $

Furthermore, by Bayes Theorem (with some transformation),

$ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta} $

Although we are very close already, we still need to substitute the class condition 'D' with the samples $ x_k $, based on our assumption made at the beginning of this section. Hence, after we substituting class condition 'D' with samples $ x_k $, the above equation then transformed into the following form:

$ p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta) $

Hence, if a sample $ \mathcal{D} $ has n samples, we can denote the sample space as: $ \mathcal{D}^n = \{x_1, x_2, ... x_n\} $.

Combine the sample space definition with the equation above:


$ p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta) $

Using this equation, we can transform the Bayesian Parameter Estimation to:

$ p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta} $




Bayesian Parameter Estimation: Gaussian Case

The Univariate Case: $ p(\mu|\mathcal{D}) $

Consider the case where $ \mu $ is the only unknown parameter. For simplicity we assume:

$ p(x|\mu) \sim N(\mu, \sigma^2) $
and
$ p(\mu) \sim N(\mu_0, \sigma_0^2) $

From the previous section, the following expression could be easily obtained using Bayes' formula:

$ p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu) $

Where $ \alpha $ is a factorization factor independent of $ \mu $.

Now, substitute $ p(x_k|\mu) $ and $ p(u) $ with:

$ p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $
$ p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $

The equation has now become:

$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $
$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $

Update the scaling factor to $ \alpha' $ and $ \alpha'' $ correspondingly,

$ p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}) $
$ p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu] $

With the knowledge of Gaussian distribution:

$ p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}] $

Finally, the estimate of $ u_n $ can be obtained:

$ \mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0 $

Where $ \bar{x_n} $ is defined as sample means and $ n $ is the sample size.

In order to form a Gaussian distribution, the variance $ \sigma_n^2 $ associated with $ \mu_n $ could also be obtained correspondingly as:

$ \sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2} $


Observation:

With $ N \to \infty $,
$ \sigma_D \to 0 $

$ p(\mu|D) $ becomes more sharply peaked around $ \mu_D $

The Univariate Case: $ p(x|\mathcal{D}) $

Having obtained the posteriori density for the mean $ u_n $ of set $ \mathcal{D} $, the remaining of the task is to estimate the "class-conditional" density for $ p(x|D) $.

Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:


$ p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu $
$ p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}] \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu $


$ p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}]f(\sigma,\sigma_n) $


Where $ f(\sigma, \sigma_n) $ is defined as:


$ f(\sigma,\sigma_n) = \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 x+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu $

Hence, $ p(x|D) $ is normally distributed as:

$ p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2) $


References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.


Questions and comments

If you have any questions, comments, etc. please post them on this page.



Back to ECE 662 S14 course wiki

Back to ECE 662 course page

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett