Revision as of 06:40, 5 May 2014 by Mboutin (Talk | contribs)


Bayesian Parameter Estimation: Gaussian Case

A slecture by ECE student Shaobo Fang

Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.



Introduction: Bayesian Estimation

According to Chapter #3.3 (Duda's book), although the answers we get by BPE will generally be nearly identical to those obtained by maximum likelihood estimation, the conceptual difference is significant. For maximum likelihood estimation, the parameter $ \theta $ is a fixed while in Bayersian estimation $ \theta $ is considered to be a random variable.

By definition, given samples class $ \mathcal{D} $, Bayes' formula then becomes:


$ P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)} $

As the above equation suggests, we can use the information provided by the training data to help determine both the class-conditional densities and the priori probabilities.

Furthermore, since we are treating supervised case, we can separate the training samples by class into c subsets $ \mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c $, accordingly:

$ P(w_i|x,D) = \frac{p(x|w_i,D_i)P(w_i)}{\sum_{j = 1}^c p(x|w_j,D_j)P(w_j)} $

Now, assume $ p(x) $ has a parameter form. We are given a set of $ N $ independent samples $ \mathcal{D} = \{x_1, x_2, ... , x_N \} $. View $ \theta $ as a random variable. Consider more specifically in continuous case:

$ p(x|D) $ can be computed as:

$ p(x|D) = \int p(x, \theta|D)d\theta = \int p(x|\theta)p(\theta|D)d\theta $


Bayesian Parameter Estimation: General Theory

We first start with a generalized approach which can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are as follows:

1. The form of the density $ p(x|\theta) $ is assumed to be known, but the value of the parameter vector $ \theta $ is not known exactly.

2. The initial knowledge about $ \theta $ is assumed to be contained in a known a priori density $ p(\theta) $.

3. The rest of the knowledge about $ \theta $ is contained in a set $ \mathcal{D} $ of n samples $ x_1, x_2, ... , x_n $ drawn independently according to the unknown probability density $ p(x) $.

Accordingly, already know:

$ p(x|D) = \int p(x|\theta)p(\theta|D)d\theta $

and By Bayes Theorem,

$ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta} $


Now, since we are attempting to transform the equation to be based on samples $ x_k $, by independent assumption,

$ p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta) $

Hence, if a sample $ \mathcal{D} $ has n samples, we can denote the sample space as: $ \mathcal{D}^n = \{x_1, x_2, ... x_n\} $.

Combine the sample space definition with the equation above:


$ p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta) $

Using this equation, we can transform the Bayesian Parameter Estimation to:

$ p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta} $




Bayesian Parameter Estimation: Gaussian Case

The Univariate Case: $ p(\mu|\mathcal{D}) $

Consider the case where $ \mu $ is the only unknown parameter. For simplicity we assume:

$ p(x|\mu) \sim N(\mu, \sigma^2) $
and
$ p(\mu) \sim N(\mu_0, \sigma_0^2) $

From the previous section, the following expression could be easily obtained using Bayes' formula:

$ p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu) $

Where $ \alpha $ is a factorization factor independent of $ \mu $.

Now, substitute $ p(x_k|\mu) $ and $ p(u) $ with:

$ p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $
$ p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $

The equation has now become:

$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $
$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $

Update the scaling factor to $ \alpha' $ and $ \alpha'' $ correspondingly,

$ p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}) $
$ p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu] $

With the knowledge of Gaussian distribution:

$ p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}] $

Finally, the estimate of $ u_n $ can be obtained:

$ \mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0 $

Where $ \bar{x_n} $ is defined as sample means and $ n $ is the sample size.

In order to form a Gaussian distribution, the variance $ \sigma_n^2 $ associated with $ \mu_n $ could also be obtained correspondingly as:

$ \sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2} $


Observation:

With $ N \to \infty $,
$ \sigma_D \to 0 $

$ p(\mu|D) $ becomes more sharply peaked around $ \mu_D $

The Univariate Case: $ p(x|\mathcal{D}) $

Having obtained the posteriori density for the mean $ u_n $ of set $ \mathcal{D} $, the remaining of the task is to estimate the "class-conditional" density for $ p(x|D) $.

Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:


$ p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu $
$ p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}] \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu $


$ p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}]f(\sigma,\sigma_n) $


Where $ f(\sigma, \sigma_n) $ is defined as:


$ f(\sigma,\sigma_n) = \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 x+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu $

Hence, $ p(x|D) $ is normally distributed as:

$ p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2) $


References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.


Questions and comments

If you have any questions, comments, etc. please post them on this page.



Back to ECE 662 S14 course wiki

Back to ECE 662 course page

Alumni Liaison

Correspondence Chess Grandmaster and Purdue Alumni

Prof. Dan Fleetwood