Revision as of 19:32, 28 April 2014 by Fang29 (Talk | contribs)


Bayesian Parameter Estimation

A slecture by ECE student Shaobo Fang

Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

PDF version



Bayesian Estimation

As stated in Duda's book, the conceptual difference between maximum likelihood estimation and Bayersian learning is that in MLE $ \theta $ is a fixed vector while in Bayersian estimation $ \theta $ is considered to be a random variable.

By definition, given samples $ \mathcal{D} $ Bayer's formula is defined as


$ P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)} $

Furthermore, $ p(x|D) $ can be computed as:

$ p(x|D) = \int p(x|\theta)p(\theta|D)d\theta $


Bayesian Parameter Estimation: General Theory

It is important to know that:

1. The form of the density $p(x|\theta)$ is assumed to be known, but the value of the parameter vector $ \theta $ is not known exactly.

2. The initial knowledge about $ \theta $ is assumed to be contained in a known a priori density $ p(\theta) $.

3. The rest of the knowledge about $ \theta $ is contained in a set $ \mathcal{D} $ of n samples $ x_1, x_2, ... , x_n $ drawn independently according to the unknown probability density $ p(x) $.

Accordingly, based on:

$ p(x|D) = \int p(x|\theta)p(\theta|D)d\theta $

and Bayes Theorem,

$ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta} $


Now, since we are attempting to transform the equation to be based on samples $ x_i $, by independent assumption,

$ p(D|\theta) = \prod_{k = 1}^n p(x_i|\theta) $

Hence, if a sample $ \mathcal{D} $ has n samples, we can denote the sample space as:

$ \mathcal{D}^n = \{x_1, x_2, ... x_n\} $.


$ p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta) $

Using this equation, we can transform the Bayesian Parameter Estimation to:

$ p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta} $


Investigation of Estimator's Accuracy of Bayesian Parameter Estimation: Gaussian Case

The Univariate Case: $ p(\mu|\mathcal{D}) $

Assumptions:

$ p(x|\mu) \sim N(\mu, \sigma^2) $
$ p(\mu) \sim N(\mu_0, \sigma_0^2) $

From the previous section, the following expression could be easily obtained:

$ p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu) $

Where $ \alpha $ is a factorization factor independent of $ \mu $.

Now, substitute $ p(x_k|\mu) $ and $ p(u) $ with:

$ p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $
$ p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $

Hence, accordingly,

$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $
$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $

Update the scaling factor to $ \beta $,

$ p(\mu|D) = \beta exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}) $
$ p(\mu|D) = \gamma exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu] $

Furthermore, since

$ p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}] $

Finally, the estimate of $ u_n $ can be obtained:

$ \mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0 $

Where $ \bar{x_n} $ is defined as sample means and $ n $ is the sample size.

In order to form a Gaussian distribution, the variance $ \sigma_n^2 $ associated with $ \mu_n $ is defined as:

$ \sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2} $

The Univariate Case: $ p(x|\mathcal{D}) $

Having obtained the posteriori density for the mean $ u_n $ of set $ \mathcal{D} $, the remaining of the task is to estimate the "class-conditional" density for $ p(x|D) $.

Based on the text by \textbf{Duda's},


$ p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu $


$ p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}]f(\sigma,\sigma_n) $


Where $ f(\sigma, \sigma_n) $ is defined as:


$ f(\sigma,\sigma_n) = \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 x+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu $

Hence, $ p(x|D) $ is normally distributed as:

$ p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2) $


\subsection{Experiment of Bayesian Parameter Estimation}

\paragraph{Design}

Assume n samples were obtained from the class $ \mathcal{D} $ of unknown mean $ \mu $ (known $ \sigma $). Assume,

$ p(x|\mu) \sim N(\mu, \sigma^2) $

$ p(\mu) \sim N(\mu_0, \sigma_0^2) $

While $ \sigma = \sigma_0 = constant $, and $ \mu_0 = 0 $ (It does not matter what $ \mu_0 $ it was assumed to be, this will be verified shortly after). Based on the sample data $ x_i \in \mathcal{D}, i = 1,2,3,...,n $, $ \mu $ is desired to be estimated.

The following results will be obtained: \begin{enumerate} \item The impact of $\mu_0$ on estimated $\hat{\mu}$ \item The impact of sample size $n$ have on estimation accuracy \end{enumerate}

\paragraph{Results} \begin{center} \includegraphics[scale=1]{BPE_1.png}

Figure 21. The impact of $\mu_0$ on estimated $\hat{\mu}$ averaged over 50 samples

\includegraphics[scale=1]{BPE_2.png}

Figure 22. The impact of $\mu_0$ on the variance of estimated $\hat{\mu}$ over 50 samples


\end{center}

The estimated mean is shifting up with $\mu_0$ increasing. \textbf{Based on the experiment it can be concluded that the most 'accurate' estimate could be obtained if $ \mu_0 = \mu $. But, according to the plot, even if the $\mu_0$ is different different from $\mu$, the error of estimation is still acceptable. (In our case, within [-0.1,+0.06] region)} However, the variance of estimated mean could be assumed to be identical as the \textbf{real empirical mean}.


\begin{center} \includegraphics[scale=0.7]{e23456.png}

Figure 23. The impact of sample size $n$ have on estimation shape accuracy (sample sizes = 2,3,4,5,6)


\includegraphics[scale=0.6]{ece662_14.png}

Figure 24. The impact of sample size $n$ have on estimation shape accuracy (sample sizes = 4,10,20,50,100)


\paragraph{Conclusion} Figure 23. and Figure 24. have demonstrated that with insufficient sample size the result would be really poor regarding prediction of points distribution.

Fig 3: Summary diagram of whitening and coloring process.



References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

[2]. R. Duda, P. Hart, D. STork , Pattern Classification. Thomson Brooks/Cole. Second Edition, 2006.

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett