Bayesian Parameter Estimation: Gaussian Case

A slecture by ECE student Shaobo Fang

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

## Introduction: Bayesian Estimation

Although the estimator obtained from Maximum Likelihood Estimation (MLE) and Bayersian Parameter Estimation(BPE) would be similar or even identical for most of the time, the key idea(structure) for MLE and BPE is completely different. For Maximum Likelihood Estimation, we can consider the parameter estimated to be a fixed number (or several numbers if more than one parameters), while in BPE the estimated parameter is a vector (r.v.).

To start with, Bayes' formula was transformed into the following form given samples class $\mathcal{D}$:

$P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)}$

Based on the observation on above equations, it can be concluded that both class-conditional densities and the priori could be obtained based on the training data.

Now, assuming that the we are working on a supervised case with labelled training data, that is all samples from the training data could be separated accurately into c subsets $\mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c$.

Hence, the above equation could be further developed into the following form:

$P(w_i|x,D) = \frac{p(x|w_i,D_i)P(w_i)}{\sum_{j = 1}^c p(x|w_j,D_j)P(w_j)}$

Now, assume that a set of $N$ independent samples were obtained from a certain class $\mathcal{D} = \{x_1, x_2, ... , x_N \}$ and for each of the sample there exist a probability function with the parameter form: p(x). In order to form a BPE estimation, we will consider $\theta$ to be a vector (random variable). More specifically, a probability function given a class condition of D and a parameter vector of $\theta$ is defined as below:

$p(x|D)$ can be computed as:

$p(x|D) = \int p(x, \theta|D)d\theta = \int p(x|\theta)p(\theta|D)d\theta$

## Bayesian Parameter Estimation: General Theory

In order to provide better understanding regarding Bayesian Parameter Estimation (BPE) technique, first of all we will briefly discuss the general technique. For the BPE method, as $\theta$ is considered to be a random variable (vector) hence it is assumed to be unknown. Although $\theta$ in general is unknown, another assumption need to be made that $\theta$ has the priori distribution of the form $p(\theta)$ which is considered to be known. Hence, in order to estimate the parameter $\theta$ both the information in priori and the information from set $\mathcal{D}$ of n samples $x_1, x_2, ... , x_n$ need to be utilized. Since the training data is known and well labelled, obviously the density function of a sample x with parameter $\theta$ is known, denoted as $p(x|\theta)$.

From the previous section we have already obtained:

$p(x|D) = \int p(x|\theta)p(\theta|D)d\theta$

Furthermore, by Bayes Theorem (with some transformation),

$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta}$

Although we are very close already, we still need to substitute the class condition 'D' with the samples $x_k$, based on our assumption made at the beginning of this section. In order to do that, first the probability function of class 'D' with $\theta$ as a parameter need to be computed:

$p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta)$

Now for a class $\mathcal{D}$ which contains 'N' samples: $\mathcal{D}^n = \{x_1, x_2, ... x_n\}$, we can further transform the above equation to the following form, as $p(x_n|\theta)$ is assumed to be known and $p(D^{n-1}|\theta)$ is the probability function of class D with N-1 samples:

$p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta)$

Hence, after we substituting class condition 'D' with samples $x_k$, the Bayesian Parameter Estimation equation then transformed into the following form:

$p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta}$

## The Univariate Case: $p(\mu|\mathcal{D})$

As was done in MLE first will start with a simple case with only the mean: $\mu$ unknown. As usual we will assume sample $x_k$ is normally distributed as:

$p(x|\mu) \sim N(\mu, \sigma^2)$

and the parameter $\mu$ has the distribution of:

$p(\mu) \sim N(\mu_0, \sigma_0^2),$

as parameter $\mu$ is not estimated to be a number but a random variable.

Using Bayes' formula and the corresponding derivation from the previous section the corresponding function could be easily obatined:

$p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)$

Where $\alpha$ is introduced as a 'scale' coefficient in order to simplify the derivation. Please note that $\alpha$ is completely independent of $\mu$.

As $x_k$ is normally distributed we update the $p(x_k|\mu)$ and $p(u)$ with the known distribution function:

$p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]$
$p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]$

Again, substitute $p(x_k|\mu)$ and $p(u)$ in equation $p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)$, we obtained:

$p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]$
$p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]$

Similarly, in order to simplify the derivation we update the scaling factors to $\alpha'$ and $\alpha''$, and correspondingly,

$p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2})$
$p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu]$

Finally, compare derived $p(u|D)$ to the Gaussian Distribution in the standard form:

$p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}]$

Based on knowledge on Gaussian Distribution, $\mu_n$ and $\sigma_n^2$ could be obtained accordingly:

$\mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0$
$\sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2}$

(Why is it important to emphasize both variance: $\sigma_n^2$ and mean: $\mu_n$? Since the distribution is in Gaussian Form and the two most decisive parameters in Gaussian R.V. are mean and variance) Please note that $\bar{x_n}$ is the empirical mean in our known training data.

## The Univariate Case: $p(x|\mathcal{D})$

Given the posteriori density $p(\mu|D)$ successfully derived (variance: $\sigma_n^2$ and mean: $\mu_n$ now known), the final step is to estimate $p(x|D)$ based on the conclusions in previous sections.

Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:

$p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu$
$p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}] \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu$

Finally, substitute $\sigma_n^2$ and $\mu_n$ the probability function $p(x|\mathcal{D})$ is obtained:

$p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}] \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 \bar{x}_n+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu$

Hence, $p(x|D)$ is normally distributed as:

$p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2)$

## References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.

## Alumni Liaison

Ph.D. on Applied Mathematics in Aug 2007. Involved on applications of image super-resolution to electron microscopy

Francisco Blanco-Silva