Bayes Parameter Estimation with examples - Rhea

Bayesian Parameter Estimation with examples

Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

Introduction: Bayesian Estimation

Suppose that we have an observable random variable $$ X $$ for an experiment, that takes values in a set S. Suppose that distribution of $$ X $$ depends on a parameter $\theta$ taking values in a parameter space $\Theta$ . We will denote the probability density function of $$ X $$ for a given value of $\theta$ by $f( \mathbf{x} \mid \theta)$ for $x \in S$ and $\theta \in S$ . Of course, our data variable X is almost always vector-valued. The parameter $\theta$ may also be vector-valued.

In Bayesian analysis, named for the famous Thomas Bayes, we treat the parameter $\theta$ as a random variable, with a given probability density function $h(\theta)$ for $\theta \in \Theta$ . The corresponding distribution is called the prior distribution of $\theta$ and is intended to reflect our knowledge (if any) of the parameter, before we gather data. After observing $x \in S$ , we then use Bayes' theorem, to compute the conditional probability density function of $\theta$ given $\mathbf{X}=\mathbf x$ .

First recall that the joint probability density function of $(\mathbf X,\theta)$ is the mapping on $S \times \Theta$ given by

(x, \theta) \mapsto h(\theta) f(x \mid \theta)

Next recall that the (marginal) probability density function f of $$ X $$ is given by

f(x) = \sum_{\theta \in \Theta} h(\theta) f(x | \theta), \quad x \in S

if the parameter has a discrete distribution, or

f(x) = \int_\Theta h(\theta) f(x| \theta) \, d\theta, \quad x\in S

if the parameter has a continuous distribution. Finally, the conditional probability density function of $\theta<math> given <math> X= x$ is

h(\theta \mid x) = \frac{h(\theta) f(x \mid \theta)}{f(x)}; \quad \theta \in \Theta, \; x\in S

The conditional distribution of $\theta$ given $$ X= x $$ is called the posterior distribution, and is an updated distribution, given the information in the data. Finally, if $\theta$ is a real parameter, the conditional expected value $\mathbb{E}(\theta \mid X)$ is the Bayes' estimator of $\theta$ . Recall that $\mathbb{E}(\theta \mid X)$ is a function of X and, among all functions of X, is closest to $\theta$ in the mean square sense. Thus, once we collect the data and observe $$ X= x $$ , the estimate of $\theta$ is $\mathbb{E}(\theta \mid X)$ .

Bayesian Parameter Estimation: General Theory

We first start with a generalized approach which can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are as follows:

1. The form of the density $p(x|\theta)$ is assumed to be known, but the value of the parameter vector $\theta$ is not known exactly.

2. The initial knowledge about $\theta$ is assumed to be contained in a known a priori density $p(\theta)$ .

3. The rest of the knowledge about $\theta$ is contained in a set $\mathcal{D}$ of n samples $$ x_1, x_2, ... , x_n $$ drawn independently according to the unknown probability density $$ p(x) $$ .

Accordingly, already know:

p(x|D) = \int p(x|\theta)p(\theta|D)d\theta

and By Bayes Theorem,

p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta}

Now, since we are attempting to transform the equation to be based on samples $$ x_k $$ , by independent assumption,

p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta)

Hence, if a sample $\mathcal{D}$ has n samples, we can denote the sample space as: $\mathcal{D}^n = \{x_1, x_2, ... x_n\}$ .

Combine the sample space definition with the equation above:

p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta)

Using this equation, we can transform the Bayesian Parameter Estimation to:

p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta}

Bayesian Parameter Estimation: Example

The objective of the following experiments is to evaluate how varying parameters affect density estimation:

1. 1D Binomial data density estimation when varing the number of training data 2. 1D Binomial data density estimation using different prior distribution. 3. 2D synthetic data density estimation when updating our prior guess.

The 1D Binomial test is based on flipping a biased coin. The probabilty that the biased coin appears head is assumed as p, so that the probability of tail is 1-p. In this experiment, we introduce another well-known estimator, maximum a posteriori probability (MAP) estimator. The reason of introducing MAP in the context of comparing MLE and BPE is that MAP can be treated as an intermediate step between MLE and BPE, which also takes prior into account. Note that we can simply define MAP as follows:

\hat{\theta}_{\mathrm{ML}}(x)= \underset{\theta}{\operatorname{arg\,max}} \ f(x | \theta)

First of all, we will examine how the number of training data will affect BPE, MLE and MAP. My question is which one will be the best when our training data is insufficient. To answer this question, we formulate the problem of flipping a biased coin in the following way: -- number of training data varies from 5 to 200 in step of 10 --item for each case, we use the same prior knowledge, that is $\theta$ follows a Beta distribution(mean = 2/3) --item for each case, we account 30 trials, which will give us a reasonable mean and variance, where the ground truth of p is 2/3.

The probability density function of the beta distribution, where $0 \le x \ge 1$ , and shape parameters $\alpha,\beta > 0$

f(x;\alpha,\beta) = \frac{1}{ B(\alpha,\beta)} x^{\alpha-1}(1-x)^{\beta-1}

%%%% figure 1 \begin{figure}[h!] \centering \includegraphics[width=0.6\textwidth]{./fig1.png} \caption[]{Posterior mean with increasing number of samples} \label{fig:exp1.1} \end{figure} In Figure 1, all curves converge to the true mean as number of training data increases. However, when number of samples is not enough, BPE gives us a better estimation, because it takes all prior information into account, whereas MAP has a huge offset even though it also includes some prior information. The performance of MLE is somewhere between BPE and MAP from the perspective of mean value.

If we take a closer look at the variance of each case, we can see that MLE tends to have a larger variance specially when number of samples are insufficient, which means MLE has more uncertainty over what it tries to estimate. On the other hand, BPE and MAP have smaller variance because the prior information limits the uncertainty to a certain range. We can infer that if our prior distribution has a narrower peak at the true mean rather than Beta distribution with a wide ramp, the estimated variance will much smaller. %%%% figure 2 \begin{figure}[h!] \centering \includegraphics[width=0.6\textwidth]{./fig5.png} \caption[]{Variance of $\hat{p}$ with different prior information} \label{fig:exp1.5} \end{figure}

Figure 2 proves our inference above. In this case, we tempararily let Beta distribution have true mean equal to 0.5 and manipulate two parameters($\alpha \; and\; \beta$) to give us different variance, which represents the uncertainty of our initial guess. Figure 3 shows how Beta distribution changes when using different parameter. Back to Figure 2, we can conclude that certainty of prior knowledge determines the variance of our estimation. %%%% figure 3 \begin{figure}[h!] \centering \includegraphics[width=0.6\textwidth]{./fig7.png} \caption[]{Beta distribution when varying $\alpha,\beta$} \label{fig:exp1.7} \end{figure}

You may also ask how the number of samples affect the variance. Figure 4 tells us that starting from a really small number of samples, 5 in our case, the variance tends to go up and then go down to zero. The reason of such phenomenon is that when number of samples is so small, the prior is dominant so that the estimation is simply a reflection of prior, which tends to have a small variance. %%%% figure 4 \begin{figure}[h!] \centering \includegraphics[width=0.6\textwidth]{./fig6.png} \caption[]{Variance of $\hat{p}$ with increasing number of samples} \label{fig:exp1.6} \end{figure} %%%% figure 5 \begin{figure}[h!] \centering \includegraphics[width=0.6\textwidth]{./fig2.png} \caption[]{Posterior mean with different initial guess} \label{fig:exp1.2} \end{figure} Now, let's discuss what if our prior knowledge is biased, say the true mean is 0.6, but we model our prior as a gaussian centered at 0.2. Still using the problem we formulated before, where our ground truth is 2/3, we force our prior to be biased. As Figure 5 and Figure 6 shows, four initial guesses are implemented for a relatively small amount of samples. We can see that results from different prior knowledge vary a lot and the effect of prior is dominant in this case. With more data, such effect will be attenuated and the influence of data will be essential then. Figure 7 simply shows how posterior updates according to different prior. %%%% figure 6 \begin{figure}[h!] \centering \includegraphics[width=0.6\textwidth]{./fig3.png} \caption[]{Prior: Beta distribution with various parameters} \label{fig:exp1.3} \end{figure} %%%% figure 7 \begin{figure}[h!] \centering \includegraphics[width=0.6\textwidth]{./fig4.png} \caption[]{Posterior: likelihood $\times$ prior} \label{fig:exp1.4} \end{figure}

Secondly, I will discuss how to update prior in a recursive way to reach a better estimation. In this experiment, assume there is an intruder UFO detected by global radar in the year of 2050. With advanced technology, the UFO can produce Gaussian noise over its position to illude our radar. However, aliens don't know we have learned Bayes Estimation.

For simplicity, we limit the detection of UFO in a certain range[3:5,4:6] and the true location is [3.5,4.5], which is unknown to us. What we know from our military radar is shown in Figure 8. The illusions that aliens created follow gaussian distribution with standard deviation of 2 centered at true location. %%%% figure 8 \begin{figure}[h!] \centering \includegraphics[width=0.6\textwidth]{./r1.png} \caption[]{UFO location on radar} \label{fig:exp1.5} \end{figure} To start off, our initial guess is just a unform distribution in the region we limit. In this case, we have 100 observations on our radar. From each observation, we update the prior according to previous posterior. Figure 9 illustrates three stages of our detection. As we can see, with more data collected, our prior information is more constrained. In another word, the confidence of detection is growing with observations. %%%% figure 9 \begin{figure}[h!] \centering \includegraphics[width=0.8\textwidth]{./r2.png} \caption[]{Updating prior with data: first line represents X,Y coordinates, second line is the updated prior distribution} \label{fig:exp1.6} \end{figure}

References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.

Questions and comments

If you have any questions, comments, etc. please post them on this page.

Back to ECE662, Spring 2014