Line 17: Line 17:
  
  
Suppose that we have an observable random variable<math>\bs X</math> for an experiment, that takes values in a set S. Suppose that distribution of <math>\bs X</math>  depends on a parameter $\theta$ taking values in a parameter space <math>\Theta</math>. We will denote the probability density function of <math>\bs X</math> for a given value of <math>\theta</math> by <math>f(\bs x \mid \theta)</math> for <math>\bs x \in S </math> and <math>\theta \in S</math>. Of course, our data variable X is almost always vector-valued. The parameter <math>\theta</math> may also be vector-valued.
+
Suppose that we have an observable random variable<math> X</math> for an experiment, that takes values in a set S. Suppose that distribution of <math> X</math>  depends on a parameter $\theta$ taking values in a parameter space <math>\Theta</math>. We will denote the probability density function of <math> X</math> for a given value of <math>\theta</math> by <math>f( '''x''' \mid \theta)</math> for <math> x \in S </math> and <math>\theta \in S</math>. Of course, our data variable X is almost always vector-valued. The parameter <math>\theta</math> may also be vector-valued.
  
 
In Bayesian analysis, named for the famous Thomas Bayes, we treat the parameter <math>\theta</math> as a random variable, with a given probability density function <math>h(\theta)</math> for <math>\theta \in \Theta </math>. The corresponding distribution is called the prior distribution of <math>\theta </math> and is intended to reflect our knowledge (if any) of the parameter, before we gather data. After observing <math>x \in S</math>, we then use Bayes' theorem, to compute the conditional probability density function of <math>\theta</math> given <math>\bs X=\bs x</math>.
 
In Bayesian analysis, named for the famous Thomas Bayes, we treat the parameter <math>\theta</math> as a random variable, with a given probability density function <math>h(\theta)</math> for <math>\theta \in \Theta </math>. The corresponding distribution is called the prior distribution of <math>\theta </math> and is intended to reflect our knowledge (if any) of the parameter, before we gather data. After observing <math>x \in S</math>, we then use Bayes' theorem, to compute the conditional probability density function of <math>\theta</math> given <math>\bs X=\bs x</math>.
  
 
First recall that the joint probability density function of <math>(\bs X,\theta)</math> is the mapping on <math>S \times \Theta </math> given by
 
First recall that the joint probability density function of <math>(\bs X,\theta)</math> is the mapping on <math>S \times \Theta </math> given by
 +
 +
<center><math>(x, \theta) \mapsto h(\theta) f(x \mid \theta)</math></center>
 +
 +
Next recall that the (marginal) probability density function f of <math>X</math> is given by
 
\[
 
\[
<math>(\bs{x}, \theta) \mapsto h(\theta) f(\bs{x} \mid \theta)</math>
+
<math>f(x) = \sum_{\theta \in \Theta} h(\theta) f(x | \theta), \quad x \in S</math>
\]
+
Next recall that the (marginal) probability density function f of $\bs X$ is given by
+
\[
+
<math>f(\bs{x}) = \sum_{\theta \in \Theta} h(\theta) f(\bs{x} | \theta), \quad \bs{x} \in S</math>
+
 
\]
 
\]
 
if the parameter has a discrete distribution, or
 
if the parameter has a discrete distribution, or
 
\[
 
\[
<math>f(\bs{x}) = \int_\Theta h(\theta) f(\bs{x} | \theta) \, d\theta, \quad \bs{x} \in S</math>
+
<math>f(x) = \int_\Theta h(\theta) f(x| \theta) \, d\theta, \quad \x\in S</math>
 
\]
 
\]
if the parameter has a continuous distribution. Finally, the conditional probability density function of $\theta$ given <math>\bs X= \bs x</math> is
+
if the parameter has a continuous distribution. Finally, the conditional probability density function of $\theta$ given <math> X= x</math> is
 
\[
 
\[
<math>h(\theta \mid \bs{x}) = \frac{h(\theta) f(\bs{x} \mid \theta)}{f(\bs{x})}; \quad \theta \in \Theta, \; \bs{x} \in S</math>
+
<math>h(\theta \mid x) = \frac{h(\theta) f(x \mid \theta)}{f(x)}; \quad \theta \in \Theta, \; x\in S</math>
 
\]
 
\]
The conditional distribution of <math>\theta</math> given <math>\bs X=\bs x</math> is called the \textit{posterior} distribution, and is an updated distribution, given the information in the data.
+
The conditional distribution of <math>\theta</math> given <math> X= x</math> is called the \textit{posterior} distribution, and is an updated distribution, given the information in the data.
Finally, if <math>\theta</math> is a real parameter, the conditional expected value <math>\mathbb{E}(\theta \mid \bs X)</math> is the Bayes' estimator of <math>\theta</math>. Recall that <math>\mathbb{E}(\theta \mid \bs X) </math>is a function of X and, among all functions of X, is closest to <math>\theta</math> in the mean square sense. Thus, once we collect the data and observe <math>\bs X=\bs x</math>, the estimate of <math>\theta</math> is <math>\mathbb{E}(\theta \mid \bs X)</math>.
+
Finally, if <math>\theta</math> is a real parameter, the conditional expected value <math>\mathbb{E}(\theta \mid X)</math> is the Bayes' estimator of <math>\theta</math>. Recall that <math>\mathbb{E}(\theta \mid X) </math>is a function of X and, among all functions of X, is closest to <math>\theta</math> in the mean square sense. Thus, once we collect the data and observe <math> X= x</math>, the estimate of <math>\theta</math> is <math>\mathbb{E}(\theta \mid X)</math>.
 
----
 
----
  

Revision as of 10:20, 1 May 2014


Bayesian Parameter Estimation with examples

A slecture by ECE student Yu Wang

Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.



Introduction: Bayesian Estimation

Suppose that we have an observable random variable$ X $ for an experiment, that takes values in a set S. Suppose that distribution of $ X $ depends on a parameter $\theta$ taking values in a parameter space $ \Theta $. We will denote the probability density function of $ X $ for a given value of $ \theta $ by $ f( '''x''' \mid \theta) $ for $ x \in S $ and $ \theta \in S $. Of course, our data variable X is almost always vector-valued. The parameter $ \theta $ may also be vector-valued.

In Bayesian analysis, named for the famous Thomas Bayes, we treat the parameter $ \theta $ as a random variable, with a given probability density function $ h(\theta) $ for $ \theta \in \Theta $. The corresponding distribution is called the prior distribution of $ \theta $ and is intended to reflect our knowledge (if any) of the parameter, before we gather data. After observing $ x \in S $, we then use Bayes' theorem, to compute the conditional probability density function of $ \theta $ given $ \bs X=\bs x $.

First recall that the joint probability density function of $ (\bs X,\theta) $ is the mapping on $ S \times \Theta $ given by

$ (x, \theta) \mapsto h(\theta) f(x \mid \theta) $

Next recall that the (marginal) probability density function f of $ X $ is given by \[ $ f(x) = \sum_{\theta \in \Theta} h(\theta) f(x | \theta), \quad x \in S $ \] if the parameter has a discrete distribution, or \[ $ f(x) = \int_\Theta h(\theta) f(x| \theta) \, d\theta, \quad \x\in S $ \] if the parameter has a continuous distribution. Finally, the conditional probability density function of $\theta$ given $ X= x $ is \[ $ h(\theta \mid x) = \frac{h(\theta) f(x \mid \theta)}{f(x)}; \quad \theta \in \Theta, \; x\in S $ \] The conditional distribution of $ \theta $ given $ X= x $ is called the \textit{posterior} distribution, and is an updated distribution, given the information in the data. Finally, if $ \theta $ is a real parameter, the conditional expected value $ \mathbb{E}(\theta \mid X) $ is the Bayes' estimator of $ \theta $. Recall that $ \mathbb{E}(\theta \mid X) $is a function of X and, among all functions of X, is closest to $ \theta $ in the mean square sense. Thus, once we collect the data and observe $ X= x $, the estimate of $ \theta $ is $ \mathbb{E}(\theta \mid X) $.


Bayesian Parameter Estimation: General Theory

We first start with a generalized approach which can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are as follows:

1. The form of the density $ p(x|\theta) $ is assumed to be known, but the value of the parameter vector $ \theta $ is not known exactly.

2. The initial knowledge about $ \theta $ is assumed to be contained in a known a priori density $ p(\theta) $.

3. The rest of the knowledge about $ \theta $ is contained in a set $ \mathcal{D} $ of n samples $ x_1, x_2, ... , x_n $ drawn independently according to the unknown probability density $ p(x) $.

Accordingly, already know:

$ p(x|D) = \int p(x|\theta)p(\theta|D)d\theta $

and By Bayes Theorem,

$ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta} $


Now, since we are attempting to transform the equation to be based on samples $ x_k $, by independent assumption,

$ p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta) $

Hence, if a sample $ \mathcal{D} $ has n samples, we can denote the sample space as: $ \mathcal{D}^n = \{x_1, x_2, ... x_n\} $.

Combine the sample space definition with the equation above:


$ p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta) $

Using this equation, we can transform the Bayesian Parameter Estimation to:

$ p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta} $




Bayesian Parameter Estimation: Gaussian Case

The Univariate Case: $ p(\mu|\mathcal{D}) $

Consider the case where $ \mu $ is the only unknown parameter. For simplicity we assume:

$ p(x|\mu) \sim N(\mu, \sigma^2) $
and
$ p(\mu) \sim N(\mu_0, \sigma_0^2) $

From the previous section, the following expression could be easily obtained using Bayes' formula:

$ p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu) $

Where $ \alpha $ is a factorization factor independent of $ \mu $.

Now, substitute $ p(x_k|\mu) $ and $ p(u) $ with:

$ p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $
$ p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $

The equation has now become:

$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $
$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $

Update the scaling factor to $ \alpha' $ and $ \alpha'' $ correspondingly,

$ p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}) $
$ p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu] $

With the knowledge of Gaussian distribution:

$ p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}] $

Finally, the estimate of $ u_n $ can be obtained:

$ \mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0 $

Where $ \bar{x_n} $ is defined as sample means and $ n $ is the sample size.

In order to form a Gaussian distribution, the variance $ \sigma_n^2 $ associated with $ \mu_n $ could also be obtained correspondingly as:

$ \sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2} $


Observation:

With $ N \to \infty $,
$ \sigma_D \to 0 $

$ p(\mu|D) $ becomes more sharply peaked around $ \mu_D $

The Univariate Case: $ p(x|\mathcal{D}) $

Having obtained the posteriori density for the mean $ u_n $ of set $ \mathcal{D} $, the remaining of the task is to estimate the "class-conditional" density for $ p(x|D) $.

Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:


$ p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu $
$ p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}] \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu $


$ p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}]f(\sigma,\sigma_n) $


Where $ f(\sigma, \sigma_n) $ is defined as:


$ f(\sigma,\sigma_n) = \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 x+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu $

Hence, $ p(x|D) $ is normally distributed as:

$ p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2) $


References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.

Questions and comments

If you have any questions, comments, etc. please post them on this page.


Back to ECE662, Spring 2014

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett