Difference between revisions of "Bayersian Parameter Estimation" - Rhea

Revision as of 11:33, 22 April 2014

Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

Introduction

Bayesian Parameter Estimation: General Theory

Fundamentals of Bayesian Estimation

As stated in Duda's book, the conceptual difference between maximum likelihood estimation and Bayersian learning is that in MLE $\theta$ is a fixed vector while in Bayersian estimation $\theta$ is considered to be a random variable.

By definition, given samples $\mathcal{D}$ Bayer's formula is defined as $P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)}$

Furthermore, $p(x|D)$ can be computed as: $$p(x|D) = \int p(x|\theta)p(\theta|D)d\theta$$

\subsection{Bayesian Parameter Estimation: General Theory} \paragraph{It is important to know that: 1. The form of the density $p(x|\theta)$ is assumed to be known, but the value of the parameter vector $\theta$ is not known exactly. 2. The initial knowledge about $\theta$ is assumed to be contained in a known a priori density $p(\theta)$. 3. The rest of the knowledge about $\theta$ is contained in a set $\mathcal{D}$ of n samples $x_1, x_2, ... , x_n$ drawn independently according to the unknown probability density $p(x)$.}

Accordingly, based on: $$p(x|D) = \int p(x|\theta)p(\theta|D)d\theta$$ and Bayes Theorem, $$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta}$$

Now, since we are attempting to transform the equation to be based on samples $x_i$, by independent assumption, $$p(D|\theta) = \prod_{k = 1}^n p(x_i|\theta)$$ Hence, if a sample $\mathcal{D}$ has n samples, we can denote the sample space as $\mathcal{D^n} = \{x_1, x_2, ... x_n\}$. $$p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta)$$ Using this equation, we can transform the Bayesian Parameter Estimation to: $$p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta}$$

\section{Extension: Investigation of Estimator's Accuracy of Bayesian Parameter Estimation: Gaussian Case} \subsection{The Univariate Case: $p(\mu|\mathcal{D})$} Assumptions: $$p(x|\mu) \sim N(\mu, \sigma^2)$$ $$p(\mu) \sim N(\mu_0, \sigma_0^2)$$ From the previous section, the following expression could be easily obtained: $$p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)$$ Where $\alpha$ is a factorization factor independent of $\mu$. Now, substitute $p(x_k|\mu)$ and $p(u)$ with: $$p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]$$ $$p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]$$ Hence, accordingly, $$p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]$$

$$p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]$$

Update the scaling factor to $\beta$,

$$p(\mu|D) = \beta exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2})$$

$$p(\mu|D) = \gamma exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu]$$

Furthermore, since $$p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}]$$ Finally, the estimate of $u_n$ can be obtained: $$\mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0$$ Where $\bar{x_n}$ is defined as sample means and $n$ is the sample size. In order to form a Gaussian distribution, the variance $\sigma_n^2$ associated with $\mu_n$ is defined as: $$\sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2}$$

\subsection{The Univariate Case: $p(x|\mathcal{D})$} Having obtained the posteriori density for the mean $u_n$ of set $\mathcal{D}$, the remaining of the task is to estimate the "class-conditional" density for $p(x|D)$.

Based on the text by \textbf{Duda's}, $$p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu$$ $$p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}]f(\sigma,\sigma_n)$$ Where $f(\sigma, \sigma_n)$ is defined as: $$f(\sigma,\sigma_n) = \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 x+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu$$ Hence, $p(x|D)$ is normally distributed as: $$p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2)$$

\subsection{Experiment of Bayesian Parameter Estimation}

\paragraph{Design} Assume n samples were obtained from the class $\mathcal{D}$ of unknown mean $\mu$ (known $\sigma$). Assume, $$p(x|\mu) \sim N(\mu, \sigma^2)$$ $$p(\mu) \sim N(\mu_0, \sigma_0^2)$$ While $\sigma = \sigma_0 = constant$, and $\mu_0 = 0$ (It does not matter what $\mu_0$ it was assumed to be, this will be verified shortly after). Based on the sample data $x_i \in \mathcal{D}, i = 1,2,3,...,n$, $\mu$ is desired to be estimated.

The following results will be obtained: \begin{enumerate} \item The impact of $\mu_0$ on estimated $\hat{\mu}$ \item The impact of sample size $n$ have on estimation accuracy \end{enumerate}

\paragraph{Results} \begin{center} \includegraphics[scale=1]{BPE_1.png}

Figure 21. The impact of $\mu_0$ on estimated $\hat{\mu}$ averaged over 50 samples

\includegraphics[scale=1]{BPE_2.png}

Figure 22. The impact of $\mu_0$ on the variance of estimated $\hat{\mu}$ over 50 samples

\end{center}

\paragraph{Conclusion} The estimated mean is shifting up with $\mu_0$ increasing. \textbf{Based on the experiment it can be concluded that the most 'accurate' estimate could be obtained if $\mu_0 = \mu$. But, according to the plot, even if the $\mu_0$ is different different from $\mu$, the error of estimation is still acceptable. (In our case, within [-0.1,+0.06] region)} However, the variance of estimated mean could be assumed to be identical as the \textbf{real empirical mean}.

\begin{center} \includegraphics[scale=0.7]{e23456.png}

Figure 23. The impact of sample size $n$ have on estimation shape accuracy (sample sizes = 2,3,4,5,6)

\includegraphics[scale=0.6]{ece662_14.png}

Figure 24. The impact of sample size $n$ have on estimation shape accuracy (sample sizes = 4,10,20,50,100)

\end{center}

\paragraph{Conclusion} Figure 23. and Figure 24. have demonstrated that with insufficient sample size the result would be really poor regarding prediction of points distribution.

Fig 3: Summary diagram of whitening and coloring process.

@@ Line 17: / Line 17: @@
 == '''Introduction''' ==
-\section{Bayesian Parameter Estimation: General Theory}
+Bayesian Parameter Estimation: General Theory
-\subsection{Fundamentals of Bayesian Estimation}
-As stated in \textbf{Duda's} book, the conceptual difference between maximum likelihood estimation and Bayersian learning is that in MLE $\theta$ is a fixed vector while in Bayersian estimation $\theta$ is considered to be a random variable.
+Fundamentals of Bayesian Estimation
+As stated in Duda's book, the conceptual difference between maximum likelihood estimation and Bayersian learning is that in MLE <math>\theta</math> is a fixed vector while in Bayersian estimation <math>\theta</math> is considered to be a random variable.
 By definition, given samples $\mathcal{D}$ Bayer's formula is defined as
-$$P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)}$$
+<math>P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)}</math>
 Furthermore, $p(x|D)$ can be computed as:

Difference between revisions of "Bayersian Parameter Estimation" - Rhea

Revision as of 11:33, 22 April 2014

Introduction

Alumni Liaison