(18 intermediate revisions by 6 users not shown)
Line 1: Line 1:
[http://balthier.ecn.purdue.edu/index.php/ECE662 ECE662 Main Page]
+
=Lecture 7, [[ECE662]]: Decision Theory=
  
[http://balthier.ecn.purdue.edu/index.php/ECE662#Class_Lecture_Notes Class Lecture Notes]
+
Lecture notes for [[ECE662:BoutinSpring08_Old_Kiwi|ECE662 Spring 2008]], Prof. [[user:mboutin|Boutin]].
  
LECTURE THEME :  
+
Other lectures: [[Lecture 1 - Introduction_Old Kiwi|1]],
    - Maximum Likelihood Estimation and Bayesian Parameter Estimation
+
[[Lecture 2 - Decision Hypersurfaces_Old Kiwi|2]],
 +
[[Lecture 3 - Bayes classification_Old Kiwi|3]],
 +
[[Lecture 4 - Bayes Classification_Old Kiwi|4]],
 +
[[Lecture 5 - Discriminant Functions_Old Kiwi|5]],
 +
[[Lecture 6 - Discriminant Functions_Old Kiwi|6]],
 +
[[Lecture 7 - MLE and BPE_Old Kiwi|7]],
 +
[[Lecture 8 - MLE, BPE and Linear Discriminant Functions_Old Kiwi|8]],
 +
[[Lecture 9 - Linear Discriminant Functions_Old Kiwi|9]],
 +
[[Lecture 10 - Batch Perceptron and Fisher Linear Discriminant_Old Kiwi|10]],
 +
[[Lecture 11 - Fischer's Linear Discriminant again_Old Kiwi|11]],
 +
[[Lecture 12 - Support Vector Machine and Quadratic Optimization Problem_Old Kiwi|12]],
 +
[[Lecture 13 - Kernel function for SVMs and ANNs introduction_Old Kiwi|13]], 
 +
[[Lecture 14 - ANNs, Non-parametric Density Estimation (Parzen Window)_Old Kiwi|14]],
 +
[[Lecture 15 - Parzen Window Method_Old Kiwi|15]],
 +
[[Lecture 16 - Parzen Window Method and K-nearest Neighbor Density Estimate_Old Kiwi|16]],
 +
[[Lecture 17 - Nearest Neighbors Clarification Rule and Metrics_Old Kiwi|17]],
 +
[[Lecture 18 - Nearest Neighbors Clarification Rule and Metrics(Continued)_Old Kiwi|18]],
 +
[[Lecture 19 - Nearest Neighbor Error Rates_Old Kiwi|19]],
 +
[[Lecture 20 - Density Estimation using Series Expansion and Decision Trees_Old Kiwi|20]],
 +
[[Lecture 21 - Decision Trees(Continued)_Old Kiwi|21]],
 +
[[Lecture 22 - Decision Trees and Clustering_Old Kiwi|22]],
 +
[[Lecture 23 - Spanning Trees_Old Kiwi|23]],
 +
[[Lecture 24 - Clustering and Hierarchical Clustering_Old Kiwi|24]],
 +
[[Lecture 25 - Clustering Algorithms_Old Kiwi|25]],
 +
[[Lecture 26 - Statistical Clustering Methods_Old Kiwi|26]],
 +
[[Lecture 27 - Clustering by finding valleys of densities_Old Kiwi|27]],
 +
[[Lecture 28 - Final lecture_Old Kiwi|28]],
 +
----
 +
----
 +
== Lecture Content ==
  
**See also:** [Comparison of MLE and Bayesian Parameter Estimation]
+
* Maximum Likelihood Estimation and Bayesian Parameter Estimation
 +
* Parametric Estimation of Class Conditional Density
  
**Parametric Estimation** of Class Conditional Density
+
== Relevant Links==
 +
*MLE: [[Maximum Likelihood Estimation_Old Kiwi| Maximum Likelihood Estimation]]
 +
**[[MLE Examples: Exponential and Geometric Distributions_Old Kiwi|Examples of MLE: Exponential and Geometric Distributions ]]
 +
**[[MLE Examples: Binomial and Poisson Distributions_Old Kiwi|Examples of MLE: Binomial and Poisson Distributions]]
 +
*BPE: [[Bayesian Parameter Estimation_Old Kiwi|Bayesian Parameter Estimation]]
 +
*[[Comparison of MLE and Bayesian Parameter Estimation_Old Kiwi|Comparison of MLE and Bayesian Parameter Estimation]]
 +
*[[Parametric Estimators_Old Kiwi|Parametric Estimators]]
 +
----
 +
-----
 +
The class conditional density <math>p(\vec{x}|w_i)</math> can be estimated using training data. We denote the parameter of estimation as <math>\vec{\theta}</math>. There are two methods of estimation discussed.
  
.. |classcond1| image:: tex
+
MLE: [[Maximum Likelihood Estimation_Old Kiwi| Maximum Likelihood Estimation]]
  :alt: tex: p(\vec{x}|w_i)
+
  
.. |vectheta1| image:: tex
+
BPE: [[Bayesian Parameter Estimation_Old Kiwi|Bayesian Parameter Estimation]]
  :alt: tex: \vec{\theta}
+
  
The class conditional density |classcond1| can be estimated using training data. We denote the parameter of estimation as |vectheta1|. There are two methods of estimation discussed.
 
  
MLE ([Maximum Likelihood Estimation])
+
== Maximum Likelihood Estimation ==
  
BPE ([Bayesian Parameter Estimation])
+
Let "c" denote the number of classes. D, the entire collection of sample data. <math>D_1, \ldots, D_c</math> represent the classification of data into classes <math>\omega_1, \ldots, \omega_c</math>. It is assumed that:
 +
- Samples in <math>D_i</math> give no information about the samples in <math>D_j, i \neq j</math>, and
 +
- Each sample is drawn independently.
  
.. |Dsample1k| image:: tex
+
'''Example:''' The class conditional density <math>p(\vec{x}|w_i)</math> depends on parameter <math>\vec{\theta_i}</math>. If  <math>X ~ N(\mu,\sigma^2)</math> denotes the class conditional density; then <math>\vec{\theta}=[\mu,\sigma^2]</math>.
  :alt: tex: D_1, \ldots, D_c
+
  
.. |classes1k| image:: tex
+
Let n be the size of training sample, and <math>D=\{\vec{X_1}, \ldots, \vec{X_n}\}</math>. Then,
  :alt: tex: \omega_1, \ldots, \omega_c
+
  
.. |Di| image:: tex
+
<math>p(\vec{X}|\omega_i,\vec{\theta_i})</math> equals <math>p(\vec{X}|\vec{\theta})</math> for a single class.
  :alt: tex: D_i
+
 
+
.. |Dj| image:: tex
+
  :alt: tex: D_j, i \neq j
+
 
+
 
+
**Maximum Likelihood Estimation**
+
 
+
 
+
Let "c" denote the number of classes. D, the entire collection of sample data. |Dsample1k| represent the classification of data into classes |classes1k|. It is assumed that:
+
  - Samples in |Di| give no information about the samples in |Dj|, and
+
  - Each sample is drawn independently.
+
 
+
.. |vec_thetai| image:: tex
+
  :alt: tex: \vec{\theta_i}
+
 
+
.. |X_normal| image:: tex
+
  :alt: tex: X ~ N(\mu,\sigma^2)
+
 
+
.. |theta1| image:: tex
+
  :alt: tex: \vec{\theta}=[\mu,\sigma^2]
+
 
+
Example: The class conditional density |classcond1| depends on parameter |vec_thetai|. If  |X_normal| denotes the class conditional density; then |theta1|.
+
 
+
.. |D_collecX| image:: tex
+
  :alt: tex: D=\{\vec{X_1}, \ldots, \vec{X_n}\}
+
 
+
Let n be the size of training sample, and |D_collecX|. Then,
+
 
+
.. |XgivenOmTh| image:: tex
+
  :alt: tex: p(\vec{X}|\omega_i,\vec{\theta_i})
+
 
+
.. |XgivenTh| image:: tex
+
  :alt: tex: p(\vec{X}|\vec{\theta})
+
 
+
.. |likelihood1| image:: tex
+
  :alt: tex: p(D|\vec{\theta})=\displaystyle \prod_{k=1}^n p(\vec{X_k}|\vec{\theta})
+
 
+
|XgivenOmTh| equals |XgivenTh| for a single class.
+
 
+
 
+
The **Likelihood Function** is, then, defined as
+
|likelihood1|
+
  
 +
The ''Likelihood Function'' is, then, defined as <math>p(D|\vec{\theta})=\displaystyle \prod_{k=1}^n p(\vec{X_k}|\vec{\theta})</math>,
 
which needs to be maximized for obtaining the parameter.
 
which needs to be maximized for obtaining the parameter.
  
.. |loglikelihood1| image:: tex
 
  :alt: tex: l(\vec{\theta})=log p(D|\vec{\theta})=\displaystyle log(\prod_{k=1}^n p(\vec{X_k}|\vec{\theta}))=\displaystyle \sum_{k=1}^n log(p(\vec{X_k}|\vec{\theta}))
 
  
Since logarithm is a monotonic function, maximizing the Likelihood is same as maximizing log of Likelihood which is defined as  
+
Since logarithm is a monotonic function, maximizing the Likelihood is same as maximizing log of Likelihood which is defined as
|loglikelihood1|.
+
<math>l(\vec{\theta})=log p(D|\vec{\theta})=\displaystyle log(\prod_{k=1}^n p(\vec{X_k}|\vec{\theta}))=\displaystyle \sum_{k=1}^n log(p(\vec{X_k}|\vec{\theta}))</math>.
  
 
"l" is the log likelihood function.
 
"l" is the log likelihood function.
  
Maximize log likelyhood function with respect to |jinha_theta|
+
Maximize log likelyhood function with respect to <math>\vec{\theta}</math>
  
.. |jinha_theta| image:: tex
+
<math>\rightarrow \hat{\theta} = argmax \left( l (\vec{\theta}) \right)</math>
  :alt: tex: \vec{\theta}
+
  
|jinha_est_theta|
+
If <math>l(\vec{\theta})</math> is a differentiable function
  
.. |jinha_est_theta| image:: tex
+
Let <math>\vec{\theta} = \left[ \theta_1, \theta_2, \cdots , \theta_p \right]</math> be 1 by p vector, then
  :alt: tex: \rightarrow \hat{\theta} = argmax \left( l (\vec{\theta}) \right)
+
  
If |jinha_ltheta| is a differentiable function
+
<math>\nabla_{\vec{\theta}} = \left[ \frac{\partial}{\partial\theta_1} \frac{\partial}{\partial\theta_2} \cdots \frac{\partial}{\partial\theta_p} \right]^{t}</math>
 
+
.. |jinha_ltheta| image:: tex
+
  :alt: tex: l(\vec{\theta})
+
 
+
Let |jinha_vectheta| be 1 by p vector, then
+
 
+
.. |jinha_vectheta| image:: tex
+
  :alt: tex: \vec{\theta} = \left[ \theta_1, \theta_2, \cdots , \theta_p \right]
+
 
+
|jinha_nabia|
+
 
+
.. |jinha_nabia| image:: tex
+
  :alt: tex: \nabla_{\vec{\theta}} = \left[ \frac{\partial}{\partial\theta_1} \\ \frac{\partial}{\partial\theta_2} \\ \cdots \\ \frac{\partial}{\partial\theta_p} \right]^{t}
+
  
 
Then, we can compute the first derivatives of log likelyhood function,
 
Then, we can compute the first derivatives of log likelyhood function,
  
|jinha_fd_ltheta|
+
<math>\rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = \sum_{k=1}^{n} \nabla_{\vec{\theta}} \left[  log(p(\vec{x_k} | \vec{\theta})) \right]</math>
 
+
.. |jinha_fd_ltheta| image:: tex
+
  :alt: tex: \rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = \sum_{k=1}^{n} \nabla_{\vec{\theta}} \left[  log(p(\vec{x_k} | \vec{\theta})) \right]
+
  
 
and equate this first derivative to be zero
 
and equate this first derivative to be zero
  
|jinha_fd_0|
+
<math>\rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = 0</math>
  
.. |jinha_fd_0| image:: tex
+
== Example of Guassian case ==
  :alt: tex: \rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = 0
+
  
**Example of Guassian case**
+
Assume that covariance matrix are known.
  
Assume that covariance matrix are know.
+
<math>p(\vec{x_k} | \vec{\mu}) = \frac{1}{ \left( (2\pi)^{d} |\Sigma| \right)^{\frac{1}{2}}} exp \left[ - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right]</math>
  
|jinha_mult_normal|
+
'''Step 1: Take log'''
  
.. |jinha_mult_normal| image:: tex
+
<math>log p(\vec{x_k} | \vec{\mu}) = -\frac{1}{2} log \left( (2\pi)^d |\Sigma| \right) - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu})</math>
  :alt: tex: p(\vec{x_k} | \vec{\mu}) = \frac{1}{ \left( (2\pi)^{d} |\Sigma| \right)^{\frac{1}{2}}} exp \left[ - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right]
+
  
**Step 1: Take log**
+
'''Step 2: Take derivative'''
  
|jinha_log_normal|
+
<math>\frac{\partial}{\partial\vec{\mu}} \left( log p(\vec{x_k} | \vec{\mu}) \right) = \frac{1}{2} \left[ (\vec{x_k} - \vec{\mu})^t \Sigma^{-1}\right]^t + \frac{1}{2} \left[ \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right] = \Sigma^{-1} (\vec{x_k} - \vec{\mu})</math>
  
.. |jinha_log_normal| image:: tex
+
'''Step 3: Equate to 0'''
  :alt: tex: log p(\vec{x_k} | \vec{\mu}) = -\frac{1}{2} log \left( (2\pi)^d |\Sigma| \right) - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu})
+
  
**Step 2: Take derivative**
+
<math>\sum_{k=1}^{n} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) = 0</math>
  
|jinha_fd_log_normal|
+
<math>\rightarrow \Sigma^{-1} \sum_{k=1}^{n} (\vec{x_k} - \vec{\mu}) = 0</math>
  
.. |jinha_fd_log_normal| image:: tex
+
<math>\rightarrow \Sigma^{-1} \left[ \sum_{k=1}^{n} \vec{x_k} - n \vec{\mu}\right] = 0</math>
  :alt: tex: \frac{\partial}{\partial\vec{\mu}} \left( log p(\vec{x_k} | \vec{\mu}) \right) = \frac{1}{2} \left[ (\vec{x_k} - \vec{\mu})^t \Sigma^{-1}\right]^t + \frac{1}{2} \left[ \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right] = \Sigma^{-1} (\vec{x_k} - \vec{\mu})
+
  
**Step 3: Equate to 0**
+
<math>\Longrightarrow \hat{\vec{\mu}} = \frac{1}{n} \sum_{k=1}^{n} \vec{x_k}</math>
 
+
|jinha_eqtozero|
+
 
+
.. |jinha_eqtozero| image:: tex
+
  :alt: tex: \sum_{k=1}^{n} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) = 0
+
 
+
|jinha_eqtozero2|
+
 
+
.. |jinha_eqtozero2| image:: tex
+
  :alt: tex: \rightarrow \Sigma^{-1} \sum_{k=1}^{n} (\vec{x_k} - \vec{\mu}) = 0
+
 
+
|jinha_eqtozero3|
+
 
+
.. |jinha_eqtozero3| image:: tex
+
  :alt: tex: \rightarrow \Sigma^{-1} \left[ \sum_{k=1}^{n} \vec{x_k} - n \vec{\mu}\right] = 0
+
 
+
|jinha_eqtozero4|
+
 
+
.. |jinha_eqtozero4| image:: tex
+
  :alt: tex: \Longrightarrow \hat{\vec{\mu}} = \frac{1}{n} \sum_{k=1}^{n} \vec{x_k}
+
  
 
This is the sample mean for a sample size n.
 
This is the sample mean for a sample size n.
  
[MLE Examples: Exponential and Geometric Distributions]
+
[[MLE Examples: Exponential and Geometric Distributions_Old Kiwi|Examples of MLE: Exponential and Geometric Distributions ]]
  
[MLE Examples: Binomial and Poisson Distributions]
+
[[MLE Examples: Binomial and Poisson Distributions_Old Kiwi|Examples of MLE: Binomial and Poisson Distributions]]
  
Advantages of MLE:
+
'''Advantages of MLE:'''
- Simple
+
* Simple
- Converges
+
* Converges
- Asymptotically unbiased (though biased for small N)
+
* Asymptotically unbiased (though biased for small N)
  
**Bayesian Parameter Estimation**
+
== Bayesian Parameter Estimation ==
  
 
For a given class,
 
For a given class,
let |x_khh| be feature vector of the class and |theta_khh| be parameter of pdf of |x_khh| to be estimated.
+
let <b><math>x</math></b> be feature vector of the class and <b><math>\theta</math></b> be parameter of pdf of <b><math>x</math></b> to be estimated.
  
And let |D_khh|
+
And let <b><math>D= \{  x_1, x_2, \cdots, x_n  \} </math></b>
, where |xx_khh| are training samples of the class
+
, where <b><math>x</math></b> are training samples of the class
  
.. |x_khh| image:: tex
+
Note that <b><math>\theta</math></b> is random variable with probability density <b><math>p(\theta)</math></b>
  :alt: tex: \bf{x}
+
  
.. |D_khh| image:: tex
+
[[Image:Equation1_Old Kiwi.png]]
  :alt: tex: D= \{  \mathbf{x}_1, \mathbf{x}_2, \cdots , \mathbf{x}_n  \} \\
+
  
.. |xx_khh| image:: tex
+
where
  :alt: tex:  \mathbf{x}_1, \mathbf{x}_2, \cdots , \mathbf{x}_n
+
  
Note that |theta_khh| is random variable with probability density |p_theta_khh|
+
[[Image:Equation2_Old Kiwi.png]]
  
.. |theta_khh| image:: tex
+
== EXAMPLE: Bayesian Inference for Gaussian Mean ==
  :alt: tex: \bf{ \theta }
+
 
+
.. |p_theta_khh| image:: tex
+
  :alt: tex: p( \bf { \theta } )
+
 
+
.. image:: tex
+
  :alt: tex: \qquad p(\mathbf{x} \vert D)=\displaystyle \int p(\mathbf{x} \vert \mathbf{\theta} ) p(\mathbf{\theta} \vert D) d \mathbf{\theta }
+
 
+
where
+
 
+
.. image:: tex
+
  :alt: tex: \qquad p(\mathbf{\theta} \vert D)=\frac {\displaystyle p(D \vert \mathbf{\theta} ) p(\mathbf{\theta} )} {\displaystyle  \int p(D \vert \mathbf{\theta} ) p(\mathbf{\theta}  ) d \mathbf{\theta } }
+
 
+
- Example
+
 
+
          Here is a good example:
+
          http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html
+
 
+
 
+
 
+
**EXAMPLE: Bayesian Inference for Gaussian Mean**
+
  
 
The univariate case. The variance is assumed to be known.
 
The univariate case. The variance is assumed to be known.
Line 224: Line 150:
 
Here's a summary of results:
 
Here's a summary of results:
  
  * Univariate Gaussian density |bi_gm_1|
+
* Univariate Gaussian density <math>p(x|\mu)\sim N(\mu,\sigma^{2})</math>
 
+
  * Prior density of the mean |bi_gm_2|
+
 
+
  * Posterior density of the mean |bi_gm_3|
+
 
+
where
+
 
+
  * |bi_gm_4|
+
 
+
  * |bi_gm_5|
+
 
+
  * |bi_gm_6|
+
 
+
Finally, the class conditional density is given by
+
 
+
|bi_gm_7|
+
 
+
.. |bi_gm_1| image:: tex
+
  :alt: tex:p(x|\mu)\sim N(\mu,\sigma^{2})
+
  
.. |bi_gm_2| image:: tex
+
* Prior density of the mean <math>p(\mu)\sim N(\mu_{0},\sigma_{0}^{2})</math>
  :alt: tex:p(\mu)\sim N(\mu_{0},\sigma_{0}^{2})
+
+
.. |bi_gm_3| image:: tex
+
  :alt: tex:p(\mu|D)\sim N(\mu_{n},\sigma_{n}^{2})
+
  
.. |bi_gm_4| image:: tex
+
* Posterior density of the mean <math>p(\mu|D)\sim N(\mu_{n},\sigma_{n}^{2})</math>
  :alt: tex:\mu_{n}=\left(\frac{n\sigma_{0}^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\right)\hat{\mu}_{n}+\frac{\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\mu_{0}
+
  
.. |bi_gm_5| image:: tex
+
where
  :alt: tex:\sigma_{n}^{2}=\frac{\sigma_{0}^{2}\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}
+
  
.. |bi_gm_6| image:: tex
+
* <math>\mu_{n}=\left(\frac{n\sigma_{0}^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\right)\hat{\mu}_{n}+\frac{\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\mu_{0}</math>
  :alt: tex:\hat{\mu}_{n}=\frac{1}{n}\sum_{k=1}^{n}x_{k}  
+
  
.. |bi_gm_7| image:: tex
+
* <math>\sigma_{n}^{2}=\frac{\sigma_{0}^{2}\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}</math>
  :alt: tex: p(x|D)\sim N(\mu_{n},\sigma^{2}+\sigma_{n}^{2})
+
  
.. |hsantos_sigma1| image:: tex
+
* <math>\hat{\mu}_{n}=\frac{1}{n}\sum_{k=1}^{n}x_{k}</math>
  :alt: tex: \sigma^{2}
+
  
.. |hsantos_sigma2| image:: tex
+
Finally, the class conditional density is given by <math>p(x|D)\sim N(\mu_{n},\sigma^{2}+\sigma_{n}^{2})</math>
  :alt: tex: \sigma_{n}^{2}
+
  
The above formular can be interpreted as: in making prediction for a single new observatioin, the variance of the estimate will have two components:  
+
The above formulas can be interpreted as: in making prediction for a single new observation, the variance of the estimate will have two components:
1) |hsantos_sigma1| - the inherent variance within the distribution of x, i.e. the variance that would never be eliminated even with perfect information about the underlying distribution model;
+
1) <math>\sigma^{2}</math> - the inherent variance within the distribution of x, i.e. the variance that would never be eliminated even with perfect information about the underlying distribution model;
2) |hsantos_sigma2| - the variance introduced from the estimation of the mean vector "mu", this component can be eliminated given exact prior information or very large training set ( N goes to infinity);  
+
2) <math>\sigma_{n}^{2}</math> - the variance introduced from the estimation of the mean vector "mu", this component can be eliminated given exact prior information or very large training set ( N goes to infinity);
  
.. image:: BayesianInference_GaussianMean_small.jpg
+
<center>[[Image:BayesianInference_GaussianMean_small_Old Kiwi.jpg]]</center>
  
 
The above figure illustrates the Bayesian inference for the mean of a Gaussian distribution, for which the variance is assumed to be known. The curves show the prior distribution over 'mu' (the curve labeled N=0), which in this case is itself Gaussian, along with the posterior distributions for increasing number N of data points. The figure makes clear that as the number of data points increase, the posterior distribution peaks around the true value of the mean. This phenomenon is known as *Bayesian learning*.
 
The above figure illustrates the Bayesian inference for the mean of a Gaussian distribution, for which the variance is assumed to be known. The curves show the prior distribution over 'mu' (the curve labeled N=0), which in this case is itself Gaussian, along with the posterior distributions for increasing number N of data points. The figure makes clear that as the number of data points increase, the posterior distribution peaks around the true value of the mean. This phenomenon is known as *Bayesian learning*.
 +
----
 +
----
 +
==For more information==
  
**For more information:**
+
*MLE: [[Maximum Likelihood Estimation_Old Kiwi| Maximum Likelihood Estimation]]
 +
**[[MLE Examples: Exponential and Geometric Distributions_Old Kiwi|Examples of MLE: Exponential and Geometric Distributions ]]
 +
**[[MLE Examples: Binomial and Poisson Distributions_Old Kiwi|Examples of MLE: Binomial and Poisson Distributions]]
 +
*BPE: [[Bayesian Parameter Estimation_Old Kiwi|Bayesian Parameter Estimation]]
 +
*[[Comparison of MLE and Bayesian Parameter Estimation_Old Kiwi|Comparison of MLE and Bayesian Parameter Estimation]]
 +
*[[Parametric Estimators_Old Kiwi|Parametric Estimators]]
 +
----
 +
[[ECE662:BoutinSpring08_Old_Kiwi|Back to ECE662, Spring 2008, Prof. Boutin]]
  
[Parametric Estimators]
+
[[Category:lecture notes]]

Latest revision as of 10:16, 20 May 2013

Lecture 7, ECE662: Decision Theory

Lecture notes for ECE662 Spring 2008, Prof. Boutin.

Other lectures: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,



Lecture Content

  • Maximum Likelihood Estimation and Bayesian Parameter Estimation
  • Parametric Estimation of Class Conditional Density

Relevant Links



The class conditional density $ p(\vec{x}|w_i) $ can be estimated using training data. We denote the parameter of estimation as $ \vec{\theta} $. There are two methods of estimation discussed.

MLE: Maximum Likelihood Estimation

BPE: Bayesian Parameter Estimation


Maximum Likelihood Estimation

Let "c" denote the number of classes. D, the entire collection of sample data. $ D_1, \ldots, D_c $ represent the classification of data into classes $ \omega_1, \ldots, \omega_c $. It is assumed that: - Samples in $ D_i $ give no information about the samples in $ D_j, i \neq j $, and - Each sample is drawn independently.

Example: The class conditional density $ p(\vec{x}|w_i) $ depends on parameter $ \vec{\theta_i} $. If $ X ~ N(\mu,\sigma^2) $ denotes the class conditional density; then $ \vec{\theta}=[\mu,\sigma^2] $.

Let n be the size of training sample, and $ D=\{\vec{X_1}, \ldots, \vec{X_n}\} $. Then,

$ p(\vec{X}|\omega_i,\vec{\theta_i}) $ equals $ p(\vec{X}|\vec{\theta}) $ for a single class.

The Likelihood Function is, then, defined as $ p(D|\vec{\theta})=\displaystyle \prod_{k=1}^n p(\vec{X_k}|\vec{\theta}) $, which needs to be maximized for obtaining the parameter.


Since logarithm is a monotonic function, maximizing the Likelihood is same as maximizing log of Likelihood which is defined as $ l(\vec{\theta})=log p(D|\vec{\theta})=\displaystyle log(\prod_{k=1}^n p(\vec{X_k}|\vec{\theta}))=\displaystyle \sum_{k=1}^n log(p(\vec{X_k}|\vec{\theta})) $.

"l" is the log likelihood function.

Maximize log likelyhood function with respect to $ \vec{\theta} $

$ \rightarrow \hat{\theta} = argmax \left( l (\vec{\theta}) \right) $

If $ l(\vec{\theta}) $ is a differentiable function

Let $ \vec{\theta} = \left[ \theta_1, \theta_2, \cdots , \theta_p \right] $ be 1 by p vector, then

$ \nabla_{\vec{\theta}} = \left[ \frac{\partial}{\partial\theta_1} \frac{\partial}{\partial\theta_2} \cdots \frac{\partial}{\partial\theta_p} \right]^{t} $

Then, we can compute the first derivatives of log likelyhood function,

$ \rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = \sum_{k=1}^{n} \nabla_{\vec{\theta}} \left[ log(p(\vec{x_k} | \vec{\theta})) \right] $

and equate this first derivative to be zero

$ \rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = 0 $

Example of Guassian case

Assume that covariance matrix are known.

$ p(\vec{x_k} | \vec{\mu}) = \frac{1}{ \left( (2\pi)^{d} |\Sigma| \right)^{\frac{1}{2}}} exp \left[ - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right] $

Step 1: Take log

$ log p(\vec{x_k} | \vec{\mu}) = -\frac{1}{2} log \left( (2\pi)^d |\Sigma| \right) - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) $

Step 2: Take derivative

$ \frac{\partial}{\partial\vec{\mu}} \left( log p(\vec{x_k} | \vec{\mu}) \right) = \frac{1}{2} \left[ (\vec{x_k} - \vec{\mu})^t \Sigma^{-1}\right]^t + \frac{1}{2} \left[ \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right] = \Sigma^{-1} (\vec{x_k} - \vec{\mu}) $

Step 3: Equate to 0

$ \sum_{k=1}^{n} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) = 0 $

$ \rightarrow \Sigma^{-1} \sum_{k=1}^{n} (\vec{x_k} - \vec{\mu}) = 0 $

$ \rightarrow \Sigma^{-1} \left[ \sum_{k=1}^{n} \vec{x_k} - n \vec{\mu}\right] = 0 $

$ \Longrightarrow \hat{\vec{\mu}} = \frac{1}{n} \sum_{k=1}^{n} \vec{x_k} $

This is the sample mean for a sample size n.

Examples of MLE: Exponential and Geometric Distributions

Examples of MLE: Binomial and Poisson Distributions

Advantages of MLE:

  • Simple
  • Converges
  • Asymptotically unbiased (though biased for small N)

Bayesian Parameter Estimation

For a given class, let $ x $ be feature vector of the class and $ \theta $ be parameter of pdf of $ x $ to be estimated.

And let $ D= \{ x_1, x_2, \cdots, x_n \} $ , where $ x $ are training samples of the class

Note that $ \theta $ is random variable with probability density $ p(\theta) $

Equation1 Old Kiwi.png

where

Equation2 Old Kiwi.png

EXAMPLE: Bayesian Inference for Gaussian Mean

The univariate case. The variance is assumed to be known.

Here's a summary of results:

  • Univariate Gaussian density $ p(x|\mu)\sim N(\mu,\sigma^{2}) $
  • Prior density of the mean $ p(\mu)\sim N(\mu_{0},\sigma_{0}^{2}) $
  • Posterior density of the mean $ p(\mu|D)\sim N(\mu_{n},\sigma_{n}^{2}) $

where

  • $ \mu_{n}=\left(\frac{n\sigma_{0}^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\right)\hat{\mu}_{n}+\frac{\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\mu_{0} $
  • $ \sigma_{n}^{2}=\frac{\sigma_{0}^{2}\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}} $
  • $ \hat{\mu}_{n}=\frac{1}{n}\sum_{k=1}^{n}x_{k} $

Finally, the class conditional density is given by $ p(x|D)\sim N(\mu_{n},\sigma^{2}+\sigma_{n}^{2}) $

The above formulas can be interpreted as: in making prediction for a single new observation, the variance of the estimate will have two components: 1) $ \sigma^{2} $ - the inherent variance within the distribution of x, i.e. the variance that would never be eliminated even with perfect information about the underlying distribution model; 2) $ \sigma_{n}^{2} $ - the variance introduced from the estimation of the mean vector "mu", this component can be eliminated given exact prior information or very large training set ( N goes to infinity);

BayesianInference GaussianMean small Old Kiwi.jpg

The above figure illustrates the Bayesian inference for the mean of a Gaussian distribution, for which the variance is assumed to be known. The curves show the prior distribution over 'mu' (the curve labeled N=0), which in this case is itself Gaussian, along with the posterior distributions for increasing number N of data points. The figure makes clear that as the number of data points increase, the posterior distribution peaks around the true value of the mean. This phenomenon is known as *Bayesian learning*.



For more information


Back to ECE662, Spring 2008, Prof. Boutin

Alumni Liaison

Abstract algebra continues the conceptual developments of linear algebra, on an even grander scale.

Dr. Paul Garrett