(New page: =Notes for Lecture 11, ECE662, Spring 2010, Prof. Boutin= == Maximum Likelihood Estimation (MLE) == ---- '''General Principles''' Given vague knowledge about ...)
 
 
Line 1: Line 1:
=Notes for [[Lecture8ECE662S10|Lecture 11]], [[ECE662]], Spring 2010, Prof. Boutin=
+
=Notes for [[Lecture11ECE662S10|Lecture 11]], [[ECE662]], Spring 2010, Prof. Boutin=
  
 
== Maximum Likelihood Estimation (MLE) ==  
 
== Maximum Likelihood Estimation (MLE) ==  

Latest revision as of 09:17, 11 May 2010

Notes for Lecture 11, ECE662, Spring 2010, Prof. Boutin

Maximum Likelihood Estimation (MLE)


General Principles

Given vague knowledge about a situation and some training data (i.e. feature vector values for which the class is known) $ \vec{x}_l, \qquad l=1,\ldots,\text{hopefully large number} $

we want to estimate $ p(\vec{x}|\omega_i), \qquad i=1,\ldots,k $

  1. Assume a parameter form for $ p(\vec{x}|\omega_i), \qquad i=1,\ldots,k $
  2. Use training data to estimate the parameters of $ p(\vec{x}|\omega_i) $, e.g. if you assume $ p(\vec{x}|\omega_i)=\mathcal{N}(\mu,\Sigma) $, then need to estimate $ \mu $ and $ \Sigma $.
  3. Hope that as cardinality of training set increases, estimate for parameters converges to true parameters.


Let $ \mathcal{D}_i $ be the training set for class $ \omega_i, \qquad i=1,\ldots,k $. Assume elements of $ \mathcal{D}_i $ are i.i.d. with $ p(\vec{x}|\omega_i) $. Choose a parametric form for $ p(\vec{x}|\omega_i) $.

$ p(\vec{x}|\omega_i, \vec{\Theta}_i) $ where $ \vec{\Theta}_i $ are the parameters.


How to estimate $ \vec{\Theta}_i $?

  • Consider each class separately
    • $ \vec{\Theta}_i \to \vec{\Theta} $
    • $ \mathcal{D}_i \to \mathcal{D} $
    • Let $ N = \vert \mathcal{D}_i \vert $
    • samples $ \vec{x}_1,\vec{x}_2,\ldots,\vec{x}_N $ are independent

So $ p(\mathcal{D}|\vec{\Theta})=\prod_{j=1}^N(p(x_j|\vec{\Theta})) $

Definition: The maximum likelihood estimate of $ \vec{\Theta} $ is the value $ \hat{\Theta} $ that maximizes $ p(\mathcal{D}|\vec{\Theta}) $.

Observe. $ \hat{\Theta} $ also maximizes $ l(\Theta)=\text{ln} p(\mathcal{D}|\vec{\Theta}) = \sum_{l=1}^N \text{ln} p(\vec{x}_l|\vec{\Theta}) $ where ln is the logarithm base of natural number $ e $.

$ l(\Theta) $ called log likelihood.

$ \hat{\Theta}=\arg\max_\vec{\Theta}l(\vec{\Theta}) $

If $ p(\mathcal{D}|\vec{\Theta}) $ is a derivative function of $ \hat{\Theta} \in \mathbb{R}^p $, the max can be obtained by $ \bigtriangledown_\Theta $ descent. Data does not have to be normally distributed to use this technique.


Example 1: 1D Gaussian with unknown mean

$ p(\vec{x}|\vec{\Theta})=\mathcal{N}(\mu,\Sigma) $ where $ \mu $ is unknown and $ \Sigma $ is known.

Then $ \vec{\Theta}=\mu $. We have $ p(\mathcal{D}|\mu)=\prod_{l=1}^N(p(x_l|\mu)) $

$ \text{ln }p(\mathcal{D}|\mu)=\sum_{l=1}^N\text{ln }p(x_l|\mu) = \sum_{l=1}^N\frac{-1}{2}\ln(2\pi|\Sigma|)-\frac{1}{2}(x_l-\mu)\Sigma^{-1}(x_l - \mu) $

$ \Rightarrow \bigtriangledown_{\mu} \ln p(\mathcal{D}|\mu)= \sum_{l=1}^N \bigtriangledown_{\mu}(\frac{-1}{2}\ln(2\pi|\Sigma|)-\frac{1}{2}(x_l-\mu)\Sigma^{-1}(x_l - \mu)) $

$ \bigtriangledown_{\mu} \ln p(\mathcal{D}|\mu)= \sum_{l=1}^N \sum^{-1}(x_l - \mu) $

Looking for $ \hat{\mu} $ such that $ \bigtriangledown l(\mu)|_{\hat{\mu}}=0 $

$ \Longleftrightarrow \sum_{l=1}^N \sum^{-1}(x_l - \mu) = 0 $

$ \Longleftrightarrow \sum_{l=1}^N x_l - \sum_{l=1}^N \hat{\mu} = 0 $

$ \hat{\mu}=\frac{1}{N}\sum_{l=1}^N x_l $

M.L. estimate for mean is the sample mean

Example 2: 1D Gaussian with unknown mean and variance

$ \hat{\Theta}=(\Theta_1,\Theta_2) \text{where} \Theta_1=\mu \text{and} \Theta_2=\Sigma=\sigma^2 $

$ \ln p(x_l|\mu,\sigma^2)=\frac{-1}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}(x_l-\mu)^2 $

$ \Rightarrow \ln p(\mathcal{D}|\mu,\sigma^2)=\sum_{l=1}^N \ln p(x_l|\mu,\sigma^2) = \sum_{l=1}^N \frac{-1}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}(x_l-\mu)^2 $

$ \bigtriangledown_{(\mu,\sigma^2)}\ln p(\mathcal{D}|\mu,\sigma^2)= \sum_{l=1}^N \left( \begin{array}{c} \frac{\partial}{\partial\mu} \ln p(\mathcal{D}|\mu,\sigma^2) \\ \frac{\partial}{\partial\sigma^2} \ln p(\mathcal{D}|\mu,\sigma^2) \\ \end{array}\right) = \sum_{l=1}^N \left( \begin{array}{c} \frac{1}{\sigma^2}(x_l-\mu) \\ \frac{-1}{\sigma^2}+\frac{(x_l-\mu)^2}{2\sigma^2} \end{array}\right) $

$ = \left( \begin{array}{c} \frac{1}{\sigma^2}(\sum_{l=1}^N x_l - N_{\mu}) \\ \frac{1}{2\sigma^2} (\sum_{l=1}^N \frac{(x_l-\mu)^2}{\sigma^2} - N_{\mu}) \\ \end{array}\right)= \left( \begin{array}{c} 0 \\ 0 \\ \end{array}\right) $

So we get $ \hat{\mu}=\frac{1}{N}\sum_{l=1}^N \vec{x}_l $ and $ \hat{\Sigma}=\frac{1}{N}\sum_{l=1}^N(\vec{x_l}-\hat{\mu})(\vec{x}_l-\hat{\mu})^T $

Slight problem "bias" of $ \hat{\Sigma} $

The expected value of $ \hat{\Sigma} $ over all data sets of size $ N $ is not the true variance $ \Sigma $.

Claim: $ \mathbb{E}(\hat{\Sigma})=\frac{N-1}{N}\Sigma $

Proof in 1D:

$ \mathbb{E}(\hat{\Sigma})=\mathbb{E}(\frac{1}{N}\Sigma_{l=1}^N (x_l-\hat{\mu})^2) = \frac{1}{N}\Sigma_{l=1}^N \mathbb{E}(x_l x_l - 2x_l\hat{\mu} + \hat{\mu}\hat{\mu}) $

where $ \hat{\mu}=\Sigma_{m=1}^N \frac{x_m}{N} $

$ \mathbb{E}(\hat{\Sigma})= \frac{1}{N} \Sigma_{l=1}^N(\mathbb{E}(x_l x_l) - 2\mathbb{E}(x_l \Sigma_{m=1}^N \frac{x_m}{N}) + \mathbb{E}(\Sigma_{m=1}^N \frac{x_m}{N} \Sigma_{s=1}^N \frac{x_s}{N})) $

$ \mathbb{E}(\hat{\Sigma})= \frac{1}{N} \Sigma_{l=1}^N [\mathbb{E}(x_l x_l) - \frac{2}{N}\Sigma_{m=1}^N \mathbb{E}(x_l x_m) + \frac{1}{N^2} \Sigma_{m,s=1}^N \mathbb{E}(x_m x_s)] $

by independence, for $ l \neq m $

$ \mathbb{E}(x_l x_m) = \mathbb{E}(x_l)\mathbb{E}(x_m) = \mu \mu = \mu^2 $

if $ l = m \Rightarrow \mathbb{E}(x_l x_l) = \mathbb{E}(x_l^2) $ and similarly when $ l = m $

Write $ \mathbb{E}(x_l^2) = \mathbb{E}(x^2) $

$ \Rightarrow \mathbb{E}(\hat{\Sigma}) = \frac{1}{N} \Sigma_{l=1}^N [ \mathbb{E}(x^2) - \frac{2}{N} \Sigma_{l \neq m} \mu^2 - \frac{2}{N} \Sigma_{l=m} \mathbb{E}(x^2) + \frac{1}{N^2} \Sigma_{m \neq s} \mu^2 + \frac{1}{N^2} \Sigma_{m = s} \mathbb{E}(x^2)] $

$ = \frac{1}{N} \Sigma_{l=1}^N (1 - \frac{1}{N})\mathbb{E}(x^2) + (\frac{1}{N} - 1)\mu^2 $

$ = \frac{1}{N} \cdot N ((1 - \frac{1}{N})\mathbb{E}(x^2) + (\frac{1}{N} - 1)\mu^2) $

$ = \frac{N-1}{N}(\mathbb{E}(x^2)-\mu^2) = \frac{N-1}{N}(\mathbb{E}((x-\mu)^2)) = \frac{N-1}{N} \sigma^2 $


End of lecture 11

--Gmodeloh 13:37, 21 April 2010 (UTC)


Back to ECE662 Spring 2010, Prof. Boutin

Alumni Liaison

Ph.D. 2007, working on developing cool imaging technologies for digital cameras, camera phones, and video surveillance cameras.

Buyue Zhang