Line 136: | Line 136: | ||
<math>\rightarrow</math> the sample mean is the maximum likelihood | <math>\rightarrow</math> the sample mean is the maximum likelihood | ||
estimate for <math>\mu</math> | estimate for <math>\mu</math> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | **Example three: I.D. Gaussian, both <math>\mu</math> and <math>\sigma^2</math> unknown | ||
+ | |||
+ | <math>\theta = (\theta_1, \theta_2) = (\mu, \sigma^2)<math> | ||
+ | |||
+ | We have | ||
+ | <math>ln\rho(x_k|\mu,\sigma^2) = </math> | ||
+ | <math>ln(\frac{1}{\sqrt{2\pi}\sigma}\cdot e^{-\frac{(x-\mu)^2}{2\sigma^2}})</math> | ||
+ | |||
+ | =<math>-1/2ln(2\pi\sigma^2)-1/(2\sigma^2)(x_k-\mu)^2</math> | ||
+ | <math>ln\rho_D(D|\mu, \sigma^2)</math> | ||
+ | |||
+ | =<math>ln\prod\limits_{k=1}{N}\rho(x_k|\mu,\theta^2)</math> | ||
+ | |||
+ | =<math> \sum\limits_{k=1}^{N}(-\frac{1}{2}ln(2\pi\sigma^2)</math> | ||
+ | <math>-\frac{1}{2\sigma^2}(x_k-\mu)^2)</math> | ||
+ | \\ | ||
+ | <math>\nabla_{\mu,\sigma^2}ln_D(D|\mu,\sigma^2)</math> | ||
+ | |||
+ | =<math>\begin{bmatrix} | ||
+ | \frac{\partial}{\partial \mu}ln\rho_D(D|\mu,\sigma^2)\\ | ||
+ | \frac{\partial}{\partial \sigma^2}ln\rho_D(D|\mu,\sigma^2)\\ | ||
+ | \end{bmatrix}</math> | ||
+ | |||
+ | =<math>\begin{bmatrix} | ||
+ | \frac{\partial}{\partial \mu}(-\frac{N}{2}ln(2\pi\sigma^2)<math> | ||
+ | <math>-\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\ | ||
+ | \frac{\partial}{\partial \sigma^2}(-\frac{N}{2}ln(2\pi\sigma^2)<math> | ||
+ | <math>-\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\ | ||
+ | \end{bmatrix}</math> | ||
+ | |||
+ | =<math>\begin{bmatrix} | ||
+ | \frac{1}{\sigma^2}\sum\limits_{k=1}^{N}<math> | ||
+ | <math>(x_k-\mu)\\ | ||
+ | <math> | ||
+ | <math>-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}-<math> | ||
+ | <math>\frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2\\ | ||
+ | \end{bmatrix}</math> | ||
+ | |||
+ | =<math>\begin{bmatrix} | ||
+ | \frac{1}{\sigma^2}\sum\limits_{k=1}^{N}</math> | ||
+ | <math>(x_k-\mu)\\ | ||
+ | </math> | ||
+ | <math>-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+</math> | ||
+ | <math>\frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2\\ | ||
+ | \end{bmatrix}</math> set to be 0 | ||
+ | |||
+ | from <math>\frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)=0</math> | ||
+ | <math>\Leftrightarrow \mu=</math> | ||
+ | <math>\sum\limits_{k=1}^{N}x_k-N\mu=0</math> | ||
+ | |||
+ | <math>\Leftrightarrow \mu=\frac{1}{N}\sum\limits_{k=1}^{N}x_k</math> which is sample mean. | ||
+ | |||
+ | From <math>-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}-</math> | ||
+ | <math>\frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0</math> and | ||
+ | <math>\hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k \Rightarrow<//math> | ||
+ | |||
+ | <math>-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+<math> | ||
+ | <math>\frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow</math> | ||
+ | |||
+ | <math>-\frac{N}{2}+\frac{1}{2\sigma^2}</math> | ||
+ | <math>\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow</math> | ||
+ | |||
+ | <math>\frac{1}{2\sigma^2}=</math> | ||
+ | <math>\frac{N}{2}\cdot \frac{1}{\sum\limits_{k=1}^{N}(x_k-\mu)^2} </math> | ||
+ | <math>\Leftrightarrow</math> | ||
+ | |||
+ | <math>\sigma^2 = \frac{1}{N}\cdot \sum\limits_{k=1}^{N}(x_k-\mu)^2</math> | ||
+ | =<math>\hat{\sigma^2}</math> which is the MLE of <math>\sigma</math> | ||
+ | |||
+ | In general, when <math>x\sim N(\vec{\mu}, \Sigma), </math> | ||
+ | <math>x\in \mathbb{R}^n, \vec{\mu}, \Sigma</math> unknown, | ||
+ | the MLE for <math>\vec{\mu} and \Sigma are:</math> | ||
+ | <math>\hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k</math> | ||
+ | <math>, \hat{\Sigma} = \frac{1}{N}\sum\limits_{k=1}^{N}</math> | ||
+ | <math>(x_k-\mu)(x_k-\mu)^T</math> | ||
+ | |||
+ | <math>\Sigma</math> is non singular, but <math>\hat{\Sigma}</math> can be singular | ||
+ | <math>\Rightarrow</math> no inverse <math>\rightarrow</math> this happens when number | ||
+ | of points N<n: feature space down. | ||
+ | |||
+ | What happens when repeat sampling and estimating? | ||
+ | |||
+ | Sample: <math>(x_1^i, x_2^i,...,x_N^i) \Rightarrow</math> | ||
+ | <math>\hat{\mu}^i = \frac{1}{N}\sum\limits_{k=1}^{N}x_k^i</math> | ||
+ | |||
Revision as of 21:32, 5 May 2014
Expected Value of MLE estimate over standard deviation and expected deviation
A slecture by ECE student Zhenpeng Zhao
Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.
1. Motivation
- Most likely converge as number of number of training sample increase.
- Simpler than alternate methods such as Bayesian technique.
2. MLE as a Parametric Density Estimation
- Statistical Density Theory Context
- Given c classes + some knowledge about features $ x \in \mathbb{R}^n $ (or some other space)
- Given training data, $ x_j\sim\rho(x)=\sum\limits_{i=1}^n\rho(x|w_i) Prob(w_i) $, unknown class $ w_{ij} $ for $ x_j $ is know, $ \forall{j}=1,...,N $ (N hopefully large enough)
- In order to make decision, we need to estimate $ \rho(x|w_i) $, $ Prob(w_i) $ $ \rightarrow $ use Bayes rule, or $ \rho(x|w_i) $ $ \rightarrow $ use Neyman-Pearson Criterion
- To estimate the above two, use training data.
- The parametric pdf|Prob estimation problem
- Let $ D={x_1,x_2,...,x_N} $, $ x_j $ is drown independently from some probability law.
- Choose parametric from $ \rho(x|\theta) $ for the pdf of x or $ Prob(x|\theta) $ for the probability of x $ \rightarrow $ an unknown parametric vector
- Use $ D $ to estimate $ \theta $
- Definition: The maximum likelihood estimate of $ \theta $ is the value $ \hat{\theta} $ that maximize $ \rho_D(D|\theta) $, if x is continuous R.V., or $ Prob(D|\theta) $, if x is discrete R.V.
- Observation: By independence, $ \rho(D|\theta)=\rho(x_1,x_2,...,x_N|\theta) $ = $ \prod\limits_{j=1}^n\rho(x_j|\theta) $
- Simple Example One:
Those to estimate the priors: $ Prob(w_1), Prob(w_2) $ for $ c=2 $ classes.
Let $ Prob(w_1)=P $, $ \Rightarrow $ $ Prob(w_2)=1-P $, as an unknown parameter ($ \theta=P $)
Let $ w_j $ be the class of some $ x_j $, ($ j\in{1,2,...N} $)
$ Prob(D|P) $ = $ \prod\limits_{j=1}^n Prob(w_{ij}|P) $, $ x\sim \rho(x) $
=$ \prod\limits_{j=1}^{N_1} Prob(w_{ij}|P)\prod\limits_{j=1}^{N_2}Prob(w_{ij}|p) $
=$ P^{N_1}\dot(1-P)^{N-N_1} $
, the first $ w_{ij}=w_1 $ and the second $ w_{ij}=w_2 $,
$ N1 $= number of sample from class 1 Then, we $ \infty $ differentiate P $ (Prob(D|P)) $, so local max is where derivative = 0.
$ \frac{d}{dP} Prob(D|P)=\frac{d}{dP} P^{N_1}(1-P)^{N-N_1} $
$ =N_1P^{N_1-1}(1-P)^{N-N_1}-(N-N_1)P^{N_1}(1-p) $
$ =p^{N_1-1}(1-P)^{N-N_1-1}[N_1(1-P)-(N-N_1)P]=0 $
$ \Rightarrow $ So either P=0 or P=1 $ \rightarrow N_1(1-P) $
$ \Leftrightarrow P=\frac{N_1}{N} $
- Simple Example Two: Continuous R.V.: Estimate mean of Gaussian with Known $ \Sigma $
$ \rho(\vec{x}|\vec{\mu})=N(\vec{\mu},\Sigma) $, where $ \mu $ is
unknown, and $ Sigma $ is known.
$ \rho(D|\vec{\mu}) = \prod\limits_{j=1}^{N}\rho(x_j|\vec{\mu}) $
Observe the MLE $ \in \hat{\theta} $, also maximize $ log\rho_D(D|\theta) $ since log is monotonic
= $ \sum\limits_{j=1}^{N}ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}) $ $ \exp^{-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2}} $
= $ \sum\limits_{j=1}^{N}ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}) $ $ -\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2} $
which is $ \infty $ many times differentiable for $ \vec{\mu} $, so local max are where $ \nabla=0 $
compute $ \nabla $, $ \nabla_{\vec{\mu}}ln\rho_{D}(D|\vec{\mu}) $
=$ \sum\limits_{j=1}^{N}\nabla_{\vec{\mu}} (ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}) $ $ -\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2}) $
=$ -1/2\sum\limits_{j=1}^{N}\nabla_{\vec{\mu}}[(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})] $
=$ -1/2\sum\limits_{j=1}^{N} \begin{bmatrix} \frac{\partial}{\partial\mu_1} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \frac{\partial}{\partial\mu_2} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \vdots \\ \frac{\partial}{\partial\mu_n} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \end{bmatrix} $
But $ \frac{\partial}{\partial\mu_1} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu}) $
=$ (\frac{\partial}{\partial \mu_i}(x_j-\mu)^T)\Sigma^{-1} $ $ (x_j-\mu)+(x_j-\mu)^T\Sigma^{-1}\frac{\partial}{\partial \mu_i}(x_j-\mu) $
=$ 2\frac{\partial}{\partial \mu_i}(x_j-\mu)^T)\Sigma^{-1}(x_j-\mu) $
=$ 2(0,0,0,...,-1,0,...,0)\Sigma^{-1}(x_j-\mu) $
=$ -2\vec{e_i}^{T}\Sigma^{-1}(x_j-\mu) $
so, $ \nabla{ln\rho_D(D|\mu)} = -1/2\sum\limits_{j=1}^{N} $ $ \begin{bmatrix} -2\vec{e_1}^{T}\Sigma^{-1}(x_j-{\mu})\\ -2\vec{e_2}^{T}\Sigma^{-1}(x_j-{\mu})\\ \vdots \\ -2\vec{e_n}^{T}\Sigma^{-1}(x_j-{\mu})\\ \end{bmatrix} $
=$ \sum\limits_{j=1}^{N} $ $ \begin{bmatrix} -2\vec{e_1}^{T}\\ -2\vec{e_2}^{T}\\ \vdots \\ -2\vec{e_n}^{T}\\ \end{bmatrix} $ $ \Sigma^{-1}(x_j-\mu) $, the vector of $ \vec{e_i} $ is the space domain of feature
=$ \sum\limits_{j=1}^{N}\Sigma^{-1}(x_j-\mu) $
=$ \Sigma^{-1}\sum\limits_{j=1}^{N}(x_j-\mu) $ set to be 0
$ \Rightarrow \Sigma\Sigma^{-1}\sum\limits_{j=1}^{N}\Sigma^{-1}(x_j-\mu) = \Sigma \cdot 0 $
$ \Rightarrow \sum\limits_{j=1}^{N}(x_j-\mu) = 0 $
$ \Rightarrow \frac{1}{N}\sum\limits_{j=1}^{N}x_j = \mu $
$ \rightarrow $ the sample mean is the maximum likelihood estimate for $ \mu $
- Example three: I.D. Gaussian, both $ \mu $ and $ \sigma^2 $ unknown
$ \theta = (\theta_1, \theta_2) = (\mu, \sigma^2)<math> We have <math>ln\rho(x_k|\mu,\sigma^2) = $ $ ln(\frac{1}{\sqrt{2\pi}\sigma}\cdot e^{-\frac{(x-\mu)^2}{2\sigma^2}}) $
=$ -1/2ln(2\pi\sigma^2)-1/(2\sigma^2)(x_k-\mu)^2 $ $ ln\rho_D(D|\mu, \sigma^2) $
=$ ln\prod\limits_{k=1}{N}\rho(x_k|\mu,\theta^2) $
=$ \sum\limits_{k=1}^{N}(-\frac{1}{2}ln(2\pi\sigma^2) $ $ -\frac{1}{2\sigma^2}(x_k-\mu)^2) $ \\ $ \nabla_{\mu,\sigma^2}ln_D(D|\mu,\sigma^2) $
=$ \begin{bmatrix} \frac{\partial}{\partial \mu}ln\rho_D(D|\mu,\sigma^2)\\ \frac{\partial}{\partial \sigma^2}ln\rho_D(D|\mu,\sigma^2)\\ \end{bmatrix} $
=$ \begin{bmatrix} \frac{\partial}{\partial \mu}(-\frac{N}{2}ln(2\pi\sigma^2)<math> <math>-\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\ \frac{\partial}{\partial \sigma^2}(-\frac{N}{2}ln(2\pi\sigma^2)<math> <math>-\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\ \end{bmatrix} $
=$ \begin{bmatrix} \frac{1}{\sigma^2}\sum\limits_{k=1}^{N}<math> <math>(x_k-\mu)\\ <math> <math>-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}-<math> <math>\frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2\\ \end{bmatrix} $
=$ \begin{bmatrix} \frac{1}{\sigma^2}\sum\limits_{k=1}^{N} $ $ (x_k-\mu)\\ $ $ -\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+ $ $ \frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2\\ \end{bmatrix} $ set to be 0
from $ \frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)=0 $ $ \Leftrightarrow \mu= $ $ \sum\limits_{k=1}^{N}x_k-N\mu=0 $
$ \Leftrightarrow \mu=\frac{1}{N}\sum\limits_{k=1}^{N}x_k $ which is sample mean.
From $ -\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}- $ $ \frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 $ and $ \hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k \Rightarrow<//math> <math>-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+<math> <math>\frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow $
$ -\frac{N}{2}+\frac{1}{2\sigma^2} $ $ \sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow $
$ \frac{1}{2\sigma^2}= $ $ \frac{N}{2}\cdot \frac{1}{\sum\limits_{k=1}^{N}(x_k-\mu)^2} $ $ \Leftrightarrow $
$ \sigma^2 = \frac{1}{N}\cdot \sum\limits_{k=1}^{N}(x_k-\mu)^2 $ =$ \hat{\sigma^2} $ which is the MLE of $ \sigma $
In general, when $ x\sim N(\vec{\mu}, \Sigma), $ $ x\in \mathbb{R}^n, \vec{\mu}, \Sigma $ unknown, the MLE for $ \vec{\mu} and \Sigma are: $ $ \hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k $ $ , \hat{\Sigma} = \frac{1}{N}\sum\limits_{k=1}^{N} $ $ (x_k-\mu)(x_k-\mu)^T $
$ \Sigma $ is non singular, but $ \hat{\Sigma} $ can be singular $ \Rightarrow $ no inverse $ \rightarrow $ this happens when number of points N<n: feature space down.
What happens when repeat sampling and estimating?
Sample: $ (x_1^i, x_2^i,...,x_N^i) \Rightarrow $ $ \hat{\mu}^i = \frac{1}{N}\sum\limits_{k=1}^{N}x_k^i $
(create a question page and put a link below)
Questions and comments
If you have any questions, comments, etc. please post them on https://kiwi.ecn.purdue.edu/rhea/index.php/ECE662Selecture_ZHenpengMLE_Ques.