(22 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
<font size="4">Expected Value of MLE estimate over standard deviation and expected deviation </font>  
 
<font size="4">Expected Value of MLE estimate over standard deviation and expected deviation </font>  
  
A [https://www.projectrhea.org/learning/slectures.php slecture] by ECE student Zhenpeng Zhao  
+
A [http://www.projectrhea.org/learning/slectures.php slecture] by ECE student Zhenpeng Zhao  
  
Partly based on the [[2014 Spring ECE 662 Boutin|ECE662 Spring 2014 lecture]] material of [[User:Mboutin|Prof. Mireille Boutin]].  
+
Partly based on the [[2014_Spring_ECE_662_Boutin_Statistical_Pattern_recognition_slectures|ECE662 Spring 2014 lecture]] material of [[User:Mboutin|Prof. Mireille Boutin]].  
 
</center>  
 
</center>  
 
----
 
----
Line 17: Line 17:
  
  
=== <br> 2. Motivation ===
+
=== <br> 2. MLE as a Parametric Density Estimation ===
 
*Statistical Density Theory Context
 
*Statistical Density Theory Context
 
**Given c classes + some knowledge about features <math>x \in \mathbb{R}^n</math> (or some other space)
 
**Given c classes + some knowledge about features <math>x \in \mathbb{R}^n</math> (or some other space)
 
**Given training data, <math>x_j\sim\rho(x)=\sum\limits_{i=1}^n\rho(x|w_i) Prob(w_i)</math>, unknown class <math>w_{ij}</math> for <math>x_j</math> is know, <math>\forall{j}=1,...,N</math> (N hopefully large enough)
 
**Given training data, <math>x_j\sim\rho(x)=\sum\limits_{i=1}^n\rho(x|w_i) Prob(w_i)</math>, unknown class <math>w_{ij}</math> for <math>x_j</math> is know, <math>\forall{j}=1,...,N</math> (N hopefully large enough)
 +
**In order to make decision, we need to estimate <math>\rho(x|w_i)</math>, <math>Prob(w_i)</math> <math>\rightarrow</math> use Bayes rule, or <math>\rho(x|w_i)</math> <math>\rightarrow</math> use Neyman-Pearson Criterion
 +
**To estimate the above two, use training data.
  
 +
*The parametric pdf|Prob estimation problem
 +
** Let <math>D={x_1,x_2,...,x_N}</math>, <math>x_j</math> is drown independently from some probability law.
 +
** Choose parametric from <math>\rho(x|\theta)</math> for the pdf of x or <math>Prob(x|\theta)</math> for the probability of x <math>\rightarrow</math> an unknown parametric vector
 +
**Use <math>D</math> to estimate <math>\theta</math>
  
[[Image:Zhenpeng_Selecture_1.png]]
+
*Definition: The maximum likelihood estimate of <math>\theta</math> is the value <math>\hat{\theta}</math> that maximize <math>\rho_D(D|\theta)</math>, if x is continuous R.V., or <math>Prob(D|\theta)</math>, if x is discrete R.V.
[[Image:Zhenpeng_Selecture_2.png]]
+
[[Image:Zhenpeng_Selecture_3.png]]
+
[[Image:Zhenpeng_Selecture_4.png]]
+
[[Image:Zhenpeng_Selecture_5.png]]
+
  
 +
*Observation: By independence, <math>\rho(D|\theta)=\rho(x_1,x_2,...,x_N|\theta)</math> = <math>\prod\limits_{j=1}^n\rho(x_j|\theta)</math>
 +
**Simple Example One:
 +
Those to estimate the priors: <math>Prob(w_1), Prob(w_2)</math> for <math>c=2</math> classes.
  
 +
Let <math>Prob(w_1)=P</math>, <math>\Rightarrow</math> <math>Prob(w_2)=1-P</math>, as an unknown parameter (<math>\theta=P</math>)
 +
 +
Let <math>w_j</math> be the class of some <math>x_j</math>, (<math>j\in{1,2,...N}</math>)
 +
 +
<math>Prob(D|P)</math> = <math>\prod\limits_{j=1}^n Prob(w_{ij}|P)</math>, <math>x\sim \rho(x)</math>
 +
 +
 +
=<math>\prod\limits_{j=1}^{N_1} Prob(w_{ij}|P)\prod\limits_{j=1}^{N_2}Prob(w_{ij}|p)</math>
 +
 +
=<math>P^{N_1}\dot(1-P)^{N-N_1}</math>
 +
 +
, the first <math>w_{ij}=w_1</math> and the second <math>w_{ij}=w_2</math>,
 +
 +
<math>N1</math>= number of sample from class 1
 +
Then, we <math>\infty</math> differentiate  P <math>(Prob(D|P))</math>, so local max is where derivative = 0.
 +
 +
<math>\frac{d}{dP} Prob(D|P)=\frac{d}{dP} P^{N_1}(1-P)^{N-N_1}</math>
 +
 +
<math>=N_1P^{N_1-1}(1-P)^{N-N_1}-(N-N_1)P^{N_1}(1-p)</math>
 +
 +
<math>=p^{N_1-1}(1-P)^{N-N_1-1}[N_1(1-P)-(N-N_1)P]=0</math>
 +
 +
<math>\Rightarrow</math> So either P=0 or P=1 <math>\rightarrow N_1(1-P)  </math>
 +
 +
<math>\Leftrightarrow P=\frac{N_1}{N}</math>
 +
 +
----
 +
 +
<br>
 +
**Simple Example Two: Continuous R.V.: Estimate mean of Gaussian with Known <math>\Sigma</math>
 +
<math>\rho(\vec{x}|\vec{\mu})=N(\vec{\mu},\Sigma)</math>, where <math>\mu</math> is
 +
unknown, and <math>Sigma</math> is known.
 +
 +
<math>\rho(D|\vec{\mu}) = \prod\limits_{j=1}^{N}\rho(x_j|\vec{\mu})</math>
 +
 +
Observe the MLE <math>\in \hat{\theta}</math>, also maximize
 +
<math>log\rho_D(D|\theta)</math> since log is monotonic
 +
 +
= <math>\sum\limits_{j=1}^{N}ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}})</math>
 +
<math>\exp^{-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2}}</math>
 +
 +
= <math>\sum\limits_{j=1}^{N}ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}})</math>
 +
<math>-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2}</math>
 +
 +
which is <math>\infty</math> many times differentiable for <math>\vec{\mu}</math>,
 +
so local max are where <math>\nabla=0</math>
 +
 +
compute <math>\nabla</math>, <math>\nabla_{\vec{\mu}}ln\rho_{D}(D|\vec{\mu})</math>
 +
 +
=<math>\sum\limits_{j=1}^{N}\nabla_{\vec{\mu}}  (ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}})</math>
 +
<math>-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2})</math>
 +
 +
=<math>-1/2\sum\limits_{j=1}^{N}\nabla_{\vec{\mu}}[(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})]</math>
 +
 +
=<math>-1/2\sum\limits_{j=1}^{N}
 +
\begin{bmatrix}
 +
  \frac{\partial}{\partial\mu_1} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\
 +
  \frac{\partial}{\partial\mu_2} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\
 +
  \vdots  \\
 +
  \frac{\partial}{\partial\mu_n} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\
 +
\end{bmatrix}</math>
 +
 +
But  <math>  \frac{\partial}{\partial\mu_1} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})</math>
 +
 +
=<math>(\frac{\partial}{\partial \mu_i}(x_j-\mu)^T)\Sigma^{-1}</math>
 +
<math>(x_j-\mu)+(x_j-\mu)^T\Sigma^{-1}\frac{\partial}{\partial \mu_i}(x_j-\mu)</math>
 +
 +
=<math>2\frac{\partial}{\partial \mu_i}(x_j-\mu)^T)\Sigma^{-1}(x_j-\mu)</math>
 +
 +
=<math>2(0,0,0,...,-1,0,...,0)\Sigma^{-1}(x_j-\mu)</math>
 +
 +
=<math>-2\vec{e_i}^{T}\Sigma^{-1}(x_j-\mu)</math>
 +
 +
so, <math>\nabla{ln\rho_D(D|\mu)} = -1/2\sum\limits_{j=1}^{N}</math>
 +
<math>\begin{bmatrix}
 +
-2\vec{e_1}^{T}\Sigma^{-1}(x_j-{\mu})\\
 +
-2\vec{e_2}^{T}\Sigma^{-1}(x_j-{\mu})\\
 +
  \vdots  \\
 +
-2\vec{e_n}^{T}\Sigma^{-1}(x_j-{\mu})\\
 +
\end{bmatrix}</math>
 +
 +
=<math>\sum\limits_{j=1}^{N}</math>
 +
<math>\begin{bmatrix}
 +
-2\vec{e_1}^{T}\\
 +
-2\vec{e_2}^{T}\\
 +
  \vdots  \\
 +
-2\vec{e_n}^{T}\\
 +
\end{bmatrix}</math>
 +
<math>\Sigma^{-1}(x_j-\mu)</math>, the vector of <math>\vec{e_i}</math> is the space
 +
domain of feature
 +
 +
=<math>\sum\limits_{j=1}^{N}\Sigma^{-1}(x_j-\mu)</math>
 +
 +
=<math>\Sigma^{-1}\sum\limits_{j=1}^{N}(x_j-\mu)</math> set to be 0
 +
 +
<math>\Rightarrow \Sigma\Sigma^{-1}\sum\limits_{j=1}^{N}\Sigma^{-1}(x_j-\mu) = \Sigma \cdot 0</math>
 +
 +
<math>\Rightarrow \sum\limits_{j=1}^{N}(x_j-\mu) = 0</math>
 +
 +
<math>\Rightarrow \frac{1}{N}\sum\limits_{j=1}^{N}x_j = \mu</math>
 +
 +
<math>\rightarrow</math> the sample mean is the maximum likelihood
 +
estimate for <math>\mu</math>
 +
 +
 +
 +
----
 +
 +
<br>
 +
**Example three: I.D. Gaussian, both <math>\mu</math> and <math>\sigma^2</math> unknown
 +
 +
<math>\theta = (\theta_1, \theta_2) = (\mu, \sigma^2)</math>
 +
 +
We have
 +
<math>ln\rho(x_k|\mu,\sigma^2) = </math>
 +
<math>ln(\frac{1}{\sqrt{2\pi}\sigma}\cdot e^{-\frac{(x-\mu)^2}{2\sigma^2}})</math>
 +
 +
=<math>-1/2ln(2\pi\sigma^2)-1/(2\sigma^2)(x_k-\mu)^2</math>
 +
<math>ln\rho_D(D|\mu, \sigma^2)</math>
 +
 +
=<math>ln\prod\limits_{k=1}{N}\rho(x_k|\mu,\theta^2)</math>
 +
 +
=<math> \sum\limits_{k=1}^{N}(-\frac{1}{2}ln(2\pi\sigma^2)</math>
 +
<math>-\frac{1}{2\sigma^2}(x_k-\mu)^2)</math>
 +
<math>\nabla_{\mu,\sigma^2}ln_D(D|\mu,\sigma^2)</math>
 +
 +
=<math>\begin{bmatrix}
 +
\frac{\partial}{\partial \mu}ln\rho_D(D|\mu,\sigma^2)\\
 +
\frac{\partial}{\partial \sigma^2}ln\rho_D(D|\mu,\sigma^2)\\
 +
\end{bmatrix}</math>
 +
 +
=<math>\begin{bmatrix}
 +
\frac{\partial}{\partial \mu}(-\frac{N}{2}ln(2\pi\sigma^2)
 +
-\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\
 +
\frac{\partial}{\partial \sigma^2}(-\frac{N}{2}ln(2\pi\sigma^2)
 +
-\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\
 +
\end{bmatrix}</math>
 +
 +
=<math>\begin{bmatrix}
 +
\frac{1}{\sigma^2}\sum\limits_{k=1}^{N}
 +
(x_k-\mu)\\
 +
-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}-
 +
\frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2
 +
\end{bmatrix}</math>
 +
 +
=<math>\begin{bmatrix}
 +
\frac{1}{\sigma^2}\sum\limits_{k=1}^{N}
 +
(x_k-\mu)\\
 +
-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+
 +
\frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2
 +
\end{bmatrix}</math> set to be 0
 +
 +
from <math>\frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)=0</math>
 +
<math>\Leftrightarrow \mu=</math>
 +
<math>\sum\limits_{k=1}^{N}x_k-N\mu=0</math>
 +
 +
<math>\Leftrightarrow \mu=\frac{1}{N}\sum\limits_{k=1}^{N}x_k</math> which is sample mean.
 +
 +
From <math>-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}-</math>
 +
<math>\frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0</math> and
 +
<math>\hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k \Rightarrow</math>
 +
 +
<math>-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+</math>
 +
<math>\frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow</math>
 +
 +
<math>-\frac{N}{2}+\frac{1}{2\sigma^2}</math>
 +
<math>\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow</math>
 +
 +
<math>\frac{1}{2\sigma^2}=</math>
 +
<math>\frac{N}{2}\cdot \frac{1}{\sum\limits_{k=1}^{N}(x_k-\mu)^2} </math>
 +
<math>\Leftrightarrow</math>
 +
 +
<math>\sigma^2 = \frac{1}{N}\cdot \sum\limits_{k=1}^{N}(x_k-\mu)^2</math>
 +
=<math>\hat{\sigma^2}</math> which is the MLE of <math>\sigma</math>
 +
 +
In general, when <math>x\sim N(\vec{\mu}, \Sigma), </math>
 +
<math>x\in  \mathbb{R}^n, \vec{\mu}, \Sigma</math> unknown,
 +
the MLE for <math>\vec{\mu}</math> and <math>\Sigma </math>are:
 +
<math>\hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k</math>
 +
<math>, \hat{\Sigma} = \frac{1}{N}\sum\limits_{k=1}^{N}</math>
 +
<math>(x_k-\mu)(x_k-\mu)^T</math>
 +
 +
<math>\Sigma</math> is non singular, but <math>\hat{\Sigma}</math> can be singular
 +
<math>\Rightarrow</math> no inverse <math>\rightarrow</math> this happens when number
 +
of points N<n: feature space down.
 +
 +
What happens when repeat sampling and estimating?
 +
 +
Sample: <math>(x_1^i, x_2^i,...,x_N^i) \Rightarrow</math>
 +
<math>\hat{\mu}^i = \frac{1}{N}\sum\limits_{k=1}^{N}x_k^i</math>
 +
 +
 +
**<math>E(\hat{u})=?</math>
 +
 +
We have <math>E(\hat{u})= E(\frac{1}{N}\sum\limits_{k=1}^{N}(x_k))</math>
 +
<math>\frac{1}{N}E(x_k)=\frac{1}{N}\sum\limits_{k=1}^{N}E(x)=</math>
 +
<math>\frac{1}{N}\sum\limits_{k=1}^{N}u = \mu</math>
 +
 +
But how far do we expect to derivate from the mean?
 +
 +
<math>E(|\hat{\mu}-\mu|^2) = E((\hat{\mu}-\mu)(\hat{\mu}-\mu))</math>
 +
<math>=E(\hat{\mu}\cdot\hat{\mu}-\hat{\mu}\cdot{\mu}</math>
 +
<math>-{\mu}\cdot\hat{\mu}+{\mu}\cdot{\mu})</math>
 +
 +
<math>=E(\hat{\mu}\cdot\hat{\mu})-2\cdot \mu E(\hat{u})+\mu \cdot \mu</math>
 +
 +
<math>=E(\hat{\mu}\cdot\hat{\mu})-\mu\cdot\mu</math>
 +
 +
<math>=E(\frac{1}{N}\sum\limits_{k=1}^{N}x_k \cdot \frac{1}{N}\sum\limits_{j=1}^{N}x_j)-\mu\cdot\mu</math>
 +
 +
<math>=\frac{1}{N^2}\sum\limits_{k=1}^{N}E(x_k \cdot x_j)-\mu\cdot\mu</math>
 +
 +
<math>=\frac{1}{N^2}[\sum\limits_{k,j=1,k\neq j}^{N}</math>
 +
<math>E(x_k )\cdot E(x_j)+\sum\limits_{k,j=1,k\neq j}^{N}</math>
 +
<math>E(x_k )\cdot E(x_k)]-\mu\cdot\mu</math>
 +
 +
<math>=\frac{1}{N^2}[N\cdot (N-1)\mu\cdot \mu+</math>
 +
<math>\sum\limits_{k=1}^{N}E(x^2)]-\mu\cdot\mu</math>
 +
 +
<math>-\frac{1}{N}\mu\cdot\mu+\frac{1}{N^2}\sum\limits_{k=1}^{N}E(x^2)</math>
 +
 +
by <math>E[(x-\mu)(x-\mu)] = \sigma^2 \Rightarrow</math>
 +
<math>E(x \cdot x)-\mu^2 = \sigma^2 \rightarrow </math>
 +
<math>E(x \cdot x) = \sigma^2+\mu^2</math>
 +
 +
 +
So: <math>E(|\hat{\mu}-\mu|^2) = -\frac{1}{N}\mu \cdot \mu +</math>
 +
<math> \frac{1}{N}(\sigma^2+\mu \cdot \mu) = \frac{1}{N}\sigma^2</math>
 +
 +
----
 +
 +
<br>
 +
*Bias: The maximum likelihood for the variance $\sigma^2$ is biased means
 +
the expected value over all data sets of size n of the sample variance is not equal to the true variance:
 +
 +
<math>E[\frac{1}{n}\sum\limits_{k=1}^{N}(x_k-\bar{x})] = \frac{n-1}{n}</math>
 +
<math>\sigma^2 \neq \sigma^2</math>
 +
 +
But we can tell that as n <math>\rightarrow \infty</math>, the MLE of <math>\sigma</math> is closing to <math>\sigma^2</math>
 
----
 
----
  

Latest revision as of 10:51, 22 January 2015


Expected Value of MLE estimate over standard deviation and expected deviation

A slecture by ECE student Zhenpeng Zhao

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.




1. Motivation

  • Most likely converge as number of number of training sample increase.
  • Simpler than alternate methods such as Bayesian technique.



2. MLE as a Parametric Density Estimation

  • Statistical Density Theory Context
    • Given c classes + some knowledge about features $ x \in \mathbb{R}^n $ (or some other space)
    • Given training data, $ x_j\sim\rho(x)=\sum\limits_{i=1}^n\rho(x|w_i) Prob(w_i) $, unknown class $ w_{ij} $ for $ x_j $ is know, $ \forall{j}=1,...,N $ (N hopefully large enough)
    • In order to make decision, we need to estimate $ \rho(x|w_i) $, $ Prob(w_i) $ $ \rightarrow $ use Bayes rule, or $ \rho(x|w_i) $ $ \rightarrow $ use Neyman-Pearson Criterion
    • To estimate the above two, use training data.
  • The parametric pdf|Prob estimation problem
    • Let $ D={x_1,x_2,...,x_N} $, $ x_j $ is drown independently from some probability law.
    • Choose parametric from $ \rho(x|\theta) $ for the pdf of x or $ Prob(x|\theta) $ for the probability of x $ \rightarrow $ an unknown parametric vector
    • Use $ D $ to estimate $ \theta $
  • Definition: The maximum likelihood estimate of $ \theta $ is the value $ \hat{\theta} $ that maximize $ \rho_D(D|\theta) $, if x is continuous R.V., or $ Prob(D|\theta) $, if x is discrete R.V.
  • Observation: By independence, $ \rho(D|\theta)=\rho(x_1,x_2,...,x_N|\theta) $ = $ \prod\limits_{j=1}^n\rho(x_j|\theta) $
    • Simple Example One:

Those to estimate the priors: $ Prob(w_1), Prob(w_2) $ for $ c=2 $ classes.

Let $ Prob(w_1)=P $, $ \Rightarrow $ $ Prob(w_2)=1-P $, as an unknown parameter ($ \theta=P $)

Let $ w_j $ be the class of some $ x_j $, ($ j\in{1,2,...N} $)

$ Prob(D|P) $ = $ \prod\limits_{j=1}^n Prob(w_{ij}|P) $, $ x\sim \rho(x) $


=$ \prod\limits_{j=1}^{N_1} Prob(w_{ij}|P)\prod\limits_{j=1}^{N_2}Prob(w_{ij}|p) $

=$ P^{N_1}\dot(1-P)^{N-N_1} $

, the first $ w_{ij}=w_1 $ and the second $ w_{ij}=w_2 $,

$ N1 $= number of sample from class 1 Then, we $ \infty $ differentiate P $ (Prob(D|P)) $, so local max is where derivative = 0.

$ \frac{d}{dP} Prob(D|P)=\frac{d}{dP} P^{N_1}(1-P)^{N-N_1} $

$ =N_1P^{N_1-1}(1-P)^{N-N_1}-(N-N_1)P^{N_1}(1-p) $

$ =p^{N_1-1}(1-P)^{N-N_1-1}[N_1(1-P)-(N-N_1)P]=0 $

$ \Rightarrow $ So either P=0 or P=1 $ \rightarrow N_1(1-P) $

$ \Leftrightarrow P=\frac{N_1}{N} $



    • Simple Example Two: Continuous R.V.: Estimate mean of Gaussian with Known $ \Sigma $

$ \rho(\vec{x}|\vec{\mu})=N(\vec{\mu},\Sigma) $, where $ \mu $ is

unknown, and $ Sigma $ is known. 

$ \rho(D|\vec{\mu}) = \prod\limits_{j=1}^{N}\rho(x_j|\vec{\mu}) $

Observe the MLE $ \in \hat{\theta} $, also maximize $ log\rho_D(D|\theta) $ since log is monotonic

= $ \sum\limits_{j=1}^{N}ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}) $ $ \exp^{-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2}} $

= $ \sum\limits_{j=1}^{N}ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}) $ $ -\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2} $

which is $ \infty $ many times differentiable for $ \vec{\mu} $, so local max are where $ \nabla=0 $

compute $ \nabla $, $ \nabla_{\vec{\mu}}ln\rho_{D}(D|\vec{\mu}) $

=$ \sum\limits_{j=1}^{N}\nabla_{\vec{\mu}} (ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}) $ $ -\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2}) $

=$ -1/2\sum\limits_{j=1}^{N}\nabla_{\vec{\mu}}[(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})] $

=$ -1/2\sum\limits_{j=1}^{N} \begin{bmatrix} \frac{\partial}{\partial\mu_1} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \frac{\partial}{\partial\mu_2} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \vdots \\ \frac{\partial}{\partial\mu_n} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \end{bmatrix} $

But $ \frac{\partial}{\partial\mu_1} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu}) $

=$ (\frac{\partial}{\partial \mu_i}(x_j-\mu)^T)\Sigma^{-1} $ $ (x_j-\mu)+(x_j-\mu)^T\Sigma^{-1}\frac{\partial}{\partial \mu_i}(x_j-\mu) $

=$ 2\frac{\partial}{\partial \mu_i}(x_j-\mu)^T)\Sigma^{-1}(x_j-\mu) $

=$ 2(0,0,0,...,-1,0,...,0)\Sigma^{-1}(x_j-\mu) $

=$ -2\vec{e_i}^{T}\Sigma^{-1}(x_j-\mu) $

so, $ \nabla{ln\rho_D(D|\mu)} = -1/2\sum\limits_{j=1}^{N} $ $ \begin{bmatrix} -2\vec{e_1}^{T}\Sigma^{-1}(x_j-{\mu})\\ -2\vec{e_2}^{T}\Sigma^{-1}(x_j-{\mu})\\ \vdots \\ -2\vec{e_n}^{T}\Sigma^{-1}(x_j-{\mu})\\ \end{bmatrix} $

=$ \sum\limits_{j=1}^{N} $ $ \begin{bmatrix} -2\vec{e_1}^{T}\\ -2\vec{e_2}^{T}\\ \vdots \\ -2\vec{e_n}^{T}\\ \end{bmatrix} $ $ \Sigma^{-1}(x_j-\mu) $, the vector of $ \vec{e_i} $ is the space domain of feature

=$ \sum\limits_{j=1}^{N}\Sigma^{-1}(x_j-\mu) $

=$ \Sigma^{-1}\sum\limits_{j=1}^{N}(x_j-\mu) $ set to be 0

$ \Rightarrow \Sigma\Sigma^{-1}\sum\limits_{j=1}^{N}\Sigma^{-1}(x_j-\mu) = \Sigma \cdot 0 $

$ \Rightarrow \sum\limits_{j=1}^{N}(x_j-\mu) = 0 $

$ \Rightarrow \frac{1}{N}\sum\limits_{j=1}^{N}x_j = \mu $

$ \rightarrow $ the sample mean is the maximum likelihood estimate for $ \mu $




    • Example three: I.D. Gaussian, both $ \mu $ and $ \sigma^2 $ unknown

$ \theta = (\theta_1, \theta_2) = (\mu, \sigma^2) $

We have $ ln\rho(x_k|\mu,\sigma^2) = $ $ ln(\frac{1}{\sqrt{2\pi}\sigma}\cdot e^{-\frac{(x-\mu)^2}{2\sigma^2}}) $

=$ -1/2ln(2\pi\sigma^2)-1/(2\sigma^2)(x_k-\mu)^2 $ $ ln\rho_D(D|\mu, \sigma^2) $

=$ ln\prod\limits_{k=1}{N}\rho(x_k|\mu,\theta^2) $

=$ \sum\limits_{k=1}^{N}(-\frac{1}{2}ln(2\pi\sigma^2) $ $ -\frac{1}{2\sigma^2}(x_k-\mu)^2) $ $ \nabla_{\mu,\sigma^2}ln_D(D|\mu,\sigma^2) $

=$ \begin{bmatrix} \frac{\partial}{\partial \mu}ln\rho_D(D|\mu,\sigma^2)\\ \frac{\partial}{\partial \sigma^2}ln\rho_D(D|\mu,\sigma^2)\\ \end{bmatrix} $

=$ \begin{bmatrix} \frac{\partial}{\partial \mu}(-\frac{N}{2}ln(2\pi\sigma^2) -\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\ \frac{\partial}{\partial \sigma^2}(-\frac{N}{2}ln(2\pi\sigma^2) -\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\ \end{bmatrix} $

=$ \begin{bmatrix} \frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)\\ -\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}- \frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2 \end{bmatrix} $

=$ \begin{bmatrix} \frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)\\ -\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+ \frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2 \end{bmatrix} $ set to be 0

from $ \frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)=0 $ $ \Leftrightarrow \mu= $ $ \sum\limits_{k=1}^{N}x_k-N\mu=0 $

$ \Leftrightarrow \mu=\frac{1}{N}\sum\limits_{k=1}^{N}x_k $ which is sample mean.

From $ -\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}- $ $ \frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 $ and $ \hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k \Rightarrow $

$ -\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+ $ $ \frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow $

$ -\frac{N}{2}+\frac{1}{2\sigma^2} $ $ \sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow $

$ \frac{1}{2\sigma^2}= $ $ \frac{N}{2}\cdot \frac{1}{\sum\limits_{k=1}^{N}(x_k-\mu)^2} $ $ \Leftrightarrow $

$ \sigma^2 = \frac{1}{N}\cdot \sum\limits_{k=1}^{N}(x_k-\mu)^2 $ =$ \hat{\sigma^2} $ which is the MLE of $ \sigma $

In general, when $ x\sim N(\vec{\mu}, \Sigma), $ $ x\in \mathbb{R}^n, \vec{\mu}, \Sigma $ unknown, the MLE for $ \vec{\mu} $ and $ \Sigma $are: $ \hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k $ $ , \hat{\Sigma} = \frac{1}{N}\sum\limits_{k=1}^{N} $ $ (x_k-\mu)(x_k-\mu)^T $

$ \Sigma $ is non singular, but $ \hat{\Sigma} $ can be singular $ \Rightarrow $ no inverse $ \rightarrow $ this happens when number of points N<n: feature space down.

What happens when repeat sampling and estimating?

Sample: $ (x_1^i, x_2^i,...,x_N^i) \Rightarrow $ $ \hat{\mu}^i = \frac{1}{N}\sum\limits_{k=1}^{N}x_k^i $


    • $ E(\hat{u})=? $

We have $ E(\hat{u})= E(\frac{1}{N}\sum\limits_{k=1}^{N}(x_k)) $ $ \frac{1}{N}E(x_k)=\frac{1}{N}\sum\limits_{k=1}^{N}E(x)= $ $ \frac{1}{N}\sum\limits_{k=1}^{N}u = \mu $

But how far do we expect to derivate from the mean?

$ E(|\hat{\mu}-\mu|^2) = E((\hat{\mu}-\mu)(\hat{\mu}-\mu)) $ $ =E(\hat{\mu}\cdot\hat{\mu}-\hat{\mu}\cdot{\mu} $ $ -{\mu}\cdot\hat{\mu}+{\mu}\cdot{\mu}) $

$ =E(\hat{\mu}\cdot\hat{\mu})-2\cdot \mu E(\hat{u})+\mu \cdot \mu $

$ =E(\hat{\mu}\cdot\hat{\mu})-\mu\cdot\mu $

$ =E(\frac{1}{N}\sum\limits_{k=1}^{N}x_k \cdot \frac{1}{N}\sum\limits_{j=1}^{N}x_j)-\mu\cdot\mu $

$ =\frac{1}{N^2}\sum\limits_{k=1}^{N}E(x_k \cdot x_j)-\mu\cdot\mu $

$ =\frac{1}{N^2}[\sum\limits_{k,j=1,k\neq j}^{N} $ $ E(x_k )\cdot E(x_j)+\sum\limits_{k,j=1,k\neq j}^{N} $ $ E(x_k )\cdot E(x_k)]-\mu\cdot\mu $

$ =\frac{1}{N^2}[N\cdot (N-1)\mu\cdot \mu+ $ $ \sum\limits_{k=1}^{N}E(x^2)]-\mu\cdot\mu $

$ -\frac{1}{N}\mu\cdot\mu+\frac{1}{N^2}\sum\limits_{k=1}^{N}E(x^2) $

by $ E[(x-\mu)(x-\mu)] = \sigma^2 \Rightarrow $ $ E(x \cdot x)-\mu^2 = \sigma^2 \rightarrow $ $ E(x \cdot x) = \sigma^2+\mu^2 $


So: $ E(|\hat{\mu}-\mu|^2) = -\frac{1}{N}\mu \cdot \mu + $ $ \frac{1}{N}(\sigma^2+\mu \cdot \mu) = \frac{1}{N}\sigma^2 $



  • Bias: The maximum likelihood for the variance $\sigma^2$ is biased means

the expected value over all data sets of size n of the sample variance is not equal to the true variance:

$ E[\frac{1}{n}\sum\limits_{k=1}^{N}(x_k-\bar{x})] = \frac{n-1}{n} $ $ \sigma^2 \neq \sigma^2 $

But we can tell that as n $ \rightarrow \infty $, the MLE of $ \sigma $ is closing to $ \sigma^2 $


(create a question page and put a link below)

Questions and comments

If you have any questions, comments, etc. please post them on https://kiwi.ecn.purdue.edu/rhea/index.php/ECE662Selecture_ZHenpengMLE_Ques.


Back to ECE662, Spring 2014

Alumni Liaison

Questions/answers with a recent ECE grad

Ryne Rayburn