Expected Value of MLE estimate over standard deviation and expected deviation

A slecture by ECE student Zhenpeng Zhao

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

### 1. Motivation

• Most likely converge as number of number of training sample increase.
• Simpler than alternate methods such as Bayesian technique.

### 2. MLE as a Parametric Density Estimation

• Statistical Density Theory Context
• Given c classes + some knowledge about features $x \in \mathbb{R}^n$ (or some other space)
• Given training data, $x_j\sim\rho(x)=\sum\limits_{i=1}^n\rho(x|w_i) Prob(w_i)$, unknown class $w_{ij}$ for $x_j$ is know, $\forall{j}=1,...,N$ (N hopefully large enough)
• In order to make decision, we need to estimate $\rho(x|w_i)$, $Prob(w_i)$ $\rightarrow$ use Bayes rule, or $\rho(x|w_i)$ $\rightarrow$ use Neyman-Pearson Criterion
• To estimate the above two, use training data.
• The parametric pdf|Prob estimation problem
• Let $D={x_1,x_2,...,x_N}$, $x_j$ is drown independently from some probability law.
• Choose parametric from $\rho(x|\theta)$ for the pdf of x or $Prob(x|\theta)$ for the probability of x $\rightarrow$ an unknown parametric vector
• Use $D$ to estimate $\theta$
• Definition: The maximum likelihood estimate of $\theta$ is the value $\hat{\theta}$ that maximize $\rho_D(D|\theta)$, if x is continuous R.V., or $Prob(D|\theta)$, if x is discrete R.V.
• Observation: By independence, $\rho(D|\theta)=\rho(x_1,x_2,...,x_N|\theta)$ = $\prod\limits_{j=1}^n\rho(x_j|\theta)$
• Simple Example One:

Those to estimate the priors: $Prob(w_1), Prob(w_2)$ for $c=2$ classes.

Let $Prob(w_1)=P$, $\Rightarrow$ $Prob(w_2)=1-P$, as an unknown parameter ($\theta=P$)

Let $w_j$ be the class of some $x_j$, ($j\in{1,2,...N}$)

$Prob(D|P)$ = $\prod\limits_{j=1}^n Prob(w_{ij}|P)$, $x\sim \rho(x)$

=$\prod\limits_{j=1}^{N_1} Prob(w_{ij}|P)\prod\limits_{j=1}^{N_2}Prob(w_{ij}|p)$

=$P^{N_1}\dot(1-P)^{N-N_1}$

, the first $w_{ij}=w_1$ and the second $w_{ij}=w_2$,

$N1$= number of sample from class 1 Then, we $\infty$ differentiate P $(Prob(D|P))$, so local max is where derivative = 0.

$\frac{d}{dP} Prob(D|P)=\frac{d}{dP} P^{N_1}(1-P)^{N-N_1}$

$=N_1P^{N_1-1}(1-P)^{N-N_1}-(N-N_1)P^{N_1}(1-p)$

$=p^{N_1-1}(1-P)^{N-N_1-1}[N_1(1-P)-(N-N_1)P]=0$

$\Rightarrow$ So either P=0 or P=1 $\rightarrow N_1(1-P)$

$\Leftrightarrow P=\frac{N_1}{N}$

• Simple Example Two: Continuous R.V.: Estimate mean of Gaussian with Known $\Sigma$

$\rho(\vec{x}|\vec{\mu})=N(\vec{\mu},\Sigma)$, where $\mu$ is

unknown, and $Sigma$ is known.


$\rho(D|\vec{\mu}) = \prod\limits_{j=1}^{N}\rho(x_j|\vec{\mu})$

Observe the MLE $\in \hat{\theta}$, also maximize $log\rho_D(D|\theta)$ since log is monotonic

= $\sum\limits_{j=1}^{N}ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}})$ $\exp^{-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2}}$

= $\sum\limits_{j=1}^{N}ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}})$ $-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2}$

which is $\infty$ many times differentiable for $\vec{\mu}$, so local max are where $\nabla=0$

compute $\nabla$, $\nabla_{\vec{\mu}}ln\rho_{D}(D|\vec{\mu})$

=$\sum\limits_{j=1}^{N}\nabla_{\vec{\mu}} (ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}})$ $-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2})$

=$-1/2\sum\limits_{j=1}^{N}\nabla_{\vec{\mu}}[(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})]$

=$-1/2\sum\limits_{j=1}^{N} \begin{bmatrix} \frac{\partial}{\partial\mu_1} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \frac{\partial}{\partial\mu_2} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \vdots \\ \frac{\partial}{\partial\mu_n} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \end{bmatrix}$

But $\frac{\partial}{\partial\mu_1} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})$

=$(\frac{\partial}{\partial \mu_i}(x_j-\mu)^T)\Sigma^{-1}$ $(x_j-\mu)+(x_j-\mu)^T\Sigma^{-1}\frac{\partial}{\partial \mu_i}(x_j-\mu)$

=$2\frac{\partial}{\partial \mu_i}(x_j-\mu)^T)\Sigma^{-1}(x_j-\mu)$

=$2(0,0,0,...,-1,0,...,0)\Sigma^{-1}(x_j-\mu)$

=$-2\vec{e_i}^{T}\Sigma^{-1}(x_j-\mu)$

so, $\nabla{ln\rho_D(D|\mu)} = -1/2\sum\limits_{j=1}^{N}$ $\begin{bmatrix} -2\vec{e_1}^{T}\Sigma^{-1}(x_j-{\mu})\\ -2\vec{e_2}^{T}\Sigma^{-1}(x_j-{\mu})\\ \vdots \\ -2\vec{e_n}^{T}\Sigma^{-1}(x_j-{\mu})\\ \end{bmatrix}$

=$\sum\limits_{j=1}^{N}$ $\begin{bmatrix} -2\vec{e_1}^{T}\\ -2\vec{e_2}^{T}\\ \vdots \\ -2\vec{e_n}^{T}\\ \end{bmatrix}$ $\Sigma^{-1}(x_j-\mu)$, the vector of $\vec{e_i}$ is the space domain of feature

=$\sum\limits_{j=1}^{N}\Sigma^{-1}(x_j-\mu)$

=$\Sigma^{-1}\sum\limits_{j=1}^{N}(x_j-\mu)$ set to be 0

$\Rightarrow \Sigma\Sigma^{-1}\sum\limits_{j=1}^{N}\Sigma^{-1}(x_j-\mu) = \Sigma \cdot 0$

$\Rightarrow \sum\limits_{j=1}^{N}(x_j-\mu) = 0$

$\Rightarrow \frac{1}{N}\sum\limits_{j=1}^{N}x_j = \mu$

$\rightarrow$ the sample mean is the maximum likelihood estimate for $\mu$

• Example three: I.D. Gaussian, both $\mu$ and $\sigma^2$ unknown

$\theta = (\theta_1, \theta_2) = (\mu, \sigma^2)$

We have $ln\rho(x_k|\mu,\sigma^2) =$ $ln(\frac{1}{\sqrt{2\pi}\sigma}\cdot e^{-\frac{(x-\mu)^2}{2\sigma^2}})$

=$-1/2ln(2\pi\sigma^2)-1/(2\sigma^2)(x_k-\mu)^2$ $ln\rho_D(D|\mu, \sigma^2)$

=$ln\prod\limits_{k=1}{N}\rho(x_k|\mu,\theta^2)$

=$\sum\limits_{k=1}^{N}(-\frac{1}{2}ln(2\pi\sigma^2)$ $-\frac{1}{2\sigma^2}(x_k-\mu)^2)$ $\nabla_{\mu,\sigma^2}ln_D(D|\mu,\sigma^2)$

=$\begin{bmatrix} \frac{\partial}{\partial \mu}ln\rho_D(D|\mu,\sigma^2)\\ \frac{\partial}{\partial \sigma^2}ln\rho_D(D|\mu,\sigma^2)\\ \end{bmatrix}$

=$\begin{bmatrix} \frac{\partial}{\partial \mu}(-\frac{N}{2}ln(2\pi\sigma^2) -\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\ \frac{\partial}{\partial \sigma^2}(-\frac{N}{2}ln(2\pi\sigma^2) -\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\ \end{bmatrix}$

=$\begin{bmatrix} \frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)\\ -\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}- \frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2 \end{bmatrix}$

=$\begin{bmatrix} \frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)\\ -\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+ \frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2 \end{bmatrix}$ set to be 0

from $\frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)=0$ $\Leftrightarrow \mu=$ $\sum\limits_{k=1}^{N}x_k-N\mu=0$

$\Leftrightarrow \mu=\frac{1}{N}\sum\limits_{k=1}^{N}x_k$ which is sample mean.

From $-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}-$ $\frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0$ and $\hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k \Rightarrow$

$-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+$ $\frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow$

$-\frac{N}{2}+\frac{1}{2\sigma^2}$ $\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow$

$\frac{1}{2\sigma^2}=$ $\frac{N}{2}\cdot \frac{1}{\sum\limits_{k=1}^{N}(x_k-\mu)^2}$ $\Leftrightarrow$

$\sigma^2 = \frac{1}{N}\cdot \sum\limits_{k=1}^{N}(x_k-\mu)^2$ =$\hat{\sigma^2}$ which is the MLE of $\sigma$

In general, when $x\sim N(\vec{\mu}, \Sigma),$ $x\in \mathbb{R}^n, \vec{\mu}, \Sigma$ unknown, the MLE for $\vec{\mu}$ and $\Sigma$are: $\hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k$ $, \hat{\Sigma} = \frac{1}{N}\sum\limits_{k=1}^{N}$ $(x_k-\mu)(x_k-\mu)^T$

$\Sigma$ is non singular, but $\hat{\Sigma}$ can be singular $\Rightarrow$ no inverse $\rightarrow$ this happens when number of points N<n: feature space down.

What happens when repeat sampling and estimating?

Sample: $(x_1^i, x_2^i,...,x_N^i) \Rightarrow$ $\hat{\mu}^i = \frac{1}{N}\sum\limits_{k=1}^{N}x_k^i$

• $E(\hat{u})=?$

We have $E(\hat{u})= E(\frac{1}{N}\sum\limits_{k=1}^{N}(x_k))$ $\frac{1}{N}E(x_k)=\frac{1}{N}\sum\limits_{k=1}^{N}E(x)=$ $\frac{1}{N}\sum\limits_{k=1}^{N}u = \mu$

But how far do we expect to derivate from the mean?

$E(|\hat{\mu}-\mu|^2) = E((\hat{\mu}-\mu)(\hat{\mu}-\mu))$ $=E(\hat{\mu}\cdot\hat{\mu}-\hat{\mu}\cdot{\mu}$ $-{\mu}\cdot\hat{\mu}+{\mu}\cdot{\mu})$

$=E(\hat{\mu}\cdot\hat{\mu})-2\cdot \mu E(\hat{u})+\mu \cdot \mu$

$=E(\hat{\mu}\cdot\hat{\mu})-\mu\cdot\mu$

$=E(\frac{1}{N}\sum\limits_{k=1}^{N}x_k \cdot \frac{1}{N}\sum\limits_{j=1}^{N}x_j)-\mu\cdot\mu$

$=\frac{1}{N^2}\sum\limits_{k=1}^{N}E(x_k \cdot x_j)-\mu\cdot\mu$

$=\frac{1}{N^2}[\sum\limits_{k,j=1,k\neq j}^{N}$ $E(x_k )\cdot E(x_j)+\sum\limits_{k,j=1,k\neq j}^{N}$ $E(x_k )\cdot E(x_k)]-\mu\cdot\mu$

$=\frac{1}{N^2}[N\cdot (N-1)\mu\cdot \mu+$ $\sum\limits_{k=1}^{N}E(x^2)]-\mu\cdot\mu$

$-\frac{1}{N}\mu\cdot\mu+\frac{1}{N^2}\sum\limits_{k=1}^{N}E(x^2)$

by $E[(x-\mu)(x-\mu)] = \sigma^2 \Rightarrow$ $E(x \cdot x)-\mu^2 = \sigma^2 \rightarrow$ $E(x \cdot x) = \sigma^2+\mu^2$

So: $E(|\hat{\mu}-\mu|^2) = -\frac{1}{N}\mu \cdot \mu +$ $\frac{1}{N}(\sigma^2+\mu \cdot \mu) = \frac{1}{N}\sigma^2$

• Bias: The maximum likelihood for the variance $\sigma^2$ is biased means

the expected value over all data sets of size n of the sample variance is not equal to the true variance:

$E[\frac{1}{n}\sum\limits_{k=1}^{N}(x_k-\bar{x})] = \frac{n-1}{n}$ $\sigma^2 \neq \sigma^2$

But we can tell that as n $\rightarrow \infty$, the MLE of $\sigma$ is closing to $\sigma^2$

(create a question page and put a link below)