Maximum Likelihood Estimators and Examples
A slecture by Lu Zhang

Partially based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

# Outline of the slecture

• Background and Introduction of ML estimator
• Finding Maximum Likelihood Estimators and Examples
• Summary and Suggestions
• References

## Background and Introduction of ML estimator

Once we have decided on a model(Probability Distribution), our next step is often to estimate some information from the observed data. There are generally two parametric frameworks for estimating unknown information from data. We will refer to these two general frameworks as the Frequentist and Baysian approaches. One very widely used Frequentist estimator is known as the Maximum Likelihood estimator.

In the frequentist approach, one treats the unknown quantity as a deterministic, but unknown parameter vector, $\theta \in \Omega$. So for example, after we observe the random vector $Y \in \mathbb{R}^{n}$, then our objective is to use $Y$ to estimate the unknown scalar or vector $\theta$. In order to formulate this problem, we will assume that the vector $Y$ has a probability density function given by $p_{\theta}(y)$ where $\theta$ parameterizes a family of density functions for $Y$. We may then use this family of distributions to determine a function, $T : \mathbb{R}^{n} \rightarrow \Omega$, that can be used to compute an estimate of the unknown parameter as

$\hat{\theta} = T(Y)$

Notice, that since $T(Y)$ is a function of random vector $Y$, the estimate, $\hat{\theta}$, is a random vector. The mean of the estimator, $\bar{\theta}$, can be computed as

$\bar{\theta} = E_{\theta}[\hat{\theta}] = \int_{\mathbb{R}^{n}} T(y)p_{\theta}(y)dy$

The difference between the mean of the estimator and the value of the parameter is known as the bias and is given by

$bias_{\theta} = \bar{\theta} -\theta$

Similarly, the variance of the estimator is given by

$var_{\theta} = E_{\theta}[(\hat{\theta} -\bar{\theta})^2]$

and it is easily shown that the mean squared error (MSE) of the estimate is then given by

$MSE_{\theta} = E_{\theta}[(\hat{\theta}-\theta)^2] = var_{\theta} + (bias_{\theta})^2$

Since the bias, variance, and the MSE of the estimator will depend on the specific value of $\theta$, it is often unclear precisely how to compare the accuracy of different estimators. Even estimators that seem quite poor may produce small or zero error for certain values of $\theta$. For example, consider the estimator which is fixed to the value $\hat{\theta}=1$, independent of the data. This would seem to be a very poor estimator, but it has an MSE of 0 when $\theta=1$.

An estimator is said to be consistent if for all $\theta \in \Omega$, the MSE of the estimator goes to zero as the number of independent data samples, n, goes to infinity. If an estimator is not consistent, this means that even with arbitrarily large quantities of data, the estimate will not approach the true value of the parameter. Consistency would seem to be the least we would expect of an estimator, but we will later see that even some very intuitive estimators are not always consistent.

Ideally, it would be best if one could select an estimator which has uniformly low bias and variance for all values of $\theta$. This is not always possible, but when it is we have names for such estimators. For example, $\hat{\theta}$ is said to be an unbiased estimator if for all values of $\theta$ the bias is zero, i.e. $\theta = \bar{\theta}$. If in addition, for all values of $\theta$, the variance of an estimator is less than or equal to that of all other unbiased estimators, then we say that the estimator is uniformly minimum variance unbiased(UMVU) estimator.

There are many excellent estimators that have been proposed through the years for many different types of problems. However, one very widely used Frequentist estimator is known as the maximum likelihood(ML) estimator given by

$\hat{\theta} = \arg\max_{\theta \in \Omega} p_{\theta}(Y)$
$= \arg\max_{\theta \in \Omega} \log p_{\theta}(Y)$

where the notation "argmax" denotes the value of the argument that achieves the global maximum of the function. Notice that these formulas for the ML estimate actually use the random variable $Y$ as an argument to the density function $p_{\theta}(y)$. This implies that $\hat{\theta}$ is function of $Y$, which in turn means that $\hat{\theta}$ is a random variable.

When the density function, $p_{\theta}(y)$, is a continuously differentiable function of $\theta$, then a necessary condition when computing the ML estimate is that the gradient of the likelihood is zero.

$\bigtriangledown_{\theta} p_{\theta}(Y)|_{\theta=\hat{\theta}} = 0$

While the ML estimate is generally not unbiased, it does have a number of desirable properties. It can be shown that under relatively mild technical conditions, the ML estimate is both consistent and asymptotically efficient. This means that the ML estimate attains the Cramer-Rao bound asymptotically as the number of independent data samples,n, goes to infinity. Since this Cramer-Rao bound is a lower bound on the variance of an unbiased estimator, this means that the ML estimate is about as good as any unbiased estimator can be. So if one has no idea what values of $\theta$ may occur, and needs to guarantee good performance for all cases, then the ML estimator is usually a good choice.

# Example 1: 1D Gaussuan, $\theta$ unknown, $\sigma^2$ known

Let $\{Y_k\}_{k=1}^n$ be i.i.d. random variables with Gaussian distribution $N(\theta,1)$ and unknown mean parameter $\theta$. For this case, the logarithm of the density function is given by:

$\log p_{\theta}(Y) = -\frac{1}{2\sigma^2}\sum_{k=1}^{n}(Y_k-\theta)^2 - \frac{n}{2} \log (2\pi\sigma^2)$

Local max are where gradient is zero. Differentiating the log likelihood results in the following expression.

$\bigtriangledown_{\theta} \log p_{\theta}(Y) = \frac{d\log p_{\theta}(Y)}{d\theta} |_{\theta = \hat{\theta}} = \frac{1}{\sigma^2}\sum_{k=1}^{n}(Y_k - \hat{\theta}) = 0$

From this we obtain the ML estimate for $\theta$. So

$\hat{\theta}=\frac{1}{n}\sum_{k=1}^{n} y_k$

Notice that, by the following argument, this particular ML estimate is unbiased.

$E_{\theta}[\hat{\theta}] = E_{\theta}[\frac{1}{n}\sum_{k=1}^{n}Y_k] = \frac{1}{n}\sum_{k=1}^{n} E_{\theta}[Y_k] = \frac{1}{n} \sum_{k=1}^{n} \theta = \theta$

Moreover, for this special case, it can be shown that the ML estimator is also the UMVU estimator.

# Example 2: 1D Gaussuan, $(\mu,\sigma^2)$ unknown

Let $\{Y_k\}_{k=1}^n$ be i.i.d. random variables with Gaussian distribution $N(\mu,\sigma^2)$ and unknown mean parameter $\theta$. For this case, the density function is giving by:

$p_{\theta}(Y) = (2\pi\sigma^2)^{-\frac{n}{2}} \exp(- \frac{1}{2\sigma^2}\sum_{k=1}^{n}(Y_k - \mu))$

The logarithm of the density function is given by:

$\log p_{\theta}(Y) = -\frac{1}{2\sigma^2}\sum_{k=1}^{n}(Y_k-\mu)^2 - \frac{n}{2} \log (2\pi\sigma^2)$

Similarly, differentiating the log likelihood results in the following expressions.

$\bigtriangledown_{\mu} \log p_{\theta}(Y) = \bigtriangledown_{\mu} (-\frac{n}{2}\log 2\pi-\frac{n}{2}\log \sigma^2 - \frac{1}{2\sigma^2} \sum_{k=1}^{n} (Y_k-\mu)^2)$

$= -\frac{1}{2\sigma^2}\cdot 2\cdot (-1) \sum_{k=1}^{n}(Y_k-\hat{\mu})$

$= \frac{1}{\sigma^2} \sum_{k=1}^{n}(Y_k-\hat{\mu})$

$= \frac{1}{\sigma^2}(\sum_{k=1}^{n}Y_k -\hat{\mu}n) = 0 --- 1$

$\bigtriangledown_{\sigma^2} \log p_{\theta}(Y) = \bigtriangledown_{\sigma^2} (-\frac{n}{2}\log 2\pi-\frac{n}{2}\log \sigma^2 - \frac{1}{2\sigma^2} \sum_{k=1}^{n} (Y_k-\mu)^2)$

$= -\frac{n}{2} \frac{1}{\sigma^2}+\frac{1}{2} \sum_{k=1}^{n} (Y_k - \mu)^2(\hat{\sigma^2})^{-2} = 0 --- 2$

From Equation 1 and 2, we can find the ML estimator for the two parameters:

$\hat{\mu} = \frac{1}{n}\sum_{k=1}^n Y_k$
$\hat{\sigma^2} = \frac{1}{n}\sum_{k=1}^n (Y_k - \hat{\mu})^2$

# Example 3: pD Gaussuan, $(\mu,R)$ unknown

Let $Y_1,...,Y_n$ be i.i.d. Gaussian random vectors with distribution $N(\mu,R)$ where $\mu \in \mathbb{R}^{p}$ and $R \in \mathbb{R}^{p\times p}$ is a symmetric positive definite matrix. Let $\theta = [\mu,R]$ denote the parameter vector for the distribution.

We would to know the density function for the entire observation first. In order to simplify the notation, we will represent the data as a matrix $Y$.

$Y = [Y_1,...,Y_n]$

Since the $n$ observations, $Y_1,...,Y_n$ are i.i.d., we know we can represent the density of $Y$ as the following product of densities for each vector $Y_k$.

$p(x|\theta) = \prod_{k=1}^n \frac{1}{(2\pi)^{p/2}} |R|^{-1/2} \exp\{-\frac{1}{2} (y_k-\mu)^t R^{-1}(y_k-\mu)\}$
$p(x|\theta) = \frac{1}{(2\pi)^{pn/2}} |R|^{-\frac{n}{2}} \exp\{-\frac{1}{2}\sum_{k=1}^n (y_k-\mu)^t R^{-1}(y_k-\mu)\}$

This expression may be furthur simplified by defining the following two new sample statistics,

$b = \sum_{k=1}^n y_k$
$S = \sum_{k=1}^n y_ky_k^t = yy^t$

Where $yy^t$ is the product of the matrix $y$ with its transpose. Also,

$\sum_{k=1}^n (y_k-\mu)^t R^{-1}(y_k-\mu)$
$= \sum_{k=1}^n (tr((y_k-\mu)^t R^{-1}(y_k-\mu)))$
$= \sum_{k=1}^n (tr((y_k-\mu)^t (y_k-\mu)R^{-1}))$
$= \sum_{k=1}^n tr((y_k y_k^t-2\mu y_k^t + \mu\mu^t)R^{-1})$
$= tr(\sum_{k=1}^ny_k y_k^t R^{-1}) - 2tr(\sum_{k=1}^ny_k^tR^{-1}\mu) + ntr(\mu^tR^{-1}\mu)$

where $tr\{A\}$ denotes the trace of the matrix $A$, the summation of the diagonal elements.

In general, a statistic is any function of the observed data, but typically, statistics summarize the data in some useful way. Using these two statistics, the pdf of $Y$ may be written as:

$p(x|\theta) = \frac{1}{(2\pi)^{pn/2}} |R|^{-\frac{n}{2}} \exp \{tr\{SR^{-1}\} +b^tR^{-1}\mu - \frac{n}{2}\mu^tR^{-1}\mu\}---3$

Maximizing this expression with respect to $\theta$ then results in the ML estimate of the mean and covariance given by

$\arg\max_{[\mu, R]} p(y|\mu,R) = [\hat{\mu},\hat{R}]$

The logarithm of the density function is given by:

$\log p(y|\theta) = -\frac{pn}{2}\log{2\pi} -\frac{n}{2}\log |R| - \frac{1}{2} tr(SR^{-1}) + b^t R^{-1}\mu - \frac{n}{2}\mu^tR^{-1}\mu$

Differentiating the log likelihood over $\mu$ results in the following expressions.

$\bigtriangledown_{\mu} \log p_{\theta}(Y|\theta) = b^t R^{-1} - \frac{n}{2}R^{-1}\hat{\mu}*2$
$\hat{\mu} = \frac{b}{n} = \frac{1}{n} \sum_{k=1}^n y_k$

However, finding the gradient of covariance matrix $R$ is more complicated. Let

$M = \sum_{k=1}^n (y_k - \hat{\mu})(y_k-\hat{\mu})^t$
$B = M^{\frac{1}{2}}R^{-1} M^{\frac{1}{2}}$

Then

$R^{-1} = M^{\frac{1}{2}} B M^{-\frac{1}{2}}$
$|R|^{-1} = |M^{-1}| |B|$

Using these two matrix, the density function can be written by:

$|R|^{-\frac{n}{2}} \exp \{-\frac{1}{2}tr(\sum_{k=1}^n(y_k-\hat{\mu})(y_k-\hat{\mu})^tR^{-1})\}$
$= |M|^{-1/2}|B|^{n/2} \exp(-\frac{1}{2}tr(B))$
$\propto \prod_{i=1}^{p} \lambda_i ^{\frac{n}{2}} \exp(-\frac{1}{2} \sum_{i=1}^{p} \lambda_i) = \prod_{i=1}^{p} \lambda_i ^{\frac{n}{2}} \exp(-\frac{\lambda_i}{2} )$

Therefore,

$\hat{R} = M^{1/2}B^{-1}M^{1/2} = \frac{1}{n} M$
$\hat{R} = \frac{1}{n} \sum_{k=1}^{n} (y_k-\hat{\mu})(y_k-\hat{\mu})^t$
$\hat{R} = \frac{1}{n} \sum_{k=1}^{n} y_ky_k^t + \sum_{k=1}^{n}\hat{\mu}{\hat{\mu}}^t - 2\sum_{k=1}^{n} y_k {\hat{\mu}}^t$
$\hat{R} = \frac{S}{n} + \hat{\mu}{\hat{\mu}}^t -\frac{2}{n}b{\hat{\mu}}^t = \frac{S}{n}- \hat{\mu}{\hat{\mu}}^t$
$\hat{R} = \frac{S}{n}- \hat{\mu}{\hat{\mu}}^t = \frac{1}{n} \sum_{k=1}^{n} (Y_k - \mu)(Y_k - \mu)^t$

So from this we see that the ML estimate of the mean and covariance of a multivariate Gaussian distribution is simply given by the sample mean and covariance of the observations.

Now for example 3, we would like to know if the estimator is biased or not(mentioned in the background section), i.e. check if $bias_\theta = E_\theta[\hat{\theta}] - \theta = 0$

Case 1, when $\mu$ is known, check if $\hat{R}$ is biased.

$E[ \hat{R} ] = E[ \frac{1}{n} \sum_{k=1}^{n} (Y_k - \mu)(Y_k - \mu)^t]$

$= \sum_{k=1}^{n} E[ \frac{1}{n} (Y_k - \mu)(Y_k - \mu)^t]$

$= \frac{1}{n} \sum_{k=1}^{n} E [Y_k Y_k^t - Y_k \mu^t - \mu Y^t + \mu \mu^t]$

$= \frac{1}{n} n E[ Y_k Y_k^t ] - \mu\mu^t$

$= E[ Y Y^t - \mu\mu^t] \frac{}{}$

$= E[ Y Y^t - 2\mu\mu^t + \mu\mu^t] \frac{}{}$

$= E[ Y Y^t] - E[Y]\mu^t - \mu E[Y^t] + \mu\mu^t \frac{}{}$

$= E[(Y-\mu)(Y-\mu)^t] = R \frac{}{}$

Therefore when $\mu$ is known, the ML estimator of covariance matrix $R$ is unbiased, that means the mean of estimated value of covariance matrix is equal to the true $R$.

Case 2, when $[\mu, R]$ unknown (initial settings of example 3), check if $\hat{R}$ is still unbiased.

$E[\hat{R}] = \frac{1}{n} E[ \sum_{k=1}^{n} (Y_k - \hat{\mu})(Y_k - \hat{\mu})^t]$

$= \frac{1}{n} \sum_{k=1}^{n} E[Y_k Y_k^t- Y_k \hat{\mu}^t - \hat{\mu} Y^t + \hat{\mu} \hat{\mu}^t]$

$= \frac{1}{n} \sum_{k=1}^{n} ( E[ Y_k Y_k^t ] - E[Y_K \hat{\mu}^t] - E[ \hat{\mu} Y^T] - E[\hat{\mu} \hat{\mu}^t] )$

$= \frac{1}{n} \sum_{k=1}^{n} \left(E[ Y_k Y_k^t ] - E [(\frac{1}{n}\sum_{l=1}^{n}Y_l)Y_k]\right) - E [\frac{1}{n}\sum_{l=1}^{n}Y_k Y_l^t]+ E[\frac{1}{n}\sum_{l=1}^{n}Y_l \frac{1}{n}\sum_{m=1}^{n}X_m^T]$

Need to divide the summation shown in the equation above into two parts, $l = k$ and $l \neq k$,

$E[ \hat{R} ] = \frac{1}{n} \sum_{k=1}^{n} \{E[Y Y^T] - \frac{1}{n} R_{l=1, l \neq k}^{n}E(Y_l) E (Y_k^t)-\frac{1}{n} R_{l=1, l = k}^{n} E( Y_k Y_k^t )$

$-\frac{1}{n}R_{l=1, l \neq k}^{n} E(Y_k) E(Y_l^t)-\frac{1}{n} R_{l=1, l = k}^{n} E( Y_k Y_k^t )+ \frac{1}{n^2} R_{l,m=1, l \neq m}^{K} E(Y_l)E(Y_m^t) + \frac{1}{n^2} R_{l,m=1, l = m}^{K} E(Y_l)E(Y_l^t)\}$

$= \frac{1}{n} \sum_{k=1}^{n} \{E[Y Y^T] - \frac{1}{n} R_{l=1, l \neq k}^{n}\mu\mu^t-\frac{1}{n} R_{l=1, l = k}^{n} E( Y_k Y_k^t )-\frac{1}{n}R_{l=1, l \neq k}^{n} \mu\mu^t$

$-\frac{1}{n} R_{l=1, l = k}^{n} E( Y_k Y_k^T )+ \frac{1}{n^2} R_{l,m=1, l \neq m}^{K} \mu\mu^t + \frac{1}{n^2} R_{l,m=1, l = m}^{K} E(Y_l)E(Y_l^t)$

$E[ \hat{R} ] = \frac{1}{n} \left[n(1 - \frac{1}{n} E( Y Y^T ) + n( \frac{1}{n} - 1) \mu\mu^t \right]$

$= (1 - \frac{1}{n} ) \left[E( Y Y^T ) - \mu\mu^t \right]$

$= ( \frac{n- 1}{n} ) E(Y Y^t - \mu\mu^t))$

$= ( \frac{n - 1}{n} )E[(Y - \mu)(Y - \mu)^t] \neq R$

Therefore, when both of the parameters are unknown for a set of Gaussian distributed random vector, the ML estimator of covariance matrix is not unbiased any more.

However, we can notice that as the size of the data increases, i.e. $n$, the expected value of ML estimator will converge to its true value.

## Summary and Suggestions

Since the ML estimator is asymptotically efficient, it is guaranteed to do well for all values of the unknown parameter $\theta$, as the amount of data grows towards to infinity(shown in conclusion of example 3). So it might seem that nothing more can be done.

However,as I mentioned in the background section, there are two general frameworks for estimating unknown information from data, the frequentist and Bayesian approaches. What I have in this section is a presentation of maximum likelihood (ML) estimator, which is an important example of the frequentist approach. You might want to learn Bayesian approaches from other slectures. Bayesian methods attempts to exceed the accuracy of Frequentist estimators, such as the ML estimates, by making assumptions about values of the parameters that are most likely to occur. In practice, you might have already seen in homework 2 that Bayesian estimation methods are most useful when the amount of data is relatively small compared to the dimension of the unknown parameter.

Hopefully, you will see that each approach, has its potential advantages, and the best choice tends to depend on the particular situation. Generally, we will see that the ML estimator tends to be a good choice when the ratio of the amount of data to the number of unknowns is much greater than 1, whereas the Bayesian parameter estimator tends to be a good choice when this ratio approaches or is less than 1.

## Reference

. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

. C. Bouman, ECE 641 'Model Based Image Processing', Spring 2013. 