Maximum Likelihood Estimators and Examples
A slecture by Lu Zhang

Partially based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

# Outline of the slecture

• Background and Introduction of ML estimator
• Derivation for Maximum Likelihood Estimates (MLE)
• Examples
• Summary
• References

## Background and Introduction of ML estimator

Once we have decided on a model(Probability Distribution), our next step is often to estimate some information from the observed data. There are generally two parametric frameworks for estimating unknown information from data. We will refer to these two general frameworks as the Frequentist and Baysian approaches. One very widely used Frequentist estimator is known as the Maximum Likelihood estimator.

In the frequentist approach, one treats the unknown quantity as a deterministic, but unknown parameter vector, $\theta \in \Omega$. So for example, after we observe the random vector $Y \in \mathbb{R}^{n}$, then our objective is to use $Y$ to estimate the unknown scalar or vector $\theta$. In order to formulate this problem, we will assume that the vector $Y$ has a probability density function given by $p_{\theta}(y)$ where $\theta$ parameterizes a family of density functions for $Y$. We may then use this family of distributions to determine a function, $T : \mathbb{R}^{n} \rightarrow \Omega$, that can be used to compute an estimate of the unknown parameter as

$\hat{\theta} = T(Y)$

Notice, that since $T(Y)$ is a function of random vector $Y$, the estimate, $\hat{\theta}$, is a random vector. The mean of the estimator, $\bar{\theta}$, can be computed as

$\bar{\theta} = E_{\theta}[\hat{\theta}] = \int_{\mathbb{R}^{n}} T(y)p_{\theta}(y)dy$

The difference between the mean of the estimator and the value of the parameter is known as the bias and is given by

$bias_{\theta} = \bar{\theta} -\theta$

Similarly, the variance of the estimator is given by

$var_{\theta} = E_{\theta}[(\hat{\theta} -\bar{\theta})^2]$

and it is easily shown that the mean squared error (MSE) of the estimate is then given by

$MSE_{\theta} = E_{\theta}[(\hat{\theta}-\theta)^2] = var_{\theta} + (bias_{\theta})^2$

Since the bias, variance, and the MSE of the estimator will depend on the specific value of $\theta$, it is often unclear precisely how to compare the accuracy of different estimators. Even estimators that seem quite poor may produce small or zero error for certain values of $\theta$. For example, consider the estimator which is fixed to the value $\hat{\theta}=1$, independent of the data. This would seem to be a very poor estimator, but it has an MSE of 0 when $\theta=1$.

An estimator is said to be consistent if for all $\theta \in \Omega$, the MSE of the estimator goes to zero as the number of independent data samples, n, goes to infinity. If an estimator is not consistent, this means that even with arbitrarily large quantities of data, the estimate will not approach the true value of the parameter. Consistency would seem to be the least we would expect of an estimator, but we will later see that even some very intuitive estimators are not always consistent.

Ideally, it would be best if one could select an estimator which has uniformly low bias and variance for all values of $\theta$. This is not always possible, but when it is we have names for such estimators. For example, $\hat{\theta}$ is said to be an unbiased estimator if for all values of $\theta$ the bias is zero, i.e. $\theta = \bar{\theta}$. If in addition, for all values of $\theta$, the variance of an estimator is less than or equal to that of all other unbiased estimators, then we say that the estimator is uniformly minimum variance unbiased(UMVU) estimator.

There are many excellent estimators that have been proposed through the years for many different types of problems. However, one very widely used Frequentist estimator is known as the maximum likelihood(ML) estimator given by

$\hat{\theta} = \arg\max_{\theta \in \Omega} p_{\theta}(Y)$
$= \arg\max_{\theta \in \Omega} \log p_{\theta}(Y)$

where the notation "argmax" denotes the value of the argument that achieves the global maximum of the function. Notice that these formulas for the ML estimate actually use the random variable $Y$ as an argument to the density function $p_{\theta}(y)$. This implies that $\hat{\theta}$ is function of $Y$, which in turn means that $\hat{\theta}$ is a random variable.

When the density function, $p_{\theta}(y)$, is a continuously differentiable function of $\theta$, then a necessary condition when computing the ML estimate is that the gradient of the likelihood is zero.

$\bigtriangledown_{\theta} p_{\theta}(Y)|_{\theta=\hat{\theta}} = 0$

While the ML estimate is generally not unbiased, it does have a number of desirable properties. It can be shown that under relatively mild technical conditions, the ML estimate is both consistent and asymptotically efficient. This means that the ML estimate attains the Cramer-Rao bound asymptotically as the number of independent data samples,n, goes to infinity. Since this Cramer-Rao bound is a lower bound on the variance of an unbiased estimator, this means that the ML estimate is about as good as any unbiased estimator can be. So if one has no idea what values of $\theta$ may occur, and needs to guarantee good performance for all cases, then the ML estimator is usually a good choice.

# Example 1: 1D Gaussuan, $\theta$ unknown, $\sigma^2$ known

Let $\{Y_k\}_{k=1}^n$ be i.i.d. random variables with Gaussian distribution $N(\theta,1)$ and unknown mean parameter $\theta$. For this case, the logarithm of the density function is given by:

$\log p_{\theta}(Y) = -\frac{1}{2\sigma^2}\sum_{k=1}^{n}(Y_k-\theta)^2 - \frac{n}{2} \log (2\pi\sigma^2)$

Local max are where gradient is zero. Differentiating the log likelihood results in the following expression.

$\bigtriangledown_{\theta} \log p_{\theta}(Y) = \frac{d\log p_{\theta}(Y)}{d\theta} |_{\theta = \hat{\theta}} = \frac{1}{\sigma^2}\sum_{k=1}^{n}(Y_k - \hat{\theta}) = 0$

From this we obtain the ML estimate for $\theta$. So

$\hat{\theta}=\frac{1}{n}\sum_{k=1}^{n} y_k$

Notice that, by the following argument, this particular ML estimate is unbiased.

$E_{\theta}[\hat{\theta}] = E_{\theta}[\frac{1}{n}\sum_{k=1}^{n}Y_k] = \frac{1}{n}\sum_{k=1}^{n} E_{\theta}[Y_k] = \frac{1}{n} \sum_{k=1}^{n} \theta = \theta$

Moreover, for this special case, it can be shown that the ML estimator is also the UMVU estimator.

# Example 2: 1D Gaussuan, $(\theta,\sigma^2)$ unknown

Let $\{Y_k\}_{k=1}^n$ be i.i.d. random variables with Gaussian distribution $N(\theta,\sigma^2)$ and unknown mean parameter $\theta$. For this case, the density function is giving by:

$p_{\theta}(Y) = (2\pi\sigma^2)^{-\frac{n}{2}} \exp(- \frac{1}{2\sigma^2}\sum_{k=1}^{n}(Y_k - \theta))$

The logarithm of the density function is given by:

$\log p_{\theta}(Y) = -\frac{1}{2\sigma^2}\sum_{k=1}^{n}(Y_k-\theta)^2 - \frac{n}{2} \log (2\pi\sigma^2)$

Similarly, differentiating the log likelihood results in the following expressions.

$\bigtriangledown_{\theta} \log p_{\theta}(Y) = \bigtriangledown_{\theta} (-\frac{n}{2}\log 2\pi-\frac{n}{2}\log \sigma^2 - \frac{1}{2\sigma^2} \sum_{k=1}^{n} (Y_k-\theta)^2)$

$= -\frac{1}{2\sigma^2}\cdot 2\cdot (-1) \sum_{k=1}^{n}(Y_k-\hat{\theta})$

$= \frac{1}{\sigma^2} \sum_{k=1}^{n}(Y_k-\hat{\theta})$

$= \frac{1}{\sigma^2}(\sum_{k=1}^{n}Y_k -\hat{\theta}n) = 0$

## Alumni Liaison

Ph.D. 2007, working on developing cool imaging technologies for digital cameras, camera phones, and video surveillance cameras.