Advantages of MLE

for ECE662: Decision Theory

Complement to Lecture 7: Maximum Likelihood Estimation and Bayesian Parameter Estimation, ECE662, Spring 2010, Prof. Boutin


MLE

  1. Always have good convergence properties as number of training samples increases.
  2. MLE is often simpler than other methods of parameter estimation.

Parameter Estimation by MLE

Example 1: The Gaussian Case: Unknown $ \mu $

Suppose the samples are drawn from a multivariate normal population with mean $ \mu $ and covariance matrix $ \sigma $. For this example only mean is unknown. Let $ x_k $ be sample point.

$ \ln p(x_k|\mu) = -\frac{1}{2} \ln (2\pi)^d|\Sigma| - \frac{1}{2} (x_k - \mu)^t \Sigma^{-1} (x_k - \mu)) $

$ \nabla_{\mu} \ln p(x_k|\mu) = \Sigma^{-1}(x_k-\mu) $

Thus differentiating above equation and equating to 0, we get

$ \sum_{k=1}^n \Sigma^{-1} (x_k-\hat{\mu}) = 0 $

Multiplying by $ \Sigma $ and rearranging, we obtain

$ \hat{\mu} = \frac{1}{n} \sum_{k=1}^n x_k $

Thus the MLE for the unknown population mean is the arithmetic average of the training samples called *the sample mean*

Example 2: The Gaussian Case: Unknown $ \mu $ and $ \sigma $

In this example both mean $ \mu $ and covariance matrix $ \sigma $ are unknown. These unknown parameters constitute the components of the parameter vector $ \theta $. Consider univariate case with $ \theta_1 = \mu $ and $ \theta_2 = \sigma^2 $.

$ \ln p(x_k|\theta) = -\frac{1}{2} \ln 2\pi\theta_2 - \frac{1}{2\theta_2}(x_k - \theta_1)^2 $

Taking derivative of above equation

$ \nabla_{\theta}l = \nabla_{\theta} \ln p(x_k|\theta) = [ \frac{1}{\theta_2}(x_k - \theta_1) ; -\frac{1}{2\theta_2} +\frac{(x_k-\theta_1)^2}{2\theta_2^2}]. $

Equating the above equation to 0, we get

$ \sum_{k=1}^n \frac{1}{\hat{\theta_2}}(x_k-\hat{\theta_1}) = 0 $

and

$ -\sum_{k=-1}^{n} \frac{1}{\hat{\theta_2}} + \sum_{k=1}^n \frac{(x_k-\hat{\theta_1})^2}{\hat{\theta_2}^2} = 0 $

where $ \hat{\theta_1} $ and $ \hat{\theta_2} $ are maximum likelihood estimates for $ \theta_1 $ and $ \theta_2 $ respectively. Substituting $ \hat{\mu} = \hat{\theta_1} $ and $ \hat{\sigma} = \hat{\theta_2} $, we obtain

$ \hat{\mu} = \frac{1}{n} \sum_{k=1}^n x_k $

and

$ \hat{\sigma}^2 = \frac{1}{n} \sum_{k=1}^n(x_k - \hat{\mu})^2. $


See Also


Back to Lecture 7: Maximum Likelihood Estimation and Bayesian Parameter Estimation, ECE662, Spring 2010, Prof. Boutin

Alumni Liaison

Recent Math PhD now doing a post-doctorate at UC Riverside.

Kuei-Nuan Lin