(See Also: Lecture 7_OldKiwi and BPE_OldKiwi)

Advantages of MLE :

  1. Always have good convergence properties as number of training samples increases.
  2. MLE is often simpler than other methods of parameter estimation.

Parameter Estimation by MLE

Example 1: The Gaussian Case: Unknown $ \mu $

Suppose the samples are drawn from a multivariate normal population with mean $ \mu $ and covariance matrix $ \sigma $. For this example only mean is unknown. Let $ x_k $ be sample point.

$ \ln p(x_k|\mu) = -\frac{1}{2} \ln (2\pi)^d|\Sigma| - \frac{1}{2} (x_k - \mu)^t \Sigma^{-1} (x_k - \mu)) $

$ \nabla_{\mu} \ln p(x_k|\mu) = \Sigma^{-1}(x_k-\mu) $

Thus differentiating above equation and equating to 0, we get

$ \sum_{k=1}^n \Sigma^{-1} (x_k-\hat{\mu}) = 0 $

Multiplying by $ \Sigma $ and rearranging, we obtain

$ \hat{\mu} = \frac{1}{n} \sum_{k=1}^n x_k $

Thus the MLE for the unknown population mean is the arithmetic average of the training samples called *the sample mean*

Example 2: The Gaussian Case: Unknown $ \mu $ and $ \sigma $

In this example both mean $ \mu $ and covariance matrix $ \sigma $ are unknown. These unknown parameters constitute the components of the parameter vector $ \theta $. Consider univariate case with $ \theta_1 = \mu $ and $ \theta_2 = \sigma^2 $.

$ \ln p(x_k|\theta) = -\frac{1}{2} \ln 2\pi\theta_2 - \frac{1}{2\theta_2}(x_k - \theta_1)^2 $

Taking derivative of above equation

$ \nabla_{\theta}l = \nabla_{\theta} \ln p(x_k|\theta) = [ \frac{1}{\theta_2}(x_k - \theta_1) ; -\frac{1}{2\theta_2} +\frac{(x_k-\theta_1)^2}{2\theta_2^2}]. $

Equating the above equation to 0, we get

$ \sum_{k=1}^n \frac{1}{\hat{\theta_2}}(x_k-\hat{\theta_1}) = 0 $

and

$ -\sum_{k=-1}^{n} \frac{1}{\hat{\theta_2}} + \sum_{k=1}^n \frac{(x_k-\hat{\theta_1})^2}{\hat{\theta_2}^2} = 0 $

where $ \hat{\theta_1} $ and $ \hat{\theta_2} $ are maximum likelihood estimates for $ \theta_1 $ and $ \theta_2 $ respectively. Substituting $ \hat{\mu} = \hat{\theta_1} $ and $ \hat{\sigma} = \hat{\theta_2} $, we obtain

$ \hat{\mu} = \frac{1}{n} \sum_{k=1}^n x_k $

and

$ \hat{\sigma}^2 = \frac{1}{n} \sum_{k=1}^n(x_k - \hat{\mu})^2. $

MLE Examples: Exponential and Geometric Distributions_OldKiwi

MLE Examples: Binomial and Poisson Distributions_OldKiwi

Alumni Liaison

Abstract algebra continues the conceptual developments of linear algebra, on an even grander scale.

Dr. Paul Garrett