m (Have great difficulty in pasting formula...)
 
 
(27 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<br>
 
 
<center><font size="4"></font>  
 
<center><font size="4"></font>  
<font size="4">'''Introduction to Maximum Likelihood Estimation''' <br> </font> <font size="2">A [https://www.projectrhea.org/learning/slectures.php slecture] by Wen Yi </font>  
+
<font size="4">'''Introduction to Maximum Likelihood Estimation''' <br> </font>  
 +
 
 +
<font size="2">A [http://www.projectrhea.org/learning/slectures.php slecture] by Wen Yi </font>  
 +
 
 +
Partly based on the [[2014_Spring_ECE_662_Boutin_Statistical_Pattern_recognition_slectures|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]].
  
<font size="2"></font> [[Introduction to Maximum Likelihood Estimation.pdf|pdf file:Introduction to Maximum Likelihood Estimation.pdf]]
 
 
</center>  
 
</center>  
 
<br>  
 
<br>  
Line 11: Line 13:
 
----
 
----
  
<span lang="EN-US" style="font-size:14.0pt">&nbsp;</span>  
+
<br>  
  
<span lang="EN-US" style="font-size:14.0pt">&nbsp;</span>  
+
=== <br> 1. Introduction  ===
  
<span lang="EN-US">&nbsp;</span>  
+
&nbsp; For density estimation, Maximum Likelihood Estimation (MLE) is a method of parametric density estimation model. When we applying MLE to a data set with fixed density distribution, MLE provides the estimates for the parameters of density distribution model. In real estimation, we search over all the possible sets of parameter values, then find the specific set of parameters with the maximum value of likelihood, which means is the most likely to observe the data set samples.<br>  
  
 
<br>  
 
<br>  
  
'''<span lang="EN-US" style="font-size:12.0pt">1. Introduction</span>'''
+
----
  
<span lang="EN-US">In statistics,
+
----
maximum-likelihood estimation (MLE) is a method of estimating the parameters of
+
a statistical model. When applied to a data set and given a statistical model,
+
maximum-likelihood estimation provides estimates for the model's parameters.</span>
+
  
<span lang="EN-US">In maximum
+
<br>  
likelihood estimation, we search over all possible sets of parameter values for
+
a specified model to find the set of values for which the observed sample was
+
most likely. That is, we find the set of parameter values that, given a model,
+
were most likely to have given us the data that we have in hand.</span>  
+
  
<span lang="EN-US">&nbsp;</span>
+
=== 2. Basic method  ===
  
'''<span lang="EN-US" style="font-size:12.0pt">2. Basic method</span>'''<br>  
+
&nbsp; Suppose we have a set of n independent and identically destributed observation samples. Then density function is fixed, but unknown to us. We assume that the density funtion belongs to a certain family of distributions, so let&nbsp;θ be a vector of parameters for this distribution family. So, the goal to use MLE is to find the vector of parameters that is as close to the true distribution parameter value as possible.<br>  
  
<span lang="EN-US">Suppose there is
+
&nbsp; To use MLE, we first take the joint density function for all the sample observations. For an i.i.d data set of samples, the joint density function is:<br>  
a sample <math>x_1,\ x_2,\ \dots ,\ x_N</math> of n independent and identically distributed observations from
+
a distribution with an unknown probability density function <span class="texhtml">''f''<sub>0</sub></span>. We can say that the function </span><span lang="EN-US">&lt;img width=12 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image002.png"&gt;</span><span lang="EN-US">&nbsp;belongs to a certain family of distributions </span><span lang="EN-US">&lt;img width=96 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image003.png"&gt;</span><span lang="EN-US">, where θ is a vector of parameters for this family, so that so that
+
</span><span lang="EN-US">&lt;img width=76 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image004.png"&gt;</span><span lang="EN-US">. The value </span><span lang="EN-US">&lt;img width=14 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image005.png"&gt;</span><span lang="EN-US">&nbsp;is unknown and is referred to as the true value of the
+
parameter. So, using MLE, we want to find an estimator which would be as close
+
to the true value </span><span lang="EN-US">&lt;img width=14 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image005.png"&gt;</span><span lang="EN-US">&nbsp;as possible.</span>  
+
  
<span lang="EN-US">To use the
+
[[Image:GMMimage006.png|center]]
method of maximum likelihood, one first specifies the joint density function
+
for all observations. For an independent and identically distributed sample,
+
this joint density function is</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; As each sample x_i is independent with each other, the likelihood of θ with the data set of samples x_1,x_2,…,x_n can be defined as:
width=347 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image006.png"&gt;</span>
+
  
<span lang="EN-US">As each sample </span><span lang="EN-US">&lt;img width=12 height=21
+
[[Image:GMMimage010.png|center]]&nbsp; In practice, it’s more convenient to take ln for the both sides, called log-likelihhod. Then the formula becomes:  
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image007.png"&gt;</span><span lang="EN-US">&nbsp;is independent with each other, the likelihood of </span><span lang="EN-US">&lt;img width=8 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image008.png"&gt;</span><span lang="EN-US">&nbsp;with the observation of samples </span><span lang="EN-US">&lt;img width=70 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image009.png"&gt;</span><span lang="EN-US">&nbsp;can be defined as:</span>
+
  
<span lang="EN-US">&lt;img
+
[[Image:GMMimage011.png|center]]<span style="line-height: 1.5em;">&nbsp; Then, for a fixed set of samples, to maximize the likelihood of θ, we should choose the data that satisfied:</span><br>[[Image:GMMimage012.png|center]]&nbsp; To find the maximum of lnL(θ;x_1,x_2,…,x_N ), we take the derivative of θ on it and find theθ value that make the derivation equals to 0.
width=315 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image010.png"&gt;</span>  
+
  
<span lang="EN-US">In practice, it’s
+
[[Image:GMMimage014.png|center]]
more convenient to take ln for the both sides, called log-likelihhod. Then the
+
formula becomes:</span>
+
  
<span lang="EN-US">&lt;img
+
<br>  
width=216 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image011.png"&gt;</span>  
+
  
<span lang="EN-US">Then, for a
+
&nbsp; To check our result we should garentee that the second derivative of θ on lnL(θ;x_1,x_2,…,x_n ) is negative.
fixed set of samples, to maximize the likelihood of </span><span lang="EN-US">&lt;img width=8 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image008.png"&gt;</span><span lang="EN-US">, we should choose the data that satisfied:</span>
+
  
<span lang="EN-US">&lt;img
+
[[Image:GMMimage016.png|center]]
width=397 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image012.png"&gt;</span>
+
  
<span lang="EN-US">To find the
+
<br>  
maximum of </span><span lang="EN-US">&lt;img width=116 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image013.png"&gt;</span><span lang="EN-US">, we take the derivative of </span><span lang="EN-US">&lt;img width=8 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image008.png"&gt;</span><span lang="EN-US">&nbsp;on it and find the</span><span lang="EN-US">&lt;img width=8 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image008.png"&gt;</span><span lang="EN-US">&nbsp;value that make the derivation equals to 0.</span>  
+
  
<span lang="EN-US">&lt;img
+
----
width=161 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image014.png"&gt;</span>
+
  
<span lang="EN-US">To check our
+
----
result we should garentee that the second derivative of </span><span lang="EN-US">&lt;img width=8 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image008.png"&gt;</span><span lang="EN-US">&nbsp;on </span><span lang="EN-US">&lt;img width=115 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image015.png"&gt;</span><span lang="EN-US">&nbsp;is negative.</span>
+
  
<span lang="EN-US">&lt;img
+
=== ===
width=167 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image016.png"&gt;</span>
+
  
<span lang="EN-US">&nbsp;</span>  
+
=== <br> 3. Practice considerations  ===
  
'''<span lang="EN-US" style="font-size:12.0pt">3. Practice</span>''' considerations
+
==== 3.1 Log-likelihood  ====
  
<span lang="EN-US">3.1 Log-likelihood</span>  
+
&nbsp; As the likelihood comes from the joint density function, it is usually a product of the probability of all the observations, which is very hard to calculate and analyse. Also, as the probability of a observation sample is always less than 1,&nbsp;let's say if one probability for a observation sample is 0.1, then the more data we have, the smaller the likelihood value is (e.g. 0.00000001 or smaller). The small value of likelihood leads to the difficulty in calculating and storing the likelihood.<br>  
  
<span lang="EN-US">Just as
+
&nbsp; For the solution of this problem, we took the natural log of the original likelihood, then the joint probability will express as the sum of the natural log of each probability. In this way, the value of likelihood become easier to measure as the number of samples we have increases. Please note that as the probability of one observation of sample is always less than 1, the log-likelihood will always less than 0.<br>  
mentioned above, to make life a little easier, we can work with the natural log
+
of likelihoods rather than the likelihoods themselves. The main reason for this
+
is, computational rather than theoretical. If you multiply lots of very small
+
numbers together (say all less than 0.0001) then you will very quickly end up
+
with a number that is too small to be represented by any calculator or computer
+
as different from zero. This situation will often occur in calculating
+
likelihoods, when we are often multiplying the probabilities of lots of rare
+
but independent events together to calculate the joint probability.</span>  
+
  
<span lang="EN-US">With
+
==== 3.2 Removing the constant  ====
log-likelihoods, we simply add them together rather than multiply them
+
(log-likelihoods will always be negative, and will just get larger (more
+
negative) rather than approaching 0).</span>
+
  
<span lang="EN-US">So,
+
&nbsp; Let's take binomial distribution for example, the likelihood for this distribution is:
log-likelihoods are conceptually no different to normal likelihoods. When we optimize
+
the log-likelihood, with respect to the model parameters, we also optimize the
+
likelihood with respect to the same parameters, for there is a one-to-one
+
(monotonic) relationship between numbers and their logs.</span>
+
  
<span lang="EN-US">&nbsp;</span>
+
[[Image:GMMimage017.png|center]]
  
<span lang="EN-US">3.2 Removing the constant</span>  
+
&nbsp; In this estimation of MLE, we noted that the total number of samples, n, and the number of occurrence, k, is fixed. Then, we can see that as the first part of this likelihood doesn't depend on the value of p, it is a fix value as the value of p changes. So, removing the first part of the likelihood doesn't influence the comparison of likelihood between different value of ps. As a result, we can estimate the likelihood of binomial distribution like following rather than the way above:<br>  
  
<span lang="EN-US">For example the
+
[[Image:GMMimage018.png|center]]&nbsp; For another reason to do this, as the value of the first part is always larger than 1, as number of samples increases, the total value of likelihood will increase subsequently and make the calculation and storing of the value harder. For this reason, remove the constant part will also make the life easier.
likelihood function for the binomial distribution is:</span>
+
  
<span lang="EN-US">&lt;img
+
<br>  
width=175 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image017.png"&gt;</span>  
+
  
<span lang="EN-US">In the context
+
==== 3.3 Numerical MLE  ====
of MLE, we noted that the values representing the data will be fixed: these are
+
''n'' and'' k''. In this case, the binomial 'co-efficient' depends only
+
upon these constants. Because it does not depend on the value of the parameter ''p''
+
we can essentially ignore this first term. This is because any value for ''p''
+
which maximizes the above quantity will also maximize</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; Sometimes, we cannot write a equation that can be differentiated to find the MLE parameter estimates, In these cases, we may get exhausted in trying all the value that is possible to be the maximum likelihood. If we choose this method, then the step of the value we try will result in the time of calculation. Thus, we should choose the step as 0.01, 0.001 or 0.0000001 according to the needed accuracy we want.<br>  
width=80 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image018.png"&gt;</span>  
+
  
<span lang="EN-US">This means that
+
<br>  
the likelihood will have no meaningful scale in and of itself. This is not
+
usually important, however, for as we shall see, we are generally interested
+
not in the absolute value of the likelihood but rather in the ''ratio ''between
+
two likelihoods - in the context of a likelihood ratio test.</span>  
+
  
<span lang="EN-US">We may often
+
----
want to ignore the parts of the likelihood that do not depend upon the
+
parameters in order to reduce the computational intensity of some problems.
+
Even in the simple case of a binomial distribution, if the number of trials
+
becomes very large, the calculation of the factorials can become infeasible.</span>
+
  
<span lang="EN-US">&nbsp;</span>
+
----
  
<span lang="EN-US">3.3 Numerical MLE</span>  
+
<br>  
  
<span lang="EN-US">Sometimes we cannot write an equation that
+
=== 4. Some basic examples  ===
can be differentiated to find the MLE parameter estimates. This is especially
+
likely if the model is complex and involves many parameters and/or complex
+
probability functions. (e.g. the normal mixture probability distribution)</span>
+
  
<span lang="EN-US">In this scenario, it is also typically not
+
==== 4.1 Poisson Distribution  ====
feasible to evaluate the likelihood at all points, or even a reasonable number
+
of points. In the parameter space of the problem in the coin toss example, the
+
parameter space was only one-dimensional (i.e. only one parameter) and ranged
+
between 0 and 1. Nonetheless, because p can theoretically take any value
+
between 0 and 1, the MLE will always be an approximation (albeit an incredibly
+
accurate one) if we just evaluate the likelihood for a finite number of
+
parameter values. For example, we chose to evaluate the likelihood at steps of
+
0.02. But we could have chosen steps of 0.01, of 0.001, of 0.000000001, etc. In
+
theory and practice, one has to set a minimum tolerance by which you are happy
+
for your estimates to be out. This is why computers are essential for these
+
types of problems: they can tabulate lots and lots of values very quickly and
+
therefore achieve a much finer resolution.</span>
+
  
<span lang="EN-US">&nbsp;</span>
+
&nbsp; For Poisson distribution the expression of probability is:
  
'''<span lang="EN-US" style="font-size:12.0pt">4. Some basic examples</span>'''
+
[[Image:GMMimage019.png|center]]
  
<span lang="EN-US">4.1 Poisson Distribution</span>
+
&nbsp; Let X_1,X_2,…,X_N be the Independent and identically distributed (iid) Poisson random variables. Then, we will have a joint frequency function that is the product of marginal frequency functions. The log likelihood of Poisson distribution thus should be:
  
<span lang="EN-US">For Poisson
+
[[Image:GMMimage021.png|center]]
distribution the expression of probability is:</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; Take the derivative of λ on it and find theλ value that make the derivation equals to 0.  
width=108 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image019.png"&gt;</span>
+
  
<span lang="EN-US">Let </span><span lang="EN-US">&lt;img width=75 height=21
+
[[Image:GMMimage022.PNG|center]]
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image020.png"&gt;</span><span lang="EN-US">&nbsp;be the Independent and identically distributed (iid) Poisson random
+
variables. Then, we will have a joint frequency function that is the product of
+
marginal frequency functions. The log likelihood of Poisson distribution thus
+
should be:</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; Thus, the ML estimation for Poisson distribution should be:
width=554 height=125
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image021.png"&gt;</span>
+
  
<span lang="EN-US">Take the derivative
+
[[Image:GMMimage027.png|center]]
of </span><span lang="EN-US">&lt;img width=7 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image022.png"&gt;</span><span lang="EN-US">&nbsp;on it and find the</span><span lang="EN-US">&lt;img width=7 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image022.png"&gt;</span><span lang="EN-US">&nbsp;value that make the derivation equals to 0.</span>
+
  
<span lang="EN-US">&lt;img
+
==== <br> 4.2 Exponential distribution  ====
width=162 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image023.png"&gt;</span>  
+
  
<span lang="EN-US">&lt;img
+
&nbsp; For exponential distribution the expression of probability is:<br>  
width=217 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image024.png"&gt;</span>  
+
  
<span lang="EN-US">&lt;img
+
[[Image:GMMimage028.png|border|center]]
width=95 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image025.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; Let X_1,X_2,…,X_N be the Independent and identically distributed (iid) exponential random variables. As P(X=x)=0 when x&lt;0, no samples can sit in x&lt;0 region. Thus, for all X_1,X_2,…,X_N, we can only focus on the x≥0 part. Then, we will have a joint frequency function that is the product of marginal frequency functions. The log likelihood of exponential distribution thus should be:
width=68 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image026.png"&gt;</span>
+
  
<span lang="EN-US">Thus, the ML
+
[[Image:GMMimage031.png|border|center]]
estimation for Poisson distribution should be:</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; Take the derivative of λ on it and find theλ value that make the derivation equals to 0.  
width=35 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image027.png"&gt;</span>
+
  
<span lang="EN-US">&nbsp;</span>
+
[[Image:GMMimage032.PNG|border|center]]&nbsp; Thus, the ML estimation for exponential distribution should be:
  
<span lang="EN-US">4.2 Exponential distribution</span>
+
[[Image:GMMimage035.png|border|center]]
  
<span lang="EN-US">For exponential distribution the expression
+
==== <br> 4.3 Gaussian distribution  ====
of probability is:</span>  
+
  
<span lang="EN-US">&lt;img
+
&nbsp; For Gaussian distribution the expression of probability is:
width=167 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image028.png"&gt;</span>
+
  
<span lang="EN-US">Let </span><span lang="EN-US">&lt;img width=75 height=21
+
[[Image:GMMimage036.png|frame|center]]
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image020.png"&gt;</span><span lang="EN-US">&nbsp;be the Independent and identically distributed (iid)
+
exponential random variables. As </span><span lang="EN-US">&lt;img width=80 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image029.png"&gt;</span><span lang="EN-US">&nbsp;when x&lt;0, no samples can sit in x&lt;0 region. Thus, for
+
all </span><span lang="EN-US">&lt;img width=75 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image020.png"&gt;</span><span lang="EN-US">, we can only focus on the </span><span lang="EN-US">&lt;img width=34 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image030.png"&gt;</span><span lang="EN-US">&nbsp;part. Then, we will have a joint frequency function that is
+
the product of marginal frequency functions. The log likelihood of exponential
+
distribution thus should be:</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; Let X_1,X_2,…,X_N be the Independent and identically distributed (iid) Gaussian random variables. Then, we will have a joint frequency function that is the product of marginal frequency functions. The log likelihood of Gaussian distribution thus should be:
width=517 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image031.png"&gt;</span>
+
  
<span lang="EN-US">Take the derivative
+
[[Image:GMMimage037.png|border|center]]
of </span><span lang="EN-US">&lt;img width=7 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image022.png"&gt;</span><span lang="EN-US">&nbsp;on it and find the</span><span lang="EN-US">&lt;img width=7 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image022.png"&gt;</span><span lang="EN-US">&nbsp;value that make the derivation equals to 0.</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; Take the derivative of μ,Σ on it and find the μ,Σ value that make the derivation equals to 0.  
width=162 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image023.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
[[Image:GMMimage038.PNG|border|center]][[Image:GMMimage039.PNG|border|center]]&nbsp; Thus, the ML estimation for Gaussian distribution should be:
width=149 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image032.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
[[Image:GMMimage040.PNG|border|center]]
width=87 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image033.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
<br>  
width=67 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image034.png"&gt;</span>  
+
  
<span lang="EN-US">Thus, the ML
+
----
estimation for </span><span lang="EN-US">exponential</span><span lang="EN-US"> distribution
+
should be:</span>
+
  
<span lang="EN-US">&lt;img
+
----
width=35 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image035.png"&gt;</span>
+
  
<span lang="EN-US">&nbsp;</span>  
+
<br>  
  
<span lang="EN-US">4.3 Gaussian distribution</span>
+
=== 5. Some advanced examples  ===
  
<span lang="EN-US">For Gaussian distribution the expression of
+
==== 5.1 Expression of Estimated Parameters  ====
probability is:</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; The above estimation all base on the assumption that the distribution to be estimated follows the distribution of a single function, but how about the estimation of the mixture of functions?
width=249 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image036.png"&gt;</span>
+
  
<span lang="EN-US">Let </span><span lang="EN-US">&lt;img width=75 height=21
+
&nbsp; To simplify the problem, we only talk about Gaussian Mixture Model (GMM) here. Using the same method, it’s easy to extend it to other kind of mixture model and the mixture between different models.  
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image020.png"&gt;</span><span lang="EN-US">&nbsp;be the Independent and identically distributed (iid) Gaussian random
+
variables. Then, we will have a joint frequency function that is the product of
+
marginal frequency functions. The log likelihood of Gaussian distribution thus
+
should be:</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; To start with, we should know that if we set the number of Gaussian function to be used in the GMM estimation flexible, we will find out that the number of Gaussian function will never reach a best solution, as adding more Gaussian functions into the estimation will subsequently improve the accuracy anyway. As calculating how many Gaussian function is include in GMM is a clustering problem. We assume to know the number of Gaussian function in GMM as k here.<br>  
width=554 height=104
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image037.png"&gt;</span>  
+
  
<span lang="EN-US">Take the derivative
+
&nbsp; As this distribution is a mixture of Gaussian, the expression of probability is:
of </span><span lang="EN-US">&lt;img width=21 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image038.png"&gt;</span><span lang="EN-US">&nbsp;on it and find the </span><span lang="EN-US">&lt;img width=21 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image038.png"&gt;</span><span lang="EN-US">&nbsp;value that make the derivation equals to 0.</span>
+
  
<span lang="EN-US">&lt;img
+
[[Image:GMMimage046.png|border|center]]
width=417 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image039.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; α_j is the weight of Gaussian function g_j (x). <br>  
width=94 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image040.png"&gt;</span>  
+
  
<span lang="EN-US">&lt;img
+
[[Image:GMMimage049.png|border|center]]&nbsp; Thus, the parameters to be estimated are:
width=68 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image041.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
[[Image:GMMimage050.png|border|center]]&nbsp; Let X_1,X_2,…,X_N be the Independent and identically distributed (iid) Gaussian Mixture Model (GMM) random variables.
width=491 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image042.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; Following Bayes rule, the responsibility that a mixture component takes for explaining an observation X_i is:
width=108 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image043.png"&gt;</span>
+
  
<span lang="EN-US">Thus, the ML
+
[[Image:GMMimage052.png|border|center]]&nbsp; Then, we will have a joint frequency function that is the product of marginal frequency functions. The log likelihood of Gaussian Mixture Model distribution thus should be:  
estimation for </span><span lang="EN-US">Gaussian</span><span lang="EN-US"> distribution
+
should be:</span>
+
  
<span lang="EN-US">&lt;img
+
[[Image:GMMimage053.png|border|center]]&nbsp; Take the derivative of μ_j,Σ_j on it and find the μ_j,Σ_j value that make the derivation equals to 0.
width=36 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image044.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
[[Image:GMMimage055.png|border|center]][[Image:GMMimage056.PNG|border|center]]
width=110 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image045.png"&gt;</span>
+
  
<span lang="EN-US">&nbsp;</span>
+
[[Image:GMMimage059.png|border|center]][[Image:GMMimage060.PNG|border|center]]
  
'''<span lang="EN-US" style="font-size:12.0pt">5. Some</span>''' advanced examples
+
&nbsp; The α_j is subject to
  
<span lang="EN-US">5.1 Expression of Estimated Parameters</span>
+
[[Image:GMMimage065.png|border|center]]
  
<span lang="EN-US">The above
+
&nbsp; Basic optimization theories show that α_j is optimized by:
estimation all base on the assumption that the distribution to be estimated follows
+
the distribution of a single function, but how about the estimation of the mixture
+
of functions?</span>
+
  
<span lang="EN-US">To simplify the problem,
+
[[Image:GMMimage067.png|border|center]]
we only talk about Gaussian Mixture Model (GMM) here. Using the same method, it’s
+
easy to extend it to other kind of mixture model and the mixture between
+
different models.</span>
+
  
<span lang="EN-US">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; To
+
<br>  
start with, we should know that if we set the number of Gaussian function to be
+
used in the GMM estimation flexible, we will find out that the number of Gaussian
+
function will never reach a best solution, as adding more Gaussian functions
+
into the estimation will subsequently improve the accuracy anyway. As
+
calculating how many Gaussian function is include in GMM is a clustering
+
problem. We assume to know the number of Gaussian function in GMM as k here.</span>  
+
  
<span lang="EN-US">As this
+
&nbsp; Thus, the ML estimation for Gaussian Mixture Model distribution should be:  
distribution is a mixture of Gaussian, the expression of probability is:</span>
+
  
<span lang="EN-US">&lt;img
+
;&nbsp;[[Image:GMMimage068.PNG|border|center]]
width=139 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image046.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img width=13 height=21
+
==== 5.2 Practical Implementation  ====
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image047.png"&gt;</span><span lang="EN-US">&nbsp;is the weight of Gaussian function </span><span lang="EN-US">&lt;img width=33 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image048.png"&gt;</span><span lang="EN-US">. </span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; Now we can observe that, as the Gaussian Mixture Model with K Gaussian functions have 3K parameters, to find the best vector of parameters set, θ, is to find the optimized parameters in 3K dimension space. As the Gaussian Mixture Model include more Gaussian functions, the complexity of computing the best θ will go incrediblily high. Also, we can see that all the expressions of μ, Σ and α include themselves directly or indirectly, it’s implossible to get the value of the parameters within one time calculation.  
width=237 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image049.png"&gt;</span>
+
  
<span lang="EN-US">&nbsp;</span>
+
&nbsp; Now it’s time to introduce a method for finding maximum likelihood with large number of latent variables (parameters), Expectation–maximization (EM) algorithm.
  
<span lang="EN-US">Thus, the
+
&nbsp; In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables (the parameters). The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
parameters to be estimated are:</span>
+
  
<span lang="EN-US">&lt;img
+
&nbsp; In short words, to get the best θ for our maximum likelihood, firstly, for the expectation step, we should evaluate the weight of each cluster with the current parameters. Then, for the maximization step, we re-estimate parameters using the existing weight.  
width=263 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image050.png"&gt;</span>
+
  
<span lang="EN-US">Let </span><span lang="EN-US">&lt;img width=75 height=21
+
&nbsp;&nbsp;By repeating these calculation process for several times, the parameters will approach the value for the maximum likelihood.<br>  
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image020.png"&gt;</span><span lang="EN-US">&nbsp;be the Independent and identically distributed (iid) Gaussian
+
Mixture Model (GMM) random variables. </span>  
+
  
<span lang="EN-US">Following Bayes
+
[[Image:EMresult1.png|border|center]]<br>  
rule, the responsibility that a mixture component takes for explaining an
+
observation </span><span lang="EN-US">&lt;img width=13 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image051.png"&gt;</span><span lang="EN-US">&nbsp;is:</span>  
+
  
<span lang="EN-US">&lt;img
+
[[Image:EMresult2.png|border|center]]
width=264 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image052.png"&gt;</span>
+
  
<span lang="EN-US">Then, we will
+
----
have a joint frequency function that is the product of marginal frequency
+
----
functions. The log likelihood of Gaussian Mixture Model distribution thus
+
==== 6. References  ====
should be:</span>  
+
[http://www.cscu.cornell.edu/news/statnews/stnews50.pdf www.cscu.cornell.edu/news/statnews/stnews50.pdf]<br>  
  
<span lang="EN-US">&lt;img
+
[http://en.wikipedia.org/wiki/Maximum_likelihood en.wikipedia.org/wiki/Maximum_likelihood ]
width=217 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image053.png"&gt;</span>
+
  
<span lang="EN-US">Take the derivative
+
[http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm ]
of </span><span lang="EN-US">&lt;img width=31 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image054.png"&gt;</span><span lang="EN-US">&nbsp;on it and find the </span><span lang="EN-US">&lt;img width=31 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image054.png"&gt;</span><span lang="EN-US">&nbsp;value that make the derivation equals to 0.</span>
+
  
<span lang="EN-US">&lt;img
+
[http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html]<br>  
width=554 height=374
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image055.png"&gt;</span>  
+
  
<span lang="EN-US">&lt;img
+
[http://eniac.cs.qc.cuny.edu/andrew/gcml-11/lecture10c.pptx eniac.cs.qc.cuny.edu/andrew/gcml-11/lecture10c.pptx ]
width=183 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image056.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
[http://statweb.stanford.edu/~susan/courses/s200/lectures/lect11.pdf statweb.stanford.edu/~susan/courses/s200/lectures/lect11.pdf]
width=146 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image057.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
[http://books.google.com/books?isbn=1461457432 Statistics and Measurement Concepts with OpenStat, by William Miller]]
width=128 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image058.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
----
width=554 height=437
+
----
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image059.png"&gt;</span>
+
= [[MLEforGMM_comments| Questions and comments]]  =
  
<span lang="EN-US">&lt;img
+
If you have any questions, comments, etc. please post them on  [[MLEforGMM_comments|this page]].  
width=182 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image060.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
----
width=246 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image061.png"&gt;</span>
+
 
+
<span lang="EN-US">&lt;img
+
width=208 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image062.png"&gt;</span>
+
  
<span lang="EN-US">&lt;img
+
<br>  
width=175 height=62
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image063.png"&gt;</span>
+
 
+
<span lang="EN-US">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The
+
</span><span lang="EN-US">&lt;img width=16 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image064.png"&gt;</span><span lang="EN-US">is subject to </span><span lang="EN-US">&lt;img width=69 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image065.png"&gt;</span><span lang="EN-US">. Basic optimization theories show that </span><span lang="EN-US">&lt;img width=13 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image047.png"&gt;</span><span lang="EN-US">&nbsp;</span><span lang="EN-US">&lt;img width=97 height=21
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image066.png"&gt;</span><span lang="EN-US">:</span>
+
 
+
<span lang="EN-US">&lt;img
+
width=114 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image067.png"&gt;</span>
+
 
+
<span lang="EN-US">Thus, the ML
+
estimation for </span><span lang="EN-US">Gaussian</span><span lang="EN-US"> Mixture Model
+
distribution should be:</span>  
+
  
<span lang="EN-US">&lt;img width=104 height=42
+
<br>  
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image068.png"&gt;</span><span lang="EN-US">; </span><span lang="EN-US">&lt;img width=135 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image069.png"&gt;</span><span lang="EN-US">; </span><span lang="EN-US">&lt;img width=93 height=42
+
src="Introduction%20to%20Maximum%20Likelihood%20Estimation%20-%20copy.files/image070.png"&gt;</span>
+
 
+
<span lang="EN-US">&nbsp;</span>
+
 
+
<span lang="EN-US">5.2 Practical Implementation</span>
+
 
+
<span lang="EN-US">Now we can
+
observe that, as the Gaussian Mixture Model with K Gaussian functions have 3K
+
parameters, to find the best vector of parameters set, </span><span style="font-family:宋体">θ</span><span lang="EN-US">, is to find the optimized
+
parameters in 3K dimension space. As the Gaussian Mixture Model include more
+
Gaussian functions, the complexity of computing the best </span><span style="font-family:宋体">θ</span><span lang="EN-US"> will go incrediblily high.
+
Also, we can see that all the expressions of </span><span style="font-family: 宋体">μ</span><span lang="EN-US">, </span><span style="font-family:宋体">Σ</span><span lang="EN-US"> and </span><span style="font-family:宋体">α</span><span lang="EN-US">
+
include themselves directly or indirectly, it’s implossible to get the value of
+
the parameters within one time calculation.</span>
+
 
+
<span lang="EN-US">Now it’s time to
+
introduce a method for finding maximum likelihood with large number of latent variables
+
(parameters), Expectation–maximization (EM) algorithm.</span>
+
 
+
<span lang="EN-US">In statistics,
+
an expectation–maximization (EM) algorithm is an iterative method for finding maximum
+
likelihood estimates of parameters in statistical models, where the model
+
depends on unobserved latent variables (the parameters). The EM iteration
+
alternates between performing an expectation (E) step, which creates a function
+
for the expectation of the log-likelihood evaluated using the current estimate
+
for the parameters, and a maximization (M) step, which computes parameters
+
maximizing the expected log-likelihood found on the E step. These
+
parameter-estimates are then used to determine the distribution of the latent
+
variables in the next E step.</span>
+
 
+
<span lang="EN-US">In short words,
+
to get the best </span><span style="font-family:宋体">θ</span><span lang="EN-US">
+
for our maximum likelihood, firstly, for the expectation step, we should evaluate
+
the weight of each cluster with the current parameters. Then, for the
+
maximization step, we re-estimate parameters using the existing weight.</span>
+
 
+
<span lang="EN-US">By repeating
+
these calculation process for several times, the parameters will approach the
+
value for the maximum likelihood.</span>
+
 
+
<span lang="EN-US">&nbsp;</span>
+
 
+
'''<span lang="EN-US" style="font-size:12.0pt">6. References</span>'''
+
 
+
[http://www.cscu.cornell.edu/news/statnews/stnews50.pdf www.cscu.cornell.edu/news/statnews/stnews50.pdf]
+
 
+
[http://en.wikipedia.org/wiki/Maximum_likelihood en.wikipedia.org/wiki/Maximum_likelihood]<br>
+
 
+
[http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm]
+
 
+
[http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html]<br>  
+
 
+
[http://eniac.cs.qc.cuny.edu/andrew/gcml-11/lecture10c.pptx eniac.cs.qc.cuny.edu/andrew/gcml-11/lecture10c.pptx]
+
 
+
[http://statweb.stanford.edu/~susan/courses/s200/lectures/lect11.pdf statweb.stanford.edu/~susan/courses/s200/lectures/lect11.pdf]
+
  
<br>
+
[[Category:ECE662]] [[Category:Bayes'_Theorem]] [[Category:Probability]] [[Category:Bayes'_Rule]] [[Category:Bayes'_Classifier]] [[Category:Slecture]] [[Category:ECE662Spring2014Boutin]] [[Category:ECE]] [[Category:Pattern_recognition]]

Latest revision as of 10:50, 22 January 2015

Introduction to Maximum Likelihood Estimation

A slecture by Wen Yi

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.






1. Introduction

  For density estimation, Maximum Likelihood Estimation (MLE) is a method of parametric density estimation model. When we applying MLE to a data set with fixed density distribution, MLE provides the estimates for the parameters of density distribution model. In real estimation, we search over all the possible sets of parameter values, then find the specific set of parameters with the maximum value of likelihood, which means is the most likely to observe the data set samples.





2. Basic method

  Suppose we have a set of n independent and identically destributed observation samples. Then density function is fixed, but unknown to us. We assume that the density funtion belongs to a certain family of distributions, so let θ be a vector of parameters for this distribution family. So, the goal to use MLE is to find the vector of parameters that is as close to the true distribution parameter value as possible.

  To use MLE, we first take the joint density function for all the sample observations. For an i.i.d data set of samples, the joint density function is:

GMMimage006.png

  As each sample x_i is independent with each other, the likelihood of θ with the data set of samples x_1,x_2,…,x_n can be defined as:

GMMimage010.png
  In practice, it’s more convenient to take ln for the both sides, called log-likelihhod. Then the formula becomes:
GMMimage011.png
  Then, for a fixed set of samples, to maximize the likelihood of θ, we should choose the data that satisfied:
GMMimage012.png
  To find the maximum of lnL(θ;x_1,x_2,…,x_N ), we take the derivative of θ on it and find theθ value that make the derivation equals to 0.
GMMimage014.png


  To check our result we should garentee that the second derivative of θ on lnL(θ;x_1,x_2,…,x_n ) is negative.

GMMimage016.png





3. Practice considerations

3.1 Log-likelihood

  As the likelihood comes from the joint density function, it is usually a product of the probability of all the observations, which is very hard to calculate and analyse. Also, as the probability of a observation sample is always less than 1, let's say if one probability for a observation sample is 0.1, then the more data we have, the smaller the likelihood value is (e.g. 0.00000001 or smaller). The small value of likelihood leads to the difficulty in calculating and storing the likelihood.

  For the solution of this problem, we took the natural log of the original likelihood, then the joint probability will express as the sum of the natural log of each probability. In this way, the value of likelihood become easier to measure as the number of samples we have increases. Please note that as the probability of one observation of sample is always less than 1, the log-likelihood will always less than 0.

3.2 Removing the constant

  Let's take binomial distribution for example, the likelihood for this distribution is:

GMMimage017.png

  In this estimation of MLE, we noted that the total number of samples, n, and the number of occurrence, k, is fixed. Then, we can see that as the first part of this likelihood doesn't depend on the value of p, it is a fix value as the value of p changes. So, removing the first part of the likelihood doesn't influence the comparison of likelihood between different value of ps. As a result, we can estimate the likelihood of binomial distribution like following rather than the way above:

GMMimage018.png
  For another reason to do this, as the value of the first part is always larger than 1, as number of samples increases, the total value of likelihood will increase subsequently and make the calculation and storing of the value harder. For this reason, remove the constant part will also make the life easier.


3.3 Numerical MLE

  Sometimes, we cannot write a equation that can be differentiated to find the MLE parameter estimates, In these cases, we may get exhausted in trying all the value that is possible to be the maximum likelihood. If we choose this method, then the step of the value we try will result in the time of calculation. Thus, we should choose the step as 0.01, 0.001 or 0.0000001 according to the needed accuracy we want.





4. Some basic examples

4.1 Poisson Distribution

  For Poisson distribution the expression of probability is:

GMMimage019.png

  Let X_1,X_2,…,X_N be the Independent and identically distributed (iid) Poisson random variables. Then, we will have a joint frequency function that is the product of marginal frequency functions. The log likelihood of Poisson distribution thus should be:

GMMimage021.png

  Take the derivative of λ on it and find theλ value that make the derivation equals to 0.

GMMimage022.PNG

  Thus, the ML estimation for Poisson distribution should be:

GMMimage027.png


4.2 Exponential distribution

  For exponential distribution the expression of probability is:

GMMimage028.png

  Let X_1,X_2,…,X_N be the Independent and identically distributed (iid) exponential random variables. As P(X=x)=0 when x<0, no samples can sit in x<0 region. Thus, for all X_1,X_2,…,X_N, we can only focus on the x≥0 part. Then, we will have a joint frequency function that is the product of marginal frequency functions. The log likelihood of exponential distribution thus should be:

GMMimage031.png

  Take the derivative of λ on it and find theλ value that make the derivation equals to 0.

GMMimage032.PNG
  Thus, the ML estimation for exponential distribution should be:
GMMimage035.png


4.3 Gaussian distribution

  For Gaussian distribution the expression of probability is:

GMMimage036.png

  Let X_1,X_2,…,X_N be the Independent and identically distributed (iid) Gaussian random variables. Then, we will have a joint frequency function that is the product of marginal frequency functions. The log likelihood of Gaussian distribution thus should be:

GMMimage037.png

  Take the derivative of μ,Σ on it and find the μ,Σ value that make the derivation equals to 0.

GMMimage038.PNG
GMMimage039.PNG
  Thus, the ML estimation for Gaussian distribution should be:
GMMimage040.PNG





5. Some advanced examples

5.1 Expression of Estimated Parameters

  The above estimation all base on the assumption that the distribution to be estimated follows the distribution of a single function, but how about the estimation of the mixture of functions?

  To simplify the problem, we only talk about Gaussian Mixture Model (GMM) here. Using the same method, it’s easy to extend it to other kind of mixture model and the mixture between different models.

  To start with, we should know that if we set the number of Gaussian function to be used in the GMM estimation flexible, we will find out that the number of Gaussian function will never reach a best solution, as adding more Gaussian functions into the estimation will subsequently improve the accuracy anyway. As calculating how many Gaussian function is include in GMM is a clustering problem. We assume to know the number of Gaussian function in GMM as k here.

  As this distribution is a mixture of Gaussian, the expression of probability is:

GMMimage046.png

  α_j is the weight of Gaussian function g_j (x).

GMMimage049.png
  Thus, the parameters to be estimated are:
GMMimage050.png
  Let X_1,X_2,…,X_N be the Independent and identically distributed (iid) Gaussian Mixture Model (GMM) random variables.

  Following Bayes rule, the responsibility that a mixture component takes for explaining an observation X_i is:

GMMimage052.png
  Then, we will have a joint frequency function that is the product of marginal frequency functions. The log likelihood of Gaussian Mixture Model distribution thus should be:
GMMimage053.png
  Take the derivative of μ_j,Σ_j on it and find the μ_j,Σ_j value that make the derivation equals to 0.
GMMimage055.png
GMMimage056.PNG
GMMimage059.png
GMMimage060.PNG

  The α_j is subject to

GMMimage065.png

  Basic optimization theories show that α_j is optimized by:

GMMimage067.png


  Thus, the ML estimation for Gaussian Mixture Model distribution should be:

 
GMMimage068.PNG

5.2 Practical Implementation

  Now we can observe that, as the Gaussian Mixture Model with K Gaussian functions have 3K parameters, to find the best vector of parameters set, θ, is to find the optimized parameters in 3K dimension space. As the Gaussian Mixture Model include more Gaussian functions, the complexity of computing the best θ will go incrediblily high. Also, we can see that all the expressions of μ, Σ and α include themselves directly or indirectly, it’s implossible to get the value of the parameters within one time calculation.

  Now it’s time to introduce a method for finding maximum likelihood with large number of latent variables (parameters), Expectation–maximization (EM) algorithm.

  In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables (the parameters). The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

  In short words, to get the best θ for our maximum likelihood, firstly, for the expectation step, we should evaluate the weight of each cluster with the current parameters. Then, for the maximization step, we re-estimate parameters using the existing weight.

  By repeating these calculation process for several times, the parameters will approach the value for the maximum likelihood.

EMresult1.png

EMresult2.png


6. References

www.cscu.cornell.edu/news/statnews/stnews50.pdf

en.wikipedia.org/wiki/Maximum_likelihood

en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html

eniac.cs.qc.cuny.edu/andrew/gcml-11/lecture10c.pptx

statweb.stanford.edu/~susan/courses/s200/lectures/lect11.pdf

Statistics and Measurement Concepts with OpenStat, by William Miller]



Questions and comments

If you have any questions, comments, etc. please post them on this page.




Alumni Liaison

Ph.D. on Applied Mathematics in Aug 2007. Involved on applications of image super-resolution to electron microscopy

Francisco Blanco-Silva