Example: Gaussian Mixture

Taken from Wikipedia


Assume that a sample of m vectors (or scalars) $ \mathbf y_1, \dots, \textbf{y}_m $, where $ \mathbf y_j \in \mathbb{R}^l $, is drawn from one of n Gaussian distributions. Let $ z_j \in \{1,2,\ldots,n\} $ denote which Gaussian $ \mathbf y_j $ is from. The probability that a particular $ \mathbf y $ comes from the $ i^{\mathrm{th}} $ $ D $-dimensional Gaussian is

$ P(\mathbf y | z=i,\theta) = \mathcal{N}(\mu_i,\sigma_i) = (2\pi)^{-D/2} {\left| \sigma_i \right|}^{-1/2} \exp\left(-\frac{1}{2}(\mathbf y - \mathbf \mu_i)^T \sigma_i^{-1} (\mathbf y - \mathbf \mu_i)\right) $

Our task is to estimate the unknown parameters $ \theta = \left\{ \mu_1, \dots, \mu_n, \sigma_1, \dots, \sigma_n, P(z=1), \dots, P(z=n) \right\} $, that is, the mean and standard deviation of each Gaussian and the probability for each Gaussian being drawn for any given point. (Actually it is not clear that we should allow the standard deviations to take any value because then the maximum likelihood may be unbounded as one centers a particular Gaussian on a particular data point and decreases the standard deviation toward zero!)

E-step

Estimation of the unobserved z's (which Gaussian is used), conditioned on the observation, using the values from the last maximization step:

$ p(z_j=i|\mathbf y_j,\theta_t) = \frac{p(z_j=i, \mathbf y_j | \theta_t)}{p(\mathbf y_j|\theta_t)} = \frac{p(\mathbf y_j|z_j=i,\theta_t) p(z_j=i|\theta_t)}{\sum_{k=1}^n p(\mathbf y_j | z_j=k, \theta_t) p(z_j=k|\theta_t)} $

M-step

We now want to maximize the expected log-likelihood of the joint event:

$ \begin{align} Q(\theta) & = E_{z} \left[ \ln \prod_{j=1}^m p \left(\mathbf y_j, \mathbf z | \theta \right) \Big| \mathbf y_j \right] \\ & = E_{z} \left[ \sum_{j=1}^m \ln p \left(\mathbf y_j, \mathbf z | \theta \right) \Big| \mathbf y_j \right] \\ & = \sum_{j=1}^m E_{z} \left[ \ln p \left(\mathbf y_j, \mathbf z | \theta \right) \Big| \mathbf y_j \right] \\ & = \sum_{j=1}^m \sum_{i=1}^n p \left(z_j=i | \mathbf y_j, \theta_t \right) \ln p\left(z_j=i, \mathbf y_j | \theta \right) \\ \end{align} $

If we expand the probability of the joint event, we get

$ Q(\theta) = \sum_{j=1}^m \sum_{i=1}^n p(z_j=i | \mathbf y_j, \theta_t) \ln \left( p(\mathbf y_j | z_j=i, \theta) p(z_j=i | \theta) \right) $

We have the constraint

$ \sum_{i=1}^{n} p(z_j=i|\theta) = 1 $

If we add a Lagrange multiplier, and expand the pdf, we get

$ \begin{align} \mathcal{L}(\theta) = & \left( \sum_{j=1}^m \sum_{i=1}^n p(z_j=i | \mathbf y_j, \theta_t) \left( - \frac{D}{2} \ln (2\pi) - \frac{1}{2} \ln \left| \sigma_i \right| - \frac{1}{2}(\mathbf y_j - \mathbf \mu_i)^T \sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i) + \ln p(z_j=i | \theta) \right) \right) \\ & - \lambda \left( \sum_{i=1}^{n} p(z_j=i | \theta) - 1 \right) \end{align} $

To find the new estimate $ \theta_{t+1} $, we find a maximum where $ \frac{\partial \mathcal{L}(\theta)}{\partial \theta} = 0 $.

New estimate for mean (using some differentiation rules from matrix calculus):

$ \begin{align} \frac{\partial \mathcal{L}(\theta)}{\partial \mu_i} & = \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \left( - \frac{\partial}{\partial \mu_i} \frac{1}{2}(\mathbf y_j - \mathbf \mu_i)^T \sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i) \right) \\ & = \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \left( - \frac{1}{2}(\sigma_i^{-1} +\sigma_i^{-T})(\mathbf y_j - \mathbf \mu_i)(-1) \right) \\ & = \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \left( \sigma_i^{-1}(\mathbf y_j - \mathbf \mu_i) \right) \\ & = 0 \\ & \Downarrow \\ \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \sigma_i^{-1} \mathbf \mu_i & = \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \sigma_i^{-1} \mathbf y_j \\ & \Downarrow \\ \mu_i \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) & = \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \mathbf y_j \\ & \Downarrow \\ \mu_i & = \frac{\sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \mathbf y_j}{\sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t)} \\ \end{align} $

New estimate for covariance:

$ \begin{align} \frac{\partial \mathcal{L}(\theta)}{\partial \sigma_i} & = \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \left( - \frac{\partial}{\partial \sigma_i} \frac{1}{2} \ln \left| \sigma_i \right| - \frac{\partial}{\partial \sigma_i} \frac{1}{2}(\mathbf y_j - \mathbf \mu_i)^T \sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i) \right) \\ & = \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \left( - \frac{1}{2} \sigma_i^{-T} + \frac{1}{2} \sigma_i^{-T}(\mathbf y_j - \mathbf \mu_i) (\mathbf y_j - \mathbf \mu_i)^T \sigma_i^{-T} \right) \\ & = 0 \\ & \Downarrow \\ \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \sigma_i^{-1} & = \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i) (\mathbf y_j - \mathbf \mu_i)^T \sigma_i^{-1} \\ & \Downarrow \\ \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) & = \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i) (\mathbf y_j - \mathbf \mu_i)^T \\ & \Downarrow \\ \sigma_i & = \frac{\sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) (\mathbf y_j - \mathbf \mu_i) (\mathbf y_j - \mathbf \mu_i)^T}{\sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t)} \\ \end{align} $

New estimate for class probability:

$ \begin{align} \frac{\partial \mathcal{L}(\theta)}{\partial p(z=i|\theta)} & = \left( \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \frac{\partial \ln p(z=i|\theta)}{\partial p(z=i|\theta)} \right) - \lambda \left( \frac{\partial p(z=i|\theta)}{\partial p(z=i|\theta)} \right) \\ & = \left( \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \frac{1}{p(z=i|\theta)} \right) - \lambda \\ & = 0 \\ & \Downarrow \\ \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \frac{1}{p(z=i|\theta)} & = \lambda \\ & \Downarrow \\ p(z=i|\theta) & = \frac{1}{\lambda} \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \\ \end{align} $

Inserting into the constraint:

$ \begin{align} \sum_{i=1}^{n} p(z=i|\theta) & = \sum_{i=1}^{n} \frac{1}{\lambda} \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \\ & = 1 \\ & \Downarrow \\ \lambda & = \sum_{i=1}^{n} \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \\ \end{align} $

Inserting $ \lambda $ into our estimate:

$ \begin{align} p(z=i|\theta) & = \frac{1}{\lambda} \sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \\ & = {\displaystyle\frac{\sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t)}{\sum_{k=1}^{n} \sum_{j=1}^m p(z_j=k | \mathbf y_j, \theta_t)}} \\ & = \frac{1}{m}\sum_{j=1}^m p(z_j=i | \mathbf y_j, \theta_t) \end{align} $

These estimates now become our $ \theta_{t+1} $, to be used in the next estimation step.

Alumni Liaison

Ph.D. 2007, working on developing cool imaging technologies for digital cameras, camera phones, and video surveillance cameras.

Buyue Zhang