All Articles

The Expectation Maximization Algorithm

Expectation Maximization Algorithm

The Expectation Maximization (EM) algorithm is a widely used powerful algorithm that can be used to estimate the parameters of a statistical model. It’s similar to the famous Newton-Raphson method, the difference is that the Expectation Maximization (EM) algorithm is used when the data has missing values or when the model has latent variables. It has been applied to a variety of problems, including clustering, density estimation, and parameter estimation for hidden Markov models.

The EM algorithm is an iterative algorithm that consists of two main steps: the E-step and the M-step. In the E-step, the algorithm computes the expected value of the latent variables given the observed data and the current estimate of the parameters. In the M-step, the algorithm maximizes the likelihood function with respect to the parameters, using the expected values of the latent variables computed in the E-step. We repeat these two steps until the algorithm converges to a local maximum of the likelihood function.

Basic Definitions

Convex Set

The Convex Set is defined as a set of points such that the line segment connecting any two points in the set lies entirely within the set. Formally, a set SS is a convex set if for any two points xx and yy in SS, the line segment connecting xx and yy lies entirely within SS.

Convex Function & Concave Function

  • Convex Function

The Convex Function is defined on top of the convex set. A twice-differentiable function of a single variable is convex if and only if its second derivative is nonnegative on its entire domain, and its domain should be a convex set. e.g. f(x)f(x) is a convex function if d2f(x)dx20\frac{\mathrm{d^2} f(x)}{\mathrm{d} x^2} \geq 0 for all xx in the domain of ff.

  • Concave Function

The Concave Function is defined on top of the convex set. A twice-differentiable function of a single variable is concave if and only if its second derivative is nonpositive on its entire domain, and its domain should be a convex set. e.g. f(x)f(x) is a concave function if d2f(x)dx20\frac{\mathrm{d^2} f(x)}{\mathrm{d} x^2} \leq 0 for all xx in the domain of ff.

Jensen’s Inequality

For a convex function ff and a random variable XX, the following inequality holds:

f(E[X])E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]

similarly, but oppositely, for a concave function ff and a random variable XX, the following inequality holds:

f(E[X])E[f(X)]f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]

Maximum Likelihood Estimation

In statistics, the Maximum Likelihood Estimation (MLE) is a method used to estimate the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

Assuming we have a set of independent and identically distributed (i.i.d.) random variables X={X1,X2,,Xn}X = \{X_{1}, X_{2}, \ldots, X_{n}\}, and the distribution function is defined as:

P(Xθ)=i=1nP(Xiθ)\mathbf{P}(X | \theta) = \prod_{i=1}^{n} P(X_{i} | \theta)

where θ\theta is the parameter of the distribution, and we could define the likelihood function as:

L(θX)=P(Xθ)=i=1nP(Xiθ)\mathbf{L}(\theta | X) = P(X | \theta) = \prod_{i=1}^{n} P(X_{i} | \theta)

The Maximum Likelihood Estimation is to find the parameter θ\theta that maximizes the likelihood function L(θX)\mathbf{L}(\theta | X), i.e.

θ^=argmaxθL(θX)\hat{\theta} = \arg \max_{\theta} \mathbf{L}(\theta | X)

The Expectation Maximization Algorithm

Definitions

As mentioned before, the Expectation Maximization (EM) algorithm is a loop to iterate between the E-step and the M-step until the algorithm converges to a local maximum of the likelihood function. Let’s say, we have a set of observed data on a random variable XX, and a latent variable ZZ, and the joint distribution of XX and ZZ is defined as:

P(x,z;θ)\mathbf{P}(x,z; \theta)

where P\mathbf{P} is the probability mass/density function, xx and zz are values for XX and ZZ, and θ\theta parameterizes the distribution.

In practice, we often don’t observe the latent variable ZZ, and we wish to find the value for θ\theta that makes xx most likely under our model. Since we have not observed zz, we take a logarithm of the likelihood function, and the function must marginalize over zz:

l(θ)=logL(θx)=logzP(x,z;θ)\mathbf{l}(\theta) = \log \mathbf{L}(\theta | x) = \log \sum_{z} \mathbf{P}(x,z; \theta)

or,

l(θ)=logL(θx)=logP(x,z;θ)dz\mathbf{l}(\theta) = \log \mathbf{L}(\theta | x) = \log \int \mathbf{P}(x,z; \theta) \mathrm{d}z

The E-step

The E-step is to compute the expected value of the latent variables given the observed data and the current estimate of the parameters. The expected value of the latent variables is computed using the current estimate of the parameters and the observed data. The E-step is called the “expectation” step because it computes the expected value of the latent variables given the observed data and the current estimate of the parameters.

The M-step

The M-step is to maximize the likelihood function with respect to the parameters, using the expected values of the latent variables computed in the E-step. The M-step is called the “maximization” step because it maximizes the likelihood function with respect to the parameters.

Conclusion

The Expectation Maximization (EM) algorithm is a powerful algorithm that can be used to estimate the parameters of a statistical model. It is a versatile algorithm that can be used in many different settings, and it is an important tool to have in your machine learning toolbox.

Published Jun 26, 2015

Flying code monkey