Description
E13 EM Algorithm
17341015 Hongzheng Chen
Contents
1 Chinese Football Dataset 2
2 EM 2
2.1 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.2 About Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 EM for Gaussian Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Tasks 7
4 Codes and Results 7
1 Chinese Football Dataset
Country 2006WorldCup 2010WorldCup 2014WorldCup 2018WorldCup 2007AsianCup 2011AsianCup 2015AsianCup
China 50 50 50 40 9 9 5 Japan 28 9 29 15 4 1 5
South Korea 17 15 27 19 3 3 2
Iran 25 40 28 18 5 5 5
Saudi Arabia 28 40 50 26 2 9 9 Iraq 50 50 40 40 1 5 4 Qatar 50 40 40 40 9 5 9
United Arab Emirates 50 40 50 40 9 9 3
Uzbekistan 40 40 40 40 5 4 9
Thailand 50 50 50 40 9 17 17
Vietnam 50 50 50 50 5 17 17
Oman 50 50 40 50 9 17 9 Bahrain 40 40 50 50 9 9 9
North Korea 40 32 50 50 17 9 9
Indonesia 50 50 50 50 9 17 17
Australia 16 21 30 30 9 2 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
The scoring rules are below:
• For the FIFA World Cup, teams score the same with their rankings if they enter the World Cup; teams score 50 for failing to entering the Asia Top Ten; teams score 40 for entering the Asia Top Ten but not entering the World Cup.
• For the AFC Asian Cup, teams score the same with their rankings if they finally enter the top four; teams score 5 for entering the top eight but not the top four, and 9 for entering the top sixteen but not top eight; teams score 17 for not passing the group stages.
We aim at classifying the above 16 teams into 3 classes according to their performance: the firstclass, the second-class and the third-class. In our opinion, teams of Australia, Iran, South Korea and Japan belong to the first-class, while the Chinese football team belongs to the third-class.
2 EM
2.1 The Gaussian Distribution
The Gaussian, also known as the normal distribution, is a widely used model for the distribution of continuous variables. In the case of a single variable x, the Gaussian distribution can be written in the form
(2.1.1)
where µ is the mean and σ2 is the variance.
For a D-dimensional vector x, the multivariate Gaussian distribution takes the form
(2.1.2)
where µ is a D-dimensional mean vector, Σ is a D × D covariance matrix, and |Σ| denotes the determinant of |Σ|.
2.2 Mixtures of Gaussians
2.2.1 Introduction
While the Gaussian distribution has some important analytical properties, it suffers from significant limitations when it comes to modelling real data sets. Consider the example shown in Figure 1. This is known as the Old Faithful data set, and comprises 272 measurements of the eruption of the Old Faithful geyser at Yel-lowstone National Park in the USA. Each measurement comprises the duration of the eruption in minutes (horizontal axis) and the time in minutes to the next eruption (vertical axis). We see that the data set forms two dominant clumps, and that a simple Gaussian distribution is unable to capture this structure, whereas a linear superposition of two Gaussians gives a better characterization of the data set.
Figure 1: Example of a Gaussian mixture distribution
Such superpositions, formed by taking linear combinations of more basic distributions such as Gaussians, can be formulated as probabilistic models known as mixture distributions. In Figure 1 we see that a linear combination of Gaussians can give rise to very complex densities. By using a sufficient number of Gaussians, and by adjusting their means and covariances as well as the coefficients in the linear combination, almost any continuous density can be approximated to arbitrary accuracy.
We therefore consider a superposition of K Gaussian densities of the form
K
p(x) = X πkN(x|µk,Σk) (2.2.1)
k=1
which is called a mixture of Gaussians. Each Gaussian density N(x|µk,Σk) is called a component of the mixture and has its own mean µk and covariance Σk.
The parameters πk in (2.2.1) are called mixing coefficients. If we integrate both sides of (2.2.1) with respect to x, and note that both p(x) and the individual Gaussian components are normalized, we obtain
K
X
πk = 1. (2.2.2)
k=1
Also, the requirement that p(x) ≥ 0, together with N(x|µk,Σk) ≥ 0, implies πk ≥ 0 for all k. Combining this with condition (2.2.2) we obtain
0 ≤ πk ≤ 1. (2.2.3)
We therefore see that the mixing coefficients satisfy the requirements to be probabilities.
From the sum and product rules, the marginal density is given by
K
p(x) = X p(k)p(x|k) (2.2.4)
k=1
which is equivalent to (2.2.1) in which we can view πk = p(k) as the prior probability of picking the kth component, and the density N(x|µk,Σk) = p(x|k) as the probability of x conditioned on k. From Bayes’ theorem these are given by
. (2.2.5)
The form of the Gaussian mixture distribution is governed by the parameters π, µ and Σ, where we have used the notation π = {π1,…,πK}, µ = {µ1,…,µk} and Σ = {Σ1,…,ΣK}. One way to set the values of there parameters is to use maximum likelihood. From (2.2.1) the log of the likelihood function is given by
(2.2.6)
where X = {x1,…,xN}. One approach to maximizing the likelihood function is to use iterative numerical optimization techniques. Alternatively we can employ a powerful framework called expectation maximization (EM).
2.2.2 About Latent Variables
We now turn to a formulation of Gaussian mixtures in terms of discrete latent variables. This will provide us with a deeper insight into this important distribution, and will also serve to motivate the expectation-maximization (EM) algorithm.
Recall from (2.2.1) that the Gaussian mixture distribution can be written as a linear superposition of Gaussians in the form
K
p(x) = X πkN(x|µk,Σk) (2.2.7)
k=1
Let us introduce a K-dimensional binary random variable z having a 1-of-K representation in which a particular element zk is equal to 1 and all other elements are equal to 0. The values of zk therefore satisfy zk ∈ {0,1} and Σkzk = 1, and we see that there are K possible states for the vector z according to which element is nonzero. We shall define the joint distribution p(x,z) in terms of a marginal distribution p(z) and a conditional distribution p(x|z). The marginal distribution over z is specified in terms of the mixing coefficients πk, such that
p(zk = 1) = πk (2.2.8)
where the parameters {πk} must satisfy
0 ≤ πk ≤ 1 (2.2.9)
together with
K
X
πk = 1 (2.2.10)
k=1
in order to be valid probabilities. Because z uses a 1-of-K representation, we can also write this distribution in the form
K
p(z) = Y πkzk. (2.2.11)
k=1
Similarly, the conditional distribution of x given a particular value for z is a Gaussian
p(x|zk = 1) = (x|µk,Σk)
which can also be written in the form (2.2.12)
K p(x|z) = Y p(x|µk,Σk)zk. (2.2.13)
k=1
The joint distribution is given by p(z)p(x|z), and the marginal distribution of x is then obtained by summing the joint distribution over all possible states of z to give
K
p(x) = Xp(z)p(x|z) = X πkN(x|µk,Σk) (2.2.14)
z k=1
where we have made use of (2.2.12) and (2.2.13). Thus the marginal distribution of x is a Gaussian mixture of the form (2.2.7). If we have several observations x1,…,xN, then, because we have represented the marginal distribution in the form p(x) = Pz p(x,z), it follows that for every observed data point xn there is a corresponding latent variable zn.
We have therefore found an equivalent formulation of the Gaussian mixture involving an explicit latent variable. It might seem that we have not gained much by doing so. However, we are now able to work with the joint distribution p(x,z) instead of the marginal distribution p(x), and this will lead to significant simplifications, most notably through the introduction of the expectation-maximization (EM) algorithm.
Another quantity that will play an important role is the conditional probability of z given x. We shall use γ(zk) to denote p(zk = 1|x), whose value can be found using Bayes theorem
(2.2.15)
We shall view πk as the prior probability of zk = 1, and the quantity γ(zk) as the corresponding posterior probability once we have observed x. As we shall see later, γ(zk) can also be viewed as the responsibility that component k takes for explaining the observation x.
2.3 EM for Gaussian Mixtures
Initially, we shall motivate the EM algorithm by giving a relatively informal treatment in the context of the Gaussian mixture model.
Let us begin by writing down the conditions that must be satisfied at a maximum of the likelihood function. Setting the derivatives of lnp(X|π,µ,Σ) with respect to the means µk of the Gaussian components to zero, we obtain
µk) (2.3.1)
Multiplying by Σ−k 1 (which we assume to be nonsingular) and rearranging we obtain
N
µ (2.3.2)
n=1
where we have defined
N
Nk = X γ(znk). (2.3.3)
n=1
If we set the derivative of ln(X|π,µ,Σ) with respect to Σk to zero, and follow a similar line of reasoning, making use of the result for the maximum likelihood for the covariance matrix of a single Gaussian, we obtain
T
(2.3.4)
which has the same form as the corresponding result for a single Gaussian fitted to the data set, but again with each data point weighted by the corresponding posterior probability and with the denominator given by the effective number of points associated with the corresponding component.
Finally, we maximize lnp(X|π,µ,Σ) with respect to the mixing coefficients πk. Here we must take account of the constraint = 1. This can be achieved using a Lagrange multiplier and maximizing the following quantity
K
lnp(X|π,µ,Σ) + λ(X πk − 1) (2.3.5)
k=1
which gives
(2.3.6)
where again we see the appearance of the responsibilities. If we now multiply both sides by πk and sum over k making use of the constraint = 1, we find λ = −N. Using this to eliminate λ and rearranging we obtain
(2.3.7)
so that the mixing coefficient for the kth component is given by the average responsibility which that component takes for explaining the data points.
2.4 EM Algorithm
Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to the parameters (comprising the means and covariances of the components and the mixing coefficients).
1. Initialize the means µk, covariances Σk and mixing coefficients πk, and evaluate the initial value of the log likelihood.
2. E step. Evaluate the responsibilities using the current parameter values
(2.4.1)
3. M step. Re-estimate the parameters using the current responsibilities
N
µ (2.4.2)
T
(2.4.3)
(2.4.4)
where
N
Nk = X γ(znk). (2.4.5)
n=1
4. Evaluate the log likelihood
(2.4.6)
and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to step 2.
3 Tasks
• Assume that score vectors of teams in the same class are normally distributed, we can thus adopt the Gaussian mixture model. Please classify the teams into 3 classes by using EM algorithm. If necessary, you can refer to page 430-439 in the book Pattern Recognition and Machine
Learning.pdf and the website https://blog.csdn.net/jinping_shi/article/details/59613054 which is a Chinese translation.
• You should show the values of these parameters: γ, µ and Σ. If necessary, you can plot the clustering results. Note that γ is essential for classifying.
• Please submit a file named E13 YourNumber.pdf and send it to ai 201901@foxmail.com
4 Codes and Results
The classification results are listed below.
0 Japan South Korea Iran Australia
1 Saudi Arabia United Arab Emirates Uzbekistan Bahrain North Korea
2 China Iraq Qatar Thailand Vietnam Oman Indonesia
And the γ, µ and Σ are shown in Fig. 2.
Figure 2: Values of γ, µ and Σ
The classification plot in 2D is shown in Fig. 3 (only the first two dimensions are selected).
Figure 3: classification plot The following code em.py is generated by jupyter notebook em.ipynb.
# coding: utf-8 # In[ ]:
import numpy as np import pandas as pd import sys
dataset = pd.read_csv(“football.txt”,sep=”,”) dataset.head()
# In[ ]:
def prob(x, mu, sigma):
“””
Calculate the Gaussian distribution
N(x|mu,Sigma)=rac{1}{(2pi)^{D/2}|Sigma|^{1/2}}exp{-rac{1}{2}(x-mu)^T}Sigma ,→ ^{-1}(x-mu)}
Notice the input x here should be a 1-d vector
“””
if x.ndim != 1 or mu.ndim != 1:
raise RuntimeError(“Dimension error!”)
# print(x,mu,sigma) D = x.shape[0] expOn = – 1 / 2 * np.matmul(np.matmul((x – mu).T,np.linalg.inv(sigma)),x – mu)
divBy = np.power(2 * np.pi, D / 2) * np.sqrt(np.linalg.det(sigma)) return np.exp(expOn) / divBy
def EM(dataMat, n_components=3, maxIter=100):
“””
Expectation-Maximization (EM) algorithm
This implementation has been extended to support different number of components
“”” n_samples, D = np.shape(dataMat) # n_samples=16 D=7
# 1. Initialize Gaussian parameters pi_k = np.ones(n_components) / n_components # mixing coefficients # randomly select k(n_components) samples as the mean of each class
# choices = np.random.choice(n_samples,n_components)
# predefined mean of each class choices = [1,13,11] print([(i,dataset[“Country”][i]) for i in choices]) mu_k = np.array([dataMat[i,:] for i in choices]).reshape(n_components,D) # k * D
# mu_k = [dataMat[5, :], dataMat[21, :], dataMat[26, :]] sigma_k = [np.eye(D) for x in range(n_components)] # k * D * D
gamma_k = np.zeros((n_samples, n_components)) # n * k
# Iterate for maxIter times for i in range(maxIter):
“””
2. E step
gamma(z_{nk}) = rac{pi_k prob(x_n|mu_k,Sigma_k)}{sum_pi_mul} sum_pi_mul = sum_{j=1}^K pi_j prob(x_n|mu_j,Sigma_j)
“”” for n in range(n_samples):
sum_pi_mul = 0 # denominator for k in range(n_components): gamma_k[n, k] = pi_k[k] * prob(dataMat[n, :], mu_k[k], sigma_k[k]) sum_pi_mul += gamma_k[n, k]
# normalization for k in range(n_components): gamma_k[n, k] /= sum_pi_mul
# summarize gamma_k along different samples
N_k = np.sum(gamma_k, axis=0)
“””
3. M step
mu_k^{new} = rac{1}{N_k}sum_{n=1}^Ngamma(z_{nk})x_n
Sigma_k^{new} = rac{1}{N_k}sum_{n=1}^Ngamma(z_{nk})(x_n-mu_k^{new})(x_n-mu_k ,→ ^{new})^T
pi_k^{new} = rac{N_k}{N}
N_k=sum_{n=1}^Ngamma(z_{nk}) “”” for k in range(n_components): # Calculate mu_k mu_k[k] = np.zeros(D,dtype=np.float64) for n in range(n_samples):
mu_k[k] += gamma_k[n, k] * dataMat[n, :] mu_k[k] /= N_k[k]
# Calculate Sigma_k sigma_k[k] = np.zeros((D, D),dtype=np.float64) for n in range(n_samples):
sigma_k[k] += gamma_k[n, k] * np.matmul((dataMat[n, :] – mu_k[k]).reshape ,→ (1,-1).T, (dataMat[n, :] – mu_k[k]).reshape(1,-1)) # be careful of ,→ outer product!
sigma_k[k] /= N_k[k]
# Calculate new mixing coefficient pi_k[k] = N_k[k] / n_samples sigma_k += np.eye(D)
print(“gamma: “,gamma_k) print(“mu: “,mu_k) print(“Sigma: “,sigma_k) return gamma_k # In[ ]:
def gaussianCluster(dataMat, n_components, max_iter):
n_samples, D = np.shape(dataMat) centroids = np.zeros((n_components, D)) gamma = EM(dataMat,n_components,max_iter)
# get the cluster result clusterAssign = np.zeros((n_samples, 2)) for n in range(n_samples):
clusterAssign[n, :] = np.argmax(gamma[n, :]), np.amax(gamma[n, :])
# calculate the final results for k in range(n_components):
pointsInCluster = dataMat[np.nonzero(clusterAssign[:, 0] == k)[0]] centroids[k, :] = np.mean(pointsInCluster, axis=0)
return centroids, clusterAssign[:,0] # In[ ]:
from sklearn import mixture
def sklearn_em(data,n_components,max_iter):
clst = mixture.GaussianMixture(n_components=n_components,max_iter=max_iter,
,→ covariance_type=”full”) clst.fit(data) predicted_labels = clst.predict(data) return clst.means_, predicted_labels
# In[ ]:
import matplotlib.pyplot as plt
def showCluster(dataset, k, centroids, clusterAssment):
numSamples, dim = dataset.shape
mark = [’or’, ’ob’, ’og’, ’ok’, ’^r’, ’+r’, ’sr’, ’dr’, ’<r’, ’pr’] if k > len(mark): print(“Sorry! Your k is too large!”) return 1
# draw all samples for i in range(numSamples):
markIndex = int(clusterAssment[i]) plt.plot(dataset[i, 0], dataset[i, 1], mark[markIndex])
mark = [’Dr’, ’Db’, ’Dg’, ’Dk’, ’^b’, ’+b’, ’sb’, ’db’, ’<b’, ’pb’]
# draw the centroids for i in range(k):
plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize=12) plt.show()
# In[ ]:
n_components = 3 data = dataset[dataset.columns.values[dataset.columns.values != “Country”]].to_numpy()
# centroids, labels = sklearn_em(data,n_components,100) centroids, labels = gaussianCluster(data.astype(np.float64),n_components,100) showCluster(data, n_components, centroids, labels) res = {0:[],1:[],2:[]} for i,label in enumerate(labels):
res[label].append(dataset[“Country”][i])
for key in res:
print(key,*res[key])
Reviews
There are no reviews yet.