## Description

Le Song

• Please read the following submission rules carefully. We will strictly follow these rules starting from this assignment.

• Matlab submissions will be graded in Matlab Online R2019b environment . Python submission will be graded in Python 3.6 environment on a regular laptop. Please make sure your program is runnable in those environments. Any program that cannot run is subject to zero credit. Situations include but not restricted to exceeding 5 minutes of running time, exceeding memory limit and invalid text character.

• Put your report and code files into a single flat folder and submit it as a single .zip file.

• Recommended reading: PRML Section 1.5, 1.6, 2.5, 9.2, 9.3

1 EM for Mixture of Gaussians

Mixture of K Gaussians is represented as

K

p(x) = XπkN(x|µk,Σk), (1)

k=1

where πk represents the probability that a data point belongs to the kth component. As it is probability, it satisfies 0 ≤ πk ≤ 1 and Pk πk = 1. In this problem, we are going to represent this in a slightly different manner with explicit latent variables. Specifically, we introduce 1-of-K coding representation for latent variables z(k) ∈ RK for k = 1,…,K. Each z(k) is a binary vector of size K, with 1 only in kth element and 0 in all others. That is,

z(1) = [1;0;…;0] z(2) = [0;1;…;0]

…

z(K) = [0;0;…;1].

For example, if the second component generated data point xn, its latent variable zn is given by [0;1;…;0] = z(2). With this representation, we can express p(z) as

K p(z) = Y πkzk,

k=1

where zk indicates kth element of vector z. Also, p(x|z) can be represented similarly as

.

By the sum rule of probability, (1) can be represented by

p(x) = X p(z)p(x|z). (2)

z∈Z

where Z = {z(1),z(2),…,z(K)}.

(a) Show that (2) is equivalent to (1). [5 pts]

(b) In reality, we do not know which component each data point is from. Thus, we estimate the responsibility (expectation of zkn) in the E-step of EM. Since zkn is either 1 or 0, its expectation is the probability for the point xn to belong to the component zk. In other words, we estimate . Derive the formula for this estimation by using Bayes rule. Note that, in the E-step, we assume all other parameters, i.e. πk, µk, and Σk, are fixed, and we want to express as a function of these fixed parameters. [10 pts]

(c) In the M-Step, we re-estimate parameters πk, µk, and Σk by maximizing the log-likelihood. Given N i.i.d (Independent Identically Distributed) data samples, derive the update formula for each parameter. Note that in order to obtain an update rule for the M-step, we fix the responsibilities, i.e. p(zkn|xn), which we have already calculated in the E-step. [15 pts] Hint: Use Lagrange multiplier for πk to apply constraints on it.

(d) EM and K-Means [10 pts]

K-means can be viewed as a particular limit of EM for Gaussian mixture. Considering a mixture model in which all components have covariance I, show that in the limit 0, maximizing the expected complete data log-likelihood for this model is equivalent to minimizing objective function in K-means:

N K

J = XXγnkkxn − µkk2,

n=1 k=1

where γnk = 1 if xn belongs to the k-th cluster and γnk = 0 otherwise.

2 Density Estimation

Consider a histogram-like density model in which the space x is divided into fixed regions for which density p(x) takes constant value hi over ith region, and that the volume of region i in denoted as ∆i. Suppose we have a set of N observations of x such that ni of these observations fall in regions i.

(a) What is the log-likelihood function? [8 pts]

(b) Derive an expression for the maximum likelihood estimator for hi. [10 pts]

Hint: This is a constrained optimization problem. Remember that p(x) must integrate to unity. Since p(x) has constant value hi over region i, which has volume ∆i. The normalization constraint is Pi hi∆i = 1. Use Lagrange multiplier by adding λ(Pi hi∆i − 1) to your objective function.

(c) Mark T if it is always true, and F otherwise. Briefly explain why. [12 pts]

• Non-parametric density estimation usually does not have parameters.

• The Epanechnikov kernel is the optimal kernel function for all data.

• Histogram is an efficient way to estimate density for high-dimensional data.

• Parametric density estimation assumes the shape of probability density.

3 Information Theory

In the lecture you became familiar with the concept of entropy for one random variable and mutual information. For a pair of discrete random variables X and Y with the joint distribution p(x,y), the joint entropy H(X,Y ) is defined as

H(X,Y ) = − XX p(x,y)logp(x,y) (3)

x∈X y∈Y

which can also be expressed as

H(X,Y ) = −E[logp(X,Y )] (4)

Let X and Y take on values x1,x2,…,xr and y1,y2,…,ys respectively. Let Z also be a discrete random variable and Z = X + Y .

(a) Prove that H(X,Y ) ≤ H(X) + H(Y ) [4 pts]

(b) Show that I(X;Y ) = H(X) + H(Y ) − H(X,Y ). [2 pts]

(c) Under what conditions does H(Z) = H(X) + H(Y ). [4 pts]

4 Programming: Text Clustering

In this problem, we will explore the use of EM algorithm for text clustering. Text clustering is a technique for unsupervised document organization, information retrieval. We want to find how to group a set of different text documents based on their topics. First we will analyze a model to represent the data.

Bag of Words

The simplest model for text documents is to understand them as a collection of words. To keep the model simple, we keep the collection unordered, disregarding grammar and word order. What we do is counting how often each word appears in each document and store the word counts into a matrix, where each row of the matrix represents one document. Each column of matrix represent a specific word from the document dictionary. Suppose we represent the set of nd documents using a matrix of word counts like this:

2

D1:nd = 2

… 6

4 …

…

… 4 0 = T

This means that word W1 occurs twice in document D1 . Word Wnw occurs 4 times in document D1 and not at all in document D2.

Multinomial Distribution

The simplest distribution representing a text document is multinomial distribution(Bishop Chapter 2.2).

The probability of a document Di is:

nw

p(Di) = Y µTjij

j=1

Here, µj denotes the probability of a particular word in the text being equal to wj, Tij is the count of the word in document. So the probability of document D1 would be

Mixture of Multinomial Distributions

In order to do text clustering, we want to use a mixture of multinomial distributions, so that each topic has a particular multinomial distribution associated with it, and each document is a mixture of different topics. We define p(c) = πc as the mixture coefficient of a document containing topic c, and each topic is modeled by a multinomial distribution p(Di|c) with parameters µjc, then we can write each document as a mixture over topics as

nc nc nw

p(Di) = Xp(Di|c)p(c) = Xπc Y µTjcij

c=1 c=1 j=1

EM for Mixture of Multinomials

In order to cluster a set of documents, we need to fit this mixture model to data. In this problem, the EM algorithm can be used for fitting mixture models. This will be a simple topic model for documents. Each topic is a multinomial distribution over words (a mixture component). EM algorithm for such a topic model, which consists of iterating the following steps:

1. Expectation

Compute the expectation of document Di belonging to cluster c:

2. Maximization

Update the mixture parameters, i.e. the probability of a word being Wj in cluster (topic) c, as well as prior probability of each cluster.

Task [20 pts]

Implement the algorithm and run on the toy dataset data.mat. You can find detailed description about the data in the homework2.m file. Observe the results and compare them with the provided true clusters each document belongs to. Report the evaluation (e.g. accuracy) of your implementation.

Hint: We already did the word counting for you, so the data file only contains a count matrix like the one shown above. For the toy dataset, set the number of clusters nc = 4. You will need to initialize the parameters. Try several different random initial values for the probability of a word being Wj in topic c, µjc. Make sure you normalized it. Make sure that you should not use the true cluster information during your learning phase.

Extra Credit: Realistic Topic Models [20pts]

Specifically, consider the log-likelihood of the joint distribution of document and words

L = XX Tdw logP(d,w),

d∈D w∈W

where Tdw is the counts of word w in the document d. This count matrix is provided as input.

The joint distribution of a specific document and a specific word is modeled as a mixture (5)

P(d,w) = X P(z)P(w|z)P(d|z), (6)

z∈Z

where P(z) is the mixture proportion, P(w|z) is the distribution over the vocabulary for the z-th topic, and P(d|z) is the probability of the document for the z-th topic. And these are the parameters for the model.

The E-step calculates the posterior distribution of the latent variable conditioned on all other variables

. (7)

In the M-step, we maximizes the expected complete log-likelihood with respect to the parameters, and get the following update rules

(8)

(9)

. (10)

Task

Implement EM for maximum likelihood estimation and cluster the text data provided in the nips.mat file you downloaded. You can print out the top key words for the topics/clusters by using the show topics.m utility. It takes two parameters: 1) your learned conditional distribution matrix, i.e., P(w|z) and 2) a cell array of words that corresponds to the vocabulary. You can find the cell array wl in the nips.mat file. Try different values of k and see which values produce sensible topics. In assessing your code, we will use another dataset and observe the produces topics.

## Reviews

There are no reviews yet.