Description
Homework 5: EM with Mixtures, PCA, and Graphical Models
This homework assignment will have you work with EM for mixtures, PCA, and graphical models. We encourage you to read sections 9.4 and 8.2.5 of the course textbook.
Please type your solutions after the corresponding problems using this LATEX template, and start each problem on a new page.
Please submit the writeup PDF to the Gradescope assignment ‘HW5’. Remember to assign pages for each question.
Please submit your LATEX file and code files to the Gradescope assignment ‘HW5 – Supplemental’.
Problem 1 (Expectation-Maximization for Gamma Mixture Models, 25pts)
In this problem we will explore expectation-maximization for a Categorical-Gamma Mixture model.
Let us suppose the following generative story for an observation x: first one of K classes is randomly selected, and then the features x are sampled according to this class. If
z ∼ Categorical(θ)
indicates the selected class, then x is sampled according to the class or “component” distribution corresponding to z. (Here, θ is the mixing proportion over the K components: Pk θk = 1 and θk > 0). In this problem, we assume these component distributions are gamma distributions with shared shape parameter but different rate parameters:
x|z ∼ Gamma(α,βk).
In an unsupervised setting, we are only given a set of observables as our training dataset: . The EM algorithm allows us to learn the underlying generative process (the parameters θ and {βk}) despite not having the latent variables {zn} corresponding to our training data.
1. Intractability of the Data Likelihood We are generally interested in finding a set of parameters βk that maximizes the likelihood of the observed data:
.
Expand the data likelihood to include the necessary sums over observations xn and to marginalize out the latents zn. Why is optimizing this likelihood directly intractable?
2. Complete Data Log Likelihood The complete dataset includes latents zn.
Write out the negative complete data log likelihood:
.
Apply the power trick and simplify your expression using indicator elements znk.a Notice that optimizing this loss is now computationally tractable if we know zn.
(Continued on next page.)
aThe “power trick” is used when terms in a PDF are raised to the power of indicator components of a one-hot vector. z
For example, it allows us to rewrite p(zn;θ) = Qkθknk.
Problem 1 (cont.)
3. Expectation Step Our next step is to introduce a mathematical expression for qn, the posterior over the hidden component variables zn conditioned on the observed data xn with fixed parameters. That is:
q .
Write down and simplify the expression for qn. Note that because the qn represents the posterior over the hidden categorical variables zn, the components of vector qn must sum to 1. The main work is to find an expression for ) for any choice of zn; i.e., for any 1-hot encoded zn. With this, you can then construct the different components that make up the vector qn.
4. Maximization Step Using the qn estimates from the Expectation Step, derive an update for maximizing the expected complete data log likelihood in terms of θ and . (a) Derive an expression for the expected complete data log likelihood using qn.
(c) Find an expression for βk that maximizes the expected complete data log likelihood. Why does this optimal βk make intuitive sense?
5. Suppose that this had been a classification problem. That is, you were provided the “true” components zn for each observation xn, and you were going to perform the classification by inverting the provided generative model (i.e. now you’re predicting zn given xn). Could you reuse any of your derivations above to estimate the parameters of the model?
6. Finally, implement your solution in p1.ipynb and attach the final plot below.
You will recieve no points for code not included below.
Solution
1.
2.
3.
4. (a)
(b)
(c) 5.
6. Plot:
Code:
def e_step(theta, betas):
q = ’not implemented’ return q
def m_step(q):
theta_hat = ’not implemented’ beta_hats = ’not implemented’ return theta_hat, beta_hats
def log_px(x, theta, betas):
p = ’not implemented’ return p
def run_em(theta, betas, iterations=1000):
theta = ’not implemented’ betas = ’not implemented’ return theta, betas
Problem 2 (PCA, 15 pts)
For this problem you will implement PCA from scratch on the first 6000 images of the MNIST dataset. Your job is to apply PCA on MNIST and discuss what kind of structure is found. Implement your solution in p2.ipynb and attach the final plots below.
You will recieve no points for using third-party PCA implementations (i.e. scikit-learn). You will recieve no points for code not included below.
1. Compute the PCA. Plot the eigenvalues corresponding to the most significant 500 components in order from most significant to least. Make another plot that describes the cumulative proportion of variance explained by the first k most significant components for values of k from 1 through 500. How much variance is explained by the first 500 components? Describe how the cumulative proportion of variance explained changes with k. Include this plot below.
2. Plot the mean image of the dataset and plot an image corresponding to each of the first 10 principle components. How do the principle component images compare to the cluster centers from K-means? Discuss any similarities and differences. Include these two plots below.
Reminder: Center the data before performing PCA
3. Compute the reconstruction error on the data set using the mean image of the dataset. Then compute the reconstruction error using the first 10 principal components. How do these errors compare to the final objective loss achieved by using K-means on the dataset? Discuss any similarities and differences.
For consistency in grading, define the reconstruction error as the squared L2 norm averaged over all data points.
4. Suppose you took the original matrix of principle components that you found U and multiplied it by some rotation matrix R. Would that change the quality of the reconstruction error in the last problem? The interpretation of the components? Why or why not?
Solution
Plots: Code:
def pca(x, n_comps=500):
top_eigvals = ’not implemented’ top_pcomps = ’not implemented’ return top_eigvals, top_pcomps
def calc_cfvs(eigvals): cum_frac_vars = ’not implemented’ return cum_frac_vars
def calc_errs(x, pcomps):
err_mean = ’not implemented’ err_pcomp = ’not implemented’ return err_mean, err_pcomp
1.
2.
3.
Problem 3 (Bayesian Networks, 10 pts)
In this problem we explore the conditional independence properties of a Bayesian Network. Consider the following Bayesian network representing a fictitious person’s activities. Each random variable is binary (true/false).
The random variables are:
• Weekend: Is it the weekend?
• Friends over: Does the person have friends over?
• Traveling: Is the person traveling?
• Sick: Is the person sick?
• Eat exotic foods: Is the person eating exotic foods?
• Get Sleep: Is the person getting sleep?
For the following questions, A ⊥ B means that events A and B are independent and A ⊥ B|C means that events A and B are independent conditioned on C.
Use the concept of d-separation to answer the questions and show your work (i.e., state what the blocking path(s) is/are and what nodes block the path; or explain why each path is not blocked).
Example Question: Is Friends over ⊥ Traveling? If NO, give intuition for why.
Example Answer: NO. The path from Friends over – Weekend – Traveling is not blocked following the d-separation rules as we do not observe Weekend. Thus, the two are not independent.
Actual Questions:
1. Is Weekend ⊥ Get Sleep? If NO, give intuition for why.
2. Is Sick ⊥ Weekend? If NO, give intuition for why.
3. Is Sick ⊥ Friends over|Eat exotic foods? If NO, give intuition for why.
4. Is Friends over ⊥ Get Sleep? If NO, give intuition for why.
5. Is Friends over ⊥ Get Sleep|Traveling? If NO, give intuition for why.
6. Suppose the person stops traveling in ways that affect their sleep patterns. Travel still affects whether they eat exotic foods. Draw the modified network. (Feel free to reference the handout file for the commands for displaying the new network in LATEX).
7. For this modified network, is Friends over ⊥ Get Sleep? If NO, give an intuition why. If YES, describe what observations (if any) would cause them to no longer be independent.
Solution
1.
2.
3.
4.
5.
6.
7.
Whom did you work with, and did you use any resources beyond cs181-textbook and your notes?
Calibration
Approximately how long did this homework take you to complete (in hours)?
Reviews
There are no reviews yet.