CS155 – Policies (Solution)

$ 29.99
Category:

Description

• In this course, we will be using Google Colab for code submissions. You will need a Google account.
• This set uses PyTorch, a Python package for neural networks. We recommend using Google Colab, which comes with PyTorch already installed.
• Submit your report as a single .pdf file to Gradescope, under ”Set 6 Report”.
• In the report, include any images generated by your code along with your answers to the questions.
• Submit your code by sharing a link in your report to your Google Colab notebook for each problem (see naming instructions below). Make sure to set sharing permissions to at least ”Anyone with the link can view”. Links that can not be run by TAs will not be counted as turned in. Check your links in an incognito window before submitting to be sure.
• For instructions specifically pertaining to the Gradescope submission process, see https://www.
gradescope.com/get_started#student-submission.
Google Colab Instructions
For each notebook, you need to save a copy to your drive.
1. Open the github preview of the notebook, and click the icon to open the colab preview.
2. On the colab preview, go to File → Save a copy in Drive.
3. Edit your file name to “lastname firstname set problem”, e.g.”yue yisong set6 prob2.ipynb”
1 Class-Conditional Densities for Binary Data [25 Points, 8 EC Points]
This problem will test your understanding of probabilistic models, especially Naive Bayes. Consider a generative classifier for C classes, with class conditional density p(x|y) and a uniform class prior p(y). Suppose all the D features are binary, xj ∈ {0,1}. If we assume all of the features are conditionally independent, as in Naive Bayes, we can write:

This requires storing DC parameters.
Now consider a different model, which we will call the ‘full’ model, in which all the features are fully dependent.
Problem A [9 points]: Use the chain rule of probability to factorize p(x | y), and let θxjc = p(xj|x1,…,j−1,y = c). Assuming we store each θxjc, how many parameters are needed to represent this factorization? Use bigO notation.
Problem B [8 points]: Assume we did no such factorization, and just used the joint probability p(x | y = c). How many parameters would we need to estimate in order be able to compute p(x|y = c) for arbitrary x and c? How does this compare to your answer from the previous part? Again, use big-O notation.
Problem C [4 points]: Assume the number of features D is fixed. Let there be N training cases. If the sample size N is very small, which model (Naive Bayes or full) is likely to give lower test set error, and why?
Problem D [4 points]: If the sample size N is very large, which model (Naive Bayes or full) is likely to give lower test set error, and why?
Problem E [8 EC points]: Assume all the parameter estimates have been computed. What is the computational complexity of making a prediction, i.e. computing p(y | x), using Naive Bayes for a single test case? What is the computation complexity of making a prediction with the full model? In justifying your answer for the full model, choose either the implementation in 1A or 1B and state your choice. For the full-model case, assume that converting a D-bit vector to an array index is an O(D) operation. Also, recall that we have assumed a uniform class prior.
2 Sequence Prediction [75 Points]
In this problem, we will explore some of the various algorithms associated with Hidden Markov Models (HMMs), as discussed in lecture. We have also uploaded a note on HMMs to the github that might be helpful.
Sequence Prediction
These next few problems will require extensive coding, so be sure to start early!
• You will write an implementation for the hidden Markov model in the cell for HMM Code in the notebook given to you, within the appropriate functions where indicated. There should be no need to write additional functions or use NumPy in your implementation, but feel free to do so if you would like.
• You can (and should!) use the helper cells for each of the subproblems in the notebook, namely 2A, 2Bi, 2Bii, 2C, 2D, and 2F. These can be used to run and check your implementations for each of the corresponding problems. The cells provide useful output in an easy-to-read format. There is no need to modify these cells.
• Lastly, the cell for Utility contains some functions used for loading data directly from the class github repository. There is no need to modify this cell.
The supplementary data folder of the class github repository contains 6 files titled sequence_data0.txt, sequence_data1.txt, … , sequence_data5.txt. Each file specifies a trained HMM. The first row contains two tab-delimited numbers: the number of states Y and the number of types of observations X (i.e. the observations are 0,1,…,X−1). The next Y rows of Y tab-delimited floating-point numbers describe the state transition matrix. Each row represents the current state, each column represents a state to transition to, and each entry represents the probability of that transition occurring. The next Y rows of X tab-delimited floating-point numbers describe the output emission matrix, encoded analogously to the state transition matrix. The file ends with 5 possible emissions from that HMM.
The supplementary data folder also contains one additional file titled ron.txt. This is used in problems 2C and 2D and is explained in greater detail there.
Problem A [10 points]: For each of the six trained HMMs, find the max-probability state sequence for each of the five input sequences at the end of the corresponding file. To complete this problem, you will have to implement the Viterbi algorithm (in viterbi() of the HiddenMarkovModel object). Write your implementation well, as we will be reusing it in a later problem. See the end of problem 2B for a big hint! Note that you do not need to worry about underflow in this part.
Note that the probability of emitting an input sequence can be found by using either the α vectors from the Forward algorithm or the β vectors from the Backward algorithm. You don’t need to worry about this, as it is done for you in probability_alphas() and probability_betas().
Implement the Forward algorithm. In your report, show your results on the 6 files.
Implement the Backward algorithm. In your report, show your results on the 6 files.
After you complete problems 2A and 2B, you can compare your results (the probabilities from the forward and backward algorithms should be the same) for the file titled sequence_data0.txt with the values given in the table below:
Dataset Emission Sequence Max-probability State Sequence Probability of Sequence
0 25421 31033 4.537e-05
0 01232367534 22222100310 1.620e-11
0 5452674261527433 1031003103222222 4.348e-15
0 7226213164512267255 1310331000033100310 4.739e-18
0 0247120602352051010255241 2222222222222222222222103 9.365e-24
HMM Training
Ron is an avid music listener, and his genre preferences at any given time depend on his mood. Ron’s possible moods are happy, mellow, sad, and angry. Ron experiences one mood per day (as humans are known to do) and chooses one of ten genres of music to listen to that day depending on his mood.
Ron’s roommate, who is known to take to odd hobbies, is interested in how Ron’s mood affects his music selection, and thus collects data on Ron’s mood and music selection for six years (2190 data points). This data is contained in the supplementary file ron.txt. Each row contains two tab-delimited strings: Ron’s mood and Ron’s genre preference that day. The data is split into 12 sequences, each corresponding to half a year’s worth of observations. The sequences are separated by a row containing only the character -.
Problem C [10 points]: Use a single M-step to train a supervised Hidden Markov Model on the data in ron.txt. What are the learned state transition and output emission matrices?
Tip: the (1,1) entry of your transition matrix should be 2.833e-01, and the (1,1) entry of your observation matrix should be 1.486e-01.
Problem D [15 points]: Now suppose that Ron has a third roommate who is also interested in how Ron’s mood affects his music selection. This roommate is lazier than the other one, so he simply steals the first roommate’s data. Unfortunately, he only manages to grab half the data, namely, Ron’s choice of music for each of the 2190 days.
In this problem, we will train an unsupervised Hidden Markov Model on this data. Recall that unsupervised HMM training is done using the Baum-Welch algorithm and will require repeated EM steps. For this problem, we will use 4 hidden states and run the algorithm for 1000 iterations. The transition and observation matrices are initialized for you in the helper functions supervised_learning() and unsupervised_learning() such that they are random and normalized.
What are the learned state transition and output emission matrices? Please report the result using random seed state 1, as is done by default in the notebook.
Tips for debugging:
• The rows of the state transition and output emitting matrices should sum to 1.
• Your matrices should not change drastically every iteration.
• After many iterations, your matrices should converge.
• If you used random seed 1 for this computation (as is done by default in the notebook), the (1,1) entry of the state transition matrix should be 5.075e-01, and the (1,1) entry of the output emission matrix should be 1.117e-01.
Sequence Generation
Hidden Markov Models fall under the umbrella of generative models and therefore can be used to not only predict sequential data, but also to generate it.
Problem F [5 points]: Run the cell for this problem. The code in the cell loads the trained HMMs from the files titled sequence_data0.txt, … , sequence_data5.txt and uses the six models to probabilistically generate five sequences of emissions from each model, each of length 20. In your report, show your results.
Visualization & Analysis
Once you have implemented the HMM code part of the notebook, load and run the cells for the following subproblems. Here you will apply the HMM you have implemented to the Constitution. There is no coding required for this part, only analysis.
Answer the following problems in the context of the visualizations in the notebook.
Problem G [3 points]: What can you say about the sparsity of the trained A and O matrices? How does this sparsity affect the transition and observation behaviour at each state?
Problem H [5 points]: How do the sample emission sentences from the HMM change as the number of hidden states is increased? What happens in the special case where there is only one hidden state? In general, when the number of hidden states is unknown while training an HMM for a fixed observation set, can we increase the training data likelihood by allowing more hidden states?
Problem I [5 points]: Pick a state that you find semantically meaningful, and analyze this state and its wordcloud. What does this state represent? How does this state differ from the other states? Back up your claim with a few key words from the wordcloud.

Reviews

There are no reviews yet.

Be the first to review “CS155 – Policies (Solution)”

Your email address will not be published. Required fields are marked *