Description
TAs: Sana, Abhi, Sami, Helena, Chi
This is the final homework assignment. This assignment revisits Bayes Nets (from HW7) and Reinforcement Learning (from HW8). The new topics covered are Ensemble Methods, K-Means, PCA, and Recommender Systems.
START HERE: Instructions
˜mgormley/courses/10601/syllabus.html
• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ ˜mgormley/courses/10601/syllabus.html
• Submitting your work: You will use Gradescope to submit answers to all questions and code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope.
1
Instructions for Specific Problem Types
For “Select One” questions, please fill in the appropriate bubble completely:
Select One: Who taught this course?
Matt Gormley / Henry Chai
# Marie Curie
# Noam Chomsky
Select One: Who taught this course?
Matt Gormley / Henry Chai
# Marie Curie
@ Noam Chomsky
For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
2 Stephen Hawking Albert Einstein
Isaac Newton 2 None of the above
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein Isaac Newton
@ I don’t know
Fill in the blank: What is the course number?
10-S7601S
Written Questions (75 points)
1 Bayes Nets, Revisited (4 points)
Consider the joint distribution over the binary random variables A,B,C,D,E represented by the Bayesian Network shown in the figure.
1. (1 point) [SOLO] Write the joint probability distribution for P(A,B,C,D,E) factorized as much as possible using the conditional independence assumptions expressed by the Bayesian Network.
2. (1 point) [SOLO] Which nodes are in the Markov boundary of B? Note that the Markov boundary is the smallest possible Markov blanket.
Select all that apply:
A C D E
None of the above
3. (1 point) [SOLO] Which nodes are in the Markov boundary of C?
Select all that apply:
A B D E
None of the above
4. (1 point) [SOLO] True or False: E is conditionally independent of C given {A,B,D}. That is,
E ⊥ C | {A,B,D}.
Select one: True
# False
2 Reinforcement Learning, Revisited (9 points)
While attending the ML conference The Fellowship of the Ring, you meet Elon Musk, founder of SpaceX. He has a new idea for destroying the evil lord Sauron’s precious ring: fly the ring directly into the Sun. Elon has asked you to develop a reinforcement learning agent capable of carrying out the space-flight from Earth to the Sun. You model this problem as a Markov decision process (MDP). The figure below depicts the state space.
Here are the details:
1. Each grid cell is a state SA,SB,…,SY corresponding to a position in the solar system.
2. The action space includes movement up/down/left/right. Transitions are deterministic. It is not possible to move into blocked states, which are shaded grey, since they contain other planets.
3. The start state is SA (Earth). The terminal states include both the SY (Sun) and SE (asteroid Metis, home to Sauron’s cousin).
4. Non-zero rewards are depicted with arrows. Flying into the Sun from the left gives positive reward R(SX,right) = +50. Flying into the Sun from below gives positive reward R(ST ,up) = +100. Flying to Metis is inadvisable and gives negative reward R(SJ,down) = −100. All other rewards are zero.
5. The discount factor is γ = 0.5.
Below, let V ∗(s) denote the value function for state s using the optimal policy π∗(s). Let Q∗(s,a) denote the Q function for π∗.
1. (1 point) [SOLO] What is the value V ∗(ST )?
2. (1 point) [SOLO] What is the value V ∗(SO)?
3. (1 point) [SOLO] What is the value V ∗(SA)?
4. (1 point) [SOLO] What is the value Q∗(ST ,up)?
5. (1 point) [SOLO] What is the value Q∗(ST ,down)?
6. (1 point) [SOLO] What action does the optimal policy take from state SQ (i.e. what is π∗(SQ))? (Note: If the optimal policy is not unique and there are multiple optimal actions, select them all.)
Select all that apply:
Up
Down Left
Right
Now suppose you employ Q-Learning to learn table values Q(s,a) for each state s and action a. The table is initialized to all zeros. On the very first episode of training, you begin at state SA (Earth), take eight steps, and arrive in state SE (Metis). At each step, you perform a Q-Learning update of the appropriate entry in Q(s,a). Assume a learning rate α = 1.
7. (1 point) [SOLO] What is the new table value found in Q(SJ,down) after this episode?
8. (1 point) [SOLO] What is the new table value found in Q(SO,down) after this episode?
9. (1 point) [SOLO] True or False: The Q-function is guaranteed to converge to the true Q-values in this environment given the specified initialization and assuming that the Q-Learning algorithm visits each state-action pair infinitely often.
Select one: True
# False
3 PCA (15 points)
Some PCA Theory
1. (2 points) [SOLO] Assume we apply PCA to a matrix X ∈ Rn×m and obtain a set of PCA features, Z ∈ Rn×m .We divide this set into two, Z1 and Z2. The first set, Z1, corresponds to the top principal components. The second set, Z2, corresponds to the remaining principal components. Which is more common in the training data:
Select one:
#
a point with large feature values in Z1 and small feature values in Z2 a point with large feature values in Z2 and small feature values in Z1 a point with large feature values in Z2 and large feature values in Z1
# a point with small feature values in Z2 and small feature values in Z1
2. (1 point) [SOLO] For the data set shown below, assume the data are centered at the origin. Further, refer to the x-axis as component 1, and the y-axis as component 2. What is its first principal component, if it exists?
Select one: a
#
b c d
# None of the above
3. (1 point) [SOLO] NOTE : This is continued from the previous question. What is the second principal component in the figure from the previous question, if it exists?
Select one: a
#
b c d
# None of the above
4. (1 point) [SOLO] NOTE : This is continued from the previous question. What is the third principal component in the figure from the previous question, if it exists?
Select one: a
#
b c d
# None of the above
PCA in Practice
For this section, refer to the PCA demo linked here. In this demonstration, we have performed PCA for you on a simple four-feature dataset. The questions below have also been added to the colab notebook linked for ease of access. Run the code in the notebook, then answer the questions based on the results. Once you’ve answered the following questions, feel free to make a copy of the notebook to further explore how PCA works.
5. (2 points) [GROUP] We begin by normalizing each of the columns in our data (ie: adjusting the data to be between 0 and 1, here implemented through min-max scaling. Why is this a good idea? Additionally, why is it potentially a bad idea to standardize our features (ie: adjust all the features to have a variance of 1)?
6. (2 points) [GROUP] Below this question (in the colab file), we plot each of the features against each other. Do you see any special relationships between any of the features? In particular, take a look at the petallength feature. How would you describe its association with each of the other features? Note: there is no need to copy the plots to your submission.
7. (2 points) [GROUP] To get the principal components of the features, we will calculate the eigenvalues and eigenvectors of the covariance matrix. If we took the dot product of any two eigenvectors, what should it be? What does it mean to obtain this value and why is this property beneficial?
8. (2 points) [GROUP] If we wanted to find k principal components such that we preserve at least 95% of the variance in the data, what would be the value of k? Hint: it is helpful here to look at the cumulative variance in the first k components, which we have calculated for you.
9. (2 points) [GROUP] Taking note of the principal components plot, let’s take a look back at the scatter matrix. If we wanted to perform dimensionality reduction to have just two features, we could pick any two features from the dataset, and train a classifier on just those. What is one reason we could prefer the PCA features to just choosing two of the original features to represent our data?
4 K-Means (20 points)
(a) Dataset A
(b) Dataset B
(c) Dataset C
Figure 1: Datasets
1. Consider the 3 datasets A, B and C as shown in Figure 1. Each dataset is classified into k clusters as represented by different colors in the figure. For each dataset, select the image with cluster centers (denoted by an X) that is generated by K-means. The distance measure used here is the Euclidean distance.
(a) (1 point) [SOLO] Dataset A (Select one)
A.1
A.2
(b) (1 point) [SOLO] Dataset B (Select one)
B.1
B.2
(c) (1 point) [SOLO] Dataset C (Select one)
C.1
C.2
5.5 5.1
D = 6.6
5.5
6.8 3.1
4.8
3.0
4.6
3.8
(a) (2 points) [SOLO] Which of the following points will be the center for cluster 0? Select one:
(5.7 , 4.1)
(5.6 , 4.8)
(6.3 , 3.3)
(6.7 , 3.4)
(b) (2 points) [SOLO] Which of the following points will be the center for cluster 1? Select one:
(6.1 , 3.8)
(5.5 , 4.6)
(5.4 , 4.7)
(5.3 , 4.7)
(c) (1 point) [SOLO] How many points will belong to cluster 0?
(d) (1 point) [SOLO] How many points will belong to cluster 1?
3. Recall that in K-means clustering we attempt to find k cluster centers cj ∈ Rd (where d is the dimension of the data), j ∈ {1,…,k} such that the total distance between each datapoint and the nearest cluster center is minimized. Then the objective function is,
n
X ||xi − cj||2 (1) min
j∈{1,…,k} i=1
Specifically, you convinced John by providing two values α, the minimum possible value of Eq. (1), and β, the value of k when Eq. (1) is minimized.
(a) (1 point) [SOLO] What is the value of α when n = 100?
(b) (1 point) [SOLO] What is the minimum value of β when n = 100?
(c) (2 points) [SOLO] We want to see how K-means clustering works on a single dimension. Consider the case in which k = 3 and we have 4 data points x1 = 1,x2 = 2,x3 = 5,x4 = 7. What is the optimal value of the objective Eq. (1) when k is fixed at 3?
4. (3 points) [GROUP] Consider the following simple brute-force algorithm for minimizing the K-means objective:
1. Enumerate all the possible partitionings of the n points into k clusters.
2. For each possible partitioning, compute the optimal centers c1,…,ck by taking the mean of the points in each cluster and compute the corresponding K-means objective value.
3. Output the best clustering found.
This algorithm is guaranteed to output the optimal set of centers, but unfortunately its running time is exponential in the number of data points. For the case k = 2, justify that the running time of the brute-force algorithm above is exponential in the number of data points n.
5. Initializing the centers has a big impact on the performance of Lloyd’s clustering algorithm. Usually, we randomly initialize k cluster centers. However, there are other methods, namely, furthest point initialization and k-means++ initialization.
(a) (2 points) [GROUP] Essentially, in k-means++, the first cluster center is chosen uniformly at random from the data points, after which each subsequent cluster center is chosen from the remaining (not chosen) data points. The probability of each point to be chosen as the next center is proportional to the squared distance to closest existing cluster center.
Figure 2: 2D Dataset
Explain in 1– 2 sentences why using k-means++ initialization is more likely to choose one sample from each cluster than random initialization from the dataset in Figure 2 above. Recall that random initialization is randomly choosing a set of data points as starting cluster centers.
(b) (2 points) [GROUP] Another method of initialization is furthest point initialization. Essentially, the first cluster center is chosen at random. Then, pick the rest of the cluster centers, one by one, among the datapoints that are the furthest from the cluster centers chosen thus far.
Given this definition, is the furthest point initialization sensitive to outliers? Explain why or why not.
5 Ensemble Methods (17 points)
1. (1 point) [SOLO] Which of the following is false?
Select one:
In the weighted majority algorithm, the weights associated with the weak learners are learned
during training.#
In the AdaBoost algorithm, the weights associated with the weak learners are learned during training.
In the weighted majority algorithm, the weak learners are learned during training.
## In the AdaBoost algorithm, the weak learners are learned during training.
2. (1 point) [SOLO] True or False: Provided enough iterations are performed, AdaBoost will give zero training error on any dataset regardless of the type of weak learner used.
Select one: True
# False
Select one: True
# False
4. (1 point) [SOLO] True or False: If AdaBoost reaches perfect training accuracy, all weak learners created in subsequent iterations will be identical.
Select one: True
# False
5. (1 point) [SOLO] Consider the following table containing recorded weights from iterations of AdaBoost. Which of the training points (A, B, C, D, E, F) must have been misclassified in Round 220 in order to produce the updated weights shown at the start of Round 221?
Round Dt(A) Dt(B) Dt(C) Dt(D) Dt(E) Dt(F)
…
220 1
14 1
14 7
14 1
14 2
14 2
14
221 5
40 5
40 14
40 2
40 10
40 4
40
…
Select all that apply:
A
B
C D E
F
6. In the following question, we will examine the generalization error of AdaBoost using a concept known as the classification margin.
Throughout the question, use the following definitions:
• T: The number of iterations used to train AdaBoost.
• N: The number of training samples.
• S = {(x1,y1),··· ,(xN,yN)}: The training samples with binary labels (∀i ∈ [N]yi ∈ {−1,+1}).
• d: The VC-dimension of the weak learner hypothesis class.
• ht: The weak learner constructed at time t.
• t: The error of ht on Dt.
• : The normalization factor for the distribution update at time t.
• : The majority vote of the weak learners, rescaled based on the total weights.
• Ht(x) = sign(ft(x)): The voting classifier decision function.
• : The training error of classifier H.
• : The generalization error of classifier H.
(a) (1 point) [GROUP] What is the range of values the classification margin can take on?
(b) (1 point) [GROUP] Let margint(x,y) represent the margin for our AdaBoost classifier at iteration t on the sample (x,y). Write a single inequality in terms of margint(x,y) that is true if and only if the classifier makes a mistake on the input (x,y) (i.e., provide a bound on the margin in the case the classifier is incorrect). Assume the classifier makes a mistake on ties.
(c) (1 point) [GROUP] For a given input and label (xi,yi), write margint(xi,yi) in terms of xi,yi and ft.
(d) (1 point) [GROUP] Recall the update AdaBoost performs on the distribution of weights:
• D1(i) = 1/N
• .
(e) (1 point) [GROUP] . Rewrite your above answer in terms of yi,α,ft,xi.
(f) (1 point) [GROUP] Rewrite your above answer in terms of margint(xi,yi) and α.
(g) (1 point) [GROUP] Note that α is constant across the input points. Using the classification margin, describe which points AdaBoost assigns high weight to at time t.
Consider Prˆ (xi,yi)∼S [marginT (xi,yi) ≤ θ], the fraction of inputs with margin below a margin threshold θ > 0 for the combined hypothesis. Suppose we are able to find weak learners below 1/2 error at all time steps. As a result of the weighting behavior you discovered in the above parts, AdaBoost decreases Prˆ (xi,yi)∼S [marginT (xi,yi) ≤ θ] exponentially fast in the number of iterations T.
Now, consider the following high-probability bounds on the generalization (true) error of AdaBoost. The first bound stems from a traditional PAC learning analysis, while the second stems from a classification margin analysis.
!
Bound 1 :
Bound 2 : [margin
(h) Suppose we have achieved . Considering only the first bound, and supposing we want as tight a bound on true error as possible, should we continue training? Why or why not?
i. (1 point) [GROUP] Select one:
Continue training # Stop training
(i) Suppose we have achieved . Considering only the second bound, and supposing we want as tight a bound on true error as possible, should we continue training? Why or why not?
i. (1 point) [GROUP] Select one:
Continue training # Stop training
(j) (1 point) [GROUP] True or False: Considering both bounds, it is reasonable to stop training
AdaBoost as soon as .
Select one: True
# False
6 Recommender Systems (10 points)
1. (2 points) [SOLO] In which of the following situations will a collaborative filtering system be more appropriate learning algorithm compared to linear or logistic regression?
Select all that apply:
You manage an online bookstore and you have the book ratings from many users. For each user, you want to recommend other books she will enjoy, based on her own ratings and the ratings of other users.
You run an online news aggregator, and for every user, you know some subset of articles that the user likes and some different subset that the user dislikes. You’d want to use this to find other articles that the user likes. You’ve written a piece of software that has downloaded news articles from many news websites. In your system, you also keep track of which articles you personally like vs. dislike, and the system also stores away features of these articles (e.g., word counts, name of author). Using this information, you want to build a system to try to find additional new articles that you personally will like.
You manage an online bookstore and you have the book ratings from many users. You want to learn to predict the expected sales volume (number of books sold) as a function of the average rating of a book.
None of the above
2. (2 points) [SOLO] What is the basic intuition behind matrix factorization?
Select all that apply:
That content filtering and collaborative filtering are just two different factorizations of the same rating matrix.
That factoring user and item matrices can partition the users and items into clusters that can be treated identically, reducing the complexity of making recommendations.
The user-user and item-item correlations are more efficiently computed by factoring matrices. That user-item relations can be well described in a low dimensional space that can be computed from the rating matrices.
None of the above
3. To plan your schedule for next semester, you decide to use the recommender system by checking the courses your 10-601 TAs have taken. The course chart below shows everyone’s course status. For entry values, 1 indicates that the course has been taken and 0 indicates that the course has not been taken yet.
15122 15150 15210 15213 10301 10403 10725 11344
You 1 1 0 0 1 0 0 0
Abhi 1 1 1 0 1 1 1 1
Chi 1 0 0 1 1 0 0 0
Helena 1 1 0 0 1 1 0 1
Sana 1 1 0 0 1 1 1 1
Sami 1 1 1 1 1 1 0 0
To use the content-based filtering system, the following similarity table is provided:
15122 15150 15210 15213 10301 10403 10725 11344
15122 1 0.8 0.7 0.8 0.5 0.4 0.1 0.1
15150 0.8 1 0.9 0.6 0.3 0.1 0.1 0.1
15210 0.7 0.9 1 0.6 0.2 0.1 0.1 0.1
15213 0.8 0.6 0.6 1 0.2 0.1 0.1 0.1
10301 0.5 0.3 0.2 0.2 1 0.8 0.6 0.6
10403 0.4 0.1 0.1 0.1 0.8 1 0.9 0.6
10725 0.1 0.1 0.1 0.1 0.6 0.9 1 0.8
11344 0.1 0.1 0.1 0.1 0.6 0.6 0.8 1
(a) (1 point) [SOLO] Using the content-based filtering recommender system, which one of the courses that you (assuming the first row of table 1) have not taken is the most recommended?
(b) (1 point) [SOLO] Using the collaborative filtering neighborhood method, which one of the courses that you have not taken is the most recommended?
(c) (1 point) [GROUP] Which one of the recommender systems do you prefer? Choose one and justify your choice in one sentence.
4. (3 points) [GROUP] When building a recommender system using matrix factorization, the regularized objective function we wish to minimize is:
J(W,H) = X (vui − wuT hi)2 + λ(X||wu||2 + X||hi||2)
u,i∈Z u i
where vui is the user u’s rating of item i, wu is the uth row of W and the vector representing user u; hi is the ith row of H and the vector representing item i; Z is the index set of observed user/item ratings in the training set; and λ is the weight of the L2 regularizer. One method of solving this optimization problem is to apply Block Coordinate Descent. The algorithms proceeds as shown below:
• while not converged:
– for u in {1,…,Nu}:
* wu0 ← argminwu0 J(W,H)
– for i in {1,…Ni}
* hi0 ← argminhi0 J(W,H)
Doing so yields an algorithm called Alternating Least Squares (ALS) for matrix factorization. Which of the following is equal to the transpose of argminwu0 J(W,H)? Note: vu is the vector of user u’s ratings across all items. Select one:
vuH(HT H + λI)−1
(HT H + λI)−T vuH vuH(HT H)−1
6 Collaboration Questions
1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details.
2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details.
3. Did you find or come across code that implements any part of this assignment ? If so, include full details.
Reviews
There are no reviews yet.