Description
Exam 3 Practice Problems Room:
Time Limit: N/A Exam Number:
Instructions:
• Clearly mark your answers in the allocated space on the front of each page. If needed, use the back of a page for scratch space, but you will not get credit for anything written on the back of a page. If you have made a mistake, cross out the invalid parts of your solution, and circle the ones which should be graded.
• Please write all answers in pen.
• You have N/A to complete the exam. Good luck!
Topic Pages # Questions
Ensemble Methods 2 8
Recommender Systems 5 3
Hidden Markov Models 7 3
Graphical Models 10 2
Principal Component Analysis 13 2
Reinforcement Learning 15 6
K-Means 20 2
Support Vector Machines 24 8
Kernel Methods 27 7
1 Ensemble Methods
1. [3pts] In the AdaBoost algorithm, if the final hypothesis makes no mistakes on the training data, which of the following is correct?
Select all that apply:
Additional rounds of training can help reduce the errors made on unseen data.
Additional rounds of training have no impact on unseen data.
The individual weak learners also make zero error on the training data. Additional rounds of training always leads to worse performance on unseen data.
2. [3pts] Which of the following is true about ensemble method?
Select all that apply:
Ensemble methods combine together many simple, poorly performing classifiers in order to produce a single, high quality classifier.
Neural networks can be used in the ensemble methods.
For the weighted majority algorithm, the weak classifiers are learned along the way.
For the weighted majority algorithm, we want to give higher weights to better performing models.
3. [2pt] True or False: In AdaBoost weights of the misclassified examples go up by the same multiplicative factor.
True
False
4. [2pt] True or False: AdaBoost will eventually give zero training error regardless of the type of weak classifier it uses, provided enough iterations are performed.
True
False
5. [12pts] In the last semester, someone used AdaBoost to train some data and recorded all the weights throughout iterations but some entries in the table are not recognizable. Clever as you are, you decide to employ your knowledge of Adaboost to determine some of the missing information.
Below, you can see part of table that was used in the problem set. There are columns for the Round # and for the weights of the six training points (A, B, C, D, E, and F)
Round Dt(A) Dt(B) Dt(C) Dt(D) Dt(E) Dt(F)
1 ? ? 1
6 ? ? ?
2 ? ? ? ? ? ?
…
219 ? ? ? ? ? ?
220 1
14 1
14 7
14 1
14 2
14 2
14
221 1
8 1
8 7
20 1
20 1
4 1
10
…
3017 1
2 1
4 1
8 1
16 1
16 0
…
8888 1
8 3
8 1
8 2
8 3
8 1
8
at the start of each round. Some of the entries, marked with “?”, are impossible for you to read.
(a) [3pts] The weak classifier chosen in Round 1 correctly classified training points A, B, C, and E but misclassified training points D and F. What should the updated weights have been in the following round, Round 2? Please complete the form below.
Round D2(A) D2(B) D2(C) D2(D) D2(E) D2(F)
2
(b) [3pts] During Round 219, which of the training points (A, B, C, D, E, F) must have been misclassified, in order to produce the updated weights shown at the start of Round 220? List all the points that were misclassified. If none were misclassified, write ‘None’. If it can’t be decided, write ‘Not Sure’ instead.
(c) [3pts] During Round 220, which of the training points (A, B, C, D, E, F) must have been misclassified in order to produce the updated weights shown at the start of Round 221? List all the points that were misclassified. If none were misclassified, write ‘None’. If it can’t be decided, write ‘Not Sure’ instead.
(d) [3pts] You observes that the weights in round 3017 or 8888 (or both) cannot possibly be right. Which one is incorrect? Why? Please explain in one or two short sentences.
Round 3017 is incorrect.
Round 8888 is incorrect.
Both rounds 3017 and 8888 are incorrect.
NOTE: Please do not change the size of the following text box, and keep your answer in it. Thank you!
6. [3 pts] What condition must a weak learner satisfy in order for boosting to work? Short answer:
7. [3 pts] After an iteration of training, AdaBoost more heavily weights which data points to train the next weak learner? (Provide an intuitive answer with no math symbols.) Short answer:
8. [3 pts extra credit] Do you think that a deep neural network is nothing but a case of boosting? Why or why not? Impress us.
Answer:
2 Recommender Systems
1. [4pts] In which of the following situations will a collaborative filtering system be the most appropriate learning algorithm compared to linear or logistic regression?
Select all that apply:
You manage an online bookstore and you have the book ratings from many users. For each user, you want to recommend other books she will enjoy, based on her own ratings and the ratings of other users.
You run an online news aggregator, and for every user, you know some subset of articles that the user likes and some different subset that the user dislikes. You’d want to use this to find other articles that the user likes.
You’ve written a piece of software that has downloaded news articles from many news websites. In your system, you also keep track of which articles you personally like vs. dislike, and the system also stores away features of these articles (e.g., word counts, name of author). Using this information, you want to build a system to try to find additional new articles that you personally will like.
You manage an online bookstore and you have the book ratings from many users. You want to learn to predict the expected sales volume (number of books sold) as a function of the average rating of a book.
2. [3pts] What is the basic intuition behind matrix factorization?
Select all that apply:
That content filtering and collaborative filtering are just two different factorizations of the same rating matrix.
That factoring user and item matrices can partition the users and items into clusters that can be treated identically, reducing the complexity of making recommendations.
The user-user and item-item correlations are more efficiently computed by factoring matrices.
That user-item relations can be well described in a low dimensional space that can be computed from the rating matrices.
3. [3pts] When building a recommender system using matrix factorization, the regularized objective function we wish to minimize is:
J(W,H) = X (vui − wuT hi)2 + λ(X||wu||2 + X||hi||2)
u,i∈Z u i
where wu is the uth row of W and the vector representing user u; hi is the ith row of H and the vector representing item i; Z is the index set of observed user/item ratings in the training set; and λ is the weight of the L2 regularizer. One method of solving this optimization problem is to apply Block Coordinate Descent. The algorithms proceeds as shown below:
while not converged: for u in {1,…,Nu}:
wu0 ← argminwu0 J(W,H)
for i in {1,…Ni} hi0 ← argminhi0 J(W,H)
Doing so yields an algorithm called Alternating Least Squares (ALS) for matrix factorization. Which of the following is equal to the transpose of argminwu0 J(W,H)? Select one:
vuH(HT H + λI)−1
(HT H + λI)−T vuH vuH(HT H + λI)−T vuH(HT H)−1
3 Hidden Markov Models
1. Recall that both the Hidden Markov Model (HMM) can be used to model sequential data with local dependence structures. In this question, let Yt be the hidden state at time t, Xt be the observation at time t, Y be all the hidden states, and X be all the observations.
(a) [2 pts] Draw the HMM as a Bayesian network where the observation sequence has length 3 (i.e., t = 1,2,3), labelling nodes with Y1,Y2,Y3 and X1,X2,X3.
(b) [2 pts] Write out the factorized joint distribution of P(X,Y) using the independencies/conditional independencies assumed by the HMM graph, using terms Y1,Y2,Y3 and X1,X2,X3. P(X,Y) =
(c) [2 pts] True or False: In general, we should not include unobserved variables in a graphical model because we cannot learn anything useful about them without observations.
True False
2. Consider an HMM with states Yt ∈ {S1,S2,S3}, observations Xt ∈ {A,B,C} and pa-
/ / /
rameters , transition matrix A , and emission matrix
B .
(a) [3 pts] What is P(Y5 = S3)?
(b) [2 pts] What is P(Y5 = S3|X1:7 = AABCABC)?
(c) [4 pts] Fill in the following table assuming the observation AABCABC. The α’s are values obtained during the forward algorithm: αt(i) = P(X1,…,Xt,Yt = i).
t αt(1) αt(2) αt(3)
(d) [3 pts] Write down the sequence of Y1:7 with the maximal posterior probability assuming the observation AABCABC. What is that posterior probability?
3. Consider the HMM in the figure above. The HMM has k states (s1,…,sk). sk is the terminal state. All states have the same emission probabilities (shown in the figure). The HMM always starts at s1 as shown. Transition probabilities for all states except sk
are also the same as shown. Once a run reaches sk it outputs a symbol based on the sk state emission probability and terminates.
1. [5 pts] Assume we observed the output AABAABBA from the HMM. Select all answers below that COULD be correct.
k > 8 k < 8
k > 6
k < 6
k = 7
2. [9 pts] Now assume that k = 4. Let P(0AABA0) be the probability of observing AABA from a full run of the HMM. For the following equations, fill in the box with >,<,= or ? (? implies it is impossible to tell).
(a) P(0AAB0) P( BABA0)
(b) P(0ABAB ) P(0BABA0)
(c) P(0AAABA ) P(0BBAB0)
4 Graphical Models [16 pts]
1. Consider the following two Bayesian networks.
(a) Answer whether the following conditional independence is true.
[2 pts] X1 ⊥ X2 | X3?
Circle one: Yes No
Please explain briefly in one sentence.
[2 pts] X1 ⊥ X4?
Circle one: Yes No
Please explain briefly in one sentence.
[2 pts] X5 ⊥ X2 | X3?
Circle one: Yes No
Please explain briefly in one sentence.
(b) [4 pts] Write out the joint probability in a form that utilizes as many independence/conditional independence assumptions contained in the graph as possible. Answer: P(X1,X2,X3,X4,X5) =
(c) [2 pts] In the Hidden Markov Model (HMM), a state depends only on the corresponding observation and its previous state.
Circle one: True False
(d) [2 pts] In a graphical model, if X1 ⊥ X2, then X1 ⊥ X2|Y for every node Y in the graph.
Circle one: True False
(e) [2 pts] In a graphical model, if X1 ⊥ X2|Y for some node Y in the graph, it is always true that X1 ⊥ X2.
Circle one: True False
2. Consider the graphical model shown above for questions (a)-(f). Assume all variables are boolean-valued.
(a) [2 pt. ] (Short answer) Write down the factorization of the joint probability P(A,B,C,D,E) for the above graphical model, as a product of the five distributions associated with the five variables.
(b) [2 pt. ] T or F: Is C conditionally independent of D given B (i.e. is (C ⊥ D)|B)?
(c) [2 pt. ] T or F: Is A conditionally independent of D given C (i.e. is (A ⊥ D)|C)?
(d) [2 pt. ] T or F: Is A independent of B (i.e. is A ⊥ B)?
(e) [4 pt. ] Write an expression for P(C = 1|A = 1,B = 0,D = 1,E = 0) in terms of the
parameters of Conditional Probability Distributions associated with this graphical model.
5 Principal Component Analysis
1. (i) [5 pts] Consider the following two plots of data. Draw arrows from the mean of the data to denote the direction and relative magnitudes of the principal components.
0 1 0 1
(ii) [5 pts] Now consider the following two plots, where we have drawn only the principal components. Draw the data ellipse or place data points that could yield the given principal components for each plot. Note that for the right hand plot, the principal components are of equal magnitude.
0 1 0 1
2. Circle one answer and explain.
In the following two questions, assume that using PCA we factorize X ∈ Rn×m as ZT U ≈ X, for Z ∈ Rm×n and U ∈ Rm×m, where the rows of X contain the data points, the rows of U are the prototypes/principal components, and ZT U = Xˆ.
(i) [2 pts] Removing the last row of U will still result in an approximation of X, but this will never be a better approximation than Xˆ.
Circle one: True
(ii) [2 pts] XˆXˆT = ZT Z. False
Circle one: True False
(iii) [2 pts] The goal of PCA is to interpret the underlying structure of the data in terms of the principal components that are best at predicting the output variable.
Circle one: True False
(iv) [2 pts] The output of PCA is a new representation of the data that is always of lower dimensionality than the original feature representation.
Circle one: True False
6 Reinforcement Learning
6.1 Markov Decision Process
Lord Farquaad is hoping to evict all fairytale creatures from his kingdom of Duloc, and has one final ogre to evict: Shrek. Unfortunately all his previous attempts to catch the crafty ogre have fallen short, and he turns to you, with your knowledge of Markov Decision Processes (MDP’s) to help him catch Shrek once and for all.
Consider the following MDP environment where the agent is Lord Farquaad:
Figure 1: Kingdom of Duloc, circa 2001
Here’s how we will define this MDP:
• S (state space): a set of states the agent can be in. In this case, the agent (Farquaad) can be in any location (row,col) and also in any orientation ∈ {N,E,S,W}. Therefore, state is represented by a three-tuple (row,col,dir), and S = all possible of such tuples. Farquaad’s start state is (1,1,E).
• A (action space): a set of actions that the agent can take. Here, we will have just three actions: turn right, turn left, and move forward (turning does not change row or col, just dir). So our action space is {R,L,M}. Note that Farquaad is debilitatingly short, so he cannot travel through (or over) the walls. Moving forward when facing a wall results in no change in state (but counts as an action).
• R(s,a) (reward function): In this scenario, Farquaad gets a reward of 5 by moving into the swamp (the cell containing Shrek), and a reward of 0 otherwise.
• p(s0|s,a) (transition probabilities): We’ll use a deterministic environment, so this will bee 1 if s0 is reachable from s and by taking a, and 0 if not.
1. What are |S| and |A| (size of state space and size of action space)?
2. Why is it called a ”Markov” decision process? (Hint: what is the assumption made with p?)
3. What are the following transition probabilities?
p((1,1,N)|(1,1,N),M) = p((1,1,N)|(1,1,E),L) = p((2,1,S)|(1,1,S),M) =
p((2,1,E)|(1,1,S),M) =
4. Given a start position of (1,1,E) and a discount factor of γ = 0.5, what is the expected discounted future reward from a = R? For a = L? (Fix γ = 0.5 for following problems).
5. What is the optimal action from each state, given that orientation is fixed at E? (if there are multiple options, choose any)
6. Farquaad’s chief strategist (Vector from Despicable Me) suggests that having γ = 0.9 will result in a different set of optimal policies. Is he right? Why or why not?
7. Vector then suggests the following setup: R(s,a) = 0 when moving into the swamp, and R(s,a) = −1 otherwise. Will this result in a different set of optimal policies? Why or why not?
8. Vector now suggests the following setup: R(s,a) = 5 when moving into the swamp, and R(s,a) = 0 otherwise, but with γ = 1. Could this result in a different optimal policy?
Why or why not?
9. Surprise! Elsa from Frozen suddenly shows up. Vector hypnotizes her and forces her to use her powers to turn the ground into ice. Now the environment is now stochastic: since the ground is now slippery, when choosing the action M, with a 0.2 chance, Farquaad will slip and move two squares instead of one. What is the expected future-discounted rewards from s = (2,4,S)?
6.2 Value and Policy Iteration
1. Which of the following environment characteristics would increase the computational complexity per iteration for a value iteration algorithm? Choose all that apply:
Large Action Space
A Stochastic Transition Function
Large State Space
Unknown Reward Function
None of the Above
2. Which of the following environment characteristics would increase the computational complexity per iteration for a policy iteration algorithm? Choose all that apply:
Large Action Space
A Stochastic Transition Function
Large State Space
Unknown Reward Function
None of the Above
3. In the image below is a representation of the game that you are about to play. There are 5 states: A, B, C, D, and the goal state. The goal state, when reached, gives 100 points as reward. In addition to the goal’s points, you also get points by moving to different states. The amount of points you get are shown next to the arrows. You start at state B. To figure out the best policy, you use asynchronous value iteration with a decay (γ) of 0.9.
(i) When you first start playing the game, what action would you take (up, down, left, right) at state B?
(ii) What is the total reward at state B at this time?
(iii) Let’s say you keep playing until your total values for each state has converged. What action would you take at state B?
(iv) What is the total reward at state B at this time?
6.3 Q-Learning
1. For the following true/false, circle one answer and provide a one-sentence explanation:
(i) One advantage that Q-learning has over Value and Policy iteration is that it can account for non-deterministic policies.
Circle one: True False
(ii) You can apply Value or Policy iteration to any problem that Q-learning can be applied to.
Circle one: True False
(iii) Q-learning is guaranteed to converge to the true value Q* for a greedy policy.
Circle one: True False
2. For the following parts of this problem, recall that the update rule for Q-learning is:
w
(i) From the update rule, let’s look at the specific term X = (r + γ maxa0 q(s0,a0;w)) Describe in English what is the role of X in the weight update.
(ii) Is this update rule synchronous or asynchronous?
(iii) A common adaptation to Q-learning is to incorporate rewards from more time steps into the term X. Thus, our normal term rt+γ∗maxat+1q(st+1,at+1;w) would become rt + γ ∗ rt+1 + γ2 maxat+2 q(st+2,at+2 : w) What are the advantages of using more rewards in this estimation?
7 K-Means
1. For True or False questions, circle your answer and justify it; for QA questions, write down your answer.
(i) For a particular dataset and a particular k, k-means always produce the same result, if the initialized centers are the same. Assume there is no tie when assigning the clusters.
True
False
Justify your answer:
(ii) k-means can always converge to the global optimum.
True
False
Justify your answer:
True
False
Justify your answer:
(iv) k-means is not sensitive to outliers.
True
False
Justify your answer:
(v) k in k-nearest neighbors and k-means has the same meaning.
True
False
Justify your answer:
(vi) What’s the biggest difference between k-nearest neighbors and k-means?
Write your answer in one sentence:
2. In k-means, random initialization could possibly lead to a local optimum with very bad performance. To alleviate this issue, instead of initializing all of the centers completely randomly, we decide to use a smarter initialization method. This leads us to k-means++.
The only difference between k-means and k-means++ is the initialization strategy, and all of the other parts are the same. The basic idea of k-means++ is that instead of simply choosing the centers to be random points, we sample the initial centers iteratively, each time putting higher probability on points that are far from any existing center. Formally, the algorithm proceeds as follows.
Given: Data set x(i),i = 1,…,N
Initialize:
For j = 2,…,k
Computing probabilities of selecting each point
Select next center given the appropriate probabilities µ(j) ∼ Categorical(
Note: n is the number of data points, k is the number of clusters. For cluster 1’s center, you just randomly choose one data point. For the following centers, every time you initialize a new center, you will first compute the distance between a data point and the center closest to this data point. After computing the distances for all data points, perform a normalization and you will get the probability. Use this probability to sample for a new center.
Now assume we have 5 data points (n=5): (0, 0), (1, 2), (2, 3), (3, 1), (4, 1). The number of clusters is 3 (k=3). The center of cluster 1 is randomly choosen as (0, 0).
These data points are shown in the figure below.
(i) [5 pts] What is the probability of every data point being chosen as the center for cluster 2? (The answer should contain 5 probabilities, each for every data point)
(ii) [1 pts] Which data point is mostly liken chosen as the center for cluster 2?
(iii) [5 pts] Assume the center for cluster 2 is chosen to be the most likely one as you computed in the previous question. Now what is the probability of every data point being chosen as the center for cluster 3? (The answer should contain 5 probabilities, each for every data point)
(iv) [1 pts] Which data point is mostly liken chosen as the center for cluster 3?
(v) [3 pts] Assume the center for cluster 3 is also chosen to be the most likely one as you computed in the previous question. Now we finish the initialization for all 3 centers. List the data points that are classified into cluster 1, 2, 3 respectively.
(vi) [3 pts] Based on the above clustering result, what’s the new center for every cluster?
8 Support Vector Machines
1. [3 pts.] Given the same training data, in which the points are linearly separable, the margin of the decision boundary produced by SVM will always be greater than or equal to the margin of the decision boundary produced by Perceptron.
Circle one: True False
2. [2 pts] The support vectors for a soft margin SVM include the points within the margin as well as those that are incorrectly classified.
Circle one: True False
One line justification (only if False):
• Shifts toward the point removed
• Shifts away from the point removed
• Does not change
– Remains the same
– Increases
– Decreases
5. SVM is a discriminative classifier, whereas Na¨ıve Bayes is a generative classifier. Describe one statistical advantage SVM has over Na¨ıve Bayes. Describe in one sentence:
6. [4 pt.] SVM and Perceptron both give linear decision boundaries. Which of the following are correct about the differences between SVM and perceptron?
(a) SVM maximizes the margin of the classifier while perceptron does not necessarily
(b) SVM can take advantage of the kernel trick while perceptron cannot
(c) SVM can allow classification errors with soft margin while perceptron cannot
(d) Perceptron is more suitable for online learning while SVM is less suitable
(e) Perceptron has fewer parameters to learn while SVM has more
(f) Perceptron is guaranteed to converge in the linearly separable case while SVM is not
7. [4 pts.] Consider the dataset in Fig. 2. Under the SVM formulation in problem (3),
(a) Draw the decision boundary on the graph.
(b) What is the size of the margin?
(c) Circle all the support vectors on the graph.
Figure 2: SVM toy dataset
8. [Extra Credit: 3 pts.] One formulation of soft-margin SVM optimization problem is:
N 1
s.t. yi(w>xi) ≥ 1 − ξi ∀i = 1,…,N ξi ≥ 0 ∀i = 1,…,N
C ≥ 0
where (xi,yi) are training samples and w defines a linear decision boundary.
Derive a formula for ξi when the objective function achieves its minimum (No steps necessary). Note it is a function of yiw>xi. Sketch a plot of ξi with yiw>xi on the x-axis and value of ξi on the y-axis. What is the name of this function?
Figure 3: Plot here
9 Kernel Methods
1. [3 pts.] Applying the kernel trick enables features to be mapped into a higher dimensional space, at a cost of higher computational complexity to operate in the higher dimensional space.
Circle one: True False
2. [3 pts.] Since the VC dimension for an SVM with a Radial Base Kernel is infinite, such an SVM must have a larger generalization error than an SVM without kernel which has a finite VC dimension.
Circle one: True False
3. [3 pts.] Suppose φ(x) is an arbitrary feature mapping from input x ∈ X to φ(x) ∈ RN and let K(x,z) = φ(x) · φ(z). Then K(x,z) will always be a valid kernel function.
Circle one: True False
4. [3 pts.] Suppose φ(x) is the feature map induced by a polynomial kernel K(x,z) of degree d, then φ(x) should be a d-dimensional vector.
Circle one: True False
5. [3 pts.] The decision boundary that we get from a Gaussian Naive Bayes model with class-conditional variance is quadratic. Can we in principle reproduce this with an SVM and a polynomial kernel?
Circle one: Yes No
6. Let’s go kernelized!
(a) Perceptron review. Assume we have a binary classification task with dataset where x(i) ∈ Rd and y(i) ∈ {−1,1}. Recall that the perceptron
learns a linear classifier y = sign(wT x) by applying the following algorithm.
Algorithm 1: Perceptron algorithm
Initialize the weights w = 0; for i = 1,2,··· do
Predict ˆy(i) = sign(wT x(i)); if yˆ(i) 6= y(i) then
Update w = w + y(i)x(i); end
Final classifier: h(x) = sign(wT x)
Show that the final weight vector w is a linear combination of all the samples x(i) (i = 1,2,··· ,T) it has been trained on, and hence for prediction we can write wT x in the form of ) for some αi where K(x,z) = xT z. (b) Kernelized perceptron. Now we are going to introduce a kernel function K(x,z) to kernelize the perceptron algorithm. Based on your findings in the previous question, fill in the blanks below to complete the kernelized perceptron algorithm using the kernel K(x,z). Assume the training loop stops after it has seen T training samples.
(c) Short answer. Describe one advantage and one disadvantage of using kernelized perceptron compared to using vanilla perceptron.
7. Suppose we have six training samples that lie in a two-dimensional space as is shown in Figure 4a. Four of them belong to the blue class: (0√ √ ,0.5),(0,2),(0.5,0),(2,0), and two
of them belong to the red class: ( 2/2, 2/2),(1.5,1.5). Unfortunately, this dataset is not linearly separable. You recall that kernel trick is one technique you can take advantage of to address this problem. The trick uses a kernel function K(x,z) which implicitly defines a feature map φ(x) from the original space to the feature space. Consider the following normalized kernel:
.
(a) Data points in the original space (b) Data points in the feature space
(a) What is the feature map φ(x) that corresponds to this kernel? Draw φ(x) for each training sample in Figure 4b.
(b) The samples should now be linearly separable in the feature space. The classifier in the feature space that gives the maximum margin can be represented as a line wT x + α = 0. Draw the decision boundary of this classifier in Figure 4b. What are the coefficients in the weight vector w = (w1,w2)T ? Hint: you don’t need to compute them.
(c) Now we map the decision boundary obtained in (b) back to the original space. Write down the corresponding boundary in the original space in the format of an explicit equation. You can keep α in your equation. Try to plot its rough shape in Figure 4a.
Reviews
There are no reviews yet.