Description
SVMs, K-Means, PCA, Graphical Models
TAs: Chenxi, Loki, Eric, Gauri, Daniel
START HERE: Instructions
Homework 9 covers topics on SVMs, K-Means, PCA and Graphical Models. The homework includes multiple choice, True/False, and short answer questions.
cmu.edu/~mgormley/courses/10601/about.html
• Late Submission Policy: See the late submission policy here: http://www.cs.cmu. edu/~mgormley/courses/10601/about.html
• Submitting your work:
For multiple choice or select all that apply questions, shade in the box or circle in the template document corresponding to the correct answer(s) for each of the questions. For LATEXusers, use and for shaded boxes and circles, and don’t change anything else.
Instructions for Specific Problem Types
For “Select One” questions, please fill in the appropriate bubble completely:
Select One: Who taught this course?
Matt Gormley
# Marie Curie # Noam Chomsky
Select One: Who taught this course?
Matt Gormley # Marie Curie
@ @ Noam Chomsky
For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein
Isaac Newton
I don’t know
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein Isaac Newton
@@ I don’t know
Fill in the blank: What is the course number?
10-601 10-S7601S
1 Support Vector Machines [19 pts]
In class, we discussed the properties and formulation of hard-margin SVMs, where we assume the decision boundary to be linear and attempt to find the hyperplane with the largest margin. Here, we introduce a new class of SVM called soft margin SVM, where we introduce the slack variables ei to the optimization problem and relax the assumptions. The formulation of soft margin SVM with no Kernel is
!
minimize
w,b,e
subject to y(i)(wT x(i) + b) ≥ 1 − ei, ∀ i = 1,…,N ei ≥ 0, ∀ i = 1,…,N
1. [3pts] Consider the ith training example (x(i),y(i)) and its corresponding slack variable ei. Assuming C > 0 and is fixed, what would happen as ei → ∞?
Select all that apply:
the constraint y(i)(wT x(i) + b) ≥ 1 − ei would hold for almost all w.
there would be no vector that satisfies the constraint y(i)(wT x(i) + b) ≥ 1 − ei the objective function would approach infinity.
With this in mind, we hope that you can see why soft margin SVM can be applied even when the data is not linearly separable.
2. [5pts] What could happen as C → ∞? Do not assume that the data is linearly separable unless specified.
Select all that apply:
When the data is linearly separable, the solution to the soft margin SVM would converge to the solution of hard margin SVM.
There is no solution w,b satisfying all the constraints in the optimization problem.
Any arbitrary vector w and scalar b can satisfy the constraints in the optimization problem.
The optimal weight vector would converge to the zero vector 0.
When C approaches to infinity, it could help reduce overfitting.
3. [5pts] What could happen as C → 0? Do not assume that the data is linearly separable unless specified.
Select all that apply:
When the data is linearly separable, the solution to the soft margin SVM would converge to the solution of hard margin SVM.
There is no solution w,b satisfying all the constraints in the optimization problem.
Any arbitrary vector w and scalar b can satisfy the constraints in the optimization problem.
The optimal weight vector would converge to be the zero vector 0.
When C approaches to 0, doing so could help reduce overfitting.
4. [3pts] An extension to soft margin SVM (or, an extension to the hard margin SVM we talked in class) is the 2-norm SVM with the following primal formulation
!
minimize
w,b,e
subject to y(i)(wT x(i) + b) ≥ 1 − ei, ∀ i = 1,…,N ei ≥ 0, ∀ i = 1,…N
Which of the following is true about the 2-norm SVM? (Hint: think about `1-regularization versus `2 regularization!)
Select one:
If a particular pair of parameters w∗,b∗ minimizes the objective function in soft margin SVM, then this pair of parameters is guaranteed to minimize the objective function in 2-norm SVM.
2-norm SVM penalizes large ei’s more heavily than soft margin SVM.
One drawback of 2-norm SVM is that it cannot utilize the kernel trick.
None of the above.
Figure 1: SVM dataset
5. [3pts] Consider the dataset shown in Figure 1. Which of the following models, when properly tuned, could correctly classify ALL the data points?
Select all that apply:
Logistic Regression without any kernel
Hard margin SVM without any kernel
Soft margin SVM without any kernel
Hard margin SVM with RBF Kernel
Soft margin SVM with RBF Kernel
2 Kernels [19pts]
1. [2pt] Consider the following kernel function:
(1, if x = x0
0
K(x,x ) =
0, otherwise
True or False: In this kernel space, any labeling of points from any training data X will be linearly separable.
True
False
2. [3pts] Suppose that input-space is two-dimensional, x = (x1,x2)T . The feature mapping is defined as –
What is the corresponding kernel function, i.e. K(x,z)? Select one.
(x1z1)2 + (x2z2)2 + 1
(1 + xT z)2
(xT z)2 xT z
3. [3pts] Suppose that input-space is three-dimensional, x = (x1,x2,x3)T . The feature mapping is defined as –
Suppose we want to compute the value of kernel function K(x,z) on two vectors x,z ∈ R3. We want to check how many additions and multiplications are needed if you map the input vector to the feature space and then perform dot product on the mapped features. Report α+β, where α is the number of multiplications and β is the number of additions.
Note: Multiplication/Addition with constants should also be included in the counts.
4. [3pts] Suppose that input-space is three-dimensional, x = (x1,x2,x3)T . The feature mapping is defined as –
Suppose we want to compute the value of kernel function K(x,z) on two vectors x,z ∈ R3.We want to check how many additions and multiplications are needed if you do the computation through the kernel function you derived above. Report α + β, where α is the number of multiplications and β is the number of additions.
Note: Multiplication/Addition with constants should also be included in the counts.
5. [3pts] Suppose one dataset contains four data points in R1 space, as shown in Figure 2
Figure 2: Data in R1
Different shapes of the points indicate different labels. If we train a linear classifier on the dataset, what is the lowest training error for a linear classifier on R1? 0.25
6. [3pts] Following the above question, which of the feature mappings below can we use to project the dataset to higher dimensional space such that training a linear classifier on the projected dataset would yield zero training error?
φ(x) = (x,1) φ(x) = (x,x3) φ(x) = (x,x2) φ(x) = (x,(x + 1)2)
7. [2pt] True or False: Given the same training data, in which the points are linearly separable, the margin of the decision boundary produced by SVM will always be greater than or equal to the margin of the decision boundary produced by Perceptron.
True
False
3 K Means [19pts]
(a) Dataset A
(b) Dataset B
(c) Dataset C
Figure 3: Datasets
1. [3pts] Consider the 3 datasets A, B and C as shown in Figure 3. Each dataset is classified into k clusters as represented by different colors in the figure. For each dataset, determine which image with cluster centers (denoted by X) is generated by K-means method.The distance measure used here is the Euclidean distance.
1.1. [1pt] Dataset A (Select one)
A.1
A.2
1.2. [1pt] Dataset B (Select one)
B.1
B.2
1.3. [1pt] Dataset C (Select one)
C.1
C.2
5.5
5.1 D = 6.6
5.5
6.8 3.1
4.8
3.0
4.6
3.8
2.1. [3pts] Which of the following points will be the center for cluster 0 after the first iteration? Select one:
(5.7 , 4.1)
(5.6 , 4.8)
(6.3 , 3.3)
(6.7 , 3.4)
2.2. [3pts] Which of the following points will be the center for cluster 1 after the first iteration? Select one:
(6.1 , 3.8)
(5.5 , 4.6)
(5.4 , 4.7)
(5.3 , 4.7)
2.3. [2pt] How many points will belong to cluster 1 after the first iteration?
2.4. [2pt] How many points will belong to cluster 2 after the first iteration?
3. [6pts] Recall that in k-means clustering we attempt to find k cluster centers cj ∈ Rd,j ∈ {1,…,k} such that the total distance between each datapoint and the nearest cluster center is minimized. Then the objective function is,
(1)
Specifically, you convinced John by providing two values α, the minimum possible value of Eq. (1), and β, the value of k when Eq. (1) is minimized.
3.1. [3pts] What is the value of α + β when n = 100? 100
3.2. [3pts] We want to see how k-means clustering works on a single dimension. Consider the case in which k = 3 and we have 4 data points x1 = 1,x2 = 2,x3 = 5,x4 = 7. What is the optimal value of the objective Eq. (1)?
0.5
4 PCA [13pts]
1. [4pts] Assume we are given a dataset X for which the eigenvalues of the covariance matrix are: (2.2, 1.7, 1.4, 0.8, 0.4, 0.2, 0.15, 0.02, 0.001). What is the smallest value of k we can use if we want to retain 75% of the variance (sum of all the variances in value) using the first k principal components?
2. [3pts] Assume we apply PCA to a matrix X ∈ Rn×m and obtain a set of PCA features, Z ∈ Rn×m .We divide this set into two, Z1 and Z2.The first set, Z1, corresponds to the top principal components. The second set, Z2, corresponds to the remaining principal components. Which is more common in the training data: Select one:
a point with large feature values in Z1 and small feature values in Z2 a point with large feature values in Z2 and small feature values in Z1 a point with large feature values in Z2 and large feature values in Z1 a point with small feature values in Z2 and small feature values in Z1
3. [2pts] For the data set shown below, what will be its first principal component?
Select one: d b
c a
4. [2pts] NOTE : This is continued from the previous question. What is the second principal component in the figure from the previous question? Select one:
d b
c a
5. [2pts] NOTE : This is continued from the previous question. What is the third principal component in the figure from the previous question? Select one: (a)
(b)
(c)
(d)
None of the above
5 Graphical Models [15pts]
In the Kingdom of Westeros, Summer has come. Jon Snow, the King in the North, has taken the responsibility to defeat the Giants and protect the realm.
If Jon Snow can get Queen Cersei and Daenerys Queen of the Dragons to help him Jon is likely to beat the giants. Cersei and Daenerys are powerful women who are skeptical of the existence of Giants and will most likely only consider joining Jon if the are shown evidence of an imminent Giant attack. They can only be shown of an attack if Jon captures a live
Giant.
The Bayesian network that represents the relationship between the events described above is shown below. Use the following notation for your variables: Jon Snow captures a live Giant (X1), Jon shows Censei and Daenerys a live Giant (X2), Cersei agrees to help (X3), Daenerys agrees to help (X4) and Giants defeated (X5).
1. [1pt] Write down the factorization of the above directed graphical model.
P(x1)P(x2|x1)P(x3|x2)P(x4|x2)P(x5|x3,x4)
2. [1pt] Each random variable represented in the above Bayesian network is binary valued (i.e. either the event happens or it does not). State the minimum number of parameters you need to fully specify this Bayesian network.
3. [1pt] If we didn’t use these conditional independence assumptions above, what would be the minimum number of parameters we would need to model any joint distribution over the same set of random variables?
4. [5pts] For the following questions fill in the blank with the smallest set S of random variables needed to be conditioned on in order for the independence assumption to hold. For example Xi ⊥ Xj | S. What is the smallest set S that makes this statement true? The empty set ∅ is a valid answer, additionally if the independence assumption cannot be satisfied no matter what we condition on then your answer should be ’Not possible’.
(a) [1pt] X1 ⊥ X3 |
(b) [1pt] X1 ⊥ X5 |
(c) [1pt] X2 ⊥ X4 |
(d) [1pt] X3 ⊥ X4 |
(e) [1pt] X2 ⊥ X5 |
5. [7pts] Jon gets his friend Sam to calculate some estimates of his chances. Sam returns to Jon with the following conditional probabilities tables:
Table 1: Sam’s Conditional Probability tables
Using the conditional probabilities for our graphical model, compute the following (Your answers should be given to 5 decimal places):
(a) [2pts] P(X1 = 0,X2 = 1,X3 = 0,X4 = 1,X5 = 0). 0.02016
(b) [5pts]P(X1 = 1|X3 = 1)
0.67384
Collaboration Questions Please answer the following:
1. Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full details.
2. Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.
3. Did you find or come across code that implements any part of this assignment ? If so, include full details.
Reviews
There are no reviews yet.