10601 – Homework 9 SVMs, K-Means, AdaBoost, PCA Solved

$ 29.99
Category:

Description

TAs: Rongye, Rawal, Jeremy
START HERE: Instructions
Homework 9 covers topics on SVMs, K-Means, PCA, AdaBoost. The homework includes multiple choice, True/False, and short answer questions.
• Late Submission Policy: See the late submission policy here: http://www.cs.cmu. edu/~mgormley/courses/10601bd-f18/about.html#6-general-policies
• Submitting your work:
correctly by our AI assisted grader. In addition, please tag the problems to the corresponding pages when submitting your work.
For multiple choice or select all that apply questions, shade in the box or circle in the template document corresponding to the correct answer(s) for each of the questions. For LATEXusers, use and for shaded boxes and circles, and don’t change anything else.
Instructions for Specific Problem Types
For “Select One” questions, please fill in the appropriate bubble completely:
Select One: Who taught this course?
Matt Gormley
# Marie Curie # Noam Chomsky
Select One: Who taught this course?
Matt Gormley # Marie Curie
@ @ Noam Chomsky
For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein
Isaac Newton
I don’t know
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein Isaac Newton
@@ I don’t know
Fill in the blank: What is the course number?
10-601 10-S7601S
1 Support Vector Machines [19 pts]
In class, we discussed the properties and formulation of hard-margin SVMs, where we assume the decision boundary to be linear and attempt to find the hyperplane with the largest margin. Here, we introduce a new class of SVM called soft margin SVM, where we introduce the slack variables ei to the optimization problem and relax the assumptions. The formulation of soft margin SVM with no Kernel is
!
minimize
w,b,e
subject to y(i)(wT x(i) + b) ≥ 1 − ei, ∀ i = 1,…,N ei ≥ 0, ∀ i = 1,…,N
1. [3pts] Consider the ith training example (x(i),y(i)) and its corresponding slack variable ei. Assuming C > 0 and is fixed, what would happen as ei → ∞?
Select all that apply:
the constraint y(i)(wT x(i) + b) ≥ 1 − ei would hold for almost all w.
there would be no vector that satisfies the constraint y(i)(wT x(i) + b) ≥ 1 − ei the objective function would approach infinity.
With this in mind, we hope that you can see why soft margin SVM can be applied even when the data is not linearly separable.
2. [5pts] What could happen as C → ∞? Do not assume that the data is linearly separable unless specified.
Select all that apply:
When the data is linearly separable, the solution to the soft margin SVM would converge to the solution of hard margin SVM.
There is no solution w,b satisfying all the constraints in the optimization problem.
Any arbitrary vector w and scalar b can satisfy the constraints in the optimization problem.
The optimal weight vector would converge to the zero vector 0.
When C approaches to infinity, it could help reduce overfitting.
3. [5pts] What could happen as C → 0? Do not assume that the data is linearly separable unless specified.
Select all that apply:
When the data is linearly separable, the solution to the soft margin SVM would converge to the solution of hard margin SVM.
There is no solution w,b satisfying all the constraints in the optimization problem.
Any arbitrary vector w and scalar b can satisfy the constraints in the optimization problem.
The optimal weight vector would converge to be the zero vector 0.
When C approaches to 0, doing so could help reduce overfitting.
4. [3pts] An extension to soft margin SVM (or, an extension to the hard margin SVM we talked in class) is the 2-norm SVM with the following primal formulation
!
minimize
w,b,e
subject to y(i)(wT x(i) + b) ≥ 1 − ei, ∀ i = 1,…,N ei ≥ 0, ∀ i = 1,…N
Which of the following is true about the 2-norm SVM? (Hint: think about `1-regularization versus `2 regularization!)
Select one:
If a particular pair of parameters w∗,b∗ minimizes the objective function in soft margin SVM, then this pair of parameters is guaranteed to minimize the objective function in 2-norm SVM.
2-norm SVM penalizes large ei’s more heavily than soft margin SVM.
One drawback of 2-norm SVM is that it cannot utilize the kernel trick.
None of the above.

Figure 1: SVM dataset
5. [3pts] Consider the dataset shown in Figure 1. Which of the following models, when properly tuned, could correctly classify ALL the data points?
Select all that apply:
Logistic Regression without any kernel
Hard margin SVM without any kernel
Soft margin SVM without any kernel
Hard margin SVM with RBF Kernel
Soft margin SVM with RBF Kernel
2 Kernels [19pts]
1. [2pt] Consider the following kernel function:
(1, if x = x0
0
K(x,x ) =
0, otherwise
True or False: In this kernel space, any labeling of points from any training data X will be linearly separable.
True
False
2. [3pts] Suppose that input-space is three-dimensional, x = (x1,x2,x3)T . The feature mapping is defined as –

What is the corresponding kernel function, i.e. K(x,z)? Select one.
(x1z1)2 + (x2z2)2 + (x3z3)2
(xT z)3 (xT z)2 xT z
3. [3pts] Suppose that input-space is three-dimensional, x = (x1,x2,x3)T . The feature mapping is defined as –

Suppose we want to compute the value of kernel function K(x,z) on two vectors x,z ∈ R3. We want to check how many additions and multiplications are needed if you map the input vector to the feature space and then perform dot product on the mapped features. Report α+β, where α is the number of multiplications and β is the number of additions.
Note: Multiplication/Addition with constants should also be included in the counts.

4. [3pts] Suppose that input-space is three-dimensional, x = (x1,x2,x3)T . The feature mapping is defined as –

Suppose we want to compute the value of kernel function K(x,z) on two vectors x,z ∈ R3.We want to check how many additions and multiplications are needed if you do the computation through the kernel function you derived above. Report α + β, where α is the number of multiplications and β is the number of additions.
Note: Multiplication/Addition with constants should also be included in the counts.

5. [3pts] Suppose one dataset contains four data points in R1 space, as shown in Figure 2

Figure 2: Data in R1
Different shapes of the points indicate different labels. If we train a linear classifier on the dataset, what is the lowest training error for a linear classifier on R1?

6. [3pts] Following the above question, we use feature mapping φ(x) = (x,x2) to project the data to R2 space. Again we train a linear classifier on the projected dataset. What is the lowest traning error for a linear classifier on R2?

7. [2pt] True or False: Given the same training data, in which the points are linearly separable, the margin of the decision boundary produced by SVM will always be greater than or equal to the margin of the decision boundary produced by Perceptron.
True
False
3 K Means [22pts]

(a) Dataset A

(b) Dataset B

(c) Dataset C
Figure 3: Datasets
1. [3pts] Consider the 3 datasets A, B and C as shown in Figure 3. Each dataset is classified into k clusters as represented by different colors in the figure. For each dataset, determine which image with cluster centers (denoted by X) is generated by K-means method.The distance measure used here is the Euclidean distance.
1.1. [1pt] Dataset A (Select one)
A.1
A.2
1.2. [1pt] Dataset B (Select one)
B.1
B.2
1.3. [1pt] Dataset C (Select one)
C.1
C.2
5.5
5.1 D = 6.4

5.6

6.7 3.1
4.8
3.6
4.7
3.7
2.1. [3pts] Which of the following points will be the center for cluster 0 after the first iteration? Select one:
(5.4 , 3.2)
(5.5 , 3.2)
(5.5 , 3.1)
(5.4 , 3.1)
2.2. [3pts] Which of the following points will be the center for cluster 1 after the first iteration? Select one:
(5.2 , 4.7)
(5.5 , 4.7)
(5.5 , 4.6)
(5.3 , 4.7)
2.3. [3pts] Which of the following points will be the center for cluster 2 after the first iteration? Select one:
(6.5 , 3.5)
(6.5 , 3.9)
(6.5 , 3.8)
(6.5 , 3.6)
2.4. [2pt] How many points will belong to cluster 1 after the first iteration? Select one:
2
3
5
1
2.5. [2pt] How many points will belong to cluster 2 after the first iteration? Select one:
2
3
5
1
3. [6pts] Recall that in k-means clustering we attempt to find k cluster centers cj ∈ Rd,j ∈ {1,…,k} such that the total distance between each datapoint and the nearest cluster center is minimized. Then the objective function is,
(1)
Specifically, you convinced John by providing two values α, the minimum possible value of Eq. (1), and β, the value of k when Eq. (1) is minimized.
3.1. [3pts] What is the value of α + β when n = 100?

3.2. [3pts] We want to see how k-means clustering works on a single dimension. Consider the case in which k = 3 and we have 4 data points x1 = 1,x2 = 2,x3 = 5,x4 = 7. What is the optimal value of the objective Eq. (1)?

4 PCA [13pts]
1. [4pts] Assume we are given a dataset X for which the eigenvalues of the covariance matrix are: (2.2, 1.7, 1.4, 0.8, 0.4, 0.2, 0.15, 0.02, 0.001). What is the smallest value of k we can use if we want to retain 90% of the variance (sum of all the variances in value) using the first k principal components?

2. [3pts] Assume we apply PCA to a matrix X ∈ Rn×m and obtain a set of PCA features, Z ∈ Rn×m .We divide this set into two, Z1 and Z2.The first set, Z1, corresponds to the top principal components. The second set, Z2, corresponds to the remaining principal components. Which is more common in the training data: Select one:
a point with large feature values in Z1 and small feature values in Z2 a point with large feature values in Z2 and small feature values in Z1 a point with large feature values in Z2 and large feature values in Z1 a point with small feature values in Z2 and small feature values in Z1
3. [2pts] For the data set shown below, what will be it’s first principal component?

Select one: d b
c a
4. [2pts] NOTE : This is continued from the previous question. What is the second principal component in the figure from the previous question? Select one:
d b
c a
5. [2pts] NOTE : This is continued from the previous question. What is the third principal component in the figure from the previous question? Select one: (a)
(b)
(c)
(d)
None of the above
5 Lagrange Multipliers [12pts]
Lagrange multipliers are helpful in solving constrained optimization problems. In this problem, we will analyze how to apply this method to a simple question: finding the MLE of a categorical distribution. Let 1 be the indicator function, where
, if a = b
, otherwise
The probability mass function for categorical distribution with K categories, assuming K > 1, can then be written as
K
P(x) = Yp1k{x=k},
k=1
where = 1 and 0 ≤ pk ≤ 1. Let {x(1),x(2),…,x(N)} be the samples we have, the
log likelihood of the categorical distribution is then
N K XX (i)
1{x = k}log(pk)
i=1 k=1
Finding the MLE of the categorical distribution is equivalent to the following optimization problem
N K
minimize
p1,…,pK − XX1{x(i) = k}log(pk)
i=1 k=1
subject to pk ≥ 0 ∀ k = 1,…,K
pk ≤ 1, ∀ k = 1,…,K
K
X
pi = 1
k=1
We attempt to solve this problem by solving its dual. To do this, recall first that the dual problem can be written as
,
where θ is the variables we introduce when deriving the dual and L(p1,…,pk;θ) is the Lagrangian function.
1. [2pts] Which of the following is the formulation for the Lagrangian function L? Select one.
N K K K K
XX (i) X X X
1{x = k}log(pk) + αkpk + βk(pk − 1) + γ( pk − 1)
i=1 k=1 k=1 k=1 k=1
N K K K K
−XX1{x(i) = k}log(pk) − Xαkpk + Xβk(pk − 1) + γ(Xpk − 1)
i=1 k=1 k=1 k=1 k=1
N K K K K
−XX1{x(i) = k}log(pk) − Xpk + Xβk(pk − 1) + γ(Xpk − 1)
i=1 k=1 k=1 k=1 k=1
N K K K K
−XX1{x(i) = k}log(pk) − Xpk + X(pk − 1) + Xpk
i=1 k=1 k=1 k=1 k=1
Now, we would like to use the method of Lagrange Multipliers to prove that maximizing the variance in PCA gives an eigenvector. In general, a constrained optimization problem has the following standard form:

s.t. gi(w) ≤ 0,i = 1,…,k hi(w) = 0,i = 1,…,l
This is also called the primal problem. We introduce new variables αi ≥ 0 and βi (no need to be non-negative) called Lagrange multipliers and study the generalized Lagrangian defined by
k l
L(w,α,β) = f(w) + Xαigi(w) + Xβihi(w)
i=1 i=1
A discussion of solving constrained optimization is based on the fact that if w∗ is a feasible local minimum of the objective function that satisfies all the constraints, then there exist Lagrange multipliers α and β, such that the Karush-Kuhn-Tucker (KKT) conditions are satisfied:
Complementary slackness
gi(w∗) ≤ 0,i = 1,…,k Primal feasibility αi ≥ 0,i = 1,…,k Dual feasibility
The points that satisfy the KKT gives a set of all the local optimums. In the following examples, we can efficiently solve the optimization problem by finding the KKT points and investigate each of them to optimize the objective function globally.
From class, we know that minimizing reconstruction error is equivalent to maximizing variance, i.e.,
.
s.t. vT v = 1
Let X = [x(1),x(2),…,x(N)] . Using the covariance matrix A , we get to the
following standard constrained minimization problem
min − vT Av
v
s.t. vT v − 1 = 0.
2. [2pts] Please write down the Lagrangian L(v,λ), using the Lagrangian multiplier λ.

3. [4pts] Please write down the KKT conditions of the optimization problem.

4. [4pts] Please solve the optimization problem by solving the KKT conditions and show that the optimal solution is (v∗,λ∗), where λ∗ and v∗ are the largest eigenvalue of A and the corresponding eigenvector, respectively .

Figure 4: AdaBoost dataset.
6 AdaBoost [25pts]
Decision stumps are a popular family of binary classifiers used to implement the AdaBoost algorithm. Typically, they work as “weak classifiers.” A decision stump is a linear separator that is axis-aligned. In other words, it is a 1-level decision tree. It only has a root node. More formally, suppose the inputs are vectors in Rd. A decision stump is a function h where
(
1 if xi ≥ c
h(~x) =
−1 otherwise
1. [2pts] For the data in Figure 4, what is the lowest training error that can be achieved by a single decision stump?
Select one:
25%
50%
75%
100%
Point x1 x2 y
1 1 3 -1
2 1 1 +1
3 2 2 +1
4 3 3 +1
5 3 1 -1
Table 1: Data for AdaBoost Q2
2. [23pts] Now suppose we have the data in Table 1. On the first iteration of AdaBoost using decision stumps, we choose the classifier h where
(
1 if x1 ≥ 1.5
h(~x) =
−1 otherwise
. Fill in the blanks to complete this iteration of the algorithm. Use normalization factor Zt = 2p
(a) [3pts] err =
(b) [3pts] α1 =
(c) [2pts] Which weight(s) will be increased? (Note: D(i) is the weight of point i)
Select all that apply:
D(1)
D(2)
D(3)
D(4)
D(5)
(d) [7pts] After updating, what are the weights D(i)? Fill in the table below, rounding to 4 decimal places.

(e) [2pts] On each iteration, AdaBoost will choose a weak classifier h(~x) which minimizes the error,

Continuing from above, which decision stump would we choose as our classifier in the second iteration?
Select one:
h(~x) = (
1 if x2 ≥ 1.5
−1 otherwise
h(~x) = (
1 if x2 ≤ 2.5
−1 otherwise
h(~x) = (
1 if x1 ≤ 2.5
−1 otherwise
h(~x) = (
1 if x1 ≥ 1.5
−1 otherwise
(f) [6pts] What is the least number of iterations in which it is possible to achieve 0 training error on this dataset? (Hint: implementation of the AdaBoost is suggested to test your answer)
Select one:
2
3
4
5
None of the above
Collaboration Questions Please answer the following:
1. Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full details.
2. Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.
3. Did you find or come across code that implements any part of this assignment ? If so, include full details.

Reviews

There are no reviews yet.

Be the first to review “10601 – Homework 9 SVMs, K-Means, AdaBoost, PCA Solved”

Your email address will not be published. Required fields are marked *