Description
1
DECISION TREES, K-NN, PERCEPTRON, REGRESSION
http://mlcourse.org
TAs: Weyxin, Kevin, Joseph, Mukund, Brendon
START HERE: Instructions
˜mgormley/courses/10601/syllabus.html#7-academic-integrity-policies
• Submitting your work:
Instructions for Specific Problem Types
For “Select One” questions, please fill in the appropriate bubble completely:
Select One: Who taught this course?
Matt Gormley / Henry Chai
◯ Marie Curie
◯ Noam Chomsky
Select One: Who taught this course?
Matt Gormley / Henry Chai
# Marie Curie
@ Noam Chomsky
For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
□ Stephen Hawking
Albert Einstein
□ Isaac Newton
□ None of the above
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein Isaac Newton
@ I don’t know
Fill in the blank: What is the course number?
10-S7601S
1 Decision Tree (Revisited)
1. (4 points) Consider the following 4 × 4 checkerboard pattern. Suppose our goal is to perfectly classify the range shown such that all black regions are labeled as +1 and all white regions are labeled as −1.
Figure 1: Checkerboard Pattern
(a) (2 points) [SOLO] What is the minimum depth of decision tree that perfectly classifies the 4 × 4 colored regions in Figure 1, using features that only inspect either x or y but not both x and y? (How you use each of x and y to construct a feature is up to you.)
(b) (2 points) [SOLO] What is the minimum depth of decision trees to perfectly classify the colored regions in Figure 1, using ANY features of x and y?
Consider the graph below analyzing the size of tree vs. accuracy for a decision tree which has been pruned back to the red line.
Figure 2: Pruned decision tree
2. (1 point) [SOLO] Refer to Figure 2. Let’s say that we have a third dataset Dnew (from the same data distribution), which is not used for training or pruning.
If we evaluate this new dataset, approximately what is the accuracy when the size of the tree is at 25 nodes, and why? Select one.
Select one:
◯ Around 0.76 (slightly higher than the accuracy for validation data at 25 nodes)
◯ Around 0.73 (the same as the accuracy for validation data at 25 nodes)
◯ Around 0.70 (slightly lower than the accuracy for validation data at 25 nodes) ◯ None of the above
3. (1 point) [SOLO] Which of the following gives us the best approximation of the true error?
◯ Line corresponding to training data
◯ Line corresponding to validation data
◯ Line corresponding to new dataset Dnew
4. (2 points) [SOLO] Which of the following are valid ways to avoid overfitting? Select all that apply.
Select all that apply:
□ Decrease the training set size.
□ Set a threshold for a minimum number of samples required to split at an internal node.
□ Prune the tree so that cross-validation error is minimal.
□ Maximize the tree depth.
□ None of the above.
5. (3 points) Ensemble of Decision Tree. Say we have a dataset shown below. There are 12 data points, with 6 in label ”-” and 6 in label ”+”. We would like to use Decision Trees to solve this binary classification problem. However, in our problem setting, each Decision Tree has access to only ONE line. That is to say, our Decision Tree would have access to only one attribute, and so has max-depth of 1.
By accessing this line, the Decision Tree could know (and only know) whether the data point is on the right side of this line or the left side. (Unofficial definition: let’s assume the right side of a line shares the same direction with the green normal vector of that line.)
Finally, please use majority vote strategy to make classification decision at each leaf.
Figure 3: Decision Tree Ensemble
(a) (1 point) [SOLO] In Figure 3, if we train only 1 Decision Tree, what is the best/lowest error rate? Note that we have in total 12 data points. Round to 4 decimal places after the decimal point.
(b) (2 points) [SOLO] If we could use 2 Decision Trees in Figure 3, what is the best/lowest error rate? If we have two Decision Trees, then each would predict each data point with ’+’ or ’-’. Then, we would like to combine these predictions as the final result. If both trees predict ’+’, then the result is ’+’. The same with ’-’. However, if one predicts ’+’ while one predicts ’-’, then we always choose ’-’ as the final result to break ties. Round to 4 decimal places after the decimal point.
(c) (2 points) [SOLO] Now let’s train 3 Decision Trees as a forest in Figure 3. What is the best/lowest error rate? The ensemble strategy is now unanimous voting. That is, if every Decision Tree agrees, then the model predicts a positive label. However, if one of them has a different answer from the other two, then we predict negative. That means, we train each Decision Tree individually, and each Decision Tree choose one unique line as its decision boundary such that it would try its best to achieve maximum accuracy. And, next, if all the Decision Trees agree, then we assign the point a positive label. Round to 4 decimal places after the decimal point.
6. (6 points) Consider a binary classification problem using 1-nearest neighbors with the Euclidean dis-
(1) (2) (N) tance metric. We have N 1-dimensional training points x ,x ,…x and corresponding labels y(1),y(2),…y(N) with x(i) ∈ R and y(i) ∈ {0,1}. Assume the points x(1),x(2),…x(N) are in ascending order by value. If there are ties during the 1-NN algorithm, we break ties by choosing the label
(i) corresponding to the x with lower value.
(a) (2 points) [GROUP] Is it possible to build a decision tree that behaves exactly the same as the 1-nearest neighbor classifier? Assume that the decision at each node takes the form of “x ≤ t or x > t,” where t ∈R.
◯ Yes
◯ No
If your answer is yes, please explain how you will construct the decision tree. If your answer is no, explain why it’s not possible.
(b) (2 points) [GROUP] Let’s add a dimension! Now assume the training points are 2-dimensional where and the decision at each node takes the form of “xj ≤ t or xj > t,” where t ∈ R and j ∈ {1,2}. Give an example with at most 3 training points for which it isn’t possible to build a decision tree that behaves exactly the same as a 1-nearest neighbor classifier.
(c) (2 points) [GROUP] Assuming we have 2-dimensional training points x and the decision at each node takes the form of “xj ≤ t or xj > t,” where t ∈R and j ∈ {1,2}, under what conditions is it possible to build a decision tree that behaves exactly the same as a 1-nearest neighbor classifier? Explain why it is possible to build the decision tree as stated in part (a) but, in general, not possible in part (b).
2 k-Nearest Neighbors
Figure 4: k-NN Dataset
1. (2 points) [SOLO] Consider a k-nearest neighbors (k-NN) binary classifier which assigns the class of a test point to be the class of the majority of the k-nearest neighbors, according to a Euclidean distance metric. Using Figure 4 shown above to train the classifier and choosing k = 6, what is the classification error on the training set?
Answer as a decimal with precision 4, e.g. (6.051, 0.1230, 1.234e+7)
2. (1 point) [SOLO] Again using Figure 4, what is the value of k that minimizes the training error? Let’s assume we break ties by selecting one of the labels uniformly at random.
3. (2 points) [SOLO] Let’s say that we have a new test point (not present in our training data) xnew = [3,9]T that we would like to apply our k-NN classifier to, as seen in figure 5.
Figure 5: k-NN Dataset with Test Point
For which values of k is this test point correctly classified by the k-NN algorithm?
Select all that apply:
□ k = 1
□ k = 5
□ k = 9
□ k = 12
□ None of the above
4. (5 points) Assume we have a training set and a test set drawn from the same distribution, and we would like to classify points in the test set using a k-NN classifier.
(a) (1 point) [SOLO] In order to minimize the classification error on this test set, we should always choose the value of k which minimizes the training set error.
Select one:
◯ True
◯ False
(b) (2 points) [SOLO] Instead of choosing the hyper-parameters by merely minimizing the training set error, we instead consider splitting the training-all data set into a training and a validation data set, and choose the hyper-parameters that lead to lower validation error. Is choosing hyperparameters based on validation error better than choosing hyper-parameters based on training error? Justify your opinion with no more than 3 sentences.
Select one:
◯ Yes
(c) (2 points) [SOLO] Your friend Sally suggests that instead of splitting the training set into separate training and validation sets, we should instead use the test set as the validation data for choosing hyper-parameters. Is this a good idea? Justify your opinion with no more than 3 sentences.
Select one:
◯ Yes
5. (3 points) [SOLO] Consider a binary k-NN classifier where k = 4 and the two labels are “triangle” and “square”. Consider classifying a new point x= (1,1), where two of the x’s nearest neighbors are labeled “triangle” and two are labeled “square” as shown below.
Which of the following methods can be used to break ties or avoid ties on this dataset?
1. Assign x the label of its nearest neighbor
2. Flip a coin to randomly assign a label to x (from the labels of its 4 closest points)
3. Use k = 3 instead
4. Use k = 5 instead
Select one:
◯ 1 only
◯ 2 only
◯ 2, 3, 4 ◯ 2, 4
◯ 4 only
◯ 1, 2, 3, 4
◯ None of the above
6. (2 points) [SOLO] In this question, we would like to compare the differences among k-NN, the perceptron algorithm, and linear regression.
Select all that apply:
□ For classification tasks, both k-NN and the perceptron algorithm can have linear decision boundaries.
□ For classification tasks, both k-NN and the perceptron algorithm always have linear decision boundaries.
□ All three models can be susceptible to overfitting.
□ In all three models, after the training is completed, we must store the training data to make predictions on the test data.
□ None of the above.
7. (3 points) [SOLO] Please select all that apply about k-NN in the following options. Select all that apply:
□ A larger k gives a smoother decision boundary.
□ To reduce the impact of noise or outliers in our data, we should increase the value k.
□ If we make k too large, we could end up overfitting the data.
□ We can use cross-validation to help us select the value of k.
□ We should never select the k that minimizes the error on the validation dataset.
□ None of the above.
1 2.5 3.8 45 2 3.6 2.4 55
3 3.3 3.5 91
4 4.0 4.0 142
5 2.2 3.2 88
6 3.8 3.0 2600
7 3.0 2.0 163
8 3.3 2.8 67
9 3.9 3.8 unknown
(a) (2 points) [GROUP] Among Students 1 to 8, who is the nearest neighbor to Student 9, using Euclidean distance? Answer the Student ID only.
(b) (3 points) [GROUP] Now, our task is to predict the salary Student 9 earns after graduation. We apply k-NN to this regression problem: the prediction for the numerical target (salary in this example) is equal to the average of salaries for the top k nearest neighbors. If k = 4, what is our prediction for Student 9’s salary? Round your answer to the nearest integer. Be sure to use the same unit of measure (thousands of dollars per year) as the table above.
(c) (2 points) [GROUP] Suppose that the first 8 students shown above are only a subset of your full training data set, which consists of 10,000 students. We apply k-NN regression using Euclidean distance to this problem and we define training loss on this full data set to be the mean squared error (MSE) of salary. Now consider the possible consequences of modifying the data in various ways. Which of the following changes could have an effect on training loss on the full data set as measured by mean squared error (MSE) of salary?
Select all that apply:
□ None of the above.
9. (3 points) [GROUP] Suppose we have N training examples, and each one has M features. Derive the prediction time complexity of running a naive k-NN binary classifier with Euclidean distance on one test point (for a fixed k) and use pseudo-code to justify your explanation.
10. (3 points) [GROUP] We have been dealing with k-NN with features that are all continuous values. Can we use k-NN on a data set that has categorical features that take three or more values? If yes, what is a distance metric that could be used to compute distance between categorical data points. Explain your reasoning.
3 Perceptron
1. (1 point) [SOLO] Consider running the online perceptron algorithm on some sequence of examples S
′
(an example is a data point and its label). Let S be the same set of examples as S, but presented in a different order.
True or False: the online perceptron algorithm is guaranteed to make the same number of mistakes on
′
S as it does on S .
Select one:
◯ True
◯ False
2. (3 points) [SOLO] Suppose we have a perceptron whose inputs are 2-dimensional vectors and each feature vector component is either 0 or 1, i.e., xi ∈ {0,1}. The prediction function y = sign(w1x1 + w2x2+ b), and
sign(z) = {1, if z ≥ 0
0, otherwise.
Which of the following functions can be implemented with the above perceptron? That is, for which of the following functions does there exist a set of parameters w,b that correctly define the function. Select all that apply:
□ AND function, i.e., the function that evaluates to 1 if and only if all inputs are 1, and 0 otherwise.
□ OR function, i.e., the function that evaluates to 1 if and only if at least one of the inputs are 1, and 0 otherwise. □ XOR function, i.e., the function that evaluates to 1 if and only if the inputs are not all the same. For example
XOR(1,0) = 1, but XOR(1,1) = 0.
□ None of the above.
3. (2 points) [SOLO] Suppose we have a dataset {(x(1),y(1)),…,(x(N),y(N))}, where x(i) ∈RM, y(i) ∈ {+1,−1}. We would like to apply the perceptron algorithm on this dataset. Assume there is no intercept term. How many parameter values is the perceptron algorithm learning?
Select one:
◯ N
◯ N × M
◯ M
4. (2 points) [SOLO] Which of the following are true about the perceptron algorithm?
Select all that apply:
□ The number of mistakes the perceptron algorithm makes is proportional to the number of points in the dataset.
□ The perceptron algorithm converges on any dataset.
□ The perceptron algorithm can be used in the context of online learning.
□ For linearly separable data, the perceptron algorithm always finds the separating hyperplane with the largest margin.
□ None of the above.
5. (2 points) [SOLO] The following table shows a data set and the number of times each point is misclassified during a run of the perceptron algorithm. What is the separating plane w found by the algorithm, i.e. w = [b,w1,w2,w3]? Assume that the initial weights are all zero.
x1 x2 x3 y Times Misclassified
2 1 5 1 10
5 3 3 1 5
4 1 2 1 8
8 4 8 -1 2
3 2 6 -1 3
Select one:
◯ [3,22,11,24]
◯ [0,−5,20,−10]
◯ [16,56,18,47]
◯ [18,52,19,47]
6. (2 points) [SOLO] Please select the correct statement(s) about the mistake bound of the perceptron algorithm.
Select all that apply:
□ If the minimum distance from any data point to the separating hyperplane is increased, without any other change to the data points, the mistake bound will also increase.
□ If the whole dataset is shifted away from origin, then the mistake bound will also increase. □ If the size of the data set (i.e., the maximum pair-wise distance between data points) is increased, then the mistake bound will also increase.
□ The mistake bound is linearly inverse-proportional to the minimum distance of any data point to the separating hyperplane of the data.
□ None of the above.
7. (2 points) [SOLO] Suppose we have data whose elements are of the form [x1,x2], where x1− x2 = 0. We do not know the label for each element. Suppose the perceptron algorithm starts with θ = [3,5], which of the following values will θ never take on in the process of running the perceptron algorithm on the data?
Select one:
◯ [−1,1]
◯ [4,6]
◯ [−3,0]
◯ [−6,−4]
8. (2 points) [SOLO] Consider the linear decision boundary below and the training dataset shown. Which of the following are valid weights θ and its corresponding training error? (Note: Assume the decision boundary is fixed and does not change while evaluating training error.)
Select all that apply:
□ θ= [2,1], error = 5/13
□ θ= [−2,1], error = 5/13
□ θ= [2,−1], error = 8/13
□ θ= [2,−1], error = 5/13
□ θ= [−2,1], error = 8/13 □ θ= [−2,−1], error = 8/13 □ None of the above.
9. (9 points) The following problem will walk you through an application of the Perceptron Mistake Bound. The following table shows a linearly separable dataset. Your task will be to determine the mistake bound for the dataset below. Then, you will run the training of the perceptron.
(a) (2 points) [SOLO] Compute the radius R of the ”circle” which bounds the data points. Round your answer to 3 decimal places after the decimal point.
(b) (2 points) [SOLO] Assume that the linear separator with the largest margin is given by
⎢⎡ θ ⎢⎢⎢⎢y1⎥⎥⎥⎥⎥⎥⎥⎦ = 0,, where θ∗ = ⎢⎢⎢⎣⎢⎢⎢⎡⎢⎢435⎥⎥⎥⎥⎥⎥⎥⎥⎤⎦
∗T ⎢⎢x⎤⎥
⎣⎢
Now, compute the margin of the dataset. Round your answer to 3 decimal places after the decimal point.
(c) (1 point) [SOLO] Based on the above values, what is the theoretical perceptron mistake bound for this dataset, given this linear separator? Round your answer to 3 decimal places after the decimal point.
(d) (3 points) [SOLO] Finally, train the perceptron using the dataset. Train using the datapoints in the given order, top to bottom repeatedly until convergence. Give the final weights in a commaseparated list ( w1, w2, b ).
(e) (1 point) [SOLO] How many mistakes did the perceptron make?
4 Linear Regression
1. (3 points) We would like to fit a linear regression estimate to the dataset D = {(x(1),y(1)),(x(2),y(2)),⋯,(x(N),y(N))}
(i) M
with x ∈R by minimizing the ordinary least square (OLS) objective function:
2
J(w) = 2 1 ∑N= ⎛⎜⎝y(i) − j∑M=1wjxj(i)⎟⎞⎠
i 1
(a) (2 points) [SOLO] Specifically, we solve for each coefficient wk (1 ≤ k ≤ M) by deriving an expression of wk from the critical point . What is the expression for each wk in terms of the dataset (x(1),y(1)),⋯,(x(N),y(N)) and w1,⋯,wk−1,wk+1,⋯,wM?
Select one:
◯
◯
◯
◯
(b) (1 point) [SOLO] How many coefficients do you need to estimate? When solving for these coefficients, how many equations do you have?
Select one:
◯ N coefficients, M equations
◯ M coefficients, N equations
◯ M coefficients, M equations
◯ N coefficients, N equations
2. (3 points) [SOLO] Consider the following 3 data points for linear regression: x(1) = [0,1,2]T, x(2) = [1,0,2]T and x(3) = [2,1,0]T. The corresponding y values are y(1) = 3, y(2) = 6, y(3) = 9.
Assume the intercept to be 0. Find the weights θ= [θ1,θ2,θ3]T ∈R3 such that the mean squared error J(θ) = (y−Xθ)T(y−Xθ) is minimized on this training set. X is the design matrix where X .
3. (1 point) [SOLO] Assume that a data set has M data points and N variables, where M > N. Different loss functions would return the same sets of solutions as long as they are convex.
Select one:
◯ True
◯ False
4. (1 point) [SOLO] Suppose we are working with datasets where the number of features is 3. The optimal solution for linear regression is always unique regardless of the number of data points that are in this dataset.
Select one:
◯ True
◯ False
5. (1 point) [SOLO] Consider the following dataset:
x1 1.5 2.5 3.5 4.5 5.5
x2 3.0 5.0 7.0 9.0 11.0
y 4.0 7.0 8.0 11.0 17.0
We want to carry out a multiple-linear regression between y (dependent variable) and x1 and x2 (independent variables). The closed-form solution given by w = (XTX)−1XTY will return the unique solution.
Note: The ith row of X contains the ith data point while the ith row of Y contains the ith
(i) data point y .
Select one:
◯ True
◯ False
6. (3 points) [SOLO] Identifying whether a function is a convex function is useful because a convex function’s local minimum has the nice property that it has to be the global minimum. Please select all functions below that are convex functions. Note dom(f) denotes the domain of the function f. Select all that apply:
□ f(x) = x3+ 3x + 7,dom(f) =R
□ f(x) =−logx,dom(f) =R++ (the set of positive real numbers)
□ f(x) =−∣x∣,dom(f) =R
□ f(x) =−2×2,dom(f) =R
□
□ None of the above.
7. (2 points) [GROUP] Typically we can solve linear regression problems in two ways. One is through direct methods, e.g. solving the closed form solution, and the other is through iterative methods (gradient descent). Consider a linear regression on data (X,y). We assume each row in X denotes one input in the dataset. Please select all correct options.
Select all that apply:
□ If the matrix XTX is invertible, the exact solution is always preferred for solving the solution to linear regression as computing matrix inversions and multiplications are fast regardless of the size of the dataset.
□ Assume N is the number of examples and M is the number of features. The computational complexity of N iterations of batch gradient descent is O(MN).
□ The computational complexity of the closed form solution is linear in number of parameters/features.
□ None of the above.
8. (8 points) Consider the following dataset:
x 9.0 2.0 6.0 1.0 8.0 y 1.0 0.0 3.0 0.0 1.0
Let x be the vector of datapoints and y be the label vector. Here, we are fitting the data using gradient descent. Note that our objective function is where N is the number of data points, w is the weight, and b is the intercept.
(a) (2 points) [GROUP] If we initialize the weight as 3.0 and intercept as 0.0, what is the gradient of the loss function with respect to the weight w, calculated over all the data points, in the first step of the gradient descent update? Round to 2 decimal places after the decimal point.
(b) (4 points) [GROUP] Compute the direct solution of the weight and the intercept for the objective function defined in the previous question. Round to 2 decimal places after the decimal point.
(c) (2 points) [GROUP] Let the learning rate be 0.001. Perform five steps of batch gradient descent on the data. Fill in the blank with the value of the weight after five steps of batch gradient descent. Round to 2 decimal places after the decimal point.
9. (6 points) Consider a dataset D = ((x(1),y(1)),…,(x(n),y(n))) and the solution to linear regression on
D is y = w1x + b1.
(a) (2 points) [GROUP] Now, suppose we have the dataset (x(1) + α,y(1) + β),…,(x(n) + α,y(n) + β) where α > 0,β > 0 and w1α ≠ β. The solution to the linear regression on this dataset is y = w2x + b2. Select the correct statement about w1,w2,b1,b2 below. Note that the statement should hold no matter what values α,β take on within the specified constraints.
Select one:
◯ w1 = w2,b1 = b2
◯ w1 ≠ w2,b1 = b2
◯ w1 = w2,b1 ≠ b2
◯ w1 ≠ w2,b1 ≠ b2
(b) (2 points) [GROUP] Let x¯ and y¯ be the mean of the x and y coordinates, respectively. After mean centering the dataset to create Dnew = ((x(1) − x,y¯ (1) − y¯),…,(x(n) − x,y¯ (n) − y¯)), let the solution to linear regression on Dnew be y = w3x + b3. Explain how w3 compares to w1 and justify.
5 Collaboration Questions
1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details.
2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details.
3. Did you find or come across code that implements any part of this assignment ? If so, include full details.
Reviews
There are no reviews yet.