## Description

HOMEWORK 3

DECISION TREES, K-NN, PERCEPTRON, REGRESSION

TAs: Haohui, Rakshith, Sahithya, Siva, Monica, Neural the Narwhal

Summary It’s time to practice what you’ve learned! In this assignment, you will answer questions on topics we’ve covered in class so far, including Decision Trees, K-Nearest Neighbors, Perceptron, and Linear Regression. This assignment consists of a written component split into four sections, one for each topic. These questions are designed to test your understanding of the theoretical and mathematical concepts related to each topic. For each topic, you will also apply your understanding of the concept to the related ideas such as overfitting, error rates, and model selection. This homework is designed to help you apply what you’ve learned and solve a few concrete problems.

START HERE: Instructions

• Late Submission Policy: For this homework, you will only have 2 late days instead of the usual 3. This allows us to provide feedback before the exam. See the late submission policy here: http: //www.cs.cmu.edu/˜mgormley/courses/10601/syllabus.html

• Submitting your work: You will use Gradescope to submit answers to all questions and code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope.

1

Instructions for Specific Problem Types

For “Select One” questions, please fill in the appropriate bubble completely:

Select One: Who taught this course?

Matt Gormley

Marie Curie

Noam Chomsky

Henry Chai

Marie Curie

@@ Noam Chomsky

For “Select all that apply” questions, please fill in all appropriate squares completely:

Select all that apply: Which are scientists?

⌅ Stephen Hawking

⌅ Albert Einstein ⌅ Isaac Newton I don’t know

Select all that apply: Which are scientists?

⌅ Stephen Hawking

⌅ Albert Einstein

⌅ Isaac Newton

@⌅ I don’t know

10-601 10-◆S6301

◆S

1 LATEX Bonus Point and Template Alignment (1 points)

1. (1 point) Select one: Did you use LATEX for the entire written portion of this homework?

is Yes

No

2. (0 points) Select one: I have ensured that my final submission is aligned with the original template given to me in the handout file and that I haven’t deleted or resized any items or made any other modifications which will result in a misaligned template. I understand that incorrectly responding yes to this question will result in a penalty equivalent to 2% of the points on this assignment.

Note: Failing to answer this question will not exempt you from the 2% misalignment penalty.

068 Yes

2 Decision Tree (Revisited) (15 points)

1. Consider the following 4 ⇥ 4 checkerboard pattern. Suppose our goal is to perfectly classify the rangeand all white regions are labeled as 1. Let the shown such that all black regions are labeled as +1 horizontal axis denote feature x1 and vertical axis denote feature x2.

NOTE: If a point is on the border or corner of a region, it has the same label as the region that is above it and/or to its right. For example, (1,0) has label +1, (1,1) has label 1, and (1,1.5) has label 1.

(a) (2 points) What is the minimum depth of a binary decision tree that perfectly classifies the colored regions in Figure 1, using only features of the form x1 < c or of the form x2 < c for different constants c?

(b) (2 points) What is the minimum depth of a binary decision tree that perfectly classifies the colored regions (0 < x1 < 4) in Figure 1, using features that only inspect either x1 or x2 but not both x1 and x2? Feel free to get creative in your use of the features x1 and x2 for the splits!

Since this is a binary decision tree, we can only use features that split into two branches. Some

feature examples: ceil(x1)%2 = 0, 2 < x1 < 4, x2 < 1 or x2 > 3

(c) (2 points) What is the minimum depth of a binary decision tree that perfectly classifies the colored regions in Figure 1, using ANY features that involve x1, x2, or both?

2. Consider the graph below analyzing the size of tree vs. accuracy for a decision tree which has been pruned back to the vertical (red) line. Assume all of our labeled data for this task was randomly divided into a training dataset Dtrain, a validation data set Dval, and a test dataset Dtest. The full tree was trained only on Dtrain, then reduced-error pruning was applied using Dval.

Figure 2: Pruned decision tree. The lowest running dotted curve is the validation performance of the unpruned tree as it grows during training. Finally, we prune it back to the size given by the red line with a reduced validation error (the upper running dotted curve)

(a) (1 point) Select one: Refer to Figure 2. Note that Dtest was not used for training or pruning. When the size of the pruned tree is at 25 nodes, what is its accuracy on Dtest likely to be?

Slightly higher than the pruned tree’s accuracy on Dval

The same as the pruned tree’s accuracy on Dval

Slightly lower than the pruned tree’s accuracy on Dval

(b) (1 point) Select one: Which of the following gives us the best approximation of the true error?

Error corresponding to training data Dtrain

Error corresponding to validation data Dval

Error corresponding to test data Dtest

3. (2 points) Select all that apply: Which of the following are valid ways to avoid overfitting?

Decrease the training set size.

Set a threshold for a minimum number of examples required to split at an internal node.

Prune the tree so that cross-validation error is minimal.

Increase the tree depth.

None of the above. 4. Consider a binary classification problem using 1-nearest neighbors with the Euclidean distance metric. We have N 1-dimensional training points x(1),x(2),…x(N) and corresponding labels y(1),y(2),…y(N), with x(i) 2R and y(i) 2{0,1}.

Assume the points x(1),x(2),…x(N) are in ascending order by value. If there are ties during the 1-NN algorithm, we break ties by choosing the label corresponding to the x(i) with lower value.

NOTE: For subparts (a) and (b), 2 models “behave exactly the same” when the decision boundaries produced by them are the same, leading to the same performance on all possible test data.

(a) (2 points) Select one: Is it possible to build a decision tree that behaves exactly the same as the 1-nearest neighbor classifier? Assume that the decision at each node takes the form of “x t” or “x > t”, where t 2R.

Yes

No

If your answer is yes, please explain how you will construct the decision tree. If your answer is no, explain why it’s not possible.

(b) (3 points) Select all that apply: Let’s add a dimension! Now the training points are 2-dimensional where x . For which of the following training sets is it possible to build a decision tree that behaves exactly the same as a 1-nearest neighbor classifier? Assume that (1) the decision at each node takes the form of “xj t” or ”xj > t”, where t 2R and j 2{1,2}. (2) the decision tree has finite depth.

Points: {( 5,5),(5,5)}, Labels: {0,1} Points: {(0, 9),(0,10)}, Labels: {0,1}

Points: {(3,4),( 3, 4)}, Labels: {0,1}

Points: {(0,0),(0,1),(1,0)}, Labels: {0,0,1}

Points: {(0,0),(0,1),(1,0)}, Labels: {0,1,1}

Points: {(1,2),(3,4),(5,6)}, Labels: {0,1,1} no

None of the above

3 k-Nearest Neighbors (26 points)

Figure 3: k-NN Dataset

1. Consider a k-nearest neighbors (k-NN) binary classifier which assigns the class of a test point to be the class of the majority of the k-nearest neighbors in the training dataset, according to the Euclidean distance metric. Assume that ties are broken by selecting one of the labels uniformly at random.

NOTE:-} An example tie scenario can occur when the classes of the 6 nearest neighbors are {+, +, +, -, -, i.e. the number of neighbors belonging to each class type is equal. In this case, you can assume the

test point’s class to be + or – randomly.

(a) (2 points) Using Figure 3 shown above to train the classifier and choosing k = 6, what is the classification error on the training set? Report your answer either as a fraction or as a decimal with 4 decimal places after the decimal point.

(b) (2 points) Select all that apply: Let’s say that we have a new test point (not present in our training data) xnew = [3,9]T that we would like to apply our k-NN classifier to, as seen in figure 4.

Figure 4: k-NN Dataset with Test Point

For which values of k is this test point always correctly classified by the k-NN algorithm?

k = 1 k = 5 k = 9 k = 12

None of the above

2. Select one: Assume we have a large labeled dataset that is randomly divided into a training set and a test set, and we would like to classify points in the test set using a k-NN classifier.

(a) (1 point) In order to minimize the classification error on this test set, we should always choose the value of k which minimizes the training set error.

True

False

(b) (2 points) Select one: Instead of choosing the hyperparameters by merely minimizing the training set error, we instead consider splitting the training-all data set into a training and a validation data set, and choose the hyperparameters that lead to lower validation error. Is choosing hyperparameters based on validation error better than choosing hyper-parameters based on training error?

Yes, lowering validation error is better for the model because cross-validation guarantees a better test error.

No, lowering training error is better for the model because we have to learn the training set as well as possible to guarantee the best possible test error.

(c) (2 points) Select one: Your friend Sally suggests that instead of splitting the original training set into separate training and validation sets, we should instead use the test set as the validation data for choosing hyperparameters. Is this a good idea? Justify your opinion with no more than 3 sentences.

Yes

3. (2 points) Select all that apply: Consider a binary k-NN classifier where k = 4 and the two labels are “triangle” and “square”. Consider classifying a new point x = (1,1), where two of the x’s nearest neighbors are labeled “triangle” and two are labeled “square” as shown below.

Which of the following methods can be used to break ties or avoid ties on this dataset?

Assign x the label of its nearest neighbor an Flip a coin to randomly assign a label to x (from the labels of its 4 closest points)

Use k = 3 instead ee Use k = 5 instead

None of the above.

4. (3 points) Select all that apply: Which of the following is/are correct statement(s) about k-NN models?

A larger k tends to give a smoother decision boundary.

come To reduce the impact of noise or outliers in our data, we should increase the value k.

If we make k too large, we could end up overfitting the data.

We can use cross-validation to help us select the value of k.

We should never select the k that minimizes the error on the validation dataset.

None of the above.

1 2.5 3.8 45

2 3.3 3.5 90

3 4.0 4.0 142

4 3.0 2.0 163

5 3.8 3.0 2600

6 3.3 2.8 67

7 3.9 3.8 unknown

(a) (2 points) Among Students 1 to 6, who is the nearest neighbor to Student 7, using Euclidean distance? Answer the Student ID only.

(b) (2 points) Now, our task is to predict the salary Student 7 earns after graduation. We apply k-NN to this regression problem: the prediction for the numerical target (salary in this example) is equal to the average of salaries for the top k nearest neighbors. If k = 3, what is our prediction for Student 7’s salary? Be sure to use the same unit of measure (thousands of dollars per year) as the table above.

Round your answer to the nearest integer.

(c) (2 points) Select all that apply: Suppose that the first 6 students shown above are only a subset of your full training data set, which consists of 10,000 students. We apply k-NN regression using Euclidean distance to this problem and we define training loss on this full data set to be the mean squared error (MSE) of salary. Now consider the possible consequences of modifying the data in various ways. Which of the following changes could have an effect on training loss on the full data set as measured by mean squared error (MSE) of salary?

6. An archaeologist discovers a 242 kilobyte 8-inch floppy disk buried beneath the hedges near Wean Hall. The floppy disk contains a few hundred black and white images of 3×3 pixels. You are asked to classify them as either a photo (y = +) or artwork (y = ) to aid in the analysis.

You build a k-Nearest Neighbor (k-NN) classifier trained on a training dataset obtained from the web (converted to similarly small black and white images). Suppose you are informed that each image is represented as a 3⇥3 matrix xpixel images as follows:of binary values and you plan to use Hamming distance to measure the distance between each pair of 3 ⇥ 3

the number of pixels that differ between u and v

While calculating the distance metric, if there is a tie in distance among the points competing for k nearest points, the classifier increases k to include all those tied points in the majority vote. If, in the end, there is a tie in the vote, your classifier returns yˆ =?. You can try out your k-NN implementation on the images below.

Table 2: Test Data

(a) (1 point) What is the distance between x(2) and x(6)?

(b) (1 point) Select one: What would a k-NN classifier with k = 1 predict as the label for test point x(7)?

yˆ = + yˆ =

0 yˆ =?

(c) (1 point) Select one: What would a k-NN classifier with k = 3 predict as the label for test point x(7)?

yˆ =?

(d) (1 point) Select one: What would a k-NN classifier with k = 5 predict as the label for test point x(7)?

yˆ = + yˆ = yˆ =?

(e) (2 points) Short answer: Your friend says that you should try using Euclidean distance because it might give better results. Do you agree that switching could lead to lower test error? Why or why not?

Your answer:

NO The euclideandistance calculations would be similar to using hammingdistance The hamming distance is summing

TimagestesttheherefortheeuclideanperrorixandeelstheHammingbecausethadistancpredictiont areeeuddifferentwouldistancclideanswoubeedistanceisBecauseldjustbea betterthethewouldallsquaresamemetrithenotvalvesrootandclowerforosfarecomparingohammingtestwozerouerrorldandthedistancetheone

my friend should not switch to it

4 Perceptron (19 points)

1. (1 point) Select one: Consider running the online perceptron algorithm on some sequence of examples S (an example is a data point and its label). Let S0 be the same set of examples as S, but presented in a different order.

True or False: The online perceptron algorithm is guaranteed to make the same number of mistakes on S as it does on S0.

True

False

2. (3 points) Select all that apply: Suppose we have a perceptron whose inputs are 2-dimensional vectors and each feature vector component is either 0 or 1, i.e., xi 2 {0,1}. The prediction function is y = sign(w1x1 + w2x2 + b), and

, if z > 0

, otherwise.

Which of the following functions can be implemented with the above perceptron? That is, for which of the following functions does there exist a set of parameters w,b that correctly define the function.

AND function, i.e., the function that evaluates to 1 if and only if all inputs are 1, and 0 otherwise.

OR function, i.e., the function that evaluates to 1 if and only if at least one of the inputs are 1, and 0 otherwise.

XOR function, i.e., the function that evaluates to 1 if and only if the inputs are not all the same. For example

XOR(1,0) = 1, but XOR(1,1) = 0.

None of the above.

3. (2( points) Select one: Suppose we have a dataset x(1),y(1) ,…, x(N),y(N) , where x(i) 2RM, y i) 2{+1, 1}. We would like to apply the perceptron algorithm on this dataset. Assume there is no intercept term. How many parameter values is the perceptron algorithm learning? W

4. (2 points) Select one: The following table shows a data set and the number of times each point is misclassified during a run of the perceptron algorithm. What is the separating plane ✓ found by the algorithm, i.e. ✓ = [b,✓1,✓2,✓3]? Assume that the initial weights are all zero.

x1 x2 x3 y Times Misclassified

2 1 5 1 10

5 3 3 1 5

1 6 2 1 8

7 2 1 -1 2

3 2 6 -1 3

75 10 W 20 050

25 1515D is w 452565

1498 6448 1816265 2213 WW 533969797381

6 18 w 306361

[18,25,14,34]

[18,30,63,61]

[16,56,18,47]

[18,52,19,47]

5. (2 points) Select all that apply: Which of the following is/are correct statement(s) about the mistake bound of the perceptron algorithm?

If the minimum distance from any data point to the separating hyperplane is increased, without any other change to the data points, the mistake bound will also increase.

If the whole dataset is shifted away from origin i.e. translated by adding a constant to each feature, then the mistake bound will also increase. RT

If the pair-wise distance between data points is increased, i.e. the data is scaled by some constant value, then the mistake bound will also increase.

The mistake bound is linearly inverse-proportional to the minimum distance of any data point to the separating hyperplane of the data.

None of the above.

6. (2 points) Select one: Suppose we have data whose examples are of the form [x1,x2], where x1 x2 = 0. We do not know the label for each element. Suppose the perceptron algorithm starts with ✓ = [3,5]; which of the following values will ✓ never take on in the process of running the perceptron algorithm on the data?

[ 1,1]w 3,57 41,52

[4,6]

ooo [ 3,0]

[ 6, 4]

7. (2 points) Select all that apply: Consider the linear decision boundary below and the test dataset shown. Which of the following weight vectors ✓ is paired with its corresponding test error on this dataset? (Note: Assume the decision boundary is fixed and does not change while evaluating error.)

✓ = [ 2,1], error = 5/13

✓ = [2, 1], error = 8/13

✓ = [2, 1], error = 5/13

✓ = [ 2,1], error = 8/13 None of the above.

8. The following problem will walk you through an application of the Perceptron Mistake Bound. The following table shows a linearly separable dataset, and your task will be to determine the mistake bound for the dataset.

NOTE: The proof of the perceptron mistake bound requires that the optimal linear separator passes through the origin. To make the linear separator pass through the origin, we fold the bias into the weights and prepend a 1 to each training example’s input. The original data is on the left, and the result of this prepending is shown on the right. Be sure to use the modified dataset on the right in your calculations.

x0 x1 x2 y

1 -2 2 1

1 -1 -3 -1

1 -2 -3 -1

1 0 1 1

1 2 -1 1

(a) (2 points) Compute the radius R of the “circle” centered at the origin that bounds the data points.

Round to 4 decimal places after the decimal point.

(b) (2 points) Assume that the linear separator with the largest margin is given by

6 445 19 4

✓,, where ✓⇤ = 233

Now, compute the margin of the dataset.

Round to 4 decimal places after the decimal point.

(c) (1 point) Based on the above rounded values, what is the theoretical perceptron mistake bound for this dataset, given this linear separator?

Round to 4 decimal places after the decimal point.

Mistake Bound:

13.3439

5 Linear Regression (19 points)

1. Consider the following dataset:

x 9.0 2.0 6.0 1.0 8.0 y 1.0 0.0 3.0 0.0 1.0

Let x be the vector of datapoints and y be the label vector. Here, we are fitting the data using gradient descent, and our objective function is where N is the number of data points, w is the weight, and b is the intercept. 2 Wxitb Yi Xi

(a) (2 points) If we initialize the weight as 3.0 and intercept as 0.0, what is the gradient of the loss function with respect to the weight w, calculated over all the data points, in the first step of the gradient descent update?

(b) (2 points) What is the gradient of the loss function with respect to the intercept b, calculated over all the data points, in the first step of the gradient descent update?

(c) (2 points) Let the learning rate be 0.01. Perform one step of gradient descent on the data. Fill in the following blanks with the value of the weight and the value of the intercept after this step.

Round to 4 decimal places after the decimal point.

Oeo rg

We W JTWJ b 5 85J

W 3 0.01 209.2 W 0.908

6 0 0.01 29.2 b O 2920

2. Consider a dataset D1 = {(x(1),y(1)),…,is y(=x(Nw)1,yx(+N)b)1}.. Assume the linear regression model that minimizes the mean-squared error on D1

↵, y

(a) minimizes the mean-squared error onw(21points,w(N2,b) +)1,bSelect one:2)} where ↵Now, suppose we have the dataset> 0, > 0 Dand2 iswy1↵==6w2x. Assume the linear regression model that+Db22=. Select the correct statement about{(x(1)+↵, y(1)+ )take on within,…,(x(N)+

below. Note that the statement should hold no matter what values ↵,

the specified constraints.

w1 = w2,b1 = b2 Y Wix tb w 6= w ,b = b 42 Wax by

(b) subset of the rows in(2 points) We decide to ask a friend to analyzeD1. Explain why the linear regression parameters that minimize mean-squaredmay differ from the parameters learned onD1; however, he makes a mistake by duplicating aD1, i.e. w1 and b1. error on the duplicated data

Your answer:

The duplicated rows could have been of outliers in the data which would bias the regression parameters to try to fit

to provide extra weightage to the repeated point

3. We wish to learn a linear regression model on the datasetk. For this question, define the log-cosh loss as D = {(x(1),y(1)),…,(x(N),y(N))} where x2R

`(y,yˆ ) = log(cosh(ˆy y))

In this question, log function has base e.

In particular, for a given point x(i), the log-cosh loss of a model with parameters ✓ is

J(i)(✓) = log⇣cosh⇣✓T x(i) y(i)⌘⌘

We are interested in minimizing loss over our training data, so we minimize the average log-cosh loss over all points in D.

(a) (2 points) What is the objective J(✓) in this setting?

(c) (2 points) What is the gradient of J(i)(✓) with respect to the entire parameter vector ✓?

6 Collaboration Questions

1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details.

2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details.

3. Did you find or come across code that implements any part of this assignment? If so, include full details.

## Reviews

There are no reviews yet.