## Description

TAs: Abhi, Tara, Sami, Alex, Hayden, Chu

START HERE: Instructions

˜mgormley/courses/10601/syllabus.html#7-academic-integrity-policies

• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/

˜mgormley/courses/10601/syllabus.html#late-homework-policy

• Submitting your work:

– Programming: You will submit your code for programming questions on the homework to

Gradescope (https://gradescope.com). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). When you are developing, check that the version number of the programming language environment (e.g. Python 3.9.12) and versions of permitted libraries (e.g. numpy 1.23.0) match those used on Gradescope. You have a total of 10 Gradescope programming submissions. Use them wisely. In order to not waste code submissions, we recommend debugging your implementation on your local machine (or the linux servers) and making sure your code is running correctly first before any Gradescope coding submission. The above is true for future assignments, but this one allows unlimited submissions.

1

• Materials: The data that you will need in order to complete this assignment is posted along with the writeup and template on the course website.

For multiple choice or select all that apply questions, shade in the box or circle in the template document corresponding to the correct answer(s) for each of the questions. For LATEX users, replace choice with CorrectChoice to obtain a shaded box/circle, and don’t change anything else.

Instructions for Specific Problem Types

For “Select One” questions, please fill in the appropriate bubble completely:

Select One: Who taught this course?

Matt Gormley

⃝ Marie Curie

⃝ Noam Chomsky

Select One: Who taught this course?

Henry Chai

⃝ Marie Curie

@@ Noam Chomsky

For “Select all that apply” questions, please fill in all appropriate squares completely:

Select all that apply: Which are scientists?

■ Stephen Hawking

■ Albert Einstein

■ Isaac Newton

2 I don’t know

Select all that apply: Which are scientists?

■ Stephen Hawking

■ Albert Einstein

■ Isaac Newton

@■ I don’t know

Fill in the blank: What is the course number?

10-S6301

S

1 Programming: Majority Vote Classifier [30 Points]

1.1 Introduction

The goal of this assignment is to ensure that you:

1. Have a way to edit and test your code (i.e. a text editor and compiler/interpreter)

2. Are familiar with submitting to Gradescope

3. Are familiar with file I/O and standard output in Python

1.2 Majority Vote Classifier

1.2.1 Algorithm

The training procedure should store the label used for prediction at test time. In the case of a tie, output the value that is numerically higher (or comes last alphabetically). At test time, each example should be passed through the classifier. Its predicted label becomes the label most commonly occurring in the train set.

Looking ahead: This simple algorithm acts as a small component of the Decision Tree that you will implement in the next homework assignment. We hope that you will employ best practices when coding so that you can re-use your own code here in the next assignment. A Majority Vote Classifier is simply a decision tree of depth zero (it predicts a class label for the input instance based on the most commonly occurring label present in the data).

1.2.2 The Datasets

Materials Download the zip file from course website, which contains all the data that you will need in order to complete this assignment.

Datasets The handout contains two datasets. Each one contains attributes and labels and is already split into training and testing data. The first row of each .tsv file contains the name of each attribute, and the class label is always the last column.

1. heart: The first task is to predict whether a patient has been (or will be) diagnosed with heart disease, based on available patient information. The attributes (aka. features) are:

(a) sex: The sex of the patient—1 if the patient is male, and 0 if the patient is female.

(b) chest_pain: 1 if the patient has chest pain, and 0 otherwise.

(c) high_blood_sugar: 1 if the patient has high blood sugar (>120 mg/dl fasting), or 0 otherwise.

(d) abnormal_ecg: 1 if exercise induced angina in the patient, and 0 otherwise. Angina is a type of severe chest pain.

(e) flat_ST: 1 if the patient’s ST segment (a section of an ECG) was flat during exercise, or 0 if it had some slope.

(f) fluoroscopy: 1 if a physician used fluoroscopy, and 0 otherwise. Fluoroscopy is an imaging technique used to see the flow of blood through the heart.

(h) heart_disease: 1 if the patient was diagnosed with heart disease, and 0 otherwise. This is the class label you should predict.

The training data is in heart_train.tsv, and the test data in heart_test.tsv.

The handout zip file also contains the predictions and metrics from a reference implementation of a Majority Vote Classifier for the heart and education datasets (see subfolder example output). You can check your own output against these to see if your implementation is correct.

Note: For simplicity, all attributes are discretized into just two categories. This applies to all the datasets in the handout, as well as the additional datasets on which we will evaluate your Majority Vote Classifier.

1.2.3 Command Line Arguments

The autograder runs and evaluates the output from the files generated, using the following command:

$ python majority_vote.py [args…]

Where above [args…] is a placeholder for five command-line arguments: <train input> <test input> <train out> <test out> <metrics out>. These arguments are described in detail below:

1. <train input>: path to the training input .tsv file

2. <test input>: path to the test input .tsv file

3. <train out>: path of output .txt file to which the predictions on the training data should be written

4. <test out>: path of output .txt file to which the predictions on the test data should be written

5. <metrics out>: path of the output .txt file to which metrics such as train and test error should be written

As an example, the following command line would run your program on the heart dataset. The train predictions would be written to heart_train_labels.txt, the test predictions to heart_test_labels.txt, and the metrics to heart_metrics.txt.

$ python majority_vote.py heart_train.tsv heart_test.tsv heart_train_labels.txt heart_test_labels.txt heart_metrics.txt

1.2.4 Output: Labels Files

Your program should write two output .txt files containing the predictions of your model on training data (<train out>) and test data (<test out>). Each should contain the predicted labels for each example printed on a new line. Use ‘ ’ to create a new line.

Your labels should exactly match those of a reference majority vote classifier implementation—this will be checked by the autograder by running your program and evaluating your output file against the reference solution.

The first few lines of an example output file is given below for the heart dataset:

0

0

0

0

0

0 …

1.2.5 Output: Metrics File

Generate another file where you should report the training error and testing error. This file should be written to the path specified by the command line argument <metrics out>. Your reported numbers should be within 0.0001 of the reference solution. You do not need to round your reported numbers! The autograder will automatically incorporate the right tolerance for float comparisons. The file should be formatted as follows:

error(train): 0.490000 error(test): 0.402062

1.3 Command Line Arguments

In this and future programming assignments, we will use command line arguments to run your programs with different parameters. Below, we provide some simple examples for how to do this in Python. In the examples below, suppose your program takes two arguments: an input file and an output file.

Python:

import sys

if __name__ == ’__main__’: infile = sys.argv[1] outfile = sys.argv[2]

print(“Theinput file is: %s” % (infile)) print(“Theoutput file is: %s” % (outfile))

1.4 Code Submission

You must submit a file named majority_vote.py. The autograder is case sensitive. You must submit this file to the corresponding homework link on Gradescope.

Written Questions (74 points)

1 LATEX Bonus Point (1 points)

1. (1 point) Select one: Did you use LATEX for the entire written portion of this homework?

⃝ Yes

⃝ No

2 Course Policies (8 points)

This section covers important course policies that every student should know and understand. These questions MUST be finished in order for the whole homework to be considered for grading.

1. (1 point) Select one: Assignment turned in late without prior approval will incur a daily penalty. How much is the penalty? Up to 1 day: Up to 2 day: Up to 3 day: Up to 4 day:

⃝ 5%, 10%, 15%, 20%

⃝ 10%, 20%, 30%, 40%

⃝ 25%, 50%, 75%, 100%

⃝ 20%, 40%, 60%, 80%

⃝ As many as I want; Of course!

⃝ 6; No

⃝ 6; Yes

⃝ 8; No

⃝ 8; Yes

3. (1 point) Select all that apply: Seeking help from other students in understanding course materials needed to solve homework problems is ALLOWED under which of the following conditions?

2 Any written notes are taken on an impermanent surface (e.g. whiteboard, chalkboard) and discarded before writing up one’s solution alone.

2 Learning is facilitated not circumvented; i.e., the purpose of seeking help is to learn and understand the problem instead of merely getting an answer

2 Help both given and received is reported in collaboration questions in the homework

2 The student updates his/her collaborative questions even if it is after submitting their own assignment

2 None of the above

4. (1 point) Select all that apply: Which of the following is (are) strictly forbidden in solving and submitting homework?

2 Searching on the internet for solutions or sample codes

2 Consulting people outside this class who have seen or solved the problem before

2 Turning in someone else’s homework

2 Using anyone else’s, or allowing other classmates to use your computer or Gradescope account in connection with this course

2 None of the above

5. (1 point) Select one: If you solved your assignment completely on your own, you can skip the collaboration questions at the end of each homework.

⃝ True

⃝ False

6. (1 point) Select all that apply: What is (are) the consequence(s) of being caught cheating in this course?

First time:

2 Failure of the course

2 Negative 100% on the assignment

Second time:

2 Failure of the course

2 Negative 100% on the assignment

7. (1 point) Select one: Assume a difficult situation arises in the middle of the semester (e.g. medical, personal etc.) that might prevent you from submitting assignments on time or working as well as you would like. What should you do?

⃝ Do not speak to the course staff, try to finish the class, reach out to the course staff in the end of the semester explaining your special situation

3 Probability and Statistics (25 points)

Use the following data to answer questions 1-2. Consider data created by flipping a coin five times S

= [1, 1, 0, 1, 1] , where 1 denotes that the coin turned up heads and 0 denotes that it turned up tails.

1. (3 points) Select one: What is the probability of observing any combination of this data (4 heads and 1 tails), assuming it was generated by flipping a coin X with an unequal probability of heads (1) and tails (0), where the distribution is P(X = 1) = 0.75, P(X = 0) = 0.25?

2. (1 point) Note that the probability of this data sample would be greater if the value of P(X = 1) was not 0.75, but instead some other value. What is the value of P(X = 1) that maximizes the probability of the sample S? Provide your answer as a fraction.

3. (1 point) Select one: For events A and B, where A ∩ B indicates A AND B, and A ∪ B indicates A

OR B,

P(A ∩ B) = P(A) + P(B) − P(A ∪ B)

⃝ True

⃝ False

4. (1 point) Select one: For events A and B,

P(A1 ∩ A2 ∩ A3) = P(A3|A2 ∩ A1)P(A2|A1)P(A1)

⃝ True

⃝ False

P(R) = 0.4

P(W|R) = 0.8 P(W|¬R) = 0.2

What is the probability that your car is wet in the morning?

⃝ 0.64

⃝ 0.56

⃝ 0.44

⃝ 0.4

ble where both X and Y are binary variables:

X Y Probability

0 0 0.1

0 1 0.4

1 0 0.2

1 1 0.3

6. (1 point) Select one: What is P(X = 1|Y = 1)?

7. (1 point) Select one: What is P(Y = 0)?

⃝ 0.2

⃝ 0.6

⃝ 0.5

⃝ 0.3

Use the following information to answer questions 8-10. Let X be a random variable with expected

value E[X] = 1 and variance V ar[X] = 2.

8. (1 point) Select one: What is E[6X]?

⃝ 1

⃝ 3

⃝ 6

⃝ 36

9. (1 point) Select one: What is V ar[2X]?

⃝ 1

⃝ 4

⃝ 8

⃝ 16

10. (1 point) Select one: What is V ar[2X − 3]?

⃝ 3

⃝ 4

⃝ 5

⃝ 8

Use the following information to answer questions 11-14: Let A, B, and C be random variables with

discrete probability distributions. Consider the following two joint probability tables: one relating A and B, and the other relating B and C.

AB b1 b2 b3

a1 0.1 0.05 0.

a2 0.1 0.05 0.3

a3 0.05 0.15 0.05

B C

c1 c2 c3 c4

b1 0.02 0.

0.06 0.03

b2 0.03 0.05 0 0.17

b3 0.35 0.04 0 0.11

11. (1 point) Select all that apply: Which of the following statements are necessarily false? Note X Y indicates that random variable X is independent of random variable Y .

2 AB

2 BC

2 AC

2 None of the above.

12. (2 points) What is P(B = b1|A = a2,C = c4)? If this value cannot be computed, write N/A. Ignore when P(A = a2,C = c4) = 0 for this problem, if such a case is possible.

16. (1 point) Select one: What is the mean, variance and entropy of a Bernoulli (p) random variable?

⃝ p, p(1 − p), −(1 − p)log(1 − p) − plog(p)

⃝ p(1 − p), p, −(1 − p)log(1 − p) − plog(p)

⃝ p, p(1 − p), log(1 − p) − plog(p)

⃝ The entropy of a Bernoulli variable is not defined.

17. (2 points) Please match the probability density function of the random variable X to its corresponding distribution name.

A) prob

B) prob(X = x) = λe−λx when x ≥ 0; 0 otherwise

C) prob

D) prob when a ≤ x ≤ b; 0 otherwise

E) prob(X = x) = px(1 − p)1−x

4 Calculus (8 points)

1. (2 points) Evaluate the derivative of y with respect to x, where .

2. (2 points) Select one: Find the partial derivative of y with respect to x, where y = 3×2 sin(z)e−x

⃝ 3xsin(z)e−x(2 + x) ⃝ −6xsin(z)e−x

⃝ 3xsin(z)e−x(2 − x)

⃝ 6xcos(z)e−x

3. (2 points) Select one: For the function f(x) = 4×3−5×2−2x, the value sets the derivative to be 0. Additionally, the second order derivative of is negative. What can you say about f(x) at the point :

⃝ a local minimum

⃝ a local maximum

⃝ a local minimum or a local maximum

⃝ None of the above

4. (2 points) Select one: Suppose that f(x|θ) = xTθ, where x,θ ∈ Rn. The function g(θ) is defined as for some x(1) ∈ Rn and y(1) ∈ R. What is the function type of g(θ):

⃝ g : Rn → R

⃝ g : R → R

⃝ g : R → Rn

⃝ g : (Rn × Rn) → R

5 Vectors and Matrices (13 points)

1. (1 point) Select one: Consider the matrix X and the vectors y and z below:

X , y .

What is the inner product of the vectors y and z (this is also sometimes called the dot product)?

2. (1 point) Select one: Using the same values for X, y, and z as above, what is the product Xy?

3. (2 points) Select all that apply: Consider u and V . Which of these are valid

operations?

2 uTV 2 VTu

2 uV

2 VV

2 None of the above

4. (2 points) Select one: For the matrices A and B , what is the

product AB?

19

−14

−4 5

−9

18

19

9

−2

−20

−14

2

19

−14

−5

5. (1 point) True or False: The matrix A from the previous question has an inverse.

⃝ True ⃝ False

x1 y1

6. (2 points) Select one: Consider two vectors x = x2 and y = y2. Let z = xTy. What is ?

x3 y3

⃝ y2

⃝ x2

⃝ x

⃝ y

7. (2 points) Select one: Given matrix X and the column vector y , what is the

eigenvalue of X associated with y? (Recall an eigenvector of a matrix A ∈ Rn×n is a nonzero vector v ∈ Rn such that Av = λv where we call the scalar λ the associated eigenvalue for v.)

⃝ −5

⃝ −3

⃝ 2

⃝ 1.5

8. (2 points) Select one: Preparing for his linear algebra final, Joe is finding eigenvectors and eigenvalues for different matrices. Joe finds out that some matrix A (not given) has eigenvalues 4 and 5. He also

3 1

notices that both 117 and 39 are solutions to the equation Av = 4v, and concludes that both

9 3

of these vectors are eigenvectors corresponding to the eigenvalue of 4. Which statement regarding his conclusion is true?

⃝ It must be wrong because there cannot be multiple eigenvectors corresponding to a single eigenvalue.

⃝ It must be wrong because the eigenvectors are linearly dependent and thus cannot both be solutions to Av = 4v.

⃝ It is correct because both vectors are solutions to Av = 4v.

6 Geometry (5 points)

1. (2 points) Select one: What relationship does the vector w share with the line wTx + b = 0? (assume x and w are both two dimensional column vectors, and wT indicates the transpose of the column vector w.)

⃝ parallel

⃝ orthogonal

⃝ depends on the value of b

2. (1 point) Select one: With reference to the above question, select the statement which best explains why w and wTx + b = 0 share the above relationship.

⃝ The inner product wT(x′ − x′′), where x′ and x′′ are two points on the line wTx+b = 0, is 0

⃝ The inner product wT(x′ − x′′), where x′ and x′′ are two points on the line wTx+b = 0, is 1

⃝ The inner product wT(x′ − x′′), where x′ and x′′ are two points on the line wTx+b = 0, is b

3. (2 points) Select one: What is the distance from the origin to the line wTx + b = 0?

(In the following answers, λ is some constant)

⃝ |Tb| w w w

7 CS Foundations (14 points)

1. (1 point) Select one: If f(n) = ln(n) and g(n) = log3(n), which of the following is true?

⃝ f(n) ∈ O(g(n))

⃝ g(n) ∈ O(f(n))

⃝ Both

⃝ Neither

2. (1 point) Select one: If f(n) = n10 and g(n) = 10n, which of the following is true?

⃝ f(n) ∈ O(g(n))

⃝ g(n) ∈ O(f(n))

⃝ Both

⃝ Neither

Figure 1: Britain’s Royal Family

3. (2 points) Select one: Using the tree shown in Figure 1, how many nodes would depth-first-search visit in finding Mia Tindall (including her node)? Assume we search left-to-right and top-down.

⃝ 3

⃝ 12

⃝ 15

⃝ 18

Figure 2: A Binary Tree with indexed nodes

4. (2 points) Figure 2 is a Binary Tree with indexed nodes. Assume root node is node 1. What is the node-visit order of DFS and BFS of the above Binary Tree?

A depth-first search (DFS) traversal of a binary tree starts with visiting the root node, and recursively searches down the left subtree (i.e., the tree rooted at the left node) before going to search the right subtree (i.e., the tree rooted at the right node) until the traversal is done.

Note: Alternatively, we can also look right subtree before left subtree too, for the question please consider left to right order!

A breadth-first search (BFS) traversal of a binary tree visits every node (assuming a left-to-right order) on a level (with the same distance to the root) before going to a lower level until the traversal is done.

The node-visit order of DFS is (separate values with commas, e.g., 1,2,3,4):

The node-visit order of BFS is (separate values with commas, e.g., 1,2,3,4):

5. (2 points) Fill in the blanks in the pseudo code for key search using recursive depth-first search (DFS) traversal. (Note: Please put your answer in the boxes below, not on the lines.)

class TreeNode:

def __init__(self, val): self.val = val self.leftNode = None self.rightNode = None

# (a) the left/right node is denoted as

# node.leftNode/node.rightNode

# (b) left/right node are of type TreeNode

# (c) the value of the node is denoted as node.val

# (d) recursive DFS to search for the node

# with value key in a binary tree

# (e) the left node is assumed to be searched

# before the right node

def find_val(node, key):

if node is None: return None

if (1) :

return node

else:

result = (2)

if result is None:

result = (3)

return

Your pseudo code for missing field (1):

if n = 0 if n = 1

Ln−1 + Ln−2 if n > 1

6. (2 points) Select one: Which of the following is the numerical value for L32?

⃝ 3010349

⃝ 3524578

⃝ 4870847

⃝ 7881196

Consider the following information to answer questions 7-8:

Given the functions of computing a Fibonacci number:

def fib_1(n): if n == 0 or n == 1: return 1

return fib_1(n – 1) + fib_1(n – 2)

d = {} d[0] = 1 d[1] = 1 def fib_2(n):

if n in d.keys():

return d[n]

d[n] = fib_2(n – 1) + fib_2(n – 2)

return d[n]

7. (2 points) Select one: Which of the following is the tightest upper bound on the time complexity of computing fib_1(n)?

⃝ O(n)

⃝ O(nlogn)

⃝ O(2n)

⃝ O(n!)

8. (2 points) Select one: Which of the following is the tightest upper bound on the time complexity of computing fib_2(n)?

⃝ O(n)

⃝ O(nlogn)

⃝ O(2n)

⃝ O(n!)

9 Collaboration Questions

1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details.

2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details.

3. Did you find or come across code that implements any part of this assignment ? If so, include full details.

## Reviews

There are no reviews yet.