Description
DECISION TREES
TAs: Catherine, Zachary, Ari, Siyuan, Anoushka
Summary It’s time to build your first end-to-end learning system! In this assignment, you will build a Decision Tree classifier and apply it to several binary classification problems. This assignment consists of several parts: In Written component, you will work through some Information Theory basics in order to “learn” a Decision Tree on paper. Then in Programming component, you will implement Decision Tree learning, prediction, and evaluation. Using that implementation, you will answer the empirical questions found at the end of the Written component.
START HERE: Instructions
˜mgormley/courses/10601/syllabus.html
• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ ˜mgormley/courses/10601/syllabus.html
• Submitting your work: You will use Gradescope to submit answers to all questions and code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope.
– Programming: You will submit your code for programming questions on the homework to
Gradescope (https://gradescope.com). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). When you are developing, check that the version number of the programming language environment (e.g. Python 3.9.6, OpenJDK 11.0.11, g++ 7.5.0) and versions of permitted libraries (e.g. numpy 1.21.2 and scipy 1.7.1) match those used on Gradescope. You have 10 Gradescope programming submissions. We recommend debugging your implementation on your local machine (or the Linux servers) and making sure your code is running correctly first before submitting you code to Gradescope.
• Materials: The data that you will need in order to complete this assignment is posted along with the writeup and template on Piazza.
1
Linear Algebra Libraries When implementing machine learning algorithms, it is often convenient
EJML for Java EJML is a pure Java linear algebra package with three interfaces. We strongly recommend using the SimpleMatrix interface. The autograder will use EJML version 0.41. When compiling and running your code, we will add the additional command line argument
-cp “linalg_lib/ejml-v0.41-libs/*:linalg_lib/nd4j-v1.0.0-M1.1-libs/*:./” to ensure that all the EJML jars are on the classpath as well as your code.
ND4J for Java ND4J is a library for multidimensional tensors with an interface akin to Python’s NumPy. The autograder will use ND4J version 1.0.0-M1.1. When compiling and running your code, we will add the additional command line argument
-cp “linalg_lib/ejml-v0.41-libs/*:linalg_lib/nd4j-v1.0.0-M1.1-libs/*:./” to ensure that all the ND4J jars are on the classpath as well as your code.
Eigen for C++ Eigen is a header-only library, so there is no linking to worry about—just #include whatever components you need. The autograder will use Eigen version 3.4.0. The command line arguments above demonstrate how we will call you code. When compiling your code we will include, the argument -I./linalg_lib in order to include the linalg_lib/Eigen subdirectory, which contains all the headers.
We have included the correct versions of EJML/ND4J/Eigen in the linalg_lib.zip posted on the Coursework page of the course website for your convenience. It contains the same linalg_lib/ directory that we will include in the current working directory when running your tests. Do not include EJML, ND4J, or Eigen in your homework submission; the autograder will ensure that they are in place.
a
https://ejml.org
b
https://javadoc.io/doc/org.nd4j/nd4j-api/latest/index.html c
http://eigen.tuxfamily.org/
Instructions for Specific Problem Types
For “Select One” questions, please fill in the appropriate bubble completely:
Select One: Who taught this course?
Matt Gormley / Henry Chai
◯ Marie Curie
◯ Noam Chomsky
Select One: Who taught this course?
Matt Gormley / Henry Chai
# Marie Curie
@ Noam Chomsky
For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
□ Stephen Hawking
Albert Einstein
□ Isaac Newton
□ None of the above
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein Isaac Newton
@ I don’t know
Fill in the blank: What is the course number?
10-S7601S
1 Written Questions (30 points)
1.1 Warm-Up
First, let’s think a little bit about decision trees. The following dataset D consists of 8 examples, each with 3 attributes, (A,B,C), and a label, Y .
A B C Y
1 2 0 1
0 1 0 0
0 0 1 0
0 2 0 1
1 1 0 1
1 0 1 0
1 2 1 0
1 1 0 1
Use the data above to answer the following questions.
A few important notes:
• All calculations should be done without rounding! After you have finished all of your calculations, write your rounded solutions in the boxes below.
• Note that, throughout this homework, we will use the convention that the leaves of the trees do not count as nodes, and as such are not included in calculations of depth and number of splits. (For example, a tree which classifies the data based on the value of a single attribute will have depth 1, and contain 1 split.)
• Note that the dataset contains duplicate rows; treat each of these as their own example, do not remove duplicate rows.
1. (1 point) What is the entropy of Y in bits, H(Y )? In this and subsequent questions, when we request the units in bits, this simply means that you need to use log base 2 in your calculations. (Please include one number rounded to the fourth decimal place, e.g. 0.1234)
2. (1 point) What is the mutual information of Y and A in bits, I(Y ;A)? (Please include one number rounded to the fourth decimal place, e.g. 0.1234)
3. (1 point) What is the mutual information of Y and B in bits, I(Y ;B)? (Please include one number rounded to the fourth decimal place, e.g. 0.1234)
4. (1 point) What is the mutual information of Y and C in bits, I(Y ;C)? (Please include one number rounded to the fourth decimal place, e.g. 0.1234)
5. (1 point) Consider the dataset given above. Which attribute (A, B, or C) would a decision tree algorithm pick first to branch on, if its splitting criterion is mutual information?
Select one:
#
A
B # C
6. (1 point) Consider the dataset given above. After making the first split, which attribute would pick to branch on next, if the splitting criterion is mutual information? (Hint: Notice that this question correctly presupposes that there is exactly one second attribute.)
Select one:
#
A
B # C
7. (1 point) If the same algorithm continues until the tree perfectly classifies the data, what would the depth of the tree be?
1.2 Empirical Questions
The following questions should be completed as you work through the programming portion of this assignment.
1. (3 points) Train and test your decision tree on the politician dataset and the education dataset with four different values of max-depth, {0,1,2,4}. Report your findings in the HW2 solutions template provided. A Decision Tree with max-depth 0 is simply a majority vote classifier; a Decision Tree with max-depth 1 is called a decision stump. If desired, you could even check that your answers for these two simple cases are correct using your favorite spreadsheet application (e.g. Excel, Google Sheets). (Please round each number to the fourth decimal place, e.g. 0.1234)
2. (3 points) For the politicians dataset, create a computer-generated plot showing error on the y-axis against depth of the tree on the x-axis. Plot both training error and testing error, clearly labeling which is which. That is, for each possible value of max-depth (0,1,2,…, up to the number of attributes in the dataset), you should train a decision tree and report train/test error of the model’s predictions. You should include an image file below using the provided, commented out code in LaTeX, switching out politician.png to your file name as needed.
3. (3 points) Suppose your research advisor asks you to run some model selection experiments and then report your results. You select the Decision Tree model’s max-depth to be the one with lowest test error in metrics.txt and then report that model’s test error as the performance of our classifier on held out test data. Is this a good experimental setup? If so, why? If not, why not?
4. (3 points) In this assignment, we used max-depth as our stopping criterion, and as a mechanism to prevent overfitting. Alternatively, we could stop splitting a node whenever the mutual information for the best attribute is lower than a threshold value. This threshold would be another hyperparameter. Theoretically, how would increasing this threshold value affect the number of nodes and depth of the learned trees?
5. (3 points) Continuing from the previous question, how would you set-up model training to choose the threshold value?
6. (2 points) Print (do not handwrite!) the decision tree which is produced by your algorithm for the politician data with max depth 3. Instructions on how to print the tree could be found in section 3.7.
7. (2 points) Describe the decision tree for the mushrooms dataset. How would you differentiate between poisonous and edible mushrooms?
2 Collaboration Questions
1. Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full details.
2. Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.
3. Did you find or come across code that implements any part of this assignment ? If so, include full details.
3 Programming: (70 points)
3.1 The Tasks and Datasets
Materials Download the zip file from the course website. The zip file will have a handout folder that contains all the data that you will need in order to complete this assignment.
Datasets The handout contains four datasets. Each one contains attributes and labels and is already split into training and testing data. The first line of each .tsv file contains the name of each attribute, and the class is always the last column.
1. politician: The first task is to predict whether a US politician is a member of the Democrat or Republican party, based on their past voting history. Attributes (aka. features) are short descriptions of bills that were voted on, such as Aid to nicaraguan contras or Duty free exports. Values are given as ‘y’ for yes votes and ‘n’ for no votes. The training data is in politicians_train.tsv, and the test data in politicians_test.tsv.
3. small: We also include small_train.tsv and small_test.tsv—a small, purely for demonstration version of the politicians dataset, with only attributes Anti satellite test ban and
Export south africa. For this small dataset, the handout tar file also contains the predictions from a reference implementation of a Decision Tree with max-depth 3 (see small_3_train.labels, small_3_test.labels, small_3_metrics.txt). You can check your own output against
2 these to see if your implementation is correct.
4. mushrooms: Finally, we have a large dataset of mushroom samples, mushrooms_train.tsv and mushrooms_test.tsv, for you to test your algorithm against. Each sample has discrete attributes which were split into boolean attributes. For example, cap-size could be bell, conical, convex, flat, knobbed, or sunken. This was split into boolean attributes cap-size bell, cap-size conical, cap-size convex, etc. Your goal is to differentiate the poisonous versus edible mushrooms.
Note: For simplicity, all attributes are discretized into just two categories (i.e. each node will have at most two descendents). This applies to all the datasets in the handout, as well as the additional datasets on which we will evaluate your Decision Tree.
2Yes, you read that correctly: we are giving you the correct answers.
3.2 Program #1: Inspecting the Data [5pts]
Write a program inspection.{py|java|cpp} to calculate the label entropy at the root (i.e. the entropy of the labels before any splits) and the error rate (the percent of incorrectly classified instances) of classifying using a majority vote (picking the label with the most examples). You do not need to look at the values of any of the attributes to do these calculations, knowing the labels of each example is sufficient. Entropy should be calculated in bits using log base 2.
Command Line Arguments The autograder runs and evaluates the output from the files generated, using the following command:
For Python:
For Java:
For C++:
$ python3 inspection.py <input> <output>
$ javac inspection.java; java inspect <input> <output>
$ g++ inspection.cpp; ./a.out <input> <output>
Your program should accept two command line arguments: an input file and an output file. It should read the .tsv input file (of the format described in Section 3.1), compute the quantities above, and write them to the output file so that it contains:
entropy: <entropy value> error: <error value>
Example For example, suppose you wanted to inspect the file small_train.tsv and write out the results to small_inspect.txt. For Python, you would run the command below:
$ python3 inspection.py small_train.tsv small_inspect.txt
Afterwards, your output file small_inspect.txt should contain the following:
entropy: 0.996316519559 error: 0.464285714286
For your own records, run your program on each of the datasets provided in the handout—this error rate for a majority vote classifier is a baseline over which we would (ideally) like to improve.
3.3 Program #2: Decision Tree Learner [65pts]
In decisionTree.{py ∣ java ∣ cpp}, implement a Decision Tree learner. This file should learn a decision tree with a specified maximum depth, print the decision tree in a specified format, predict the labels of the training and testing examples, and calculate training and testing errors.
Your implementation must satisfy the following requirements:
• Use mutual information to determine which attribute to split on.
• Be sure you’re correctly weighting your calculation of mutual information. For a split on attribute X,
I(Y ;X) = H(Y ) − H(Y ∣X) = H(Y ) − P(X = 0)H(Y ∣X = 0) − P(X = 1)H(Y ∣X = 1).
• As a stopping rule, only split on an attribute if the mutual information is > 0.
• You should split with replacement. That is, when you split on a column, you should retain this column in child datasets.
• Do not grow the tree beyond a max-depth specified on the command line. For example, for a maximum depth of 3, split a node only if the mutual information is > 0 and the current level of the node is < 3.
• Use a majority vote of the labels at each leaf to make classification decisions. If the vote is tied, choose the label that comes last in the lexicographical order (i.e. Republican should be chosen before Democrat)
• It is possible for attributes to have equal values for mutual information. In this case, you should split on the first attribute to break ties.
Careful planning will help you to correctly and concisely implement your Decision Tree learner. Here are a few hints to get you started:
• Write helper functions to calculate entropy and mutual information.
• Write a function to train a stump (tree with only one level). Then call that function recursively to create the sub-trees.
• In the recursion, keep track of the depth of the current tree so you can stop growing the tree beyond the max-depth.
• Implement a function that takes a learned decision tree and data as inputs, and generates predicted labels. You can write a separate function to calculate the error of the predicted labels with respect to the given (ground-truth) labels.
• Be sure to correctly handle the case where the specified maximum depth is greater than the total number of attributes.
• Be sure to handle the case where max-depth is zero (i.e. a majority vote classifier).
• Look under the FAQ’s on Piazza for more useful clarifications about the assignment.
3.4 Command Line Arguments
The autograder runs and evaluates the output from the files generated, using the following command:
For Python:
For Java:
For C++:
Where above [args…] is a placeholder for six command-line arguments: <train input> <test input> <max depth> <train out> <test out> <metrics out>. These arguments are described in detail below:
1. <train input>: path to the training input .tsv file (see Section 3.1)
2. <test input>: path to the test input .tsv file (see Section 3.1)
3. <max depth>: maximum depth to which the tree should be built
4. <train out>: path of output .labels file to which the predictions on the training data should be written (see Section 3.5)
5. <test out>: path of output .labels file to which the predictions on the test data should be written (see Section 3.5)
6. <metrics out>: path of the output .txt file to which metrics such as train and test error should be written (see Section 3.6)
$ python3 decisionTree.py [args…]
$ javac decisionTree.java; java decisionTree [args…]
$ g++ -g decisionTree.cpp -o decisionTree; ./decisionTree [args…]
As an example, if you implemented your program in Python, the following command line would run your program on the politicians dataset and learn a tree with max-depth of two. The train predictions would be written to pol_2_train.labels, the test predictions to pol_2_test.labels, and the metrics to pol_2_metrics.txt.
$ python3 decisionTree.py politicians_train.tsv politicians_test.tsv
2 pol_2_train.labels pol_2_test.labels pol_2_metrics.txt
The following example would run the same learning setup except with max-depth three, and conveniently writing to analogously named output files, so you can can compare the two runs.
$ python3 decisionTree.py politicians_train.tsv politicians_test.tsv
3 pol_3_train.labels pol_3_test.labels pol_3_metrics.txt
3.5 Output: Labels Files
Your program should write two output .labels files containing the predictions of your model on training data (<train out>) and test data (<test out>). Each should contain the predicted labels for each example printed on a new line. Use ‘ ’ to create a new line.
Your labels should exactly match those of a reference decision tree implementation—this will be checked by the autograder by running your program and evaluating your output file against the reference solution.
Note: You should output your predicted labels using the same string identifiers as the original training data:
e.g., for the politicians dataset you should output democrat/republican and for the education dataset you should output A/notA. The first few lines of an example output file is given below for the politician dataset:
democrat democrat democrat republican democrat …
3.6 Output: Metrics File
Generate another file where you should report the training error and testing error. This file should be written to the path specified by the command line argument <metrics out>. Your reported numbers should be within 0.01 of the reference solution. You do not need to round your reported numbers! The Autograder will automatically incorporate the right tolerance for float comparisons. The file should be formatted as follows:
error(train): 0.0714 error(test): 0.1429
The values above correspond to the results from training a tree of depth 3 on smalltrain.tsv and testing on smalltest.tsv. (There is one space between the colon and value)
3.7 Output: Printing the Tree
Below, we have provided the recommended format for printing the tree (example for python). You can print it directly to standard out rather than to a file. This functionality of your program will not be autograded.
$ python3 decisionTree.py small_train.tsv small_test.tsv 2 small_2_train.labels small_2_test.labels small_2_metrics.txt
[15 democrat/13 republican]
| Anti_satellite_test_ban = y: [13 democrat/1 republican]
| | Export_south_africa = y: [13 democrat/0 republican]
| | Export_south_africa = n: [0 democrat/1 republican]
| Anti_satellite_test_ban = n: [2 democrat/12 republican]
| | Export_south_africa = y: [2 democrat/7 republican]
| | Export_south_africa = n: [0 democrat/5 republican]
However, you should be careful that the tree might not be full. For example, after swapping the train/test files in the example above, you could end up with a tree like the following.
$ python3 decisionTree.py small_test.tsv small_train.tsv 2 swap_2_train.labels swap_2_test.labels swap_2_metrics.txt
[13 democrat/15 republican]
| Anti_satellite_test_ban = y: [9 democrat/0 republican]
| Anti_satellite_test_ban = n: [4 democrat/15 republican]
| | Export_south_africa = y: [4 democrat/10 republican]
| | Export_south_africa = n: [0 democrat/5 republican]
The following pretty-print shows the education dataset with max-depth 3. Use this example to check your code before submitting your pretty-print of the politics dataset (asked in question 14 of the Empirical questions).
$ python3 decisionTree.py education_train.tsv education_test.tsv 3 edu_3_train.labels edu_3_test.labels edu_3_metrics.txt
[135 A/65 notA]
| F = A: [119 A/23 notA]
| | M4 = A: [56 A/2 notA]
| | | P1 = A: [41 A/0 notA]
| | | P1 = notA: [15 A/2 notA]
| | M4 = notA: [63 A/21 notA]
| | | M2 = A: [37 A/3 notA]
| | | M2 = notA: [26 A/18 notA]
| F = notA: [16 A/42 notA]
| | M2 = A: [13 A/15 notA]
| | | M4 = A: [6 A/1 notA]
| | | M4 = notA: [7 A/14 notA]
| | M2 = notA: [3 A/27 notA]
| | | M4 = A: [3 A/5 notA]
| | | M4 = notA: [0 A/22 notA]
The numbers in brackets give the number of positive and negative labels from the training data in that part of the tree.
At this point, you should be able to go back and answer questions 1-7 in the ”Empirical Questions” section of this handout. Write your solutions in the template provided.
Programming Please ensure you have completed the following files for submission.
inspection.{py|java|cpp} decisionTree.{py|java|cpp}
When submitting your solution, make sure to select and upload both files. Ensure the files have the exact same spelling and letter casing as above. You can either directly zip the two files (by selecting the two files and compressing them – do not compress the folder containing the files) or directly drag them to Gradescope for submission.
Note: Please make sure the programming language that you use is consistent within this assignment (e.g. don’t use C++ for inspect and Python for decisionTree).
Reviews
There are no reviews yet.