10601 – 1

$ 29.99
Category:

Description

Answer the following questions in the HW2 solutions template provided. DO NOT show your work. Then upload your solutions to Gradescope.
1.1 Warm-up
First, let’s think a little bit about decision trees. The following dataset consists of 7 examples, each with 3 attributes, (A,B,C), and a label, Y .
A B C Y
1 1 0 0
1 1 2 1
1 0 0 1
1 1 2 1
0 0 2 0
0 1 1 0
0 0 0 0
Use the data above to answer the following questions.
A few important notes:
• All calculations should be done without rounding! After you have finished all of your calculations, write your rounded solutions in the boxes below.
• Note that, throughout this homework, we will use the convention that the leaves of the trees do not count as nodes, and as such are not included in calculations of depth and number of splits. (For example, a tree which classifies the data based on the value of a single attribute will have depth 1, and contain 1 split.)
1. [1pt] What is the entropy of Y in bits, H(Y )? In this and subsequent questions, when we request the units in bits, this simply means that you need to use log base 2 in your calculations. (Please include one number rounded to the fourth decimal place, e.g. 0.1234)
0.9852
2. [1pt] What is the mutual information of Y and A in bits, I(Y ;A)? (Please include one number rounded to the fourth decimal place, e.g. 0.1234) 0.5216
3. [1pt] What is the mutual information of Y and B in bits, I(Y ;B)? (Please include one number rounded to the fourth decimal place, e.g. 0.1234)
0.0202
4. [1pt] What is the mutual information of Y and C in bits, I(Y ;C)? (Please include one number rounded to the fourth decimal place, e.g. 0.1234)
0.1981
5. [1pt] Consider the dataset given above. Which attribute (A, B, or C) would a decision tree algorithm pick first to branch on, if its splitting criterion is mutual information?
Select one:
A
B
C
6. [1pt] Consider the dataset given above. Which is the second attribute you would pick to branch on, if its splitting criterion is mutual information? (Hint: Notice that this question correctly presupposes that there is exactly one second attribute.)
Select one:
A
B
C
7. [1pt] If the same algorithm continues until the tree perfectly classifies the data, what would the depth of the tree be?

8. [4pt] Draw your completed Decision Tree. Label the non-leaf nodes with which attribute the tree will split on (e.g. B), the edges with the value of the attribute (e.g. 1 or 0), and the leaf nodes with the classification decision (e.g. Y = 0).

1.2 Empirical Questions
The following questions should be completed as you work through the programming portion of this assignment (Section ??).
9. [2pt] Train and test your decision tree on the politician dataset and the education dataset with four different values of max-depth, {0,1,2,4}. Report your findings in the HW2 solutions template provided. A Decision Tree with max-depth 0 is simply a majority vote classifier; a Decision Tree with max-depth 1 is called a decision stump. If desired, you could even check that your answers for these two simple cases are correct using your favorite spreadsheet application (e.g. Excel, Google Sheets). (Please round each number to the fourth decimal place, e.g. 0.1234)

10. [3pt] For the politicians dataset, create a computer-generated plot showing error on the y-axis against depth of the tree on the x-axis. Plot both training error and testing error, clearly labeling which is which. That is, for each possible value of max-depth (0,1,2,…, up to the number of attributes in the dataset), you should train a decision tree and report train/test error of the model’s predictions.

11. [2pt] Suppose your research advisor asks you to run some model selection experiments and then report your results. You select the Decision Tree model’s max-depth to be the one with lowest test error in metrics.txt and then report that model’s test error as the performance of our classifier on held out test data. Is this a good experimental setup? If so, why? If not, why not?
Solution
No, it’s not a good setup. In the metrics.txt, there are only one test data error which can’t stands for all other data error. If we use the test error as our standard for choosing maxdepth, we are actually using the test data to setup a model which is not correct. Therefore, we cannot choose that depth as the maxdepth to train our model.
12. [2pt] In this assignment, we used max-depth as our stopping criterion, and as a mechanism to prevent overfitting. Alternatively, we could stop splitting a node whenever the mutual information for the best attribute is lower than a threshold value. This threshold would be another hyperparameter. Theoretically, how would increasing this threshold value affect the number of nodes and depth of the learned trees?
Solution
When we increase the threshold value, the number of nodes will decrease, as well as the depth of the learned tree. Because of the reduce of the count of mutual information. Since the threshold is larger than before, it takes less times of counting to below the threshold than before.
13. [2pt] From question 12, how would you set-up model training to choose the threshold value?
Solution
Start from 0, increase the threshold little by little each time and find out the test error and train error. When the shape of error turns out to be a stable, then choose the threshold.
14. [3pt] Print (do not handwrite!) the decision tree which is produced by your algorithm for the politician data with max depth 3. Instructions on how to print the tree could be found in section ??.

Reviews

There are no reviews yet.

Be the first to review “10601 – 1”

Your email address will not be published. Required fields are marked *