Description
1. Decision Trees as Interpretable Models
(a) Download the Accute Inflamations data from https://archive.ics.uci.edu/ ml/datasets/Acute+Inflammations.
(b) Build a decision tree on the whole data set and plot it.
(c) Convert the decision rules into a set of IF-THEN rules.
(d) Use cost-complexity pruning to find a minimal decision tree and a set of decision rules with high interpretability.
2. The LASSO and Boosting for Regression
(a) Download the Communities and Crime data from https://archive.ics.uci. edu/ml/datasets/Communities+and+Crime. Use the first 1495 rows of data as the training set and the rest as the test set.
(b) The data set has missing values. Use a data imputation technique to deal with the missing values in the data set. The data description mentions some features are nonpredictive. Ignore those features.
(c) Plot a correlation matrix for the features in the data set.
(d) Calculate the Coefficient of Variation CV for each feature, where , in which s is sample standard deviation and m is sample mean..
√
(e) Pick b 128c features with highest CV , and make scatter plots and box plots for them. Can you draw conclusions about significance of those features, just by the scatter plots?
(f) Fit a linear model using least squares to the training set and report the test error.
(g) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
(h) Fit a LASSO model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with a list of the variables selected by the model. Repeat with standardized4 features. Report the test error for both cases and compare them.
(i) Fit a PCR model on the training set, with M (the number of principal components) chosen by cross-validation. Report the test error obtained.
1
Homework 5 DSCI 552, Instructor: Mohammad Reza Rajati
(j) In this section, we would like to fit a boosting tree to the data. As in classification trees, one can use any type of regression at each node to build a multivariate regression tree. Because the number of variables is large in this problem, one can use L1-penalized regression at each node. Such a tree is called L1 penalized gradient boosting tree. You can use XGBoost to fit the model tree. Determine α (the regularization term) using cross-validation.
2
Reviews
There are no reviews yet.