Description
Homework 2: Classification and Bias-Variance Trade-offs
Introduction
As a general note, for classification problems we imagine that we have the input matrix X ∈ RN×D (or perhaps they have been mapped to some basis Φ, without loss of generality) with outputs now “one-hot encoded.” This means that if there are K output classes, rather than representing the output label y as an integer 1,2,…,K, we represent y as a “one-hot” vector of length K. A “one-hot” vector is defined as having every component equal to 0 except for a single component which has value equal to 1. For example, if there are K = 7 classes and a particular data point belongs to class 3, then the target vector for this data point would be y = [0,0,1,0,0,0,0]. We will define C1 to be the one-hot vector for the 1st class, C2 for the 2nd class, etc. Thus, in the previous example y = C3. If there are K total classes, then the set of possible labels is . Throughout the assignment we will assume that each label y unless otherwise specified. The most common exception is the case of binary classification (K = 2), in which case labels are the typical integers y ∈ {0,1}.
Please type your solutions after the corresponding problems using this LATEX template, and start each problem on a new page.
Please submit your LATEX file and code files to the Gradescope assignment ‘HW2 – Supplemental’.
Problem 1 (Exploring Bias and Variance, 10 pts)
In this problem, we will explore the bias and variance of a few different model classes when it comes to logistic regression.
Consider the true data generating process y ∼ Bern(f(x)),f(x) = 0.4×sin(1.2x)+0.5, where x ∈ [−3,3], and y ∈ {0,1}. Recall that for a given x, bias and variance are defined in terms of expectations over randomly drawn datasets D from this underlying data distribution:
Bias[fˆ(x)] = ED[fˆ(x)] − f(x)
Variance[fˆ(x)] = ED[(fˆ(x) − ED[fˆ(x)])2]
Here, fˆ(x) is our estimator (learned through logistic regression on a given dataset D). We will directly explore the bias-variance trade-off by drawing multiple such datasets and fitting different logistic regression models to each. Remember that we, the modelers, do not usually see the true data distribution. Knowledge of the true f(x) is only exposed in this problem to (1) make possible the simulation of drawing multiple datasets, and (2) to serve as a pedagogical tool in allowing verification of the true bias.
1. Consider the three bases ϕ1(x) = [1,x], ϕ2(x) = [1,x,x2], ϕ3(x) = [1,x,x2,x3,x4,x5]. For each of these bases, generate 10 datasets of size N = 30 using the starter code provided, and fit a logistic regression model using sigmoid(wTϕ(x)) to each dataset by using gradient descent to minimize the negative log likelihood. This means you will be running gradient descent 10 times for each basis, once for each dataset. Note that the classes are represented with 0’s and 1’s.
Use random starting values of w, η = 0.001, take 10,000 update steps for each gradient descent run, and make sure to average the gradient over the data points (for each step). These parameters, while not perfect, will ensure your code runs in a reasonable amount of time. The emphasis of this problem is on capturing the bias-variance trade-off, so don’t worry about attaining perfect precision in the gradient descent as long as this trade-off is captured in the final models.
3. How are bias and variance reflected in the 3 types of curves on the graphs? How do the fits of the individual and mean prediction functions change? Keeping in mind that none of the model classes match the true generating process exactly, discuss the extent to which each of the bases approximates the true process.
Note: In this problem, we are not interested in whether the model is more biased for certain inputs x compared to other inputs x′. We are interested in the overall bias and variance of fˆ(x) across the different basis choices. In other words, we want to investigate how the bias between fˆ(x) and the ground truth as well as the variance of fˆ(x) will be different over different basis choices.
4. If we were to increase the size of each dataset drawn from N = 30 to a larger number, how would the variance change? The bias? Why might this be the case?
Solution
Problem 2 (Maximum likelihood in classification, 15pts)
Consider now a generative K-class model. We adopt class prior p(y = Ck;π) = πk for all k ∈ {1,…,K} (where πk is a parameter of the prior). Let p(x|y = Ck) denote the class-conditional density of features x (in this case for class Ck). Consider the data setwhere as above y is encoded as a one-hot target vector and the data are independent.
1. Write out the log-likelihood of the data set, lnp(D;π).
2. Since the prior forms a distribution, it has the constraint that Pk πk − 1 = 0. Using the hint on Lagrange multipliers below, give the expression for the maximum-likelihood estimator for the prior class-membership probabilities, i.e. ˆπk. Make sure to write out the intermediary equation you need to solve to obtain this estimator. Briefly state why your final answer is intuitive.
For the remaining questions, let the class-conditional probabilities be Gaussian distributions with the same covariance matrix
p(x|y = Ck) = N(x|µk,Σ), for k ∈ {1,…,K}
and different means µk for each class.
3. Derive the gradient of the log-likelihood with respect to vector µk. Write the expression in matrix form as a function of the variables defined throughout this exercise. Simplify as much as possible for full credit.
4. Derive the maximum-likelihood estimator ˆµk for vector µk. Briefly state why your final answer is intuitive.
5. Derive the gradient for the log-likelihood with respect to the covariance matrix Σ (i.e., looking to find an MLE for the covariance). Since you are differentiating with respect to a matrix, the resulting expression should be a matrix!
6. Derive the maximum likelihood estimator Σ of the covariance matrix.ˆ
Hint: Lagrange Multipliers. Lagrange Multipliers are a method for optimizing a function f with respect to an equality constraint, i.e.
minf(x) s.t. g(x) = 0.
x
This can be turned into an unconstrained problem by introducing a Lagrange multiplier λ and constructing the Lagrangian function,
L(x,λ) = f(x) + λg(x).
It can be shown that it is a necessary condition that the optimum is a critical point of this new function. We can find this point by solving two equations:
∂L(x,λ)
= 0 and
∂x ∂λ
Cookbook formulas. Here are some formulas you might want to consider using to compute difficult gradients. You can use them in the homework without proof. If you are looking to hone your matrix calculus skills, try to find different ways to prove these formulas yourself (will not be part of the evaluation of this homework). In general, you can use any formula from the matrix cookbook, as long as you cite it. We opt for the following common notation: X−⊤ := (X⊤)−1
∂a⊤X−1b −⊤ ⊤ −⊤
= −X ab X
∂X
∂ ln|det(X)| −⊤
= X
∂X
Solution
Problem 3 (Classifying Stars, 15pts)
You’re tasked with classifying three different kinds of stars using their magnitudes and temperatures.
See star.png for a plot of the data, adapted from http://astrosci.scimuze.com/stellar_data.htm and available as data/hr.csv, which you will find in the Github repository.
The CSV file has three columns: type, magnitude, and temperature. The first few lines look like this:
Type,Magnitude,Temperature
Dwarf,-5.8,-0.35
Dwarf,-4.1,-0.31 …
In this problem, you will code up 4 different classifiers for this task:
a) A three-class generalization of logistic regression, also known as softmax regression, in which you implement gradient descent on the negative log-likelihood. In Question 2 you will explore the effect of using different values for the learning rate η (self.eta) and regularization strength λ (self.lam). Make sure to include a bias term and to use L2 regularization. See CS181 Textbook’s Chapter 3.6 for details on multi-class logistic regression and softmax. For your implementation, use the loss and gradient expressions provided there.
b) A generative classifier with Gaussian class-conditional densities with a shared covariance matrix across all classes. Feel free to re-use your Problem 2 results.
c) Another generative classifier with Gaussian class-conditional densities , but now with a separate covariance matrix learned for each class. (Note: The staff implementation can switch between the two Gaussian generative classifiers with just a few lines of code.)
d) A kNN classifier in which you classify based on the k = 1,3,5 nearest neighbors and the following distance function:
dist(star1,star2) = ((mag1 − mag2)/3)2 + (temp1 − temp2)2
where nearest neighbors are those with the smallest distances from a given point.
Note 2: The grid of points for which you are making predictions should be interpreted as our test space. Thus, it is not necessary to make a test point that happens to be on top of a training point ignore itself when selecting neighbors.
After implementing the above classifiers, complete the following exercises:
1. Plot the decision boundaries generated by each classifier for the dataset. Include them in your PDF. Identify the similarities and differences among the classifiers. What explains the differences?
2. For logistic regression only, make a plot with “Number of Iterations” on the x-axis and “Negative Log-Likelihood Loss” on the y-axis for several configurations of the hyperparameters η and λ. Specifically, try the values 0.05, 0.01, and 0.001 for each hyperparameter. Limit the number of gradient descent iterations to 200,000. What are your final choices of learning rate (η) and regularization strength (λ), and why are they reasonable? How does altering these hyperparameters affect the ability to converge, the rate of convergence, and the final loss (a qualitative description is sufficient)? You only need to submit one plot for your final choices of hyperparameters.
Note: The likelihood of the model is the probability of data given the model—it should not include the regularization term. The objective is the combination of the likelihood and the regularizer.
3. For both Gaussian generative models, report the negative log-likelihood loss. Which model has a lower loss, and why? For the separate covariance model, be sure to use the covariance matrix that matches the true class of each data point.
4. Consider a star with Magnitude 6 and Temperature 2. To what class does each classifier assign this star? Do the classifiers give any indication as to whether or not you should trust them?
Problem 3 (cont.)
Implementation notes: Run the controller file, T2 P3.py, to test your code. Write the actual implementations in the GaussianGenerativeModel, LogisticRegression, and KNNModel classes, which are defined in the three T2 P3 ModelName.py files. These classes follow the same interface pattern as sklearn. Their code currently outputs nonsense predictions just to show the high-level interface, so you should replace their predict() implementations. You’ll also need to modify the hyperparameter values in T2 P3.py for logistic regression.
Solution
Whom did you work with, and did you use any resources beyond cs181-textbook and your notes?
Calibration
Approximately how long did this homework take you to complete (in hours)?
Reviews
There are no reviews yet.