Description
Homework 1: Regression
Introduction
This homework is on different forms of linear regression and focuses on loss functions, optimizers, and regularization. Linear regression will be one of the few models that we see that has an analytical solution. These problems focus on deriving these solutions and exploring their properties.
If you find that you are having trouble with the first couple problems, we recommend going over the fundamentals of linear algebra and matrix calculus (see links on website). The relevant parts of the cs181textbook notes are Sections 2.1 – 2.7. We strongly recommend reading the textbook before beginning the homework.
We also encourage you to first read the Bishop textbook, particularly: Section 2.3 (Properties of Gaussian Distributions), Section 3.1 (Linear Basis Regression), and Section 3.3 (Bayesian Linear Regression). (Note that our notation is slightly different but the underlying mathematics remains the same!).
LATEX Basics and LATEX tutorial with exercises in Overleaf
Homeworks will be submitted through Gradescope. You will be added to the course Gradescope once you join the course Canvas page. If you haven’t received an invitation, contact the course staff through Ed.
Please submit the writeup PDF to the Gradescope assignment ‘HW1’. Remember to assign pages for each question.
Please submit your LATEXfile and code files to the Gradescope assignment ‘HW1 – Supplemental’. Your files should be named in the same way as we provide them in the repository, e.g. T1 P1.py, etc.
Problem 1 (Optimizing a Kernel, 15pts)
Kernel-based regression techniques are similar to nearest-neighbor regressors: rather than fit a parametric model, they predict values for new data points by interpolating values from existing points in the training set. In this problem, we will consider a kernel-based regressor of the form:
where (xn,yn) are the training data points, and K(x,x′) is a kernel function that defines the similarity between two inputs x and x′. Assume that each xi is represented as a column vector, i.e. a D by 1 vector where D is the number of features for each data point. A popular choice of kernel is a function that decays as the distance between the two points increases, such as
where τ represents the square of the lengthscale (a scalar value). In this problem, we will consider optimizing what that (squared) lengthscale should be.
1. Let be our training data set. Suppose we are interested in minimizing the residual sum of squares. Write down this loss over the training data L(W) as a function of τ.
Important: When computing the prediction f(xi) for a point xi in the training set, carefully consider for which points x′ you should be including the term K(xi,x′) in the sum.
2. Take the derivative of the loss function with respect to τ.
Problem 1 (cont.)
3. Consider the following data set:
x , y 0 , 0
1 , 0.5
2 , 1
3 , 2
4 , 1
6 , 1.5
8 , 0.5
And the following lengthscales: τ = .01, τ = 2, and τ = 100.
4. Plot the function (x∗,f(x∗)) for each of the lengthscales above. You will plot x∗ on the x-axis and the prediction f(x∗) on the y-axis. For the test inputs x∗, you should use an even grid of spacing of 0.1 between x∗ = 0 and x∗ = 12. (Note: it is possible that a test input x∗ lands right on top of one of the training inputs above. You can still use the formula!)
Initial impressions: Briefly describe what happens in each of the three cases. Is what you see consistent with the which lengthscale appeared to be numerically best above? Describe why or why not.
5. Bonus: Code up a gradient descent to optimize the kernel for the data set above. Start your gradient descent from τ = 2. Report on what you find.
Note: Gradient descent is discussed in Section 3.4 of the cs181-textbook notes and Section 5.2.4 of Bishop, and will be covered later in the course!
Problem 2 (Kernels and kNN, 10pts)
Now, let us compare the kernel-based approach to an approach based on nearest-neighbors. Recall that kNN uses a predictor of the form
is one of k-closest to x∗)
where I is an indicator variable. For this problem, you will use the same dataset and kernel as in Problem 1.
Make sure to include all required plots in your PDF.
1. Implement kNN for k = {1,3,N − 1} where N is the size of the dataset, then plot the results for each k. To find the distance between points, use the kernel function from Problem 1 with lengthscale τ = 1.
As before, you will plot x∗ on the x-axis and the prediction f(x∗) on the y-axis. For the test inputs x∗, you should use an even grid of spacing of 0.1 between x∗ = 0 and x∗ = 12. (Like in Problem 1, if a test point lies on top of a training input, use the formula without excluding that training input.)
2. Describe what you see: What is the behavior of the functions in these three plots? How does it compare to the behavior of the functions in the three plots from Problem 1? Are there situations in which kNN and kernel-based regression interpolate similarly? Extrapolate similarly? Based on what you see, do you believe there exist some values of k and τ for which the kNN and kernelbased regressors produce the exact same classifier (ie. given any point x, the two regressors will produce the same prediction f(x))? Explain your answer.
3. Why did we not vary τ for the kNN approach?
Problem 3 (Deriving Linear Regression, 10pts)
The solution for the least squares linear regressions “looks” kind of like a ratio of covariance and variance terms. In this problem, we will make that connection more explicit.
We will consider the process of fitting these data from this distribution with the best linear model possible, that is a linear model of the form ˆy = wx that minimizes the expected squared loss Ex,y[(y−yˆ)2].
Notes: The notation Ex,y indicates an expectation taken over the joint distribution p(x,y). Since x and y are scalars, w is also a scalar.
1. Derive an expression for the optimal w, that is, the w that minimizes the expected squared loss above. You should leave your answer in terms of moments of the distribution, e.g. terms like Ex[x], Ex[x2], Ey[y], Ey[y2], Ex,y[xy] etc.
2. Provide unbiased and consistent formulas to estimate Ex,y[yx] and Ex[x2] given observed data
.
3. In general, moment terms like ], etc. can easily be estimated from the data (like you did above). If you substitute in these empirical moments, how does your expression for the optimal w∗ in this problem compare with the optimal w∗ that we see in Section 2.6 of the cs181-textbook?
4. Many common probabilistic linear regression models assume that variables x and y are jointly Gaussian. Did any of your above derivations rely on the assumption that x and y are jointly Gaussian? Why or why not?
Problem 4 (Modeling Changes in Republicans and Sunspots, 15pts)
The objective of this problem is to learn about linear regression with basis functions by modeling the number of Republicans in the Senate. The file data/year-sunspots-republicans.csv contains the data you will use for this problem. It has three columns. The first one is an integer that indicates the year. The second is the number of Sunspots observed in that year. The third is the number of Republicans in the Senate for that year. The data file looks like this:
Year,Sunspot_Count,Republican_Count
1960,112.3,36
1962,37.6,34
1964,10.2,32
1966,47.0,36
You can see scatterplots of the data in the figures below. The horizontal axis is the Year, and the vertical axis is the Number of Republicans and the Number of Sunspots, respectively.
(Data Source: http://www.realclimate.org/data/senators_sunspots.txt)
Make sure to include all required plots in your PDF.
1. In this problem you will implement ordinary least squares regression using 4 different basis functions for Year (x-axis) v. Number of Republicans in the Senate (y-axis). Some starter Python code that implements simple linear regression is provided in T1_P4.py.
First, plot the data and regression lines for each of the following sets of basis functions, and include the generated plot as an image in your submission PDF. You will therefore make 4 total plots:
(a) ϕj(x) = xj for j = 1,…,5 ie, use basis y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 for some constants {a1,…,a5}.
for µj = 1960,1965,1970,1975,…2010
(c) ϕj(x) = cos(x/j) for j = 1,…,5
(d) ϕj(x) = cos(x/j) for j = 1,…,25
* Note: Please make sure to add a bias term for all your basis functions above in your implementation of the make_basis function in T1_P4.py.
Second, for each plot include the residual sum of squares error. Submit the generated plot and residual sum-of-squares error for each basis in your LaTeX write-up.
Problem 4 (cont.)
2. Repeat the same exact process as above but for Number of Sunspots (x-axis) v. Number of Republicans in the Senate (y-axis). Now, however, only use data from before 1985, and only use basis functions (a), (c), and (d) – ignore basis (b). You will therefore make 3 total plots. For each plot make sure to also include the residual sum of squares error.
Which of the three bases (a, c, d) provided the ”best” fit? Choose one, and keep in mind the generalizability of the model.
Given the quality of this fit, do you believe that the number of sunspots controls the number of Republicans in the senate (Yes or No)?
Whom did you work with, and did you use any resources beyond cs181-textbook and your notes?
Calibration
Approximately how long did this homework take you to complete (in hours)?
Reviews
There are no reviews yet.