Description
Regression
Antoni Chan
In this programming assignment you will implement and test some of the regression methods that were introduced in the lecture and in the problem sets. Let f(x,θ) be a function with input x ∈ Rd and parameters θ ∈ RD, and
f(x,θ) = φ(x)T θ, (1)
where φ(x) : Rd → RD is a feature transformation of x. For example, the K-th order polynomial with input x ∈ R can be expressed as
K
f(x,θ) = Xxkθk = φ(x)T θ, (2)
k=0
where the feature transformation and parameters are
. (3)
Our goal is to obtain the best estimate of the function given iid samples D = {(x1,y1),…,(xn,yn)}, where yi are the noisy observations of f(xi,θ). We have seen several ways of doing the regression, using different assumed noise models and formulations. For convenience, define the following quan-
tities,
, . (4)
Here is a summary of the various regression algorithms we have seen so far:
method objective function parameter estimate θˆ prediction f∗ for input x∗
least-squares (LS) θˆLS = (ΦΦT)−1Φy f∗ = φ(x∗)Tθˆ
regularized LS (RLS) θˆRLS = (ΦΦT + λI)−1Φy f∗ = φ(x∗)Tθˆ
L1-regularized LS (LASSO) QP solver (see Prob. 3.13) f∗ = φ(x∗)Tθˆ
robust regression (RR) LP solver (see Prob. 2.10) f∗ = φ(x∗)Tθˆ
distributions posterior predictive
Bayesian regression (BR) θ ∼ N(0,αI) θ|X,y ∼ N(µˆθ,σˆθ2)
y|x,θ ∼ N(f(x,θ),σ2) µˆ∗ = φ(x∗)Tµˆθ
Part 1 Polynomial function
In the first problem, you will test these regression methods on a simple dataset, and look at the effects of using different objective functions, regularization terms, and formulations. The MATLAB data file poly data.mat (or poly data *.txt if you are not using MATLAB) contains a set of samples generated from a 5th order polynomial function, with observation noise σ2 = 5. The file contains the following variables:
• sampx – sample input values (xi), each entry is an input value.
• sampy – sample output values (yi), each entry is an output.
• polyx – input values for the true function (also these are the test inputs x∗).
• polyy – output values for the true function.
• thtrue – the true value of the θ used to generate the data.
A plot of the samples (sampx,sampy) and the true polynomial function (polyx, polyy) are shown below.
The goal is to try to estimate the polynomial function from the samples.
(a) Implement the above 5 regression algorithms for the K-th order polynomial given in (2). In the next problem, you will use these regression methods with a different feature transformation φ(x). Hence, it would be better to separate the regression algorithm and the feature transformation in your implementation.
(b) For each regression method, use the sample data (sampx, sampy) to estimate the parameters of a 5th order polynomial function. Make a plot of the estimated function using polyx as inputs, along with the sample data. For BR, also plot the standard deviation around the mean. What is the mean-squared error between the learned function outputs and the true function outputs (polyy), averaged over all input values in polyx? For algorithms with hyperparameters, select some values that tend to work well.
(c) Repeat (b), but reduce the amount of training data available, by selecting a subset of thesamples (e.g., 10%, 25%, 50%, 75%). Plot the estimated functions. Which models are more robust with less data, and which tend to overfit? Make a plot of error versus training size, and comment on any important trends and findings. (You should run multiple trials with different random subsets, and take the average error).
(d) Add some outliers output values (e.g., add large numbers to a few values in sampy), and repeat (b). Which methods are robust to the presence of outliers, and which are most sensitive? Why?
(e) Repeat (b) but estimate a higher-order polynomial (e.g., 10th order). Which models tend tooverfit the data when learning a more complex model, and which models do not? Verify your observations by examining the estimated parameter values.
………
Part 2 A real world regression problem – counting people
Now we will consider a real world regression problem – estimating the number of people in an image. Below is an example image of a sidewalk with people.
From each image, we extract a feature vector xi ∈ R9, which is based on some properties of the crowd (e.g., the area covered by the crowd, the perimeter of the crowd, etc). We also have the number of people in the image, yi ∈ R. The goal is to learn a regression function between this feature vector and the number of people in the image. (Let’s ignore the fact that the count is actually a non-negative integer, and just apply our standard regression methods).
The data can be found in the count data.mat MATLAB file (or count data *.txt if you are not using MATLAB). The mat file contains the following variables:
• trainx – training inputs. Each column is a 9-dimensional feature vector.
• trainy – training outputs (mean-subtracted counts for each input vector).
• testx – test inputs.
• testy – true outputs for the test inputs.
Note that the output values are actually the mean-subtracted counts, so you will see some negative outputs.
(a) Let’s first look at using the features directly, i.e., set φ(x) = x. Use the training set (trainx, trainy), estimate a function with some of the regression algorithms above. Use the test set inputs testx to predict the output, and compare the predictions to the true output testy. (You can round the function outputs so they are counting numbers). Calculate the mean-absolute error and mean-squared error. Which method works the best? Plot the test predictions and the true counts, and discuss any interesting findings.
(b) Now try some other feature transformations. For example, you can create a simple 2nd orderpolynomial as . This could be expanded to include more crossterms xixj. Try some other non-linear transformation of the features. Can you improve your results from (a)?
………
Part 3 Estimating hyperparameters (optional)
So far you probably have been setting the hyperparameters of each method (e.g., λ) by hand to get the best performance on the test set. This is not a fair way to do an experiment, since it amounts to optimizing to the test set, and doesn’t reflect whether the model can generalize well to unseen data. In a real scenario, you wouldn’t have the true outputs for the test set! Here are two common ways to set the hyperparameters using only the training set:
1. N-fold cross-validation: With this method, the original training set is split into a new training set and a validation set. The regression function is estimated from the new (smaller) training set, and tested on the validation set, for different hyperparameter values. This is repeated for different partitions of the original training set. Formally, separate the training set D into N equal partitions, {D1,··· ,DN}. For each partition i and candidate hyperparameter value λj: 1) train a regression function using λj on DDi (all the data except for Di); 2) test the function on Di; 3) record the error ei,j. Finally, sum the error over the N partitions,
. Select the λj with the lowest overall error, and train the final regression
function on the full training set D. (usually people use between N = 3 to N = 10 folds, depending on the size of the training set).
2. maximum marginal likelihood: For Bayesian regression, another way to select the hyperparameters is to maximize the marginal likelihood of the training data (see Problem 3.12 for details),
α,ˆ σˆ2 = argmaxp(y|X,σ2,α) = argmaxlogp(y|X,σ2,α) (5)
α,σ2 α,σ2
where the marginal log-likelihood is
The procedure is as follows: for each candidate hyperparameter pair {αj,σj2}, calculate the marginal likelihood in (7). Select the hyperparameter pair that results in the largest marginal likelihood.
(a) For Problem 2, use cross-validation or maximum marginal likelihood to select the hyperparameters using only the training data. Does the accuracy increase or decrease? How close were the automatically selected parameters to your hand-selected ones?
………
• What to hand in – You need to turn in the following things:
1. Answers, plots, analysis, discussion, etc. for the above problems.
2. Source code files.
You must submit your programming assignment using the Canvas website. Go to “Assignments” ⇒ “Programming Assignment 1” and submit the assignment.
• Grading – The marks for this assignment will be distributed as follows:
– 10% – Implementation of the regression algorithms (1a).
– 40% – Results testing the regression functions (1b, 1c, 1d, 1e).
– 20% – Results on counting people (2a).
– 10% – Trying out other features for counting people (2b).
– 20% – Quality of the written report. More points given to insightful observations and analysis.
Note: if you really cannot implement the algorithms correctly, it is okay to use 3rd party software. You will not get marks for implementation, but you can still get marks for presenting the results.
• Optional Problems – The optional problem 3 will not be graded.
Reviews
There are no reviews yet.