## Description

Machine Learning Course

Martin Jaggi & Ru¨diger Urbanke mlo.epfl.ch/page-157255-en-html/ epfmlcourse@gmail.com

Goals. The goal of this week’s lab is to

• Implement grid search, gradient descent and stochastic gradient descent.

• Learn to debug your implementations.

• Learn to visualize results.

• Understand advantages and disadvantages of these algorithms.

• Study the effect of outliers using MSE and MAE cost functions.

Setup, data and sample code. Obtain the folder labs/ex02 of the course github repository

github.com/epfml/ML course

We will use the dataset height weight genders.csv in this exercise, and we have provided sample code templates that already contain useful snippets of code required for this exercise.

You will be working in the notebook ex02.ipynb for all exercises of this week, by filling in the corresponding functions. The notebook already provides a lot of template code, as well as code to load the data, and normalize the features, visualize the results.

Additionally, please also take a look at the files helpers.py and plots.py, and make sure you understand them.

1 Computing the Cost Function

In this exercise, we will focus on simple linear regression which takes the following form,

yn ≈ f(xn1) = w0 + w1xn1. (1)

We will use height as the input variable xn1 and weight as the output variable yn. The coefficients w0 and w1 are also called model parameters. We will use a mean-square-error (MSE) function defined as follows,

. (2)

Our goal is to find and that minimize this cost.

Let us start by the array data type in NumPy. We store all the (yn,xn1) pairs in a vector and a matrix as shown below.

y1 1 x11

y2 1 x21 y = … Xf = … … (3)

yN 1 xN1

Exercise 1:

To understand this data format, answer the following warmup questions:

• What does each column of X˜ represent?

• What does each row of X˜ represent?

• Why do we have 1’s in X˜ ?

• If we have heights and weights of 3 people, what would be the size of y and X˜ ? What would X˜ 32 represent?

• In helpers.py, we have already provided code to form arrays for y and X˜ . Have a look at the code, and make sure you understand how they are constructed.

• Check if the sizes of the variables make sense (use functions shape).

a) Now we will compute the MSE. Let us introduce the vector notation e = y − Xw˜ , for given model parameters w = [w0, w1]>. Prove that the MSE can also be rewritten in terms of the vector e, as

L(w) = … (4) b) Complete the implementation of the notebook function compute loss(y, tx, w). You can start by setting w = [1,2]>, and test your function.

2 Grid Search

Now we are ready to implement our first optimization algorithm: Grid Search. Revise the lecture notes.

Exercise 2:

a) Fill in the notebook function grid search(y, tx, w0, w1) to implement grid search. You will have to write one for-loop per dimension, and compute the cost function for each setting of w0 and w1. Once you have all values of cost function stored in the variable loss, the code finds an approximate minimum (as discussed in the class).

The code should print the obtained minimum value of the cost function along with the found and . It should also show a contour plot and the plot of the fit, as shown in Figure 1.

Figure 1: Grid Search Visualization

b) Does this look like a good estimate? Why not? What is the problem? Why is the MSE plot not smooth?

Repeat the above exercise by changing the grid spacing to 10 instead of 50. Compare the new fit to the old one.

c) Discuss with your peers:

• To obtain an accurate fit, do you need a coarse grid or a fine grid?

• Try different values of grid spacing. What do you observe?

• How does increasing the number of values affect the computational cost? How fast or slow does your code run?

3 Gradient Descent

In the lecture, we derived the following expressions for the gradient (the vector of partial derivatives) of the MSE for linear regression,

(5)

(6)

Denoting the gradient by ∇L(w), we can write these operations in vector form as follows,

X˜ >e (7)

Exercise 3:

a) Now implement a function that computes the gradients. Implement the notebook function compute gradient(y, tx, w) using Equation (7). Verify that the function returns the right values. First, manually compute the gradients for hand-picked values of y, X˜ , and w and compare them to the output of compute gradient.

b) Once you make sure that your gradient code is correct, get some intuition about the gradient values:

Compute the gradients for

• w0 = 100 and w1 = 20

• w0 = 50 and w1 = 10

What do the values of these gradients tell us? For example, think about the norm of this vector. In which case are they bigger? What does that mean?

Hint: Imagine a quadratic function and estimate its gradient near its minimum and far from it. Hint 2: As we know from the lecture notes, the update rule for gradient descent at step t is

w(t+1) = w(t) − γ ∇L(w(t)) (8)

where γ > 0 is the step size, and ∇L ∈ R2 is the gradient vector.

c) Fill in the notebook function gradient descent(y, tx, initial w, …). Run the code and visualize the iterations. Also, look at the printed messages that show L and values of and . Take a detailed look at these plots,

• Is the cost being minimized?

• Is the algorithm converging? What can be said about the convergence speed?

• How good are the final values of w1 and w0 found?

d) Now let’s experiment with the value of the step size and initialization parameters and see how they influences the convergence. In theory, gradient descent converges to the optimum on convex functions, when the value of the step size is chosen appropriately.

• Try the following values of step size: 0.001, 0.01, 0.5, 1, 2, 2.5. What do you observe? Did the procedure converge?

• Try different initializations with fixed step size γ = 0.1, for instance:

– w0 = 0, w1 = 0

– w0 = 100, w1 = 10

– w0 = −1000, w1 = 1000

What do you observe? Did the procedure converge?

4 Stochastic Gradient Descent

Exercise 4:

Let us implement stochastic gradient descent. Recall from the lecture notes that the update rule for stochastic

gradient descent on an objective function at step t is

w(t+1) = w(t) − γ ∇Ln(w(t)) . (9)

HINT: You can use the function batch iter() in the file of helpers.py to generate mini-batch data for stochastic gradient descent.

5 Effect of Outliers and MAE Cost Function

Exercise 5:

Let’s simulate the presence of two outliers, and their effect on linear regression under MSE cost function,

• Reload the data through function load data() by setting sub sample=True to keep only a few data examples.

• Plot the data. You should get a cloud of points similar, but less dense, than what you saw before with the whole dataset.

• As before, find the values of w0,w1 to fit a linear model (using MSE cost function), and plot the resulting f together with the data points.

• Now we will add two outliers points simulating the mistake that we entered the weights in pounds instead of kilograms. For example, you can achieve this by setting add outlier=True in load data(). Feel free to add more outlier points.

• Fit the model again to the augmented dataset with the outliers. Does it look like a good fit?

One way to deal with outliers is to use a more robust cost function, such as the Mean Absolute Error (MAE), as discussed in the class.

6 Subgradient Descent

Exercise 6:

Modify the function compute loss(y, tx, w) for the Mean Absolute Error cost function.

Unfortunately, you cannot directly use gradient descent, since the MAE function is non-differentiable at several points.

a) Compute a subgradient of the MAE cost function, for every given vector w.

Hint: Use the chain rule to compute the subgradient of the absolute value function. For a function L(w) := h(q(w)) with q differentiable, the subgradient can be computed using ∂L(w) = ∂h(q(w))·∇q(w), where each ∂.. denotes the set of all subgradient vectors.

b) Implement subgradient descent for the MAE cost function.

To do so, write a new function compute gradient(y, tx, w) for the new MAE objective, and modify it to return a subgradient if the given w turns out to be a non-differentiable point.

Plot the resulting model f together with the two curves obtained in the previous exercise.

• Is the fit using MAE better than the one using MSE?

• Did your optimization algorithm ever encounter a non-differentiable point?

c) Implement stochastic subgradient descent (SGD) for the MAE cost function.

How is the picture different when you compare the two algorithm variants on MAE, compared to what you have observed on MSE?

Wrap-Up

After you have finished the implementation of the above exercises the notebook ex02.ipynb, you can wrap up by copying your code to separate .py files for later re-use. For example, you’ll be re-using your code from this week later on, for example for Project 1 and some of the subsequent labs.

We have provided template files for this, namely

cost.py, grid search.py, gradient descent.py and stochastic gradient descent.py,

## Reviews

There are no reviews yet.