Description
Machine Learning Course
EPFL
Martin Jaggi & Ru¨diger Urbanke mlo.epfl.ch/page-157255-en-html/ epfmlcourse@gmail.com
(Matrix Factorizations and Recommender Systems)
Goals. The goal of this exercise is to
• Build a recommendation system.
• Learn to evaluate its performance.
• Implement and understand matrix factorization using SGD.
• Implement and understand the alternating least-squares (ALS) algorithm.
• Choose the appropriate number of factors, as well as regularization parameters.
• Compare your system to a few baselines.
Setup, data and sample code. Obtain the folder labs/ex10 of the course github repository
github.com/epfml/ML course
We will use the dataset movielens100k.csv in this exercise, and we have provided sample code templates that already contain useful snippets of code required for this exercise.
1 Visualization and Creating Train-Test Splits
Since our goal is to predict the unseen ratings, we can create a test set by “punching holes” in the matrix, i.e. we randomly select some of the ratings in the matrix as test points, while we leave the rest for training.
Figure 1 shows the number of ratings for each user and each movie. We can see that the number of ratings varies quite a lot among users and movies. This is very typical of datasets for recommendation systems, and Big Data in general.
Exercise 1:
Fill in the notebook function split data to split dataset into train and test. We only consider users and movies that have more than 10 ratings, and will discard all other users and movies. Among the remaining valid ratings, you should randomly select a rating to become a test rating with probability 10%, and training with prob. 90%. This way, we keep some minimum amount of training data for users and movies that have few ratings.
We can visualize the resulting split as in Figure 2.
2 Performance Evaluation
We will use the root mean-square error (RMSE) to measure the performance. Given true test ratings xdn for the d’th movie and n’th user, and our prediction xˆdn = Wd: Z:n = (WZ>)dn, we compute the RMSE as follows:
RMSE(W, (WZ (1)
(d,n)∈Ω
and W ∈ RD×K, Z ∈ RN×K. Here Ω ⊆ [D] × [N] is the set of the indices of the observed ratings of the input matrix X. The RMSE can be computed either on a training set Ω, or on a hold our hold-out test set Ω.
Figure 1: The left plot shows the (ordered) number of ratings for each user, while the right plot shows the (ordered) number of ratings for the movies. Both numbers are showed in descending order for clarity.
Figure 2: This figure shows obtained train-test split. The left plot shows the training data and the right figure shows the test data. In each plot, a dot indicates a user-movie pair with a non-zero rating.
3 Baseline Models
We will use the following models, that use the mean to predict, as baselines:
N D
Global Mean: ˆ , (2)
(d,n)∈Ω n=1 d∈Ω:n d=1 n∈Ωd:
User Mean: ˆ , (3)
d∈Ω:n
Movie Mean: ˆ , (4)
d:
where Ω is the set of non-zero rating indices (d,n) in the training data matrix, Ω:n is the set of movies rated by the n-th user, and Ωd: is the set of users who have rated the d-th movie.
Exercise 2:
We will compare the above three baselines first.
• Before implementing, think about the following: which of the three models will give the best performance, and why?
• Implement the notebook functions baseline global mean(), baseline user mean() and baseline item mean().
• Compare the resulting models. Which model gives you the lowest training RMSE? Which one gives lowest test RMSE?
Hint: You can change the random seed of your train/test split, and thereby generate several estimates.
4 Matrix Factorization with SGD
Exercise 3:
Task: Implement Stochastic Gradient Descent
1. Derive the full gradient ∇(W,Z)f(W,Z) for f = RMSE(W,Z). Note that since we have (D + N) × K variables, the gradient will have (D+N)×K entries (each corresponding to an entry of W or Z respectively).
2. Derive a stochastic gradient G using the sum structure of f over the Ω elements. We want to do this in such a way that G only depends on a single observed rating (d,n) ∈ Ω.
3. Implement Stochastic Gradient Descent as described in the lecture, for our objective function being RMSE as defined in Equation (1).
4. Implement SGD for the regularized cost function
(WZ 2Frob
where λw,λz > 0 are scalars.
5. Experimentally find the best stepsize γ to obtain the lowest training error value.
6. Does the test error also decrease monotonically during optimization, or does it increase again after some time? Try for different choices of K.
7. (OPTIONAL: Can you speed up your code, by for example maintaining the set of values (WZ>)dn for the few observed values (d,n) ∈ Ω, and thereby avoiding the computation of the matrix multiplication WZ> in every step?)
5 Alternating Least Squares
Exercise 4:
Theory Question: Alternating Least Squares with Missing Entries.
1. Mathematically derive the precise updates for W and Z, for the Alternating Least Squares (ALS) algorithm for the more general setting, when only the ratings (d,n) ∈ Ω contribute to the cost, i.e.
(WZ
Hint: Compute the gradient with respect to each group of variables, and set to zero.
2. Do the same for the regularized cost function
(WZ 2Frob
where λw,λz > 0 are scalars.
Exercise 5:
From our provided template, implement the ALS algorithm for regularized matrix completion, as in the above defined objective function.
• Step 1 Initialize the item matrix W by assigning the average rating for that movie as the first row, and small random numbers for the remaining entries;
• Step 2 Fix W, Solve for Z by minimizing the objective function (the sum of squared errors);
• Step 3 Fix Z, solve for W by minimizing the objective function similarly;
• Step 4 Repeat Steps 2 and 3 until a stopping criterion is satisfied.
Try different values of the latent dimension K.
Reviews
There are no reviews yet.