CSE6740 – CSE/ISYE 6740 Homework 3 Solved

Description

Le Song
• Please read the following submission rules carefully. We will strictly follow these rules starting from this assignment.
• We recommend writing your report with Latex. Other text editors such as MS Word are also accepted. Hand-written report must be clear. Unreadable handwriting is subject to zero credit. Report must be a single pdf file named with the format HWnumber your GT account.pdf. For example, “HW3 yyang123.pdf”. Any additional document is ignored.
• Matlab submissions will be graded in Matlab Online R2019b environment . Python submission will be graded in Python 3.6 environment on a regular laptop. Please make sure your program is runnable in those environments. Any program that cannot run is subject to zero credit. Situations include but not restricted to exceeding 5 minutes of running time, exceeding memory limit and invalid text character.
• Put your report and code files into a single flat folder and submit it as a single .zip file.
• Recommended reading: PRML Section 3.1, 3.2
1 Linear Regression [30 pts]
In class, we derived a closed form solution (normal equation) for linear regression problem: θˆ= (XTX)−1XTY . A probabilistic interpretation of linear regression tells us that we are relying on an assumption that each data point is actually sampled from a linear hyperplane, with some noise. The noise follows a zero-mean Gaussian distribution with constant variance. Specifically,
(1)
where , and {Xi,Y i} is the i-th data point. In other words, we are assuming that each every point is independent to each other and that every data point has same variance.
(a) Using the normal equation, and the model (Eqn. 1), derive the expectation E[θˆ]. Note that here X is fixed, and only Y is random, i.e. “fixed design” as in statistics. [6 pts]
(b) Similarly, derive the variance Var[θˆ]. [6 pts]
(c) Under the white noise assumption above, someone claims that θˆ follows Gaussian distribution with mean and variance in (a) and (b), respectively. Do you agree with this claim? Why or why not? [8 pts]
(d) Weighted linear regression
σ12 0 … 0 
 0 σ22 … 0 
Σ =  … … … … . (2)
0 0 … σn2
Derive the estimator θˆ (similar to the normal equations) for this problem using matrix-vector notations with Σ. [10 pts]
2 Ridge Regression [15 pts]
For linear regression, it is often assumed that where θ,x ∈ Rm by absorbing the constant term, and ) is a Gaussian random variable. Given n i.i.d samples (x1,y1),…,(xn,yn), we define y = (y1,…,yn)> and X = (x1,…,xn)>. Thus, we have y ∼ N(Xθ,σ2I). Show that the ridge regression estimate is the mean of the posterior distribution under a Gaussian prior θ ∼ N(0,τ2I). Find the explicit relation between the regularization parameter λ in the ridge regression estimate of the parameter θ, and the variances σ2,τ2.
3 Bayes Classifier
3.1 Bayes Classifier With General Loss Function
In class, we talked about the popular 0-1 loss function in which L(a,b) = 1 for a 6= b and 0 otherwise, which means all wrong predictions cause equal loss. Yet, in many other cases including cancer detection, the asymmetric loss is often preferred (misdiagnosing cancer as no-cancer is much worse). In this problem, we assume to have such an asymmetric loss function where L(a,a) = L(b,b) = 0 and L(a,b) = p,L(b,a) = q,p 6= q. Write down the the Bayes classifier f : X → Y for binary classification Y ∈ {−1,+1}. Simplify the classification rule as much as you can. [20 pts]
3.2 Gaussian Class Conditional distribution
(a) Suppose the class conditional distribution is a Gaussian. Based on the general loss function in problem 3.1, write the Bayes classifier as f(X) = sign(h(X)) and simplify h as much as possible. What is the geometric shape of the decision boundary? [10 pts]
(b) Repeat (a) but assume the two Gaussians have identical covariance matrices. What is the geometric shape of the decision boundary? [10 pts]
(c) Repeat (a) but assume now that the two Gaussians have covariance matrix which is equal to the identity matrix. What is the geometric shape of the decision boundary? [10 pts]
4 Logistic Regression
Logistic regression is named after the log-odds of success (the logit of the probability) defined as below:
where
(a) Show that log-odds of success is a linear function of X. [6 pts]
(b) Show that the logistic loss L(z) = log(1 + exp(−z)) is a convex function. [9 pts]
5 Programming: Recommendation System [40 pts]
Personalized recommendation systems are used in a wide variety of applications such as electronic commerce, social networks, web search, and more. Machine learning techniques play a key role to extract individual preference over items. In this assignment, we explore this popular business application of machine learning, by implementing a simple matrix-factorization-based recommender using gradient descent.
Suppose you are an employee in Netflix. You are given a set of ratings (from one star to five stars) from users on many movies they have seen. Using this information, your job is implementing a personalized rating predictor for a given user on unseen movies. That is, a rating predictor can be seen as a function f : U × I → R, where U and I are the set of users and items, respectively. Typically the range of this function is restricted to between 1 and 5 (stars), which is the the allowed range of the input.
Our approach for this problem is matrix factorization. Specifically, we assume that the rating matrix M is a low-rank matrix. Intuitively, this reflects our assumption that there is only a small number of factors (e.g, genre, director, main actor/actress, released year, etc.) that determine like or dislike. Let’s define r as the number of factors. Then, we learn a user profile U ∈ Rm×r and an item profile V ∈ Rn×r. (Recall that m and n are the number of users and items, respectively.) We want to approximate a rating by an inner product of two length r vectors, one representing user profile and the other item profile. Mathematically, a rating by user u on movie i is approximated by
. (3)
We want to fit each element of U and V by minimizing squared reconstruction error over all training data points. That is, the objective function we minimize is given by
(4)
where Uu is the uth row of U and Vi is the ith row of V . We observe that this looks very similar to the linear regression. Recall that we minimize in linear regression:
m m r
E(θ) = X(Y i − θTxi)2 = X(Y i − Xθkxik)2
i=1 i=1 k=1 (5)
where m is the number of training data points. Let’s compare (4) and (5). Mu,i in (4) corresponds to Y i in (5), in that both are the observed labels. in (4) corresponds to θTxi in (5), in that both are our estimation with our model. The only difference is that both U and V are the parameters to be learned in (4), while only θ is learned in (5). This is where we personalize our estimation: with linear regression, we apply the same θ to any input xi, but with matrix factorization, a different profile Uu are applied depending on who is the user u.
As U and V are interrelated in (4), there is no closed form solution, unlike linear regression case. Thus, we need to use gradient descent:
, (6)
where µ is a hyper-parameter deciding the update rate. It would be straightforward to take partial derivatives of E(U,V ) in (4) with respect to each element Uv,k and Vj,k. Then, we update each element of U and V using the gradient descent formula in (6).
(a) Derive the update formula in (6) by solving the partial derivatives. [10 pts]
(b) To avoid overfitting, we usually add regularization terms, which penalize for large values in U and V . Redo part (a) using the regularized objective function below. [5 pts]
r
E(U,V ) = X (Mu,i − XUu,kVi,k)2 + λXUu,k2 + λXVi,k2
(u,i)∈M k=1 u,k i,k
(λ is a hyper-parameter controlling the degree of penalization.)
(c) Implement myRecommender.m by filling the gradient descent part.
You are given a skeleton code myRecommender.m. Using the training data rateMatrix, you will implement your own recommendation system of rank lowRank. The only file you need to edit and submit is myRecommender.m. In the gradient descent part, repeat your update formula in (b), observing the average reconstruction error between your estimation and ground truth in training set. You need to set a stopping criteria, based on this reconstruction error as well as the maximum number of iterations. You should play with several different values for µ and λ to make sure that your final prediction is accurate. Formatting information is here:
Input
• rateMatrix: training data set. Each row represents a user, while each column an item. Observed values are one of {1,2,3,4,5}, and missing values are 0.
• lowRank: the number of factors (dimension) of your model. With higher values, you would expect more accurate prediction.
Output
• U: the user profile matrix of dimension user count × low rank.
• V: the item profile matrix of dimension item count × low rank.
Evaluation [15 pts]
Upload your myRecommender.m implementation file. (Do not copy and paste your code in your report. Be sure to upload your myRecommender.m file.)
X − f(u,i))2
(Mu,i
(u,i)∈M
Note that we provide homework3.m just to help you evaluate your code easily. You are not expected to alter or submit this to us. In other words, we will not use this file when we grade your submission. The only file we take care of is myRecommender3.m. Grading criteria:
• Your code should output U and V as specified. The dimension should match to the specification.
• We will test your output on another test dataset, which was not provided to you. The test RMSE on this dataset should be at least 1.05 to get at least partial credit.
• We will measure elapsed time for learning. If your implementation takes longer than 3 minutes for rank 5, you should definitely try to make your code faster or adjust parameters. Any code running more than 5 minutes is not eligible for credit.
• Your code should not crash. Any crashing code will be not credited.
Report [10 pts]
Note
• Do not print anything in your code (e.g, iteration 1 : err=2.4382) in your final submission.
• Do not alter input and output format of the skeleton file. (E.g, adding a new parameter without specifying its defalut value) Please make sure that you returned all necessary outputs according to the given skeleton.
• Please do not use additional file. This task is simple enough that you can fit in just one file.
• Submit your code with the best parameters you found. We will grade without modifying your code. (Applying cross-validation to find best parameters is fine, though you do not required to do.)
• Please be sure that your program finishes within a fixed number of iterations. Always think of a case where your stopping criteria is not satisfied forever. This can happen anytime depending on the data, not because your code is incorrect. For this, we recommend setting a maximum number of iteration in addition to other stopping criteria.
Grand Prize
Similar to the Netflix competition held in 2006 to 2009, the student who achives the lowest RMSE on the secret test set will earn the Grand Prize. We will give extra 10 bonus points to the winner, and share the student’s code to everyone. (Note that the winner should satisfy all other grading criteria perfectly, including answer sanity check and timing requirement. Otherwise, the next student will be considered as the winner.)
Typing Bonus
As usual, we will give 5 bonus points for typed submissions. Note that all questions should be typed to get this credit.

Reviews

There are no reviews yet.

Be the first to review “CSE6740 – CSE/ISYE 6740 Homework 3 Solved”

CSE6740 – CSE/ISYE 6740 Homework 3 Solved

Description

Reviews

Related products

CSE6740 – CSE/ISYE 6740 Homework 4 Solved

CSE6740 – CS 7641 CSE/ISYE 6740 Homework 2 Solved

CSE6740 – CS 7641 CSE/ISYE 6740 Homework 1 Solved