Description
• In this course, we will be using Google Colab for code submissions. You will need a Google account.
• Submit your report as a single .pdf file to Gradescope (entry code 7426YK), under ”Set 5 Report”.
• In the report, include any images generated by your code along with your answers to the questions.
• Submit your code by sharing a link in your report to your Google Colab notebook for each problem (see naming instructions below). Make sure to set sharing permissions to at least ”Anyone with the link can view”. Links that can not be run by TAs will not be counted as turned in. Check your links in an incognito window before submitting to be sure.
• For instructions specifically pertaining to the Gradescope submission process, see https://www.
gradescope.com/get_started#student-submission.
Google Colab Instructions
For each notebook, you need to save a copy to your drive.
1. Open the github preview of the notebook, and click the icon to open the colab preview.
2. On the colab preview, go to File → Save a copy in Drive.
3. Edit your file name to “lastname firstname originaltitle”, e.g.”yue yisong 3 notebook part1.ipynb”
1 SVD and PCA [35 Points]
Relevant materials: Lectures 10, 11
Problem A [3 points]: Let X be a N ×N matrix. For the singular value decomposition (SVD) X = UΣV T, show that the columns of U are the principal components of X. What relationship exists between the singular values of X and the eigenvalues of XXT?
Solution A:
Problem B [4 points]: Provide both an intuitive explanation and a mathematical justification for why the eigenvalues of the PCA of X (or rather XXT) are non-negative. Such matrices are called positive semidefinite and possess many other useful properties.
Solution B:
Problem C [5 points]: In calculating the Frobenius and trace matrix norms, we claimed that the trace is invariant under cyclic permutations (i.e., Tr(ABC) = Tr(BCA) = Tr(CAB)). Prove that this holds for any number of square matrices.
Hint: First prove that the identity holds for two matrices and then generalize. Recall that Tr(AB) = . Can you find a way to expand (AB)ii in terms of another sum?
Solution C:
Problem D [3 points]: Outside of learning, the SVD is commonly used for data compression. Instead of storing a full N × N matrix X with SVD X = UΣV T, we store a truncated SVD consisting of the k largest singular values of Σ and the corresponding columns of U and V . One can prove that the SVD is the best rank-k approximation of X, though we will not do so here. Thus, this approximation can often re-create the matrix well even for low k. Compared to the N2 values needed to store X, how many values do we need to store a truncated SVD with k singular values? For what values of k is storing the truncated SVD more efficient than storing the whole matrix?
Hint: For the diagonal matrix Σ, do we have to store every entry?
Solution D:
Dimensions & Orthogonality
In class, we claimed that a matrix X of size D × N can be decomposed into UΣV T, where U and V are orthogonal and Σ is a diagonal matrix. This is a slight simplification of the truth. In fact, the singular value decomposition gives an orthogonal matrix U of size D × D, an orthogonal matrix V of size N × N, and a rectangular diagonal matrix Σ of size D × N, where Σ only has non-zero values on entries (Σ)ii, i ∈ {1,…,K}, where K is the rank of the matrix X.
Problem E [3 points]: Assume that D > N and that X has rank N. Show that UΣ = U′Σ′, where Σ′ is the N × N matrix consisting of the first N rows of Σ, and U′ is the D × N matrix consisting of the first N columns of U. The representation U′Σ′V T is called the “thin” SVD of X.
Solution E:
Problem F [3 points]: Show that since U′ is not square, it cannot be orthogonal according to the definition given in class. Recall that a matrix A is orthogonal if AAT = ATA = I.
Solution F:
Problem G [4 points]: Even though U′ is not orthogonal, it still has similar properties. Show that
U′TU′ = IN×N. Is it also true that U′U′T = ID×D? Why or why not? Note that the columns of U′ are still orthonormal. Also note that orthonormality implies linear independence.
Solution G:
Pseudoinverses
Let X be a matrix of size D × N, where D > N, with “thin” SVD X = UΣV T. Assume that X has rank N.
Problem H [4 points]: Assuming that Σ is invertible, show that the pseudoinverse X+ = V Σ+UT as given in class is equivalent to V Σ−1UT. Refer to lecture 11 for the definition of pseudoinverse.
Solution H:
Problem I [4 points]: Another expression for the pseudoinverse is the least squares solution X+′ = (XTX)−1XT. Show that (again assuming Σ invertible) this is equivalent to V Σ−1UT.
Solution I:
Problem J [2 points]: One of the two expressions in problems H and I for calculating the pseudoinverse is highly prone to numerical errors. Which one is it, and why? Justify your answer using condition numbers.
Hint: Note that the transpose of a matrix is easy to compute. Compare the condition numbers of Σ and
XTX. The condition number of a matrix A is given by , where σmax(A) and σmin(A) are the maximum and minimum singular values of A, respectively.
Solution J:
2 Matrix Factorization [30 Points]
Relevant materials: Lecture 11
In the setting of collaborative filtering, we derive the coefficients of the matrices U ∈ RM×K and V ∈ RN×K by minimizing the regularized square error:
argmin
i,j
where uTi and vjT are the ith and jth rows of U and V , respectively, and ∥·∥F represents the Frobenius norm.
Then Y ∈ RM×N ≈ UV T, and the ij-th element of .
Problem A [5 points]: Derive the gradients of the above regularized squared error with respect to ui and vj, denoted ∂ui and ∂vj respectively. We can use these to compute U and V by stochastic gradient descent using the usual update rule:
ui = ui − η∂ui vj = vj − η∂vj
where η is the learning rate.
Solution A:
Problem B [5 points]: Another method to minimize the regularized squared error is alternating least squares (ALS). ALS solves the problem by first fixing U and solving for the optimal V , then fixing this new V and solving for the optimal U. This process is repeated until convergence.
Derive closed form expressions for the optimal ui and vj. That is, give an expression for the ui that minimizes the above regularized square error given fixed V , and an expression for the vj that minimizes it given fixed U.
Solution B:
Problem C [10 points]: Download the provided MovieLens dataset (train.txt and test.txt). The format of the data is (user, movie, rating), where each triple encodes the rating that a particular user gave to a particular movie. Make sure you check if the user and movie ids are 0 or 1-indexed, as you should with any real-world dataset.
In your implementation, you should:
• Initialize the entries of U and V to be small random numbers; set them to uniform random variables in the interval [−0.5,0.5].
• Use a learning rate of 0.03.
• Randomly shuffle the training data indices before each SGD epoch.
• Set the maximum number of epochs to 300, and terminate the SGD process early via the following early stopping condition:
– Keep track of the loss reduction on the training set from epoch to epoch, and stop when the relative loss reduction compared to the first epoch is less than ϵ = 0.0001. That is, if ∆0,1 denotes the loss reduction from the initial model to end of the first epoch, and ∆i,i−1 is defined analogously, then stop after epoch t if ∆t−1,t/∆0,1 ≤ ϵ.
Solution C:
Problem D [5 points]: Use your code from the previous problem to train your model using k = 10,20,30,50,100, and plot your Ein,Eout against k. Note that Ein and Eout are calculated via the squared loss, i.e. via . What trends do you notice in the plot? Can you explain them?
Solution D:
Problem E [5 points]: Now, repeat problem D, but this time with the regularization term. Use the following regularization values: λ ∈ {10−4,10−3,0.01,0.1,1}. For each regularization value, use the same range of values for k as you did in the previous part. What trends do you notice in the graph? Can you explain them in the context of your plots for the previous part? You should use your code you wrote for part C in 2 notebook.ipynb.
Solution E:
3 Word2Vec Principles [35 Points]
Relevant materials: Lecture 12
The Skip–gram model is part of a family of techniques that try to understand language by looking at what words tend to appear near what other words. The idea is that semantically similar words occur in similar contexts. This is called “distributional semantics”, or “you shall know a word by the company it keeps”.
The Skip–gram model does this by defining a conditional probability distribution p(wO|wI) that gives the probability that, given that we are looking at some word wI in a line of text, we will see the word wO nearby. To encode p, the Skip-gram model represents each word in our vocabulary as two vectors in RD: one vector for when the word is playing the role of wI (“input”), and one for when it is playing the role of wO (“output”). (The reason for the 2 vectors is to help training — in the end, mostly we’ll only care about the wI vectors.) Given these vector representations, p is then computed via the familiar softmax function:
(2)
where vw and are the “input” and “output” vector representations of word a w ∈ {1,…,W}. (We assume all words are encoded as positive integers.)
Given a sequence of training words w1,w2,…,wT, the training objective of the Skip-gram model is to maximize the average log probability
(1)
where s is the size of the “training context” or “window” around each word. Larger s results in more training examples and higher accuracy, at the expense of training time.
Problem A [5 points]: If we wanted to train this model with naive gradient descent, we’d need to compute all the gradients ∇logp(wO|wI) for each wO, wI pair. How does computing these gradients scale with W, the number of words in the vocabulary, and D, the dimension of the embedding space? To be specific, what is the time complexity of calculating ∇logp(wO|wI) for a single wO, wI pair?
Solution A:
Problem B [10 points]: When the number of words in the vocabulary W is large, computing the regular softmax can be computationally expensive (note the normalization constant on the bottom of Eq. 2). For reference, the standard fastText pre-trained word vectors encode approximately W ≈ 218000 words in D = 100 latent dimensions. One trick to get around this is to instead represent the words in a binary tree format and compute the hierarchical softmax.
When the words have all the same frequency, then any balanced binary tree will minimize the average representation length and maximize computational efficiency of the hierarchical softmax. But in practice, Table 1: Words and frequencies for Problem B
Word Occurrences
do 18
you 4
know 7
the 20
way 9
of 4
devil 5
queen 6
words occur with very different frequencies — words like ”a”, ”the”, and ”in” will occur many more times than words like ”representation” or ”normalization”.
The original paper (Mikolov et al. 2013) uses a Huffman tree instead of a balanced binary tree to leverage this fact. For the 8 words and their frequencies listed in the table below, build a Huffman tree using the algorithm found here. Then, build a balanced binary tree of depth 3 to store these words. Make sure that each word is stored as a leaf node in the trees.
The representation length of a word is then the length of the path (the number of edges) from the root to the leaf node corresponding to the word. For each tree you constructed, compute the expected representation length (averaged over the actual frequencies of the words).
Solution B:
Problem C [3 points]: In principle, one could use any D for the dimension of the embedding space. What do you expect to happen to the value of the training objective as D increases? Why do you think one might not want to use very large D?
Solution C:
Implementing Word2Vec
Word2Vec is an efficient implementation of the Skip–gram model using neural network–inspired training techniques. We’ll now implement Word2Vec on text datasets using Pytorch. This blog post provides an overview of the particular Word2Vec implementation we’ll use.
At a high level, we’ll do the following:
(i) Load in a list L of the words in a text file
(ii) Given a window size s, generate up to 2s training points for word Li. The diagram below shows an example of training point generation for s = 2:
Figure 1: Generating Word2Vec Training Points
(iii) Fit a neural network consisting of a single hidden layer of 10 units on our training data. The hidden layer should have no activation function, the output layer should have a softmax activation, and the loss function should be the cross entropy function.
Notice that this is exactly equivalent to the Skip–gram formulation given above where the embedding dimension is 10: the columns (or rows, depending on your convention) of the input–to–hidden weight matrix in our network are the wI vectors, and those of the hidden–to–output weight matrix are the wO vectors.
(iv) Discard our output layer and use the matrix of weights between our input layer and hidden layer as the matrix of feature representations of our words.
(v) Compute the cosine similarity between each pair of distinct words and determine the top 30 pairs of most-similar words.
Implementation
See set5 prob3.ipynb, which implements most of the above.
Problem D [10 points]: Fill out the TODOs in the skeleton code; specifically, add code where indicated to train a neural network as described in (iii) above and extract the weight matrix of its input–to–hidden weight matrix. Also, fill out the generate traindata() function, which generates our data and label matrices.
Solution D:
Running the code
Run your model on dr seuss.txt and answer the following questions:
Problem E [2 points]: What is the dimension of the weight matrix of your hidden layer?
Solution E:
Problem F [2 points]: What is the dimension of the weight matrix of your output layer?
Solution F:
Problem G [1 points]: List the top 30 pairs of most similar words that your model generates.
Solution G:
Problem H [2 points]: What patterns do you notice across the resulting pairs of words?
Solution H:





Reviews
There are no reviews yet.