Description
CSCI-GA 2572 Deep Learning
The goal of homework 3 is to test your understanding of Energy-Based Models, and to show you one application in structured prediction.
In the theoretical part, we’ll mostly test your intuition. You’ll need to write brief answers to questions about how EBMs work. In part 2, we will implement a simple optical character recognition system.
In part 1, you should submit all your answers in a pdf file. As before, we recommend using LATEX.
For part 2, you will implement some neural networks by adding your code to the provided ipynb file.
As before, please use numerator layout.
• hw3_theory.pdf
• hw3_practice.ipynb
Note: we will subtract points for Campuswire posts containing solutions to problems. Campuswire shouldn’t be a platform where you can get your solution checked, the goal is to help you with any misunderstandings associated with the homework.
The following behaviors will result in penalty of your final score:
1. 10% penalty for submitting your file without using the correct naming format (including naming the zip file, PDF file or python file wrong, adding extra files in the zip folder, like the testing scripts in your zip file).
3. 20% penalty for code submission that cannot be executed following the steps we mentioned.
1 Theory (50pt)
1.1 Energy Based Models Intuition
This question tests your intuitive understanding of Energy-based models and their properties.
(a) How do energy-based models allow for modeling situations where the mapping from input xi to output yi is not 1 to 1, but 1 to many?
(b) How do energy-based models differ from models that output probabilities?
(c) How can you use energy function FW(x, y) to calculate a probability p(y| x)?
(d) What are the roles of the loss function and energy function?
(e) Can loss function be equal to the energy function?
(g) Briefly explain the three methods that can be used to shape the energy function.
(h) Provide an example of a loss function that uses negative examples. The format should be as follows `example(x, y,W) = FW(x, y).
1.2 Negative log-likelihood loss
Let’s consider an energy-based model we are training to do classification of input between n classes. FW(x, y) is the energy of input x and class y. We consider n classes: y∈{1,…,n}.
(i) For a given input x, write down an expression for a Gibbs distribution over labels y that this energy-based model specifies. Use β for the constant multiplier.
(ii) Let’s say for a particular data sample x, we have the label y. Give the expression for the negative log likelihood loss, i.e. negative log likelihood of the correct label (don’t copy expressions from the slides, show step-by-step derivation of the loss function from the expression of the previous subproblem). For easier calculations in the following subproblem, multiply the loss by .
(iii) Now, derive the gradient of that expression with respect to W (just providing the final expression is not enough). Why can it be intractable to compute it, and how can we get around the intractability?
(iv) Explain why negative log-likelihood loss pushes the energy of the correct example to negative infinity, and all others to positive infinity, no matter how close the two examples are, resulting in an energy surface with really sharp edges in case of continuous y (this is usually not an issue for discrete
y because there’s no distance measure between different classes).
1.3 Comparing Contrastive Loss Functions
In this problem, we’re going to compare a few contrastive loss functions. We are going to look at the behavior of the gradients, and understand what uses each loss function has. In the following subproblems, m is a margin, m ∈R, x is input, y is the correct label, y¯ is the incorrect label. Define the loss in the following
format: `example(x, y, y¯,W) = FW(x, y).
(a) Simple loss function is defined as follows:
`simple(x, y, y¯,W) = [FW(x, y)]++[m−FW(x, y¯)]+
Assuming we know the derivative ∂FW∂W(x,y) for any x, y, give an expression for the partial derivative of the `hinge with respect to W.
(b) Hinge loss is defined as follows:
`hinge(x, y, y¯,W) = [FW(x, y)−FW(x, y¯)+m]+
Assuming we know the derivative ∂FW∂W(x,y) for any x, y, give an expression for the partial derivative of the `hinge with respect to W.
(c) Square-Square loss is defined as follows:
`square-square(x, y, y¯,W) =¡[FW(x, y)]+¢2 +¡[m−FW(x, y¯)]+¢2
Assuming we know the derivative ∂FW∂W(x,y) for any x, y, give an expression for the partial derivative of the `square-square with respect to W.
(d) Comparison.
(i) Explain how NLL loss is different from the three losses above.
(ii) What is the role of the margin in hinge loss? Why do we take only the positive part of FW(x, y)−FW(x, y¯)+m?
(iii) How are simple loss and square-square loss different from hinge loss? In what situations would you use simple loss, and in what situations would you use square-square loss?
2 Implementation (50pt)
Please add your solutions to this notebook hw3_practice.ipynb . Plase use your NYU account to access the notebook. The notebook contains parts marked as TODO, where you should put your code or explanations. The notebook is a Google Colab notebook, you should copy it to your drive, add your solutions, and then download and submit it to NYU Classes. You’re also free to run it on any other machine, as long as the version you send us can be run on Google Colab.
Reviews
There are no reviews yet.