16720 – Instructions Solved

$ 29.99
Category:

Description

2. Each question (for points) is marked with a Q.
4. Attempt to verify your implementation as you proceed. If you don’t verify that your implementation is correct on toy examples, you will risk having a huge mess when you put everything together.
6. Some questions will ask you to “Include your code in the writeup”. For those questions, you can either copy/paste the code into a verbatim environment, or include screenshots of your code.
8. Submission: The submission is on Gradescope, you will be submitting both your writeup and code zip file. The zip file, <andrew-id.zip> contains your code and any results files we ask you to save. Note: You have to submit your writeup separately to Gradescope, and include results (and code snippets, when requested) in the write-up.
9. Do not submit anything from the data/ folder in your submission.
11. To get the data, we have included some scripts in scripts.
12. For each coding question, refer to the comments inside the given python scripts to see which function to implement. Insert your code into the designated place (where it says your code here), and DO NOT remove any of the given code elsewhere.
Contents
1 Theory 2
2 Implement a Fully Connected Network 3
2.1 Network Initialization 4
2.2 Forward Propagation 4
2.3 Backwards Propagation 4
2.4 Training Loop 5
2.5 Numerical Gradient Checker 5
3 Training Models 5
4 Extract Text from Images 7
5 (Extra Credit) Image Compression with Autoencoders 8
5.1 Building the Autoencoder 8
5.2 Training the Autoencoder 9
5.3 Evaluating the Autoencoder 9
6 PyTorch 10
6.1 Train a neural network in PyTorch 10
6.2 Fine Tuning 10
6.3 Neural Networks in the Real World 11
7 Deliverables 11
8 Appendix: Neural Network Overview 14
8.1 Basic Use 14
8.2 Backprop 15

1 Theory
Q1.1 Theory [3 points] Prove that softmax is invariant to translation, that is
softmax(x) = softmax(x + c) ∀c ∈ R.
Softmax is defined as below, for each index i in a vector x.

Often we use c = −maxxi. Why is that a good idea? (Tip: consider the range of values that numerator will have with c = 0 and c = −maxxi)
Q1.2 Theory [3 points] Softmax can be written as a three step processes, with si = exi, S = Psi and softmax(xi) = S1si.
• As x ∈ Rd, what are the properties of softmax(x), namely what is the range of each element? What is the sum over all elements?
• One could say that “softmax takes an arbitrary real valued vector x and turns it into a ”.
• Can you see the role of each step in the multi-step process now? Explain them.
Q1.3 Theory [3 points] Show that multi-layer neural networks without a non-linear activation function are equivalent to linear regression.
Q1.4 Theory [3 points] Given the sigmoid activation function , derive the gradient of the sigmoid function and show that it can be written as a function of σ(x) (without having access to x directly)
Q1.5 Theory [12 points] Given ), and the gradient of some loss J with respect y, show how to get ∂W∂J , ∂J∂x and . Be sure to do the derivatives with scalars and re-form the matrix form afterwards. Here is some notional suggestions.

We won’t grade the derivative with respect to b but you should do it anyways, you will need it later in the assignment.
Q1.6 Theory [4 points] When the neural network applies the elementwise activation function (such as sigmoid), the gradient of the activation function scales the backpropogation update. This is directly from the chain rule,
1. Consider the sigmoid activation function for deep neural networks. Why might it lead to a “vanishing gradient” problem if it is used for many layers (consider plotting Q1.4)?
2. Often it is replaced with tanh( . What are the output ranges of both tanh and sigmoid? Why might we prefer tanh ?
3. Why does tanh(x) have less of a vanishing gradient problem? (plotting the derivatives helps! for reference: tanh′(x) = 1 − tanh(x)2)
4. tanh is a scaled and shifted version of the sigmoid. Show how tanh(x) can be written in terms of σ(x).
2 Implement a Fully Connected Network
When implementing the following functions, make sure you run python/run q2.py along the way to check if your implemented functions work as expected.
2.1 Network Initialization
Q2.1.1 Theory [3 points] Why is it not a good idea to initialize a network with all zeros? If you imagine that every layer has weights and biases, what can a zero-initialized network output after training?
Q2.1.2 Code [3 points] In python/nn.py, implement a function to initialize one layer’s weights with Xavier initialization [1], where where n is the dimensionality of the vectors and you use a uniform distribution to sample random numbers (see eq 16 in the paper). Include your code in the writeup.
Q2.1.3 Theory [3 points] Why do we initialize with random numbers? Why do we scale the initialization depending on layer size (see near Fig 6 in the paper)?
2.2 Forward Propagation
The appendix (sec 8) has the math for forward propagation, we will implement it here.
Q2.2.1 Code [4 points] In python/nn.py, implement sigmoid, along with forward propagation for a single layer with an activation function, namely y = σ(XW +b), returning the output and intermediate results for an N × D dimension input X, with examples along the rows, data dimensions along the columns. Include your code in the writeup.
Q2.2.2 Code [3 points] In python/nn.py, implement the softmax function. Be sure to use the numerical stability trick you derived in Q1.1 softmax. Include your code in the writeup.
Q2.2.3 Code [3 points] In python/nn.py, write a function to compute the accuracy of a set of labels, along with the scalar loss across the data. The loss function generally used for classification is the cross-entropy loss.
Lf(D) = − X y · log(f(x))
(x,y)∈D
Here D is the full training dataset of data samples x (N × 1 vectors, N = dimensionality of data) and labels y (C × 1 one-hot vectors, C = number of classes), and f : RN → [0,1]C is the classifier. The log is natural log. Include your code in the writeup.
2.3 Backwards Propagation
Q2.3 Code [7 points] In python/nn.py, write a function to compute backpropogation for a single layer, given the original weights, the appropriate intermediate results, and given gradient with respect to the loss. You should return the gradient with respect to X so you can feed it into the next layer. As a size check, your gradients should be the same dimensions as the original objects. Include your code in the writeup.
2.4 Training Loop
You will tend to see gradient descent in three forms: “normal”, “stochastic” and “batch”. “Normal” gradient descent aggregates the updates for the entire dataset before changing the weights. Stochastic gradient descent applies updates after every single data example. Batch gradient descent is a compromise, where random subsets of the full dataset are evaluated before applying the gradient update. Note: For batch input, you should sum your backpropogation gradients.
2.5 Numerical Gradient Checker
Q2.5 [5 points] In python/run q2.py, implement a numerical gradient checker. Instead of using the analytical gradients computed from the chain rule, add ϵ offset to each element in the weights, and compute the numerical gradient of the loss with central differences. Central differences is just . Remember, this needs to be done for each scalar dimension in all of your weights independently. This should help you check your gradient code, so there is no need to show the result, but do include your code in the writeup.
3 Training Models
First, from the scripts folder, run the script get data.sh. This will download and unzip http://www.cs.cmu.edu/~lkeselma/16720a_data/data.zip http://www.cs.cmu.edu/~lkeselma/16720a_data/images.zip and extract them to data and image folders. (For those who are on operating systems like Windows and cannot run the .sh script, you can also manually download by clicking on the link and unzipping the data.)
Since our input images are 32×32 images, unrolled into one 1024 dimensional vector, that gets multiplied by W(1), each row of W(1) can be seen as a weight image. Reshaping each row into a 32×32 image can give us an idea of what types of images each unit in the hidden layer has a high response to.
We have provided you three data .mat files to use for this section. The training data in nist36 train.mat contains samples for each of the 26 upper-case letters of the alphabet and the 10 digits. This is the set you should use for training your network. The crossvalidation set in nist36 valid.mat contains samples from each class, and should be used in the training loop to see how the network is performing on data that it is not training on. This will help to spot overfitting. Finally, the test data in nist36 test.mat contains testing data, and should be used for the final evaluation on your best model to see how well it will generalize to new unseen data.
Use python/run q3.py for this question, and refer to the comments for what to implement.
Q3.1 Code [5 points] Train a network from scratch. Use a single hidden layer with 64 hidden units, and train for at least 50 epochs. The script will generate two plots: one showing the accuracy on both the training and validation set over the epochs, and the other showing the cross-entropy loss averaged over the data. Tune the batch size and learning rate to get an accuracy on the validation set of at least 75%. Include the plots in your writeup. Hint: Use fixed random seeds to improve reproducability.
Q3.2 Writeup [3 points] Use the script to train three networks, one with your tuned learning rate, one with 10 times that learning rate, and one with one tenth that learning rate. Include all six plots in your writeup (two will be the same from the previous question). Comment on how the learning rates affect the training, and report the final accuracy of the best network on the test set. Hint: Use fixed random seeds to improve reproducability.
Q3.3 Writeup [3 points] The script will visualize the first layer weights as 64 32×32 images, both immediately after initialization and after fully training. Include both visualizations in your writeup. Comment on the learned weights and compare them to the initialized weights. Do you notice any patterns?
Q3.4 Writeup [3 points] Visualize and include the confusion matrix of the test data for your best model. Comment on the top few pairs of classes that are most commonly confused.
4 Extract Text from Images

Figure 1: Sample image with handwritten characters annotated with boxes around each character.
Now that you have a network that can recognize handwritten letters with reasonable accuracy, you can now use it to parse text in an image. Given an image with some text on it, our goal is to have a function that returns the actual text in the image. However, since your neural network expects a a binary image with a single character, you will need to process the input image to extract each character. There are various approaches that can be done so feel free to use any strategy you like.
Here we outline one possible method, another is that given in a tutorial
1. Process the image (blur, threshold, opening morphology, etc., perhaps in that order) to classify all pixels as being part of a character or background.
2. Find connected groups of character pixels (see skimage.measure.label). Place a bounding box around each connected component.
3. Group the letters based on which line of the text they are a part of, and sort each group so that the letters are in the order they appear on the page.
4. Take each bounding box one at a time and resize it to 32 × 32, classify it with your network, and report the characters in order (inserting spaces when it makes sense).
Since the network you trained likely does not have perfect accuracy, you can expect there to be some errors in your final text parsing. Whichever method you choose to implement for the character detection, you should be able to place a box on most of the characters in the image. We have provided you with 01 list.jpg, 02 letters.jpg, 03 haiku.jpg and 04 deep.jpg to test your implementation on.
Q4.1 Theory [4 points] The method outlined above is pretty simplistic, and while it works for the given text samples, it makes several assumptions. What are two big assumptions that the sample method makes? In your writeup, include two example images where you expect the character detection to fail (for example, miss valid letters, misclassify letters or respond to non-letters).
Q4.2 Code [10 points] In python/q4.py, implement the function to find letters in the image. Given an RGB image, this function should return bounding boxes for all of the located handwritten characters in the image, as well as a binary black-and-white version of the image im. Each row of the matrix should contain [y1,x1,y2,x2] the positions of the top-left and bottom-right corners of the box. The black and white image should be floating point, 0 to 1, with the characters in black and background in white. Hint: Since we read text left to right, top to bottom, we can use this to cluster the coordinates. Include your code in the writeup.
Q4.3 Writeup [5 points] Using python/run q4.py, visualize all of the located boxes on top of the binary image to show the accuracy of your findLetters(..) function. Include all the resulting images in your writeup.
Q4.4 Code/Writeup [8 points] In python/run q4.py, you will now load the image, find the character locations, classify each one with the network you trained in Q3.1, and return the text contained in the image. Be sure you try to make your detected images look like the images from the training set. Visualize them and adjust accordingly. If you find that your classifier performs poorly, consider dilation under skimage morphology to make the letters thicker.
Your solution is correct if you can correctly detect approximately 100% of the letters and classify approximately 80% of the letters in each of the sample images.
Run your run q4 on all of the provided sample images in images/. Include the extracted text in your writeup. It is fine if your code ignores spaces, but if so, please add them manually in the writeup.
5 (Extra Credit) Image Compression with Autoencoders
An autoencoder is a neural network that is trained to attempt to copy its input to its output, but it usually allows copying only approximately. This is typically achieved by restricting the number of hidden nodes inside the autoencoder; in other words, the autoencoder would be forced to learn to represent data with this limited number of hidden nodes. This is a useful way of learning compressed representations. In this section, we will continue using the NIST36 dataset you have from the previous questions. Use python/run q5.py for this question.
5.1 Building the Autoencoder
• 1024 to 32 dimensions, followed by a ReLU
• 32 to 32 dimensions, followed by a ReLU
• 32 to 32 dimensions, followed by a ReLU
• 32 to 1024 dimensions, followed by a sigmoid (this normalizes the image output for us)
The loss function that you’re using is total squared error for the output image compared to the input image (they should be the same!). Include your code in the writeup.
Q5.1.2 (Extra Credit) Code [5 points] To help even more with convergence speed, we will implement momentum. Now, instead of updating , we will use the update rules MW = 0.9MW − α∂W∂J and W = W + MW . To implement this, populate the parameters dictionary with zero-initialized momentum accumulators, one for each parameter. Then simply perform both update equations for every batch. Include your code in the writeup.
5.2 Training the Autoencoder
Q5.2 (Extra Credit) Writeup/Code [5 points] Using the provided default settings, train the network for 100 epochs. Plot the training loss curve and include it in the writeup. What do you observe?
5.3 Evaluating the Autoencoder
Q5.3.1 (Extra Credit) Writeup/Code [5 points] Now let’s evaluate how well the autoencoder has been trained. Select 5 classes from the total 36 classes in the validation set and for each selected class include in your report 2 validation images and their reconstruction. What differences do you observe in the reconstructed validation images compared to the original ones?
Q5.3.2 (Extra Credit) Writeup [5 points] Let’s evaluate the reconstruction quality using Peak Signal- to-noise Ratio (PSNR). PSNR is defined as
PSNR = 20 × log10(MAXI) − 10 × log10(MSE) (1)
6 PyTorch
While you were able to derive manual backpropogation rules for sigmoid and fully-connected layers, wouldn’t it be nice if someone did that for lots of useful primitives and made it fast and easy to use for general computation? Meet automatic differentiation. Since we have high-dimensional inputs (images) and low-dimensional outputs (a scalar loss), it turns out forward mode AD is very efficient. Popular autodiff packages include pytorch (Facebook), tensorflow (Google), autograd (Boston-area academics). Autograd provides its own replacement for numpy operators and is a drop-in replacement for numpy, except you can ask for gradients now. The other two are able to utilize GPUs to perform highly optimized and parallel computations, and are very popular for researchers who train large networks. Tensorflow asks you to build a computational graph using its API, and then is able to pass data through that graph. PyTorch builds a dynamic graph and allows you to mix autograd functions with normal python code much more smoothly, so it is currently more popular in academia.
We will use PyTorch as a framework. Many computer vision projects use neural networks as a basic building block, so familiarity with one of these frameworks is a good skill to develop. Here, we basically replicate and slightly expand our handwritten character recognition networks, but do it in PyTorch instead of doing it ourselves. Feel free to use any tutorial you like, but we like the offical one or this tutorial (in a jupyter notebook) or these slides (starting from number 35).
For this section, you’re free to implement these however you like. All of the tasks required here are fairly small and don’t require a GPU if you use small networks. Include the neural network code for each part in the writeup.
6.1 Train a neural network in PyTorch
Q6.1.1 Code/Writeup [5 points] Re-write and re-train your fully-connected network on the included NIST36 in PyTorch. Plot training accuracy and loss over time.
Q6.1.2 Code/Writeup [5 points] Train a convolutional neural network with PyTorch on the included NIST36 dataset. Compare its performance with the previous fully-connected network.
Q6.1.3 Code/Writeup [5 points] Train a convolutional neural network with PyTorch on CIFAR-10 (torchvision.datasets.CIFAR10). Plot training accuracy and loss over time.
Q6.1.4 Code/Writeup [10 points] In Homework 1, we tried scene classification with the bag-of-words (BoW) approach on a subset of the SUN database. Use the same dataset in HW1, and implement a convolutional neural network with PyTorch for scene classification. Compare your result with the one you got in HW1, and briefly comment on it.
6.2 Fine Tuning
When training from scratch, a lot of epochs and data are often needed to learn anything meaningful. One way to avoid this is to instead initialize the weights more intelligently.
These days, it is most common to initialize a network with weights from another deep network that was trained for a different purpose. This is because, whether we are doing image classification, segmentation, recognition etc…, most real images share common properties. Simply copying the weights from the other network to yours gives your network a head start, so your network does not need to learn these common weights from scratch all over again. This is also referred to as fine tuning.
Q6.2 Code/Writeup [5 points] Fine-tune a single layer classifier using pytorch on the flowers 17 (or flowers 102!) dataset using squeezenet1 1, as well as an architecture you’ve designed yourself (for example 3 convolutional layers followed 2 fully connected layers, it’s standard slide 6) and trained from scratch. How do they compare?
We include a script in scripts/ to fetch the flowers dataset and extract it in a way that torchvision.datasets.ImageFolder can consume it, see an example, from data/oxford-flowers17. You should look at how SqueezeNet is defined, and just replace the classifier layer. There exists a pretty good example for fine-tuning in PyTorch.
6.3 Neural Networks in the Real World
Often, we train neural networks on standard datasets and evaluate on held out validation data. How would a model trained on ImageNet perform in the wild?
Q6.3 (Extra Credit) Neural Networks in the Real World [20 points] Download an ImageNet pretrained image classification model of your choice from torchvision. Using this model, pick a single category (out of the 1000 used to train the model) and evaluate the validation performance of this category. You can download the ImageNet validation data from the challenge page by creating an account (top right). Torchvision has a dataloader to help you load and process ImageNet data automatically. Next, find an instance of this selected category in the real world and take a dynamic (i.e with some movement) video of this object. Extract all of the frames from this video and apply your pretrained model to each frame and compare the accuracy of the classifier on your video compared with the images in the validation set. Why might this be? Can you suggest ways to make your model more robust?
7 Deliverables
The assignment should be submitted to Gradescope. The writeup should be submitted as a pdf named <AndrewId>.pdf. The code should be submitted as a zip named <AndrewId>.zip. The zip when uncompressed should produce the following files.
• nn.py
• q4.py
• run q2.py
• run q3.py
• run q4.py
• run q5.py (extra credit)
• any .py files for Q6
• util.py
References
[1] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. 2010. http://proceedings.mlr.press/v9/glorot10a/ glorot10a.pdf.
[2] P. J. Grother. Nist special database 19 – handprinted forms and characters database.
https://www.nist.gov/srd/nist-special-database-19, 1995.
8 Appendix: Neural Network Overview
Deep learning has quickly become one of the most applied machine learning techniques in computer vision. Convolutional neural networks have been applied to many different computer vision problems such as image classification, recognition, and segmentation with great success. In this assignment, you will first implement a fully connected feed forward neural network for hand written character classification. Then in the second part, you will implement a system to locate characters in an image, which you can then classify with your deep network. The end result will be a system that, given an image of hand written text, will output the text contained in the image.
8.1 Basic Use
Here we will give a brief overview of the math for a single hidden layer feed forward network. For a more detailed look at the math and derivation, please see the class slides.
A fully-connected network f, for classification, applies a series of linear and non-linear functions to an input data vector x of size N × 1 to produce an output vector f(x) of size C ×1, where each element i of the output vector represents the probability of x belonging to the class i. Since the data samples are of dimensionality N, this means the input layer has N input units. To compute the value of the output units, we must first compute the values of all the hidden layers. The first hidden layer pre-activation a(1)(x) is given by
a(1)(x) = W(1)x + b(1)
Then the post-activation values of the first hidden layer h(1)(x) are computed by applying a non-linear activation function g to the pre-activation values
h(1)(x) = g(a(1)(x)) = g(W(1)x + b(1))
Subsequent hidden layer (1 < t ≤ T) pre- and post activations are given by:
a(t)(x) = W(t)h(t−1) + b(t) h(t)(x) = g(a(t)(x))
The output layer pre-activations a(T)(x) are computed in a similar way
a(T)(x) = W(T)h(T−1)(x) + b(T)
and finally the post-activation values of the output layer are computed with
f(x) = o(a(T)(x)) = o(W(T)h(T−1)(x) + b(T))
where o is the output activation function. Please note the difference between g and o! For this assignment, we will be using the sigmoid activation function for the hidden layer, so:
g

Figure 2: Samples from NIST Special 19 dataset [2]
where when g is applied to a vector, it is applied element wise across the vector.
Since we are using this deep network for classification, a common output activation function to use is the softmax function. This will allow us to turn the real value, possibly negative values of a(T)(x) into a set of probabilities (vector of positive numbers that sum to 1). Letting xi denote the ith element of the vector x, the softmax function is defined as:

Gradient descent is an iterative optimisation algorithm, used to find the local optima. To find the local minima, we start at a point on the function and move in the direction of negative gradient (steepest descent) till some stopping criteria is met.
8.2 Backprop
The update equation for a general weight and bias is
f
(x) t
α is the learning rate. Please refer to the backpropagation slides for more details on how to derive the gradients. Note that here we are using softmax loss (which is different from the least square loss in the slides).

Reviews

There are no reviews yet.

Be the first to review “16720 – Instructions Solved”

Your email address will not be published. Required fields are marked *