## Description

TAs: Aakanksha, Edgar, Sida, Varsha

Summary In this assignment, you will build a handwriting recognition system using a neural network. As a warmup, Section 1 will lead you through an on-paper example of how to implement a neural network. Then, in Section 2, you will implement an end-to-end system that learns to perform handwritten letter classification.

START HERE: Instructions

โข Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ หmgormley/courses/10601bd-f18/about.html

โข Submitting your work: You will use Gradescope to submit answers to all questions, and Autolab to submit your code. Please follow instructions at the end of this PDF to correctly submit all your code to Autolab.

โ Autolab: You will submit your code for programming questions on the homework to Autolab (https://autolab.andrew.cmu.edu/). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). The software installed on the VM is identical to that on linux.andrew.cmu.edu, so you should check that your code runs correctly there. If developing locally, check that the version number of the programming language environment (e.g. Python 2.7/3.5, Octave 3.8.2, OpenJDK 1.8.0, g++ 4.8.5) and versions of permitted libraries (e.g. numpy 1.7.1) match those on linux.andrew.cmu.edu. (Octave users: Please make sure you do not use any Matlabspecific libraries in your code that might make it fail against our tests.) You have a total of 10 Autolab submissions. Use them wisely. In order to not waste Autolab submissions, we recommend debugging your implementation on your local machine (or the linux servers) and making sure your code is running correctly first before any Autolab submission.

โข Materials: Download from Autolab the tar file (โDownload handoutโ). The tar file will contain all the data that you will need in order to complete this assignment.

For multiple choice or select all that apply questions, shade in the box or circle in the template document corresponding to the correct answer(s) for each of the questions. For LATEXusers, use and for shaded boxes and circles, and donโt change anything else.

Instructions for Specific Problem Types

For โSelect Oneโ questions, please fill in the appropriate bubble completely:

Select One: Who taught this course?

Matt Gormley

Marie Curie

Noam Chomsky

Select One: Who taught this course?

Matt Gormley

Marie Curie

@@Noam Chomsky

For โSelect all that applyโ questions, please fill in all appropriate squares completely:

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein

Isaac Newton

I donโt know

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein Isaac Newton

@ I donโt know

Fill in the blank: What is the course number?

10-S7601

S

1 Written Questions [25 points]

1.1 Example Feed Forward and Backpropagation [15 points]

Figure 1.1: A One Hidden Layer Neural Network

Network Overview Consider the neural network with one hidden layer shown in Figure 1.1. The input layer consists of 6 features x = [x1,…,x6]T , the hidden layer has 4 nodes z = [z1,…,z4]T , and the output layer is a probability distribution y = [y1,y2,y3]T over 3 classes. We also add a bias to the input, x0 = 1 and the hidden layer z0 = 1, both of which are fixed to 1.

ฮฑ is the matrix of weights from the inputs to the hidden layer and ฮฒ is the matrix of weights from the hidden layer to the output layer. ฮฑj,i represents the weight going to the node zj in the hidden layer from the node xi in the input layer (e.g. ฮฑ1,2 is the weight from x2 to z1), and ฮฒ is defined similarly. We will use a sigmoid activation function for the hidden layer and a softmax for the output layer.

Network Details Equivalently, we define each of the following.

The input:

x = [x1,x2,x3,x4,x5,x6]T

Linear combination at first (hidden) layer: (1.1)

(1.2)

Activation at first (hidden) layer:

(1.3)

Linear combination at second (output) layer:

(1.4)

Activation at second (output) layer:

(1.5)

Note that the linear combination equations can be written equivalently as the product of the transpose of the weight matrix with the input vector. We can even fold in the bias term ฮฑ0 by thinking of x0 = 1, and fold in ฮฒ0 by thinking of z0 = 1.

Loss We will use cross entropy loss, `(yห,y). If y represents our target output, which will be a one-hot vector representing the correct class, and yห represents the output of the network, the loss is calculated by:

(1.6)

Prediction When doing prediction, we will predict the argmax of the output layer. For example, if yห1 = 0.3,yห2 = 0.2,yห3 = 0.5 we would predict class 3. If the true class from the training data was 2 we would have a one-hot vector y with values y1 = 0, y2 = 1, y3 = 0.

1. [4 points] We initialize the weights as:

๏ฃฎ1 2

ฮฑ = ๏ฃฏ๏ฃฏ3 1

๏ฃฐ2 2

1 0 โ3

2

2

2 0 1 โ3๏ฃน

1 0 2 ๏ฃบ

2 2 1 ๏ฃบ๏ฃป

1 โ2 2

๏ฃฎ1 ฮฒ = ๏ฃฐ1

3 2

โ1

1 โ2 1๏ฃน

1 2๏ฃป

โ1 1

And weights on the bias terms (ฮฑj,0 and ฮฒj,0) are initialized to 1.

You are given a training example x(1) = [1,1,0,0,1,1]T with label class 2, so y(1) = [0,1,0]T . Using the initial weights, run the feed forward of the network over this example (without rounding during the calculation) and then answer the following questions.

(a) What is a1?

(g) Which class would we predict on this example?

(h) What is the total loss on this example?

2. [5 points] Now use the results of the previous question to run backpropagation over the network and update the weights. Use learning rate ฮท = 1.

Do your backpropagation calculations without rounding then answer the following questions, then in your responses, round to four decimal places (a) What is the updated value of ฮฒ2,1?

(b) What is the updated weight of the hidden layer bias term applied to y1 (i.e. ฮฒ1,0)?

(c) What is the updated value of ฮฑ3,4?

(d) What is the updated weight of the input layer bias term applied to z2 (i.e. ฮฑ2,0)?

(e) If we ran backpropagation on this example for a large number of iterations and then ran feed forward over the same example again, which class would we predict?

3. [6 points] Let us now introduce regularization into our neural network. For this question, we will incorporate L2 regularization into our loss function `(yห,y), with the parameter ฮป controlling the weight given to the regularization term.

(a) Write the expression for the regularized loss function of our network after adding L2 regularization (Hint: Remember that bias terms should not be regularized!)

(b) Compute the regularized loss for training example x(1) (assume ฮป = 0.01 and use the weights before backpropagation)

Suppose the weight initialization for ฮฑ is changed to the following:

๏ฃฎ10

ฮฑ = ๏ฃฏ๏ฃฏ30

๏ฃฐ20 10

ฮฒ and bias terms are not changed. 20

10

20

0 โ30

20

20

20 0

10

20

10 10

0

20

โ20 โ30๏ฃน

20 ๏ฃบ

10 ๏ฃบ๏ฃป

20

(c) Report the non-regularized loss for the network on training example x(1)

(d) Report the regularized loss for the network on training example x(1) (ฮป = 0.01)

(f) Based on your observations from previous questions, select all statements which are true: The non-regularized loss is always higher than the regularized loss

As weights become larger, the regularized loss increases faster than non-regularized loss On adding regularization to the loss function, gradient updates for the network become

larger

When using large initial weights, weight values decrease more rapidly for a network which uses regularized loss

None of the above

1.2 Empirical Questions [10 points]

The following questions should be completed after you work through the programming portion of this assignment (Section 2).

For these questions, use the large dataset.

Use the following values for the hyperparameters unless otherwise specified:

Paramater Value

Number of Hidden Units 50

Weight Initialization RANDOM

Learning Rate 0.01

Table 1.1: Default values of hyperparameters for experiments in Section 1.2.

For the following questions, submit your solutions to Gradescope. Please submit computer-generated plots for Q4 and Q6. Do not include any visualization-related code when submitting to Autolab! Note: we expect it to take about 5 minutes to train each of these networks.

4. [4 points] Train a single hidden layer neural network using the hyperparameters mentioned in Table 1.1, except for the number of hidden units which should vary among 5, 20, 50, 100, and 200. Run the optimization for 100 epochs each time.

Plot the average training cross-entropy (sum of the cross-entropy terms over the training dataset divided by the total number of training examples) on the y-axis vs number of hidden units on the x-axis. In the same figure, plot the average testcross-entropy.

5. [1 points] Examine and comment on the the plots of training and testcross-entropy. What is the effect of changing the number of hidden units?

6. [4 points] Train a single hidden layer neural network using the hyperparameters mentioned in Table 1.1, except for the learning rate which should vary among 0.1, 0.01, and 0.001. Run the optimization for 100 epochs each time.

7. [1 points] Examine and comment on the the plots of training and testcross-entropy. How does adjusting the learning rate affect the convergence of cross-entropy of each dataset?

2 Programming [75 points]

Figure 2.1: 10 Random Images of Each of 10 Letters in OCR

2.1 The Task and Datasets

Materials Download the tar file from Autolab (โDownload handoutโ). The tar file will contain all the data that you will need in order to complete this assignment.

Datasets We will be using a subset of an Optical Character Recognition (OCR) dataset. This data includes images of all 26 handwritten letters; our subset will include only the letters โa,โ โe,โ โg,โ โi,โ โl,โ โn,โ โo,โ โr,โ โt,โ and โu.โ The handout contains three datasets drawn from this data: a small dataset with 60 samples per class (50 for training and 10 for test), a medium dataset with 600 samples per class (500 for training and 100 for test), and a large dataset with 1000 samples per class (900 for training and 100 for test). Figure 2.1 shows a random sample of 10 images of few letters from the dataset.

File Format Each dataset (small, medium, and large) consists of two csv filesโtrain and test. Each row contains 129 columns separated by commas. The first column contains the label and columns 2 to 129 represent the pixel values of a 16 ร 8 image in a row major format. Label 0 corresponds to โa,โ 1 to โe,โ 2 to โg,โ 3 to โi,โ 4 to โl,โ 5 to โn,โ 6 to โo,โ 7 to โr,โ 8 to โt,โ and 9 to โu.โ Because the original images are black-and-white (not grayscale), the pixel values are either 0 or 1. However, you should write your code to accept arbitrary pixel values in the range [0,1]. The images in Figure 2.1 were produced by converting these pixel values into .png files for visualization. Observe that no feature engineering has been done here; instead the neural network you build will learn features appropriate for the task of character recognition.

2.2 Model Definition

In this assignment, you will implement a single-hidden-layer neural network with a sigmoid activation function for the hidden layer, and a softmax on the output layer. Let the input vectors x be of length M, the hidden layer z consist of D hidden units, and the output layer yห be a probability distribution over K classes. That is, each element yk of the output vector represents the probability of x belonging to the class k.

We can compactly express this model by assuming that x0 = 1 is a bias feature on the input and that z0 = 1 is also fixed. In this way, we have two parameter matrices ฮฑ โ RDร(M+1) and ฮฒ โ RKร(D+1). The extra 0th column of each matrix (i.e. ฮฑยท,0 and ฮฒยท,0) hold the bias parameters.

The objective function we will use for training the neural network is the average cross entropy over the training dataset D = {(x(i),y(i))}:

(2.1)

In Equation 2.1, J is a function of the model parameters ฮฑ and ฮฒ because is implicitly a function of x(i), ฮฑ, and ฮฒ since it is the output of the neural network applied to x(i). Of course, and yk(i) are the kth components of yห(i) and y(i) respectively.

To train, you should optimize this objective function using stochastic gradient descent (SGD), where the gradient of the parameters for each training example is computed via backpropagation.

2.2.1 Initialization

In order to use a deep network, we must first initialize the weights and biases in the network. This is typically done with a random initialization, or initializing the weights from some other training procedure. For this assignment, we will be using two possible initialization:

RANDOM The weights are initialized randomly from a uniform distribution from -0.1 to 0.1. The bias parameters are initialized to zero.

ZERO All weights are initialized to 0.

You must support both of these initialization schemes.

2.3 Implementation

Write a program neuralnet.{py|java|cpp|m} that implements an optical character recognizer using a one hidden layer neural network with sigmoid activations. Your program should learn the parameters of the model on the training data, report the cross-entropy at the end of each epoch on both train and validation data, and at the end of training write out its predictions and error rates on both datasets.

Your implementation must satisfy the following requirements:

โข Use a sigmoid activation function on the hidden layer and softmax on the output layer to ensure it forms a proper probability distribution.

โข Number of hidden units for the hidden layer should be determined by a command line flag.

โข Support two different initialization strategies, as described in Section 2.2.1, selecting between them via a command line flag.

โข Use stochastic gradient descent (SGD) to optimize the parameters for one hidden layer neural network. The number of epochs will be specified as a command line flag.

โข Set the learning rate via a command line flag.

โข Perform stochastic gradient descent updates on the training data in the order that the data is given in the input file. Although you would typically shuffle training examples when using stochastic gradient descent, in order to autograde the assignment, we ask that you DO NOT shuffle trials in this assignment.

Implementing a neural network can be tricky: the parameters are not just a simple vector, but a collection of many parameters; computational efficiency of the model itself becomes essential; the initialization strategy dramatically impacts overall learning quality; other aspects which we will not change (e.g. activation function, optimization method) also have a large effect. These tips should help you along the way:

โข Try to โvectorizeโ your code as much as possibleโthis is particularly important for Python and Octave. For example, in Python, you want to avoid for-loops and instead rely on numpy calls to perform operations such as matrix multiplication, transpose, subtraction, etc. over an entire numpy array at once. Why? Because these operations are actually implemented in fast C code, which wonโt get bogged down the way a high-level scripting language like Python will.

โข For low level languages such as Java/C++, the use of primitive arrays and for-loops would not pose any computational efficiency problemsโhowever, it is still helpful to make use of a linear algebra library to cut down on the number of lines of code you will write.

$ python neuralnet.py [args…]

$ javac -cp “./lib/ejml-v0.33-libs/*:./” neuralnet.java

$ java -cp “./lib/ejml-v0.33-libs/*:./” neuralnet [args…]

$ g++ -g -std=c++11 -I./lib neuralnet.cpp; ./a.out [args…]

$ octave -qH neuralnet.m [args…]

โข Implement a finite difference test to check whether your implementation of backpropagation is correctly computing gradients. If you choose to do this, comment out this functionality once your backward pass starts giving correct results and before submitting to Autolabโsince it will otherwise slow down your code.

2.3.1 Command Line Arguments

The autograder runs and evaluates the output from the files generated, using the following command:

For Python:

For Java:

For C++:

For Octave:

Where above [args…] is a placeholder for nine command-line arguments: <traininput> testinput> <trainout> <testout> <metricsout> <numepoch>

<hiddenunits> <initflag> <learningrate>. These arguments are described in detail below:

1. <train input>: path to the training input .csv file (see Section 2.1)

2. <testinput>: path to the test input .csv file (see Section 2.1)

3. <train out>: path to output .labels file to which the prediction on the training data should be written (see Section 2.3.2)

4. <testout>: path to output .labels file to which the prediction on the test data should be written (see Section 2.3.2)

5. <metricsout>: path of the output .txt file to which metrics such as train and testerror should be written (see Section 2.3.3)

6. <numepoch>: integer specifying the number of times backpropogation loops through all of the training data (e.g., if <numepoch> equals 5, then each training example will be used in backpropogation 5 times).

7. <hidden units>: positive integer specifying the number of hidden units.

8. <initflag>: integer taking value 1 or 2 that specifies whether to use RANDOM or ZERO initialization (see Section 2.2.1 and Section 2.3)โthat is, if init_flag==1 initialize your weights randomly from a uniform distribution over the range [-0.1,0.1] (i.e. RANDOM), if init_flag==2 initialize all weights to zero (i.e. ZERO). For both settings, always initialize bias terms to zero.

9. <learningrate>: float value specifying the learning rate for SGD.

As an example, if you implemented your program in Python, the following command line would run your program with 4 hidden units on the small data provided in the handout for 2 epochs using zero initialization and a learning rate of 0.1.

$ python neuralnet.py smalltrain.csv smalltest.csv model1train_out.labels model1test_out.labels model1metrics_out.txt 2 4 2 0.1

Java EJML is a pure Java linear algebra package with three interfaces. We strongly recommend using the SimpleMatrix interface. Autolab will use EJML version 3.3. The command line arguments above demonstrate how we will call you code. The classpath inclusion

-cp “./lib/ejml-v0.33-libs/*:./” will ensure that all the EJML jars are on the classpath as well as your code.

C++ Eigen is a header-only library, so there is no linking to worry aboutโjust #include whatever components you need. Autolab will use Eigen version 3.3.4. The command line arguments above demonstrate how we will call you code. The argument -I./lib will include the lib/Eigen subdirectory, which contains all the headers.

We have included the correct versions of EJML/Eigen in the handout.tar for your convenience. Do not include EJML or Eigen in your Autolab submission tar; the autograder will ensure that they are in place.

ahttps://ejml.org bhttp://eigen.tuxfamily.org/

2.3.2 Output: Labels Files

Your program should write two output .labels files containing the predictions of your model on training data (<trainout>) and testdata (<testout>). Each should contain the predicted labels for each example printed on a new line. Use to create a new line.

Your labels should exactly match those of a reference implementation โ this will be checked by the autograder by running your program and evaluating your output file against the reference solution.

Note: You should output your predicted labels using the same integer identifiers as the original training data. You should also insert an empty line (again using ) at the end of each sequence (as is done in the input data files). The first few lines of the predicted labels for the testdataset is given below

6

4

8

8

2.3.3 Output Metrics

Generate a file where you report the following metrics:

cross entropy After each Stochastic Gradient Descent (SGD) epoch, report mean cross entropy on the training data crossentropy(train) and testdata crossentropy(test) (See Equation 2.1). These two cross-entropy values should be reported at the end of each epoch and prefixed by the epoch number. For example, after the second pass through the training examples, these should be prefixed by epoch=2. The total number of train losses you print out should equal numepochโlikewise for the total number of testlosses. error After the final epoch (i.e. when training has completed fully), report the final training error error(train) and testerror error(test).

A sample output is given below. It contains the train and testlosses for the first 2 epochs and the final error rate when using the command given above.

epoch=1 crossentropy(train): 2.18506276114 epoch=1 crossentropy(test): 2.18827302588 epoch=2 crossentropy(train): 1.90103257727 epoch=2 crossentropy(test): 1.91363803461 error(train): 0.77 error(test): 0.78

Take care that your output has the exact same format as shown above. There is an equal sign = between the word epoch and the epoch number, but no spaces. There should be a single space after the epoch number (e.g. a space after epoch=1), and a single space after the colon preceding the metric value (e.g. a space after epoch=1 likelihood(train):). Each line should be terminated by a Unix line ending .

2.4 Autolab Submission

You must submit a .tar file named neuralnet.tar containing neuralnet.{py|m|java|cpp}. You can create that file by running:

tar -cvf neuralnet.tar neuralnet.{py|m|java|cpp}

from the directory containing your code.

Some additional tips: DO NOT compress your files; you are just creating a tarball. Do not use tar -czvf. DO NOT put the above files in a folder and then tar the folder. Autolab is case sensitive, so observe that all your files should be named in lowercase. You must submit this file to the corresponding homework link on Autolab. The autograder for Autolab prints out some additional information about the tests that it ran. You can view this output by selecting โHandin Historyโ from the menu and then clicking one of the scores you received for a submission. For example on this assignment, among other things, the autograder will print out which language it detects (e.g. Python, Octave, C++, Java).

Python3 Users: Please include a blank file called python3.txt (case-sensitive) in your tar submission and we will execute your submitted program using Python 3 instead of Python 2.7.

A Implementation Details for Neural Networks

This section provides a variety of suggestions for how to efficiently and succinctly implement a neural network and backpropagation.

A.1 SGD for Neural Networks

Consider the neural network described in Section 2.3 applied to the ith training example (x,y) where y is a one-hot encoding of the true label. Our neural network outputs yห = hฮฑ,ฮฒ(x), where ฮฑ and ฮฒ are the parameters of the first and second layers respectively and hฮฑ,ฮฒ(ยท) is a one-hidden layer neural network with a sigmoid activation and softmax output. The loss function is negative cross-entropy J = `(yห,y) = โyT log(yห). J = Jx,y(ฮฑ,ฮฒ) is actually a function of our training example (x,y), and our model parameters ฮฑ,ฮฒ though we write just J for brevity.

In order to train our neural network, we are going to apply stochastic gradient descent. Because we want the behavior of your program to be deterministic for testing on Autolab, we make a few simplifications: (1) you should not shuffle your data and (2) you will use a fixed learning rate. In the real world, you would not make these simplifications.

SGD proceeds as follows, where E is the number of epochs and ฮณ is the learning rate.

Algorithm 1 Stochastic Gradient Descent (SGD)

1: procedure SGD(Training data D, testdata Dt)

2: Initialize parameters ฮฑ,ฮฒ . Use either RANDOM or ZERO from Section 2.2.

3: for e โ {1, ,…,E} do . For each epoch

4: for (x,y) โ D do . For each training example

5: Compute neural network layers:

6: o = object(x,a,b,z,yห,J) = NNFORWARD(x,y,ฮฑ,ฮฒ)

7: Compute gradients via backprop:

8: gฮฑ = โฮฑJ)

= NNBACKWARD(x,y,ฮฑ,ฮฒ,o)

gฮฒ = โฮฒJ

9: Update parameters:

10: ฮฑ โ ฮฑ โ ฮณgฮฑ

11: ฮฒ โ ฮฒ โ ฮณgฮฒ

12: Evaluate training mean cross-entropy JD(ฮฑ,ฮฒ)

13: Evaluate testmean cross-entropy JDt(ฮฑ,ฮฒ)

14: return parameters ฮฑ,ฮฒ

The functions NNFORWARD and NNBACKWARD are described in Algorithms and respectively. At test time, we output the most likely prediction for each example:

Algorithm 2 Prediction at Test Time

The gradients we need above are themselves matrices of partial derivatives. Let M be the number of input features, D the number of hidden units, and K the number of outputs.

๏ฃฎฮฑ10 ฮฑ11 … ฮฑ1M ๏ฃน

๏ฃฏฮฑ20 ฮฑ21 … ฮฑ2M ๏ฃบ

ฮฑ = ๏ฃฏ๏ฃฏ๏ฃฐ … … … … ๏ฃบ๏ฃบ๏ฃป g(A.1)

ฮฑD0 ฮฑD1 … ฮฑDM

๏ฃฎฮฒ10 ฮฒ11 … ฮฒ1D ๏ฃน

๏ฃฏฮฒ20 ฮฒ21 … ฮฒ2D ๏ฃบ

ฮฒ = ๏ฃฏ๏ฃฏ๏ฃฐ … … … … ๏ฃบ๏ฃบ๏ฃป g(A.2)

ฮฒK0 ฮฒK1 … ฮฒKD

Observe that we have (in a rather tricky fashion) defined the matrices such that both ฮฑ and gฮฑ are Dร(M + 1) matrices. Likewise, ฮฒ and gฮฒ are K ร (D + 1) matrices. The +1 comes from the extra columns ฮฑยท,0 and ฮฒยท,0 which are the bias parameters for the first and second layer respectively. We will always assume x0 = 1 and z0 = 1. This should greatly simplify your implementation as you will see in Section A.3.

A.2 Recursive Derivation of Backpropagation

In class, we described a very general approach to differentiating arbitrary functions: backpropagation. One way to understand how we go about deriving the backpropagation algorithm is to consider the natural consequence of recursive application of the chain rule.

In practice, the partial derivatives that we need for learning are and .

A.2.1 Symbolic Differentiation

Note In this section, we motivate backpropagation via a strawman: that is, we will work through the wrong approach first (i.e. symbolic differentiation) in order to see why we want a more efficient method (i.e. backpropagation). Do not use this symbolic differentiation in your code.

1. Considering the computational graph for the neural network, we observe that ฮฑij has exactly one child

. That is aj is the first and only intermediate quantity that uses ฮฑij. Applying the

chain rule, we obtain

2. So far so good, now we just need to compute . Not a problem! We can just apply the chain rule again. aj just has exactly one child as well, namely zj = ฯ(aj). The chain rule gives us that

. Substituting back into the equation above we find that

3. How do we get ? You guessed it: apply the chain rule yet again. This time, however, there are multiple children of zj in the computation graph; they are b1,b2,…bK. Applying the chain rule gives us that . Substituting back into the equation above gives:

4. Next we need , which we again obtain via the chain rule: l] โ yหk). Substituting back in above gives:

5. Finally, we know that which we can again substitute back in to obtain our final result:

Although we have successfully derived the partial derivative w.r.t. ฮฑij, the result is far from satisfying. It is overly complicated and requires deeply nested for-loops to compute.

The above is an example of symbolic differentiation. That is, at the end we get an equation representing the partial derivative w.r.t. ฮฑij. At this point, you should be saying to yourself: What a mess! Isnโt there a better way? Indeed there is and its called backpropagation. The algorithm works just like the above symbolic differentiation except that we never subsitute the partial derivative from the previous step back in. Instead, we work โbackwardsโ through the steps above computing partial derivatives in a top-down fashion.

A.2.2 Backpropagation

The backpropagation algorithm for the neural network used in this assignment is shown below. We proceed first through steps 1 through 5 to compute the cross-entropy of a given training examples (x,y) using parameters ฮฑ and ฮฒ. This is the forward computation. Next we work through steps 6 through 10 in order to compute the partial derivatives of the parameters and .

Forward Backward

,

Figure A.1: Backpropagation for 1-hidden layer neural network

Notice in step 8 above, that we compute partial derivatives for both and . This is because there are two types of inputs needed to compute bk in the forward pass. By contrast, in step 10, we only compute because is not needed.

Below, we rewrite the same computation, but with two simplifications:

1. We substitute the quantities from the third column above into the second column above. This yields a single โBackwardโ column below.

2. Below, we denote each of the intermediate partial derivatives that we need by a named variable. Specifically, for any node v in the computation graph we let gv โก dJdv .

Note that Figure A.1 and Figure A.2 are identical except for these changes. This yields the following version of the backpropagation algorithm.

Forward Backward

8. gฮฒkj = gbkzj

K gzj = Xgbkฮฒkj

k=1

9. gaj = gzjzj(1 โ zj)

10. gฮฑji = gajxi

Figure A.2: Backpropagation for 1-hidden layer neural network (with intermediate variables)

A.3 Matrix / Vector Operations for Neural Networks

Some programming languages are fast and some are slow. Below is a simple benchmark to show this concretely. The task is to compute a dot-product aT b between two vectors a โ R500 and b โ R500 one thousand times. Table A.1 shows the time taken for several combinations of programming language and data structure.

language data structure time (ms)

Python list 200.99

Python numpy array 1.01

Java float[] 4.00

C++ vector<float> 0.81

Table A.1: Computation time required for dot-product in various languages.

Notice that Java and C++ with standard data structures are quite efficient. By contrast, Python differs dramatically depending on which data structure you use: with a standard list object

(e.g. a = [float(i) for x in range(500)]) the computation time is an appallingly slow 200+ milliseconds. Simply by switching to a numpy array (e.g. a = np.arange(500, dtype=float)) we obtain a 200x speedup. This is because a numpy array is actually carrying out the dot-product computation in pure C, which is just as fast as our C++ benchmark, modulo some Python overhead.

Thus, for this assignment, Java and C++ programmers could easily implement the entire neural network using standard data structures and some for-loops. However, Python or Octave programmers would find that their code is simply too slow if they tried to do the same. As such, particularly for Python and Octave users, one must convert all the deeply nested for-loops into efficient โvectorizedโ math via numpy. Doing so will ensure efficient code. Java and C++ programmers can also benefit from linear algebra packages since it can cut down on the total number of lines of code you need to write.

A.4 Procedural Method of Implementation

Perhaps the simplest way to implement a 1-hidden-layer neural network is procedurally. Note that this approach has some drawbacks that weโll discuss below (Section A.4.2).

The procedural method is simply the one we derived in Section A.2: one function computes the outputs of the neural network and all intermediate quantities o = NNFORWARD(x,y,ฮฑ,ฮฒ) = object(x,a,b,z,yห,J), where the object is just some struct (i.e. steps 1 – 5 below). Then another function computes the gradients of our parameters gฮฑ,gฮฒ = NNBACKWARD(x,y,ฮฑ,ฮฒ,o), where o is a data structure that stores all the forward computation (i.e. steps 6 – 11 below). Here we describe the same computation shown in Figures A.1 and A.2, but this time we have shown how it could be carried out using matrix/vector operations.

Forward Backward

5. J = โyT logyห 6. gyห = โy รท yห

4. yห = softmax(b) 7. gb = gyTห diag

3. b = ฮฒz 8. gฮฒ = gbzT

gz = ฮฒT gb

2. z = ฯ(a) 10. ga

1. a = ฮฑx 11. gฮฑ = gaxT

Figure A.3: Backpropagation for 1-hidden layer neural network (with matrix/vector operations)

Above denotes element-wise multiplication and รท denotes element-wise division between two vectors. For any vector v โ RD, we have that diag(v) returns a D ร D diagonal matrix whose diagonal entries are v1,v2,…,vD and whose non-diagonal entries are zero. ฯ(a) denotes element-wise application of the sigmoid function and softmax(b) applies the softmax function.

One must be careful to ensure that the sigmoid and softmax functions are also vectorized. For example, the sigmoid function can be efficiently computed as

ฯ(a) = 1 รท (1 + exp(โa)) (A.3)

where 1 is a column vector of all ones, and exp is (efficiently) applied element-wise to the negated vector a. All of these operations should avoid for-loops when working in a high-level language like Python / Octave. We can compute the softmax function in a similar vectorized manner as,

softmax(b) = exp(b) รท sum(exp(b)) (A.4)

where the function sum(ยท) returns the sum of the entires in the vector.

A.4.1 Algorithm Definitions

We can write the above computation as two functions, NNFORWARD() and NNBACKWARD(). These two functions complete the learning algorithm presented in Algorithm 1.

Algorithm 3 Forward Computation

1: procedure NNFORWARD(Training example (x, y), Parameters ฮฑ, ฮฒ)

2: a = ฮฑx

3: z = ฯ(a)

4: b = ฮฒz

5: yห = softmax(b)

6: J = โyT logyห

7: o = object(x,a,z,b,yห,J)

8: return intermediate quantities o

Algorithm 4 Backpropagation

1: procedure NNBACKWARD(Training example (x, y), Parameters ฮฑ, ฮฒ, Intermediates o)

2: Place intermediate quantities x,a,z,b,yห,J in o in scope

3: gyห = โy รท yห

4: gb = gyTห diag

5: gฮฒ = gbzT

6: gz = ฮฒT gb

7: ga

8: gฮฑ = gaxT

9: return parameter gradients gฮฑ,gฮฒ

A.4.2 Drawbacks to Procedural Method

As noted in Section A.6, it is possible to use a finite difference method to check that the backpropagation algorithm is correctly computing the gradient of its corresponding forward computation. We strongly encourage you to do this.

There is a big problem however: what if your finite difference check informs you that the gradient is not being computed correctly. How will you know which part of your NNFORWARD() or NNBACKWARD() functions has a bug? There are two possible solutions here:

1. As usual, you can (and should) work through a tiny example dataset on paper. Compute each intermediate quantity and each gradient. Check that your code reproduces each number. The one that does not should indicate where to find the bug.

2. Replace your procedural implementation with a module-based one (as described in Section A.5) and then run a finite-difference check on each layer of the model individually. The finite-difference check that fails should indicate where to find the bug.

Of course, rather than waiting until you have a bug in your procedural implementation, you could jump straight to the module-based versionโthough it increases the complexity slightly (i.e. more lines of code) it might save you some time in the long run.

A.5 Module-based Method of Implementation

Module-based automatic differentiation (AD) is a technique that has long been used to develop libraries for deep learning. Dynamic neural network packages are those that allow a specification of the computation graph dynamically at runtime, such as Torch , PyTorch , and DyNet โthese all employ module-based AD in the sense that we will describe here.

The key idea behind module-based AD is to componentize the computation of the neural-network into layers. Each layer can be thought of as consolidating numerous nodes in the computation graph (a subset of them) into one vector-valued node. Such a vector-valued node should be capable of the following and we call each one a module:

1. Forward computation of output b = [b1,…,bB] given input a = [a1,…,aA] via some differentiable function f. That is b = f(a).

2. Backward computation of the gradient of the input ga given the gradient of output gb , where J is the final real-valued output of the entire computation graph. This is done via the chain rule for all i โ {1,…,A}.

A.5.1 Module Definitions

The modules we would define for our neural network would correspond to a Linear layer, a Sigmoid layer, a Softmax layer, and a Cross-Entropy layer. Each module defines a forward function b = *FORWARD(a), and a backward function ga = *BACKWARD(a,b,gb) method. These methods accept parameters if appropriate. The dimensions A and B are specific to the module such that we have input a โ RA, output b โ RB, gradient of output ga , โaJ โ RA, and gradient of input gb , โbJ โ RB.

Sigmoid Module The sigmoid layer has only one input vector a. Below ฯ is the sigmoid applied elementwise, and is element-wise multiplication s.t. u .

1: procedure SIGMOIDFORWARD(a, ฮฑ)

2: b = ฯ(a)

3: return b

4: procedure SIGMOIDBACKWARD(a, b, gb)

5: ga

6: return ga

Softmax Module The softmax layer has only one input vector a. For any vector v โ RD, we have that diag(v) returns a D ร D diagonal matrix whose diagonal entries are v1,v2,…,vD and whose nondiagonal entries are zero.

1: procedure SOFTMAXFORWARD(a, ฮฑ)

2: b = softmax(a)

3: return b

4: procedure SOFTMAXBACKWARD(a, b, gb)

5: ga = gbT diag

6: return ga

Linear Module The linear layer has two inputs: a vector a and parameters ฯ โ RBรA. The output b is not used by LINEARBACKWARD, but we pass it in for consistency of form.

1: procedure LINEARFORWARD(a, ฮฑ)

2: b = ฯa

3: return b

4: procedure LINEARBACKWARD(a, ฯ, b, gb)

5: gฯ = gbaT

6: ga = ฯT gb

7: return gฯ,ga

Cross-Entropy Module The cross-entropy layer has two inputs: a gold one-hot vector a and a predicted probability distribution aห. Itโs output b โ R is a scalar. Below รท is element-wise division. The output b is not used by CROSSENTROPYBACKWARD, but we pass it in for consistency of form.

1: procedure CROSSENTROPYFORWARD(a, aห)

2: b = โaT logaห

3: return b

4: procedure CROSSENTROPYBACKWARD(a, aห, b, gb)

5: gaห = โgb(a รท aห)

6: return ga

Itโs also quite common to combine the Cross-Entropy and Softmax layers into one. The reason for this is the cancelation of numerous terms that result from the zeros in the cross-entropy backward calculation. (Said trick is not required to obtain a sufficiently fast implementation for Autolab.)

A.5.2 Module-based AD for Neural Network

Using these modules, we can re-define our functions NNFORWARD (Algorithm 3) and NNBACKWARD (Algorithm 4) as follows.

Algorithm 5 Forward Computation

1: procedure NNFORWARD(Training example (x, y), Parameters ฮฑ, ฮฒ)

2: a = LINEARFORWARD(x,ฮฑ)

3: z = SIGMOIDFORWARD(a)

4: b = LINEARFORWARD(z,ฮฒ)

5: yห = SOFTMAXFORWARD(b)

6: J = CROSSENTROPYFORWARD(y,yห)

7: o = object(x,a,z,b,yห,J)

8: return intermediate quantities o

Algorithm 6 Backpropagation

1: procedure NNBACKWARD(Training example (x, y), Parameters ฮฑ, ฮฒ, Intermediates o)

2: Place intermediate quantities x,a,z,b,yห,J in o in scope

3: . Base case

4: gyห = CROSSENTROPYBACKWARD(y,yห,J,gJ)

5: gb = SOFTMAXBACKWARD(b,yห,gyห)

6: gฮฒ,gz = LINEARBACKWARD(z,b,gb)

7: ga = SIGMOIDBACKWARD(a,z,gz)

8: gฮฑ,gx = LINEARBACKWARD(x,a,ga) . We discard gx

9: return parameter gradients gฮฑ,gฮฒ

Hereโs the big takeaway: we can actually view these two functions as themselves defining another module! It is a 1-hidden layer neural network module. That is, the cross-entropy of the neural network for a single training example is itself a differentiable function and we know how to compute the gradients of its inputs, given the gradients of its outputs.

A.6 Testing Backprop with Numerical Differentiation

Numerical differentiation provides a convenient method for testing gradients computed by backpropagation. The centered finite difference approximation is:

(A.5)

where di is a 1-hot vector consisting of all zeros except for the ith entry of di, which has value 1. Unfortunately, in practice, it suffers from issues of floating point precision. Therefore, it is typically only appropriate to use this on small examples with an appropriately chosen .

In order to apply this technique to test the gradients of your backpropagation implementation, you will need to ensure that your code is appropriately factored. Any of the modules including NNFORWARD (Algorithm 3 or Algorithm 5) and NNBACKWARD (Algorithm 4 or Algorithm 6) could be tested in this way.

For example, you could use two functions: forward(x,y,theta) computes the cross-entropy for a training example. backprop(x,y,theta) computes the gradient of the cross-entropy for a training example via backpropagation. Finally, finite_diff as defined below approximates the gradient by the centered finited difference method. The following pseudocode provides an overview of the entire procedure.

def finite_diff(x, y, theta):

epsilon = 1e-5 grad = zero_vector(theta.length) for m in [1, …, theta.length]:

d = zero_vector(theta.length) d[m] = 1 v = forward(x, y, theta + epsilon * d) v -= forward(x, y, theta – epsilon * d) v /= 2*epsilon grad[m] = v

# Compute the gradient by backpropagation

grad_bp = backprop(x, y, theta) # Approximate the gradient by the centered finite difference method grad_fd = finite_diff(x, y, theta)

# Check that the gradients are (nearly) the same diff = grad_bp – grad_fd # element-wise difference of two vectors print l2_norm(diff) # this value should be small (e.g. < 1e-7)

A.6.1 Limitations

This does not catch all bugsโthe only thing it tells you is whether your backpropagation implementation is correctly computing the gradient for the forward computation. Suppose your forward computation is incorrect, e.g. you are always computing the cross-entropy of the wrong label. If your backpropagation is also using the same wrong label, then the check above will not expose the bug. Thus, you always want to separately test that your forward implementation is correct.

A.6.2 Finite Difference Checking of Modules

Note that the above would test the gradient for the entire end-to-end computation carried output by the neural network. However, if you implement a module-based automatic differentiation method (as in Section A.5), then you can test each individual component for correctness. The only difference is that you need to run the finite-difference check for each of the output values (i.e. a double for-loop).

## Reviews

There are no reviews yet.