## Description

Summary In this assignment, you will build a handwriting recognition system using a neural network. As a warmup, Section 1 will lead you through an on-paper example of how to implement a neural network. Then, in Section 2, you will implement an end-to-end system that learns to perform handwritten letter classification.

START HERE: Instructions

• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ ˜mgormley/courses/10601/about.html

• Submitting your work: You will use Gradescope to submit answers to all questions, and Autolab to submit your code. Please follow instructions at the end of this PDF to correctly submit all your code to Autolab.

– Autolab: You will submit your code for programming questions on the homework to Autolab (https://autolab.andrew.cmu.edu/). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). The software installed on the VM is identical to that on linux.andrew.cmu.edu, so you should check that your code runs correctly there. If developing locally, check that the version number of the programming language environment (e.g. Python 2.7/3.5, Octave 3.8.2, OpenJDK 1.8.0, g++ 4.8.5) and versions of permitted libraries (e.g. numpy 1.7.1) match those on linux.andrew.cmu.edu. (Octave users: Please make sure you do not use any Matlabspecific libraries in your code that might make it fail against our tests.) You have a total of 10 Autolab submissions. Use them wisely. In order to not waste Autolab submissions, we recommend debugging your implementation on your local machine (or the linux servers) and making sure your code is running correctly first before any Autolab submission.

• Materials: Download from Autolab the tar file (“Download handout”). The tar file will contain all the data that you will need in order to complete this assignment.

For multiple choice or select all that apply questions, shade in the box or circle in the template document corresponding to the correct answer(s) for each of the questions. For LATEXusers, use and for shaded boxes and circles, and don’t change anything else.

Instructions for Specific Problem Types

For “Select One” questions, please fill in the appropriate bubble completely:

Select One: Who taught this course?

Matt Gormley

Marie Curie

Noam Chomsky

Select One: Who taught this course?

Matt Gormley

Marie Curie

@@Noam Chomsky

For “Select all that apply” questions, please fill in all appropriate squares completely:

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein

Isaac Newton

I don’t know

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein Isaac Newton

@ I don’t know

Fill in the blank: What is the course number?

10-S7601

S

1 Written Questions [25 points]

1.1 Example Feed Forward and Backpropagation [15 points]

Figure 1.1: A One Hidden Layer Neural Network

Network Overview Consider the neural network with one hidden layer shown in Figure 1.1. The input layer consists of 6 features x = [x1,…,x6]T , the hidden layer has 4 nodes z = [z1,…,z4]T , and the output layer is a probability distribution y = [y1,y2,y3]T over 3 classes. We also add a bias to the input, x0 = 1 and the hidden layer z0 = 1, both of which are fixed to 1.

α is the matrix of weights from the inputs to the hidden layer and β is the matrix of weights from the hidden layer to the output layer. αj,i represents the weight going to the node zj in the hidden layer from the node xi in the input layer (e.g. α1,2 is the weight from x2 to z1), and β is defined similarly. We will use a sigmoid activation function for the hidden layer and a softmax for the output layer.

Network Details Equivalently, we define each of the following.

The input:

x = [x1,x2,x3,x4,x5,x6]T

Linear combination at the first (hidden) layer: (1.1)

(1.2)

Activation at the first (hidden) layer:

(1.3)

Linear combination at the second (output) layer:

(1.4)

Activation at the second (output) layer:

(1.5)

Note that the linear combination equations can be written equivalently as the product of the weight matrix with the input vector. We can even fold in the bias term α0 by thinking of x0 = 1, and fold in βj,0 by thinking of z0 = 1.

Loss We will use cross entropy loss, `(yˆ,y). If y represents our target output, which will be a one-hot vector representing the correct class, and yˆ represents the output of the network, the loss is calculated by:

(1.6)

For the below questions use natural log in the equation.

Prediction When doing prediction, we will predict the argmax of the output layer. For example, if yˆ1 = 0.3,yˆ2 = 0.2,yˆ3 = 0.5 we would predict class 3. If the true class from the training data was 2 we would have a one-hot vector y with values y1 = 0, y2 = 1, y3 = 0.

1. [6 points] In the following questions you will derive the matrix and vector forms of the previous equations which define out neural network. These are what you should hope to program in order to keep your program under the Autolab time-out.

When working these out it is important to keep a note of the vector and matrix dimensions in order for you to easily identify what is and isn’t a valid multiplication. Suppose you are given an training example: x(1) = [x1,x2,x3,x4,x5,x6]T with label class 2, so y(1) = [0,1,0]T . We initialize the network weights as:

α1,2 α2,2 α3,2 α4,2 α1,3 α2,3 α3,3 α4,3 α1,4 α2,4 α3,4 α4,4 α1,5 α2,5 α3,5 α4,5 α1,6 α2,6 α3,6 α4,6

We want to also consider the bias term and the weights on the bias terms (αj,0 and βk,0). To account for these we can add a new column to the beginning of our initial weight matrices.

α1,0 α1,1 α = α2,0 α2,1

α3,0 α3,1 α4,0 α4,1 α1,2 α2,2 α3,2 α4,2 α1,3 α2,3 α3,3 α4,0 α1,4 α2,4 α3,4 α4,4 α1,5 α1,6 α2,5 α2,6 α3,5 α3,6 α4,5 α4,6

β1,1 β2,1 β3,1 β1,2 β2,2 β3,2 β1,3 β2,3 β3,3

And we can set our first value of our input vectors to always be 1 ( ), so our input becomes:

x(1) = [1,x1,x2,x3,x4,x5,x6]T

(a) [1 points] By examining the shapes of the initial weight matrices, how many neurons do we have in the first hidden layer of the neural network? (Not including the bias neuron) (b) [1 points] How many output neurons will our neural network have?

(c) [1 points] What is the vector a whose elements are made up of the entries aj in equation (1.2). Write your answer in terms of α and x(1).

(d) [1 points] What is the vector z whose elements are made up of the entries zj in equation (1.3)? Write your answer in terms of a.

(e) [1 points] Select one: We cannot take the matrix multiplication of our weights β and our vector z since they are not compatible shapes. Which of the following would allow us to take the matrix multiplication of β and z such that the entries of the vector b = β∗z are equivalent to the values of bk in equation (1.4)?

Remove the last column of β

Remove the first row of z

Append a value of 1 to be the first entry of z

Append an additional column of 1’s to be the first column of β

Append a row of 1’s to be the first row of β

Take the transpose of β

(f) [1 points] What are the entries of the output vector yˆ? Your answer should be written in terms

2. [12 points] We will now derive the matrix and vector forms for the backpropagation algorithm.

The mathematics which you have to derive in this section jump significantly in difficultly, you should always be examining the shape of the matrices and vectors and making sure that you are comparing your matrix elements with calculations of individual derivatives to make sure they match (e.g. the element of the matrix should be equal to ). Recall that ` is our loss function defined in equation (1.6)

(a) [3 points] The derivative of the softmax function with respect to bk is as follows:

where I[k = l] is an indicator function such that if k = l then it it returns value 1 and 0 otherwise. Using this, write the derivative in a smart way such that you do not need this indicator function? Write your solutions in terms of yˆ,y.

(b) [2 points] What are the elements of the vector ? (Recall that y(1) = [0,1,0]T )

[yˆ ,yˆ − 1,yˆ ]T

1 2 3

(c) [1 points] What is the derivative ? Your answer should be in terms of dd`b and z.

(d) [1 points] Explain in one short sentance why must we go back to using the matrix β∗ (The matrix β without the first column of ones) when calculating the matrix ?

Because the bias term β0 has no relation with α

(e) [1 points] What is the derivative dd`z? Your answer should be in terms of dd`b and β∗

(g) [1 points] What is the matrix ? Your answer should be in terms of .

3. [9 points] Now you will put these equations to use in an example with numerical values. You should use the answers you get here to debug your code.

You are given a training example x(1) = [1,1,0,0,1,1]T with label class 2, so y(1) = [0,1,0]T . We

initialize the network weights as:

1 2 −3 0 1 −3

α∗ = 3 1 2 1 0 2

2 2 2 2 2 1

1 0 2 1 −2 2

We want to also consider the bias term and the weights on the bias terms (αj,0 and βj,0). Lets say they are all initialized to 1. To account for this we can add a column of 1’s to the beginning of our initial weight matrices.

1 1

α = 1 3

1 2

1 1 2

1

2

0 −3

2

2

2

1

1

3 2

−1

1 −2 1

1 2

−1 1

And we can set our first value of our input vectors to always be 1 ( ), so our input becomes:

x(1) = [1,1,1,0,0,1,1]T

Using the initial weights, run the feed forward of the network over this example (rounding to 4 decimal places during the calculation) and then answer the following questions.

(a) [1 points] What is a1? (b) [1 points] What is a2?

(c) [1 points] What is z1?

(d) [1 points] What is z3?

0.9997

(e) [1 points] What is b1?

2.7604

(f) [1 points] What is b2?

3.6430

(g) [1 points] What is yˆ2?

0.2615

(h) [1 points] Which class would we predict on this example? Your answer should just be an integer.

3

(i) [1 points] What is the total loss on this example?

1.3412

4. [7 points] Now use the results of the previous question to run backpropagation over the network and update the weights. Use learning rate η = 1.

Do your backpropagation calculations rounding to 4 decimal places then answer the following questions:

(a) [1 points] What is the value of ?

0.1082

(b) [1 points] What is the updated value of the weight β1,0?

0.8918

(c) [2 points] What is the updated value of the weight α3,4?

2

(d) [2 points] What is the updated weight of the input layer bias term applied to z2 (i.e. α2,0)?

0.9986

(e) [1 points] If we ran backpropagation on this example for a large number of iterations and then ran feed forward over the same example again, which class would we predict?

1.2 Empirical Questions [10 points]

The following questions should be completed after you work through the programming portion of this assignment (Section 2).

For these questions, use the large dataset.

Use the following values for the hyperparameters unless otherwise specified:

Paramater Value

Number of Hidden Units 50

Weight Initialization RANDOM

Learning Rate 0.01

Table 1.1: Default values of hyperparameters for experiments in Section 1.2.

For the following questions, submit your solutions to Gradescope. Please submit computer-generated plots for Q4 and Q6. Do not include any visualization-related code when submitting to Autolab! Note: we expect it to take about 5 minutes to train each of these networks.

5. [4 points] Train a single hidden layer neural network using the hyperparameters mentioned in Table 1.1, except for the number of hidden units which should vary among 5, 20, 50, 100, and 200. Run the optimization for 100 epochs each time.

Plot the average training cross-entropy (sum of the cross-entropy terms over the training dataset divided by the total number of training examples) on the y-axis vs number of hidden units on the x-axis. In the same figure, plot the average testcross-entropy.

6. [1 points] Examine and comment on the the plots of training and testcross-entropy. What is the effect of changing the number of hidden units?

7. [4 points] Train a single hidden layer neural network using the hyperparameters mentioned in Table 1.1, except for the learning rate which should vary among 0.1, 0.01, and 0.001. Run the optimization for 100 epochs each time.

2 Programming [75 points]

Figure 2.1: 10 Random Images of Each of 10 Letters in OCR

2.1 The Task and Datasets

Materials Download the tar file from Autolab (“Download handout”). The tar file will contain all the data that you will need in order to complete this assignment.

Datasets We will be using a subset of an Optical Character Recognition (OCR) dataset. This data includes images of all 26 handwritten letters; our subset will include only the letters “a,” “e,” “g,” “i,” “l,” “n,” “o,” “r,” “t,” and “u.” The handout contains three datasets drawn from this data: a small dataset with 60 samples per class (50 for training and 10 for test), a medium dataset with 600 samples per class (500 for training and 100 for test), and a large dataset with 1000 samples per class (900 for training and 100 for test). Figure 2.1 shows a random sample of 10 images of few letters from the dataset.

File Format Each dataset (small, medium, and large) consists of two csv files—train and test. Each row contains 129 columns separated by commas. The first column contains the label and columns 2 to 129 represent the pixel values of a 16 × 8 image in a row major format. Label 0 corresponds to “a,” 1 to “e,” 2 to “g,” 3 to “i,” 4 to “l,” 5 to “n,” 6 to “o,” 7 to “r,” 8 to “t,” and 9 to “u.” Because the original images are black-and-white (not grayscale), the pixel values are either 0 or 1. However, you should write your code to accept arbitrary pixel values in the range [0,1]. The images in Figure 2.1 were produced by converting these pixel values into .png files for visualization. Observe that no feature engineering has been done here; instead the neural network you build will learn features appropriate for the task of character recognition.

2.2 Model Definition

In this assignment, you will implement a single-hidden-layer neural network with a sigmoid activation function for the hidden layer, and a softmax on the output layer. Let the input vectors x be of length M, the hidden layer z consist of D hidden units, and the output layer yˆ be a probability distribution over K classes. That is, each element yk of the output vector represents the probability of x belonging to the class k.

We can compactly express this model by assuming that x0 = 1 is a bias feature on the input and that z0 = 1 is also fixed. In this way, we have two parameter matrices α ∈ RD×(M+1) and β ∈ RK×(D+1). The extra 0th column of each matrix (i.e. α·,0 and β·,0) hold the bias parameters.

The objective function we will use for training the neural network is the average cross entropy over the training dataset D = {(x(i),y(i))}:

(2.1)

In Equation 2.1, J is a function of the model parameters α and β because is implicitly a function of x(i), α, and β since it is the output of the neural network applied to x(i). Of course, and yk(i) are the kth components of yˆ(i) and y(i) respectively.

To train, you should optimize this objective function using stochastic gradient descent (SGD), where the gradient of the parameters for each training example is computed via backpropagation.

2.2.1 Initialization

In order to use a deep network, we must first initialize the weights and biases in the network. This is typically done with a random initialization, or initializing the weights from some other training procedure. For this assignment, we will be using two possible initialization:

RANDOM The weights are initialized randomly from a uniform distribution from -0.1 to 0.1. The bias parameters are initialized to zero.

ZERO All weights are initialized to 0.

You must support both of these initialization schemes.

2.3 Implementation

Write a program neuralnet.{py|java|cpp|m} that implements an optical character recognizer using a one hidden layer neural network with sigmoid activations. Your program should learn the parameters of the model on the training data, report the cross-entropy at the end of each epoch on both train and validation data, and at the end of training write out its predictions and error rates on both datasets.

Your implementation must satisfy the following requirements:

• Use a sigmoid activation function on the hidden layer and softmax on the output layer to ensure it forms a proper probability distribution.

• Number of hidden units for the hidden layer should be determined by a command line flag.

• Support two different initialization strategies, as described in Section 2.2.1, selecting between them via a command line flag.

• Use stochastic gradient descent (SGD) to optimize the parameters for one hidden layer neural network. The number of epochs will be specified as a command line flag.

• Set the learning rate via a command line flag.

• Perform stochastic gradient descent updates on the training data in the order that the data is given in the input file. Although you would typically shuffle training examples when using stochastic gradient descent, in order to autograde the assignment, we ask that you DO NOT shuffle trials in this assignment.

Implementing a neural network can be tricky: the parameters are not just a simple vector, but a collection of many parameters; computational efficiency of the model itself becomes essential; the initialization strategy dramatically impacts overall learning quality; other aspects which we will not change (e.g. activation function, optimization method) also have a large effect. These tips should help you along the way:

• Try to “vectorize” your code as much as possible—this is particularly important for Python and Octave. For example, in Python, you want to avoid for-loops and instead rely on numpy calls to perform operations such as matrix multiplication, transpose, subtraction, etc. over an entire numpy array at once. Why? Because these operations are actually implemented in fast C code, which won’t get bogged down the way a high-level scripting language like Python will.

• For low level languages such as Java/C++, the use of primitive arrays and for-loops would not pose any computational efficiency problems—however, it is still helpful to make use of a linear algebra library to cut down on the number of lines of code you will write.

$ python neuralnet.py [args…]

$ javac -cp “./lib/ejml-v0.33-libs/*:./” neuralnet.java

$ java -cp “./lib/ejml-v0.33-libs/*:./” neuralnet [args…]

$ g++ -g -std=c++11 -I./lib neuralnet.cpp; ./a.out [args…]

$ octave -qH neuralnet.m [args…]

• Implement a finite difference test to check whether your implementation of backpropagation is correctly computing gradients. If you choose to do this, comment out this functionality once your backward pass starts giving correct results and before submitting to Autolab—since it will otherwise slow down your code.

2.3.1 Command Line Arguments

The autograder runs and evaluates the output from the files generated, using the following command:

For Python:

For Java:

For C++:

For Octave:

Where above [args…] is a placeholder for nine command-line arguments: <traininput> testinput> <trainout> <testout> <metricsout> <numepoch>

<hiddenunits> <initflag> <learningrate>. These arguments are described in detail below:

1. <train input>: path to the training input .csv file (see Section 2.1)

2. <testinput>: path to the test input .csv file (see Section 2.1)

3. <train out>: path to output .labels file to which the prediction on the training data should be written (see Section 2.3.2)

4. <testout>: path to output .labels file to which the prediction on the test data should be written (see Section 2.3.2)

5. <metricsout>: path of the output .txt file to which metrics such as train and testerror should be written (see Section 2.3.3)

6. <numepoch>: integer specifying the number of times backpropogation loops through all of the training data (e.g., if <numepoch> equals 5, then each training example will be used in backpropogation 5 times).

7. <hidden units>: positive integer specifying the number of hidden units.

8. <initflag>: integer taking value 1 or 2 that specifies whether to use RANDOM or ZERO initialization (see Section 2.2.1 and Section 2.3)—that is, if init_flag==1 initialize your weights randomly from a uniform distribution over the range [-0.1,0.1] (i.e. RANDOM), if init_flag==2 initialize all weights to zero (i.e. ZERO). For both settings, always initialize bias terms to zero.

9. <learningrate>: float value specifying the learning rate for SGD.

As an example, if you implemented your program in Python, the following command line would run your program with 4 hidden units on the small data provided in the handout for 2 epochs using zero initialization and a learning rate of 0.1.

$ python neuralnet.py smalltrain.csv smalltest.csv model1train_out.labels model1test_out.labels model1metrics_out.txt 2 4 2 0.1

Java EJML is a pure Java linear algebra package with three interfaces. We strongly recommend using the SimpleMatrix interface. Autolab will use EJML version 3.3. The command line arguments above demonstrate how we will call you code. The classpath inclusion

-cp “./lib/ejml-v0.33-libs/*:./” will ensure that all the EJML jars are on the classpath as well as your code.

C++ Eigen is a header-only library, so there is no linking to worry about—just #include whatever components you need. Autolab will use Eigen version 3.3.4. The command line arguments above demonstrate how we will call you code. The argument -I./lib will include the lib/Eigen subdirectory, which contains all the headers.

We have included the correct versions of EJML/Eigen in the handout.tar for your convenience. Do not include EJML or Eigen in your Autolab submission tar; the autograder will ensure that they are in place.

ahttps://ejml.org bhttp://eigen.tuxfamily.org/

2.3.2 Output: Labels Files

Your program should write two output .labels files containing the predictions of your model on training data (<trainout>) and testdata (<testout>). Each should contain the predicted labels for each example printed on a new line. Use to create a new line.

Your labels should exactly match those of a reference implementation – this will be checked by the autograder by running your program and evaluating your output file against the reference solution.

Note: You should output your predicted labels using the same integer identifiers as the original training data. You should also insert an empty line (again using ) at the end of each sequence (as is done in the input data files). The first few lines of the predicted labels for the testdataset is given below

6

4

8

8

2.3.3 Output Metrics

Generate a file where you report the following metrics:

cross entropy After each Stochastic Gradient Descent (SGD) epoch, report mean cross entropy on the training data crossentropy(train) and testdata crossentropy(test) (See Equation 2.1). These two cross-entropy values should be reported at the end of each epoch and prefixed by the epoch number. For example, after the second pass through the training examples, these should be prefixed by epoch=2. The total number of train losses you print out should equal numepoch—likewise for the total number of testlosses. error After the final epoch (i.e. when training has completed fully), report the final training error error(train) and testerror error(test).

A sample output is given below. It contains the train and testlosses for the first 2 epochs and the final error rate when using the command given above.

epoch=1 crossentropy(train): 2.18506276114 epoch=1 crossentropy(test): 2.18827302588 epoch=2 crossentropy(train): 1.90103257727 epoch=2 crossentropy(test): 1.91363803461 error(train): 0.77 error(test): 0.78

Take care that your output has the exact same format as shown above. There is an equal sign = between the word epoch and the epoch number, but no spaces. There should be a single space after the epoch number (e.g. a space after epoch=1), and a single space after the colon preceding the metric value (e.g. a space after epoch=1 likelihood(train):). Each line should be terminated by a Unix line ending .

2.4 Autolab Submission

You must submit a .tar file named neuralnet.tar containing neuralnet.{py|m|java|cpp}. You can create that file by running:

tar -cvf neuralnet.tar neuralnet.{py|m|java|cpp}

from the directory containing your code.

Some additional tips: DO NOT compress your files; you are just creating a tarball. Do not use tar -czvf. DO NOT put the above files in a folder and then tar the folder. Autolab is case sensitive, so observe that all your files should be named in lowercase. You must submit this file to the corresponding homework link on Autolab. The autograder for Autolab prints out some additional information about the tests that it ran. You can view this output by selecting ”Handin History” from the menu and then clicking one of the scores you received for a submission. For example on this assignment, among other things, the autograder will print out which language it detects (e.g. Python, Octave, C++, Java).

A Implementation Details for Neural Networks

This section provides a variety of suggestions for how to efficiently and succinctly implement a neural network and backpropagation.

A.1 SGD for Neural Networks

Consider the neural network described in Section 2.3 applied to the ith training example (x,y) where y is a one-hot encoding of the true label. Our neural network outputs yˆ = hα,β(x), where α and β are the parameters of the first and second layers respectively and hα,β(·) is a one-hidden layer neural network with a sigmoid activation and softmax output. The loss function is negative cross-entropy J = `(yˆ,y) = −yT log(yˆ). J = Jx,y(α,β) is actually a function of our training example (x,y), and our model parameters α,β though we write just J for brevity.

In order to train our neural network, we are going to apply stochastic gradient descent. Because we want the behavior of your program to be deterministic for testing on Autolab, we make a few simplifications: (1) you should not shuffle your data and (2) you will use a fixed learning rate. In the real world, you would not make these simplifications.

SGD proceeds as follows, where E is the number of epochs and γ is the learning rate.

Algorithm 1 Stochastic Gradient Descent (SGD) without Shuffle

1: procedure SGD(Training data D, testdata Dt)

2: Initialize parameters α,β . Use either RANDOM or ZERO from Section 2.2.1

3: for e ∈ {1,2,…,E} do . For each epoch

4: for (x,y) ∈ D do . For each training example (No shuffling)

5: Compute neural network layers:

6: o = object(x,a,b,z,yˆ,J) = NNFORWARD(x,y,α,β)

7: Compute gradients via backprop:

8: gα = ∇αJ)

= NNBACKWARD(x,y,α,β,o)

gβ = ∇βJ

9: Update parameters:

10: α ← α − γgα

11: β ← β − γgβ

12: Evaluate training mean cross-entropy JD(α,β)

13: Evaluate testmean cross-entropy JDt(α,β)

14: return parameters α,β

At test time, we output the most likely prediction for each example:

Algorithm 2 Prediction at Test Time

1: procedure PREDICT(Unlabeled train or testdataset D0, Parameters α,β)

2: for x ∈ D0 do

3: Compute neural network prediction yˆ = h(x)

4: Predict the label with highest probability l = argmaxk yˆk

The gradients we need above are themselves matrices of partial derivatives. Let M be the number of input features, D the number of hidden units, and K the number of outputs.

α10 α11 … α1M

α20 α21 … α2M

α = … … … … g(A.1)

αD0 αD1 … αDM

β10 β11 … β1D

β20 β21 … β2D

β = … … … … g(A.2)

βK0 βK1 … βKD

Observe that we have (in a rather tricky fashion) defined the matrices such that both α and gα are D×(M + 1) matrices. Likewise, β and gβ are K × (D + 1) matrices. The +1 comes from the extra columns α·,0 and β·,0 which are the bias parameters for the first and second layer respectively. We will always assume x0 = 1 and z0 = 1. This should greatly simplify your implementation as you will see in Section A.3.

A.2 Recursive Derivation of Backpropagation

In class, we described a very general approach to differentiating arbitrary functions: backpropagation. One way to understand how we go about deriving the backpropagation algorithm is to consider the natural consequence of recursive application of the chain rule.

In practice, the partial derivatives that we need for learning are and .

A.2.1 Symbolic Differentiation

Note In this section, we motivate backpropagation via a strawman: that is, we will work through the wrong approach first (i.e. symbolic differentiation) in order to see why we want a more efficient method (i.e. backpropagation). Do not use this symbolic differentiation in your code.

1. Considering the computational graph for the neural network, we observe that αij has exactly one child

. That is aj is the first and only intermediate quantity that uses αij. Applying the

chain rule, we obtain

2. So far so good, now we just need to compute . Not a problem! We can just apply the chain rule again. aj just has exactly one child as well, namely zj = σ(aj). The chain rule gives us that

. Substituting back into the equation above we find that

3. How do we get ? You guessed it: apply the chain rule yet again. This time, however, there are multiple children of zj in the computation graph; they are b1,b2,…bK. Applying the chain rule gives us that . Substituting back into the equation above gives:

4. Next we need , which we again obtain via the chain rule: l] − yˆk). Substituting back in above gives:

5. Finally, we know that which we can again substitute back in to obtain our final result:

Although we have successfully derived the partial derivative w.r.t. αij, the result is far from satisfying. It is overly complicated and requires deeply nested for-loops to compute.

The above is an example of symbolic differentiation. That is, at the end we get an equation representing the partial derivative w.r.t. αij. At this point, you should be saying to yourself: What a mess! Isn’t there a better way? Indeed there is and its called backpropagation. The algorithm works just like the above symbolic differentiation except that we never subsitute the partial derivative from the previous step back in. Instead, we work “backwards” through the steps above computing partial derivatives in a top-down fashion.

A.3 Matrix / Vector Operations for Neural Networks

Some programming languages are fast and some are slow. Below is a simple benchmark to show this concretely. The task is to compute a dot-product aT b between two vectors a ∈ R500 and b ∈ R500 one thousand times. Table ?? shows the time taken for several combinations of programming language and data structure.

language data structure time (ms)

Python list 200.99

Python numpy array 1.01

Java float[] 4.00

C++ vector<float> 0.81

Notice that Java and C++ with standard data structures are quite efficient. By contrast, Python differs dramatically depending on which data structure you use: with a standard list object

(e.g. a = [float(i) for x in range(500)]) the computation time is an appallingly slow 200+ milliseconds. Simply by switching to a numpy array (e.g. a = np.arange(500, dtype=float)) we obtain a 200x speedup. This is because a numpy array is actually carrying out the dot-product computation in pure C, which is just as fast as our C++ benchmark, modulo some Python overhead.

Thus, for this assignment, Java and C++ programmers could easily implement the entire neural network using standard data structures and some for-loops. However, Python or Octave programmers would find that their code is simply too slow if they tried to do the same. As such, particularly for Python and Octave users, one must convert all the deeply nested for-loops into efficient “vectorized” math via numpy. Doing so will ensure efficient code. Java and C++ programmers can also benefit from linear algebra packages since it can cut down on the total number of lines of code you need to write.

A.4 Procedural Method of Implementation

Perhaps the simplest way to implement a 1-hidden-layer neural network is procedurally. Note that this approach has some drawbacks that we’ll discuss below (Section A.4.1).

The procedural method: one function computes the outputs of the neural network and all intermediate quantities o = NNFORWARD(x,y,α,β) = object(x,a,b,z,yˆ,J), where the object is just some struct. Then another function computes the gradients of our parameters gα,gβ = NNBACKWARD(x,y,α,β,o), where o is a data structure that stores all the forward computation.

One must be careful to ensure that functions are vectorized. For example, your Sigmoid function should be able to take a vector input and return a vector output with the Sigmoid function applied to all of its elements. All of these operations should avoid for-loops when working in a high-level language like Python / Octave. We can compute the softmax function in a similar vectorized manner.

A.4.1 Drawbacks to Procedural Method

As noted in Section A.6, it is possible to use a finite difference method to check that the backpropagation algorithm is correctly computing the gradient of its corresponding forward computation. We strongly encourage you to do this.

There is a big problem however: what if your finite difference check informs you that the gradient is not being computed correctly. How will you know which part of your NNFORWARD() or NNBACKWARD() functions has a bug? There are two possible solutions here:

1. As usual, you can (and should) work through a tiny example dataset on paper. Compute each intermediate quantity and each gradient. Check that your code reproduces each number. The one that does not should indicate where to find the bug.

2. Replace your procedural implementation with a module-based one (as described in Section A.5) and then run a finite-difference check on each layer of the model individually. The finite-difference check that fails should indicate where to find the bug.

Of course, rather than waiting until you have a bug in your procedural implementation, you could jump straight to the module-based version—though it increases the complexity slightly (i.e. more lines of code) it might save you some time in the long run.

A.5 Module-based Method of Implementation

Module-based automatic differentiation (AD) is a technique that has long been used to develop libraries for deep learning. Dynamic neural network packages are those that allow a specification of the computation graph dynamically at runtime, such as Torch , PyTorch , and DyNet —these all employ module-based AD in the sense that we will describe here.

The key idea behind module-based AD is to componentize the computation of the neural-network into layers. Each layer can be thought of as consolidating numerous nodes in the computation graph (a subset of them) into one vector-valued node. Such a vector-valued node should be capable of the following and we call each one a module:

1. Forward computation of output b = [b1,…,bB] given input a = [a1,…,aA] via some differentiable function f. That is b = f(a).

2. Backward computation of the gradient of the input ga given the gradient of output gb , where J is the final real-valued output of the entire computation graph. This is done via the chain rule for all i ∈ {1,…,A}.

A.5.1 Module Definitions

The modules we would define for our neural network would correspond to a Linear layer, a Sigmoid layer, a Softmax layer, and a Cross-Entropy layer. Each module defines a forward function b = *FORWARD(a), and a backward function ga = *BACKWARD(a,b,gb) method. These methods accept parameters if appropriate. The dimensions A and B are specific to the module such that we have input a ∈ RA, output b ∈ RB, gradient of output ga , ∇aJ ∈ RA, and gradient of input gb , ∇bJ ∈ RB. We have provided you the pseudocode for the Linear Module as an example.

Linear Module The linear layer has two inputs: a vector a and parameters ω ∈ RB×A. The output b is not used by LINEARBACKWARD, but we pass it in for consistency of form.

1: procedure LINEARFORWARD(a, α)

2: b = αa

3: return b

4: procedure LINEARBACKWARD(a, α, b, gb)

5: gα = gbaT

6: ga = αT gb

7: return gα,ga

It’s also quite common to combine the Cross-Entropy and Softmax layers into one. The reason for this is the cancelation of numerous terms that result from the zeros in the cross-entropy backward calculation. (Said trick is not required to obtain a sufficiently fast implementation for Autolab.)

A.5.2 Module-based AD for Neural Network

Using these modules, we can re-define our functions NNFORWARD and NNBACKWARD as follows.

Algorithm 3 Forward Computation

1: procedure NNFORWARD(Training example (x, y), Parameters α, β)

2: a = LINEARFORWARD(x,α)

3: z = SIGMOIDFORWARD(a)

4: b = LINEARFORWARD(z,β)

5: yˆ = SOFTMAXFORWARD(b)

6: J = CROSSENTROPYFORWARD(y,yˆ)

7: o = object(x,a,z,b,yˆ,J)

8: return intermediate quantities o

Algorithm 4 Backpropagation

1: procedure NNBACKWARD(Training example (x, y), Parameters α, β, Intermediates o) 2: Place intermediate quantities x,a,z,b,yˆ,J in o in scope

3: . Base case

4: gyˆ = CROSSENTROPYBACKWARD(y,yˆ,J,gJ)

5: gb = SOFTMAXBACKWARD(b,yˆ,gyˆ)

6: gβ,gz = LINEARBACKWARD(z,b,gb)

7: ga = SIGMOIDBACKWARD(a,z,gz)

8: gα,gx = LINEARBACKWARD(x,a,ga) .

9: return parameter gradients gα,gβ We discard gx

Here’s the big takeaway: we can actually view these two functions as themselves defining another module! It is a 1-hidden layer neural network module. That is, the cross-entropy of the neural network for a single training example is itself a differentiable function and we know how to compute the gradients of its inputs, given the gradients of its outputs.

A.6 Testing Backprop with Numerical Differentiation

Numerical differentiation provides a convenient method for testing gradients computed by backpropagation. The centered finite difference approximation is:

(A.3)

where di is a 1-hot vector consisting of all zeros except for the ith entry of di, which has value 1. Unfortunately, in practice, it suffers from issues of floating point precision. Therefore, it is typically only appropriate to use this on small examples with an appropriately chosen .

In order to apply this technique to test the gradients of your backpropagation implementation, you will need to ensure that your code is appropriately factored. Any of the modules including NNFORWARD and NNBACKWARD could be tested in this way.

For example, you could use two functions: forward(x,y,theta) computes the cross-entropy for a training example. backprop(x,y,theta) computes the gradient of the cross-entropy for a training example via backpropagation. Finally, finite_diff as defined below approximates the gradient by the centered finited difference method. The following pseudocode provides an overview of the entire procedure.

def finite_diff(x, y, theta):

epsilon = 1e-5 grad = zero_vector(theta.length) for m in [1, …, theta.length]:

d = zero_vector(theta.length) d[m] = 1 v = forward(x, y, theta + epsilon * d) v -= forward(x, y, theta – epsilon * d) v /= 2*epsilon grad[m] = v

# Compute the gradient by backpropagation grad_bp = backprop(x, y, theta)

# Approximate the gradient by the centered finite difference method grad_fd = finite_diff(x, y, theta)

# Check that the gradients are (nearly) the same diff = grad_bp – grad_fd # element-wise difference of two vectors print l2_norm(diff) # this value should be small (e.g. < 1e-7)

A.6.1 Limitations

This does not catch all bugs—the only thing it tells you is whether your backpropagation implementation is correctly computing the gradient for the forward computation. Suppose your forward computation is incorrect, e.g. you are always computing the cross-entropy of the wrong label. If your backpropagation is also using the same wrong label, then the check above will not expose the bug. Thus, you always want to separately test that your forward implementation is correct.

A.6.2 Finite Difference Checking of Modules

Note that the above would test the gradient for the entire end-to-end computation carried output by the neural network. However, if you implement a module-based automatic differentiation method (as in Section A.5), then you can test each individual component for correctness. The only difference is that you need to run the finite-difference check for each of the output values (i.e. a double for-loop).

## Reviews

There are no reviews yet.