## Description

Summary In this assignment, you will build an image recognition system using a neural network. In the Written component, you will walk through an on-paper example of how to implement a neural network. Then, in the Programming component, you will implement an end-to-end system that learns to perform image classification.

START HERE: Instructions

˜mgormley/courses/10601/syllabus.html

• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ ˜mgormley/courses/10601/syllabus.html

• Submitting your work: You will use Gradescope to submit answers to all questions and code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope.

– Programming: You will submit your code for programming questions on the homework to

Gradescope (https://gradescope.com). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). When you are developing, check that the version number of the programming language environment (e.g. Python 3.9.6, OpenJDK 11.0.11, g++ 7.5.0) and versions of permitted libraries (e.g. numpy 1.21.2 and scipy 1.7.1) match those used on Gradescope. You have 10 Gradescope programming submissions. We recommend debugging your implementation on your local machine (or the Linux servers) and making sure your code is running correctly first before submitting you code to Gradescope.

• Materials: The data that you will need in order to complete this assignment is posted along with the writeup and template on Piazza.

1

EJML for Java EJML is a pure Java linear algebra package with three interfaces. We strongly recommend using the SimpleMatrix interface. The autograder will use EJML version 0.41. When compiling and running your code, we will add the additional command line argument -cp “linalg_lib/ejml-v0.41-libs/*:linalg_lib/nd4j-v1.0.0-M1.1-libs/*:./” to ensure that all the EJML jars are on the classpath as well as your code.

ND4J for Java ND4J is a library for multidimensional tensors with an interface akin to Python’s NumPy. The autograder will use ND4J version 1.0.0-M1.1. When compiling and running your code, we will add the additional command line argument

-cp “linalg_lib/ejml-v0.41-libs/*:linalg_lib/nd4j-v1.0.0-M1.1-libs/*:./” to ensure that all the ND4J jars are on the classpath as well as your code.

Eigen for C++ Eigen is a header-only library, so there is no linking to worry about—just #include whatever components you need. The autograder will use Eigen version 3.4.0. The command line arguments above demonstrate how we will call you code. When compiling your code we will include, the argument -I./linalg_lib in order to include the linalg_lib/Eigen subdirectory, which contains all the headers.

We have included the correct versions of EJML/ND4J/Eigen in the linalg_lib.zip posted on the Coursework page of the course website for your convenience. It contains the same linalg_lib/ directory that we will include in the current working directory when running your tests. Do not include EJML, ND4J, or Eigen in your homework submission; the autograder will ensure that they are in place.

ahttps://ejml.org bhttps://javadoc.io/doc/org.nd4j/nd4j-api/latest/index.html chttp://eigen.tuxfamily.org/

Instructions for Specific Problem Types

For “Select One” questions, please fill in the appropriate bubble completely:

Select One: Who taught this course?

Matt Gormley / Henry Chai

# Marie Curie

# Noam Chomsky

Select One: Who taught this course?

Matt Gormley / Henry Chai

# Marie Curie

@ Noam Chomsky

For “Select all that apply” questions, please fill in all appropriate squares completely:

Select all that apply: Which are scientists?

2 Stephen Hawking Albert Einstein

Isaac Newton 2 None of the above

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein Isaac Newton

@ I don’t know

Fill in the blank: What is the course number?

10-S7601S

Written Questions (53 points)

1 Example Feed Forward and Backpropagation

Figure 1: A One Hidden Layer Neural Network

Network Overview Consider the neural network with one hidden layer shown in Figure 1. The input layer consists of 6 features x = [x1,…,x6]T , the hidden layer has 4 nodes z = [z1,…,z4]T , and the output layer is a probability distribution y = [y1,y2,y3]T over 3 classes (1-indexed such that yi is the probability of label i). We also allow for a bias term by adding a dimension with constant value one to the input (x0 = 1) and to the hidden layer (z0 = 1).

α is the matrix of weights from the inputs to the hidden layer and β is the matrix of weights from the hidden layer to the output layer.

αj,i represents the weight going to the node zj in the hidden layer from the node xi in the input layer (e.g. α1,2 is the weight from x2 to z1), and β is defined similarly. We will use a sigmoid activation function for the hidden layer and a softmax for the output layer.

Network Details Equivalently, we define each of the following.

The input:

x = [x1,x2,x3,x4,x5,x6]T

Linear combination at the first (hidden) layer: (1)

(2)

Activation at the first (hidden) layer:

(3)

Equivalently, we can write this as vector operation where the sigmoid activation is applied individually to each element of the vector a:

z = σ(a)

Linear combination at the second (output) layer: (4)

4

bk = βk,0 + Xβk,j · zj, ∀k ∈ {1,…,3}

j=1

Activation at the second (output) layer: (5)

(6)

Note that the linear combination equations can be written equivalently as the product of the weight matrix with the input vector. We can even fold in the bias term α0 by thinking of x0 = 1, and fold in βj,0 by thinking of z0 = 1.

Loss We will use cross entropy loss, `(yˆ,y). If y represents our target output, which will be a one-hot vector representing the correct class, and yˆ represents the output of the network, the loss is calculated by:

3

`(yˆ,y) = −Xyi log(ˆyi) (7)

i=1 For the below questions use natural log in the equation.

Prediction When doing prediction, we will predict the argmax of the output layer. For example, if yˆ1 = 0.3,yˆ2 = 0.2,yˆ3 = 0.5 we would predict class 3. If the true class from the training data was 2 we would have a one-hot vector y with values y1 = 0, y2 = 1, y3 = 0.

1. In the following questions you will derive the matrix and vector forms of the previous equations which define our neural network. These are what you should hope to program in order to keep your program under the Gradescope time-out.

When working these out it is important to keep a note of the vector and matrix dimensions in order for you to easily identify what is and isn’t a valid multiplication. Suppose you are given an training example: x(1) = [x1,x2,x3,x4,x5,x6]T with label class 2, so y(1) = [0,1,0]T . We initialize the

network weights as:

α1,1 α∗ = α2,1

α3,1 α4,1 α1,2 α2,2 α3,2 α4,2 α1,3 α2,3 α3,3 α4,3 α1,4 α2,4 α3,4 α4,4 α1,5 α2,5 α3,5 α4,5 α1,6 α2,6 α3,6 α4,6

We want to also consider the bias term and the weights on the bias terms (αj,0 and βk,0). To account for these we can add a new column to the beginning of our initial weight matrices to represent biases.

α1,0 α1,1 α = α2,0 α2,1

α3,0 α3,1 α4,0 α4,1 α1,2 α2,2 α3,2 α4,2 α1,3 α2,3 α3,3 α4,0 α1,4 α2,4 α3,4 α4,4 α1,5 α1,6 α2,5 α2,6 α3,5 α3,6 α4,5 α4,6

β1,1 β2,1 β3,1 β1,2 β2,2 β3,2 β1,3 β2,3 β3,3

We then add a corresponding new first dimension to our input vectors, always set to 1 ( ), so our input becomes: x(1) = [1,×1,x2,x3,x4,x5,x6]T

(a) (1 point) By examining the shapes of the initial weight matrices, how many neurons do we have in the first hidden layer of the neural network? (Not including the bias neuron)

(b) (1 point) How many output neurons will our neural network have?

(c) (1 point) What is the vector a whose elements are made up of the entries aj in Equation 2 (using in place of xi). Write your answer in terms of α and x(1).

(d) (1 point) Select one: We cannot take the matrix multiplication of our weights β and our vector z since they are not compatible shapes. Which of the following would allow us to take the matrix multiplication of β and z such that the entries of the vector b = βz are equivalent to the values of bk in Equation 5?

A) Remove the last column of β

B) Remove the first row of z

C) Append a value of 1 to be the first entry of z

D) Append an additional column of 1’s to be the first column of β

E) Append a row of 1’s to be the first row of β

F) Take the transpose of β

(e) (1 point) What are the entries of the output vector yˆ? Your answer should be written in terms of

2. We will now derive the matrix and vector forms for the backpropagation algorithm.

The mathematics which you have to derive in this section jump significantly in difficulty, you should always be examining the shape of the matrices and vectors and making sure that you are comparing your matrix elements with calculations of individual derivatives to make sure they match (e.g., the element of the matrix should be equal to ). Recall that ` is our loss function defined in Equation 7:

(a) (3 points) The derivative of the softmax function with respect to bk is as follows:

where I[k = l] is an indicator function such that if k = l then it returns value 1 and 0 otherwise. Using this, write the derivative in a smart way such that you do not need this indicator function. Write your solutions in terms of yˆk,yk. Show your work below.

HINT: Recall that .

(b) (1 point) What are the elements of the vector ∂∂`b? Use the convention that ∂∂`b is a row vector.

(c) (2 points) What is the derivative ? Your answer should be in terms of ∂∂`b and z. Use the convention that ∂∂`b is a row vector and z is a column vector.

You should first consider a single entry in this matrix: .

(d) (1 point) Explain in one short sentence why we use the matrix β∗ (the matrix β without the first column of ones) when calculating the derivative matrix ?

(e) (1 point) What is the derivative ∂∂`z? Your answer should be in terms of ∂∂`b and β∗. Use the convention that ∂∂`b is a row vector.

(f) (1 point) What is the derivative in terms of and zj?

(g) (1 point) What is the matrix ? Your answer should be in terms of ∂∂`a and x(1). Use the convention that ∂∂`a is a row vector and x(1) is a column vector.

3. Now you will put these equations to use in an example with numerical values. You should use the answers you get here to debug your code.

You are given a training example x(1) = [1,1,0,0,1,1]T with label class 2, so y(1) = [0,1,0]T . We initialize the network weights as:

1 2 −3 0 1 −3

α∗ = 3 1 2 1 0 2

2 2 2 2 2 1

1 0 2 1 −2 2

1 2 −2 1 β∗ = 1 −1 1 2

3 1 −1 1

We want to also consider the bias term and the weights on the bias terms (αj,0 and βj,0). Lets say they are all initialized to 1. To account for this we can add a column of 1’s to the beginning of our initial weight matrices.

1 1

α = 1 3

1 2

1 1 2

1

2

0 −3

2

2

2 0 1 −3

1 0 2

2 2 1

1 −2 2

1 β = 1

1 1

1

3 2

−1

1

And we can set our first value of our input vectors to always be 1 ( ), so our input becomes:

x(1) = [1,1,1,0,0,1,1]T

Using the initial weights, run the feed forward of the network over this example (rounding to 4 decimal places during the calculation) and then answer the following questions.

(c) (1 point) What is b2? We have computed z2 = 0.9991, z3 = 0.9997, z4 = 0.8808 for you.

(d) (1 point) What is yˆ2? We have computed b1 = 2.7604,b3 = 4.5226 for you.

(e) (1 point) Which class would we predict on this example? Your answer should just be an integer ∈ {1,2,3}.

(f) (1 point) What is the total loss on this example?

4. Now use the results of the previous question to run backpropagation over the network and update the weights. Use learning rate η = 1.

Do your backpropagation calculations rounding to 4 decimal places then answer the following questions:

(b) (1 point) What is the updated value of the weight β1,0?

(d) (1 point) What is the updated value of the weight α3,4?

(e) (2 points) What is the updated weight of the input layer bias term applied to z2 (i.e. α2,0)?

2 Convolutional Neural Networks

In this problem, consider only the convolutional layer of a standard implementation of a CNN as described in Lecture 13.

1. We are given image X and filter F below.

(a) (1 point) Let X be convolved with F using no padding and a stride of 1 to produce an output Y . What is value of j in the output Y ?

(b) (1 point) Suppose you had an input feature map of size (height × width) 6×4 and filter size 2×2, using no padding and a stride of 2, what would be the resulting output size? Write your answer in the format height × width.

2. Parameter sharing is a very important concept for CNN because it drastically reduces the complexity of the learning problem. For the following questions, assume that there is no bias term in our convolutional layer.

(a) (1 point) Which of the following are parameters of a convolutional layer?

Select all that apply:

stride size

padding size

image size

filter size

weights in the filter None of above.

(b) (1 point) Which of the following are hyperparameters of a convolutional layer?

Select all that apply:

stride size

padding size

image size

filter size

weights in the filter None of above.

(c) (1 point) Suppose for the convolutional layer, we are given grayscale images of size 22 × 22. Using one single 4 × 4 filter with a stride of 2 and no padding, what is the number of parameters you are learning in this layer?

(d) (1 point) Suppose instead of sharing the same filter for the entire image, you learn a new filter each time you move across the image. Using 4 × 4 filters with a stride of 2 and no padding, what is the number of parameters you are learning in this layer?

(e) (1 point) Now suppose you are given a 40 × 40 colored image, which consists of 3 channels (so your input is a 40 × 40 × 3 tensor), each representing the intensity of one primary color. Suppose

you learn a new filter each time you move across the image. Using 4 × 4 filters with a stride of 2 and no padding, what is the number of parameters you are learning in this layer?

(f) (1 point) In a sentence, describe a reason why parameter sharing is a good idea for a convolutional layer applied to image data, besides the reduction in number of learned parameters.

3 Empirical Questions

The following questions should be completed after you work through the programming portion of this assignment.

For these questions, use the small dataset. Use the following values for the hyperparameters unless otherwise specified:

Parameter Value

Number of Hidden Units 50

Weight Initialization RANDOM

Learning Rate 0.01

Please submit computer-generated plots for (a)i and (b)i. Note: we expect it to take about 5 minutes to train each of these networks.

1. Hidden Units

(a) (2 points) Train a single hidden layer neural network using the hyperparameters mentioned in the table above, except for the number of hidden units which should vary among 5, 20, 50, 100, and 200. Run the optimization for 100 epochs each time.

Plot the average training cross-entropy (sum of the cross-entropy terms over the training dataset divided by the total number of training examples) on the y-axis vs number of hidden units on the x-axis. In the same figure, plot the average validation cross-entropy.

(b) (2 points) Examine and comment on the the plots of training and validation cross-entropy. What is the effect of changing the number of hidden units?

(c) (2 points) In the handout folder, we provide vallosssgdsmall.txt, a text file with the validation cross-entropy loss values for SGD performed using 100 epochs, 50 hidden units, random init, and 0.01 learning rate. In the same figure, plot them against your validation results for SGD with Adagrad using the same set of parameters and the small dataset.

(d) (2 points) Examine and compare the two results. What do you observe?

2. Learning Rate

(a) (6 points) Train a single hidden layer neural network using the hyperparameters mentioned in the table above, except for the learning rate which should vary among 0.1, 0.01, and 0.001. Run the optimization for 100 epochs each time.

Plot the average training cross-entropy on the y-axis vs the number of epochs on the x-axis for the mentioned learning rates. In the same figure, plot the average validation cross-entropy loss. Make a separate figure for each learning rate.

(b) (2 points) Examine and comment on the plots of training and validation cross-entropy. How does adjusting the learning rate affect the convergence of cross-entropy on the datasets?

3. Weight Initialization

$ python neuralnet.py smallTrain.csv smallValidation.csv smallTrain_out.labels smallValidation_out.labels smallMetrics_out.txt 1 4 2 0.1

Compare the values across rows and columns in α and β. Comment on what you observed. Do you think it is reasonable to use zero initialization? Why or why not?

4 Collaboration Questions

1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details.

2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details.

3. Did you find or come across code that implements any part of this assignment ? If so, include full details.

Programming (61 points)

Figure 2: 10 random images from each of 10 image categories in CIFAR-10. You will be using a simplified version of this dataset.

5 The Task

Your goal in this assignment is to implement a neural network to classify images using a single hidden layer neural network. In addition, you will implement Adagrad, a variant of stochastic gradient descent.

6 The Datasets

Datasets We will be using a subset of a standard Computer Vision dataset, CIFAR-10. This data includes color images of various vehicles and animals; our subset will include black and white images of the 4 classes automobile, bird, frog, ship. The handout contains one dataset drawn from this data with 500 samples for training and 100 for validation.

File Format Each dataset (small, medium, and large) consists of two csv files—train and validation. Each row contains 1025 columns separated by commas. The first column contains the label and columns 2 to 1025 represent the pixel values of a 32×32 image in a row major format. Label 0 corresponds to automobile, 1 to bird, 2 to frog, and 3 to ship.

You should write your code to accept arbitrary pixel values in the range [0, 1]. The images in Figure 2 were produced by converting these pixel values into .png files for visualization. Observe that no feature engineering has been done here; instead the neural network you build will learn features appropriate for the task of image recognition.

7 Model Definition

In this assignment, you will implement a single-hidden-layer neural network with a sigmoid activation function for the hidden layer, and a softmax on the output layer. Let the input vectors x be of length M, and the hidden layer z consist of D hidden units. In addition, let the output layer yˆ be a probability distribution over K classes. That is, each element yˆk of the output vector represents the probability of x belonging to the class k.

We can compactly express this model by assuming that x0 = 1 is a bias feature on the input and that z0 = 1 is also fixed. In this way, we have two parameter matrices α ∈ RD×(M+1) and β ∈ RK×(D+1). The extra 0th column of each matrix (i.e. α·,0 and β·,0) hold the bias parameters.

The objective function we will use for training the neural network is the average cross entropy over the training dataset D = {(x(i),y(i))}:

(8)

In Equation 8, J is a function of the model parameters α and β because is implicitly a function of x(i), α, and β since it is the output of the neural network applied to x(i). Of course, and yk(i) are the kth components of yˆ(i) and y(i) respectively.

To train, you should optimize this objective function using stochastic gradient descent (SGD), where the gradient of the parameters for each training example is computed via backpropagation. You should not shuffle the training points when performing SGD. Note that SGD has a slight impact on the objective function, where we are “summing” over the current point, i:

K

JSGD(α,β) = −Xyk(i) log(ˆyk(i)) (9)

k=1

Lastly, let’s take a look at the Adagrad update that you will be performing. For parameters θt, you will first compute an intermediate value st, and then use this to compute θt+1. st will contain the element-wise sums (denoted by ) of all the element-wise squared gradients. Therefore, st should have the same shape as should be initialized once, before the first epoch, to a zero vector. The update equations for s and θ are below.

s . (10)

Then, we use st to scale the gradient for the update:

. (11)

Here, η is the learning rate, and .

7.1 Initialization

In order to use a deep network, we must first initialize the weights and biases in the network. This is typically done with a random initialization, or initializing the weights from some other training procedure. For this assignment, we will be using two possible initialization:

RANDOM The weights are initialized randomly from a uniform distribution from -0.1 to 0.1. The bias parameters are initialized to zero.

ZERO All weights are initialized to 0.

You must support both of these initialization schemes.

8 Implementation

Write a program neuralnet.{py|java|cpp|m} that implements an optical character recognizer using a one hidden layer neural network with sigmoid activations. Your program should learn the parameters of the model on the training data, report the cross-entropy at the end of each epoch on both train and validation data, and at the end of training write out its predictions and error rates on both datasets.

Your implementation must satisfy the following requirements:

• Use a sigmoid activation function on the hidden layer and softmax on the output layer to ensure it forms a proper probability distribution.

• Number of hidden units for the hidden layer should be determined by a command line flag.

• Support two different initialization strategies, as described in Section 7.1, selecting between them via a command line flag.

• Use stochastic gradient descent (SGD) to optimize the parameters for one hidden layer neural network. The number of epochs will be specified as a command line flag.

• Set the learning rate via a command line flag.

• Perform stochastic gradient descent updates on the training data in the order that the data is given in the input file. Although you would typically shuffle training examples when using stochastic gradient descent, in order to autograde the assignment, we ask that you DO NOT shuffle trials in this assignment.

• In case there is a tie in the output layer yˆ, predict the smallest index to be the label.

Implementing a neural network can be tricky: the parameters are not just a simple vector, but a collection of many parameters; computational efficiency of the model itself becomes essential; the initialization strategy dramatically impacts overall learning quality; other aspects which we will not change (e.g. activation function, optimization method) also have a large effect. These tips should help you along the way:

• Try to “vectorize” your code as much as possible—this is particularly important for Python. For example, in Python, you want to avoid for-loops and instead rely on numpy calls to perform operations such as matrix multiplication, transpose, subtraction, etc. over an entire numpy array at once. Why? Because these operations are actually implemented in fast C code, which won’t get bogged down the way a high-level scripting language like Python will.

• For low level languages such as Java/C++, the use of primitive arrays and for-loops would not pose any computational efficiency problems—however, it is still helpful to make use of a linear algebra library to cut down on the number of lines of code you will write.

$ python3 neuralnet.py [args…]

$ javac -cp “./lib/ejml-v0.38-libs/*:./” neuralnet.java

$ java -cp “./lib/ejml-v0.38-libs/*:./” neuralnet [args…]

$ g++ -g -std=c++11 -I./lib neuralnet.cpp; ./a.out [args…]

• Implement a finite difference test to check whether your implementation of backpropagation is correctly computing gradients. If you choose to do this, comment out this functionality once your backward pass starts giving correct results and before submitting to Gradescope—since it will otherwise slow down your code.

8.1 Command Line Arguments

The autograder runs and evaluates the output from the files generated, using the following command:

For Python:

For Java:

For C++:

Where above [args…] is a placeholder for nine command-line arguments: <traininput> validation input> <trainout> <validationout> <metricsout> <numepoch> <hiddenunits> <initflag> <learningrate>. These arguments are described in detail below:

1. <train input>: path to the training input .csv file (see Section 6)

2. <validationinput>: path to the validation input .csv file (see Section 6)

3. <train out>: path to output .labels file to which the prediction on the training data should be written (see Section 8.2)

4. <validationout>: path to output .labels file to which the prediction on the validation data should be written (see Section 8.2)

5. <metricsout>: path of the output .txt file to which metrics such as train and validationerror should be written (see Section 8.3)

6. <numepoch>: integer specifying the number of times backpropogation loops through all of the training data (e.g., if <numepoch> equals 5, then each training example will be used in backpropogation 5 times).

7. <hidden units>: positive integer specifying the number of hidden units.

8. <initflag>: integer taking value 1 or 2 that specifies whether to use RANDOM or ZERO initialization (see Section 7.1 and Section 8)—that is, if init_flag==1 initialize your weights randomly from a uniform distribution over the range [-0.1, 0.1] (i.e. RANDOM), if init_flag==2 initialize all weights to zero (i.e. ZERO). For both settings, always initialize bias terms to zero.

9. <learningrate>: float value specifying the base learning rate for SGD with Adagrad.

As an example, if you implemented your program in Python, the following command line would run your program with 4 hidden units on the small data provided in the handout for 2 epochs using zero initialization and a learning rate of 0.1.

$ python3 neuralnet.py smallTrain.csv smallValidation.csv smallTrain_out.labels smallValidation_out.labels smallMetrics_out.txt

2 4 2 0.1

8.2 Output: Labels Files

Your program should write two output .labels files containing the predictions of your model on training data (<trainout>) and validationdata (<validationout>). Each should contain the predicted labels for each example printed on a new line. Use to create a new line.

Your labels should exactly match those of a reference implementation – this will be checked by the autograder by running your program and evaluating your output file against the reference solution.

Note: You should output your predicted labels using the same integer identifiers as the original training data. You should also insert an empty line (again using ’ ’) at the end of each sequence (as is done in the input data files).

8.3 Output Metrics

Generate a file where you report the following metrics:

cross entropy After each epoch, report mean cross entropy on the training data crossentropy(train) and validationdata crossentropy(validation) (See Equation 8). These two cross-entropy values should be reported at the end of each epoch and prefixed by the epoch number. For example, after the second pass through the training examples, these should be prefixed by epoch=2. The total number of train losses you print out should equal numepoch—likewise for the total number of validationlosses.

error After the final epoch (i.e. when training has completed fully), report the final training error error(train) and validationerror error(validation).

A sample output is given below. It contains the train and validationlosses for the first 2 epochs and the final error rate when using the command given above.

epoch=1 crossentropy(train): 1.37990286268 epoch=1 crossentropy(validation): 1.40340064552 epoch=2 crossentropy(train): 1.37986222014 epoch=2 crossentropy(validation): 1.40252226818 error(train): 0.728 error(validation): 0.74

Take care that your output has the exact same format as shown above. There is an equal sign = between the word epoch and the epoch number, but no spaces. There should be a single space after the epoch number (e.g. a space after epoch=1), and a single space after the colon preceding the metric value (e.g. a space after epoch=1 likelihood(train):). Each line should be terminated by a Unix line ending .

8.4 Tiny Data Set

To help you with this assignment, we have also included a tiny data set, tinyTrain.csv and tinyValidation.csv, and a reference output file tinyOutput.txt for you to use. The tiny dataset is in a format similar to the other datasets, but it only contains two samples with five features. The reference file contains outputs from each layer of one correctly implemented neural network, for both forward and back-propagation steps. We advise you to use this set to help you debug in case your implementation doesn’t produce the same results as in the written part.

For your reference, tinyOutput.txt is generated from the following command line specifications:

$ python3 neuralnet.py tinyTrain.csv tinyValidation.csv tinyTrain_out.labels tinyValidation_out.labels tinyMetrics_out.txt 1 4 2 0.1

The specific output file names are not important, but be sure to keep the other arguments exactly as they are shown above.

9 Gradescope Submission

You should submit your neuralnet.{py|java|cpp} to Gradescope. Please do not use any other file name for your implementation. This will cause problems for the autograder to correctly detect and run your code.

Some additional tips: Make sure to read the autograder output carefully. The autograder for Gradescope prints out some additional information about the tests that it ran. For this programming assignment we’ve specially designed some buggy implementations that you might implement and will try our best to detect those and give you some more useful feedback in Gradescope’s autograder. Make wise use of autograder’s output for debugging your code.

A Implementation Details for Neural Networks

This section provides a variety of suggestions for how to efficiently and succinctly implement a neural network and backpropagation.

A.1 SGD for Neural Networks

Consider the neural network described in Section 8 applied to the ith training example (x,y) where y is a one-hot encoding of the true label. Our neural network outputs yˆ = hα,β(x), where α and β are the parameters of the first and second layers respectively and hα,β(·) is a one-hidden layer neural network with a sigmoid activation and softmax output. The loss function is negative cross-entropy J = `(yˆ,y) = −yT log(yˆ). J = Jx,y(α,β) is actually a function of our training example (x,y), and our model parameters α,β though we write just J for brevity.

In order to train our neural network, we are going to apply stochastic gradient descent. Because we want the behavior of your program to be deterministic for testing on Gradescope, we make a few simplifications: (1) you should not shuffle your data and (2) you will use a fixed learning rate. In the real world, you would not make these simplifications.

SGD proceeds as follows, where E is the number of epochs and γ is the learning rate.

Algorithm 1 Stochastic Gradient Descent (SGD) without Shuffle

1: procedure SGD(Training data D, testdata Dt)

2: Initialize parameters α,β . Use either RANDOM or ZERO from Section 7.1

3: for e ∈ {1,2,…,E} do . For each epoch

4: for (x,y) ∈ D do . For each training example (No shuffling)

5: Compute neural network layers:

6: o = object(x,a,b,z,yˆ,J) = NNFORWARD(x,y,α,β)

7: Compute gradients via backprop:

8: g

NNBACKWARD(x,y,α,β,o)

g

9: Update parameters with Adagrad updates :

10:

11:

12: Evaluate training mean cross-entropy JD(α,β)

13: Evaluate testmean cross-entropy JDt(α,β)

14: return parameters α,β

At test time, we output the most likely prediction for each example:

Algorithm 2 Prediction at Test Time

1: procedure PREDICT(Unlabeled train or testdataset D0, Parameters α,β)

2: for x ∈ D0 do

3: Compute neural network prediction yˆ = h(x)

4: Predict the label with highest probability l = argmaxk yˆk

The gradients we need above are themselves matrices of partial derivatives. Let M be the number of input features, D the number of hidden units, and K the number of outputs.

α10 α11 … α1M

α20 α21 … α2M

α = … … … … g(12)

αD0 αD1 … αDM

β10 β11 … β1D

β20 β21 … β2D

β = … … … … g(13)

βK0 βK1 … βKD

Observe that we have (in a rather tricky fashion) defined the matrices such that both α and gα are D×(M + 1) matrices. Likewise, β and gβ are K ×(D+1) matrices. The +1 comes from the extra columns α·,0 and β·,0 which are the bias parameters for the first and second layer respectively. We will always assume x0 = 1 and z0 = 1. This should greatly simplify your implementation as you will see in Section A.3.

A.2 Recursive Derivation of Backpropagation

In class, we described a very general approach to differentiating arbitrary functions: backpropagation. One way to understand how we go about deriving the backpropagation algorithm is to consider the natural consequence of recursive application of the chain rule.

In practice, the partial derivatives that we need for learning are and .

A.2.1 Symbolic Differentiation

Note In this section, we motivate backpropagation via a strawman: that is, we will work through the wrong approach first (i.e. symbolic differentiation) in order to see why we want a more efficient method (i.e. backpropagation). Do not use this symbolic differentiation in your code.

1. Considering the computational graph for the neural network, we observe that αij has exactly one child

. That is ai is the first and only intermediate quantity that uses αij. Applying the

chain rule, we obtain

2. So far so good, now we just need to compute . Not a problem! We can just apply the chain rule again. ai just has exactly one child as well, namely zi = σ(ai). The chain rule gives us that

. Substituting back into the equation above we find that

3. How do we get ? You guessed it: apply the chain rule yet again. This time, however, there are multiple children of zi in the computation graph; they are b1,b2,…bK. Applying the chain rule gives us that . Substituting back into the equation above gives:

4. Next we need , which we again obtain via the chain rule: l] − yˆk). Substituting back in above gives:

5. Finally, we know that which we can again substitute back in to obtain our final result:

Although we have successfully derived the partial derivative w.r.t. αij, the result is far from satisfying. It is overly complicated and requires deeply nested for-loops to compute.

The above is an example of symbolic differentiation. That is, at the end we get an equation representing the partial derivative w.r.t. αij. At this point, you should be saying to yourself: What a mess! Isn’t there a better way? Indeed there is and its called backpropagation. The algorithm works just like the above symbolic differentiation except that we never subsitute the partial derivative from the previous step back in. Instead, we work “backwards” through the steps above computing partial derivatives in a top-down fashion.

A.3 Matrix / Vector Operations for Neural Networks

Some programming languages are fast and some are slow. Below is a simple benchmark to show this concretely. The task is to compute a dot-product aT b between two vectors a ∈ R500 and b ∈ R500 one thousand times. Table 1 shows the time taken for several combinations of programming language and data structure.

language data structure time (ms)

Python list 200.99

Python numpy array 1.01

Java float[] 4.00

C++ vector<float> 0.81

Table 1: Computation time required for dot-product in various languages.

Notice that Java and C++ with standard data structures are quite efficient. By contrast, Python differs dramatically depending on which data structure you use: with a standard list object

(e.g. a = [float(i) for x in range(500)]) the computation time is an appallingly slow 200+ milliseconds. Simply by switching to a numpy array (e.g. a = np.arange(500, dtype=float)) we obtain a 200x speedup. This is because a numpy array is actually carrying out the dot-product computation in pure C, which is just as fast as our C++ benchmark, modulo some Python overhead.

Thus, for this assignment, Java and C++ programmers could easily implement the entire neural network using standard data structures and some for-loops. However, Python programmers would find that their code is simply too slow if they tried to do the same. As such, particularly for Python users, one must convert all the deeply nested for-loops into efficient “vectorized” math via numpy. Doing so will ensure efficient code. Java and C++ programmers can also benefit from linear algebra packages since it can cut down on the total number of lines of code you need to write.

A.4 Procedural Method of Implementation

Perhaps the simplest way to implement a 1-hidden-layer neural network is procedurally. Note that this approach has some drawbacks that we’ll discuss below (Section A.4.1).

The procedural method: one function computes the outputs of the neural network and all intermediate quantities o = NNFORWARD(x,y,α,β) = object(x,a,b,z,yˆ,J), where the object is just some struct. Then another function computes the gradients of our parameters gα,gβ = NNBACKWARD(x,y,α,β,o), where o is a data structure that stores all the forward computation.

One must be careful to ensure that functions are vectorized. For example, your Sigmoid function should be able to take a vector input and return a vector output with the Sigmoid function applied to all of its elements. All of these operations should avoid for-loops when working in a high-level language like Python. We can compute the softmax function in a similar vectorized manner.

A.4.1 Drawbacks to Procedural Method

As noted in Section A.6, it is possible to use a finite difference method to check that the backpropagation algorithm is correctly computing the gradient of its corresponding forward computation. We strongly encourage you to do this.

There is a big problem however: what if your finite difference check informs you that the gradient is not being computed correctly. How will you know which part of your NNFORWARD() or NNBACKWARD() functions has a bug? There are two possible solutions here:

1. As usual, you can (and should) work through a tiny example dataset on paper. Compute each intermediate quantity and each gradient. Check that your code reproduces each number. The one that does not should indicate where to find the bug.

2. Replace your procedural implementation with a module-based one (as described in Section A.5) and then run a finite-difference check on each layer of the model individually. The finite-difference check that fails should indicate where to find the bug.

Of course, rather than waiting until you have a bug in your procedural implementation, you could jump straight to the module-based version—though it increases the complexity slightly (i.e. more lines of code) it might save you some time in the long run.

A.5 Module-based Method of Implementation

Module-based automatic differentiation (AD) is a technique that has long been used to develop libraries for deep learning. Dynamic neural network packages are those that allow a specification of the computation graph dynamically at runtime, such as Torch , PyTorch , and DyNet —these all employ module-based AD in the sense that we will describe here.

The key idea behind module-based AD is to componentize the computation of the neural-network into layers. Each layer can be thought of as consolidating numerous nodes in the computation graph (a subset of them) into one vector-valued node. Such a vector-valued node should be capable of the following and we call each one a module:

1. Forward computation of output b = [b1,…,bB] given input a = [a1,…,aA] via some differentiable function f. That is b = f(a).

2. Backward computation of the gradient of the input ga given the gradient of output gb , where J is the final real-valued output of the entire computation graph. This is done via the chain rule for all i ∈ {1,…,A}.

A.5.1 Module Definitions

The modules we would define for our neural network would correspond to a Linear layer, a Sigmoid layer, a Softmax layer, and a Cross-Entropy layer. Each module defines a forward function b = *FORWARD(a), and a backward function ga = *BACKWARD(a,b,gb) method. These methods accept parameters if appropriate. You’ll want to pay close attention to the dimensions that you pass into and return from your modules.

Linear Module The linear layer has two inputs: a vector a and parameters ω ∈ RB×A. The output b is not used by LINEARBACKWARD, but we pass it in for consistency of form.

1: procedure LINEARFORWARD(a, α)

2: Compute b

3: return b

4: procedure LINEARBACKWARD(a, α, b, gb)

5: Compute gα

6: Compute ga

7: return gα,ga

It’s also quite common to combine the Cross-Entropy and Softmax layers into one. The reason for this is the cancelation of numerous terms that result from the zeros in the cross-entropy backward calculation. (Said trick is not required to obtain a sufficiently fast implementation for Gradescope.)

A.5.2 Module-based AD for Neural Network

Using these modules, we can re-define our functions NNFORWARD and NNBACKWARD as follows.

Algorithm 3 Forward Computation

1: procedure NNFORWARD(Training example (x, y), Parameters α, β)

2: a = LINEARFORWARD(x,α)

3: z = SIGMOIDFORWARD(a)

4: b = LINEARFORWARD(z,β)

5: yˆ = SOFTMAXFORWARD(b)

6: J = CROSSENTROPYFORWARD(y,yˆ)

7: o = object(x,a,z,b,yˆ,J)

8: return intermediate quantities o

Algorithm 4 Backpropagation

1: procedure NNBACKWARD(Training example (x, y), Parameters α, β, Intermediates o) 2: Place intermediate quantities x,a,z,b,yˆ,J in o in scope

3: gJ = ∂J∂J = 1 . Base case

4: gyˆ = CROSSENTROPYBACKWARD(y,yˆ,gJ)

5: gb = SOFTMAXBACKWARD(b,yˆ,gyˆ)

6: gβ,gz = LINEARBACKWARD(z,β,gb)

7: ga = SIGMOIDBACKWARD(a,z,gz)

8: gα,gx = LINEARBACKWARD(x,α,ga) .

9: return parameter gradients gα,gβ We discard gx

Here’s the big takeaway: we can actually view these two functions as themselves defining another module! It is a 1-hidden layer neural network module. That is, the cross-entropy of the neural network for a single training example is itself a differentiable function and we know how to compute the gradients of its inputs, given the gradients of its outputs.

A.6 Testing Backprop with Numerical Differentiation

Numerical differentiation provides a convenient method for testing gradients computed by backpropagation. The centered finite difference approximation is:

(14)

where di is a 1-hot vector consisting of all zeros except for the ith entry of di, which has value 1. Unfortunately, in practice, it suffers from issues of floating point precision. Therefore, it is typically only appropriate to use this on small examples with an appropriately chosen .

In order to apply this technique to test the gradients of your backpropagation implementation, you will need to ensure that your code is appropriately factored. Any of the modules including NNFORWARD and NNBACKWARD could be tested in this way.

For example, you could use two functions: forward(x,y,theta) computes the cross-entropy for a training example. backprop(x,y,theta) computes the gradient of the cross-entropy for a training example via backpropagation. Finally, finite_diff as defined below approximates the gradient by the centered finited difference method. The following pseudocode provides an overview of the entire procedure.

def finite_diff(x, y, theta):

epsilon = 1e-5

grad = zero_vector(theta.length) for m in [1, …, theta.length]:

d = zero_vector(theta.length) d[m] = 1 v = forward(x, y, theta + epsilon * d) v -= forward(x, y, theta – epsilon * d) v /= 2*epsilon grad[m] = v

# Compute the gradient by backpropagation grad_bp = backprop(x, y, theta)

# Approximate the gradient by the centered finite difference method grad_fd = finite_diff(x, y, theta)

# Check that the gradients are (nearly) the same diff = grad_bp – grad_fd # element-wise difference of two vectors print l2_norm(diff) # this value should be small (e.g. < 1e-7)

A.6.1 Limitations

This does not catch all bugs—the only thing it tells you is whether your backpropagation implementation is correctly computing the gradient for the forward computation. Suppose your forward computation is incorrect, e.g. you are always computing the cross-entropy of the wrong label. If your backpropagation is also using the same wrong label, then the check above will not expose the bug. Thus, you always want to separately test that your forward implementation is correct.

A.6.2 Finite Difference Checking of Modules

Note that the above would test the gradient for the entire end-to-end computation carried output by the neural network. However, if you implement a module-based automatic differentiation method (as in Section A.5), then you can test each individual component for correctness. The only difference is that you need to run the finite-difference check for each of the output values (i.e. a double for-loop).

A.7 Why AdaGrad?

So far, the loss functions we have discussed make it quite easy to find a global optimum. In these cases, a larger step size makes it quick and easy to reach convergence. This isn’t always the case with neural networks. Nonconvex loss functions are much harder to optimize over. Here we want step sizes that will adapt to the domain in which they optimize. We want to take larger steps where possible, but smaller steps where we are in danger of overshooting the optima. Adagrad implicitly changes the step size based on the shape of the function inferred from the gradients. Interested? Read all about it here.

## Reviews

There are no reviews yet.