## Description

TAs: Alex, Lavanya, Neural, Qiuyi, Tara, Yuchen

Summary In this assignment, you will build an image recognition system using a neural network. In the Written component, you will walk through an on-paper example of how to implement a neural network. Then, in the Programming component, you will implement an end-to-end system that learns to perform image classification.

START HERE: Instructions

หmgormley/courses/10601/syllabus.html

โข Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ หmgormley/courses/10601/syllabus.html

โข Submitting your work: You will use Gradescope to submit answers to all questions and code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope.

โ Programming: You will submit your code for programming questions on the homework to

Gradescope (https://gradescope.com). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). When you are developing, check that the version number of the programming language environment (e.g. Python 3.9.12) and versions of permitted libraries (e.g. numpy 1.23.0) match those used on Gradescope. You have 10 free Gradescope programming submissions. After 10 submissions, you will begin to lose points from your total programming score. We recommend debugging your implementation on your local machine (or the Linux servers) and making sure your code is running correctly first before submitting your code to Gradescope.

โข Materials: The data and reference output that you will need in order to complete this assignment is posted along with the writeup and template on the course website.

1

Instructions for Specific Problem Types

For โSelect Oneโ questions, please fill in the appropriate bubble completely:

Select One: Who taught this course?

Matt Gormley

โ Marie Curie

โ Noam Chomsky

Select One: Who taught this course?

Henry Chai

โ Marie Curie

@@ Noam Chomsky

For โSelect all that applyโ questions, please fill in all appropriate squares completely:

Select all that apply: Which are scientists?

โ Stephen Hawking

โ Albert Einstein

โ Isaac Newton

2 I donโt know

Select all that apply: Which are scientists?

โ Stephen Hawking

โ Albert Einstein

โ Isaac Newton

@โ I donโt know

Fill in the blank: What is the course number?

10-S6301S

Written Questions (29 points)

1 Example Feed Forward and Backpropagation (15 points)

Figure 1: Computational Graph for a One Hidden Layer Neural Network

Network Overview Consider the neural network with one hidden layer shown in Figure 1. The input layer consists of 6 features x = [x1,…,x6]T , the hidden layer has 4 nodes z = [z1,…,z4]T , and the output layer is a probability distribution y = [y1,y2,y3]T over 3 classes (1-indexed such that yi is the probability of label i).

ฮฑโ is the matrix of weights from the inputs to the hidden layer and ฮฒโ is the matrix of weights from the hidden layer to the output layer.

represents the weight going to the node zj in the hidden layer from the node xi in the input layer is the weight from x2 to z1), and ฮฒโ is defined similarly. We will use a sigmoid activation function for the hidden layer and a softmax for the output layer.

The bias vectors ฮฑb,ฮฒb are defined such that the jth value of ฮฑb (which we denote ฮฑj,b) is the bias value for aj and the kth value of ฮฒb is the bias value for bk.

Network Details Equivalently, we define each of the following.

The input:

x = [x1,x2,x3,x4,x5,x6]T

Linear combination at the first (hidden) layer: (1)

6

aj = ฮฑj,b + Xฮฑj,iโ ยท xi, โj โ {1,…,4}

i=1

Activation at the first (hidden) layer: (2)

(3)

Equivalently, we can write this as vector operation where the sigmoid activation is applied individually to each element of the vector a:

z = ฯ(a)

Linear combination at the second (output) layer: (4)

4 bk = ฮฒk,b + Xฮฒk,jโ ยท zj, โk โ {1,…,3} (5)

j=1

Activation at the second (output) layer:

(6)

Loss We will use cross entropy loss, โ(yห,y). If y represents our target output, which will be a one-hot vector representing the correct class, and yห represents the output of the network, the loss is calculated by:

3

โ(yห,y) = โXyi log(หyi) (7)

i=1

For the below questions use natural log in the equation.

Prediction When doing prediction, we will predict the argmax of the output layer. For example, if yห1 = 0.3,yห2 = 0.2,yห3 = 0.5 we would predict class 3. If the true class from the training data was 2 we would have a one-hot vector y with values y1 = 0, y2 = 1, y3 = 0.

1. In the following questions you will derive the matrix and vector forms of the previous equations which define our neural network. These are what you should hope to program in order to keep your program under the Gradescope time-out.

When working these out it is important to keep a note of the vector and matrix dimensions in order for you to easily identify what is and isnโt a valid multiplication. Suppose you are given an training example: x(1) = [x1,x2,x3,x4,x5,x6]T with label class 2, so y(1) = [0,1,0]T . We initialize the

network weights as:

๏ฃฎฮฑ1,1 ฮฑโ = ๏ฃฏ๏ฃฏฮฑ2,1

๏ฃฐฮฑ3,1 ฮฑ4,1 ฮฑ1,2 ฮฑ2,2 ฮฑ3,2 ฮฑ4,2 ฮฑ1,3 ฮฑ2,3 ฮฑ3,3 ฮฑ4,3 ฮฑ1,4 ฮฑ2,4 ฮฑ3,4 ฮฑ4,4 ฮฑ1,5 ฮฑ2,5 ฮฑ3,5 ฮฑ4,5 ฮฑ1,6๏ฃน ฮฑ2,6๏ฃบ ฮฑ3,6๏ฃบ๏ฃป ฮฑ4,6

We want to also consider the bias term and the weights on the bias terms (ฮฑj,b and ฮฒk,b). To account for these we can add them as a new column to the beginning of our initial weight matrices to represent biases, (e.g. ฮฑ1,0 = ฮฑ1,b).

ฮฑ1,2 ฮฑ2,2 ฮฑ3,2 ฮฑ4,2 ฮฑ1,3 ฮฑ2,3 ฮฑ3,3 ฮฑ4,0 ฮฑ1,4 ฮฑ2,4 ฮฑ3,4 ฮฑ4,4 ฮฑ1,5 ฮฑ1,6๏ฃน ฮฑ2,5 ฮฑ2,6๏ฃบ ฮฑ3,5 ฮฑ3,6๏ฃบ๏ฃป ฮฑ4,5 ฮฑ4,6

ฮฒ1,1 ฮฒ2,1 ฮฒ3,1 ฮฒ1,2 ฮฒ2,2 ฮฒ3,2 ฮฒ1,3 ฮฒ2,3 ฮฒ3,3

We then add a corresponding new first dimension to our input vectors, always set to 1 ( ), so our input becomes: x(1) = [1,x1,x2,x3,x4,x5,x6]T

(a) (1 point) By examining the shapes of the initial weight matrices, how many neurons do we have in the first hidden layer of the neural network? Do not include the bias in your count.

(b) (1 point) How many output neurons will our neural network have?

(c) (1 point) What is the vector a whose elements are made up of the entries aj in Equation 2 (using

(1) (1).

xi in place of xi). Write your answer in terms of ฮฑ and x

(d) (1 point) Select one: We cannot take the matrix multiplication of our weights ฮฒ and the vector z = [z1,z2,z3,z4]T since they are not compatible shapes. Which of the following would allow us to take the matrix multiplication of ฮฒ and z such that the entries of the vector b = ฮฒz are equivalent to the values of bk in Equation 5?

โ A) Remove the first row of z

B) Append a value of 1 to be the first entry of z.

โ C) Append an additional column of 1โs to be the first column of ฮฒ

โ D) Append a row of 1โs to be the first row of ฮฒ

(e) (1 point) What are the entries of the output vector yห? Your answer should be written in terms of

2. We will now derive the matrix and vector forms for the backpropagation algorithm, for example

The mathematics which you have to derive in this section jump significantly in difficulty, you should always be examining the shape of the matrices and vectors and making sure that you are comparing your matrix elements with calculations of individual derivatives to make sure they match (e.g., the element of the matrix should be equal to ). Recall that โ is our loss function defined in Equation 7:

Note: all vectors are column vectors (i.e. an n dimensional vector v โ Rnร1). Assume that all input vectors to linear layers have a bias term folded in, unless otherwise specified. All partial derivatives should be written in denominator layout notation. An example of denominator notation is that โฮฒโโ โ R3ร5 because ฮฒ โ R3ร5.

(a) (1 point) What is the derivative , where 1 โค i โค 3? Your answer should be in terms of yi and yหi. Recall that we define the loss โ(y,yห ) as follows (note: log is a natural log):

3

โ(yห,y) = โXyi log(หyi) (7)

(b) (3 points) The derivative of the softmax function with respect to bk is as follows:

(8)

where I[k = l] is an indicator function such that if k = l then it returns value 1 and 0 otherwise.

Using this and your result from (a), write the derivative in a smart way such that you do not need the indicator function in Equation 8. Write your solutions in terms of yหk,yk. Show your work below.

Hint 1: Recall that .

Hint 2: After substituting in your expressions for , try to rearrange terms so that you encounter the expression yหk Pl yl. What is the value of Pl yl?

(c) (2 points) What is the derivative ? Your answer should be in terms of โโโb and z.

You should first consider a single entry in this matrix: .

(d) (1 point) Select one: Why do we use the matrix ฮฒโ (the matrix ฮฒ without the first column of bias values) instead of ฮฒ when calculating the derivative matrix โฮฑโโ ? (Hint: try drawing a computation graph with the bias unfolded).

โ A) The bias terms do not update, so there is no need to include them in backpropagation.

โ B) It is the computationally cheapest column to remove to ensure that the dimensions match.

C) The elements ฮฒk,0 are not determined by the values of ฮฑ

โ D) The derivative of loss with respect to the bias terms is always zero.

(e) (1 point) What is the derivative (not including the bias term)? Your answer should be in terms of โโโb and ฮฒโ.

(f) (1 point) What is the derivative in terms of and zj?

(g) (1 point) What is the matrix ? Your answer should be in terms of โโโa and x(1).

2 Empirical Questions (14 points)

The following questions should be completed after you work through the programming portion of this assignment. For any plotting questions, you must using curves/line graph, title your graph, label your axes and provide units (if applicable), and provide a legend in order to receive full credit.

For these questions, use the small dataset. Use the following values for the hyperparameters unless otherwise specified:

Parameter Value

Number of Hidden Units 50

Weight Initialization RANDOM

Learning Rate 0.001

Please submit computer-generated plots for all parts. To get full credit, your plots must be line graphs with labels for both axes with the value plotted on that axis and legends labeling every line.

1. Hidden Units

(a) (2 points) Train a single hidden layer neural network using the hyperparameters mentioned in the table above, except for the number of hidden units which should vary among 5, 20, 50, 100, and 200. Run the optimization for 100 epochs each time.

Plot the average training cross-entropy (sum of the cross-entropy terms over the training dataset divided by the total number of training examples) of the final epoch on the y-axis vs number of hidden units on the x-axis. In the same figure, plot the average validation cross-entropy. The xaxis should be the number of hidden units, the y-axis should be average cross-entropy loss, and there should be one curve for validation loss and one curve for train loss.

(b) (2 points) Examine and comment on the the plots of training and validation cross-entropy. What problem arises with too few hidden units, and why does it happen?

Answer

With too few hidden units, we have very high training and validation loss and there seems to be not much learning actually taking place. We can see that there is not much gap in the two errors but this is not because it is fitting well, the average CE loss is still extremely high. The network performs this way because we donโt have enough learned features mapping the input to the output. That is, we donโt have enough learned features (hidden units) to accurately approximate a complex enough mapping from input to output.

2. Learning Rate

(a) (6 points) Train a single hidden layer neural network using the hyperparameters mentioned in the table above, except for the learning rate which should vary among 0.03, 0.003, and 0.0003. Run the optimization for 100 epochs each time.

Plot the average training cross-entropy on the y-axis vs the number of epochs on the x-axis for the mentioned learning rates. In the same figure, plot the average validation cross-entropy loss. Make a separate figure for each learning rate. The x-axis should be epoch number, the y-axis should be average cross-entropy loss, and there should be one curve for training loss and one curve for validation loss.

(b) (2 points) Examine and comment on the plots of training and validation cross-entropy. Are there any learning rates for which convergence is not achieved? Are there any learning rates that exhibit other problems? If so, describe these issues and list the learning rates that cause them.

Answer

This experiment is a good one because is shows the full spectrum of possibilities. With the learning rate of 0.03, it looks like we are severely overtraining with a too big of step size. The training losses converges quickly while the validation loss starts to improve then actually gets worse! The learning rate of 0.003 is the sweet spot, the training and validation loss mostly converge. There is some overfitting but it isnโt as extreme as the 0.03. The 0.0003 learning rate does not learn fast enough and ends up not converging! The training and validation loss slowly decrease together (with minimal overfitting) but there is no elbow in the graph that signifies that they are converging.

3. Weight Initialization

$ python neuralnet.py small_train.csv small_validation.csv small_train_out.labels small_validation_out.labels small_metrics_out.txt 1 4 2 0.1

Answer

What I noticed in this experiment was that in the first few iterations that the rows of the weight matrices are changing and trending together. Even in the first few iterations, many of the columns also had the same values but was not as glaring as the rows. This means that the hidden units are learning very similar weightings for the features (inputs). I think this means that we are not going to learn orthogonal features to effectively map the input to the output and we likely will take much much longer (more epochs) to truly converge. In terms of the model capacity, the potential is less because we will not learn as many orthogonal features.

3 Collaboration Questions

1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details.

2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details.

3. Did you find or come across code that implements any part of this assignment? If so, include full details.

Programming (94 points)

Figure 2: 10 random images of each of the 10 letters in the OCR dataset.

4 The Task

Your goal in this assignment is to implement a neural network to classify images using a single hidden layer neural network.

5 The Datasets

Datasets We will be using a subset of an Optical Character Recognition (OCR) dataset. This data includes images of all 26 handwritten letters; our subset will include only the letters โa,โ โe,โ โg,โ โi,โ โl,โ โn,โ โo,โ โr,โ โt,โ and โu.โ The handout contains a small dataset with 60 samples per class (50 for training and 10 for validation). We will also evaluate your code on a medium dataset with 600 samples per class (500 for training and 100 for validation). Figure 2 shows a random sample of 10 images of a few letters from the dataset (not the same ones weโre classifying in this assignment).

File Format Each dataset (small, medium, and large) consists of two csv filesโtrain and validation. Each row contains 129 columns separated by commas. The first column contains the label and columns 2 to 129 represent the pixel values of a 16 ร 8 image in a row major format. Label 0 corresponds to โa,โ 1 to โe,โ 2 to โg,โ 3 to โi,โ 4 to โl,โ 5 to โn,โ 6 to โo,โ 7 to โr,โ 8 to โt,โ and 9 to โu.โ

Because the original images are black-and-white (not grayscale), the pixel values are either 0 or 1. However, you should write your code to accept arbitrary pixel values in the range [0, 1]. The images in Figure 2 were produced by converting these pixel values into .png files for visualization. Observe that no feature engineering has been done here; instead the neural network you build will learn features appropriate for the task of character recognition.

6 Model Definition

In this assignment, you will implement a single-hidden-layer neural network with a sigmoid activation function for the hidden layer, and a softmax on the output layer. Let the input vectors x be of length M, and the hidden layer z consist of D hidden units. In addition, let the output layer yห be a probability distribution over K classes. That is, each element yหk of the output vector represents the probability of x belonging to the class k.

We can compactly express this model by assuming that x0 = 1 is a bias feature on the input and that z0 = 1 is also fixed. In this way, we have two parameter matrices ฮฑ โ RDร(M+1) and ฮฒ โ RKร(D+1). The extra 0th column of each matrix (i.e. ฮฑยท,0 and ฮฒยท,0) hold the bias parameters.

yหk = Softmax

The objective function we will use for training the neural network is the average cross entropy over the training dataset D = {(x(i),y(i))}:

(9)

In Equation 9, J is a function of the model parameters ฮฑ and ฮฒ because is the output of the neural network applied to x(i) and is therefore implicitly a function of x(i), ฮฑ, and and yk(i) are the kth components of yห(i) and y(i) respectively.

To train, you should optimize this objective function using stochastic gradient descent (SGD), where the gradient of the parameters for each training example is computed via backpropagation. You should shuffle the training points when performing SGD using the provided shuffle function, passing in the epoch number as a random seed. Note that SGD has a slight impact on the objective function as we are โsummingโ over just the current point, i, and not the entire dataset:

K

JSGD(ฮฑ,ฮฒ) = โXyk(i) log(หyk(i)) (10)

k=1

You will use the (hopefully at this point) familiar SGD update rule to update the parameters of your model:

(11)

(12)

where ฮณ is the learning rate, and ฮฑt and ฮฒt are the values of ฮฑ and ฮฒ at step t (similarly for ฮฑt+1 and ฮฒt+1). 6.1 Initialization

In order to use a deep network, we must first initialize the weights and biases in the network. This is typically done with a random initialization, or initializing the weights from some other training procedure. For this assignment, we will be using two possible initializations:

RANDOM The weights are initialized randomly from a uniform distribution from -0.1 to 0.1. The bias parameters are initialized to zero.

ZERO All weights are initialized to 0.

You must support both of these initialization schemes.

7 Implementation

Write a program neuralnet.py that implements an optical character recognizer using a one hidden layer neural network with sigmoid activations. Your program should learn the parameters of the model on the training data, report the cross-entropy at the end of each epoch on both train and validation data, and at the end of training write out its predictions and error rates on both datasets.

Your implementation must satisfy the following requirements:

โข Use a sigmoid activation function on the hidden layer and softmax on the output layer to ensure it forms a proper probability distribution.

โข Number of hidden units for the hidden layer should be determined by a command line flag. (More details on command line flags provided below.)

โข Support two different initialization strategies, as described in Section 6.1, selecting between them via a command line flag.

โข Use stochastic gradient descent (SGD) to optimize the parameters for one hidden layer neural network. The number of epochs will be specified as a command line flag.

โข Set the learning rate via a command line flag.

โข Perform stochastic gradient descent updates on the training data on the data shuffled with the provided function. For each epoch, you must reshuffle the original file data, not the data from the previous epoch.

โข In case there is a tie in the output layer yห, predict the smallest index to be the label. (Hint: you shouldnโt need to write extra code for tie-breaking if you are using the appropriate NumPy function.)

Implementing a neural network can be tricky: the parameters are not just a simple vector, but a collection of many parameters; computational efficiency of the model itself becomes essential; the initialization strategy dramatically impacts overall learning quality; other aspects which we will not change (e.g. activation function, optimization method) also have a large effect. These tips should help you along the way:

โข Try to โvectorizeโ your code as much as possibleโthis is particularly important for Python. For example, in Python, you want to avoid for-loops and instead rely on numpy calls to perform operations such as matrix multiplication, transpose, subtraction, etc., over an entire numpy array at once. Why? Because those calls can be much faster! Those operations are actually implemented in fast C code, which wonโt get bogged down the way a high-level scripting language like Python will.

โข Implement a finite difference test to check whether your implementation of backpropagation is correctly computing gradients. If you choose to do this, comment out this functionality once your backward pass starts giving correct results and before submitting to Gradescopeโsince it will otherwise slow down your code.

7.1 Command Line Arguments

The autograder runs and evaluates the output from the files generated, using the following command:

$ python3 neuralnet.py [args…]

Where above [args…] is a placeholder for nine command-line arguments: <traininput> validation input> <trainout> <validationout> <metricsout> <numepoch> <hiddenunits> <initflag> <learningrate>. These arguments are described in detail below:

1. <train input>: path to the training input .csv file (see Section 5)

2. <validationinput>: path to the validation input .csv file (see Section 5)

3. <train out>: path to output .labels file to which the prediction on the training data should be written (see Section 7.2)

4. <validationout>: path to output .labels file to which the prediction on the validation data should be written (see Section 7.2)

5. <metricsout>: path of the output .txt file to which metrics such as train and validationerror should be written (see Section 7.3)

6. <numepoch>: integer specifying the number of times backpropagation loops through all of the training data (e.g., if <numepoch> equals 5, then each training example will be used in backpropagation 5 times).

7. <hidden units>: positive integer specifying the number of hidden units.

8. <initflag>: integer taking value 1 or 2 that specifies whether to use RANDOM or ZERO initialization (see Section 6.1 and Section 6)โthat is, if init_flag==1 initialize your weights randomly from a uniform distribution over the range [-0.1, 0.1] (i.e. RANDOM), if init_flag==2 initialize all weights to zero (i.e. ZERO). For both settings, always initialize bias terms to zero.

9. <learningrate>: float value specifying the learning rate for SGD.

10. <–debug>: (optional argument) set the logging level, set to DEBUG to show logging.

As an example, if you implemented your program in Python, the following command line would run your program with 4 hidden units on the small data provided in the handout for 2 epochs using zero initialization and a learning rate of 0.1.

python3 neuralnet.py small_train.csv small_validation.csv small_train_out.labels small_validation_out.labels small_metrics_out.txt 2 4 2 0.1

7.2 Output: Labels Files

Your program should write two output .labels files containing the predictions of your model on training data (<trainout>) and validationdata (<validationout>). Each should contain the predicted labels for each example printed on a new line. Use to create a new line.

Your labels should exactly match those of a reference implementation โ this will be checked by the autograder by running your program and evaluating your output file against the reference solution.

Note: You should output your predicted labels using the same integer identifiers as the original training data. You should also insert an empty line (using โ โ) at the end of each sequence (as is done in the input data files).

7.3 Output Metrics

Generate a file where you report the following metrics:

cross entropy After each epoch, report mean cross entropy on the training data crossentropy(train) and validationdata crossentropy(validation) (See Equation 9). These two cross-entropy values should be reported at the end of each epoch and prefixed by the epoch number. For example, after the second pass through the training examples, these should be prefixed by epoch=2. The total number of train losses you print out should equal numepochโlikewise for the total number of validationlosses.

error After the final epoch (i.e. when training has completed fully), report the final training error error(train) and validationerror error(validation).

A sample output is given below. It contains the train and validationlosses for the first 2 epochs and the final error rate when using the command given above.

epoch=1 crossentropy(train): 2.1415670910950144 epoch=1 crossentropy(validation): 2.1502231738985618 epoch=2 crossentropy(train): 1.8642629963917074 epoch=2 crossentropy(validation): 1.8780601379038728 error(train): 0.73 error(validation): 0.72

Take care that your output has the exact same format as shown above. There is an equal sign = between the word epoch and the epoch number, but no spaces. There should be a single space after the epoch number (e.g. a space after epoch=1), and a single space after the colon preceding the metric value (e.g. a space after epoch=1 crossentropy(train):). Each line should be terminated by a Unix line ending .

python3 neuralnet.py small_train.csv small_validation.csv small_train_out.labels small_validation_out.labels small_metrics_out.txt 1 4 2 0.1

The specific output file names are not important, but be sure to keep the other arguments exactly as they are shown above.

8 Gradescope Submission

You should submit your neuralnet.py to Gradescope. Any other files will be deleted. Please do not use any other file name for your implementation. This will cause problems for the autograder to correctly detect and run your code.

Make sure to read the autograder output carefully. The autograder for Gradescope prints out some additional information about the tests that it ran. For this programming assignment weโve specially designed some buggy implementations that you might implement and will try our best to detect those and give you some more useful feedback in Gradescopeโs autograder. Make wise use of autograderโs output for debugging your code.

9 Implementation Details

9.1 Module-based Method of Implementation

Module-based automatic differentiation (AD) is a technique that has long been used to develop libraries for deep learning, and is the method of implementation that you are encouraged to follow in this assignment. Dynamic neural network packages are those that allow a specification of the computation graph dynamically at runtime, such as Torch , PyTorch , and DyNet โthese all employ module-based AD in the sense that we will describe here.

The key idea behind module-based AD is to componentize the computation of the neural-network into layers. Each layer can be thought of as consolidating numerous nodes in the computation graph (a subset of them) into one vector-valued node. Such a vector-valued node should be capable of the following and we call each one a module (corresponding to a class in Python):

1. Forward computation of output b = [b1,…,bB] given input a = [a1,…,aA] via some differentiable function f. That is, b = f(a).

2. Backward computation of the gradient of the input ga given the gradient of output gb , where J is the final real-valued output of the entire computation graph. This is done via the chain rule for all i โ {1,…,A}.

9.1.1 Module Definitions

Finally, youโll want to pay close attention to the dimensions that you pass into and return from your modules. The dimensions A and B are specific to the module such that we have input a โ RA, output b โ RB, gradient of output ga โ โaJ โ RA, and gradient of input gb โ โbJ โ RB.

We have provided you the pseudocode for the Linear Module as an example.

Linear Module

1: procedure FORWARD(a)

2: Compute b using this layerโs weight matrix

3: Cache intermediate value(s) for the backward pass โท See Written Question 1.2(c) 4: return b

5: procedure BACKWARD(gb)

6: Bring the necessary cached values into scope

7: Compute gฮฑ

8: Compute ga

9: Store gฮฑ for subsequent SGD update

10: return ga

11: procedure STEP()

12: Apply SGD update to weights ฮฑ using stored gradient gฮฑ

9.1.2 Module-based AD for Neural Network

Given that our one hidden layer neural network is a composition of modules, we can define functions for forward and backward propagation using these modules as follows:

Algorithm 1 Forward Computation

1: procedure NNFORWARD(Training example (x, y))

2: a = LINEAR1.FORWARD(x) โท First linear layer module

3: z = SIGMOID.FORWARD(a) โท Sigmoid activation module

4: b = LINEAR2.FORWARD(z) โท Second linear layer module

5: yห = SOFTMAX(b) โท Softmax function

6: J = CROSSENTROPY(y,yห)

7: return J,yห โท CrossEntropy function

Algorithm 2 Backpropagation

1: procedure NNBACKWARD(y, yห)

2: gJ = โJโJ = 1 โท Base case

3: gb = DSOFTMAXCROSSENTROPY(y,yห,gJ) โท See Written Question 1.2(b)

4: gz = LINEAR2.BACKWARD(gb)

5: ga = SIGMOID.BACKWARD(gz)

6: gx = LINEAR1.BACKWARD(ga) โท We discard gx

Hereโs the big takeaway: we can actually view these two functions as themselves defining another module!

It is a 1-hidden layer neural network module. That is, the cross-entropy of the neural network for a single training example is itself a differentiable function and we know how to compute the gradients of its inputs, given the gradients of its outputs.

9.2 Training Procedure

Consider the neural network described in Section 6 applied to the ith training example (x,y) where y is a one-hot encoding of the true label. Our neural network outputs yห = hฮฑ,ฮฒ(x), where ฮฑ and ฮฒ are the parameters of the first and second layers respectively and hฮฑ,ฮฒ is a one-hidden layer neural network with a sigmoid activation and softmax output. The loss function is negative cross-entropy J = โ(yห,y) = โyT log(yห). J = Jx,y(ฮฑ,ฮฒ) is actually a function of our training example (x,y) as well as our model parameters ฮฑ,ฮฒ, though we write just J for brevity.

In order to train our neural network, we are going to apply stochastic gradient descent (SGD). Because we want the behavior of your program to be approximately deterministic for testing on Gradescope, we will require that (1) you should use our provided shuffle function to shuffle your data at the start of each epoch and (2) you will use a fixed learning rate.

SGD proceeds as follows, where E is the number of epochs and ฮณ is the learning rate.

Algorithm 3 Training with Stochastic Gradient Descent (SGD)

1: procedure SGD(Training data Dtrain, testdata Dt)

2: Initialize parameters ฮฑ,ฮฒ โท Use either RANDOM or ZERO from Section 6.1

3: for e โ {1,2,…,E} do โท For each epoch

4: D = SHUFFLE(Dtrain,e)

5: for (x,y) โ D do โท For each training example

6: Compute neural network forward prop:

7: J,yห = NN.FORWARD(x,y,ฮฑ,ฮฒ)

8: Compute gradients via backprop:

9: g given by NN.BACKWARD(y,yห)

g

10: Update parameters with SGD updates gฮฑ,gฮฒ:

11: ฮฑ โ ฮฑ โ ฮณgฮฑ

12: ฮฒ โ ฮฒ โ ฮณgฮฒ

13: Evaluate training mean cross-entropy JD(ฮฑ,ฮฒ)

14: Evaluate testmean cross-entropy JDt(ฮฑ,ฮฒ)

15: return parameters ฮฑ,ฮฒ

9.3 Test Time Procedure

At test time, we output the most likely prediction for each example:

Algorithm 4 Prediction at Test Time

1: procedure PREDICT(Unlabeled train or testdataset Dโฒ)

2: for x โ Dโฒ do

3: Compute neural network prediction yห = h(x)

4: Predict the label with highest probability l = argmaxk yหk

## Reviews

There are no reviews yet.