## Description

TAs: Scott Liu, Yiming Wen, Ani Chowdhury, Kelly Shi, Yufei Wang

Summary In this assignment, you will build a sentiment polarity analyzer, which will be capable of analyzing the overall sentiment polarity (positive or negative) . In Section 1 you will warm up by deriving stochastic gradient descent updates for binary and multinomial logistic regression. Then in Section 2 you will implement a binary logistic regression model as the core of your natural language processing system.

START HERE: Instructions

โข Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ หmgormley/courses/10601/about.html#late-homework-policy

โข Submitting your work: You will use Gradescope to submit answers to all questions and your code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope.

โ Written: For written problems such as derivations, proofs, or plots we will be using Gradescope

โ Programming: You will submit your code for programming questions on the homework to

Gradescope (https://gradescope.com). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). When you are developing, check that the version number of the programming language environment (e.g.

1

Python 3.6.9, Octave 4.2.2, OpenJDK 11.0.5, g++ 7.4.0) and versions of permitted libraries (e.g. numpy 1.17.0 and scipy 1.4.1) match those used on Gradescope. (Octave users: Please make sure you do not use any Matlab-specific libraries in your code that might make it fail against our tests.) You have a total of 10 Gradescope programming submissions. Use them wisely. In order to not waste code submissions, we recommend debugging your implementation on your local machine (or the linux servers) and making sure your code is running correctly first before any Gradescope coding submission.

โข Materials: The data that you will need in order to complete this assignment is posted along with the writeup and template on Piazza.

Java EJML is a pure Java linear algebra package with three interfaces. We strongly recommend using the SimpleMatrix interface. Gradescope will use EJML version 0.38. The command line arguments above demonstrate how we will call you code. The classpath inclusion

-cp “./lib/ejml-v0.38-libs/*:./” will ensure that all the EJML jars are on the classpath as well as your code.

C++ Eigen is a header-only library, so there is no linking to worry aboutโjust #include whatever components you need. Gradescope will use Eigen version 3.3.7. The command line arguments above demonstrate how we will call you code. The argument -I./lib will include the lib/Eigen subdirectory, which contains all the headers.

Do not include EJML or Eigen in your Gradescope submission tar; the autograder will ensure that they are in place.

ahttps://ejml.org bhttp://eigen.tuxfamily.org/

1 Written Questions [30 points]

1.1 Perceptron and Stochastic Gradient Descent [4 points]

( x, if x โฅ 0

(x)+ =

0, otherwise

Select one:

2. (2 points) Continuing with the above question, what is the gradient of the correct loss function when the current data we are seeing is x ?

Select one:

(โy(i)x(i), if

0, otherwise.

โy(i)x(i) y(i)x(i)

(y(i)x(i), if

0, otherwise.

1.2 Multinomial Logistic Regression [13 points]

Multinomial logistic regression, also known as softmax regression or multiclass logistic regression, is a generalization of binary logistic regression. In this problem setting we have a dataset:

where x(i) โ RM,y(i) โ {1,…,K} for i = 1,…,N

Here N is the number of training examples, M is the number of features, and K is the number of possible classes, which is usually greater than two to be interesting and not equivalent to binary logistic regression.

1. (2 points) To motivate multinomial logistic regression, we will first look at a general way to extend a binary classifier to a multiclass classifier and apply it to logistic regression. Suppose we only have the resources to train K binary logistic regression classifiers where K corresponds to the number of classes. Using all of the trained classifiers, what are the possible ways to determine the class for each unlabelled data point x(โ)

Select all that apply:

Each trained classifier hi(x) will determine if a point x is in class i or not for i = 1,…,K. The class that has the highest probability from the K classifiers will be the predicted label of point x

Each trained classifier hi(x) will determine if a point x is in class i or not for i = 1,…,K. The class that has the highest โconfidence scoreโ from the K classifiers will be the predicted label of point x, where โconfidence scoreโ of classifier i is defined as wiT x + b

Each trained classifier hi(x) will determine if a point x is in class i or not for i = 1,…,K. The class that has the highest โconfidence scoreโ from the K classifiers will be the predicted label

wiTx+b of point x, where โconfidence scoreโ of classifier i is defined as |wi| (the signed distance from a point to the plane defined by wiT x + b = 0)

None of the above

2. (1 point) Now we would like a method to do multiclass classification without having to train more than one classifier. Multinomial logistic regression is such a method. Remember that in multinomial logistic regression, we have

exp(ฮธyx)

softmax((ฮx)y) (1.1)

where ฮ is the parameter matrix of size K ร (M + 1) and ฮธy denotes the yth row of ฮ, which is the parameter vector for class y. Since we have folded the bias term into ฮ we now have x โ RM+1. Let us represent class Ck with a one-hot encoding, specifically let Ck โ RK where the kth entry in Ck is 1, and 0 everywhere else. Let us also define a target matrix T of size N ร K, where the ith row of T is Cy(i), where only the y(i)th entry is 1, and 0 else where.

Write down the data conditional likelihood L(ฮ | T,X) in terms of N, K, T and . Please note that L(ฮ | T,X) = p(T | ฮ,X), where likelihood is a function of parameters (not probability), and it is equal in value to the label probability conditioned on data and parameters.

3. (1 point) Write down the negative conditional log-likelihood of the data in terms of N, K, T and

. This will be your objective function J(ฮ), also known as cross-entropy loss. To help

you with the next part, write down the objective function after replacing using equation

1.1 given in part (b). Do not include the literal term โsoftmaxโ in your answer.

4. (4 points) Now letโs derive the partial derivative of the objective function with respect to the kth parameter vector ฮธk. That is, derive , where J(ฮ) is the objective function that you provided above. Show that the partial derivative is as follows:

Show all steps of the derivation. (Hint: A good first step would be to simplify your answer from part

(c) as much as you can, if you havenโt already done so in the previous part)

5. (2 points) Write down pseudocode corresponding to the stochastic gradient descent update steps for an arbitrary ฮธk using the ith training example in terms of x .

Hint: Recall the buggy SGD program from lecture.

6. (1 point) If you train multinomial logistic regression for infinite iterations without `1 = ||ฮ||1 (sum of absolute values of all entries in the matrix ) or `2 = ||ฮ||2 (square root of sum of squares of all entries in the matrix) regularization, the weights can go to infinity in magnitude. What is an explanation for this phenomenon? (Hint: Think about what happens to the probabilities if we train an unregularized logistic regression, and the role of the weights when calculating such probabilities)

7. (2 points) How does regularization such as `1 and `2 help correct the problem?

Select all that apply:

`1 regularization prevents weights from going to infinity by eliminating the weights equal to 0, thus only include non-zero weights.

`1 regularization prevents weights from going to infinity by reducing some of the weights to 0, thus removing some features all together.

`2 regularization prevents weights from going to infinity by reducing the magnitude of the the weights to close to 0 (thus not removing any feature).

Regularization such as `1 or `2 has the effect of preventing weights from going to infinity by appropriately penalizing the magnitude of the parameter vector elements.

None of the above.

1.3 Binary Logistic Regression on a Small Dataset [5 points]

The following questions should be completed before you start the programming portion of this assignment. (Section 2).

The following dataset consists of 4 training examples, where denotes the kth dimension of the ith training example, and the corresponding label y(i). k โ {1,2,3,4,5} and i โ {1,2,3,4}

i x1 x2 x3 x4 x5 y(i)

1 0 0 1 0 1 0

2 0 1 0 0 0 1

3 0 1 1 0 0 1

4 1 0 0 1 0 0

A binary logistic regression model is trained on this data. After n iterations, the parameter vector ฮธ = [1.5,2,1,2,3]T

Use the data above to answer the following questions. For all numerical answers, please use one number rounded to the fourth decimal place, e.g., 0.1234.

1. (1 point) Calculate J(ฮธ), the negative log-likelihood over the given data after iteration n (Note here we are using natural log, i.e., the base is e).

2. (2 points) Calculate the gradients with respect to ฮธj, for all j โ {1,2,3,4,5}

3. (1 point) Update the parameters following the parameter update step and give the final (numerical) value of the vector ฮธ. Consider ฮท = 1.

4. (1 point) Following table shows the sparse feature representation for the given data

i label y(i) features x(i)

1 0 {x3 : 1,x5 : 1}

2 1 {x2 : 1}

3 1 {x2 : 1,x3 : 1}

4 0 {x1 : 1,x4 : 1}

Calculate the probability after the update step in 3., using only k unique multiplication operations where k is the number of non-zero features in x3. Explicitly show these multiplication operations.

1.4 Programming Empirical Questions [8 points]

The following questions should be completed as you work through the programming portion of this assignment (Section 2).

1. (2 points) For Model 1, using the data in the largedata folder in the handout, make a plot that shows the average negative log likelihood for the training and validation data sets after each of 200 epochs. The y-axis should show the negative log likelihood and the x-axis should show the number of epochs.

2. (2 points) For Model 2, make a plot as in the previous question.

3. (2 points) Write a few sentences explaining the output of the above experiments. In particular do the training and validation log likelihood curves look the same or different? Why?

4. (2 points) Make a table with your train and test error for the large data set (found in the largedata folder in the handout) for each of the 2 models after running for 50 epochs. Please use one number rounded to the fourth decimal place, e.g., 0.1234.

Collaboration Questions Please answer the following:

1. Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full details.

2. Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.

3. Did you find or come across code that implements any part of this assignment ? If so, include full details.

2 Programming [70 points]

2.1 The Task

Note: Before starting the programming, you should work through section 1.3 to get a good understanding of important concepts that are useful for this programming section.

2.2 The Datasets

Datasets Download the zip file from Piazza, which contains all the data that you will need in order to complete this assignment. The handout contains data from the Movie Review Polarity dataset. In the data files, each line is a data point that consists of a label (0 for negatives and 1 for positives) and a attribute (a set of words as a whole). The label and attribute are separated by a tab. In the attribute, words are separated using white-space (punctuations are also separated with white-space). All characters are lowercased. The format of each data point (each line) is label word1 word2 word3 … wordN .

Examples of the data are as follows:

1 david spade has a snide , sarcastic sense of humor that works …

0 ” mission to mars ” is one of those annoying movies where , in …

1 anyone who saw alan rickmanโs finely-realized performances in …

1 ingredients : man with amnesia who wakes up wanted for murder , …

1 ingredients : lost parrot trying to get home , friends synopsis : …

0 aspiring broadway composer robert ( aaron williams ) secretly …

0 americaโs favorite homicidal plaything takes a wicked wife in ” …

We have provided you with two subsets of the movie review dataset. Each dataset is divided into a training, a validation, and a test dataset. The small dataset (smalltrain_data.tsv, smallvalid_data.tsv, and smalltest_data.tsv) can be used while debugging your code. We have included the reference output files for this dataset after 30 training epochs (see directory smalloutput/). We have also included a larger dataset (train_data.tsv, valid_data.tsv, test_data.tsv) with reference outputs for this dataset after 60 training epochs (see directory largeoutput/) . This dataset can be used to ensure that your code runs fast enough to pass the autograder tests. Your code should be able to perform 60-epoch training and finish predictions through all of the data in less than one minute for each of the models: one minute for Model 1 and one minute for Model 2.

films 0 adapted 1 from 2 comic 3

2.3 Model Definition

Assume you are given a dataset with N training examples and M features. We first write down the negative conditional log-likelihood of the training data in terms of the design matrix X, the labels y, and the parameter vector ฮธ. This will be your objective function J(ฮธ) for gradient descent. (Recall that ith row of the design matrix X contains the features x(i) of the ith training example. The ith entry in the vector y is the label y(i) of the ith training example. Here we assume that each feature vector x(i) contains a bias feature,

e.g. . As such, the bias parameter is folded into our parameter vector ฮธ.

Taking x(i) to be a (M + 1)-dimensional vector where , the likelihood p(y|X,ฮธ) is:

(2.1)

(2.2)

Hence, the negative conditional log-likelihood is:

(2.3)

The partial derivative of the negative log-likelihood J(ฮธ) with respect to ฮธj ,j โ {0,…,M} is:

(2.4)

The gradient descent update rule for binary logistic regression for parameter element ฮธj is

(2.5)

Then, the stochastic gradient descent update for parameter element ฮธj using the ith datapoint (x(i),y(i)) is:

(2.6)

2.4 Implementation

2.4.1 Overview

The implementation consists of two programs, a feature extraction program (feature.{py|java|cpp|m}) and a sentiment analyzer program (lr.{py|java|cpp|m}) using binary logistic regression. The programming pipeline is illustrated as follows.

Figure 2.1: Programming pipeline for sentiment analyzer based on binary logistic regression

This first program is feature.{py|java|cpp|m}, that converts raw data (e.g., train_data.tsv, valid_data.tsv, and test_data.tsv) into formatted training, validation and test data based on the vocabulary information in the dictionary file dict.txt. To be specific, this program is to transfer the whole movie review text into a feature vector using some feature extraction methods. The formatted datasets should be stored in .tsv format. Details of formatted datasets will be introduced in Section 2.4.2 and Section 2.5.1.

The second program is lr.{py|java|cpp|m}, that implements a sentiment polarity analyzer using binary logistic regression. The file should learn the parameters of a binary logistic regression model that predicts a sentiment polarity (i.e. label) for the corresponding feature vector of each movie review. The program should output the labels of the training and test examples and calculate training and test error (percentage of incorrectly labeled reviews). As discussed in Appendix A.2 and A.3, efficient computation can be obtained with the help of the indexing information in the dictionary file dict.txt.

2.4.2 Feature Engineering

Your implementation of feature.{py|java|cpp|m} should have an input argument <featureflag> that specifies one of two types of feature extraction structures that should be used by the logistic regression model. The two structures are illustrated below as probabilities of the labels given the inputs.

Model 1 p(y(i) | 1occur(x(i),Vocab),ฮธ): This model defines a probability distribution over the current label y(i) using the parameters ฮธ and a bag-of-word feature vector 1occur(x(i),Vocab) indicating which word in vocabulary Vocab of the dictionary occurs at least once in the movie review example

x(i). The entry in the indicator vector associated to the occurring word will set to one (otherwise, it is zero). This bag-of-word model should be used when <featureflag> is set to 1.

Model 2 p(y(i) | 1trim(x(i),Vocab, t),ฮธ): This model defines a probability distribution over the current label y(i) using the parameters ฮธ and a trimmed bag-of-word feature vector 1trim(x(i),Vocab, t) indicating (1) which word in vocabulary Vocab of the dictionary occurs in the movie review example x(i), AND (2) the count of the word is LESS THAN (<) threshold t. The entry in the indicator vector associated to the word that satisfies both conditions will set to one (otherwise, it is zero, including no shown and high-frequent words). This trimmed bag-of-word model should be used when <featureflag> is set to 2. In this assignment, use the constant trimming threshold t = 4.

Note that above 1occur and 1trim are described as a dense feature representation as showed in Tables A.2 for illustration purpose. In your implementation, you should further convert it to the representation in A.3 for

Model 1 and the representation in A.5 for Model 2, such that the formatted data outputs match Section 2.5.1.

2.4.3 Command Line Arguments

The autograder runs and evaluates the output from the files generated, using the following command (note feature will be run before lr):

For Python: $ python feature.py [args1…]

$ python lr.py [args2…]

For Java: $ java feature.java [args1…]

$ java lr.java [args2…]

For C++: $ g++ feature.cpp ./a.out [args1…]

$ g++ lr.cpp ./a.out [args2…]

For Octave: $ octave -qH feature.m [args1…]

$ octave -qH lr.m [args2…]

Where above [args1…] is a placeholder for eight command-line arguments:<traininput>

<validationinput> <testinput> <dictinput> <formattedtrainout>

<formatted validationout> <formattedtestout> <featureflag>. These arguments are described in detail below:

1. <train input>: path to the training input .tsv file (see Section 2.2)

2. <validationinput>: path to the validation input .tsv file (see Section 2.2)

3. <testinput>: path to the test input .tsv file (see Section 2.2)

4. <dictinput>: path to the dictionary input .txt file (see Section 2.2)

5. <formattedtrain out>: path to output .tsv file to which the feature extractions on the training data should be written (see Section 2.5.1)

6. <formattedvalidationout>: path to output .tsv file to which the feature extractions on the validation data should be written (see Section 2.5.1)

7. <formattedtest out>: path to output .tsv file to which the feature extractions on the test data should be written (see Section 2.5.1)

8. <featureflag>: integer taking value 1 or 2 that specifies whether to construct the Model 1 feature set or the Model 2 feature set (see Section 2.4.2)โthat is, if feature_flag==1 use Model 1 features; if feature_flag==2 use Model 2 features

On the other hand, [args2…] is a placeholder for eight command-line arguments:<formattedtraininput>

<formatted validationinput> <formattedtestinput> <dictinput> <trainout> <testout> <metricsout> <numepoch>. These arguments are described in detail below:

1. <formattedtraininput>: path to the formatted training input .tsv file (see Section 2.5.1)

2. <formattedvalidationinput>: path to the formatted validation input .tsv file (see Section 2.5.1)

3. <formattedtest input>: path to the formatted test input .tsv file (see Section 2.5.1)

4. <dictinput>: path to the dictionary input .txt file (see Section 2.2)

5. <train out>: path to output .labels file to which the prediction on the training data should be written (see Section 2.5.2)

6. <testout>: path to output .labels file to which the prediction on the test data should be written (see Section 2.5.2)

7. <metricsout>: path of the output .txt file to which metrics such as train and test error should be written (see Section 2.5.3)

8. <numepoch>: integer specifying the number of times SGD loops through all of the training data (e.g., if <numepoch> equals 5, then each training example will be used in SGD 5 times).

As an example, if you implemented your program in Python, the following two command lines would run your programs on the data provided in the handout for 60 epochs using the features from Model 1.

$ python feature.py train_data.tsv valid_data.tsv test_data.tsv dict.txt formatted_train.tsv formatted_valid.tsv formatted_test.tsv 1

$ python lr.py formatted_train.tsv formatted_valid.tsv formatted_test

.tsv dict.txt train_out.labels test_out.labels metrics_out.txt 60

Important Note: You will not be writing out the predictions on validation data, only on train and test data. The validation data is only used to give you an estimate of held-out negative log-likelihood at the end of each epoch during training. You are asked to graph the negative log-likelihood vs. epoch of the validation and training data in section 1.4. a

aFor this assignment, we will always specify the number of epochs. However, a more mature implementation would monitor the performance on validation data at the end of each epoch and stop SGD when this validation log-likelihood appears to have converged. You should not implement such a convergence check for this assignment.

2.5 Program Outputs

2.5.1 Output: Formatted Data Files

Your feature program should write three output .tsv files converting original data to formatted data on <formatted trainout>, <formattedvalidout>, and <formattedtestout>. Each should contain the formatted presentation for each example printed on a new line. Use to create a new line. The format for each line should exactly match label index[word1]:value1 index[word2]:value2 …index[wordM]:valueM

Where above, the first column is label, and the rest are โindex[word]:valueโ feature elements. index[word] is the index of the word in the dictionary, and value is the value of this feature (in this assignment, the value is one or zero). There is a colon, :, between index[word] and corresponding value. Columns are separated using a table character, . The handout contains example <formattedtrainout>, <formatted validout>, and <formattedtestout> for your reference.

The formatted output will be checked separately by the autograder by running your feature program on some unseen datasets and evaluating your output file against the reference formatted files. Examples of content of formatted output file are given below.

0 2915:1 21514:1 166:1 32:1 10699:1 305:1 …

0 7723:1 51:1 8701:1 74:1 370:1 8:1 …

1 229:1 48:1 326:1 43:1 576:1 55:1 …

1 8126:1 1349:1 58:1 4709:1 48:1 8319:1 …

2.5.2 Output: Labels Files

Your lr program should produce two output .labels files containing the predictions of your model on training data (<train out>) and test data (<testout>). Each should contain the predicted labels for each example printed on a new line. Use to create a new line.

Your labels should exactly match those of a reference implementation โ this will be checked by the autograder by running your program and evaluating your output file against the reference solution. Examples of the content of the output file are given below.

0

0

1

0

2.5.3 Output Metrics

Generate a file where you report the following metrics:

error After the final epoch (i.e. when training has completed fully), report the final training error error(train) and test error error(test).

All of your reported numbers should be within 0.01 of the reference solution. The following is the reference solution for large dataset with Model 1 feature structure after 60 training epochs. See model1_metrics_out.txt in the handout.

error(train): 0.074167 error(test): 0.247500

Take care that your output has the exact same format as shown above. Each line should be terminated by a Unix line ending . There is a whitespace character after the colon.

2.6 Evaluation and Submission

2.6.1 Evaluation

Gradescope will test your implementations on hidden datasets with the same format as the two datasets provided in the handout. feature program and lr program will be tested separately. To ensure that your code can pass the gradescope tests in under 5 minutes (the maximum time length) be sure that your code can complete 60-epoch training and finish predictions through all of the data in the largedata folder in around one minute for each of the models.

2.6.2 Requirements

Your implementation must satisfy the following requirements:

โข The feature.{py|java|cpp|m} must produce a sparse representation of the data using the label-index-value format {label index[word1]:value1 index[word2]:value2… }. We will use unseen data to test your feature output separately. (see Section 2.5.1 and Section 2.4.2 on feature engineering for details on how to do this).

โข Ignore the words not in the vocabulary of dict.txt when the analyzer encounters one in the test or validation data.

โข Set the trimming threshold to a constant t = 4 for Model 2 feature extraction (see Section 2.4.2).

โข Initialize all model parameters to 0.

โข Use stochastic gradient descent (SGD) to optimize the parameters for a binary logistic regression model. The number of times SGD loops through all of the training data (numepoch) will be specified as a command line flag. Set your learning rate as a constant ฮท = 0.1.

โข Perform stochastic gradient descent updates on the training data in the order that the data is given in the input file. Although you would typically shuffle training examples when using stochastic gradient descent, in order to autograde the assignment, we ask that you DO NOT shuffle trials in this assignment.

โข Be able to select which one of two feature extractions you will use in your logistic regression model using a command line flag (see Section 2.4.2)

โข Do not hard-code any aspects of the datasets into your code. We will autograde your programs on multiple (hidden) datasets that include different attributes and output labels.

2.6.3 Hints

Careful planning will help you to correctly and concisely implement your program. Here are a few hints to get you started.

โข Work through section 1.3

โข Write a function that takes a single SGD step on the ith training example. Such a function should take as input the model parameters, the learning rate, and the features and label for the ith training example. It should update the model parameters in place by taking one stochastic gradient step.

โข Write a function that takes in a set of features, labels, and model parameters and then outputs the error (percentage of labels incorrectly predicted). You can also write a separate function that takes the same inputs and outputs the negative log-likelihood of the regression model.

โข You can either treat the bias term as separate variable, or fold it into the parameter vector. In either case, make sure you update the bias term correctly.

2.6.4 Gradescope Submission

You should submit your feature.{py|m|java|cpp} and lr.{py|m|java|cpp} to Gradescope. Note: please do not use other file names. This will cause problems for the autograder to correctly detect and run your code.

Some additional tips: Make sure to read the autograder output carefully. The autograder for Gradescope prints out some additional information about the tests that it ran. For this programming assignment weโve specially designed some buggy implementations that you might do, and try our best to detect those and give you some more useful feedback in Gradescopeโs autograder. Make wise use of autograderโs output for debugging your code.

A Implementation Details for Logistic Regression

A.1 Examples of Features

Here we provide examples of the features constructed by Model 1 and Model 2. Table A.1 shows an example input file, where column i indexes the ith movie review example. Rather than working directly with this input file, you should transform from the sentiment/text representation into a label/feature vector representation.

Table A.2 shows the dense occurrence-indicator representation expected for Model 1. The size of each feature vector (i.e. number of feature columns in the table) is equal to the size of the entire vocabulary of words stored in the given dict.txt (this dictionary is actually constructed from the same training data in largeset). Each row corresponds to a single example, which we have indexed by i.

It would be highly impractical to actually store your feature vectors x(i) โ RM in the dense representation shown in Table A.2 which takes O(M) space per vector (M is around 40 thousands for the dictionary). This is because the features are extremely sparse: for the second example (i = 2), only three of the features is non-zero for Model 1 and only two for Model 2. As such, we now consider a sparse representation of the features that will save both memory and computation.

Table A.3 shows the sparse representation (bag-of-word representation) of the feature vectors. Each feature vector is now represented by a map from the index of the feature (e.g. index[“apple”]) to its value which is 1. The space savings comes from the fact that we can omit from the map any feature whose value is zero. In this way, the map only contains non-zero entry for each Model 1 feature vector.

Using the same sparse representation of features, we present an example of the features used by Model 2. This involves two step: (1) construct the count-of-word representation of the feature vector (see Table A.4); (2) trim/remove the highly repetitive words/features and set the value of all remaining features to one (see Table A.5).

A.2 Efficient Computation of the Dot-Product

In simple linear models like logistic regression, the computation is often dominated by the dot-product ฮธT x of the parameters ฮธ โ RM with the feature vector x โ RM . When a dense representation of x (such as that shown in Table A.2) is used, this dot-product requires O(M) computation. Why? Because the dot-product requires a sum over each entry in the vector:

(A.1)

However, if our feature vector is represented sparsely, we can observe that the only elements of the feature vector that will contribute a non-zero value to the sum are those where xm 6= 0, since this would allow ฮธmxm to be nonzero. As such, we can write the dot-product as below:

ฮธT x = X ฮธmxm (A.2)

mโ{1,…,M} s.t. xm6=0

This requires only computation proportional to the number of non-zero entries in x, which is generally very small for Model 1 and Model 2 compared to the size of the vocabulary. To ensure that your code runs quickly it is best to write the dot-product in the latter form (Equation (A.2)).

A.3 Data Structures for Fast Dot-Product

Lastly, there is a question of how to implement this dot-product efficiently in practice. The key is choosing appropriate data structures. The most common approach is to choose a dense representation for ฮธ. In C++ or Java, you could choose an array of float or double. In Python, you could choose a numpy array or a list.

To represent your feature vectors, you might need multiple data structures. First, you could create a shared mapping from a feature name (e.g. apple or boy) to the corresponding index in the dense parameter vector. This shared mapping has already been provided to you in the dict.txt, and you can extract the index of the word from the dictionary file for all later computation. In fact, you should be able to construct the dictionary on your own from the training data (we have done this step for you in the handout). Once you know the size of this mapping (which is the size of the dictionary file), you know the size of the parameter vector ฮธ.

Another data structure should be used to represent the feature vectors themselves. This assignment use the option to directly store a mapping from the integer index in the dictionary mapping (i.e. the index m) to the value of the feature xm. Only the indices of words satisfying certain conditions will be stored, and all other indices are implies to have zero value of the feature xm. This structure option will ensure that your code runs fast so long as you are doing an efficient computation instead of the O(M) version.

example index i sentiment y(i) review text x(i)

1 pos apple boy , cat dog

2 pos boy boy : dog dog ; dog dog . dog egg egg

3 neg apple apple apple apple boy cat cat dog

4 neg egg fish

i label y(i)

1 1 0 … 1 1 1 1 0 0 0 0 … 0

2 1 0 … 0 1 0 1 1 0 0 0 … 0

3 0 0 … 1 1 1 1 0 0 0 0 … 0

4 0 0 … 0 0 0 0 1 1 0 0 … 0

Table A.1: Abstract representation of the input file format. The ith row of this file will be used to construct the ith training example using either Model 1 features (Table A.3) or Model 2 features (Table A.5).

Table A.2: Dense feature representation for Model 1 corresponding to the input file in Table A.1. The ith row corresponds to the ith training example. Each dense feature has the size of the vocabulary in the dictionary. Punctuations are excluded.

i label y(i) features x(i)

1 1 { index[โappleโ]: 1, index[โboyโ]: 1, index[โcatโ]: 1, index[โdogโ]: 1 }

2 1 { index[โboyโ]: 1, index[โdogโ]: 1, index[โeggโ]: 1 }

3 0 { index[โappleโ]: 1, index[โboyโ]: 1, index[โcatโ]: 1, index[โdogโ]:1 }

4 0 { index[โeggโ]: 1, index[โfishโ]: 1 }

Table A.3: Sparse feature representation (bag-of-word representation) for Model 1 corresponding to the input file in Table A.1.

i label y(i) features x(i)

1 1 { index[โappleโ]: 1, index[โboyโ]: 1, index[โcatโ]: 1, index[โdogโ]: 1 }

2 1 { index[โboyโ]: 2, index[โdogโ]: 5, index[โeggโ]: 2 }

3 0 { index[โappleโ]: 4, index[โboyโ]: 1, index[โcatโ]: 2, index[โdogโ]: 1 }

4 0 { index[โeggโ]: 1, index[โfishโ]: 1 }

Table A.4: Count of word representation for Model 2 corresponding to the input file in Table A.1.

i label y(i) features x(i)

1 1 { index[โappleโ]: 1, index[โboyโ]: 1, index[โcatโ]: 1, index[โdogโ]: 1 }

2 1 { index[โboyโ]: 1, index[โeggโ]:1 }

3 0 { index[โboyโ]: 1, index[โcatโ]: 1, index[โdogโ]: 1 }

4 0 { index[โeggโ]: 1, index[โfishโ]: 1 }

Table A.5: Sparse feature representation for Model 2 corresponding to the input file in Table A.1. Assume that the trimming threshold is 4. As a result, โdogโ in example 2 and โappleโ in example 3 are removed and the value of all remaining features are reset to value 1.

## Reviews

There are no reviews yet.