## Description

TAs: Aakanksha, Edgar, Sida, Varsha

Summary In this assignment you will implement a new named entity recognition system using Hidden Markov Models. You will begin by going through some multiple choice warm-up problems to build your intuition for these models and then use that intuition to build your own HMM models.

START HERE: Instructions

• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/~mgormley/ courses/10601bd-f18/about.html

• Submitting your work: You will use Gradescope to submit answers to all questions, and Autolab to submit your code. Please follow instructions at the end of this PDF to correctly submit all your code to Autolab.

– Autolab: You will submit your code for programming questions on the homework to Autolab (https://autolab.andrew.cmu.edu/). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). The software installed on the VM is identical to that on linux.andrew.cmu.edu, so you should check that your code runs correctly there. If developing locally, check that the version number of the programming language environment (e.g. Python 2.7, Octave 3.8.2, OpenJDK 1.8.0, g++ 4.8.5) and versions of permitted libraries (e.g. numpy 1.7.1) match those on linux.andrew.cmu.edu. Octave users: Please make sure you do not use any Matlab-specific libraries in your code that might make it fail against our tests. Python3 users: Please include a blank file called python3.txt (case-sensitive) in your tar submission. You have a total of 10 Autolab submissions. Use them wisely. In order to not waste Autolab submissions, we recommend debugging your implementation on your local machine (or the linux servers) and making sure your code is running correctly first before any Autolab submission.

• Materials: Download from autolab the tar file (”Download handout”). The tar file will contain all the data that you will need in order to complete this assignment.

For multiple choice or select all that apply questions, shade in the box or circle in the template document corresponding to the correct answer(s) for each of the questions. For LATEXusers, use and for shaded boxes and circles, and don’t change anything else.

Instructions for Specific Problem Types

For “Select One” questions, please fill in the appropriate bubble completely:

Select One: Who taught this course?

Matt Gormley

Marie Curie

Noam Chomsky

Select One: Who taught this course?

Matt Gormley

Marie Curie

@@ Noam Chomsky

For “Select all that apply” questions, please fill in all appropriate squares completely:

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein

Isaac Newton

I don’t know

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein Isaac Newton

@ I don’t know

Fill in the blank: What is the course number?

10-S7601S

1 Written Questions [20 pts]

1.1 Multiple Choice [10 pts]

In this section we will test your understanding of several aspects of HMMs. Shade in the box or circle in the template document corresponding to the correct answer(s) for each of the questions below.

1. (2 points. Select all that apply) Which of the following are true under the (first-order) Markov assumption in an HMM:

The states are independent

The observations are independent

yt ⊥ yt−1|yt−2

yt ⊥ yt−2|yt−1

None of the above

2. (2 points. Select all that apply) Which of the following independence assumptions hold in an HMM:

The current observation xt is conditionally independent of all other observations given the current state yt

The current observation xt is conditionally independent of all other states given the current state yt

The current state yt is conditionally independent of all states given the previous state yt−1

The current observation xt is conditionally independent of xt−2 given the previous observation xt−1.

None of the above

In the remaining questions you will always see two quantities and decide what is the strongest relation between them. (? means it’s not possible to assign any true relation). As such there is only one correct answer.

3. (2 points. Select one) What is the relation between )) and P(x)? Select only the strongest relation that necessarily holds.

?

4. (2 points. Select one) What is the relation between P(y4 = s1,y5 = s2,x) and α4(s1) · β5(s2)? Select only the strongest relation that necessarily holds.

?

5. (2 points. Select one) What is the relation between α5(i) and β5(i)? Select only the strongest relation that necessarily holds.

?

1.2 Warm-up Exercise: Forward-Backward Algorithm [4 pts]

To help you prepare to implement the HMM forward-backward algorithm (see Section 2.3 for a detailed explanation), we have provided a small example for you to work through by hand. This toy data set consists of a training set of three sequences with three unique words and two tags and a test set with a single sequence composed of the same unique words used in the training set. Before going through this example, please carefully read the algorithm description in Sections 2.2 and 2.3.

Training set:

you_B eat_A fish_B you_B fish_B eat_A eat_A fish_B

Where the training word sequences are:

And the corresponding tags are:

Test set:

fish eat you

or

The following four questions are meant to encourage you to work through the forward backward algorithm by hand using this test example. Feel free to use a calculator, being careful to carry enough significant figures through your computations to avoid rounding errors. For each question below, please report the requested value in the text box next to the question (these boxes are only visible in the template document). When a number is requested, only write the number in the box. When a word/tag is requested, only write that word or tag. DO NOT include explanations or derivation work in the text box. Points will be deducted if anything else is included in the box.

1. (1 point) Compute α2(A), the α value associated with the tag “A” for the second word in the test sequence. Please round your answer to THREE decimal places.

2. (1 point) Compute β2(B), the β value associated with the tag “B” for the second word in the test sequence. Please round your answer to THREE decimal places.

3. (1 point) Predict the tag for the third word in the test sequence.

4. (1 point) Compute the log-likelihood for the entire test sequence, “fish eat you”. Please round your answer to THREE decimal places.

1.3 Empirical Questions [6 pts]

[Return to these questions after implementing your learnhmm.{py|java|cpp|m} and forwardbackward.{py|java|cpp|m} functions]

Using the fulldata set trainwords.txt in the handout using your implementation of learnhmm.{py|java|cpp|m} to learn parameters for an hmm model using the first 10, 100, 1000, and 10000 sequences in the file. Use these learned parameters perform prediction on the trainwords.txt and the testwords.txt files using your forwardbackward.{py|java|cpp|m}. Construct a plot with number of sequences used for training on the x-axis (log-scale) and average log likelihood across all sequences from the trainwords.txt or the testwords.txt on the y-axis (see Section 2.3 for details on computing the log data likelihood for a sequence). Each table entry is worth 0.5 points. Write the resulting log likelihood values in the table in the template. Include your plot in the large box in the template (2 points). To receive credit for your plot, you must submit a computer generated plot. DO NOT hand draw your plot.

#sequences Train average log-

likelihood Test average

likelihood log-

10

100

1000

10000

1. Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full details.

2. Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.

3. Did you find or come across code that implements any part of this assignment ? If so, include full details.

2 Programming [80 pts]

2.1 The Tasks and Data Sets

The handout file for this assignments contains several files that you will use in the homework. The contents and formatting of each of these files is explained below.

1. trainwords.txt This file contains labeled text data that you will use in training your model in the Learning problem. Specifically the text contains one sentence per line that has already been preprocessed, cleaned and tokenized. You should treat every line as a separate sequence and assume that it has the following format:

<Word0> <Tag0> <Word1> <Tag1> … <WordN> <TagN> where every <WordK> <TagK> unit token is separated by white space.

2. testwords.txt: This file contains labeled data that you will use to evaluate your model in the Experiments section. This file contains the gold standard labels. This file has the same format as trainwords.txt.

3. index to word.txt and index to tag.txt: These files contain a list of all words or tags that appear in the data set. In your functions, you will convert the string representation of words or tags to indices corresponding to the location of the word or tag in these files. For example, if Austria is on line 729 of index to word.txt, then all appearances of Austria in the data sets should be converted to the index 729. This index will also correspond to locations in the parameter matrices. For example, the word Austria corresponds to the parameters in column 729 of the matrix stored in hmmemit.txt. This will be useful for your forward-backward algorithm implementation (see Section 2.3).

4. toytrain.txt, toytest.txt, toy index to word.txt, and toy index to tag.txt: These files are analogous to trainwords.txt, testwords.txt, index to word.txt, and index to tag.txt and are used to compare your implementation to your hand calculations in Section 1.2.

5. predicttest.txt The predicttest.txt file contains labeled data that you will use to debug your implementation. The labels in this file are not gold standard but are generated by running our decoder using the features from trainwords.txt. This file has the same format as trainwords.txt.

6. metrics.txt The metrics.txt file contains the metrics you will compute for the dataset. For this assignment, you need to compute the average log likelihood and your prediction accuracy on the test data. Note that in Named Entity Recognition, F-1 score is a more common metric to evaluate the model performance, here you only need to report your accuracy for tag prediction of each word.

Sample metrics file for the toy dataset

7. hmmtrans.txt, hmmemit.txt and hmmprior.txt: These files contain pre-trained model parameters of an HMM that you will use in testing your implementation of the Learning and Evaluation and Decoding problems. The format of the first two files are analogous and is as follows. Every line in these files consists of a conditional probability distribution. In the case of transition probabilities, this distribution corresponds to the probability of transitioning into another state, given a current state. Similarly, in the case of emission probabilities, this distribution corresponds to the probability of emitting a particular symbol, given a current state. For example, every line in hmmtrans.txt has the following format:

hmmtrans.txt:

<ProbS1S1> … <ProbS1SN> <ProbS2S1> … <ProbS2SN>… and every line in hmmemit.txt has the following format: hmmemit.txt:

<ProbS1Word1> … <ProbS1WordN>

<ProbS2Word1> … <ProbS2WordN>…

In both cases, elements in the same row are separated by white space. Each row corresponds to a line of text (using to create new lines).

The format of hmmprior.txt is similarly defined except that it only contains a single probability distribution over starting states. Therefore each row only has a single element. Therefore hmmprior.txt has the following format:

hmmprior.txt:

<ProbS1>

<ProbS2>…

Note that the data provided to you is to help in developing your implementation of the HMM algorithms. Your code will be tested on Autolab using different data with different HMM parameters, likely coming from a different domain although the format will be identical.

2.2 Learning

Your first task is to implement an algorithm to learn the hidden Markov model parameters needed to apply the forward backward algorithm (See Section 2.3). There are three sets of parameters that you will need to estimate: the initialization probabilities π, the transition probabilities A, and the emission probabilities B. For this assignment, we model each of these probabilities using a multinomial distribution with parameters πj = P(y1 = j), ajk = P(yt = k|yt−1 = j), and bjk = P(xt = k|yt = j). These can be estimated using maximum likelihood, which results in the following parameter estimators:

1. , where Nπj equals the number of times state sj is associated with the first word of a sentence in the training data set.

2. , where NAjk is the number of times state sj is followed by state sk in the training data set.

3. , where is the number of times that the state sj is associated with the word k in the training data set.

Note that for each count, a “+1” is added to make a pseudocount. This is slightly different from pure maximum likelihood estimation, but it is useful in improving performance when evaluating unseen cases during evaluation of your test set.

You should implement a function that reads in the training data set (trainwords.txt), and then estimates π, A, and B using the above maximum likelihood solutions.

Your outputs should be in the same format as hmmprior.txt, hmmtrans.txt, and hmmemit.txt (including the same number of decimal places to ensure there are no rounding errors during prediction). The autograder will use the following commands to call your function:

For Python: $ python learnhmm.py [args…]

For Java: $ javac -cp “./lib/ejml-v0.33-libs/*:./” learnhmm.java; java -cp “./lib/ejml-v0.33-libs/*:./” learnhmm [args…]

For C++: $ g++ -g -std=c++11 -I./lib learnhmm.cpp; ./a.out [args…]

For Octave: $ octave -qH learnhmm.m [args…]

Where above [args…] is a placeholder for six command-line arguments:<train input><index to word> <index to tag><hmmprior><hmmemit><hmmtrans>. These arguments are described in detail below:

1. <train input>: path to the training input .txt file (see Section 2.1)

2. <index to word>: path to the .txt that specifies the dictionary mapping from words to indices. The tags are ordered by index, with the first word having index of 1, the second word having index of 2, etc.

3. <index to tag>: path to the .txt that specifies the dictionary mapping from tags to indices. The tags are ordered by index, with the first tag having index of 1, the second tag having index of 2, etc.

4. <hmmprior>: path to output .txt file to which the estimated prior (π) will be written. The file output to this path should be in the same format as the handout hmmprior.txt (see Section 2.1).

5. <hmmemit>: path to output .txt file to which the emission probabilities (B) will be written. The file output to this path should be in the same format as the handout hmmemit.txt (see Section 2.1)

6. <hmmtrans>: path to output .txt file to which the transition probabilities (A) will be written. The file output to this path should be in the same format as the handout hmmtrans.txt (see Section 2.1)..

2.3 Evaluation and Decoding

2.3.1 Forward Backward Algorithm and Minimal Bayes Risk Decoding

Your next task is to implement the forward-backward algorithm. Suppose we have a set of sequence consisting of T words, x1,…,xT. Each word is associated with a label yt ∈ {1,…,J}. In the forward-backward algorithm we seek to approximate P(yt|x1,…,xT) up to a multiplication constant. This is done by first breaking P(yt|x1,…,xT) into a “forward” component and a “backward” component as follows:

P(yt = sj|x1,…,xT) ∝ P(yt = sj,xt+1,…,xT|x1,…,xt)

∝ P(yt = sj|x1,…,xt)P(xt+1,…,xT|yt = sj,x1,…,xt)

∝ P(yt = sj|x1,…,xt)P(xt+1,…,xT|yt = sj) ∝ P(yt = sj,x1,…,xt)P(xt+1,…,xT|yt = sj)

where P(yt = sj|x1,…,xt) is computed by passing forward recursively through the model and P(xt+1,…,xT|yt = j) is computed by passing recursively backwards through the model.

Forward Algorithm

Define αt(j) = P(yt = j,x1:t). We can rearrange our definition of αt(j) as follows:

Using this definition, the α’s can be computed using the following recursive procedure:

1. α1(j) = πjbjx1.

2. For

Backward Algorithm Define βt(j) = P(xt+1,…,xT|yt = sj). We can rearrange our definition of βt(j) as follows:

Just like the α’s, the β’s can also be computed using the following backward recursive procedure:

1. βT(j) = 1 (All states could be ending states)

2. For 1 (Generate xt+1 from any state)

Forward-Backward Algorithm As stated above, the goal of the Forward-Backward algorithm is to compute P(yt = sj|x1,…,xT). This can be done using the following equation:

P(yt = sj|x1,…,xT) ∝ P(yt = sj,x1,…,xt)P(xt+1,…,xT|yt = sj)

After running your forward and backward passes through the sequence, you are now ready to estimate the conditional probabilities as:

P(yt|x1,…,xT) ∝ αt ◦ βt

where ◦ is the element-wise product.

Minimum Bayes Risk Prediction We will assign tags using the minimum Bayes risk predictor, defined for this problem as follows:

yˆt = argmax

j 1,…,J

To resolve ties, select the tag that appears earlier in the <index to tag> input file.

Computing the Log Likelihood of a Sequence When we compute the log likelihood of a sequence, we are interested in the computing the quantity P(x1,…,xT). We can rewrite this in terms of values we have already computed in the forward-backward algorithm as follows:

2.3.2 Implementation Details

You should now implement your forward-backward algorithm as a program, forwardbackward.{py|java|cpp|m}. The program will read in test data and the parameter files produced by learnhmm.{py|java|cpp|m}. The autograder will use the following commands to call your function:

For Python: $ python forwardbackward.py [args…]

For Java: $ javac -cp “./lib/ejml-v0.33-libs/*:./” forwardbackward.java; java -cp “./lib/ejml-v0.33-libs/*:./” forwardbackward [args…]

For C++: $ g++ -g -std=c++11 -I./lib forwardbackward.cpp; ./a.out [args…]

For Octave: $ octave -qH forwardbackward.m [args…]

Where above [args…] is a placeholder for seven command-line arguments:<test input><index to word>

<index to tag><hmmprior><hmmemit><hmmtrans><predicted file><metric file>. These arguments are described in detail below:

1. <test input>: path to the test input .txt file that will be evaluated by your forward backward algorithm (see Section 2.1)

2. <index to word>: path to the .txt that specifies the dictionary mapping from words to indices. The tags are ordered by index, with the first word having index of 1, the second word having index of 2, etc. This is the same file as was described for learnhmm.{py|java|cpp|m}.

3. <index to tag>: path to the .txt that specifies the dictionary mapping from tags to indices. The tags are ordered by index, with the first tag having index of 1, the second tag having index of 2, etc. This is the same file as was described for learnhmm.{py|java|cpp|m}.

4. <hmmprior>: path to input .txt file which contains the estimated prior (π).

5. <hmmemit>: path to input .txt file which contains the emission probabilities (B).

6. <hmmtrans>: path to input .txt file which contains transition probabilities (A).

7. <predicted file>: path to the output .txt file to which the predicted tags will be written. The file should be in the same format as the <test input> file.

8. <metric file>: path to the output .txt file to which the metrics will be written.

Example command for python users:

$ python forwardbackward.py toydata/toytrain.txt

toydata/toy_index_to_word.txt toydata/toy_index_to_tag.txt hmmprior.txt hmmemit.txt hmmtrans.txt predicted.txt metrics.txt

After running the command above, the <predicted file> output should be:

you_B eat_A fish_B you_B fish_B eat_A eat_A fish_B

And the <metric file> output should be:

Average Log-Likelihood: -0.924531673593 Accuracy: 1.0

Take care that your output has the exact same format as shown above. There should be a single space after the colon preceding the metric value (e.g. a space after Average Log-Likelihood:). Each line should be terminated by a Unix line ending .

2.3.3 Log-Space Arithmetic for Avoiding Underflow

Handling underflow properly is a critical step in implementing an HMM. For HW7, there are two ways of avoiding underflow (Options 1 and 2 below). Note that Option 1 is only an option because of the datasets we’ve given you, whereas Option 2 would handle underflow correctly for almost any real-world dataset. Thus, we strongly recommend Option 1 for HW7, but you are required to read and understand Option 2 below.

Option 1: Use normalization For this assignment only, you can avoid issues with underflow by simply following the algorithms as described, representing alpha and beta probabilities in probabilityspace, and using 64-bit floating point numbers when you run your experiments for the empirical questions. To represent alpha and beta matrices as probabilities, remember to normalize each column in both matrices (each column corresponds to the alpha/ beta scores for all states at that timestep)

Note that in general, this is not a solution; but it will work fine for HW7 because the datasets we are giving you will only include ”short” sentences for which we’ve tested that the alpha/beta probabilities do not underflow. We encourage this solution as it allows you to simply focus on the algorithms and what they are doing.

Option 2: Compute in Log-Space Option 2 offers the proper solution to handling underflow. It’s really only a simple tweak to your algorithm. You are encouraged to (optionally) give this a try after you’ve completed and submitted the assignment.

The algorithms described above are all written in terms of probabilities, not their logs. In addition to storing the α’s and β’s as log-probabilities, you should also update the computation so as to avoid underflow. There are two cases to address:

multiplication Suppose the algorithm multiplies two probabilities p and q together to get a third probability r. That is, r = p·q. Replace this computation with lr = lp·lq where lr, lp, and lq are variables storing log(r), log(p) and log(q) respectively.

Note: The emission and transition probabilities tend to be large enough to represent in memory as probabilities. So, if q is already stored elsewhere as a probability, you can directly compute its log in place: lr = lp · log(q).

addition Suppose the algorithm adds two probabilities p and q to get a third probability r. That is, r = p + q. Again we will use variables lr, lp, and lq that store log(r), log(p) and log(q) respectively.

The naive solution here would be to compute lr = log(exp(lp) + exp(lq)), but this will just result in underflow of p = exp(lp) and q = exp(lq).

Instead, use the log-sum-exp function to carry out the addition in log-space. Notice below that we only exponentiate the absolute value of the difference between the two log-probabilities, which is unlikely to overflow.

logsumexp(x,y) = max(x,y) − log(1 + exp(abs(x − y))) (1)

where abs(·) is the absolute value function. Then we compute lr = logsumexp(lp,lq).

Note that we need to be particularly careful of how we represent zero probabilities. For example, we could represent the log(0) as −∞ since the standard floating point specification explicitly supports representation of negative infinity. Note however that your code will have to explicitly handle this case; for example, some libraries throw errors when you take the log of zero. A more careful implementation of the logsumexp function is given in Algorithm 1.

Algorithm 1 Addition in Log-Space

1: procedure logsumexp(x,y)

2: if x = −∞ then 3: return y

4: if y = −∞ then

5: return x

6: m = max(x,y)

7: return m + log(1 + exp(abs(x − y)))

Java EJML is a pure Java linear algebra package with three interfaces. We strongly recommend using the SimpleMatrix interface. Autolab will use EJML version 3.3. The commands above demonstrate how we will call you code. The classpath inclusion -cp “./lib/ejml-v0.33-libs/*:./” will ensure that all the EJML jars are on the classpath as well as your code.

a https://ejml.org

b

http://eigen.tuxfamily.org/

2.4 Autolab Submission

You must submit a .tar file named hmm.tar containing learnhmm.{py|m|java|cpp} and forwardbackward.{py|m|java|cpp} You can create that file by running:

tar -cvf hmm.tar learnhmm.{py|m|java|cpp} forwardbackward.{py|m|java|cpp}

from the directory containing your code.

Some additional tips: DO NOT compress your files; you are just creating a tarball. Do not use tar -czvf. DO NOT put the above files in a folder and then tar the folder. Autolab is case sensitive, so observe that all your files should be named in lowercase. You must submit this file to the corresponding homework link on Autolab. The autograder for Autolab prints out some additional information about the tests that it ran. You can view this output by selecting ”Handin History” from the menu and then clicking one of the scores you received for a submission. For example on this assignment, among other things, the autograder will print out which language it detects (e.g. Python, Octave, C++, Java).

Python3 Users: Please include a blank file called python3.txt (case-sensitive) in your tar submission and we will execute your submitted program using Python 3 instead of Python 2.7.

## Reviews

There are no reviews yet.