Description
HIDDEN MARKOV MODELS
https://mlcourse.org
TAs: Mike Chen, Eu Jing Chua, Filipp Shelobolin, Ayushi Sood
Summary In this assignment you will implement a new named entity recognition system using Hidden Markov Models. You will begin by going through some multiple choice warm-up problems to build your intuition for these models and then use that intuition to build your own HMM models.
START HERE: Instructions
• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ ˜mgormley/courses/10601/about.html#late-homework-policy
• Submitting your work: You will use Gradescope to submit answers to all questions and your code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope.
– Written: For written problems such as derivations, proofs, or plots we will be using Gradescope
– Programming: You will submit your code for programming questions on the homework to
Gradescope (https://gradescope.com). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). When you are developing, check that the version number of the programming language environment (e.g. Python 3.6.9, Octave 4.2.2, OpenJDK 11.0.5, g++ 7.4.0) and versions of permitted libraries (e.g. numpy 1.17.0 and scipy 1.4.1) match those used on Gradescope. (Octave users: Please make sure you do not use any Matlab-specific libraries in your code that might make it fail against our tests.) You have a total of 10 Gradescope programming submissions. Use them wisely. In order to not waste code submissions, we recommend debugging your implementation on your local machine (or the linux servers) and making sure your code is running correctly first before any Gradescope coding submission.
• Materials: The data that you will need in order to complete this assignment is posted along with the writeup and template on Piazza.
Java EJML is a pure Java linear algebra package with three interfaces. We strongly recommend using the SimpleMatrix interface. Gradescope will use EJML version 0.38. The command line arguments above demonstrate how we will call you code. The classpath inclusion
-cp “./lib/ejml-v0.38-libs/*:./” will ensure that all the EJML jars are on the classpath as well as your code.
C++ Eigen is a header-only library, so there is no linking to worry about—just #include whatever components you need. Gradescope will use Eigen version 3.3.7. The command line arguments above demonstrate how we will call you code. The argument -I./lib will include the lib/Eigen subdirectory, which contains all the headers.
Do not include EJML or Eigen in your Gradescope submission tar; the autograder will ensure that they are in place.
ahttps://ejml.org bhttp://eigen.tuxfamily.org/
Instructions for Specific Problem Types
For “Select One” questions, please fill in the appropriate bubble completely:
Select One: Who taught this course?
Matt Gormley
Marie Curie
Noam Chomsky
Select One: Who taught this course?
Matt Gormley
Marie Curie
@@Noam Chomsky
For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein
Isaac Newton
I don’t know
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein Isaac Newton
@ I don’t know
Fill in the blank: What is the course number?
10-S7601
S
1 Written Questions [20 pts]
1.1 Multiple Choice [10 pts]
In this section we will test your understanding of several aspects of HMMs. Shade in the box or circle in the template document corresponding to the correct answer(s) for each of the questions below.
1. (2 points) (Select all that apply) Which of the following are true under the (first-order) Markov assumption in an HMM:
The states are independent
The observations are independent
yt ⊥ yt−1|yt−2 yt ⊥ yt−2|yt−1
2 None of the above
2. (2 points) (Select all that apply) Which of the following independence assumptions hold in an HMM:
The current observation xt is conditionally independent of all other observations given the current state yt
The current observation xt is conditionally independent of all other states given the current state yt
The current state yt is conditionally independent of all states given the previous state yt−1
The current observation xt is conditionally independent of xt−2 given the previous observation xt−1. 2 None of the above
In the remaining questions you will always see two quantities and decide what is the strongest relation between them. (? means it’s not possible to assign any true relation). As such there is only one correct answer. α and β in the following questions and questions in section 1.2 follow the definition in section 2.
3. (1 point) (Select one) What is the relation between ? Select only the strongest relation that necessarily holds.
?
4. (1 point) (Select one) What is the relation between P(y4 = s1,y5 = s2,x) and α4(s1)·β5(s2)? Select only the strongest relation that necessarily holds.
?
5. (1 point) (Select one) What is the relation between α5(i) and β5(i)? Select only the strongest relation that necessarily holds.
?
1.2 Viterbi Decoding
Suppose we have a set of sequence consisting of T observed states, x1,…,xT , where each xt ∈ {X,Y,Z}. Each observed state is associated with a hidden state yt ∈ {A,B}. In the Viterbi algorithm we seek to find the most probable hidden state sequence y1,…,yT given the observation x1,…,xT . To compute such sequence, recall from lecture the following dynamic programming algorithm:
1. Initialize v1(sj) = πsjbsjx1 and p1(sj) = sj 2. For t > 1, we have
pt(sj) = argmax
k 1,…,J
where vt(sj) is the highest product of all the probabilities taken through path y1,…,yt−1 that ends with yt at state sj, and pt(sj) is the backpointers that store the path through hidden states that give us the highest product.
We can obtain the most probable sequence by backtracing through the backpointers as follows:
1. yˆT = argmaxk∈{1,…,J} vT (sk).
2. For t = T,…,1: yˆt−1 = pt(yˆt)
3. Return yˆ1,…,yˆT
For the following question, consider the Hidden Markov Model specified below. Working with probabilities in the logarithm scale, we have that logp(y1 = A) = logp(y1 = B) = −1, and the state transition model and emission probability tables are given as follows.
xt logp(xt|yy = A) logp(xt|yt = B)
X -1 -1.74
Y -1.32 -1
Z -3.32 -2.32
We observed x1 = X and x2 = Y . (Note that taking the maximum of log probabilities will give you the same result as taking the maximum of probabilities as log is a monotonically increasing function.)
1. (1 point) Compute logv1(sj = A) and logv1(sj = B). If your answers involves decimal numbers, please round your answer to TWO decimal places.
2. (2 points) (Select one) Which of the following is the most likely sequence of hidden states?
Don’t have enough information.
1.3 Warm-up Exercise: Forward-Backward Algorithm [4 pts]
To help you prepare to implement the HMM forward-backward algorithm (see Section 2.3 for a detailed explanation), we have provided a small example for you to work through by hand. This toy data set consists of a training set of three sequences with three unique words and two tags and a test set with a single sequence composed of the same unique words used in the training set. Before going through this example, please carefully read the algorithm description in Sections 2.2 and 2.3. Note: pseudo count used in section 2 is also used here.
Training set:
you_B eat_A fish_B you_B fish_B eat_A eat_A fish_B
Where the training word sequences are:
you eat fish x = you fish eat
eat fish
And the corresponding tags are:
B A B y = B B A A B
Test set:
fish eat you
or
fish eat you
The following four questions are meant to encourage you to work through the forward backward algorithm by hand using this test example. Feel free to use a calculator, being careful to carry enough significant figures through your computations to avoid rounding errors. For each question below, please report the requested value in the text box next to the question (these boxes are only visible in the template document). When a number is requested, only write the number in the box. When a word/tag is requested, only write that word or tag. DO NOT include explanations or derivation work in the text box. Points will be deducted if anything else is included in the box.
1. (1 point) Compute α2(A), the α value associated with the tag “A” for the second word in the test sequence. Please round your answer to THREE decimal places.
2. (1 point) Compute β2(B), the β value associated with the tag “B” for the second word in the test sequence. Please round your answer to THREE decimal places.
3. (1 point) Predict the tag for the third word in the test sequence.
4. (1 point) Compute the log-likelihood for the entire test sequence, “fish eat you”. Please round your answer to THREE decimal places.
1.4 Empirical Questions [6 pts]
Using the fulldata set trainwords.txt in the handout using your implementation of learnhmm.{py|java|cpp|m} to learn parameters for an hmm model using the first 10, 100, 1000, and 10000 sequences in the file. Use these learned parameters perform prediction on the trainwords.txt and the testwords.txt files using your forwardbackward.{py|java|cpp|m}. Construct a plot with number of sequences used for training on the x-axis (log-scale) and average log likelihood across all sequences from the trainwords.txt or the testwords.txt on the y-axis (see Section 2.3 for details on computing the log data likelihood for a sequence). Each table entry is worth 0.5 points. Write the resulting log likelihood values in the table in the template. Include your plot in the large box in the template (2 points). To receive credit for your plot, you must submit a computer generated plot. DO NOT hand draw your plot.
#sequences Train average loglikelihood Test average log-
likelihood
10
100
1000
10000
1. Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full details.
2. Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.
3. Did you find or come across code that implements any part of this assignment ? If so, include full details.
2 Programming [80 pts]
2.1 The Tasks and Data Sets
The handout file for this assignments contains several files that you will use in the homework. The contents and formatting of each of these files is explained below (these files are organized in two folders for the full and toy dataset, with corresponding prefixes).
1. trainwords.txt This file contains labeled text data that you will use in training your model in the Learning problem. Specifically the text contains one sentence per line that has already been preprocessed, cleaned and tokenized. You should treat every line as a separate sequence and assume that it has the following format:
<Word0> <Tag0> <Word1><Tag1> … <WordN><TagN>
where every <WordK><TagK> unit token is separated by white space.
2. testwords.txt: This file contains labeled data that you will use to evaluate your model in the Experiments section. This file contains the gold standard labels. This file has the same format as trainwords.txt.
3. index to word.txt and index to tag.txt: These files contain a list of all words or tags that appear in the data set. In your functions, you will convert the string representation of words or tags to indices corresponding to the location of the word or tag in these files. For example, if Austria is on line 729 of index to word.txt, then all appearances of Austria in the data sets should be converted to the index 729. This index will also correspond to locations in the parameter matrices. For example, the word Austria corresponds to the parameters in column 729 of the matrix stored in hmmemit.txt. This will be useful for your forward-backward algorithm implementation (see Section 2.3).
4. toytrain.txt, toytest.txt, toy index to word.txt, and toy index to tag.txt: These files are analogous to trainwords.txt, testwords.txt, index to word.txt, and index to tag.txt and are used to compare your implementation to your hand calculations in Section 1.3.
5. predicttest.txt The predicttest.txt file contains labeled data that you will use to debug your implementation. The labels in this file are not gold standard but are generated by running our decoder using the features from trainwords.txt. This file has the same format as trainwords.txt.
6. metrics.txt The metrics.txt file contains the metrics you will compute for the dataset. For this assignment, you need to compute the average log likelihood and your prediction accuracy on the test data. Note that in Named Entity Recognition, F-1 score is a more common metric to evaluate the model performance, here you only need to report your accuracy for tag prediction of each word.
Sample metrics file for the toy dataset
7. hmmtrans.txt, hmmemit.txt and hmmprior.txt: These files contain pre-trained model parameters of an HMM that you will use in testing your implementation of the Learning and Evaluation and Decoding problems. The format of the first two files are analogous and is as follows. Every line in these files consists of a conditional probability distribution. In the case of transition probabilities, this distribution corresponds to the probability of transitioning into another state, given a current state. Similarly, in the case of emission probabilities, this distribution corresponds to the probability of emitting a particular symbol, given a current state. For example, every line in hmmtrans.txt has the following format: hmmtrans.txt:
<ProbS1S1> … <ProbS1SN>
<ProbS2S1> … <ProbS2SN>…
and every line in hmmemit.txt has the following format:
hmmemit.txt:
<ProbS1Word1> … <ProbS1WordN>
<ProbS2Word1> … <ProbS2WordN>…
In both cases, elements in the same row are separated by white space. Each row corresponds to a line of text (using to create new lines).
The format of hmmprior.txt is similarly defined except that it only contains a single probability distribution over starting states. Therefore each row only has a single element. Therefore hmmprior.txt has the following format:
hmmprior.txt:
<ProbS1>
<ProbS2>…
Note that the data provided to you is to help in developing your implementation of the HMM algorithms. Your code will be tested on Autolab using different data with different HMM parameters, likely coming from a different domain although the format will be identical.
2.2 Learning
Your first task is to implement an algorithm to learn the hidden Markov model parameters needed to apply the forward backward algorithm (See Section 2.3). There are three sets of parameters that you will need to estimate: the initialization probabilities π, the transition probabilities A, and the emission probabilities B. For this assignment, we model each of these probabilities using a multinomial distribution with parameters πj = P(y1 = j), ajk = P(yt = k|yt−1 = j), and bjk = P(xt = k|yt = j). These can be estimated using maximum likelihood, which results in the following parameter estimators:
1. , where Nπj equals the number of times state sj is associated with the first word of a sentence in the training data set.
2. , where NAjk is the number of times state sj is followed by state sk in the training data set.
3., where NBjk is the number of times that the state sj is
p
associated with the word k in the training data set.
Note that for each count, a “+1” is added to make a pseudocount. This is slightly different from pure maximum likelihood estimation, but it is useful in improving performance when evaluating unseen cases during evaluation of your test set.
You should implement a function that reads in the training data set (trainwords.txt), and then estimates π, A, and B using the above maximum likelihood solutions.
Your outputs should be in the same format as hmmprior.txt, hmmtrans.txt, and hmmemit.txt (including the same number of decimal places to ensure there are no rounding errors during prediction). The autograder will use the following commands to call your function:
For Python: $ python learnhmm.py [args…]
For Java: $ javac -cp “./lib/ejml-v0.33-libs/*:./” learnhmm.java; java -cp “./lib/ejml-v0.33-libs/*:./” learnhmm [args…]
For C++: $ g++ -g -std=c++11 -I./lib learnhmm.cpp; ./a.out [args…]
For Octave: $ octave -qH learnhmm.m [args…]
Where above [args…] is a placeholder for six command-line arguments:<traininput><indextoword> <indextotag><hmmprior><hmmemit><hmmtrans>. These arguments are described in detail below:
1. <traininput>: path to the training input .txt file (see Section 2.1)
2. <index toword>: path to the .txt that specifies the dictionary mapping from words to indices. The tags are ordered by index, with the first word having index of 1, the second word having index of 2, etc.
3. <index totag>: path to the .txt that specifies the dictionary mapping from tags to indices. The tags are ordered by index, with the first tag having index of 1, the second tag having index of 2, etc.
4. <hmmprior>: path to output .txt file to which the estimated prior (π) will be written. The file output to this path should be in the same format as the handout hmmprior.txt (see Section 2.1).
5. <hmmemit>: path to output .txt file to which the emission probabilities (B) will be written. The file output to this path should be in the same format as the handout hmmemit.txt (see Section 2.1)
6. <hmmtrans>: path to output .txt file to which the transition probabilities (A) will be written. The file output to this path should be in the same format as the handout hmmtrans.txt (see Section 2.1)..
2.3 Evaluation and Decoding
2.3.1 Forward Backward Algorithm and Minimal Bayes Risk Decoding
Your next task is to implement the forward-backward algorithm. Suppose we have a set of sequence consisting of T words, x1,…,xT . Each word is associated with a label yt ∈ {1,…,J}. In the forward-backward algorithm we seek to approximate P(yt|x1,…,xT ) up to a multiplication constant. This is done by first breaking P(yt|x1,…,xT ) into a “forward” component and a “backward” component as follows:
P(yt = sj|x1,…,xT ) ∝ P(yt = sj,xt+1,…,xT |x1,…,xt)
∝ P(yt = sj|x1,…,xt)P(xt+1,…,xT |yt = sj,x1,…,xt)
∝ P(yt = sj|x1,…,xt)P(xt+1,…,xT |yt = sj) ∝ P(yt = sj,x1,…,xt)P(xt+1,…,xT |yt = sj)
where P(yt = sj|x1,…,xt) is computed by passing forward recursively through the model and P(xt+1,…,xT |yt =
j) is computed by passing recursively backwards through the model.
Forward Algorithm
Define αt(j) = P(yt = j,x1:t). We can rearrange our definition of αt(j) as follows:
αt(j) =P(yt = sj,x1:t)
=XP(yt = sj,yt−1 = sk,x1:t)
k
=XP(yt−1 = sk,x1:t|yt = sj)P(yt = sj)
k
=XP(xt|yt = sj)P(yt−1 = sk,x1:t−1|yt = sj)P(yt = sj)
k
=P(xt|yt = sj)XP(yt = sj,yt−1 = sk,x1:t−1)
k
=P(xt|yt = sj)XP(yt = sj,x1:t−1|yt−1 = sk)P(yt−1 = sk)
k
=P(xt|yt = sj)XP(yt = sj|yt−1 = sk)P(x1:t−1|yt−1 = sk)P(yt−1 = sk)
k
=P(xt|yt = sj)XP(yt = sj|yt−1 = sk)P(yt−1 = sk,x1:t−1)
k
=bjxt Xakjαt−1(k) (2.1)
k
Using this definition, the α’s can be computed using the following recursive procedure:
1. α1(j) = πjbjx1.
2. For
Backward Algorithm Define βt(j) = P(xt+1,…,xT |yt = sj). We can rearrange our definition of βt(j) as follows:
βt(j) = P(xt+1,…,xT |yt = sj)
J
= XP(yt+1 = sk,xt+1,…,xT |yt = sj)
k=1
J
= XP(xt+1,…,xT |yt = sj,yt+1 = sk)P(yt+1 = sk|yt = sj)
k=1
J
= XP(xt+1|yt+1 = sk)P(xt+2,…,xT |yt+1 = sk)P(yt+1 = sk|yt = sj) k=1
J
= Xbkxt+1βt+1(k)ajk (2.2)
k=1
Just like the α’s, the β’s can also be computed using the following backward recursive procedure:
1. βT (j) = 1 (All states could be ending states)
2. For 1 ≤ t < T, βt(j) = PJk=1 bkxt+1βt+1(k)ajk (Generate xt+1 from any state)
Forward-Backward Algorithm As stated above, the goal of the Forward-Backward algorithm is to compute P(yt = sj|x1,…,xT ). This can be done using the following equation:
P(yt = sj|x1,…,xT ) ∝ P(yt = sj,x1,…,xt)P(xt+1,…,xT |yt = sj)
After running your forward and backward passes through the sequence, you are now ready to estimate the conditional probabilities as:
P(yt|x1,…,xT ) ∝ αt ◦ βt
where ◦ is the element-wise product.
Minimum Bayes Risk Prediction We will assign tags using the minimum Bayes risk predictor, defined for this problem as follows:
yˆt = argmax P(yt = sj|x1,…,xT )
j∈{1,…,J}
To resolve ties, select the tag that appears earlier in the <indextotag> input file.
Computing the Log Likelihood of a Sequence When we compute the log likelihood of a sequence, we are interested in the computing the quantity P(x1,…,xT ). We can rewrite this in terms of values we have already computed in the forward-backward algorithm as follows:
2.3.2 Implementation Details
You should now implement your forward-backward algorithm as a program, forwardbackward.{py|java|cpp|m}.
The program will read in test data and the parameter files produced by learnhmm.{py|java|cpp|m}. The autograder will use the following commands to call your function:
For Python: $ python forwardbackward.py [args…]
For Java: $ javac -cp “./lib/ejml-v0.33-libs/*:./” forwardbackward.java; java -cp “./lib/ejml-v0.33-libs/*:./” forwardbackward [args…]
For C++: $ g++ -g -std=c++11 -I./lib forwardbackward.cpp; ./a.out [args…]
For Octave: $ octave -qH forwardbackward.m [args…]
Where above [args…] is a placeholder for seven command-line arguments:<testinput><indextoword>
<indextotag><hmmprior><hmmemit><hmmtrans><predictedfile><metricfile>.
These arguments are described in detail below:
1. <testinput>: path to the test input .txt file that will be evaluated by your forward backward algorithm (see Section 2.1)
2. <index toword>: path to the .txt that specifies the dictionary mapping from words to indices.
The tags are ordered by index, with the first word having index of 1, the second word having index of 2, etc. This is the same file as was described for learnhmm.{py|java|cpp|m}.
3. <index totag>: path to the .txt that specifies the dictionary mapping from tags to indices. The tags are ordered by index, with the first tag having index of 1, the second tag having index of 2, etc. This is the same file as was described for learnhmm.{py|java|cpp|m}.
4. <hmmprior>: path to input .txt file which contains the estimated prior (π).
5. <hmmemit>: path to input .txt file which contains the emission probabilities (B).
6. <hmmtrans>: path to input .txt file which contains transition probabilities (A).
7. <predictedfile>: path to the output .txt file to which the predicted tags will be written. The file should be in the same format as the <testinput> file.
8. <metric file>: path to the output .txt file to which the metrics will be written.
Example command for python users:
$ python forwardbackward.py toydata/toytrain.txt toydata/toy_index_to_word.txt toydata/toy_index_to_tag.txt hmmprior.txt hmmemit.txt hmmtrans.txt predicted.txt metrics.txt
After running the command above, the <predictedfile> output should be:
you_B eat_A fish_B you_B fish_B eat_A eat_A fish_B
And the <metricfile> output should be:
Average Log-Likelihood: -2.776793
Accuracy: 1.0
Take care that your output has the exact same format as shown above. There should be a single space after the colon preceding the metric value (e.g. a space after Average Log-Likelihood:). Each line should be terminated by a Unix line ending .
2.3.3 Log-Space Arithmetic for Avoiding Underflow
For this homework, using log-space trick starts with transforming Eq.(2.1) and Eq.(2.2) into logarithmic form (how to do that is straightforward and left as an exercise). Please use e as the base for logarithm calculation (natural log).
logXexp(vi)
i
Algorithm 1 Log-Sum-Exp Trick
logsumexpv1,v2,··· ,vn m = max(vi), i = {1,2,··· ,n} return m + log(Pi exp(vi − m))
2.4 Gradescope Submission
You should submit your learnhmm.{py|m|java|cpp} and forwardbackward.{py|m|java|cpp} to Gradescope. Note: please do not use other file names. This will cause problems for the autograder to correctly detect and run your code.
Some additional tips: Make sure to read the autograder output carefully. The autograder for Gradescope prints out some additional information about the tests that it ran. Make wise use of autograder’s output for debugging your code.
Reviews
There are no reviews yet.