Description
CENG499
Introduction to Machine Learning
Homework 4 – Naive Bayes, Hidden Markov Models
version 1
1 Introduction
In this assignment, you have 2 independent tasks: sentiment analysis using Naive Bayes and evaluation and decoding tasks of Hidden Markov Models. No report is required in this homework. All related files can be downloaded from http://user.ceng.metu.edu.tr/~artun/ceng499/hw4_files.zip.
2 Sentiment Analysis using Naive Bayes
2.1 Dataset
The dataset is created using Twitter US Airline Sentiment dataset. You need to use the version we provided. The examples are some tweets that mention some airline firms in US. They are labeled as negative, neutral, and positive and given as plain text files in hw4_data/sentiment.
2.2 Background
2.2.1 Naive Bayes
Let h be our hypothesis which is defined as
h(x) = argmaxP(y|x),
y
where y is the label and x is the feature. It basically selects the label that has the highest probability given x. Using the Bayes’ Rule, we can obtain
h(x) = argmaxP(y|x)
y
Bayes’ Rule
= argmaxP(x|y)P(y) P(x) does not depend on y;
y
however, estimating P(x|y) is still hard since we need to see the exact example in out dataset. In our case, it is highly unlikely that we see the exact same test tweet in our training dataset. To achieve more feasible solution, Naive Bayes assumes conditional independence between the dimensions of the features given label. Although this independence does not hold in our case in reality, it still works well in practice. Assuming the independence, we achieve
d
h(x) = argmaxP(y) Y P(xj|y)
y j=1
d
h(x) = argmaxlog(P(y) Y P(xj|y))
y j=1 d
= argmaxlog(P(y)) + Xlog(P(xj|y)).
y j=1
2.2.2 Our case
We are going to assume that the positions of the words do not matter and simply count the occurrences of the words in an example. This kind of modelling is called bag-of-words. To do that, first, we are going to create our vocabulary which will contain every word in the training set. (We could also use every word in the English dictionary but in this homework, we are going to use the training data we already have.) Secondly, for a specific example (a tweet), we will count the occurrences of every word in the vocabulary.
To give an example let’s say our tweet is “the ball is on the floor”. Then, our representation will be something like in Table 1 assuming that our vocabulary is {a, amazing, ball, floor, is, on, the, zoom}.
a0 amazing0 ball1 floor1 is1 on1 the2 zoom0
Table 1: The representation of “the ball is on the floor”.
We have two probabilities to estimate: P(x|y) and P(y). P(y) is easy to estimate. For class c we will count the examples labeled as class c and divide it by the number of all examples. Let πc = P(y = c) be the probability of a tweet being in class c. Then, it can be estimated as
where I is the indicator function which returns 1 when the condition holds and 0 otherwise, y(i) is the label of ith example in the dataset, and n is the number of examples in the training dataset.
In our model, we will think like while writing a tweet, every word is selected by rolling a die with d sides where d is the number of words in our vocabulary. Let θjc be the probability of rolling the word j in our vocabulary given that our class label is c. Then, the probability of generating a tweet with length
m is Multinomial Distribution (https://en.wikipedia.org/wiki/Multinomial_distribution) and can be expressed as
.
Since we are taking argmax, we don’t actually need to calculate the term since it will be the same for every class. Therefore, finally our hypothesis function will be
d
h(x) = argmaxlog(πˆc) + Xxjlog(θˆjc)
c
j=1
where θˆjc is the estimation of θjc. We will see how to estimate that in the next section.
2.2.3 Additive (Laplace) Smoothing
Calculation of θˆjc’s will be very similar to the calculation of πˆc’s. We will simply divide the number of occurrences of the jth word in all examples labeled with class c by the number of total words in the examples labeled with c. More formally, it can be described as
where y(i) is the label of the ith example, x(ji) is the number of occurrences of the jth word in our vocabulary in the ith example. The term simply calculates the number of words in the ith example.
In additive smoothing with smoothing parameter 1 (https://en.wikipedia.org/wiki/Additive_ smoothing), we are going to behave like every word appears at least once in the training dataset. If we update our estimation formula above, the number of every word will increase by one so the denominator will increase by the number of words in our vocabulary (d) and the numerator will increase by 1 and we will get
.
2.3 Implementation
Notice that although the equations seem very complicated, all it does is counting and dividing the results to the total number of what it counts. To estimate πc, we can count how many times a specific label c occurs in the training data and then divide it by the the length of our the training set. Similarly, to estimate θ:c, we can count how many times every word occurs in the training data labeled with specific class c by only traversing it once and divide the values by the total number of words for class c. Additive smoothing can also be easily adapted to this kind of implementation. If you get rid of the redundancies in your code in this way, your implementation should be able to finish its job under 1 second.
2.4 Task
You are expected to train a Naive Bayes classifier, test it using the test set and report the accuracy. The function templates are given in the nb.py and some tests are given in the nb_mini_test.py. Passing all these tests does not mean you will get full points.
• Implement the functions in nb.py according to their descriptions. During test time, you can skip the words that are not in the vocabulary created using the train set. The scores of the test function should be calculated using (as mentioned above)
d
h(x) = argmaxlog(πˆc) + Xxjlog(θˆjc).
c
j=1
• Calculate the accuracy of your model using the functions in the nb.py on the test set and print it.
3 Hidden Markov Models
In this part, you are going to work on evaluation and decoding tasks of Hidden Markov Models (HMM). To do that, you are going to implement forward and Viterbi algorithms. The template of the functions are given in “hmm.py”. You can check the recitation slides for the detailed explanations of forward and Viterbi algorithms and understand the notation used here.
3.1 Data
Only filling the forward and viterbi functions is enough for this homework. Arguments will be directly fed into those functions. Therefore, there is no need to explicitly parse the input or printing the output. You only need to return the expected outputs.
Although the outputs of the tasks are different their inputs are the same. The inputs are as mentioned in Table 2.
The data will be given in numpy arrays. A[i,j] is the state transition probability from state i to state j. B[i,j] is the probability of observing observation j in state i. pi[i] is the probability of initial state being
i. O[t] is the tth observation, which is an index between 0 and M-1 (inclusive).
Input Symbol Input Name Input Size
A State Transition Matrix NxN
B Observation Probability Matrix NxM pi Initial State Probabilities N
O Observation Sequence T
Table 2: Explanation of the input symbols in tasks. N, M, T represent number of states, number of possible observations, length of the observation sequence, respectively. (2≤N≤10, 2≤M≤10,
1≤T≤30)
3.2 Evaluation Task
For this task, you are going to implement forward algorithm by filling the forward function in “hmm.py”. The output should be the probability of an observation sequence given A, B, pi and the calculated alpha values in a numpy array. This can be done by calculating this probability for every possible state sequence and summing them up; however, this takes exponential time. Therefore, in this homework, you are asked to implement the forward algorithm, which runs in N2T time. If your algorithm runs in exponential time, you won’t be able to get full points.
3.3 Decoding Task
Similar to the evaluation task, you are going to implement Viterbi algorithm by filling the viterbi function in “hmm.py”. The output should be the numpy array of most likely observation sequence given A, B, pi and the calculated deltas in a numpy array. This can be done by calculating this probability for every possible state sequence then selecting the maximum; however, this takes exponential time. Therefore, in this homework, you are asked to implement the Viterbi algorithm, which runs in N2T time. If your algorithm runs in exponential time, you won’t be able to get full point.
4 Specifications
• The codes must be in Python3 and use only numpy. Any other programming language or library will not be accepted.
• Falsifying results is strictly forbidden and you will receive 0 if this is the case. Your programs will be examined to see if you have actually reached the results and if it is working correctly.
• The late policy given in the syllabus applies here. You have total of 6 late days for all your homeworks but you can spend at most 3 for a specific homework. The late submission penalty will be calculated using 5n2, that is, 1 day late submission will cost you 5 points, 2 days will cost you 20 points and 3 days will cost you 45 points. No late submission is accepted after reaching a total of 3 late days.
• Using any piece of code that is not your own is strictly forbidden and constitutes as cheating. This includes friends, previous homeworks, or the internet. The violators will be punished according to the department regulations.
• Follow the course page on ODTUCLASS for any updates and clarifications. Please ask your questions on Homework 4 Discussion section of ODTUCLASS instead of e-mailing if the question does not contain code or solution.
5 Submission
Submission will be done via ODTUCLASS. If you do not have access to ODTUCLASS send send an email to “artun@ceng.metu.edu.tr” as soon as possible. You will submit a zip file called “hw4.zip” that contains all the source code, README file.
Reviews
There are no reviews yet.