CS7650 – Instructions Solved

Description

1. This homework has two parts: Q1, Q2 and Q3 are theory questions and Q4 is a programming assignment with some parts requiring a written answer. Each part needs to be submitted as follows:
• The programming assignment requires you to work on boilerplate code. Submit the answers to the programming assignment in a zip that contain all the code files. This submission is to be made on Canvas for the assignment corresponding to Homework 2 Programming. Name the zip file as- LastName FirstName.zip.
3. The second question is meant for graduate students only. Undergraduate students do not need to attempt Q2. Each of the other three questions is mandatory for all students. There is no extra credit for answering additional questions than what is required.
5. The code files needed to complete the homework are included in a zip file on Canvas.
1. A collection of reviews about comedy movies (data D) contains the following keywords and binary labels for whether each movie was funny (+) or not funny (-). The data are shown below: for example, the cell at the intersection of “Review 1” and “laugh” indicates that the text of Review 1 contains 2 tokens of the word “laugh.”

Review laugh hilarious awesome dull yawn bland Y
1 2 1 1 1 1 0 +
2 0 1 2 0 0 0 +
3 3 0 0 0 0 1 +
4 0 1 0 2 1 0 –
5 1 1 1 2 0 2 –
6 1 0 0 2 2 0 –

(a) Assume that you have trained a Naive Bayes model on data D to detect funny vs. not funny movie reviews. Compute the model’s predicted score for funny and not-funny to the following sentence S (i.e. P(+|S) and P(−|S)), and determine which label the model will apply to S. [4 pts]
S: “This film was hilarious! I didn’t yawn once. Not a single bland moment. Every minute was a laugh.”
(c) What is an additional feature that you could extract from text to improve the classification of sentences like S, and how would it help improve the classification? [2 pt]
2. [CS 7650 Only]
Assume that you are training several logistic regression models. After training on the same data, θˆ is the optimal weight for an unregularized logistic regression model and θ∗ is the optimal weight for a logistic regression model with L2 regularization. Prove that .
3. Language Modeling is the technique that allows us to compute the probabilities of word sequences. The probability of a sequence W , with the use of chain rule, can be estimated as the product of probabilities of each word given the history, as shown-
P(W) = P(w1,w2…wn)
= P(w1) P(w2|w1) P(w3|w1,w2)…P(wn|w1,w2…wn−1)
n
= YP(wi|w1i−1)
i=1
(a) Using an n-gram model allows us to approximate the above probability using only a subset of of n − 1 words from the history at each step. Simplify the above expression for the general n-gram case, and the bi-gram case. [3 pts]
(b) A common way to have markers for the start and the end of sentence is to add the [BOS] (beginning of sentence) and [EOS] (end of sentence) tokens at the start and end of every sentence. Consider the following text snippet-
[BOS] i made cheese at home [EOS]
[BOS] i like home made cheese [EOS]
[BOS] cheese made at home is tasty [EOS]
[BOS] i like cheese that is salty [EOS]
Using the expression derived in (a), find the probability of the following sequence as per the bi-gram model- P([BOS] I like cheese made at home [EOS]). [5 pts]
(c) In practice, instead of raw probability, perplexity is used as the metric for evaluating a language model. Define perplexity and find the value of perplexity for the sequence in (b) for the bi-gram case. [2 pts]
(d) One way to deal with unseen word arrangements in the test set is to use Laplace smoothing, which adds 1 to all bi-gram counts, before we normalize them into probabilities. An alternative to Laplace smoothing (add-1 smoothing) is add-k smoothing, where k is a fraction that allows assigning a lesser probability mass to unseen word arrangements. Find the probability of the sequence in (b) with add-k smoothing for k = 0.1. [5 pts]
(e) To deal with unseen words in the test set, a common way is to fix a vocabulary by thresholding on the frequency of words, and assigning an [UNK] token to represent all out-of-vocabulary words. In the example from (a), use a threshold of count > 1 to fix the vocabulary. Find the probability for the following sequence for an add-0.1 smoothed bi-gram model- P([BOS] i like pepperjack cheese [EOS]). [5 pts]
4. In this problem, you will do text classifications for Hate Speech. You need both answer the questions and submit your codes.
Hate speech is a
(a) deliberate attack, (b) directed towards a specific group of people,
(c) motivated by aspects of the group’s identity.
The three premises must be true for a sentence to be categorized as HATE. Here are two examples:
(a) “Poor white kids being forced to treat apes and parasites as their equals.”
(b) “Islam is a false religion however unlike some other false religions it is crude and appeals to crude people such as arabs.”
In (a), the speaker uses “apes” and “parasites” to refer to children of dark skin and implies they are not equal to “white kids”. That is, it is an attack to the group composed of children of dark skin based on an identifying characteristic, namely, their skin colour. Thus, all the premises are true and (a) is a valid example of HATE. Example (b) brands all people of Arab origin as crude. That is, it attacks the group composed of Arab people based on their origin. Thus, all the premises are true and (b) is a valid example of HATE.
This problem will require programming in Python 3. The goal is to build a Naive Bayes model and a logistic regression model that you learnt from the class on a real-world hate speech classification dataset. Finally, you will explore how to design better features and improve the accuracy of your models for this task.
The dataset you will be using is collected from Twitter online. Each example is labeled as 1 (hatespeech) or 0 (Non-hatespeech). To get started, you should first download the data and starter code from https://www.cc.gatech.edu/classes/AY2020/cs7650_ spring/programming/h2_text_classification.zip. Try to run:
python main.py — model AlwaysPredictZero
This will load the data and run a default classifier AlwaysPredictZero which always predicts label 0 (non-hatespeech). You should be able to see the reported train accuracy = 0.4997. That says, always predicting non-hatespeech isn’t that good. Let’s try to build better classifiers!
Note that you need to implement models without using any machine learning packages such as sklearn. We will only provide train set, and we will evaluate your code based on our test set.
To have a quick check with your implementations, you can randomly split the dataset we give you into train and test set at a ration 8:2, compare the accuracy between the models you have implemented and related models in sklearn packages. You would expect an accuracy at around 0.65 (or above) on your test set.
(a) (Naive Bayes) In this part, you should implement a Naive Bayes model with add-1 smoothing, as we taught in the class. You are required to implement the
NaiveBayesClassifier class in classifiers.py. You would probably want to take a look at the UnigramFeature class in utils.py that we have implemented for you already. After you finish your codes, run python main.py –model NaiveBayes to check the performance. List the 10 words that, under your model, have the higest ratio of (the most distinctly hatespeech words). List the 10 words with the lowest ratio. What trends do you see? [25 pts]
(c) (Features) In the last part, you’ll explore and implement a more sophisicated set of features. You need to implement the class BigramFeature or modify the class CustomFeature inutils.py. Here are some common strategies (you are welcome to implement some of them but try to come up with more!):
i. Remove stopwords (e.g. a, the, in),
ii. Use a mixture of unigrams, bigrams or trigrams,
iii. Use TF-IDF (refer to http://www.tfidf.com/) features.
Use your creativity for this problem and try to obtain an accuracy as high as possible on your test set! After you implement CustomFeature , run: python main.py –model NaiveBayes — feature customized python main.py –model LogisticRegression — feature customized
Describe the features that you have implemented. We’ll evaluate your two models on the test set. [Bonus: 10 points]
You will receive up to 10 bonus points: up to 5 points based on the novel features you try and the rest based on how well your models perform compared to other submissions:

e.g. if you rank first in the class, you will receive the full bonus point! We will share the winners’ codes as well.

Reviews

There are no reviews yet.

Be the first to review “CS7650 – Instructions Solved”