CS6375 – Assignment III (Solution)

$ 24.99
Category:

Description

Text Classification

Naive Bayes and Logistic Regression for Text Classification
In this homework you will implement and evaluate Naive Bayes and Logistic Regression for text classification. Use Python to implement your algorithms.

0 Points – Download the spam/ham (ham is not spam) dataset available on the elearning. The data set is divided into two sets: training set and test set. The dataset was used in the Metsis et al. paper [1]. Each set has two directories: spam and ham. All files in the spam folders are spam messages and all files in the ham folder are legitimate (non spam) messages.

25 points – Implement the multinomial Naive Bayes algorithm for text classification described here: http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf (see Figure 13.2). Note that the algorithm uses add-one Laplace smoothing. Make sure that you do all the calculations in log-scale to avoid underflow. Use your algorithm to learn from the training set and report accuracy on the test set.

25 points – Implement the MCAP Logistic Regression algorithm with L2 regularization that we discussed in class (see Mitchell’s new book chapter). Try different values of λ. Use your algorithm to learn from the training set and report accuracy on the test set for different values of λ. Use gradient ascent for learning the weights. Do not run gradient ascent until convergence; you should put a suitable hard limit on the number of iterations.

25 points Improve your Naive Bayes and Logistic Regression algorithms by throwing away (i.e., filtering out) stop words such as he” of” and or” from all the documents. A list of stop words can be found here: https://www.ranks.nl/stopwords. Report accuracy for both Naïve Bayes and Logistic Regression for this filtered set. Does the accuracy improve? Explain why the accuracy improves or why it does not?

What to Turn in

_ Your code and a Readme file for compiling and executing your code.
_ A detailed write up (worth 25 points) that reports the accuracy obtained on the test set, parameters used (e.g., values of λ, hard limit on the number of iterations, etc.). We should be able to replicate your results based on your write up.

References
[1] V. Metsis, I. Androutsopoulos and G. Paliouras, Spam Filtering with Naive
Bayes – Which Naive Bayes?”. Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS 2006), Mountain View, CA, USA, 2006.

Reviews

There are no reviews yet.

Be the first to review “CS6375 – Assignment III (Solution)”

Your email address will not be published. Required fields are marked *