Description
Portfolio Assignment: Author Attribution
Objectives:
• Gain experience with machine learning using sklearn
• Experiment with the NLP task author attribution
Turn in:
• Use Google Colab or a local Jupyter notebook
• File -> Print to pdf, turn in the pdf to eLearning and upload it to your portfolio
Background:
The Federalist Papers is a collection of documents written by Alexander Hamilton, James Madison, and John Jay collectively under the pseudonym Publius. These documents were written to persuade voters to ratify the US Constitution. These documents continue to be influential to this day, as they are frequently cited in Federal court rulings, as well as law blogs, and political opinions.
Overview:
The data set used in this assignment is a collection of Federalist Papers from Project Gutenberg. There are 83 documents in this data set which has two columns: one for the author(s), and one for the text of the document.
The NLP task of authorship attribution is the attempt to identify the author of a document, given samples of authors’ work. In this data set, the breakdown by author is as follows:
• Alexander Hamilton 49
• James Madison 15
• John Jay 5
There are several documents for which authorship is in dispute by historians:
• Hamilton or Madison 11
• Hamilton and Madison 3
CS 4395 Intro to NLP Dr. Karen Mazidi
Instructions:
1. Read in the csv file using pandas. Convert the author column to categorical data. Display the first few rows. Display the counts by author.
2. Divide into train and test, with 80% in train. Use random state 1234. Display the shape of train and test.
3. Process the text by removing stop words and performing tf-idf vectorization, fit to the training data only, and applied to train and test. Output the training set shape and the test set shape.
4. Try a Bernoulli Naïve Bayes model. What is your accuracy on the test set?
6. Try logistic regression. Adjust at least one parameter in the LogisticRegression() model to see if you can improve results over having no parameters. What are your results?
7. Try a neural network. Try different topologies until you get good results. What is your final accuracy?
Grading Rubric:
Element Points
Step 1 10
Step 2 10
Step 3 20
Step 4 10
Step 5 20
Step 6 10
Step 7 20
Total 100
Reviews
There are no reviews yet.