Description
Marks: 10
In this assignment you have to write a multi-threaded python program for the following problem. Make sure you use python version 3.10 or newer.
You are given a text document collection (in plain text format) along with the class labels. The documents in a folder corresponds to a particular class. The goal is to produce top k (k in an integer) unique word n-gram from the collection based on their class salience score. A word n-gram is a consecutive sequence of n words that appear in a document. The class salience
ππππππππππ ππππ ππβππ ππβππππππππ ππππ ππ ππππππππππ score of a n-gram is defined as . Thus, if there are 20
# ππππππππππππππππππ ππππ ππβππ ππππππππππ
classes, and a particular n-gram appears in all the classes, then the n-gram will have 20 scores (one for each class). The top k will be strictly based on descending order of score of the n-grams.
Tokenization rule (breaking documents into words): You must generate words from a document by breaking it on any non-alphanumeric character. You must also lowercase all the words.
Link to data: https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroupsmld/20_newsgroups.tar.gz
We will evaluate your program on a linux system from command line with the arguments as follows:
python <your-code.py> <path to data directory> <# threads> <value of n for n-gram> <value of k>
The above format is very important for evaluation. Thus, your program arguments must follow the sequence.
Submission guidelines:
Important notes:
1. No credit will be given if your program does not run and produces wrong output.
2. No credit will be given if your program in not multithreaded
4. It is your responsibility to check that the file has been submitted successfully.
Reviews
There are no reviews yet.