Description
Marks: 10
You will be given a file with text documents, where each line corresponds to one document. For a given word (say W), the goal is to find:
1) Find top k positively associated word to W.
2) Find top k negatively associated word to W.
The association is computed based on word co-occurrence in documents using pointwise mutual information (PMI) scores. A word must not contain anything other than English letters. While computing co-occurrence, you must lowercase all the words and you must also remove the stopwords available here: https://github.com/terrier-org/terrier-desktop/blob/master/share/stopword-list.txt
ππ(π€π€1,π€π€2)
PMI(w1, w2) = log2ππ (π€π€1)βππ(π€π€2)
where P(w1,w2) = co/N, P(w) = m/N co -> # documents where two words appear m -> # documents where w present
N -> # documents
You goal is to write spark program for the above problem. You can use either scala or pyspark. Your code must have the main function.
Output format: Output needs to be printed on screen. First the list of positively associated words along with the PMI score. Then the list of negatively associated words along with the PMI scores.
We will evaluate your program on a linux system from command line with the arguments as follows:
spark-submit <your-code> <path to file> <query-word> <k>
Where βquery-wordβ is the given word, k is the top positively associated and negatively associated words to βquery-wordβ. This format is very important for evaluation. Thus, your program arguments must follow the sequence. Your program must have a main function.
Submission guidelines:
Important notes:
1. No credit will be given if your program does not run and produces wrong output.
3. It is your responsibility to check that the file has been submitted successfully.
Reviews
There are no reviews yet.