Description
Enron Emails
Folder /home/public/enron contains individual emails of former Enron employees. There is one file per email and the name of the employee is listed as the subfolder name. These emails were used in the actual trail but the judge decided to release them for public consumption. (I have all of them but I am sharing with you only a few of them so that potentially you do not need to write a script to load the data. You are encouraged to write a script, but because you have only a few emails you can insert them manually.)
Recently I read in the newspaper FakeInnovations that entrepreneur Bogus John Enron wants to fund a company with the same employees (those jailed will work from the jail). Bogus John needs an hbase database that will track all the emails. The management of the company wants to be able to quickly query emails for a user, all emails during a time period, and all emails for a given user during a period of time.
You have to perform the following tasks:
1. Create an hbase database model.
2. Import all emails in hbase.
3. Return the bodies of all emails for a user of your choice (as a single text file).
4. Return the bodies of all emails written during a particular month of your choice (as a single text file).
5. Return the bodies of all emails of a given user during a particular month both of your choice (as a single text file).
Here are 2 options that you can choose from.
1. Write an hbase script for all of the tasks. This is the route of least effort but also least rewarding.
2. The hbase on wolf offers restful access. You can either use python or java to perform the task.
Python: use module starbase: https://github.com/barseghyanartur/starbase or HappyBase https://happybase.readthedocs.io/en/latest/
The restful interface on wolf for hbase runs on port 20550 and thus you have to initialize the connection as c = Connection(port=20550)
To fetch a single record, use ‘get’ but to fetch a range of records, use ‘scan.’
Reviews
There are no reviews yet.