COMP 598 Homework 7 – Data Scraping (Solution)

$ 29.99

Description
Reviews (0)

Description

30 pts
This is an INDIVIDUAL Assignment – each student’s work must be their own, each student completes this assignment, there are no teams for homework 7.
Non-standard (i.e., built-in) python libraries you can use:
– pandas
– requests
– BeautifulSoup
Task 1: Scraping relationships (10 pts)
In lecture, we began work on a system for scraping the whosdatedwho website. Here, you need to finish that system.

Write a script collect_relationships.py that collects the relationships for a set of celebrities provided in a JSON configuration file as follows:

python scripts/collect_relationships.py -c <config-file.json> -o <output_file.json>

where config-file.json contains a single JSON dictionary with the following structure (the exact path and list of celebrities can, obviously, change):

{
“cache_dir”: “.data/wdw_cache”,
“target_people”: [ “robert-downey-jr”, “justin-bieber” ]
}

Your script will then go and fetch the relationships for the target individuals. Note that the target people are indicated using the identifier that follows “/dating/”. All pages visited MUST be cached in the cache directory specified – as described in the lecture. This means that, if run twice on the same config file, it will use data exclusively from the cache the second time.

The output format for the file is:

{
“robert-downey-jr”: [ “person-1”, “person-2”, “person-3” ],
“justin-bieber”: []
}

Where the identifiers in the list are the people the person had a relationship with. If the person has had no relationships, then they will have an empty list.
Task 2: Getting course information (20 pts)
python scripts/scrape_courses.py -c <caching_dir> <page#>
Your script must cache to the directory specified. The page# indiciates which URL will be loaded. The courses should be printed in CSV format to stdout with the following columns (header included): CourseID, Course Name, # of credits
You should assume that all courses will be delivered with structure like this:

Where “ACCT 626” is the CourseID, “Data Analytics in Accounting” is the course name, and “1.5” is the # of credits. If the course encountered does NOT have this structure, ignore it. (Note that the course # if the course ID can have letters in it as well, e.g., “ACCT 645D1”).
Your MyCourses submission must be a single zip file entiled HW7_<studentid>.zip. It should contain the following items:
– scripts/ o collect_relationships.py – script for Task 1 o scrape_courses.py – script for Task 2

Reviews

There are no reviews yet.

Be the first to review “COMP 598 Homework 7 – Data Scraping (Solution)”

COMP 598 Homework 7 – Data Scraping (Solution)

Description

Reviews

Related products

COMP 598 Homework 10 – To the future! (Solution)

COMP 598 Final Project – Data Science Project (Solution)

COMP 598 Homework 9 – Network Modeling (Solution)

COMP 598 Homework 3 – MLP Conversation Analysis (Solution)

COMP 598 Homework 2 – Unix server and command-line exercises (Solution)