UE19CS322 Big Data Assignment 1 Solved

$ 29.99


Analysis of US Road Accident Data using MapReduce
tasks and focuses on running MapReduce jobs to analyse data recorded from accidents in the USA.
The files required for the assignment can be found here.
Assignment Objectives and Outcomes
1. This assignment will help students become familiar with the Map Reduce programming environment and the HDFS.
2. At the end of this assignment, the student will be able to write and debug MapReduce code.
Ethical practices
The Dataset
You will be provided with a link to the dataset on PESU Forum. You will be working with the following set of attributes.
Key Type Description
Severity integer Severity of the accident (between 1 – 4)
Start_Time datetime Start time of accident in local time zone
Start_Lat float Latitude as GPS coordinate of the start point
Key Type Description
Start_Lng float Longitude as GPS coordinate of the start point
Description string Natural language description of the accident
Visibility(mi) float Visibility (in miles) during the accident
Precipitation(in) float Precipitation amount in inches, if there is any
Weather_Condition string Weather condition during the accident – rain, snow, thunderstorm, fog, etc
Sunrise_Sunset String Shows the period of day (i.e. day or night) during the accident
Software/Languages to be used:
1. Python 3.8.x
2. Hadoop v3.2.2 only
Task 1: 2 marks
Task 2: 2 marks
Report: 1 mark
Tasks Overview:
1. Load the data into HDFS.
2. Create mapper.py and reducer.py for Task 1 and Task 2
3. Run your code on the sample dataset until you get the right answer
4. Submit the files to the portal
5. Submit one page report based on the template and answer the questions on the report
Submission Link
Portal for Big Data Assignment Submissions
Submission Guidelines
You will need to make the following changes to your mapper.py and reducer.py scripts to run them on the portal
1. Include the following shebang on the first line of your code

2. Convert your files to an executable

3. Convert line breaks in DOS format to Unix format (this is necessary if you are coding on Windows – your code will not run on our portal otherwise)

Check out a detailed list of submission guidelines here.
Task Specifications
Task 1
Problem Statement
Find record count per hour
Find the number of accidents occuring per hour that satisfy a set of conditions and display them in sorted fashion.
All the following conditions must be satisfied by a record.
Attribute Condition
Description Accident should result in either a “lane blocked”, “shoulder blocked” or an “overturned vehicle”
Severity >= 2
Sunrise_Sunset Night
Visibility(mi) <= 10
Precipitation(in) >= 0.2 inches
Weather_Condition Should either be “Heavy Snow”, “Thunderstorm”, “Heavy Rain”, “Heavy Rain Showers” or “Blowing Dust”
Ignore records which do not satisfy the mentioned conditions. You do not require any command line arguments for this task. Additionally, if any of the required attributes contain NaN , ignore the record.
Recommended module: datetime
Output Format
For each hour that contains accident data that satisfies the provided conditions, print the hour followed by the number of accidents in that hour on a separate line. For hours that do not contain any accident records, do not print anything.

Task 2
Problem Statement
Find record count per city and state
Find the number of accidents occuring per city and state where the distance between the start coordinates of the accident and a given pair of coordinates – ( LATITUDE , LONGITUDE ) is within D . You will be using Euclidean Distance to find whether the distance calculated is within D .
For each record, you will be making a request to to obtain the city and state information. The IP accepts only POST requests, and expects a JSON payload containing a pair of start coordinates in the following format

The IP will send back a response containing a JSON payload containing city and state information in the following format.

You are required to take in 3 command line arguments in your mapper.py script in the format given below.
Recommended module:
requests (to be installed via pip3 )
You will not be allowed to install any other libraries or use any other APIs to execute your code.
Output Format
For each state, you will first have to display the name of the state. Following this, you will have to determine the number of accidents that occur in each city in that state, and display each city’s count on a separate line. You do not have to display cities where the count is zero. Finally, display the state again and the total number of accidents for that entire state.
D = 5.3
Taking the last state ME as an example, the counts for the cities Ellsworth, Hope and Trenton are determined to be 3 , 1 and 1 respectively. Hence, the total count for the state is 5 .

Helpful Commands
Running the MapReduce Job without Hadoop
A MapReduce job can also be run without Hadoop. Although slower, this utility helps you debug faster and helps you isolate Hadoop errors from code errors.
cat path_to_dataset | python3 mapper.py [command line arguments] | sort -k 1,1
| python3 reducer.py [command line arguments] > output.txt
Starting Hadoop
If you are running Hadoop for the first time, run

Hadoop can be started using the following command.

You can view all the Java processes running on your system using jps .
After running jps you should see the following processes running (in any order) along with their process IDs:

HDFS Operations
The HDFS supports all file operations and is greatly similar to the file system commands available on Linux. You can access HDFS on command line using hdfs dfs and use the – prefix before the file system command to execute general Linux file system commands.
Loading a file into HDFS
A file can be loaded into HDFS using the following command.

Listing files on HDFS
Files can be listed on HDFS using

Similarly, HDFS also supports -mkdir , -rm and more.
Running a MapReduce Job
A MapReduce job can be run using the following command

-mapper absolute_path_to_mapper.py command_line_arguments
-reducer absolute_path_to_reducer.py command_line_arguments


There are no reviews yet.

Be the first to review “UE19CS322 Big Data Assignment 1 Solved”

Your email address will not be published. Required fields are marked *