In this project, we look at NYC Subway Data to find out whether more people ride the subway when it’s raining versus when it’s not.
We will compare New York City Subway data, use statistical methods and data visualisation to draw an interesting conclusion about the subway with the data set that we have analysed.
The NYC public transportation system — Metro Transit Authority — provides data for download via CSV files. Part of the information available are data from the subway turnstiles, containing weekly logs for cumulative entries and exits by turnstile and subway station during a provided time-frame.
For this project, we only used the information available here.
In order to follow along with the data analysis process, please view my GitHub repo for directions and instructions.
There are three parts of this project:
- Data Gathering: Collecting and transforming data for analysis section.
- Data Analysis: Data visualisations and statistical conclusions.
- MapReduce: MapReduce program in Python to calculate total number of entries for each
UNIT(see metadata here).
Part 1: Data Gathering
In this section we will apply the data acquisition and data cleaning tasks to find out fundamental stuff about the data through a statistical approach.
Our aim is to explore the relationship between data from NYC Subway turnstiles and the city weather. For this, besides data from the subway, we will also need data from the weather in NYC.
Part 1.1: Downloading data
We will be downloading data for June 2017. For this purpose we will check for
"1706"in the URL to avoid extra downloads.
The above code downloads data in three steps:
- Parses HTML from the URL of developer resources page (line 6) on the MTA website which hosts turnstile data.
- Checks for “1706” in each URL to download data for June 2017 only.
- Saves downloaded file with name formatting specified on line 29.
The following four files are added in project directory after download is complete:
Now we create four functions for data transformation:
- create_master_turnstile_file(filenames, output_file): Combines all the four files to a single output_file to prevent repetitive data analysis steps for each file.
2. filter_by_regular(filename): Only consider the “REGULAR” scheduled audit event (normally occurs every 4 hours).
Here filename is returned from
3. get_hourly_entries(df): Get hourly entries. We use pandas
shift() method to calculate difference between two successive hour stamps.
df is the dataframe imported from master_file.txt (created by the first function) and then filtered by the second function.
4. get_hourly_exits(df): Get hourly exits. This also uses pandas
df is same one we used in our previous function.
Part 2: Data Analysis
Now let’s see some statistical results from our data. Some numerical statistics of this data are:
- It was rainy 44104 times out of 131951 (almost one-third). Values obtained from this function:
- Maximum temperature for foggy days is 81 degree Fahrenheit.
- The distribution of hourly entries for rainy and non-rainy days.
Observation: This distribution is highly right-skewed for both the cases. Hourly entries are significantly higher when it doesn’t rain. It means more people use the subway when it is not raining.
- Mean of entries for both the conditions
Turns out mean show contradicting results: mean for rainy condition (1105) is slightly higher than non-rainy condition (1090).
But since the data is highly skewed, we cannot trust on the mean value alone. We will proceed with the visual interpretation of distribution.
Part 3: MapReduce
Now we will create a mapper function for weather data.
A preview of weather data which is passed to
You can see that units are repeated multiple times. Mapper function parses
ENTRIESn_hourly from each line and prints to
stdout . We pass this file to
stdin and the mapper results are saved in mapper_result.txt.
A preview of mapper results:
Now, we’ll create the reducer.
Given the mapper result from the previous part ( read from
stdinin line 29), the reducer will print (not return) one line per
UNIT , with the total number of
ENTRIESn_hourly during May (which is our data duration), separated by a tab. The result will be saved to reducer_result.txt (specified on line 30)
A preview of reducer results:
Each line contains
UNIT and total number of hourly entries through it.
Assumption: You can assume that the entry for the reducer is ordered in a way that all lines corresponding to a particular unit are grouped. However, the reducer exit will have repetition, as there are stores that appear in different files’ locations.
We compared the hourly trends for Subway user when it rains vs when it doesn’t. From the graph above here, it looks like both distributions are highly skewed and on an average more people use subway when sky is clear.
Then we used Hadoop Streaming to perform a MapReduce job in Python. Since Hadoop is written in Java, Hadoop Streaming allows us to write and execute MapReduce programs in any language.
Mapper function returns the number of hourly entries for every unit. And since the keys are already sorted in
stdin when passing to
mapper() , there is no need to perform the intermediate shuffle sort step. All we have to do is pass the mapper result to
reduce() . Reducer then performs aggregated sum for each key (“UNIT” in this case for every turnstile interactions with subway users).
Not even a single mention of HDFS?
You must be thinking why I did not use HDFS. Because the goal of this project is to give a practical experience of writing MapReduce jobs in Python. You can use HDFS by downloading Cloudera VM and then run Hadoop jobs in a distributed file system on your local machine . It will be in pseudo-distributed mode because it is just a single machine, in production Hadoop uses multiple commodity hardware for distributed computing.
And if you want to run this same analysis in HDFS on your local machine (pseudo-distributed mode), be sure to checkout the documentation on how to setup Cloudera QuickStart VM and get started with HDFS.
Ever wondered about the origin of name Hadoop? Watch this interview with its founder, Doug Cutting.