Does rain affect NYC’s subway ridership?

Dataset: New York City Metropolitan Transit Agency (MTA) Data Window Period: 26 April 2015 - 15 June 2015

The window period of 1 May 2011 to 31 May 2001 in the original dataset provided by Udacity did not have enough raininy dats statistical testing when segmenting the data to more granular levels (i.e. into weekday/weekend, time periods, etc.). This caused some sample size related issues. Instead, a larger period was taken with more recent data. However, this set requires more data wrangling than what was required with the Udacity dataset.

Parts: 1. Data Acquisition / Wrangling 2. Data Exploration

Data Acquisition / Wrangling

If you wish to skip this stage, you can access the final product that will be used for the statistical testing here:

https://www.dropbox.com/s/oq2am8ffdmlclnu/subway_data.csv?dl=0

Downloading the turnstile data from the MTA website. The data can be accessed by conventional means, but it was done this way within R to show the entire process.

setwd("/Users/raymondcarl/Dropbox/")
library("dplyr")
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("data.table")
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, last
url1="http://web.mta.info/developers/data/nyct/turnstile/turnstile_150620.txt"
url2="http://web.mta.info/developers/data/nyct/turnstile/turnstile_150613.txt"
url3="http://web.mta.info/developers/data/nyct/turnstile/turnstile_150606.txt"
url4="http://web.mta.info/developers/data/nyct/turnstile/turnstile_150530.txt"
url5="http://web.mta.info/developers/data/nyct/turnstile/turnstile_150523.txt"
url6="http://web.mta.info/developers/data/nyct/turnstile/turnstile_150516.txt"
url7="http://web.mta.info/developers/data/nyct/turnstile/turnstile_150509.txt"
url8="http://web.mta.info/developers/data/nyct/turnstile/turnstile_150502.txt"

#Uncomment if download is actually required as will cause a substantial drag in performance:
#download.file(url1,"mta_150620.txt")
#download.file(url2,"mta_150613.txt")
#download.file(url3,"mta_150606.txt")
#download.file(url4,"mta_150530.txt")
#download.file(url5,"mta_150523.txt")
#download.file(url6,"mta_150516.txt")
#download.file(url7,"mta_150509.txt")
#download.file(url8,"mta_150502.txt")

Data loading:

The data is imported into a R data table (a type of data frame), and by all accounts, data tables are faster and easier to work with. I have found out the hard way; however, that the syntax is not 1 to 1 from the base R language to the commands used for a data table. Any slight improvements in speed when working on a larger dataset and a not so faster computer are welcome.

Here is some discussion on data tables: http://stackoverflow.com/questions/18001120/what-is-the-practical-difference-between-data-frame-and-data-table-in-r

# Load the text files into data tables
master_subway_data<-data.table(read.csv("mta_150620.txt", header=TRUE, sep=","))
b_data<-data.table(read.csv("mta_150613.txt", header=TRUE, sep=","))
c_data<-data.table(read.csv("mta_150606.txt", header=TRUE, sep=","))
d_data<-data.table(read.csv("mta_150530.txt", header=TRUE, sep=","))
e_data<-data.table(read.csv("mta_150523.txt", header=TRUE, sep=","))
f_data<-data.table(read.csv("mta_150516.txt", header=TRUE, sep=","))
g_data<-data.table(read.csv("mta_150509.txt", header=TRUE, sep=","))
h_data<-data.table(read.csv("mta_150502.txt", header=TRUE, sep=","))

# Save the dataframe names as strings.
df_names<-c("b_data","c_data","d_data","e_data","f_data","g_data","h_data")

# Merge into one dataframe (a_data)
for (i in 1:(length(df_names)-1) ) {
master_subway_data<-rbind(master_subway_data,eval(parse(text=df_names[i])))
}

# Remove the partial datasets
rm(list=df_names)

Data Wrangling:

# A "serial number" named "serial" is created a concatenated column in order to create a unique identifier for each turnstile unit.
master_subway_data$unit_serial<-paste(master_subway_data$C.A,master_subway_data$UNIT,master_subway_data$SCP,sep='')

# Each unit is sorted by serial (newly created), date, time
master_subway_data<-arrange(master_subway_data,unit_serial,DATE,TIME)

# The fields ENTRIES and EXITS do not show the entries and exits.  These are the actual numbers on the turnstile counters.  To get actual entries and exits, the difference between the counter at t and t-1 must be taken.  

#Note:  This is done using dplyr features.

master_subway_data<-master_subway_data %>%
  group_by(unit_serial) %>%  #group by is used because we need the operation performed within each the observations for each individual turnstile unit.
  mutate(entries_hourly= ENTRIES - lag(ENTRIES))

# Do the same thing for the exits
master_subway_data<-master_subway_data %>%
  group_by(unit_serial) %>%
  mutate(exits_hourly= EXITS - lag(EXITS))

Data Cleaning

Several issues need to be addressed.

  1. There some errors in the MTA data where the counter for ENTRIES AND EXITS actually goes in reverse creating negative entries or exits; respectively. This could indicate turnstile replacement and the issue it could cause with the counter was probably not a concern for the installers. Or, it could just simply be a case of a broken counter that goes backwards when someone enters. We can find these instances by searching for entries_hourly or exits_hourly having a negative number.

  2. We need to delete the first observation for each turnstile because we don??t have a prior turnstile count to caclulate the actually entrance and exit traffic.

# Taking a closer look at the units with negative entries, negative exits, or NA (but we already know NA is generated by the offset function above for the entry of each turnstile unit).
suspect_data<-filter(master_subway_data, entries_hourly<0 | exits_hourly<0 | is.na(entries_hourly) | is.na(exits_hourly))

suspect_data_summary<-suspect_data %>%
                      group_by(unit_serial) %>%
                      summarise(error_count = n())

#The summary of the suspect data shows that on average, the machines having issues show up about 11 times.  However, the summary shows that there is one machine that is have 303 error occurences.  That should be checked out more closely as it??s an indicator that something is right and maybe all the data from that turnstile should be removed.
summary(suspect_data_summary)
##  unit_serial         error_count    
##  Length:4570        Min.   :  1.00  
##  Class :character   1st Qu.:  7.00  
##  Mode  :character   Median :  8.00  
##                     Mean   : 10.85  
##                     3rd Qu.: 10.00  
##                     Max.   :303.00
#Actually, there are quite a few turnstiles that have many issues.  Taking the top one (N305R01701-03-04) will look to see if there's any indicator what the problem is.
arrange(suspect_data_summary,desc(error_count))
## Source: local data table [4,570 x 2]
## 
##          unit_serial error_count
## 1   N305R01701-03-04         303
## 2   A011R08001-00-03         301
## 3   N062R01101-00-01         301
## 4  R161AR45201-06-02         299
## 5   A025R02301-03-02         297
## 6   A069R04401-06-01         297
## 7  N063AR01100-00-04         296
## 8  N063AR01100-00-05         296
## 9  N063AR01100-00-08         296
## 10  R148R03301-00-01         296
## ..               ...         ...
#Obviously, there is something very wrong with this unit since it has so many decreases in the counter.
bad_unit_investigation<-master_subway_data %>%
  filter(unit_serial=="N305R01701-03-04") %>%
  select(.,DATE, TIME, DESC, ENTRIES, EXITS,entries_hourly )
head(bad_unit_investigation,15)
##          DATE     TIME    DESC   ENTRIES     EXITS entries_hourly
## 1  06/13/2015 00:00:00 REGULAR 335273399 1.594e+09             NA
## 2  06/13/2015 04:00:00 REGULAR 335273285 1.594e+09           -114
## 3  06/13/2015 08:00:00 REGULAR 335273246 1.594e+09            -39
## 4  06/13/2015 12:00:00 REGULAR 335273126 1.594e+09           -120
## 5  06/13/2015 16:00:00 REGULAR 335272834 1.594e+09           -292
## 6  06/13/2015 20:00:00 REGULAR 335272536 1.594e+09           -298
## 7  06/14/2015 00:00:00 REGULAR 335272374 1.594e+09           -162
## 8  06/14/2015 04:00:00 REGULAR 335272303 1.594e+09            -71
## 9  06/14/2015 08:00:00 REGULAR 335272276 1.594e+09            -27
## 10 06/14/2015 12:00:00 REGULAR 335272139 1.594e+09           -137
## 11 06/14/2015 16:00:00 REGULAR 335271897 1.594e+09           -242
## 12 06/14/2015 20:00:00 REGULAR 335271638 1.594e+09           -259
## 13 06/15/2015 00:00:00 REGULAR 335271541 1.594e+09            -97
## 14 06/15/2015 04:00:00 REGULAR 335271521 1.594e+09            -20
## 15 06/15/2015 08:00:00 REGULAR 335271448 1.594e+09            -73

What??s the solution?

As a reminder, we have to delete the first record for each turnstile unit as we don??t have the prior counter reading in this data set.

But for the units with counter oddities, a more difficult decision has to be made.

Option 1: Delete ALL records of units showing a backward movement in the counter. This is the most conservative option in terms of data quality because we exclude any turnstiles that have data issues. From a statistical point of view, we are decreasing the size of our sample. Sample size would not be a huge issue here given we have over 1 million records, but we could be introducing bias into the sample by removing the data from an entire turnstile.

Option 2: Delete only the records showing negative entries or exits. This is a issue for data quality because the roll back on the turnstile??s counter could be an indicator that the turnstile is producing bad data for all of it’s record. However, this option preserves a higher sample size and reduces the chances of introducing bias into the sample.

Given the tradeoff??s, it seems like Option 2 is the best avenue. This creates issues where the entry and exit counts for the row following a deletion may be distorted, but it seems to be the lesser of the evils.

#Delete the first record of each turnstile using slice:

master_subway_data<-master_subway_data %>%
  group_by(unit_serial) %>%
  slice(-1)

#Delete any rows with negative entries_hour or exits_hour:

master_subway_data<-master_subway_data %>%
                    filter(entries_hourly >0 | exits_hourly)

Light housekeeping: 1. Drop columns that were concatenated into the unit_serial column. 2. There??s a lot of data that??s repeated here. We don??t need to know the station, linename, and division for every single turnstile for every single row. Since we are using a data table and we??re not neccessarily copying data for the entire table each time we do a small operation, it might not make a difference, but I will create a table that maps the station name, linename, division to the turnstile. This will allow us to shrink down the size of the table.

#Delete tables no longer needed following data investigation
rm(bad_unit_investigation,suspect_data,suspect_data_summary)

#Remove unwanted columns
master_subway_data<-select(master_subway_data,-UNIT, -C.A, -SCP)
head(master_subway_data)
##        unit_serial       STATION LINENAME DIVISION       DATE     TIME
## 1 A002R05102-00-00 LEXINGTON AVE   NQR456      BMT 06/13/2015 04:00:00
## 2 A002R05102-00-00 LEXINGTON AVE   NQR456      BMT 06/13/2015 08:00:00
## 3 A002R05102-00-00 LEXINGTON AVE   NQR456      BMT 06/13/2015 12:00:00
## 4 A002R05102-00-00 LEXINGTON AVE   NQR456      BMT 06/13/2015 16:00:00
## 5 A002R05102-00-00 LEXINGTON AVE   NQR456      BMT 06/13/2015 20:00:00
## 6 A002R05102-00-00 LEXINGTON AVE   NQR456      BMT 06/14/2015 00:00:00
##      DESC ENTRIES   EXITS entries_hourly exits_hourly
## 1 REGULAR 5181959 1753215             30            7
## 2 REGULAR 5181981 1753241             22           26
## 3 REGULAR 5182091 1753342            110          101
## 4 REGULAR 5182349 1753404            258           62
## 5 REGULAR 5182661 1753457            312           53
## 6 REGULAR 5182855 1753481            194           24
#Create turnstile to station data map
station_turnstile_map<-master_subway_data %>%
  select(STATION, LINENAME, DIVISION, unit_serial) %>%
  distinct(.)

#Remove station data as it??s no longer needed.  We can get station related information 
#from the map if needed using the station_turnstile_map via a join function.
master_subway_data<-select(master_subway_data,-STATION, -LINENAME, -DIVISION)
head(master_subway_data)
##        unit_serial       DATE     TIME    DESC ENTRIES   EXITS
## 1 A002R05102-00-00 06/13/2015 04:00:00 REGULAR 5181959 1753215
## 2 A002R05102-00-00 06/13/2015 08:00:00 REGULAR 5181981 1753241
## 3 A002R05102-00-00 06/13/2015 12:00:00 REGULAR 5182091 1753342
## 4 A002R05102-00-00 06/13/2015 16:00:00 REGULAR 5182349 1753404
## 5 A002R05102-00-00 06/13/2015 20:00:00 REGULAR 5182661 1753457
## 6 A002R05102-00-00 06/14/2015 00:00:00 REGULAR 5182855 1753481
##   entries_hourly exits_hourly
## 1             30            7
## 2             22           26
## 3            110          101
## 4            258           62
## 5            312           53
## 6            194           24

This data set should be a lot easier to work with and (hopefully) will run faster when we use the more complicated functions.

The Data Wrangling stage is now (generally) complete, and hopefuly, no further major manipulation will be needed. This dataset is written to a CSV file, and the final version can be accessed here:

https://www.dropbox.com/s/oq2am8ffdmlclnu/subway_data.csv?dl=0 https://www.dropbox.com/s/u60jrzrjir9xqp6/station_turnstile_map.csv?dl=0

#uncomment if needed. Otherwise, this takes a lot of memory on your computer.

#write.table(master_subway_data, file="subway_data.csv", sep = ",", col.names=TRUE)
#write.table(station_turnstile_map, file='station_turnstile_map.csv',, sep = ",", col.names=TRUE)

Actually, I concluded the data anlaysis too early! To my amazement, the weather data is actually not part of the MTA data! The MTA data only has ridership related information. Getting the weather will be whole different process. This could be done in R, but I did that part in Python. You can see the Python script here. However, you won’t be able to run the Python code yourself because I had to mask the API key as the website www.wunderground.com will block my account if I hit the ceiling on the number of requests coming in under my key. However, you can visit the site and create a key for free just as I did.