NYC Subway Data

Do more people ride the New York City subway when it’s raining versus not raining?

Overview

Initially, basic statistical testing show no statistically difference between mean ridership during periods of rain and during periods of no rain. However when blocking for exogenous variables (weekend / weekday and time of day) completely different pictures started to emerge. A key issue with this data set is that the data sets’s window period didn’t have a large enough number of time periods with rain despite a very high total sample size of 42,649 observations. So when we start to compare rain versus no-rain ridership for specific time periods, very different patterns begin to emerge. Mean ridership is higher for periods of rain during some time periods, lower for other periods, and statistically insignificantly different for other periods. The weekday periods are the most reliabe due to occupying 5 days a week versus 2 creating a larger sample to test. The weekday results show that subway ridership is higher in non-rain periods for the 20:00 time slot and lower than non periods for the midnight time slot. The key conclusion that I drew from this data is not earthshatering, but that subway trafic takes on different characteristics through the course of a day and the course of the week. And, the characteristics of the overal ridership at each time slot responds differently to rain.

Key to taking this forward would be to 1) expand the data set, and 2) subset data even further than what has been here (for example, flag any holidays, look at each weekday individually, etc.). These things were not done due to issues with being able to get a larger data set (see “Sidenote on the data”“).

In addition to answering the original question, this project will also contain a few detours this project took while trying to form an answer.

Data

Data set: May 1, 2011 - May 31, 2011 New York City Metropolitan Transit Agency (MTA) The data set was taken from Udacity as part of the Intro to Data Science course. You can access the data set from Udacity here:

https://www.dropbox.com/s/1lpoeh2w6px4diu/improved-data set.zip?dl=0

Note: You don’t need to be a Udacity student to access the data set or the course content. The zipfile contains the CSV file used for this project and a PDF with a description of the variables.

The course content is free with registration and a subscription is only required if you want support and a course completion certificate:

https://www.udacity.com/course/intro-to-data-science–ud359

Sidenote on the data (Entirely Optional to Read)

Originally, I started this Udacity project and did not feel the historical period in the data set provided by Udacity had enough time periods with rain. I felt that by expanding the time horizon, I could expand the sample size of the time periods with rain and get more of a “smoking gun” in terms of pinning down how exactly rain affects subway ridership.

So with the STAT 497D project, I thought this would be a good opportunity to get a bigger data set and get the experience from working with raw data in R rather than using Udacity´s. So, I went to the MTA website to get the data myself:

http://web.mta.info/developers/turnstile.html

I worked quite hard building the data set from scratch in R and was quite proud of myself when I finished. However, I made one huge oversight in that I thought Udacity downloaded ALL of the data from the MTA website. Unfortunately for me, I figured out at the end of the Data Munging Process that the MTA website only includes subway related data and did not include any of the weather data. Udacity must have taken the subway data and joined it with weather data.

You can see that process here:

http://rpubs.com/raymondcarl/92744

Important to note is that Data Wrangling the raw MTA data is almost an entire R project in itself. Additionally, a lot of the problems of I had with Data Wrangling were solved through the use of a package called Dplyr. There is a whole history of the package, but I was amazed at how Dplyr is almost like a second generation of R programming with how fast and simplified it is to work with. Following that experience, most of the code I write in R now involves at least one use of the Dplyr package.

Back to the original point, I did not get detoured by the fact that the raw MTA data did not contain weather data. I thought, if Udacity can join the weather to the MTA data, why couldn’t I? I planned to download the data with Python and then do the wrangling in R. The idea would be to download all of the weather data and match it to the right subway stations according to the best fit GPS coordinates.

Unfortunately, I discovered after building 2 different Python scripts that the data exists, but hourly historical data for the approx. 100 numerous personal weather stations (“PWS”) scattered throughout NYC is not free. This data costs $180 for a 1 month subscription via openweathermap.org and no student discounts exist. Weather underground also has the historical data on an hourly level for the same PWS, but they didn’t respond to my emails and the price is not listed, so it should be safe to assume that it is expensive to download whatever the cost is.

These python scripts are fully functional and will download data from the two websites; however, I have deleted my API key to avoid the API call limit breach on my account. You are certainly welcome to sign up for a free API key with both websites though, paste in your key, and run the code. It should download the data subject to the restrictions listed below.

http://www.wunderground.com Weather underground allows to call historical hourly data in JSON format for free. The problem is that under the free subscription, you can only make 500 calls / day. That seems like a lot, but for c. 100 PWS in NYC and by calling 1 day at a time, I was hitting that 500 calls limit very quickly. They never replied to any of my emails for a student discount given the nature of the data usage. Intuitively, I think this site powers quite a few weather related applications and app developers are their primary customer base.

You can see the Python code for the API I made here (‘subway_data.py’):

https://github.com/raymondcarl/python_NYsubway_files.git

http://openweathermap.org With “open” in the title, this site seemed more promising, and so I built another Python API (modified from the weather underground API I made) to pull hourly historical data. That seemed to be going well, but I noticed I kept getting the yesterday’s historical data no mater what time frame I requested in the call. After a few hours of troubleshooting later, and I discovered that buried deep in the subscription selection menu that there is no limit on many calls you can make for free data, but you are limited to yesterday’s historical data on the free subscription!

API modified for openweather is saved here (‘open_weather.py’):

https://github.com/raymondcarl/python_NYsubway_files.git

At this point, I almost caved in and bought the $180 subscription from openweather just because I wanted to see this project through from raw data to statistical conclusion. However, I decided against it due to the cost and the time constraints with the project due date as I would still need to match the subway stations with the closest available PWS and join the weather data with the MTA data for the specified times. I also decided against it because $180 only gets you the last month’s historical hourly data, which doesn’t put me in a much better place than I am with the Udacity data set.

So in the end….. I threw in the towel. It wasn’t a complete loss because building an API was something I had never done and didn’t exactly know what an API was. I feel pretty confident accessing data via API’s and getting it out into something useful from JSON format. Also in my personal project to dump Microsoft Excel and migrate to Python (Pandas) and R, I made HUGE strides in learning how to manipulate data in R with dplyr and feel like I am now on the cusp of being able to uninstall Excel from my work PC. But at this point, I had to make a u-turn, and go back to working with the Udacity data set.

Now back to answering the original question…

Statistical Analysis in R

Load Required R Packages:

library("data.table")
library("dplyr")

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:data.table':
## 
##     between, last
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("ggplot2")
library("tidyr")

Setting working directory and loading the CSV file:

setwd("~/Dropbox/improved-dataset")
NY_subway<-data.table(read.csv("turnstile_weather_v2.csv"))

#Data sample
head(NY_subway)

##    UNIT    DATEn    TIMEn ENTRIESn  EXITSn ENTRIESn_hourly EXITSn_hourly
## 1: R003 05-01-11 00:00:00  4388333 2911002               0             0
## 2: R003 05-01-11 04:00:00  4388333 2911002               0             0
## 3: R003 05-01-11 12:00:00  4388333 2911002               0             0
## 4: R003 05-01-11 16:00:00  4388333 2911002               0             0
## 5: R003 05-01-11 20:00:00  4388333 2911002               0             0
## 6: R003 05-02-11 00:00:00  4388348 2911036              15            34
##               datetime hour day_week weekday       station latitude
## 1: 2011-05-01 00:00:00    0        6       0 CYPRESS HILLS 40.68995
## 2: 2011-05-01 04:00:00    4        6       0 CYPRESS HILLS 40.68995
## 3: 2011-05-01 12:00:00   12        6       0 CYPRESS HILLS 40.68995
## 4: 2011-05-01 16:00:00   16        6       0 CYPRESS HILLS 40.68995
## 5: 2011-05-01 20:00:00   20        6       0 CYPRESS HILLS 40.68995
## 6: 2011-05-02 00:00:00    0        0       1 CYPRESS HILLS 40.68995
##    longitude         conds fog precipi pressurei rain tempi wspdi
## 1: -73.87256         Clear   0       0     30.22    0  55.9   3.5
## 2: -73.87256 Partly Cloudy   0       0     30.25    0  52.0   3.5
## 3: -73.87256 Mostly Cloudy   0       0     30.28    0  62.1   6.9
## 4: -73.87256 Mostly Cloudy   0       0     30.26    0  57.9  15.0
## 5: -73.87256 Mostly Cloudy   0       0     30.28    0  52.0  10.4
## 6: -73.87256 Mostly Cloudy   0       0     30.31    0  50.0   6.9
##    meanprecipi meanpressurei meantempi meanwspdi weather_lat weather_lon
## 1:           0      30.25800  55.98000      7.86    40.70035   -73.88718
## 2:           0      30.25800  55.98000      7.86    40.70035   -73.88718
## 3:           0      30.25800  55.98000      7.86    40.70035   -73.88718
## 4:           0      30.25800  55.98000      7.86    40.70035   -73.88718
## 5:           0      30.25800  55.98000      7.86    40.70035   -73.88718
## 6:           0      30.23833  54.16667      8.25    40.70035   -73.88718

Firstly, the data arrives with 2 columns related to ridership: ENTRIESn_hourly and EXITSn_hourly. These variables represent the number of entries and exits for the respective turnstile at the respective station from the previous reading of the turnstile unit’s counter. (note: these are created by taking the difference in the turnstile counters for each individual turnstile. It is not as trivial as it seems. You can see how to do this in the Data Wrangling Link above.)

Creating a variable factoring in “net” entries should be more aligned with actual ridership than the entries and exits.

Also, “gross” metrics for turnstile entries would have a zero lower bound since you can’t have negative entries or exits. “Net” entries would not be zero lower bounded and could help “transforming” the data into something with a normal distribution. Without even looking at the actual data, we can surmise that the gross entries and exits variables have a long right skew.

# !! add titles, re-label x axist here  !!
NY_subway$ENTRIES_hourly_net<- NY_subway$ENTRIESn_hourly - NY_subway$EXITSn_hourly
ggplot(NY_subway, aes(x=NY_subway$ENTRIESn_hourly)) + geom_density()

ggplot(NY_subway, aes(x=NY_subway$EXITSn_hourly)) + geom_density()

ggplot(NY_subway, aes(x=NY_subway$ENTRIES_hourly_net)) + geom_density(binwidth=1000)

Even though this is not a normal distribution, it is now a symmetrical looking distribution at least, which may allow us to use parametric statistical tests.

summary(NY_subway$ENTRIES_hourly_net)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -31070.0   -167.0     95.0    525.1    811.0  28420.0

Before doing any comparison between rain and non-rain, what would be the main intuitive factors behind ridership patterns (without using any statistical analysis) in order of most to least important?

Weekday or Weekend
Time of Day (i.e. is it rush hour?)
Station and turnstile (each station and turnstile could see a different traffic pattern)
climate (precipitation and weather)

Intuitively, climate would rank #4 as an explanatory variable ridership. So when comparing rain versus non-rain, a block design is needed to eliminate the variability introduced by items 1-3. Assuming the above is true, a linear regression model using rain is not likely to have much explanatory power. Additionally, a regression model would add more complexity that isn’t really necessary to determine whether ridership during rain periods is statistically significant from non-rain periods. Hence, basic t-testing is the most appropriate tool since we are not trying to predict total ridership.

T-test on all observations: Across the entire data set (prior to any blocking or segmentation), mean ridership shows a statistically significant difference in ridership between rain and non-rain time period.

T-test: Null Hypothesis: rain ENTRIES_hourly_net (mean) = non-rain ENTRIES_hourly_net (mean) Alternative Hypothesis: rain ENTRIES_hourly_net (mean) != non-rain ENTRIES_hourly_net (mean) Significance Level: 0.05

#This should actually be much easier with Dplyr´s groupby and do functions, but I couldn't quite get it to work.  Would have been great to cut this amount of code in half.
y1<-NY_subway %>%
    filter(precipi==0) %>%
    select (ENTRIES_hourly_net)

y2<-NY_subway %>%
    filter(precipi>0) %>%
    select (ENTRIES_hourly_net)

t.test(y1, y2, var.equal=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  y1 and y2
## t = 0.71566, df = 3264.8, p-value = 0.4743
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -53.79891 115.64758
## sample estimates:
## mean of x mean of y 
##  527.1483  496.2240

You may notice use of the variable ‘precipi’. Originally, all of the R code and statistical testing was based on the binary variable ‘rain’. Rain was originally thought to refer to whether or not it rained during the time period for each specific observation. But after getting some odd results, I read the fine print and discovered that it refers to whether or not rain occurred at any time during that day. Obviously, rain in the afternoon might not affect that day’s morning ridership, so the ‘rain’ variable would not work, and all the code was subsequently changed to use ‘pricipi’ = 0 and ‘precipi’ >0.

The conclusion that seems to be forming is that rain is a factor for weekend ridership but not for weekday ridership. For NYC, that makes sense as people have a stronger ability to change their traffic patterns on weekends rather than weekdays where work commitments would be a strong deterrent.

Now, we extract a list of the time intervals at which the observations were recorded using dplyr functions. From this we can see that all the data is binned into 6 time buckets.

# 
times<-NY_subway %>%
      select(TIMEn) %>%
      distinct(TIMEn)
times

##       TIMEn
## 1: 00:00:00
## 2: 04:00:00
## 3: 12:00:00
## 4: 16:00:00
## 5: 20:00:00
## 6: 08:00:00

# Dplyr process saved to a data table in the time format.  List object with character values will be needed for process later on.
times<-as.character(times$TIMEn)

The next few lines of code are really my crown achievement with R & dplyr. I want to run t-tests for each of the 6 time periods, weekend or non-weekend, and rain or non-rain. For 2 sample testing, that should lead to running 12 different t-tests. I have spared you seeing the original code, and this might look long in itself, but let’s just say figuring this out with dplyr allowed me to remove a few feet in lines of code.

This will run the t_tests for the weekday time periods. Looks really nice, but it has to be duplicated running the weekday and weekend time periods. tapply and mapply were potential options I looked into, but I don’t see those working in this case because I’m trying to subset the data into so many different levels (I could be wrong though!)

weekday_ttest <- lapply(times, function(x) {
    
    t.test(filter(NY_subway, TIMEn==x & weekday==1 & precipi == 0) %>% select(ENTRIES_hourly_net), filter(NY_subway, TIMEn==x & weekday==1 & precipi > 0) %>% select(ENTRIES_hourly_net))
})


#extract the t-test data

p_values <- est_mean_clear <- est_mean_rain   <- c()

for (i in 1:length(weekday_ttest)){
  
  est_mean_clear <- c(est_mean_clear,weekday_ttest[[i]]$estimate[1])
  est_mean_rain <- c(est_mean_rain,weekday_ttest[[i]]$estimate[2])
  p_values <- c(p_values,weekday_ttest[[i]]$p.value)
  
}

#Build the data frame with a summary of the t-tests
weekday_summary<-data.table(times, p_values, est_mean_clear, est_mean_rain)
weekday_summary

##       times     p_values est_mean_clear est_mean_rain
## 1: 00:00:00 4.394022e-05       245.0258     585.67824
## 2: 04:00:00 6.617888e-01        16.8990      29.59427
## 3: 12:00:00 9.069500e-01       944.7769     970.89709
## 4: 16:00:00 5.692041e-01       836.2668     801.08367
## 5: 20:00:00 6.116940e-13      1201.7513    -760.75000
## 6: 08:00:00 4.551330e-01       456.0016     405.88455

Now, the same thing for weekend data except with a twist…

The only difference here is that we run in a small sample size issue because (surprisingly) there are only rain observations on the weekends for the 4am and 8am time blocks.

# For weekend day, create a rain flag of 0 for whether precipi equals 0, 1 otherwise.
# Show count of rain or no-rain for the weekdns by the 6 time slots.
# NA´s indicate there were no observations for the respective time slot and rain/no-rain subsetting.  As you can see, we have no rain for the 00:00, 12:00, 16:00, and 20:00 time slots.  So we can't run the t-tests for those slots.

NY_subway %>%
filter(weekday==0) %>%
mutate (new_rain_var = ifelse(precipi == 0,0,1)) %>%
group_by(TIMEn, new_rain_var) %>%
summarize(count = n()) %>%
spread(., key=new_rain_var, count, fill = NA)

## Source: local data table [6 x 3]
## Groups: 
## 
##      TIMEn    0   1
## 1 00:00:00 2142  NA
## 2 04:00:00 1986 154
## 3 08:00:00 1335 152
## 4 12:00:00 2132  NA
## 5 16:00:00 2134  NA
## 6 20:00:00 2144  NA

#Create a new list of weekend times to apply the t-tests.  I could do this manually, but I'd like to preserve as much flexibility as possible for any future data sets.  By just manually assigning the time slots to a list, I would just creating a failure point when running the same R code for datasets with diferent time horizons.
  
weekend_rain_times<-NY_subway %>%
                    filter(weekday==0 & precipi >0) %>%
                    select(TIMEn) %>%
                    distinct(TIMEn)

weekend_rain_times<-as.character(weekend_rain_times$TIMEn)

Run the weekend t-tests, extract the key results, and save to a data frame summary

weekend_ttest <- lapply(weekend_rain_times, function(x) {
    
    t.test(filter(NY_subway, TIMEn==x & weekday==0 & precipi == 0) %>% select(ENTRIES_hourly_net), filter(NY_subway, TIMEn==x & weekday==0 & precipi > 0) %>% select(ENTRIES_hourly_net))
})


#extract the t-test data

p_values <- est_mean_clear <- est_mean_rain   <- c()

for (i in 1:length(weekend_ttest)){
  
  est_mean_clear <- c(est_mean_clear,weekend_ttest[[i]]$estimate[1])
  est_mean_rain <- c(est_mean_rain,weekend_ttest[[i]]$estimate[2])
  p_values <- c(p_values,weekend_ttest[[i]]$p.value)
  
}

#Build the data frame with a summary of the t-tests
weekend_summary<-data.table(weekend_rain_times, p_values, est_mean_clear, est_mean_rain)
weekend_summary

##    weekend_rain_times   p_values est_mean_clear est_mean_rain
## 1:           04:00:00 0.03270717      -9.620342      77.60390
## 2:           08:00:00 0.75330454      60.261423      54.58553

So as mentioned in the Overview section, it’s very difficult to start drawing conclusions at this point despite seeing some statistically significant differences. However, I definitely think rain does impact subway ridership but to really understand how, a larger data set is needed, which would allow more granular subsetting. With this small of a data set, we are very limited in how much subsetting we can do without running into the issue of not only having a small sample size, but also a zero sample size.

But just to highlight that I think there is a relationship that exists and demonstrate the ggplot, here is graph showing the overall subway ridership during periods of rain and no-rain with for the weekends.

no_rain<-  NY_subway %>%
  filter(precipi == 0 & weekday ==0) %>%
  select(ENTRIES_hourly_net)


rain<-  NY_subway %>%
  filter(precipi > 0 & weekday ==0) %>%
  select(ENTRIES_hourly_net)

    ggplot() + 
    geom_density(aes(x=ENTRIES_hourly_net),data=no_rain,fill = "blue", alpha = 0.2) + 
    geom_density(aes(x=ENTRIES_hourly_net),data=rain,fill = "red", alpha = 0.2) +
    theme(legend.position="top")