Assignment

For this week’s homework:

Identify a data with multilevel structure
You can use your own data if it has a multilevel structure
Or you can search for one on Social Explorer, Kaggle, or data.world Analyze a continuous dependent variable
Begin with a complete-pooling model
Followed by no-pooling model
Then random intercept model
Finally random slope model Discuss your findings

This homework is due by the end of Nov. 11.

Introduction & Data

NYC OpenData takes the public data created by NYC agencies and organizations and makes it available for public use in an open repository. There are over 1,300 data sets available on the NYC open data portal from almost 60 different city agencies. The collection is organized by category as well as by agency or organization, and each set includes descriptions, any collection methods used, and other contextual information to make the data easier to use and understand. Though privacy issues mean that not all data is eligible to become open data, the city is trying to make the process of reviewing documents for inclusion in NYC OpenData as transparent as possible. Currently, all of the data sets are available in machine-readable formats, and they are automatically refreshed whenever new data is available, making it easy for researchers, developers, and entrepreneurs to access the most up-to-date information. I wanted to use this data to learn more about subway ridership by borough. Specifically, I wanted to see if rainy weather increases or decreases subway ridership, and if the effect of rainy weather is the same across stations in all five boroughs. (Yes, Staten Island has a railway, too.)

I thought https://data.ny.gov/ would be a great place to start for the subway ridership data, but the weather data was a little trickier to track down. The MTA dataset doesn’t include information on rain, because weather data comes from a separate agency. Conveniently, Kaggle user Mathijs Waegemakers had already compiled 2016 NYC weather data from the Central Park weatherstation into a simple CSV (https://www.kaggle.com/mathijs/weather-data-in-new-york-city-2016/), and the turnstile data for 2016 was also available on NYC’s data.gov site (https://catalog.data.gov/dataset/turnstile-usage-data-2016). I planned to compile these two datasets into a single one that I could parse, but the MTA Turnstile Usage dataset turned out to be so extremely large that my aging laptop was struggling to process it.

I also figured I was not the first person to be interested in weather-related subway ridership, so I looked for an already-combined dataset of MTA & weather statistics, and I found a 2011 dataset from an old “intro to data science” Udacity course. The course materials were inaccessible to me, but the dataset was free to download at: https://www.dropbox.com/s/1lpoeh2w6px4diu/improved-dataset.zip?dl=0

This dataset is not as comprehensive as the actual MTA dataset, but I was able to work with more efficiently on my old laptop, and it includes turnstile information from 5 boroughs (just one from Staten Island, but I’ll take it!) and weather data. Therefore, I have the info I need to find out if rainy weather increases subway ridership, and if the effect of rainy weather is the same for all stations. This requires testing several approaches–including complete pooling, ecological regression, no pooling, and partial pooling–to see which is the best model for answering this question.

The data has 42,649 observations across 27 variables. I am most interested in the number of turnstile entries (ENTRIESn_hourly), subway station (station), whether it is raining at the time (rain), and the amount of precipitation at the time and location (precipi).

Results

library(dplyr)
library(nlme)
library(magrittr)
library(tidyr)
library(haven)
library(lmerTest)
library(ggplot2)
library(texreg)
library(readr)
subway <- read_csv("/Users/meredithpowers/Desktop/turnstile.csv")
head(subway)
# We know there are 472 actual subway stations in operation (424 if you count the connected transfer stations as single stations), but how many subway stations are in this dataset? 
length(unique(subway$station))
## [1] 207
# how many riders per hour at these stations? 
subway %>%
  group_by(station) %>%
  summarise(ENTRIESn_hourly = n())
stationd <- subway %>%
  group_by(station) %>%
  summarise(mean_e = mean(ENTRIESn_hourly, na.rm = TRUE), mean_r = mean(rain, na.rm = TRUE))
head(stationd)
ecoreg <- lm(mean_e ~ mean_r, data = stationd)
summary(ecoreg)
## 
## Call:
## lm(formula = mean_e ~ mean_r, data = stationd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2339.4  -925.4  -499.6   326.5  8304.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)      623       1744   0.357    0.721
## mean_r          4956       7735   0.641    0.522
## 
## Residual standard error: 1642 on 205 degrees of freedom
## Multiple R-squared:  0.001999,   Adjusted R-squared:  -0.002869 
## F-statistic: 0.4106 on 1 and 205 DF,  p-value: 0.5224

The second level command, mean_r, gets at the proportion of rainy days at a particular station (rain is a dummy variable, with 0 being not raining and 1 being raining). From this, we can see one unit increase in the proportion of rainy days increase a stations’s mean hourly turnstile swipe rate by 4956. However, the P values indicate that this is not significant.

# cpooling model
cpooling <- lm(ENTRIESn_hourly ~ rain, data = subway)
summary(cpooling)
## 
## Call:
## lm(formula = ENTRIESn_hourly ~ rain, data = subway)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2028.2 -1605.5  -982.2   358.5 30968.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1845.54      16.23 113.702  < 2e-16 ***
## rain          182.66      34.24   5.335 9.61e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2951 on 42647 degrees of freedom
## Multiple R-squared:  0.0006669,  Adjusted R-squared:  0.0006435 
## F-statistic: 28.46 on 1 and 42647 DF,  p-value: 9.612e-08

Here we have some statistically significant information. Remember that rain is a dummy variable: the intercept refers to rain = 0, aka not raining. So we can see that when it is raining, the hourly turnstile swipe rate (our proxy for subway ridership) increases by around 183. We can complicate this with amount of precipitation later if we like, but for now let’s build more models.

#no pooling
dcoef <- subway %>% 
    group_by(station) %>% 
    do(mod = lm(ENTRIESn_hourly ~ rain, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram()

dcoef <- subway %>% 
    group_by(station) %>% 
    do(mod = lm(ENTRIESn_hourly ~ rain, data = .))
coef <- dcoef %>% do(data.frame(rainy = coef(.$mod)[2]))
ggplot(coef, aes(x = rainy)) + geom_histogram()

Looking at the intercept and slope histograms, the non-rainy weather condition is a positively skewed distribution with a few outliers – these represent the busiest stations.

The slope (which is the difference in weather condition) is also positively skewed. Most stations are little busier in the rain, though there are some stations that are less busy in the rain.

# random intercept models
library(nlme)
m1_lme <- lme(ENTRIESn_hourly ~ rain, data = subway, random = ~1|station, method = "ML", control=lmeControl(returnObject=TRUE))
summary(m1_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: subway 
##        AIC      BIC  logLik
##   786984.1 787018.7 -393488
## 
## Random effects:
##  Formula: ~1 | station
##         (Intercept) Residual
## StdDev:    1626.759 2431.754
## 
## Fixed effects: ENTRIESn_hourly ~ rain 
##                 Value Std.Error    DF   t-value p-value
## (Intercept) 1700.5770 113.91036 42441 14.929081       0
## rain         173.4985  28.22256 42441  6.147511       0
##  Correlation: 
##      (Intr)
## rain -0.056
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.79616689 -0.39271278 -0.09117636  0.18261944 10.61785732 
## 
## Number of Observations: 42649
## Number of Groups: 207
m2_lme <- lme(ENTRIESn_hourly ~ rain, data = subway, random = ~rain|station, method = "ML", control=lmeControl(returnObject=TRUE))
summary(m2_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: subway 
##        AIC      BIC    logLik
##   786932.3 786984.3 -393460.2
## 
## Random effects:
##  Formula: ~rain | station
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 1579.5565 (Intr)
## rain         205.1101 0.999 
## Residual    2430.1467       
## 
## Fixed effects: ENTRIESn_hourly ~ rain 
##                 Value Std.Error    DF   t-value p-value
## (Intercept) 1704.6111 110.65009 42441 15.405419       0
## rain         154.2064  31.60621 42441  4.878991       0
##  Correlation: 
##      (Intr)
## rain 0.397 
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -4.12454467 -0.38937091 -0.09115785  0.18031890 10.43718819 
## 
## Number of Observations: 42649
## Number of Groups: 207
#anova(cpooling, m1_lme, m2_lme)
AIC(cpooling, m1_lme, m2_lme)

The AIC is lowest for m2_lme, though it is pretty close to m1_lme. We can see that the rainy weather condition increases hourly ridership by about 205.

At this point, I’d like to add another variable precipi – the amount of precipitation at the specific location and time. This might have a greater affect on station turnstile usage than the general “raining or not” categorical variable.

m3_lme <- lme(ENTRIESn_hourly ~ rain + precipi, data = subway, random = ~rain|station, method = "ML", control=lmeControl(returnObject=TRUE))
summary(m3_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: subway 
##        AIC    BIC    logLik
##   786791.4 786852 -393388.7
## 
## Random effects:
##  Formula: ~rain | station
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 1577.8557 (Intr)
## rain         214.9994 0.999 
## Residual    2426.0542       
## 
## Fixed effects: ENTRIESn_hourly ~ rain + precipi 
##                 Value Std.Error    DF    t-value p-value
## (Intercept)  1703.563  110.5311 42440  15.412518       0
## rain          272.162   33.3923 42440   8.150449       0
## precipi     -5795.970  484.2604 42440 -11.968705       0
##  Correlation: 
##         (Intr) rain  
## rain     0.396       
## precipi  0.001 -0.297
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -4.11619661 -0.39198616 -0.09296074  0.18684959 10.39244841 
## 
## Number of Observations: 42649
## Number of Groups: 207

Some interesting results: the rainy weather condition leads to an increase in hourly ridership (+272). However, the amount of precipitation actually decreases ridership – when units of precipitation increase, the ridership drops by about 5795. (To be discussed further down.)

m4_lme <- lme(ENTRIESn_hourly ~ rain*precipi, data = subway, random = ~rain|station, method = "ML", control=lmeControl(returnObject=TRUE))
summary(m4_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: subway 
##        AIC      BIC    logLik
##   786793.3 786862.6 -393388.6
## 
## Random effects:
##  Formula: ~rain | station
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev   Corr  
## (Intercept) 1578.139 (Intr)
## rain         214.909 1     
## Residual    2426.048       
## 
## Fixed effects: ENTRIESn_hourly ~ rain * precipi 
##                   Value Std.Error    DF  t-value p-value
## (Intercept)    1703.399    110.55 42439 15.40803  0.0000
## rain            272.358     33.39 42439  8.15607  0.0000
## precipi       12515.477  47741.72 42439  0.26215  0.7932
## rain:precipi -18313.687  47745.30 42439 -0.38357  0.7013
##  Correlation: 
##              (Intr) rain   precip
## rain          0.395              
## precipi      -0.004  0.011       
## rain:precipi  0.004 -0.015 -1.000
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -4.11612003 -0.39198811 -0.09296505  0.18691153 10.39250657 
## 
## Number of Observations: 42649
## Number of Groups: 207

So what do these outputs mean?

The intercept is the hourly turnstile swipe rate for all people at all stations. The rain is the difference for rainy weather (i.e., there is a 272 boost in the hourly ridership in rainy conditions.) The precipi variable refers to the amount of precipitation at the location – for every unit increase in precipitation, the ridership jumps 12515. This is a lot! It’s worth pointing out that precipi is a little too broad and can actually refer to rain, snow, hail, etc. Weather data is hard to deal with for this reason. But in this table, this precipi line amount assumes the 1 condition of the rain variable, which is when it is raining. To get the effect of the amount of precipitation in the non-rainy condition specifically, we can add the rain:precipi number to the precipi number. So the sum of the two is -5803 (give or take some rounding), which means the more it precipitates but isn’t specically raining, the lower the ridership. This is interesting! But is this the best model?

# anova(cpooling, m1_lme, m2_lme, m3_lme, m4_lme)
AIC(cpooling, m1_lme, m2_lme, m3_lme, m4_lme)

Now, m3_lme appears to be the best-fitting model, although the difference between m3 and m4_lme is very minor. Looking back at the two tables, this checks out – m3_lme suggested that ridership drops by about 5795 when the precipitation amount increases, and m4_lme suggested that ridership drops by about 5803 when the precipitation amount increases. This is really close!

Discussion

We could center the covariates like we did in class, but it doesn’t make sense to do it. In this dataset, there actually are times/locations when the precipitation is zero!

So how do we reconcile the fact that ridership increases whenever the precipitation amount is high in the rain condition, but actually decreases with the amount of precipitation overall? Are people taking cabs or just not going out?

Using the numbers from model 3 (m3_lme), we know that the rainy weather condition leads to an increase in hourly ridership (+272). However, the amount of precipitation actually decreases ridership – when units of precipitation increase, the overall ridership drops by about 5795.

Given what we know about weather data, this change in ridership can actually most likely be attributed to other more extreme weather conditions like snow or hail or something. When it rains a few inches, people might still take the train to work or leisure. When it rains a LOT of inches, they might take a bus or a cab/rideshare, since we know the trains are weak when it rains. Most likely, though, people are still taking the trains when it rains – the decrease can probably safely be attributed to the other weather conditions. When there are multiple inches of snow on the ground, NYC schools and colleges close, people who can work from home stay home to do just that, and so on – so the decrease in ridership makes a lot of sense.