Predicting NYC Bike Share Daily Trips Based on Weather

Part 1 - Introduction

NYC Bike Share launched in New York in May of 2013. As part of the agreement with NYC, Motivate, the company that operates NYC Bike Share (commonly referred to as Citi Bike), publishes quarterly reports which aggregate the number of trips per day.

The Central Park weather station has provided detailed weather information since 1868 including minimum, maximum, and average temperature, precipitation, humidity in Manhattan.

Using these two data sources, we can perform an observational study of the impact of temperature and precipitation on the number of trips taken on the NYC Bike Share system.

The population under investigation is Citi Bike system users. This study will look at the whole population of system users rather than a sample. While it would be possible to use the Citi Bike users as a proxy for all NYC bikers, there may be important differences in the populations and behaviours of bike share users compared to all NYC bikers that would limit the applicability of findings from this study to all NYC bikers. Bike share users are not a random sample of all bikers.

Part 2 - Data

Dependent Variable: Number of trips per day

Citi Bike trips are recorded whenever a bike is unlocked using the system software. Motivate provides aggregated and anonymized data to the public as part of its agreement with NYC. For this study, we will examine trips from the four quarterly reports provided for 2016. (2016 was chosen to match weather data.) Number of trips per day is the dependent variable in the study.

Independent Variables: Maximum Daily Temperature and Precipitation

Weather data was collected at the Central Park Weather Station by the National Oceanic and Atmospheric Administration (NOAA) in 2016 and compiled into a CSV by Mathijs Waegemakers for a Kaggle competition related to weather and transportation.

For the purpose of this study, I will evaluate maximum temperature (degrees F) and precipitation (T/F) as the independent variables. #### Cases: Days

Each case is a single day in the system, developed from joining the two data sets by date. Cases from 2016 are the entire population of Bike Share trips, not a sample.

Data Preparation

Data from each source must be loaded and joined into a single dataset. Weather and number of trips are reported by data in each source, and so can be joined by date after reformatting the “date” data in each source.

In the weather data, precipitation is recorded in inches or as “T” for trace amounts. I have recoded the data as Boolean True or False where any amount of precipitation greater than 0 inches is coded as “True” and 0 inches or “T” are coded as “False” to create a categorical variable.

##         date maximum.temperature precipitation n_trips
## 1 2016-01-01                  42         FALSE   11009
## 2 2016-01-02                  40         FALSE   14587
## 3 2016-01-03                  45         FALSE   15499
## 4 2016-01-04                  36         FALSE   19593
## 5 2016-01-05                  29         FALSE   18053
## 6 2016-01-06                  41         FALSE   24569

Assumptions

Because the bike share area in 2016 is limited to portions of Manhattan and close-in Brooklyn and Queens, we will assume that the Central Park weather data is roughly consistent with weather across the bike share area.
I assume the data provided by Kaggle and Citi Bike are relatively free from error, both in the initial collection and the subsequent provision as datasets available online.

Part 3 - Exploratory data analysis

First, we can inspect the data and create some exploratory charts of the variables.

summary(casesRaw)

##       date            maximum.temperature precipitation      n_trips     
##  Min.   :2016-01-01   Min.   :15.00       Mode :logical   Min.   :    0  
##  1st Qu.:2016-04-01   1st Qu.:50.00       FALSE:250       1st Qu.:24677  
##  Median :2016-07-01   Median :64.50       TRUE :116       Median :38332  
##  Mean   :2016-07-01   Mean   :64.63                       Mean   :37819  
##  3rd Qu.:2016-09-30   3rd Qu.:81.00                       3rd Qu.:51742  
##  Max.   :2016-12-31   Max.   :96.00                       Max.   :69758

The minimum number of trips is 0. By closer inspection of the data, we can see that every 0-trip day is in January following a day with precipitation–this, it turns out, is a time when the system was shut down for several days due to a blizzard.

casesRaw %>% filter(n_trips == 0)

##         date maximum.temperature precipitation n_trips
## 1 2016-01-23                  27          TRUE       0
## 2 2016-01-24                  35         FALSE       0
## 3 2016-01-25                  39         FALSE       0
## 4 2016-01-26                  48         FALSE       0

Because the system was shut down, we will exclude these dates from the rest of the analysis.

cases <- casesRaw %>% filter(n_trips > 0)

Next, we can look at the relationship of temperature and precipitation on number of trips seperately and together.

boxplot(cases$n_trips ~ cases$precipitation, 
        main = "Number of Trips by Daily Precipitation, 2016",
        ylab = "Total Daily Bike Share Trips")

Above we can see that the average number of trips is less on days with some precipitation (as expected). This relationship is somewhat (although not perfectly) linear–the interquartile range decreases by roughly the same amount as the mean on days where there is precipitation.

library(ggplot2)
qplot(y=n_trips, x=maximum.temperature, data=cases, 
      main = "Number of Daily Bike Share Trips by Max. Temp (2016)",
      xlab = "Max. Temp. (Degrees Fahrenheit)",
      ylab = "Total Daily Trips")

Above, we see that there is a roughly linear relationship between number of trips and the daily high temperature.

qplot(y=n_trips, x=maximum.temperature, data=cases, colour=precipitation, 
      main = "Number of Daily Bike Share Trips by Max. Temp (2016)",
      xlab = "Max. Temp. (Degrees Fahrenheit)",
      ylab = "Total Daily Trips")

Finally, we see that there are two, roughly parallel linear relationships among days with and without precipitation between maximum temperature and the total number of bike trips.

Part 4 - Inference

model_weather <- lm(cases$n_trips ~ cases$maximum.temperature + cases$precipitation)
summary(model_weather)

## 
## Call:
## lm(formula = cases$n_trips ~ cases$maximum.temperature + cases$precipitation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23828.0  -6141.8   -812.3   5348.9  24278.9 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1402.01    1891.99   0.741    0.459    
## cases$maximum.temperature   615.86      27.63  22.291   <2e-16 ***
## cases$precipitationTRUE   -9920.43    1060.21  -9.357   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9391 on 359 degrees of freedom
## Multiple R-squared:  0.6201, Adjusted R-squared:  0.618 
## F-statistic:   293 on 2 and 359 DF,  p-value: < 2.2e-16

Running a multiple regression model, we find that that both temperature and precipitation are highly significant with a p-value approaching zero.

The adjusted R-squared for the multiple-regression model indicates that the combination of max. temperature and presence or absence of precipitation explains 62% of the variation of the number of trips.

The formula can be written as:

Number of trips = 616 * max.temp +1402-9920 precipitationTRUE

where precipitationTrue is a dummy variable equal to 1 when precipitation is present and 0 otherwise.

#reference:https://cran.r-project.org/web/packages/ggiraphExtra/vignettes/ggPredict.html
equation1=function(x){coef(model_weather)[2]*x+coef(model_weather)[1]}
equation2=function(x){coef(model_weather)[2]*x+coef(model_weather)[1]+coef(model_weather)[3]}

cases %>% ggplot(aes(y=n_trips,x=maximum.temperature,color=precipitation)) + 
  geom_point() +
  stat_function(fun=equation1,geom="line",color=scales::hue_pal()(2)[1]) +
  stat_function(fun=equation2,geom="line",color=scales::hue_pal()(2)[2]) +
  ggtitle("Impact of Daily High Temperature and Precipitation on\nNYC Bike Share Trips (2016)")+
  xlab("Daily High Temperature (Degrees Fahrenheit)") +
  ylab("Number of Bike Share Trips (24 hr period)")

Checking Assumptions

Before concluding, we must first check the following assumptions:

Linear Relationship Between Variables:

This is true for temperature, and reasonably true for precipitation, as shown above.

Multivariate normality

To a rough approximation, both number of trips and maximum temperature are normally distributed.

hist(cases$n_trips, main = "Number of Bike Share Trips, 2016",
     xlab = "Total Daily Trips")

hist(cases$maximum.temperature, main = "Maximum Daily Temperature, 2016",
     xlab = "Degrees Fahrenheit", breaks = 20)

Independent Residuals

There appears to be some slight curvature in the residuals, which may indicate a problem with the model.

plot(resid(model_weather))
abline(h=0)

Constant Variance (Homeoskedacity)

This appears to be roughly true, except at the very top of the range, where there is some tendency to over estimate in the model.

plot(model_weather$residuals ~ fitted(model_weather))
abline(h=0)

plot(model_weather$residuals ~ cases$maximum.temperature)
abline(h=0)

Normally Distributed Residuals

There appears to be some slight curvature in the residuals in the upper end, which may indicate a problem with the model.

qqnorm(model_weather$residuals)
qqline(model_weather$residuals)

Part 5 - Conclusion

Based on the model, for every additional degree of temperature (F) in the daily maximum in a given day, we should expect an additional 616 trips, less 9,920 trips if it rains or snows. Theoretically, the intercept means that on a day with 0 degrees F temperature, we would still expect to see 1,402 trips; however, the minimum observed temperature is 15 degrees F. The model should not be used to predict values outside the range.

The model performs best in the range between about 40 degrees and 90 degrees–intuitively this makes sense because while, in general, people like to ride bikes more when it’s warmer, when it’s above 90 degrees F it could be considered too hot to bike.

Further analysis could be restrict the range of temperatures to see if the model explains more of the variation within a narrower range, with perhaps a separate linear relationship for extremely high or low temperatures.