NYC Bike Share launched in New York in May of 2013. As part of the agreement with NYC, Motivate, the company that operates NYC Bike Share (commonly referred to as Citi Bike), publishes quarterly reports which aggregate the number of trips per day.
The Central Park weather station has provided detailed weather information since 1868 including minimum, maximum, and average temperature, precipitation, humidity in Manhattan.
Using these two data sources, we can perform an observational study of the impact of temperature and precipitation on the number of trips taken on the NYC Bike Share system.
The population under investigation is Citi Bike system users. This study will look at the whole population of system users rather than a sample. While it would be possible to use the Citi Bike users as a proxy for all NYC bikers, there may be important differences in the populations and behaviours of bike share users compared to all NYC bikers that would limit the applicability of findings from this study to all NYC bikers. Bike share users are not a random sample of all bikers.
Citi Bike trips are recorded whenever a bike is unlocked using the system software. Motivate provides aggregated and anonymized data to the public as part of its agreement with NYC. For this study, we will examine trips from the four quarterly reports provided for 2016. (2016 was chosen to match weather data.) Number of trips per day is the dependent variable in the study.
Weather data was collected at the Central Park Weather Station by the National Oceanic and Atmospheric Administration (NOAA) in 2016 and compiled into a CSV by Mathijs Waegemakers for a Kaggle competition related to weather and transportation.
For the purpose of this study, I will evaluate maximum temperature (degrees F) and precipitation (T/F) as the independent variables. #### Cases: Days
Each case is a single day in the system, developed from joining the two data sets by date. Cases from 2016 are the entire population of Bike Share trips, not a sample.
Data from each source must be loaded and joined into a single dataset. Weather and number of trips are reported by data in each source, and so can be joined by date after reformatting the “date” data in each source.
In the weather data, precipitation is recorded in inches or as “T” for trace amounts. I have recoded the data as Boolean True or False where any amount of precipitation greater than 0 inches is coded as “True” and 0 inches or “T” are coded as “False” to create a categorical variable.
## date maximum.temperature precipitation n_trips
## 1 2016-01-01 42 FALSE 11009
## 2 2016-01-02 40 FALSE 14587
## 3 2016-01-03 45 FALSE 15499
## 4 2016-01-04 36 FALSE 19593
## 5 2016-01-05 29 FALSE 18053
## 6 2016-01-06 41 FALSE 24569
Because the bike share area in 2016 is limited to portions of Manhattan and close-in Brooklyn and Queens, we will assume that the Central Park weather data is roughly consistent with weather across the bike share area.
I assume the data provided by Kaggle and Citi Bike are relatively free from error, both in the initial collection and the subsequent provision as datasets available online.
First, we can inspect the data and create some exploratory charts of the variables.
summary(casesRaw)
## date maximum.temperature precipitation n_trips
## Min. :2016-01-01 Min. :15.00 Mode :logical Min. : 0
## 1st Qu.:2016-04-01 1st Qu.:50.00 FALSE:250 1st Qu.:24677
## Median :2016-07-01 Median :64.50 TRUE :116 Median :38332
## Mean :2016-07-01 Mean :64.63 Mean :37819
## 3rd Qu.:2016-09-30 3rd Qu.:81.00 3rd Qu.:51742
## Max. :2016-12-31 Max. :96.00 Max. :69758
The minimum number of trips is 0. By closer inspection of the data, we can see that every 0-trip day is in January following a day with precipitation–this, it turns out, is a time when the system was shut down for several days due to a blizzard.
casesRaw %>% filter(n_trips == 0)
## date maximum.temperature precipitation n_trips
## 1 2016-01-23 27 TRUE 0
## 2 2016-01-24 35 FALSE 0
## 3 2016-01-25 39 FALSE 0
## 4 2016-01-26 48 FALSE 0
Because the system was shut down, we will exclude these dates from the rest of the analysis.
cases <- casesRaw %>% filter(n_trips > 0)
Next, we can look at the relationship of temperature and precipitation on number of trips seperately and together.
boxplot(cases$n_trips ~ cases$precipitation,
main = "Number of Trips by Daily Precipitation, 2016",
ylab = "Total Daily Bike Share Trips")
Above we can see that the average number of trips is less on days with some precipitation (as expected). This relationship is somewhat (although not perfectly) linear–the interquartile range decreases by roughly the same amount as the mean on days where there is precipitation.
library(ggplot2)
qplot(y=n_trips, x=maximum.temperature, data=cases,
main = "Number of Daily Bike Share Trips by Max. Temp (2016)",
xlab = "Max. Temp. (Degrees Fahrenheit)",
ylab = "Total Daily Trips")
Above, we see that there is a roughly linear relationship between number of trips and the daily high temperature.
qplot(y=n_trips, x=maximum.temperature, data=cases, colour=precipitation,
main = "Number of Daily Bike Share Trips by Max. Temp (2016)",
xlab = "Max. Temp. (Degrees Fahrenheit)",
ylab = "Total Daily Trips")
Finally, we see that there are two, roughly parallel linear relationships among days with and without precipitation between maximum temperature and the total number of bike trips.
model_weather <- lm(cases$n_trips ~ cases$maximum.temperature + cases$precipitation)
summary(model_weather)
##
## Call:
## lm(formula = cases$n_trips ~ cases$maximum.temperature + cases$precipitation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23828.0 -6141.8 -812.3 5348.9 24278.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1402.01 1891.99 0.741 0.459
## cases$maximum.temperature 615.86 27.63 22.291 <2e-16 ***
## cases$precipitationTRUE -9920.43 1060.21 -9.357 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9391 on 359 degrees of freedom
## Multiple R-squared: 0.6201, Adjusted R-squared: 0.618
## F-statistic: 293 on 2 and 359 DF, p-value: < 2.2e-16
Running a multiple regression model, we find that that both temperature and precipitation are highly significant with a p-value approaching zero.
The adjusted R-squared for the multiple-regression model indicates that the combination of max. temperature and presence or absence of precipitation explains 62% of the variation of the number of trips.
The formula can be written as:
Number of trips = 616 * max.temp +1402-9920 precipitationTRUE
where precipitationTrue
is a dummy variable equal to 1 when precipitation is present and 0 otherwise.
#reference:https://cran.r-project.org/web/packages/ggiraphExtra/vignettes/ggPredict.html
equation1=function(x){coef(model_weather)[2]*x+coef(model_weather)[1]}
equation2=function(x){coef(model_weather)[2]*x+coef(model_weather)[1]+coef(model_weather)[3]}
cases %>% ggplot(aes(y=n_trips,x=maximum.temperature,color=precipitation)) +
geom_point() +
stat_function(fun=equation1,geom="line",color=scales::hue_pal()(2)[1]) +
stat_function(fun=equation2,geom="line",color=scales::hue_pal()(2)[2]) +
ggtitle("Impact of Daily High Temperature and Precipitation on\nNYC Bike Share Trips (2016)")+
xlab("Daily High Temperature (Degrees Fahrenheit)") +
ylab("Number of Bike Share Trips (24 hr period)")
Before concluding, we must first check the following assumptions:
This is true for temperature, and reasonably true for precipitation, as shown above.
To a rough approximation, both number of trips and maximum temperature are normally distributed.
hist(cases$n_trips, main = "Number of Bike Share Trips, 2016",
xlab = "Total Daily Trips")
hist(cases$maximum.temperature, main = "Maximum Daily Temperature, 2016",
xlab = "Degrees Fahrenheit", breaks = 20)
There appears to be some slight curvature in the residuals, which may indicate a problem with the model.
plot(resid(model_weather))
abline(h=0)
This appears to be roughly true, except at the very top of the range, where there is some tendency to over estimate in the model.
plot(model_weather$residuals ~ fitted(model_weather))
abline(h=0)
plot(model_weather$residuals ~ cases$maximum.temperature)
abline(h=0)
There appears to be some slight curvature in the residuals in the upper end, which may indicate a problem with the model.
qqnorm(model_weather$residuals)
qqline(model_weather$residuals)
Based on the model, for every additional degree of temperature (F) in the daily maximum in a given day, we should expect an additional 616 trips, less 9,920 trips if it rains or snows. Theoretically, the intercept means that on a day with 0 degrees F temperature, we would still expect to see 1,402 trips; however, the minimum observed temperature is 15 degrees F. The model should not be used to predict values outside the range.
The model performs best in the range between about 40 degrees and 90 degrees–intuitively this makes sense because while, in general, people like to ride bikes more when it’s warmer, when it’s above 90 degrees F it could be considered too hot to bike.
Further analysis could be restrict the range of temperatures to see if the model explains more of the variation within a narrower range, with perhaps a separate linear relationship for extremely high or low temperatures.
Bike Share Trips Data Source: Citi Bike (2016 chosen to match weather data)