library(dplyr)
library(ggplot2)
library(GGally)
library(lubridate)
library(caret)
library(gbm)
library(tidyverse)
library(caret)
The NYC “CitiBike” bicycle sharing scheme went live (in midtown and downtown Manhattan) in 2013, and has been expanding ever since, both as measured by daily ridership as well as the expanding geographic footprint incorporating a growing number of “docking stations” as the system welcomes riders in Brooklyn, Queens, and northern parts of Manhattan which were not previously served.
One problem that many bikeshare systems face is money. An increase in the number of riders who want to use the system necessitates that more bikes be purchased and put into service in order to accommodate them. Heavy ridership induces wear on the bikes, requiring for more frequent repairs. However, an increase in the number of trips does not necessarily translate to an increase in revenue because riders who are clever can avoid paying surcharges by keeping the length of each trip below a specified limit (either 30 or 45 minutes, depending on user category.)
We seek to examine CitiBike ridership data, joined with daily NYC weather data, to study the impact of weather on shared bike usage and generate a predictive model which can estimate the number of trips that would be taken on each day.
The goal is to estimate future demand which would enable the system operator to make expansion plans.
Our finding is that ridership exhibits strong seasonality, with correlation to weather-related variables such as daily temperature and precipitation. Additionally, ridership is segmented by by user_type (annual subscribers use the system much more heavily than casual users), gender (there are many more male users than female) and age (a large number of users are clustered in their late 30s).
Since 2013 a shared bicycle system known as CitiBike has been available in New York City. The benefits to having such a system include reducing New Yorkers’ dependence on automobiles and encouraging public health through the exercise attained by cycling. Additionally, users who would otherwise spend money on public transit may find bicycling more economical – so long as they are aware of CitiBike’s pricing constraints.
There are currently about 12,000 shared bikes which users can rent from about 750 docking stations located in Manhattan and in western portions of Brooklyn and Queens. A rider can pick up a bike at one station and return it at a different station. The system has been expanding each year, with increases in the number of bicycles available and expansion of the geographic footprint of docking stations. For planning purposes, the system operator needs to project future ridership in order to make good investments.
The available usage data provides a wealth of information which can be mined to seek trends in usage. With such intelligence, the company would be better positioned to determine what actions might optimize its revenue stream.
I obtained data from the following source:
CitiBike makes a vast amount of data available regarding system usage as well as sales of memberships and short-term passes.
For each month since the system’s inception, there is a file containing details of (almost) every trip. (Certain “trips” are omitted from the dataset. For example, if a user checks out a bike from a dock but then returns it within one minute, the system drops such a “trip” from the listing, as such “trips” are not interesting.)
There are currently 108 monthly data files for the New York City bikeshare system, spanning July 2013 through December 2019. Each file contains a line for every trip. The number of trips per month varies from as few as 200,000 during winter months in the system’s early days to more than 2 million trips this past summer. Because of the computational limitations which this presented, I created samples of 1/1000. The samples were created non-deterministically, by randomly selecteing ‘r nrow(file)/1000’ from the file.
<- read.csv("~/Downloads/citybike.train.csv") citibike_2019
In this section, I examine selected individual variables from the CitiBike and Weather datasets. These items require transformation and/or cleaning as there are missing values or outliers which impede analysis otherwise.
The trip_duration is specified in seconds, but there are some outliers which may be incorrect, as the value for Max is quite high: 3.5984^{4} seconds, or 0.4164815 days. We can assume that this data is bad, as nobody would willingly rent a bicycle for this period of time, given the fees that would be charged. Here is a histogram of the original data distribution:
Let’s assume that nobody would rent a bicycle for more than a specified timelimit (say, 3 hours), and drop any records which exceed this:
<- nrow(citibike_2019[citibike_2019$tripduration > 10800,])
num_long_trips_removed<-citibike_2019[citibike_2019$tripduration<=10800,]
citibike_2019print(paste0("Removed ", num_long_trips_removed, " trips of longer than 3 hours."))
## [1] "Removed 49 trips of longer than 3 hours."
Other inconsistencies concern the collection of birth_year, from which we can infer the age of the participant. There are some months in which this value is omitted, while there are other months in which all values are populated. However, there are a few records which suggest that the rider is a centenarian – it seems highly implausible that someone born in the 1880s is cycling around Central Park – but the data does have such anomalies. Thus, a substantial amount of time was needed for detecting and cleaning such inconsistencies.
The birth year for some users is as old as 1886, which is not possible:
summary(citibike_2019$birth.year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1886 1969 1982 1980 1990 2003
$age <- 2019 - citibike_2019$birth.year citibike_2019
<- nrow(citibike_2019[citibike_2019$age>90,])
num_old_age_removed<-citibike_2019[citibike_2019$age<90,]
citibike_2019print(paste0("Removed ", num_old_age_removed, " trips of people older thatn 90 years"))
## [1] "Removed 27 trips of people older thatn 90 years"
library(geosphere)
$distance <- distHaversine(citibike_2019[,5:6], citibike_2019[,9:10])
citibike_2019$dist.lat <- abs((citibike_2019$start.station.latitude - citibike_2019$end.station.latitude))
citibike_2019$dist.long <- abs((citibike_2019$start.station.longitude - citibike_2019$end.station.longitude))
citibike_2019summary(citibike_2019$distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 590.7 1071.9 1374.2 1816.5 10273.2
= function(timestamp){
is_weekday ::wday(timestamp, week_start = 1) < 6
lubridate }
$Hour<-hour(citibike_2019$starttime)
citibike_2019$dayofweek <- as.factor(wday(citibike_2019$starttime))
citibike_2019$weekday<-as.factor(as.numeric(sapply(citibike_2019$starttime, is_weekday)))
citibike_2019head(citibike_2019 %>% select("Hour","dayofweek","weekday"))
## Hour dayofweek weekday
## 1 13 5 1
## 2 7 5 1
## 3 6 5 1
## 4 9 5 1
## 5 20 6 1
## 6 16 7 0
$usertype<-as.factor(citibike_2019$usertype)
citibike_2019$gender<-as.factor(citibike_2019$gender) citibike_2019
<-setdiff(names(citibike_2019),c("starttime","start.station.id","end.station.id",
colstouse "stoptime","start.station.name","end.station.name",
"start.station.latitude","start.station.longitude",
"end.station.latitude","end.station.longitude",
"weekday","bikeid","birth.year"))
<-citibike_2019[ , which(names(citibike_2019) %in% colstouse)]
train_datanames(train_data)
## [1] "tripduration" "usertype" "gender" "age" "distance"
## [6] "dist.lat" "dist.long" "Hour" "dayofweek"
<-lm(tripduration~., data = train_data) linear_model
summary(linear_model)
##
## Call:
## lm(formula = tripduration ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1904.6 -222.6 -116.4 51.5 10071.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.136e+02 1.920e+01 42.363 < 2e-16 ***
## usertypeSubscriber -5.741e+02 9.171e+00 -62.593 < 2e-16 ***
## gender1 -7.316e+01 1.215e+01 -6.024 1.72e-09 ***
## gender2 6.331e+00 1.268e+01 0.499 0.618
## age 1.890e+00 2.325e-01 8.130 4.39e-16 ***
## distance 7.272e-01 2.926e-02 24.850 < 2e-16 ***
## dist.lat 1.507e+04 4.587e+02 32.846 < 2e-16 ***
## dist.long -5.788e+04 3.044e+03 -19.012 < 2e-16 ***
## Hour 4.977e+00 5.508e-01 9.037 < 2e-16 ***
## dayofweek2 -5.842e+01 1.224e+01 -4.774 1.82e-06 ***
## dayofweek3 -1.210e+02 1.205e+01 -10.043 < 2e-16 ***
## dayofweek4 -1.133e+02 1.148e+01 -9.869 < 2e-16 ***
## dayofweek5 -1.200e+02 1.161e+01 -10.338 < 2e-16 ***
## dayofweek6 -9.834e+01 1.140e+01 -8.630 < 2e-16 ***
## dayofweek7 1.544e+01 1.195e+01 1.292 0.196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 590.4 on 49909 degrees of freedom
## Multiple R-squared: 0.4097, Adjusted R-squared: 0.4095
## F-statistic: 2474 on 14 and 49909 DF, p-value: < 2.2e-16
<-setdiff(names(citibike_2019),c("starttime","start.station.id","end.station.id",
colstouse "stoptime","start.station.name","end.station.name",
"birth.year"))
<-citibike_2019[ , which(names(citibike_2019) %in% colstouse)]
train_datanames(train_data)
## [1] "tripduration" "start.station.latitude"
## [3] "start.station.longitude" "end.station.latitude"
## [5] "end.station.longitude" "bikeid"
## [7] "usertype" "gender"
## [9] "age" "distance"
## [11] "dist.lat" "dist.long"
## [13] "Hour" "dayofweek"
## [15] "weekday"
library('gbm')
<- expand.grid(interaction.depth=c(1, 3, 5),
caretGrid n.trees = c(150,300,500),
shrinkage=c(0.01, 0.001),
n.minobsinnode=10)
<- trainControl(method="cv", number=10)
trainControl
<- "RMSE"
metric
<- train(tripduration ~ .,
gbmmodel data = train_data,
distribution = "gaussian",
method = "gbm",
trControl = trainControl,
tuneGrid=caretGrid,
metric=metric,
bag.fraction=0.75,
verbose = F)
print(gbmmodel)
## Stochastic Gradient Boosting
##
## 49924 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 44932, 44931, 44931, 44931, 44931, 44932, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.001 1 150 748.4269 0.2190057 504.6563
## 0.001 1 300 732.7663 0.2330968 485.7097
## 0.001 1 500 715.6451 0.2851715 465.8422
## 0.001 3 150 736.0388 0.3450818 491.8834
## 0.001 3 300 710.3006 0.3533046 462.7603
## 0.001 3 500 683.6676 0.3637127 431.7517
## 0.001 5 150 731.7017 0.3730394 488.4610
## 0.001 5 300 702.6904 0.3821968 455.4932
## 0.001 5 500 673.0168 0.3909381 420.6314
## 0.010 1 150 657.3529 0.3767908 400.2601
## 0.010 1 300 617.2269 0.4024616 353.9881
## 0.010 1 500 594.4068 0.4248682 328.1240
## 0.010 3 150 613.1282 0.4140863 346.8069
## 0.010 3 300 582.4484 0.4384686 310.0238
## 0.010 3 500 571.1499 0.4513926 295.6785
## 0.010 5 150 600.9113 0.4307728 334.0063
## 0.010 5 300 574.0135 0.4506206 300.1469
## 0.010 5 500 566.0267 0.4594682 289.5523
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
## 5, shrinkage = 0.01 and n.minobsinnode = 10.
summary(gbmmodel)
## var rel.inf
## dist.lat dist.lat 53.36517774
## distance distance 22.05152386
## usertypeSubscriber usertypeSubscriber 19.89447980
## end.station.latitude end.station.latitude 1.13960198
## Hour Hour 0.89695741
## gender1 gender1 0.65178326
## weekday1 weekday1 0.55642742
## start.station.latitude start.station.latitude 0.52993178
## dist.long dist.long 0.52141913
## age age 0.10518781
## bikeid bikeid 0.08494985
## start.station.longitude start.station.longitude 0.08383532
## end.station.longitude end.station.longitude 0.08031992
## dayofweek7 dayofweek7 0.03239408
## dayofweek2 dayofweek2 0.00601062
## gender2 gender2 0.00000000
## dayofweek3 dayofweek3 0.00000000
## dayofweek4 dayofweek4 0.00000000
## dayofweek5 dayofweek5 0.00000000
## dayofweek6 dayofweek6 0.00000000
<-train_data$tripduration
y_test= predict(gbmmodel,train_data)
predicted = y_test - predicted
residuals = sqrt(mean(residuals^2))
RMSE cat('The root mean square error of the train data is ', round(RMSE,3),'\n')
## The root mean square error of the train data is 561.726
= mean(y_test)
y_test_mean # Calculate total sum of squares
= sum((y_test - y_test_mean)^2 )
tss # Calculate residual sum of squares
= sum(residuals^2)
rss # Calculate R-squared
= 1 - (rss/tss)
rsq cat('The R-square of the train data is ', round(rsq,3), '\n')
## The R-square of the train data is 0.466
options(repr.plot.width=8, repr.plot.height=4)
= as.data.frame(cbind(predicted = predicted,
my_data observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) +
geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("Gradient Boosting Machine: Prediction vs Test Data") +
xlab("Predecited Trip Duration in Seconds ") + ylab("Observed Trip Duration in Secons") +
theme(plot.title = element_text(size=16,hjust = 0.5),
axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'
<- read.csv("~/Downloads/citybike.test10000Awithduration.csv") validation_data
<-validation_data[validation_data$tripduration<=10800,]
validation_data$age <- 2019 - validation_data$birth.year
validation_data<-validation_data[validation_data$age<90,]
validation_data$distance <- distHaversine(validation_data[,5:6], validation_data[,9:10])
validation_data$dist.lat <- abs((validation_data$start.station.latitude - validation_data$end.station.latitude))
validation_data$dist.long <- abs((validation_data$start.station.longitude - validation_data$end.station.longitude))
validation_data$Hour<-hour(validation_data$starttime)
validation_data$dayofweek <- as.factor(wday(validation_data$starttime))
validation_data$weekday<-as.factor(as.numeric(sapply(validation_data$starttime, is_weekday)))
validation_datahead(validation_data %>% select("Hour","dayofweek","weekday"))
## Hour dayofweek weekday
## 1 8 3 1
## 2 19 4 1
## 3 18 6 1
## 4 8 2 1
## 5 12 5 1
## 6 18 3 1
$usertype<-as.factor(validation_data$usertype)
validation_data$gender<-as.factor(validation_data$gender) validation_data
<-validation_data$tripduration
y_test= predict(gbmmodel,validation_data)
predicted = y_test - predicted
residuals = sqrt(mean(residuals^2))
RMSE cat('The root mean square error of the validation data is ', round(RMSE,3),'\n')
## The root mean square error of the validation data is 564.821
= mean(y_test)
y_test_mean # Calculate total sum of squares
= sum((y_test - y_test_mean)^2 )
tss # Calculate residual sum of squares
= sum(residuals^2)
rss # Calculate R-squared
= 1 - (rss/tss)
rsq cat('The R-square of the validation data is ', round(rsq,3), '\n')
## The R-square of the validation data is 0.466
options(repr.plot.width=8, repr.plot.height=4)
= as.data.frame(cbind(predicted = predicted,
my_data observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) +
geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("Gradient Boosting Machines: Prediction vs Test Data") +
xlab("Predecited Trip Duration in Secons ") + ylab("Observed Trip Duration in Secons") +
theme(plot.title = element_text(size=16,hjust = 0.5),
axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'
<-read.csv("~/Downloads/citybike.test20000.csv") test_data
$age <- 2019 - test_data$birth.year
test_data$distance <- distHaversine(test_data[,4:5], test_data[,8:9])
test_data$dist.lat <- abs((test_data$start.station.latitude - test_data$end.station.latitude))
test_data$dist.long <- abs((test_data$start.station.longitude - test_data$end.station.longitude))
test_data$Hour<-hour(test_data$starttime)
test_data$dayofweek <- as.factor(wday(test_data$starttime))
test_data$weekday<-as.factor(as.numeric(sapply(test_data$starttime, is_weekday)))
test_data$usertype<-as.factor(test_data$usertype)
test_data$gender<-as.factor(test_data$gender) test_data
= predict(gbmmodel,test_data)
predicted <-as.data.frame(x=predicted)
pred$ID<-c(1:20000)
predcolnames(pred) <- c("Predicted","ID")
<-subset(pred, select = c(ID, Predicted))
predwrite.csv(pred, file = "valhalla5.csv", row.names = FALSE)