Loading Required Libraries

library(dplyr)
library(ggplot2)
library(GGally)
library(lubridate)
library(caret)
library(gbm)
library(tidyverse)
library(caret)

1 Abstract

The NYC “CitiBike” bicycle sharing scheme went live (in midtown and downtown Manhattan) in 2013, and has been expanding ever since, both as measured by daily ridership as well as the expanding geographic footprint incorporating a growing number of “docking stations” as the system welcomes riders in Brooklyn, Queens, and northern parts of Manhattan which were not previously served.

One problem that many bikeshare systems face is money. An increase in the number of riders who want to use the system necessitates that more bikes be purchased and put into service in order to accommodate them. Heavy ridership induces wear on the bikes, requiring for more frequent repairs. However, an increase in the number of trips does not necessarily translate to an increase in revenue because riders who are clever can avoid paying surcharges by keeping the length of each trip below a specified limit (either 30 or 45 minutes, depending on user category.)

We seek to examine CitiBike ridership data, joined with daily NYC weather data, to study the impact of weather on shared bike usage and generate a predictive model which can estimate the number of trips that would be taken on each day.
The goal is to estimate future demand which would enable the system operator to make expansion plans.

Our finding is that ridership exhibits strong seasonality, with correlation to weather-related variables such as daily temperature and precipitation. Additionally, ridership is segmented by by user_type (annual subscribers use the system much more heavily than casual users), gender (there are many more male users than female) and age (a large number of users are clustered in their late 30s).

2 Introduction

Since 2013 a shared bicycle system known as CitiBike has been available in New York City. The benefits to having such a system include reducing New Yorkers’ dependence on automobiles and encouraging public health through the exercise attained by cycling. Additionally, users who would otherwise spend money on public transit may find bicycling more economical – so long as they are aware of CitiBike’s pricing constraints.

There are currently about 12,000 shared bikes which users can rent from about 750 docking stations located in Manhattan and in western portions of Brooklyn and Queens. A rider can pick up a bike at one station and return it at a different station. The system has been expanding each year, with increases in the number of bicycles available and expansion of the geographic footprint of docking stations. For planning purposes, the system operator needs to project future ridership in order to make good investments.

The available usage data provides a wealth of information which can be mined to seek trends in usage. With such intelligence, the company would be better positioned to determine what actions might optimize its revenue stream.

3. Data Gathering

Data sources and uploading

I obtained data from the following source:

1. CitiBike trip dataset

CitiBike makes a vast amount of data available regarding system usage as well as sales of memberships and short-term passes.

For each month since the system’s inception, there is a file containing details of (almost) every trip. (Certain “trips” are omitted from the dataset. For example, if a user checks out a bike from a dock but then returns it within one minute, the system drops such a “trip” from the listing, as such “trips” are not interesting.)

There are currently 108 monthly data files for the New York City bikeshare system, spanning July 2013 through December 2019. Each file contains a line for every trip. The number of trips per month varies from as few as 200,000 during winter months in the system’s early days to more than 2 million trips this past summer. Because of the computational limitations which this presented, I created samples of 1/1000. The samples were created non-deterministically, by randomly selecteing ‘r nrow(file)/1000’ from the file.

citibike_2019 <- read.csv("~/Downloads/citybike.train.csv")

Description of Data

In this section, I examine selected individual variables from the CitiBike and Weather datasets. These items require transformation and/or cleaning as there are missing values or outliers which impede analysis otherwise.

Examine variable trip_duration:

The trip_duration is specified in seconds, but there are some outliers which may be incorrect, as the value for Max is quite high: 3.5984^{4} seconds, or 0.4164815 days. We can assume that this data is bad, as nobody would willingly rent a bicycle for this period of time, given the fees that would be charged. Here is a histogram of the original data distribution:

Delete cases with unreasonable trip_duration values

Let’s assume that nobody would rent a bicycle for more than a specified timelimit (say, 3 hours), and drop any records which exceed this:

num_long_trips_removed<- nrow(citibike_2019[citibike_2019$tripduration > 10800,])
citibike_2019<-citibike_2019[citibike_2019$tripduration<=10800,]
print(paste0("Removed ", num_long_trips_removed, " trips of longer than 3 hours."))
## [1] "Removed 49 trips of longer than 3 hours."

Examine birth_year

Other inconsistencies concern the collection of birth_year, from which we can infer the age of the participant. There are some months in which this value is omitted, while there are other months in which all values are populated. However, there are a few records which suggest that the rider is a centenarian – it seems highly implausible that someone born in the 1880s is cycling around Central Park – but the data does have such anomalies. Thus, a substantial amount of time was needed for detecting and cleaning such inconsistencies.

The birth year for some users is as old as 1886, which is not possible:

summary(citibike_2019$birth.year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1886    1969    1982    1980    1990    2003
citibike_2019$age <- 2019 - citibike_2019$birth.year

Remove trips associated with very old users (age>90)

num_old_age_removed<- nrow(citibike_2019[citibike_2019$age>90,])
citibike_2019<-citibike_2019[citibike_2019$age<90,]
print(paste0("Removed ", num_old_age_removed, " trips of people older thatn 90 years"))
## [1] "Removed 27 trips of people older thatn 90 years"

Compute distance between start and end stations

library(geosphere)
citibike_2019$distance <- distHaversine(citibike_2019[,5:6], citibike_2019[,9:10])
citibike_2019$dist.lat <- abs((citibike_2019$start.station.latitude - citibike_2019$end.station.latitude))
citibike_2019$dist.long <- abs((citibike_2019$start.station.longitude - citibike_2019$end.station.longitude))
summary(citibike_2019$distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   590.7  1071.9  1374.2  1816.5 10273.2

Data Wrangling

is_weekday = function(timestamp){
  lubridate::wday(timestamp, week_start = 1) < 6
}

Feature Extraction

citibike_2019$Hour<-hour(citibike_2019$starttime)
citibike_2019$dayofweek <- as.factor(wday(citibike_2019$starttime))
citibike_2019$weekday<-as.factor(as.numeric(sapply(citibike_2019$starttime, is_weekday)))
head(citibike_2019 %>% select("Hour","dayofweek","weekday"))
##   Hour dayofweek weekday
## 1   13         5       1
## 2    7         5       1
## 3    6         5       1
## 4    9         5       1
## 5   20         6       1
## 6   16         7       0

Convert into factor variables

citibike_2019$usertype<-as.factor(citibike_2019$usertype)
citibike_2019$gender<-as.factor(citibike_2019$gender)

Determine the columns to use for prediction

colstouse <-setdiff(names(citibike_2019),c("starttime","start.station.id","end.station.id",
                                           "stoptime","start.station.name","end.station.name",
                                           "start.station.latitude","start.station.longitude",
                                           "end.station.latitude","end.station.longitude",
                                           "weekday","bikeid","birth.year"))
train_data<-citibike_2019[ , which(names(citibike_2019) %in% colstouse)]
names(train_data)
## [1] "tripduration" "usertype"     "gender"       "age"          "distance"    
## [6] "dist.lat"     "dist.long"    "Hour"         "dayofweek"
linear_model<-lm(tripduration~., data = train_data)
summary(linear_model)
## 
## Call:
## lm(formula = tripduration ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1904.6  -222.6  -116.4    51.5 10071.3 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         8.136e+02  1.920e+01  42.363  < 2e-16 ***
## usertypeSubscriber -5.741e+02  9.171e+00 -62.593  < 2e-16 ***
## gender1            -7.316e+01  1.215e+01  -6.024 1.72e-09 ***
## gender2             6.331e+00  1.268e+01   0.499    0.618    
## age                 1.890e+00  2.325e-01   8.130 4.39e-16 ***
## distance            7.272e-01  2.926e-02  24.850  < 2e-16 ***
## dist.lat            1.507e+04  4.587e+02  32.846  < 2e-16 ***
## dist.long          -5.788e+04  3.044e+03 -19.012  < 2e-16 ***
## Hour                4.977e+00  5.508e-01   9.037  < 2e-16 ***
## dayofweek2         -5.842e+01  1.224e+01  -4.774 1.82e-06 ***
## dayofweek3         -1.210e+02  1.205e+01 -10.043  < 2e-16 ***
## dayofweek4         -1.133e+02  1.148e+01  -9.869  < 2e-16 ***
## dayofweek5         -1.200e+02  1.161e+01 -10.338  < 2e-16 ***
## dayofweek6         -9.834e+01  1.140e+01  -8.630  < 2e-16 ***
## dayofweek7          1.544e+01  1.195e+01   1.292    0.196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 590.4 on 49909 degrees of freedom
## Multiple R-squared:  0.4097, Adjusted R-squared:  0.4095 
## F-statistic:  2474 on 14 and 49909 DF,  p-value: < 2.2e-16

Gradient Boosted Model

colstouse <-setdiff(names(citibike_2019),c("starttime","start.station.id","end.station.id",
                                           "stoptime","start.station.name","end.station.name",
                                          "birth.year"))
train_data<-citibike_2019[ , which(names(citibike_2019) %in% colstouse)]
names(train_data)
##  [1] "tripduration"            "start.station.latitude" 
##  [3] "start.station.longitude" "end.station.latitude"   
##  [5] "end.station.longitude"   "bikeid"                 
##  [7] "usertype"                "gender"                 
##  [9] "age"                     "distance"               
## [11] "dist.lat"                "dist.long"              
## [13] "Hour"                    "dayofweek"              
## [15] "weekday"
library('gbm')

caretGrid <- expand.grid(interaction.depth=c(1, 3, 5), 
                         n.trees = c(150,300,500),
                         shrinkage=c(0.01, 0.001),
                         n.minobsinnode=10)

trainControl <- trainControl(method="cv", number=10)

metric <- "RMSE"

gbmmodel <- train(tripduration ~ ., 
                 data = train_data,
                 distribution = "gaussian",
                 method = "gbm", 
                 trControl = trainControl,
                 tuneGrid=caretGrid,
                 metric=metric, 
                 bag.fraction=0.75,
                 verbose = F)
print(gbmmodel)
## Stochastic Gradient Boosting 
## 
## 49924 samples
##    14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 44932, 44931, 44931, 44931, 44931, 44932, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   0.001      1                  150      748.4269  0.2190057  504.6563
##   0.001      1                  300      732.7663  0.2330968  485.7097
##   0.001      1                  500      715.6451  0.2851715  465.8422
##   0.001      3                  150      736.0388  0.3450818  491.8834
##   0.001      3                  300      710.3006  0.3533046  462.7603
##   0.001      3                  500      683.6676  0.3637127  431.7517
##   0.001      5                  150      731.7017  0.3730394  488.4610
##   0.001      5                  300      702.6904  0.3821968  455.4932
##   0.001      5                  500      673.0168  0.3909381  420.6314
##   0.010      1                  150      657.3529  0.3767908  400.2601
##   0.010      1                  300      617.2269  0.4024616  353.9881
##   0.010      1                  500      594.4068  0.4248682  328.1240
##   0.010      3                  150      613.1282  0.4140863  346.8069
##   0.010      3                  300      582.4484  0.4384686  310.0238
##   0.010      3                  500      571.1499  0.4513926  295.6785
##   0.010      5                  150      600.9113  0.4307728  334.0063
##   0.010      5                  300      574.0135  0.4506206  300.1469
##   0.010      5                  500      566.0267  0.4594682  289.5523
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
##  5, shrinkage = 0.01 and n.minobsinnode = 10.
summary(gbmmodel)

##                                             var     rel.inf
## dist.lat                               dist.lat 53.36517774
## distance                               distance 22.05152386
## usertypeSubscriber           usertypeSubscriber 19.89447980
## end.station.latitude       end.station.latitude  1.13960198
## Hour                                       Hour  0.89695741
## gender1                                 gender1  0.65178326
## weekday1                               weekday1  0.55642742
## start.station.latitude   start.station.latitude  0.52993178
## dist.long                             dist.long  0.52141913
## age                                         age  0.10518781
## bikeid                                   bikeid  0.08494985
## start.station.longitude start.station.longitude  0.08383532
## end.station.longitude     end.station.longitude  0.08031992
## dayofweek7                           dayofweek7  0.03239408
## dayofweek2                           dayofweek2  0.00601062
## gender2                                 gender2  0.00000000
## dayofweek3                           dayofweek3  0.00000000
## dayofweek4                           dayofweek4  0.00000000
## dayofweek5                           dayofweek5  0.00000000
## dayofweek6                           dayofweek6  0.00000000

Evaluate Performance on the train data

y_test<-train_data$tripduration
predicted = predict(gbmmodel,train_data)
residuals = y_test - predicted
RMSE = sqrt(mean(residuals^2))
cat('The root mean square error of the train data is ', round(RMSE,3),'\n')
## The root mean square error of the train data is  561.726
y_test_mean = mean(y_test)
# Calculate total sum of squares
tss =  sum((y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)
cat('The R-square of the train data is ', round(rsq,3), '\n')
## The R-square of the train data is  0.466
options(repr.plot.width=8, repr.plot.height=4)
my_data = as.data.frame(cbind(predicted = predicted,
                            observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) + 
    geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("Gradient Boosting Machine: Prediction vs Test Data") +
      xlab("Predecited Trip Duration in Seconds ") + ylab("Observed Trip Duration in Secons") + 
        theme(plot.title = element_text(size=16,hjust = 0.5),
         axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
         axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'

Evaluate Performance on the validation data

validation_data <- read.csv("~/Downloads/citybike.test10000Awithduration.csv")

Perform same preprocessing on validation data set

validation_data<-validation_data[validation_data$tripduration<=10800,]
validation_data$age <- 2019 - validation_data$birth.year
validation_data<-validation_data[validation_data$age<90,]
validation_data$distance <- distHaversine(validation_data[,5:6], validation_data[,9:10])
validation_data$dist.lat <- abs((validation_data$start.station.latitude - validation_data$end.station.latitude))
validation_data$dist.long <- abs((validation_data$start.station.longitude - validation_data$end.station.longitude))
validation_data$Hour<-hour(validation_data$starttime)
validation_data$dayofweek <- as.factor(wday(validation_data$starttime))
validation_data$weekday<-as.factor(as.numeric(sapply(validation_data$starttime, is_weekday)))
head(validation_data %>% select("Hour","dayofweek","weekday"))
##   Hour dayofweek weekday
## 1    8         3       1
## 2   19         4       1
## 3   18         6       1
## 4    8         2       1
## 5   12         5       1
## 6   18         3       1
validation_data$usertype<-as.factor(validation_data$usertype)
validation_data$gender<-as.factor(validation_data$gender)
y_test<-validation_data$tripduration
predicted = predict(gbmmodel,validation_data)
residuals = y_test - predicted
RMSE = sqrt(mean(residuals^2))
cat('The root mean square error of the validation data is ', round(RMSE,3),'\n')
## The root mean square error of the validation data is  564.821
y_test_mean = mean(y_test)
# Calculate total sum of squares
tss =  sum((y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)
cat('The R-square of the validation data is ', round(rsq,3), '\n')
## The R-square of the validation data is  0.466
options(repr.plot.width=8, repr.plot.height=4)
my_data = as.data.frame(cbind(predicted = predicted,
                            observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) + 
    geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("Gradient Boosting Machines: Prediction vs Test Data") +
      xlab("Predecited Trip Duration in Secons ") + ylab("Observed Trip Duration in Secons") + 
        theme(plot.title = element_text(size=16,hjust = 0.5),
         axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
         axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'

Run Inference on the test data

test_data<-read.csv("~/Downloads/citybike.test20000.csv")

Perform data preprocessing

test_data$age <- 2019 - test_data$birth.year
test_data$distance <- distHaversine(test_data[,4:5], test_data[,8:9])
test_data$dist.lat <- abs((test_data$start.station.latitude - test_data$end.station.latitude))
test_data$dist.long <- abs((test_data$start.station.longitude - test_data$end.station.longitude))
test_data$Hour<-hour(test_data$starttime)
test_data$dayofweek <- as.factor(wday(test_data$starttime))
test_data$weekday<-as.factor(as.numeric(sapply(test_data$starttime, is_weekday)))
test_data$usertype<-as.factor(test_data$usertype)
test_data$gender<-as.factor(test_data$gender)
predicted = predict(gbmmodel,test_data)
pred<-as.data.frame(x=predicted)
pred$ID<-c(1:20000)
colnames(pred) <- c("Predicted","ID")
pred<-subset(pred, select = c(ID, Predicted))
write.csv(pred, file = "valhalla5.csv", row.names = FALSE)