Loading Required Libraries

library(dplyr)
library(ggplot2)
library(GGally)
library(lubridate)
library(caret)
library(gbm)
library(tidyverse)
library(caret)

Abstract

The NYC “CitiBike” bicycle sharing scheme went live (in midtown and downtown Manhattan) in 2013, and has been expanding ever since, both as measured by daily ridership as well as the expanding geographic footprint incorporating a growing number of “docking stations” as the system welcomes riders in Brooklyn, Queens, and northern parts of Manhattan which were not previously served.

One problem that many bikeshare systems face is money. An increase in the number of riders who want to use the system necessitates that more bikes be purchased and put into service in order to accommodate them. Heavy ridership induces wear on the bikes, requiring for more frequent repairs. However, an increase in the number of trips does not necessarily translate to an increase in revenue because riders who are clever can avoid paying surcharges by keeping the length of each trip below a specified limit (either 30 or 45 minutes, depending on user category.)

We seek to examine CitiBike ridership data, joined with daily NYC weather data, to study the impact of weather on shared bike usage and generate a predictive model which can estimate the number of trips that would be taken on each day.
The goal is to estimate future demand which would enable the system operator to make expansion plans.

Our finding is that ridership exhibits strong seasonality, with correlation to weather-related variables such as daily temperature and precipitation. Additionally, ridership is segmented by by user_type (annual subscribers use the system much more heavily than casual users), gender (there are many more male users than female) and age (a large number of users are clustered in their late 30s).

Introduction

Since 2013 a shared bicycle system known as CitiBike has been available in New York City. The benefits to having such a system include reducing New Yorkers’ dependence on automobiles and encouraging public health through the exercise attained by cycling. Additionally, users who would otherwise spend money on public transit may find bicycling more economical – so long as they are aware of CitiBike’s pricing constraints.

There are currently about 12,000 shared bikes which users can rent from about 750 docking stations located in Manhattan and in western portions of Brooklyn and Queens. A rider can pick up a bike at one station and return it at a different station. The system has been expanding each year, with increases in the number of bicycles available and expansion of the geographic footprint of docking stations. For planning purposes, the system operator needs to project future ridership in order to make good investments.

The available usage data provides a wealth of information which can be mined to seek trends in usage. With such intelligence, the company would be better positioned to determine what actions might optimize its revenue stream.

Data Gathering

Data sources and uploading

I obtained data from the following two sources:

1. CitiBike trip dataset

CitiBike makes a vast amount of data available regarding system usage as well as sales of memberships and short-term passes.

For each month since the system’s inception, there is a file containing details of (almost) every trip. (Certain “trips” are omitted from the dataset. For example, if a user checks out a bike from a dock but then returns it within one minute, the system drops such a “trip” from the listing, as such “trips” are not interesting.)

There are currently 108 monthly data files for the New York City bikeshare system, spanning July 2013 through December 2019. Each file contains a line for every trip. The number of trips per month varies from as few as 200,000 during winter months in the system’s early days to more than 2 million trips this past summer. Because of the computational limitations which this presented, I created samples of 1/1000. The samples were created non-deterministically, by randomly selecteing ‘r nrow(file)/1000’ from the file.

citibike_2019 <- read.csv("~/citibike_2019.csv")
citibike_2019 <- citibike_2019[sample(nrow(citibike_2019), 20000),]

2. Central Park daily weather data

Also I obtained historical weather information for the year 2019 from the NCDC (National Climatic Data Center) by submitting an online request to https://www.ncdc.noaa.gov/cdo-web/search . Although the weather may vary slightly within New York City, I opted to use just the data associated with the Central Park observations as proxy for the entire city’s weather.

We believe that the above data provides a reasonable representation of the target population (all CitiBike rides) and the citywide weather.

weather <- read.csv("~/Downloads/Datasets/2812766.csv",header=FALSE)
names(weather)<-weather[1,]
weather<-weather[-1,]
weather<-subset(weather[weather$NAME == "NY CITY CENTRAL PARK, NY US" & year(weather$DATE) == 2019,],)
attr_indexes <- names(weather) %>% grep("ATTR",x = .)
weather <- weather[,-attr_indexes]
weather <- weather[,-c(1:5)]
weather<-weather %>%mutate(across(c(where(is.character), -DATE), as.numeric))%>%mutate_all(~replace_na(., 0))
weather$DATE<-as.Date(weather$DATE)
weather <- weather %>% select("DATE","PRCP","SNOW","SNWD","TMAX","TMIN","WT01","WDF2","WDF5","WSF2","WSF5","WT08")
head(weather)

##             DATE PRCP SNOW SNWD TMAX TMIN WT01 WDF2 WDF5 WSF2 WSF5 WT08
## 58385 2019-01-01 0.06    0    0   58   39    1    0    0    0    0    0
## 58386 2019-01-02 0.00    0    0   40   35    0    0    0    0    0    0
## 58387 2019-01-03 0.00    0    0   44   37    0    0    0    0    0    0
## 58388 2019-01-04 0.00    0    0   47   35    0    0    0    0    0    0
## 58389 2019-01-05 0.50    0    0   47   41    1    0    0    0    0    0
## 58390 2019-01-06 0.00    0    0   49   31    0    0    0    0    0    0

Description of Data

In this section, I examine selected individual variables from the CitiBike and Weather datasets. These items require transformation and/or cleaning as there are missing values or outliers which impede analysis otherwise.

print(paste0("The citibike data consists of ", nrow(citibike_2019), " data points spread across ", ncol(citibike_2019), " features"))

## [1] "The citibike data consists of 20000 data points spread across 15 features"

Examine variable trip_duration:

The trip_duration is specified in seconds, but there are some outliers which may be incorrect, as the value for Max is quite high: 2.679841^{6} seconds, or 31.0166782 days. We can assume that this data is bad, as nobody would willingly rent a bicycle for this period of time, given the fees that would be charged. Here is a histogram of the original data distribution:

Delete cases with unreasonable trip_duration values

Let’s assume that nobody would rent a bicycle for more than a specified timelimit (say, 3 hours), and drop any records which exceed this:

num_long_trips_removed<- nrow(citibike_2019[citibike_2019$tripduration > 7200,])
citibike_2019<-citibike_2019[citibike_2019$tripduration<=7200,]
print(paste0("Removed ", num_long_trips_removed, " trips of longer than 2 hours."))

## [1] "Removed 52 trips of longer than 2 hours."

Examine birth_year

Other inconsistencies concern the collection of birth_year, from which we can infer the age of the participant. There are some months in which this value is omitted, while there are other months in which all values are populated. However, there are a few records which suggest that the rider is a centenarian – it seems highly implausible that someone born in the 1880s is cycling around Central Park – but the data does have such anomalies. Thus, a substantial amount of time was needed for detecting and cleaning such inconsistencies.

The birth year for some users is as old as 1888, which is not possible:

summary(citibike_2019$birth.year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1888    1969    1983    1980    1990    2003

citibike_2019$age <- 2019 - citibike_2019$birth.year

Remove trips associated with very old users (age>90)

num_old_age_removed<- nrow(citibike_2019[citibike_2019$age>90,])
citibike_2019<-citibike_2019[citibike_2019$age<90,]
print(paste0("Removed ", num_old_age_removed, " trips of people older thatn 90 years"))

## [1] "Removed 7 trips of people older thatn 90 years"

Compute distance between start and end stations

library(geosphere)
citibike_2019$distance <- distHaversine(citibike_2019[,6:7], citibike_2019[,10:11])
citibike_2019$dist.lat <- abs((citibike_2019$start.station.latitude - citibike_2019$end.station.latitude))
citibike_2019$dist.long <- abs((citibike_2019$start.station.longitude - citibike_2019$end.station.longitude))
summary(citibike_2019$distance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   565.9  1033.0  1329.8  1761.5 10524.7

Data Wrangling

is_weekday = function(timestamp){
  lubridate::wday(timestamp, week_start = 1) < 6
}

Feature Extraction

citibike_2019$start_date<-as.Date(citibike_2019$starttime)
citibike_2019$Hour<-hour(citibike_2019$starttime)
citibike_2019$dayofweek <- as.factor(wday(citibike_2019$starttime))
citibike_2019$weekday<-as.factor(as.numeric(sapply(citibike_2019$starttime, is_weekday)))
head(citibike_2019 %>% select("Hour","dayofweek","weekday"))

##        Hour dayofweek weekday
## 22763    14         6       1
## 151522   18         2       1
## 193496   10         5       1
## 111534   13         3       1
## 92034    17         4       1
## 99954    17         7       0

Convert into factor variables

citibike_2019$usertype<-as.factor(citibike_2019$usertype)
citibike_2019$gender<-as.factor(citibike_2019$gender)

Convert trip duration from seconds to minutes

citibike_2019$tripduration<-floor(citibike_2019$tripduration/60)
head(citibike_2019$tripduration)

## [1]  2 17 20 12  8  7

Join Citibike and Weather data

citibike_2019 <- citibike_2019 %>% inner_join(weather, by = c("start_date" = "DATE" ))
head(citibike_2019)

##   tripduration                starttime                 stoptime
## 1            2 2019-03-29 14:02:53.9670 2019-03-29 14:05:21.3790
## 2           17 2019-09-16 18:25:57.1700 2019-09-16 18:43:50.1910
## 3           20 2019-11-14 10:13:32.5140 2019-11-14 10:33:39.4500
## 4           12 2019-07-02 13:54:31.9720 2019-07-02 14:06:51.8100
## 5            8 2019-07-17 17:53:20.4200 2019-07-17 18:01:28.8080
## 6            7 2019-07-27 17:21:39.2650 2019-07-27 17:29:16.3190
##   start.station.id          start.station.name start.station.latitude
## 1              415   Pearl St & Hanover Square               40.70472
## 2              523             W 38 St & 8 Ave               40.75467
## 3             3584 Eastern Pkwy & Franklin Ave               40.67078
## 4             3263    Cooper Square & Astor Pl               40.72951
## 5              490             8 Ave & W 33 St               40.75155
## 6              435             W 21 St & 6 Ave               40.74174
##   start.station.longitude end.station.id                 end.station.name
## 1               -74.00926            351             Front St & Maiden Ln
## 2               -73.99138            266                Avenue D & E 8 St
## 3               -73.95768           3349 Grand Army Plaza & Plaza St West
## 4               -73.99075           3746                6 Ave & Broome St
## 5               -73.99393           3235            E 41 St & Madison Ave
## 6               -73.99416            161            LaGuardia Pl & W 3 St
##   end.station.latitude end.station.longitude bikeid   usertype birth.year
## 1             40.70531             -74.00613  26789 Subscriber       1968
## 2             40.72368             -73.97575  30419 Subscriber       1979
## 3             40.67297             -73.97088  27801   Customer       1969
## 4             40.72431             -74.00473  35292 Subscriber       1997
## 5             40.75217             -73.97992  19691   Customer       1992
## 6             40.72917             -73.99810  21233 Subscriber       1993
##   gender age  distance     dist.lat  dist.long start_date Hour dayofweek
## 1      1  51  349.4083 0.0005918400 0.00313455 2019-03-29   14         6
## 2      1  40 1983.4821 0.0309823000 0.01563339 2019-09-16   18         2
## 3      0  50 1470.9324 0.0021912000 0.01319974 2019-11-14   10         5
## 4      1  22 1564.1685 0.0052066406 0.01397766 2019-07-02   13         3
## 5      1  27 1559.9298 0.0006142806 0.01401206 2019-07-17   17         4
## 6      1  26  584.6764 0.0125694400 0.00394675 2019-07-27   17         7
##   weekday PRCP SNOW SNWD TMAX TMIN WT01 WDF2 WDF5 WSF2 WSF5 WT08
## 1       1 0.00    0    0   59   49    0  250  230 10.1 16.1    0
## 2       1 0.00    0    0   77   68    0   60   60 12.1 15.0    0
## 3       1 0.00    0    0   45   28    0  260  230 12.1 19.9    0
## 4       1 0.02    0    0   85   71    0  270  330  8.9 14.1    0
## 5       1 1.82    0    0   93   73    1  290  300 14.1 27.1    0
## 6       0 0.00    0    0   85   72    0  220  220 12.1 19.0    0

Correlations of individual trip data features

We can examine the correlations between variables to understand the relationship between variables, and also to help be alert to potential problems of multicollinearity. Here we compute rank correlations (Pearson and Spearman) as well as actual correlations between key variables. Here we compute the correlations between key variables on the individual CitiBike Trip data

Train - Test Split

smp_size<- floor(0.8*nrow(citibike_2019))
set.seed(123)
train_index<-sample(seq_len(nrow(citibike_2019)), size = smp_size)
train_data <- citibike_2019[train_index,]
test_data <-  citibike_2019[- train_index,]
rm(citibike_2019)

Prediction

Linear Model

Determine the columns to use for prediction

colstouse <-setdiff(names(train_data),c("starttime","start.station.id","end.station.id",
                                           "stoptime","start.station.name","end.station.name",
                                           "start.station.longitude","start_date",
                                           "end.station.latitude","end.station.longitude",
                                           "weekday","bikeid","birth.year","PRCP","DATE","SNOW","SNWD",
                                        "TMIN","WT01","WDF2","WDF5","WSF2","WSF5","WT08"))

X_train<-train_data[ , which(names(train_data) %in% colstouse)]
names(X_train)

##  [1] "tripduration"           "start.station.latitude" "usertype"              
##  [4] "gender"                 "age"                    "distance"              
##  [7] "dist.lat"               "dist.long"              "Hour"                  
## [10] "dayofweek"              "TMAX"

linear_model<-lm(tripduration~., data = X_train)

summary(linear_model)

## 
## Call:
## lm(formula = tripduration ~ ., data = X_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.714  -3.490  -1.785   0.854 111.442 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -3.835e+02  9.075e+01  -4.226 2.39e-05 ***
## start.station.latitude  9.678e+00  2.228e+00   4.344 1.41e-05 ***
## usertypeSubscriber     -7.960e+00  2.323e-01 -34.269  < 2e-16 ***
## gender1                -2.295e+00  3.138e-01  -7.314 2.72e-13 ***
## gender2                -1.017e+00  3.287e-01  -3.093 0.001983 ** 
## age                     2.579e-02  5.820e-03   4.431 9.45e-06 ***
## distance                1.249e-02  7.714e-04  16.191  < 2e-16 ***
## dist.lat                2.394e+02  1.214e+01  19.714  < 2e-16 ***
## dist.long              -1.013e+03  8.037e+01 -12.607  < 2e-16 ***
## Hour                    7.824e-02  1.369e-02   5.716 1.11e-08 ***
## dayofweek2             -1.075e+00  2.639e-01  -4.074 4.64e-05 ***
## dayofweek3             -8.718e-01  2.564e-01  -3.400 0.000675 ***
## dayofweek4             -1.021e+00  2.594e-01  -3.936 8.31e-05 ***
## dayofweek5             -1.282e+00  2.634e-01  -4.867 1.15e-06 ***
## dayofweek6             -8.385e-01  2.581e-01  -3.249 0.001162 ** 
## dayofweek7              3.322e-01  2.655e-01   1.251 0.210872    
## TMAX                    1.935e-02  4.123e-03   4.694 2.70e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.423 on 15935 degrees of freedom
## Multiple R-squared:  0.4479, Adjusted R-squared:  0.4473 
## F-statistic: 807.8 on 16 and 15935 DF,  p-value: < 2.2e-16

Stochastic Gradient Boosting Model

colstouse <-setdiff(names(train_data),c("starttime","start.station.id","end.station.id",
                                           "stoptime","start.station.name","end.station.name",
                                           "start.station.longitude","start_date",
                                           "end.station.latitude","end.station.longitude",
                                           "weekday","bikeid","birth.year","DATE","SNOW","SNWD"))
X_train<-train_data[ , which(names(train_data) %in% colstouse)]
names(X_train)

##  [1] "tripduration"           "start.station.latitude" "usertype"              
##  [4] "gender"                 "age"                    "distance"              
##  [7] "dist.lat"               "dist.long"              "Hour"                  
## [10] "dayofweek"              "PRCP"                   "TMAX"                  
## [13] "TMIN"                   "WT01"                   "WDF2"                  
## [16] "WDF5"                   "WSF2"                   "WSF5"                  
## [19] "WT08"

library('gbm')

caretGrid <- expand.grid(interaction.depth=c(5,7,9,11), 
                         n.trees = (0:30)*50,
                         shrinkage=c(0.01),
                         n.minobsinnode=20)

trainControl <- trainControl(method="cv", number=10)

metric <- "RMSE"

gbmmodel <- train(tripduration ~ ., 
                 data = X_train,
                 distribution = "gaussian",
                 method = "gbm", 
                 trControl = trainControl,
                 tuneGrid=caretGrid,
                 metric=metric, 
                 bag.fraction=0.75,
                 verbose = F)

gbmmodel

## Stochastic Gradient Boosting 
## 
## 15947 samples
##    20 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 14352, 14352, 14353, 14352, 14351, 14353, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   5                  150      8.396824  0.4842752  5.071274
##   5                  300      7.969259  0.5036105  4.570354
##   5                  500      7.841625  0.5128264  4.408715
##   9                  150      8.233405  0.4984280  4.921338
##   9                  300      7.876072  0.5122470  4.472726
##   9                  500      7.791592  0.5180836  4.351670
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.01
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
##  9, shrinkage = 0.01 and n.minobsinnode = 20.

summary(gbmmodel)

##                                           var      rel.inf
## dist.lat                             dist.lat 54.316202421
## distance                             distance 19.083842261
## usertypeSubscriber         usertypeSubscriber 14.977561106
## start.station.latitude start.station.latitude  3.432306201
## dist.long                           dist.long  2.387969483
## Hour                                     Hour  1.229529693
## gender1                               gender1  1.011399331
## age                                       age  0.866055486
## WDF5                                     WDF5  0.488515892
## WSF5                                     WSF5  0.436981549
## TMAX                                     TMAX  0.426249924
## WDF2                                     WDF2  0.280782470
## dayofweek7                         dayofweek7  0.274624728
## dayofweek4                         dayofweek4  0.214640256
## PRCP                                     PRCP  0.160037537
## TMIN                                     TMIN  0.153969439
## WSF2                                     WSF2  0.135890022
## gender2                               gender2  0.069431631
## dayofweek2                         dayofweek2  0.021243737
## WT08                                     WT08  0.015523094
## WT01                                     WT01  0.009623772
## dayofweek6                         dayofweek6  0.004684990
## dayofweek5                         dayofweek5  0.002934976
## dayofweek3                         dayofweek3  0.000000000
## SNOW                                     SNOW  0.000000000
## SNWD                                     SNWD  0.000000000

trellis.par.set(caretTheme())
plot(gbmmodel)

trellis.par.set(caretTheme())
plot(gbmmodel, metric = "RMSE", plotType = "level",
     scales = list(x = list(rot = 90)))

trellis.par.set(caretTheme())
densityplot(gbmmodel, pch = "|")

### Evaluate Performance on the train data

y_test<-train_data$tripduration
predicted = predict(gbmmodel,train_data)
residuals = y_test - predicted
RMSE = sqrt(mean(residuals^2))
cat('The root mean square error of the train data is ', round(RMSE,3),'\n')

## The root mean square error of the train data is  7.974

y_test_mean = mean(y_test)
# Calculate total sum of squares
tss =  sum((y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)
cat('The R-square of the train data is ', round(rsq,3), '\n')

## The R-square of the train data is  0.505

options(repr.plot.width=8, repr.plot.height=4)
my_data = as.data.frame(cbind(predicted = predicted,
                            observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) + 
    geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("Gradient Boosting Machine: Prediction vs Test Data") +
      xlab("Predecited Trip Duration in Seconds ") + ylab("Observed Trip Duration in Secons") + 
        theme(plot.title = element_text(size=16,hjust = 0.5),
         axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
         axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))

## `geom_smooth()` using formula 'y ~ x'

Evaluate Performance on the test data

y_test<-test_data$tripduration
predicted = predict(gbmmodel,test_data)
residuals = y_test - predicted
RMSE = sqrt(mean(residuals^2))
cat('The root mean square error of the test data is ', round(RMSE,3),'\n')

## The root mean square error of the test data is  7.533

y_test_mean = mean(y_test)
# Calculate total sum of squares
tss =  sum((y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)
cat('The R-square of the test data is ', round(rsq,3), '\n')

## The R-square of the test data is  0.54

options(repr.plot.width=8, repr.plot.height=4)
my_data = as.data.frame(cbind(predicted = predicted,
                            observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) + 
    geom_smooth(method=lm)+ ggtitle('Gradient Boosted Machines ') + ggtitle("Gradient Boosting Machines: Prediction vs Test Data") +
      xlab("Predecited Trip Duration in Minutes ") + ylab("Observed Trip Duration in Minutes") + 
        theme(plot.title = element_text(size=16,hjust = 0.5),
         axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
         axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))

## `geom_smooth()` using formula 'y ~ x'

Extreme Gradient Boosting (xgboost)

colstouse <-setdiff(names(train_data),c("starttime","start.station.id","end.station.id",
                                           "stoptime","start.station.name","end.station.name",
                                           "start.station.longitude","start_date",
                                           "end.station.latitude","end.station.longitude",
                                           "weekday","bikeid","birth.year","DATE"))
X_train<-train_data[ , which(names(train_data) %in% colstouse)]
X_train<-X_train %>%mutate(across(c(where(is.factor)), as.numeric))
summary(X_train)

##   tripduration    start.station.latitude    usertype         gender     
##  Min.   :  1.00   Min.   :40.66          Min.   :1.000   Min.   :1.000  
##  1st Qu.:  6.00   1st Qu.:40.72          1st Qu.:2.000   1st Qu.:2.000  
##  Median : 10.00   Median :40.74          Median :2.000   Median :2.000  
##  Mean   : 13.18   Mean   :40.74          Mean   :1.856   Mean   :2.167  
##  3rd Qu.: 18.00   3rd Qu.:40.76          3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :119.00   Max.   :40.82          Max.   :2.000   Max.   :3.000  
##       age           distance       dist.lat          dist.long       
##  Min.   :16.00   Min.   :   0   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:29.00   1st Qu.: 564   1st Qu.:0.003804   1st Qu.:0.003967  
##  Median :36.00   Median :1028   Median :0.008241   Median :0.008306  
##  Mean   :38.75   Mean   :1329   Mean   :0.011924   Mean   :0.010961  
##  3rd Qu.:49.00   3rd Qu.:1768   3rd Qu.:0.016128   3rd Qu.:0.015002  
##  Max.   :85.00   Max.   :9836   Max.   :0.114938   Max.   :0.088153  
##       Hour         dayofweek          PRCP             SNOW        
##  Min.   : 0.00   Min.   :1.000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:10.00   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :15.00   Median :4.000   Median :0.0000   Median :0.00000  
##  Mean   :13.89   Mean   :4.048   Mean   :0.1029   Mean   :0.01736  
##  3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:0.0400   3rd Qu.:0.00000  
##  Max.   :23.00   Max.   :7.000   Max.   :1.8300   Max.   :4.00000  
##       SNWD              TMAX            TMIN            WT01       
##  Min.   :0.00000   Min.   :14.00   Min.   : 2.00   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:57.00   1st Qu.:42.00   1st Qu.:0.0000  
##  Median :0.00000   Median :72.00   Median :56.00   Median :0.0000  
##  Mean   :0.02712   Mean   :68.19   Mean   :53.65   Mean   :0.3994  
##  3rd Qu.:0.00000   3rd Qu.:81.00   3rd Qu.:67.00   3rd Qu.:1.0000  
##  Max.   :3.90000   Max.   :95.00   Max.   :82.00   Max.   :1.0000  
##       WDF2            WDF5            WSF2            WSF5      
##  Min.   :  0.0   Min.   :  0.0   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 50.0   1st Qu.: 50.0   1st Qu.: 8.90   1st Qu.:15.00  
##  Median :180.0   Median :190.0   Median :12.10   Median :19.00  
##  Mean   :163.9   Mean   :164.8   Mean   :11.41   Mean   :18.47  
##  3rd Qu.:270.0   3rd Qu.:270.0   3rd Qu.:14.10   3rd Qu.:23.00  
##  Max.   :360.0   Max.   :360.0   Max.   :25.10   Max.   :40.90  
##       WT08       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1579  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

Convert the training and testing sets into DMatrixes: DMatrix is the recommended class in xgboost.

library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

X_train<-as.matrix(X_train %>% select(-tripduration))
Y_train<-train_data$tripduration

xgb_trcontrol = trainControl(
  method = "cv",
  number = 10,  
  allowParallel = TRUE,
  verboseIter = FALSE,
  returnData = FALSE
)

xgbGrid <- expand.grid(nrounds = c(100,200,300),
                       max_depth = c(5,10, 15, 20, 25),
                       colsample_bytree = seq(0.5, 0.9, length.out = 5),
                       eta = 0.1,
                       gamma=0,
                       min_child_weight = 1,
                       subsample = 1
                      )

xgb_model = train(
  X_train, Y_train,  
  trControl = xgb_trcontrol,
  tuneGrid = xgbGrid,
  method = "xgbTree"
)

print(xgb_model)

## eXtreme Gradient Boosting 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 14353, 14352, 14352, 14352, 14353, 14352, ... 
## Resampling results across tuning parameters:
## 
##   max_depth  colsample_bytree  nrounds  RMSE      Rsquared   MAE     
##   10         0.5               100      7.990662  0.4939391  4.459054
##   10         0.5               200      8.028360  0.4898218  4.507921
##   10         0.6               100      7.985940  0.4950087  4.448261
##   10         0.6               200      8.031759  0.4901626  4.501365
##   10         0.7               100      8.031170  0.4907930  4.459901
##   10         0.7               200      8.073485  0.4865763  4.512910
##   10         0.8               100      8.089547  0.4845796  4.470737
##   10         0.8               200      8.128734  0.4808463  4.529494
##   10         0.9               100      8.091202  0.4850372  4.487502
##   10         0.9               200      8.134691  0.4808676  4.546471
##   15         0.5               100      8.115434  0.4767399  4.671796
##   15         0.5               200      8.120068  0.4761192  4.679978
##   15         0.6               100      8.100678  0.4797061  4.611761
##   15         0.6               200      8.106604  0.4790475  4.625847
##   15         0.7               100      8.113880  0.4784341  4.584467
##   15         0.7               200      8.120708  0.4777263  4.598688
##   15         0.8               100      8.163480  0.4754049  4.565503
##   15         0.8               200      8.172585  0.4745141  4.583392
##   15         0.9               100      8.304776  0.4620645  4.618012
##   15         0.9               200      8.313216  0.4613428  4.634577
##   20         0.5               100      8.158981  0.4727551  4.739602
##   20         0.5               200      8.159012  0.4727186  4.741379
##   20         0.6               100      8.116777  0.4771562  4.648182
##   20         0.6               200      8.117489  0.4770717  4.650551
##   20         0.7               100      8.143830  0.4745711  4.608688
##   20         0.7               200      8.145033  0.4744563  4.611458
##   20         0.8               100      8.212076  0.4681641  4.600514
##   20         0.8               200      8.213553  0.4680448  4.604386
##   20         0.9               100      8.309599  0.4605969  4.640181
##   20         0.9               200      8.311563  0.4604600  4.644056
##   25         0.5               100      8.155933  0.4728047  4.748209
##   25         0.5               200      8.156126  0.4727438  4.750144
##   25         0.6               100      8.111947  0.4775316  4.639966
##   25         0.6               200      8.112506  0.4774562  4.642173
##   25         0.7               100      8.173949  0.4707071  4.607061
##   25         0.7               200      8.174838  0.4706242  4.609271
##   25         0.8               100      8.169826  0.4733180  4.595642
##   25         0.8               200      8.171277  0.4731924  4.598050
##   25         0.9               100      8.318991  0.4586380  4.632721
##   25         0.9               200      8.320909  0.4584981  4.635162
## 
## Tuning parameter 'eta' was held constant at a value of 0.1
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## 
## Tuning parameter 'subsample' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 100, max_depth = 10, eta
##  = 0.1, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
##  = 1.

summary(xgb_model)

##               Length  Class              Mode       
## handle              1 xgb.Booster.handle externalptr
## raw           4084628 -none-             raw        
## niter               1 -none-             numeric    
## call                5 -none-             call       
## params              8 -none-             list       
## callbacks           1 -none-             list       
## feature_names      20 -none-             character  
## nfeatures           1 -none-             numeric    
## xNames             20 -none-             character  
## problemType         1 -none-             character  
## tuneValue           7 data.frame         list       
## obsLevels           1 -none-             logical    
## param               0 -none-             list

trellis.par.set(caretTheme())
plot(xgb_model)

trellis.par.set(caretTheme())
plot(xgb_model, metric = "RMSE", plotType = "level",
     scales = list(x = list(rot = 90)))

trellis.par.set(caretTheme())
densityplot(xgb_model, pch = "|")

Evaluate Performance on the train data

y_test<-train_data$tripduration
predicted = predict(xgb_model,X_train)
residuals = y_test - predicted
RMSE = sqrt(mean(residuals^2))
cat('The root mean square error of the train data is ', round(RMSE,3),'\n')

## The root mean square error of the train data is  7.944

y_test_mean = mean(y_test)
# Calculate total sum of squares
tss =  sum((y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)
cat('The R-square of the train data is ', round(rsq,3), '\n')

## The R-square of the train data is  0.508

options(repr.plot.width=8, repr.plot.height=4)
my_data = as.data.frame(cbind(predicted = predicted,
                            observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) + 
    geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("XGBoost: Prediction vs Test Data") +
      xlab("Predecited Trip Duration in Seconds ") + ylab("Observed Trip Duration in Secons") + 
        theme(plot.title = element_text(size=16,hjust = 0.5),
         axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
         axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))

## `geom_smooth()` using formula 'y ~ x'

Evaluate Performance on the test data

y_test<-test_data$tripduration
X_test<-test_data[ , which(names(test_data) %in% colstouse)]
X_test<-X_test %>%mutate(across(c(where(is.factor)), as.numeric))

predicted = predict(xgb_model,X_test)
residuals = y_test - predicted
RMSE = sqrt(mean(residuals^2))
cat('The root mean square error of the test data is ', round(RMSE,3),'\n')

## The root mean square error of the test data is  7.628

y_test_mean = mean(y_test)
# Calculate total sum of squares
tss =  sum((y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)
cat('The R-square of the test data is ', round(rsq,3), '\n')

## The R-square of the test data is  0.528

options(repr.plot.width=8, repr.plot.height=4)
my_data = as.data.frame(cbind(predicted = predicted,
                            observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) + 
    geom_smooth(method=lm)+ ggtitle('Gradient Boosted Machines ') + ggtitle("XGBoost: Prediction vs Test Data") +
      xlab("Predecited Trip Duration in Minutes ") + ylab("Observed Trip Duration in Minutes") + 
        theme(plot.title = element_text(size=16,hjust = 0.5),
         axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
         axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))

## `geom_smooth()` using formula 'y ~ x'

Random Forest Model

colstouse <-setdiff(names(train_data),c("starttime","start.station.id","end.station.id",
                                           "stoptime","start.station.name","end.station.name",
                                           "start.station.longitude","start_date",
                                           "end.station.latitude","end.station.longitude",
                                           "weekday","bikeid","birth.year","DATE","SNOW","SNWD"))
X_train<-train_data[ , which(names(train_data) %in% colstouse)]
names(X_train)

##  [1] "tripduration"           "start.station.latitude" "usertype"              
##  [4] "gender"                 "age"                    "distance"              
##  [7] "dist.lat"               "dist.long"              "Hour"                  
## [10] "dayofweek"              "PRCP"                   "TMAX"                  
## [13] "TMIN"                   "WT01"                   "WDF2"                  
## [16] "WDF5"                   "WSF2"                   "WSF5"                  
## [19] "WT08"

library('randomForest')

control <- trainControl(method='repeatedcv', 
                        number=10, 
                        repeats=3)

#Metric compare model is Root Mean Squared Error
metric <- "RMSE"
set.seed(123)

#Number randomly variable selected is mtry
mtry <- sqrt(ncol(X_train)-1)
tunegrid <- expand.grid(.mtry=mtry)
rfmodel <- train(tripduration ~ ., 
                      data = X_train,
                      method='rf', 
                      metric=metric, 
                      tuneGrid=tunegrid, 
                      trControl=control)

rfmodel

## Random Forest 
## 
## 15947 samples
##    18 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 14352, 14351, 14352, 14350, 14352, 14354, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.874974  0.5089116  4.489987
## 
## Tuning parameter 'mtry' was held constant at a value of 4.242641

summary(rfmodel)

##                 Length Class      Mode     
## call                4  -none-     call     
## type                1  -none-     character
## predicted       15947  -none-     numeric  
## mse               500  -none-     numeric  
## rsq               500  -none-     numeric  
## oob.times       15947  -none-     numeric  
## importance         24  -none-     numeric  
## importanceSD        0  -none-     NULL     
## localImportance     0  -none-     NULL     
## proximity           0  -none-     NULL     
## ntree               1  -none-     numeric  
## mtry                1  -none-     numeric  
## forest             11  -none-     list     
## coefs               0  -none-     NULL     
## y               15947  -none-     numeric  
## test                0  -none-     NULL     
## inbag               0  -none-     NULL     
## xNames             24  -none-     character
## problemType         1  -none-     character
## tuneValue           1  data.frame list     
## obsLevels           1  -none-     logical  
## param               0  -none-     list

trellis.par.set(caretTheme())
densityplot(rfmodel, pch = "|")

Evaluate Performance on the train data

y_test<-train_data$tripduration
predicted = predict(rfmodel,train_data)
residuals = y_test - predicted
RMSE = sqrt(mean(residuals^2))
cat('The root mean square error of the train data is ', round(RMSE,3),'\n')

## The root mean square error of the train data is  7.916

y_test_mean = mean(y_test)
# Calculate total sum of squares
tss =  sum((y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)
cat('The R-square of the train data is ', round(rsq,3), '\n')

## The R-square of the train data is  0.512

options(repr.plot.width=8, repr.plot.height=4)
my_data = as.data.frame(cbind(predicted = predicted,
                            observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) + 
    geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("Random Forest: Prediction vs Test Data") +
      xlab("Predecited Trip Duration in Seconds ") + ylab("Observed Trip Duration in Secons") + 
        theme(plot.title = element_text(size=16,hjust = 0.5),
         axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
         axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))

## `geom_smooth()` using formula 'y ~ x'

Evaluate Performance on the test data

y_test<-test_data$tripduration
predicted = predict(gbmmodel,test_data)
residuals = y_test - predicted
RMSE = sqrt(mean(residuals^2))
cat('The root mean square error of the test data is ', round(RMSE,3),'\n')

## The root mean square error of the test data is  7.533

y_test_mean = mean(y_test)
# Calculate total sum of squares
tss =  sum((y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)
cat('The R-square of the test data is ', round(rsq,3), '\n')

## The R-square of the test data is  0.54

options(repr.plot.width=8, repr.plot.height=4)
my_data = as.data.frame(cbind(predicted = predicted,
                            observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) + 
    geom_smooth(method=lm)+ ggtitle('Gradient Boosted Machines ') + ggtitle("Random Forest: Prediction vs Test Data") +
      xlab("Predecited Trip Duration in Minutes ") + ylab("Observed Trip Duration in Minutes") + 
        theme(plot.title = element_text(size=16,hjust = 0.5),
         axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
         axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))

## `geom_smooth()` using formula 'y ~ x'

CitiBike Data Analysis

Janish Parikh

12/01/2021

Loading Required Libraries

Abstract

Introduction

Data Gathering

Data sources and uploading

1. CitiBike trip dataset

2. Central Park daily weather data

Description of Data

Examine variable trip_duration:

Delete cases with unreasonable trip_duration values

Examine birth_year

Remove trips associated with very old users (age>90)

Compute distance between start and end stations

Data Wrangling

Feature Extraction

Convert into factor variables

Convert trip duration from seconds to minutes

Join Citibike and Weather data

Correlations of individual trip data features

Train - Test Split

Prediction

Linear Model

Determine the columns to use for prediction

Stochastic Gradient Boosting Model

Evaluate Performance on the test data

Extreme Gradient Boosting (xgboost)

Evaluate Performance on the train data

Evaluate Performance on the test data

Random Forest Model

Evaluate Performance on the train data

Evaluate Performance on the test data