library(dplyr)
library(ggplot2)
library(GGally)
library(lubridate)
library(caret)
library(gbm)
library(tidyverse)
library(caret)
The NYC “CitiBike” bicycle sharing scheme went live (in midtown and downtown Manhattan) in 2013, and has been expanding ever since, both as measured by daily ridership as well as the expanding geographic footprint incorporating a growing number of “docking stations” as the system welcomes riders in Brooklyn, Queens, and northern parts of Manhattan which were not previously served.
One problem that many bikeshare systems face is money. An increase in the number of riders who want to use the system necessitates that more bikes be purchased and put into service in order to accommodate them. Heavy ridership induces wear on the bikes, requiring for more frequent repairs. However, an increase in the number of trips does not necessarily translate to an increase in revenue because riders who are clever can avoid paying surcharges by keeping the length of each trip below a specified limit (either 30 or 45 minutes, depending on user category.)
We seek to examine CitiBike ridership data, joined with daily NYC weather data, to study the impact of weather on shared bike usage and generate a predictive model which can estimate the number of trips that would be taken on each day.
The goal is to estimate future demand which would enable the system operator to make expansion plans.
Our finding is that ridership exhibits strong seasonality, with correlation to weather-related variables such as daily temperature and precipitation. Additionally, ridership is segmented by by user_type (annual subscribers use the system much more heavily than casual users), gender (there are many more male users than female) and age (a large number of users are clustered in their late 30s).
Since 2013 a shared bicycle system known as CitiBike has been available in New York City. The benefits to having such a system include reducing New Yorkers’ dependence on automobiles and encouraging public health through the exercise attained by cycling. Additionally, users who would otherwise spend money on public transit may find bicycling more economical – so long as they are aware of CitiBike’s pricing constraints.
There are currently about 12,000 shared bikes which users can rent from about 750 docking stations located in Manhattan and in western portions of Brooklyn and Queens. A rider can pick up a bike at one station and return it at a different station. The system has been expanding each year, with increases in the number of bicycles available and expansion of the geographic footprint of docking stations. For planning purposes, the system operator needs to project future ridership in order to make good investments.
The available usage data provides a wealth of information which can be mined to seek trends in usage. With such intelligence, the company would be better positioned to determine what actions might optimize its revenue stream.
I obtained data from the following two sources:
CitiBike makes a vast amount of data available regarding system usage as well as sales of memberships and short-term passes.
For each month since the system’s inception, there is a file containing details of (almost) every trip. (Certain “trips” are omitted from the dataset. For example, if a user checks out a bike from a dock but then returns it within one minute, the system drops such a “trip” from the listing, as such “trips” are not interesting.)
There are currently 108 monthly data files for the New York City bikeshare system, spanning July 2013 through December 2019. Each file contains a line for every trip. The number of trips per month varies from as few as 200,000 during winter months in the system’s early days to more than 2 million trips this past summer. Because of the computational limitations which this presented, I created samples of 1/1000. The samples were created non-deterministically, by randomly selecteing ‘r nrow(file)/1000’ from the file.
<- read.csv("~/citibike_2019.csv")
citibike_2019 <- citibike_2019[sample(nrow(citibike_2019), 20000),] citibike_2019
Also I obtained historical weather information for the year 2019 from the NCDC (National Climatic Data Center) by submitting an online request to https://www.ncdc.noaa.gov/cdo-web/search . Although the weather may vary slightly within New York City, I opted to use just the data associated with the Central Park observations as proxy for the entire city’s weather.
We believe that the above data provides a reasonable representation of the target population (all CitiBike rides) and the citywide weather.
<- read.csv("~/Downloads/Datasets/2812766.csv",header=FALSE)
weather names(weather)<-weather[1,]
<-weather[-1,]
weather<-subset(weather[weather$NAME == "NY CITY CENTRAL PARK, NY US" & year(weather$DATE) == 2019,],)
weather<- names(weather) %>% grep("ATTR",x = .)
attr_indexes <- weather[,-attr_indexes]
weather <- weather[,-c(1:5)]
weather <-weather %>%mutate(across(c(where(is.character), -DATE), as.numeric))%>%mutate_all(~replace_na(., 0))
weather$DATE<-as.Date(weather$DATE)
weather<- weather %>% select("DATE","PRCP","SNOW","SNWD","TMAX","TMIN","WT01","WDF2","WDF5","WSF2","WSF5","WT08")
weather head(weather)
## DATE PRCP SNOW SNWD TMAX TMIN WT01 WDF2 WDF5 WSF2 WSF5 WT08
## 58385 2019-01-01 0.06 0 0 58 39 1 0 0 0 0 0
## 58386 2019-01-02 0.00 0 0 40 35 0 0 0 0 0 0
## 58387 2019-01-03 0.00 0 0 44 37 0 0 0 0 0 0
## 58388 2019-01-04 0.00 0 0 47 35 0 0 0 0 0 0
## 58389 2019-01-05 0.50 0 0 47 41 1 0 0 0 0 0
## 58390 2019-01-06 0.00 0 0 49 31 0 0 0 0 0 0
In this section, I examine selected individual variables from the CitiBike and Weather datasets. These items require transformation and/or cleaning as there are missing values or outliers which impede analysis otherwise.
print(paste0("The citibike data consists of ", nrow(citibike_2019), " data points spread across ", ncol(citibike_2019), " features"))
## [1] "The citibike data consists of 20000 data points spread across 15 features"
The trip_duration is specified in seconds, but there are some outliers which may be incorrect, as the value for Max is quite high: 2.679841^{6} seconds, or 31.0166782 days. We can assume that this data is bad, as nobody would willingly rent a bicycle for this period of time, given the fees that would be charged. Here is a histogram of the original data distribution:
Let’s assume that nobody would rent a bicycle for more than a specified timelimit (say, 3 hours), and drop any records which exceed this:
<- nrow(citibike_2019[citibike_2019$tripduration > 7200,])
num_long_trips_removed<-citibike_2019[citibike_2019$tripduration<=7200,]
citibike_2019print(paste0("Removed ", num_long_trips_removed, " trips of longer than 2 hours."))
## [1] "Removed 52 trips of longer than 2 hours."
Other inconsistencies concern the collection of birth_year, from which we can infer the age of the participant. There are some months in which this value is omitted, while there are other months in which all values are populated. However, there are a few records which suggest that the rider is a centenarian – it seems highly implausible that someone born in the 1880s is cycling around Central Park – but the data does have such anomalies. Thus, a substantial amount of time was needed for detecting and cleaning such inconsistencies.
The birth year for some users is as old as 1888, which is not possible:
summary(citibike_2019$birth.year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1888 1969 1983 1980 1990 2003
$age <- 2019 - citibike_2019$birth.year citibike_2019
<- nrow(citibike_2019[citibike_2019$age>90,])
num_old_age_removed<-citibike_2019[citibike_2019$age<90,]
citibike_2019print(paste0("Removed ", num_old_age_removed, " trips of people older thatn 90 years"))
## [1] "Removed 7 trips of people older thatn 90 years"
library(geosphere)
$distance <- distHaversine(citibike_2019[,6:7], citibike_2019[,10:11])
citibike_2019$dist.lat <- abs((citibike_2019$start.station.latitude - citibike_2019$end.station.latitude))
citibike_2019$dist.long <- abs((citibike_2019$start.station.longitude - citibike_2019$end.station.longitude))
citibike_2019summary(citibike_2019$distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 565.9 1033.0 1329.8 1761.5 10524.7
= function(timestamp){
is_weekday ::wday(timestamp, week_start = 1) < 6
lubridate }
$start_date<-as.Date(citibike_2019$starttime)
citibike_2019$Hour<-hour(citibike_2019$starttime)
citibike_2019$dayofweek <- as.factor(wday(citibike_2019$starttime))
citibike_2019$weekday<-as.factor(as.numeric(sapply(citibike_2019$starttime, is_weekday)))
citibike_2019head(citibike_2019 %>% select("Hour","dayofweek","weekday"))
## Hour dayofweek weekday
## 22763 14 6 1
## 151522 18 2 1
## 193496 10 5 1
## 111534 13 3 1
## 92034 17 4 1
## 99954 17 7 0
$usertype<-as.factor(citibike_2019$usertype)
citibike_2019$gender<-as.factor(citibike_2019$gender) citibike_2019
$tripduration<-floor(citibike_2019$tripduration/60)
citibike_2019head(citibike_2019$tripduration)
## [1] 2 17 20 12 8 7
<- citibike_2019 %>% inner_join(weather, by = c("start_date" = "DATE" ))
citibike_2019 head(citibike_2019)
## tripduration starttime stoptime
## 1 2 2019-03-29 14:02:53.9670 2019-03-29 14:05:21.3790
## 2 17 2019-09-16 18:25:57.1700 2019-09-16 18:43:50.1910
## 3 20 2019-11-14 10:13:32.5140 2019-11-14 10:33:39.4500
## 4 12 2019-07-02 13:54:31.9720 2019-07-02 14:06:51.8100
## 5 8 2019-07-17 17:53:20.4200 2019-07-17 18:01:28.8080
## 6 7 2019-07-27 17:21:39.2650 2019-07-27 17:29:16.3190
## start.station.id start.station.name start.station.latitude
## 1 415 Pearl St & Hanover Square 40.70472
## 2 523 W 38 St & 8 Ave 40.75467
## 3 3584 Eastern Pkwy & Franklin Ave 40.67078
## 4 3263 Cooper Square & Astor Pl 40.72951
## 5 490 8 Ave & W 33 St 40.75155
## 6 435 W 21 St & 6 Ave 40.74174
## start.station.longitude end.station.id end.station.name
## 1 -74.00926 351 Front St & Maiden Ln
## 2 -73.99138 266 Avenue D & E 8 St
## 3 -73.95768 3349 Grand Army Plaza & Plaza St West
## 4 -73.99075 3746 6 Ave & Broome St
## 5 -73.99393 3235 E 41 St & Madison Ave
## 6 -73.99416 161 LaGuardia Pl & W 3 St
## end.station.latitude end.station.longitude bikeid usertype birth.year
## 1 40.70531 -74.00613 26789 Subscriber 1968
## 2 40.72368 -73.97575 30419 Subscriber 1979
## 3 40.67297 -73.97088 27801 Customer 1969
## 4 40.72431 -74.00473 35292 Subscriber 1997
## 5 40.75217 -73.97992 19691 Customer 1992
## 6 40.72917 -73.99810 21233 Subscriber 1993
## gender age distance dist.lat dist.long start_date Hour dayofweek
## 1 1 51 349.4083 0.0005918400 0.00313455 2019-03-29 14 6
## 2 1 40 1983.4821 0.0309823000 0.01563339 2019-09-16 18 2
## 3 0 50 1470.9324 0.0021912000 0.01319974 2019-11-14 10 5
## 4 1 22 1564.1685 0.0052066406 0.01397766 2019-07-02 13 3
## 5 1 27 1559.9298 0.0006142806 0.01401206 2019-07-17 17 4
## 6 1 26 584.6764 0.0125694400 0.00394675 2019-07-27 17 7
## weekday PRCP SNOW SNWD TMAX TMIN WT01 WDF2 WDF5 WSF2 WSF5 WT08
## 1 1 0.00 0 0 59 49 0 250 230 10.1 16.1 0
## 2 1 0.00 0 0 77 68 0 60 60 12.1 15.0 0
## 3 1 0.00 0 0 45 28 0 260 230 12.1 19.9 0
## 4 1 0.02 0 0 85 71 0 270 330 8.9 14.1 0
## 5 1 1.82 0 0 93 73 1 290 300 14.1 27.1 0
## 6 0 0.00 0 0 85 72 0 220 220 12.1 19.0 0
We can examine the correlations between variables to understand the relationship between variables, and also to help be alert to potential problems of multicollinearity. Here we compute rank correlations (Pearson and Spearman) as well as actual correlations between key variables. Here we compute the correlations between key variables on the individual CitiBike Trip data
<- floor(0.8*nrow(citibike_2019))
smp_sizeset.seed(123)
<-sample(seq_len(nrow(citibike_2019)), size = smp_size)
train_index<- citibike_2019[train_index,]
train_data <- citibike_2019[- train_index,]
test_data rm(citibike_2019)
<-setdiff(names(train_data),c("starttime","start.station.id","end.station.id",
colstouse "stoptime","start.station.name","end.station.name",
"start.station.longitude","start_date",
"end.station.latitude","end.station.longitude",
"weekday","bikeid","birth.year","PRCP","DATE","SNOW","SNWD",
"TMIN","WT01","WDF2","WDF5","WSF2","WSF5","WT08"))
<-train_data[ , which(names(train_data) %in% colstouse)]
X_trainnames(X_train)
## [1] "tripduration" "start.station.latitude" "usertype"
## [4] "gender" "age" "distance"
## [7] "dist.lat" "dist.long" "Hour"
## [10] "dayofweek" "TMAX"
<-lm(tripduration~., data = X_train) linear_model
summary(linear_model)
##
## Call:
## lm(formula = tripduration ~ ., data = X_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.714 -3.490 -1.785 0.854 111.442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.835e+02 9.075e+01 -4.226 2.39e-05 ***
## start.station.latitude 9.678e+00 2.228e+00 4.344 1.41e-05 ***
## usertypeSubscriber -7.960e+00 2.323e-01 -34.269 < 2e-16 ***
## gender1 -2.295e+00 3.138e-01 -7.314 2.72e-13 ***
## gender2 -1.017e+00 3.287e-01 -3.093 0.001983 **
## age 2.579e-02 5.820e-03 4.431 9.45e-06 ***
## distance 1.249e-02 7.714e-04 16.191 < 2e-16 ***
## dist.lat 2.394e+02 1.214e+01 19.714 < 2e-16 ***
## dist.long -1.013e+03 8.037e+01 -12.607 < 2e-16 ***
## Hour 7.824e-02 1.369e-02 5.716 1.11e-08 ***
## dayofweek2 -1.075e+00 2.639e-01 -4.074 4.64e-05 ***
## dayofweek3 -8.718e-01 2.564e-01 -3.400 0.000675 ***
## dayofweek4 -1.021e+00 2.594e-01 -3.936 8.31e-05 ***
## dayofweek5 -1.282e+00 2.634e-01 -4.867 1.15e-06 ***
## dayofweek6 -8.385e-01 2.581e-01 -3.249 0.001162 **
## dayofweek7 3.322e-01 2.655e-01 1.251 0.210872
## TMAX 1.935e-02 4.123e-03 4.694 2.70e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.423 on 15935 degrees of freedom
## Multiple R-squared: 0.4479, Adjusted R-squared: 0.4473
## F-statistic: 807.8 on 16 and 15935 DF, p-value: < 2.2e-16
<-setdiff(names(train_data),c("starttime","start.station.id","end.station.id",
colstouse "stoptime","start.station.name","end.station.name",
"start.station.longitude","start_date",
"end.station.latitude","end.station.longitude",
"weekday","bikeid","birth.year","DATE","SNOW","SNWD"))
<-train_data[ , which(names(train_data) %in% colstouse)]
X_trainnames(X_train)
## [1] "tripduration" "start.station.latitude" "usertype"
## [4] "gender" "age" "distance"
## [7] "dist.lat" "dist.long" "Hour"
## [10] "dayofweek" "PRCP" "TMAX"
## [13] "TMIN" "WT01" "WDF2"
## [16] "WDF5" "WSF2" "WSF5"
## [19] "WT08"
library('gbm')
<- expand.grid(interaction.depth=c(5,7,9,11),
caretGrid n.trees = (0:30)*50,
shrinkage=c(0.01),
n.minobsinnode=20)
<- trainControl(method="cv", number=10)
trainControl
<- "RMSE"
metric
<- train(tripduration ~ .,
gbmmodel data = X_train,
distribution = "gaussian",
method = "gbm",
trControl = trainControl,
tuneGrid=caretGrid,
metric=metric,
bag.fraction=0.75,
verbose = F)
gbmmodel
## Stochastic Gradient Boosting
##
## 15947 samples
## 20 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 14352, 14352, 14353, 14352, 14351, 14353, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 5 150 8.396824 0.4842752 5.071274
## 5 300 7.969259 0.5036105 4.570354
## 5 500 7.841625 0.5128264 4.408715
## 9 150 8.233405 0.4984280 4.921338
## 9 300 7.876072 0.5122470 4.472726
## 9 500 7.791592 0.5180836 4.351670
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.01
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
## 9, shrinkage = 0.01 and n.minobsinnode = 20.
summary(gbmmodel)
## var rel.inf
## dist.lat dist.lat 54.316202421
## distance distance 19.083842261
## usertypeSubscriber usertypeSubscriber 14.977561106
## start.station.latitude start.station.latitude 3.432306201
## dist.long dist.long 2.387969483
## Hour Hour 1.229529693
## gender1 gender1 1.011399331
## age age 0.866055486
## WDF5 WDF5 0.488515892
## WSF5 WSF5 0.436981549
## TMAX TMAX 0.426249924
## WDF2 WDF2 0.280782470
## dayofweek7 dayofweek7 0.274624728
## dayofweek4 dayofweek4 0.214640256
## PRCP PRCP 0.160037537
## TMIN TMIN 0.153969439
## WSF2 WSF2 0.135890022
## gender2 gender2 0.069431631
## dayofweek2 dayofweek2 0.021243737
## WT08 WT08 0.015523094
## WT01 WT01 0.009623772
## dayofweek6 dayofweek6 0.004684990
## dayofweek5 dayofweek5 0.002934976
## dayofweek3 dayofweek3 0.000000000
## SNOW SNOW 0.000000000
## SNWD SNWD 0.000000000
trellis.par.set(caretTheme())
plot(gbmmodel)
trellis.par.set(caretTheme())
plot(gbmmodel, metric = "RMSE", plotType = "level",
scales = list(x = list(rot = 90)))
trellis.par.set(caretTheme())
densityplot(gbmmodel, pch = "|")
### Evaluate Performance on the train data
<-train_data$tripduration
y_test= predict(gbmmodel,train_data)
predicted = y_test - predicted
residuals = sqrt(mean(residuals^2))
RMSE cat('The root mean square error of the train data is ', round(RMSE,3),'\n')
## The root mean square error of the train data is 7.974
= mean(y_test)
y_test_mean # Calculate total sum of squares
= sum((y_test - y_test_mean)^2 )
tss # Calculate residual sum of squares
= sum(residuals^2)
rss # Calculate R-squared
= 1 - (rss/tss)
rsq cat('The R-square of the train data is ', round(rsq,3), '\n')
## The R-square of the train data is 0.505
options(repr.plot.width=8, repr.plot.height=4)
= as.data.frame(cbind(predicted = predicted,
my_data observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) +
geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("Gradient Boosting Machine: Prediction vs Test Data") +
xlab("Predecited Trip Duration in Seconds ") + ylab("Observed Trip Duration in Secons") +
theme(plot.title = element_text(size=16,hjust = 0.5),
axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'
<-test_data$tripduration
y_test= predict(gbmmodel,test_data)
predicted = y_test - predicted
residuals = sqrt(mean(residuals^2))
RMSE cat('The root mean square error of the test data is ', round(RMSE,3),'\n')
## The root mean square error of the test data is 7.533
= mean(y_test)
y_test_mean # Calculate total sum of squares
= sum((y_test - y_test_mean)^2 )
tss # Calculate residual sum of squares
= sum(residuals^2)
rss # Calculate R-squared
= 1 - (rss/tss)
rsq cat('The R-square of the test data is ', round(rsq,3), '\n')
## The R-square of the test data is 0.54
options(repr.plot.width=8, repr.plot.height=4)
= as.data.frame(cbind(predicted = predicted,
my_data observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) +
geom_smooth(method=lm)+ ggtitle('Gradient Boosted Machines ') + ggtitle("Gradient Boosting Machines: Prediction vs Test Data") +
xlab("Predecited Trip Duration in Minutes ") + ylab("Observed Trip Duration in Minutes") +
theme(plot.title = element_text(size=16,hjust = 0.5),
axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'
<-setdiff(names(train_data),c("starttime","start.station.id","end.station.id",
colstouse "stoptime","start.station.name","end.station.name",
"start.station.longitude","start_date",
"end.station.latitude","end.station.longitude",
"weekday","bikeid","birth.year","DATE"))
<-train_data[ , which(names(train_data) %in% colstouse)]
X_train<-X_train %>%mutate(across(c(where(is.factor)), as.numeric))
X_trainsummary(X_train)
## tripduration start.station.latitude usertype gender
## Min. : 1.00 Min. :40.66 Min. :1.000 Min. :1.000
## 1st Qu.: 6.00 1st Qu.:40.72 1st Qu.:2.000 1st Qu.:2.000
## Median : 10.00 Median :40.74 Median :2.000 Median :2.000
## Mean : 13.18 Mean :40.74 Mean :1.856 Mean :2.167
## 3rd Qu.: 18.00 3rd Qu.:40.76 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :119.00 Max. :40.82 Max. :2.000 Max. :3.000
## age distance dist.lat dist.long
## Min. :16.00 Min. : 0 Min. :0.000000 Min. :0.000000
## 1st Qu.:29.00 1st Qu.: 564 1st Qu.:0.003804 1st Qu.:0.003967
## Median :36.00 Median :1028 Median :0.008241 Median :0.008306
## Mean :38.75 Mean :1329 Mean :0.011924 Mean :0.010961
## 3rd Qu.:49.00 3rd Qu.:1768 3rd Qu.:0.016128 3rd Qu.:0.015002
## Max. :85.00 Max. :9836 Max. :0.114938 Max. :0.088153
## Hour dayofweek PRCP SNOW
## Min. : 0.00 Min. :1.000 Min. :0.0000 Min. :0.00000
## 1st Qu.:10.00 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :15.00 Median :4.000 Median :0.0000 Median :0.00000
## Mean :13.89 Mean :4.048 Mean :0.1029 Mean :0.01736
## 3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:0.0400 3rd Qu.:0.00000
## Max. :23.00 Max. :7.000 Max. :1.8300 Max. :4.00000
## SNWD TMAX TMIN WT01
## Min. :0.00000 Min. :14.00 Min. : 2.00 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:57.00 1st Qu.:42.00 1st Qu.:0.0000
## Median :0.00000 Median :72.00 Median :56.00 Median :0.0000
## Mean :0.02712 Mean :68.19 Mean :53.65 Mean :0.3994
## 3rd Qu.:0.00000 3rd Qu.:81.00 3rd Qu.:67.00 3rd Qu.:1.0000
## Max. :3.90000 Max. :95.00 Max. :82.00 Max. :1.0000
## WDF2 WDF5 WSF2 WSF5
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 50.0 1st Qu.: 50.0 1st Qu.: 8.90 1st Qu.:15.00
## Median :180.0 Median :190.0 Median :12.10 Median :19.00
## Mean :163.9 Mean :164.8 Mean :11.41 Mean :18.47
## 3rd Qu.:270.0 3rd Qu.:270.0 3rd Qu.:14.10 3rd Qu.:23.00
## Max. :360.0 Max. :360.0 Max. :25.10 Max. :40.90
## WT08
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1579
## 3rd Qu.:0.0000
## Max. :1.0000
Convert the training and testing sets into DMatrixes: DMatrix is the recommended class in xgboost.
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
<-as.matrix(X_train %>% select(-tripduration))
X_train<-train_data$tripduration Y_train
= trainControl(
xgb_trcontrol method = "cv",
number = 10,
allowParallel = TRUE,
verboseIter = FALSE,
returnData = FALSE
)
<- expand.grid(nrounds = c(100,200,300),
xgbGrid max_depth = c(5,10, 15, 20, 25),
colsample_bytree = seq(0.5, 0.9, length.out = 5),
eta = 0.1,
gamma=0,
min_child_weight = 1,
subsample = 1
)
= train(
xgb_model
X_train, Y_train, trControl = xgb_trcontrol,
tuneGrid = xgbGrid,
method = "xgbTree"
)
print(xgb_model)
## eXtreme Gradient Boosting
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 14353, 14352, 14352, 14352, 14353, 14352, ...
## Resampling results across tuning parameters:
##
## max_depth colsample_bytree nrounds RMSE Rsquared MAE
## 10 0.5 100 7.990662 0.4939391 4.459054
## 10 0.5 200 8.028360 0.4898218 4.507921
## 10 0.6 100 7.985940 0.4950087 4.448261
## 10 0.6 200 8.031759 0.4901626 4.501365
## 10 0.7 100 8.031170 0.4907930 4.459901
## 10 0.7 200 8.073485 0.4865763 4.512910
## 10 0.8 100 8.089547 0.4845796 4.470737
## 10 0.8 200 8.128734 0.4808463 4.529494
## 10 0.9 100 8.091202 0.4850372 4.487502
## 10 0.9 200 8.134691 0.4808676 4.546471
## 15 0.5 100 8.115434 0.4767399 4.671796
## 15 0.5 200 8.120068 0.4761192 4.679978
## 15 0.6 100 8.100678 0.4797061 4.611761
## 15 0.6 200 8.106604 0.4790475 4.625847
## 15 0.7 100 8.113880 0.4784341 4.584467
## 15 0.7 200 8.120708 0.4777263 4.598688
## 15 0.8 100 8.163480 0.4754049 4.565503
## 15 0.8 200 8.172585 0.4745141 4.583392
## 15 0.9 100 8.304776 0.4620645 4.618012
## 15 0.9 200 8.313216 0.4613428 4.634577
## 20 0.5 100 8.158981 0.4727551 4.739602
## 20 0.5 200 8.159012 0.4727186 4.741379
## 20 0.6 100 8.116777 0.4771562 4.648182
## 20 0.6 200 8.117489 0.4770717 4.650551
## 20 0.7 100 8.143830 0.4745711 4.608688
## 20 0.7 200 8.145033 0.4744563 4.611458
## 20 0.8 100 8.212076 0.4681641 4.600514
## 20 0.8 200 8.213553 0.4680448 4.604386
## 20 0.9 100 8.309599 0.4605969 4.640181
## 20 0.9 200 8.311563 0.4604600 4.644056
## 25 0.5 100 8.155933 0.4728047 4.748209
## 25 0.5 200 8.156126 0.4727438 4.750144
## 25 0.6 100 8.111947 0.4775316 4.639966
## 25 0.6 200 8.112506 0.4774562 4.642173
## 25 0.7 100 8.173949 0.4707071 4.607061
## 25 0.7 200 8.174838 0.4706242 4.609271
## 25 0.8 100 8.169826 0.4733180 4.595642
## 25 0.8 200 8.171277 0.4731924 4.598050
## 25 0.9 100 8.318991 0.4586380 4.632721
## 25 0.9 200 8.320909 0.4584981 4.635162
##
## Tuning parameter 'eta' was held constant at a value of 0.1
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
##
## Tuning parameter 'subsample' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 100, max_depth = 10, eta
## = 0.1, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
## = 1.
summary(xgb_model)
## Length Class Mode
## handle 1 xgb.Booster.handle externalptr
## raw 4084628 -none- raw
## niter 1 -none- numeric
## call 5 -none- call
## params 8 -none- list
## callbacks 1 -none- list
## feature_names 20 -none- character
## nfeatures 1 -none- numeric
## xNames 20 -none- character
## problemType 1 -none- character
## tuneValue 7 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
trellis.par.set(caretTheme())
plot(xgb_model)
trellis.par.set(caretTheme())
plot(xgb_model, metric = "RMSE", plotType = "level",
scales = list(x = list(rot = 90)))
trellis.par.set(caretTheme())
densityplot(xgb_model, pch = "|")
<-train_data$tripduration
y_test= predict(xgb_model,X_train)
predicted = y_test - predicted
residuals = sqrt(mean(residuals^2))
RMSE cat('The root mean square error of the train data is ', round(RMSE,3),'\n')
## The root mean square error of the train data is 7.944
= mean(y_test)
y_test_mean # Calculate total sum of squares
= sum((y_test - y_test_mean)^2 )
tss # Calculate residual sum of squares
= sum(residuals^2)
rss # Calculate R-squared
= 1 - (rss/tss)
rsq cat('The R-square of the train data is ', round(rsq,3), '\n')
## The R-square of the train data is 0.508
options(repr.plot.width=8, repr.plot.height=4)
= as.data.frame(cbind(predicted = predicted,
my_data observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) +
geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("XGBoost: Prediction vs Test Data") +
xlab("Predecited Trip Duration in Seconds ") + ylab("Observed Trip Duration in Secons") +
theme(plot.title = element_text(size=16,hjust = 0.5),
axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'
<-test_data$tripduration
y_test<-test_data[ , which(names(test_data) %in% colstouse)]
X_test<-X_test %>%mutate(across(c(where(is.factor)), as.numeric))
X_test
= predict(xgb_model,X_test)
predicted = y_test - predicted
residuals = sqrt(mean(residuals^2))
RMSE cat('The root mean square error of the test data is ', round(RMSE,3),'\n')
## The root mean square error of the test data is 7.628
= mean(y_test)
y_test_mean # Calculate total sum of squares
= sum((y_test - y_test_mean)^2 )
tss # Calculate residual sum of squares
= sum(residuals^2)
rss # Calculate R-squared
= 1 - (rss/tss)
rsq cat('The R-square of the test data is ', round(rsq,3), '\n')
## The R-square of the test data is 0.528
options(repr.plot.width=8, repr.plot.height=4)
= as.data.frame(cbind(predicted = predicted,
my_data observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) +
geom_smooth(method=lm)+ ggtitle('Gradient Boosted Machines ') + ggtitle("XGBoost: Prediction vs Test Data") +
xlab("Predecited Trip Duration in Minutes ") + ylab("Observed Trip Duration in Minutes") +
theme(plot.title = element_text(size=16,hjust = 0.5),
axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'
<-setdiff(names(train_data),c("starttime","start.station.id","end.station.id",
colstouse "stoptime","start.station.name","end.station.name",
"start.station.longitude","start_date",
"end.station.latitude","end.station.longitude",
"weekday","bikeid","birth.year","DATE","SNOW","SNWD"))
<-train_data[ , which(names(train_data) %in% colstouse)]
X_trainnames(X_train)
## [1] "tripduration" "start.station.latitude" "usertype"
## [4] "gender" "age" "distance"
## [7] "dist.lat" "dist.long" "Hour"
## [10] "dayofweek" "PRCP" "TMAX"
## [13] "TMIN" "WT01" "WDF2"
## [16] "WDF5" "WSF2" "WSF5"
## [19] "WT08"
library('randomForest')
<- trainControl(method='repeatedcv',
control number=10,
repeats=3)
#Metric compare model is Root Mean Squared Error
<- "RMSE"
metric set.seed(123)
#Number randomly variable selected is mtry
<- sqrt(ncol(X_train)-1)
mtry <- expand.grid(.mtry=mtry)
tunegrid <- train(tripduration ~ .,
rfmodel data = X_train,
method='rf',
metric=metric,
tuneGrid=tunegrid,
trControl=control)
rfmodel
## Random Forest
##
## 15947 samples
## 18 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 14352, 14351, 14352, 14350, 14352, 14354, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.874974 0.5089116 4.489987
##
## Tuning parameter 'mtry' was held constant at a value of 4.242641
summary(rfmodel)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 15947 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 15947 -none- numeric
## importance 24 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 15947 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 24 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
trellis.par.set(caretTheme())
densityplot(rfmodel, pch = "|")
<-train_data$tripduration
y_test= predict(rfmodel,train_data)
predicted = y_test - predicted
residuals = sqrt(mean(residuals^2))
RMSE cat('The root mean square error of the train data is ', round(RMSE,3),'\n')
## The root mean square error of the train data is 7.916
= mean(y_test)
y_test_mean # Calculate total sum of squares
= sum((y_test - y_test_mean)^2 )
tss # Calculate residual sum of squares
= sum(residuals^2)
rss # Calculate R-squared
= 1 - (rss/tss)
rsq cat('The R-square of the train data is ', round(rsq,3), '\n')
## The R-square of the train data is 0.512
options(repr.plot.width=8, repr.plot.height=4)
= as.data.frame(cbind(predicted = predicted,
my_data observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) +
geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("Random Forest: Prediction vs Test Data") +
xlab("Predecited Trip Duration in Seconds ") + ylab("Observed Trip Duration in Secons") +
theme(plot.title = element_text(size=16,hjust = 0.5),
axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'
<-test_data$tripduration
y_test= predict(gbmmodel,test_data)
predicted = y_test - predicted
residuals = sqrt(mean(residuals^2))
RMSE cat('The root mean square error of the test data is ', round(RMSE,3),'\n')
## The root mean square error of the test data is 7.533
= mean(y_test)
y_test_mean # Calculate total sum of squares
= sum((y_test - y_test_mean)^2 )
tss # Calculate residual sum of squares
= sum(residuals^2)
rss # Calculate R-squared
= 1 - (rss/tss)
rsq cat('The R-square of the test data is ', round(rsq,3), '\n')
## The R-square of the test data is 0.54
options(repr.plot.width=8, repr.plot.height=4)
= as.data.frame(cbind(predicted = predicted,
my_data observed = y_test))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) + geom_point(color = "darkred", alpha = 0.5) +
geom_smooth(method=lm)+ ggtitle('Gradient Boosted Machines ') + ggtitle("Random Forest: Prediction vs Test Data") +
xlab("Predecited Trip Duration in Minutes ") + ylab("Observed Trip Duration in Minutes") +
theme(plot.title = element_text(size=16,hjust = 0.5),
axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'