Bicycling is an activity which yields many benefits: Riders improve their health through exercise, while traffic congestion is reduced if riders move out of cars, with a corresponding reduction in pollution from carbon emissions [5]. In recent years, Bike Sharing has become popular in a growing list of cities around the world. The NYC “Citibike” bicycle sharing scheme went live (in midtown and downtown Manhattan) in 2013, and has been expanding ever since, both as measured by daily ridership as well as the expanding geographic footprint incorporating a growing number of “docking stations” [9-10] as the system welcomes riders in Brooklyn, Queens, and northern parts of Manhattan which were not previously served. One problem that many bikeshare systems face is money. An increase in the number of riders who want to use the system necessitates that more bikes be purchased and put into service to accommodate them. Heavy ridership induces wear on the bikes, requiring for more frequent repairs. However, an increase in the number of trips does not necessarily translate to an increase in revenue because riders who are clever can avoid paying surcharges by keeping the length of each trip below a specified limit (either 30 or 45 minutes, depending on user category.) We seek to examine Citibike ridership data [2-3], joined with daily NYC weather data [4], to study the impact of weather on shared bike usage and generate a predictive model which can estimate the number of trips that would be taken on each day [6-8]. The goal is to estimate future demand which would enable the system operator to make expansion plans. Our finding is that ridership exhibits strong seasonality, with correlation to weather-related variables such as daily temperature and precipitation [6-8]. Additionally, ridership is segmented by user type (annual subscribers use the system much more heavily than casual users), gender (there are many more male users than female) and age (a large number of users are clustered in their late 30s). [11-12]
Bikeshare, Weather, Cycling, CitiBike, New York City
Since 2013 a shared bicycle system known as Citibike has been available in New York City [1]. The benefits to having such a system include reducing New Yorkers’ dependence on automobiles and encouraging public health through the exercise attained by cycling [5]. Additionally, users who would otherwise spend money on public transit may find bicycling more economical – so long as they are aware of Citibike’ s pricing constraints. There are currently about 12,000 shared bikes which users can rent from about 750 docking stations located in Manhattan and in western portions of Brooklyn and Queens. A rider can pick up a bike at one station and return it at a different station. The system has been expanding each year, with increases in the number of bicycles available and expansion of the geographic footprint of docking stations. For planning purposes, the system operator needs to project future ridership to make good investments. The available usage data provides a wealth of information which can be mined to seek trends in usage 2-3]. With such intelligence, the company would be better positioned to determine what actions might optimize its revenue stream.
We obtained data from two sources:
CitiBike makes a vast amount of data available regarding system usage as well as sales of memberships and short-term passes.
For each month since the system’s inception, there is a file containing details of (almost) every trip. (Certain “trips” are omitted from the dataset. For example, if a user checks out a bike from a dock but then returns it within one minute, the system drops such a “trip” from the listing, as such “trips” are not interesting.)
There are currently 77 monthly data files for the New York City bikeshare system, spanning July 2013 through November 2019. Each file contains a line for every trip. The number of trips per month varies from as few as 200,000 during winter months in the system’s early days to more than 2 million trips this past summer. The total number of entries was more than 90 million, resulting in 17GB of data. Because of the computational limitations which this presented, we created samples of 1/1000 and 1/100 of the data. The samples were created deterministically, by subsetting the files on each 1000th (or, 100th) row.
Also we obtained historical weather information for 2013-2019 from the NCDC (National Climatic Data Center) by submitting an online request to https://www.ncdc.noaa.gov/cdo-web/search . Although the weather may vary slightly within New York City, we opted to use just the data associated with the Central Park observations as proxy for the entire city’s weather.
We believe that the above data provides a reasonable representation of the target population (all CitiBike rides) and the citywide weather.
## trip_duration s_time e_time s_station_id s_station_name s_lat s_long
## 1 634 2013-06-30 21:00:00 2013-06-30 21:10:34 164 E 47 St & 2 Ave 40.75323 -73.97033
## 2 437 2013-07-01 03:54:02 2013-07-01 04:01:19 479 9 Ave & W 45 St 40.76019 -73.99126
## 3 1398 2013-07-01 05:03:38 2013-07-01 05:26:56 157 Henry St & Atlantic Ave 40.69089 -73.99612
## 4 1124 2013-07-01 05:37:40 2013-07-01 05:56:24 496 E 16 St & 5 Ave 40.73726 -73.99239
## 5 1199 2013-07-01 06:16:59 2013-07-01 06:36:58 432 E 7 St & Avenue A 40.72622 -73.98380
## 6 221 2013-07-01 08:50:21 2013-07-01 08:54:02 475 E 16 St & Irving Pl 40.73524 -73.98759
## e_station_id e_station_name e_lat e_long bike_id user_type birth_year gender
## 1 504 1 Ave & E 15 St 40.73222 -73.98166 16950 Customer NA 0
## 2 243 Fulton St & Rockwell Pl 40.68798 -73.97847 16151 Subscriber 1987 1
## 3 375 Mercer St & Bleecker St 40.72679 -73.99695 15997 Subscriber 1987 1
## 4 500 Broadway & W 51 St 40.76229 -73.98336 17750 Subscriber 1959 2
## 5 466 W 25 St & 6 Ave 40.74395 -73.99145 17671 Subscriber 1983 2
## 6 537 Lexington Ave & E 24 St 40.74026 -73.98409 16490 Subscriber 1956 1
## [1] 92565
## [1] 15
# Weather data is obtained from the NCDC (National Climatic Data Center) via https://www.ncdc.noaa.gov/cdo-web/
# click on search tool https://www.ncdc.noaa.gov/cdo-web/search
# select "daily summaries"
# select Search for Stations
# Enter Search Term "USW00094728" for Central Park Station:
# https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094728/detail
# "add to cart"
weatherfilenames=list.files(path="./",pattern = '.csv$', full.names = T) # ending with .csv ; not .zip
#weatherfilenames
weatherfile <- "DATA/NYC_Weather_Data_2013-2019.csv"
## Perhaps we should rename the columns to more clearly reflect their meaning?
weatherspec <- cols(
STATION = col_character(),
NAME = col_character(),
LATITUDE = col_double(),
LONGITUDE = col_double(),
ELEVATION = col_double(),
DATE = col_date(format = "%F"), # readr::parse_datetime() : "%F" = "%Y-%m-%d"
#DATE = col_date(format = "%m/%d/%Y"), #col_date(format = "%F")
AWND = col_double(), # Average Daily Wind Speed
AWND_ATTRIBUTES = col_character(),
PGTM = col_double(), # Peak Wind-Gust Time
PGTM_ATTRIBUTES = col_character(),
PRCP = col_double(), # Amount of Precipitation
PRCP_ATTRIBUTES = col_character(),
SNOW = col_double(), # Amount of Snowfall
SNOW_ATTRIBUTES = col_character(),
SNWD = col_double(), # Depth of snow on the ground
SNWD_ATTRIBUTES = col_character(),
TAVG = col_double(), # Average Temperature (not populated)
TAVG_ATTRIBUTES = col_character(),
TMAX = col_double(), # Maximum temperature for the day
TMAX_ATTRIBUTES = col_character(),
TMIN = col_double(), # Minimum temperature for the day
TMIN_ATTRIBUTES = col_character(),
TSUN = col_double(), # Daily Total Sunshine (not populated)
TSUN_ATTRIBUTES = col_character(),
WDF2 = col_double(), # Direction of fastest 2-minute wind
WDF2_ATTRIBUTES = col_character(),
WDF5 = col_double(), # Direction of fastest 5-second wind
WDF5_ATTRIBUTES = col_character(),
WSF2 = col_double(), # Fastest 2-minute wind speed
WSF2_ATTRIBUTES = col_character(),
WSF5 = col_double(), # fastest 5-second wind speed
WSF5_ATTRIBUTES = col_character(),
WT01 = col_double(), # Fog
WT01_ATTRIBUTES = col_character(),
WT02 = col_double(), # Heavy Fog
WT02_ATTRIBUTES = col_character(),
WT03 = col_double(), # Thunder
WT03_ATTRIBUTES = col_character(),
WT04 = col_double(), # Sleet
WT04_ATTRIBUTES = col_character(),
WT06 = col_double(), # Glaze
WT06_ATTRIBUTES = col_character(),
WT08 = col_double(), # Smoke or haze
WT08_ATTRIBUTES = col_character(),
WT13 = col_double(), # Mist
WT13_ATTRIBUTES = col_character(),
WT14 = col_double(), # Drizzle
WT14_ATTRIBUTES = col_character(),
WT16 = col_double(), # Rain
WT16_ATTRIBUTES = col_character(),
WT18 = col_double(), # Snow
WT18_ATTRIBUTES = col_character(),
WT19 = col_double(), # Unknown source of precipitation
WT19_ATTRIBUTES = col_character(),
WT22 = col_double(), # Ice fog
WT22_ATTRIBUTES = col_character()
)
# load all the daily weather data
weather <- read_csv(weatherfile, col_types = weatherspec)
weather_df1 = as.data.frame(weather)
# Check the number of rows and columns in weather data frame
nrow(weather_df1)
## [1] 2556
## [1] 28
# Select only those columns that are useful for our analysis
weather_df = select(weather_df1, STATION, NAME, DATE, AWND, PRCP, SNOW, SNWD, TMAX, TMIN, WDF2, WDF5, WSF2, WSF5, WT01)
# Check how many columns have empty values
sapply(weather_df, function(x) sum(is.na(x)))
## STATION NAME DATE AWND PRCP SNOW SNWD TMAX TMIN WDF2 WDF5 WSF2 WSF5 WT01
## 0 0 0 166 0 1 0 0 0 163 179 163 179 1703
# Perform Data Impuation on weather_df, replace empty/blank values with mean values
weather_df$AWND[is.na(weather_df$AWND)] = mean(weather_df1$AWND, na.rm=TRUE)
weather_df$SNOW[is.na(weather_df$SNOW)] = mean(weather_df1$SNOW, na.rm=TRUE)
weather_df$WDF2[is.na(weather_df$WDF2)] = mean(weather_df1$WDF2, na.rm=TRUE)
weather_df$WDF5[is.na(weather_df$WDF5)] = mean(weather_df1$WDF5, na.rm=TRUE)
weather_df$WSF2[is.na(weather_df$WSF2)] = mean(weather_df1$WSF2, na.rm=TRUE)
weather_df$WSF5[is.na(weather_df$WSF5)] = mean(weather_df1$WSF5, na.rm=TRUE)
# Again, check if the imputation removed all empty/blank values with mean values
sapply(weather_df, function(x) sum(is.na(x)))
## STATION NAME DATE AWND PRCP SNOW SNWD TMAX TMIN WDF2 WDF5 WSF2 WSF5 WT01
## 0 0 0 0 0 0 0 0 0 0 0 0 0 1703
## [1] 92565 15
## [1] 2556 14
## [1] "trip_duration" "s_time" "e_time" "s_station_id" "s_station_name" "s_lat"
## [7] "s_long" "e_station_id" "e_station_name" "e_lat" "e_long" "bike_id"
## [13] "user_type" "birth_year" "gender"
## [1] "STATION" "NAME" "DATE" "AWND" "PRCP" "SNOW" "SNWD" "TMAX" "TMIN" "WDF2" "WDF5"
## [12] "WSF2" "WSF5" "WT01"
## trip_duration s_time e_time s_station_id s_station_name s_lat s_long
## 1 634 2013-06-30 21:00:00 2013-06-30 21:10:34 164 E 47 St & 2 Ave 40.75323 -73.97033
## 2 437 2013-07-01 03:54:02 2013-07-01 04:01:19 479 9 Ave & W 45 St 40.76019 -73.99126
## 3 1398 2013-07-01 05:03:38 2013-07-01 05:26:56 157 Henry St & Atlantic Ave 40.69089 -73.99612
## 4 1124 2013-07-01 05:37:40 2013-07-01 05:56:24 496 E 16 St & 5 Ave 40.73726 -73.99239
## 5 1199 2013-07-01 06:16:59 2013-07-01 06:36:58 432 E 7 St & Avenue A 40.72622 -73.98380
## 6 221 2013-07-01 08:50:21 2013-07-01 08:54:02 475 E 16 St & Irving Pl 40.73524 -73.98759
## e_station_id e_station_name e_lat e_long bike_id user_type birth_year gender
## 1 504 1 Ave & E 15 St 40.73222 -73.98166 16950 Customer NA 0
## 2 243 Fulton St & Rockwell Pl 40.68798 -73.97847 16151 Subscriber 1987 1
## 3 375 Mercer St & Bleecker St 40.72679 -73.99695 15997 Subscriber 1987 1
## 4 500 Broadway & W 51 St 40.76229 -73.98336 17750 Subscriber 1959 2
## 5 466 W 25 St & 6 Ave 40.74395 -73.99145 17671 Subscriber 1983 2
## 6 537 Lexington Ave & E 24 St 40.74026 -73.98409 16490 Subscriber 1956 1
## STATION NAME DATE AWND PRCP SNOW SNWD TMAX TMIN WDF2 WDF5 WSF2 WSF5 WT01
## 1 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-01 6.93 0 0 0 40 26 310 300 15.0 25.9 NA
## 2 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-02 5.82 0 0 0 33 22 310 340 15.0 21.9 NA
## 3 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-03 4.47 0 0 0 32 24 260 260 13.0 19.9 NA
## 4 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-04 8.05 0 0 0 37 30 290 250 17.9 28.0 NA
## 5 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-05 6.71 0 0 0 42 32 310 310 17.0 25.9 NA
## 6 USW00094728 NY CITY CENTRAL PARK, NY US 2013-01-06 6.71 0 0 0 46 34 290 270 13.0 19.9 NA
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
supplied_secs | 61.0000000 | 373.0000000 | 618.0000000 | 906.8727165 | 1061.0000000 | 1688083.0000 |
calc_secs | 60.0000000 | 373.7270000 | 618.6000001 | 907.3520470 | 1061.9250000 | 1688083.0000 |
calc_mins | 1.0000000 | 6.2287833 | 10.3100000 | 15.1225341 | 17.6987500 | 28134.7167 |
calc_hours | 0.0166667 | 0.1038131 | 0.1718333 | 0.2520422 | 0.2949792 | 468.9119 |
calc_days | 0.0006944 | 0.0043255 | 0.0071597 | 0.0105018 | 0.0122908 | 19.5380 |
The above indicates that the duration of the trips (in seconds) includes values in the millions – which likely reflects a trip which failed to be properly closed out.
Delete cases with unreasonable trip_duration values Let’s assume that nobody would rent a bicycle for more than a specified timelimit (say, 3 hours), and drop any records which exceed this:
## [1] "Removed 158 trips (0.171%) of longer than 3 hours."
## [1] "Remaining number of trips: 92407"
Other inconsistencies concern the collection of birth_year, from which we can infer the age of the participant. There are some months in which this value is omitted, while there are other months in which all values are populated. However, there are a few records which suggest that the rider is a centenarian – it seems highly implausible that someone born in the 1880s is cycling around Central Park – but the data does have such anomalies. Thus, a substantial amount of time was needed for detecting and cleaning such inconsistencies.
The birth year for some users is as old as
## Min.
## 1885
which is not possible:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1885 1969 1981 1978 1988 2003 5828
Remove trips associated with very old users (age>90) (Also: remove trips associated with missing birth_year)
## [1] "Removed 41 trips (0.044%) of users older than 90 years."
## [1] "Removed 5828 trips (6.296%) of users where age is unknown (birth_year unspecified)."
## [1] "Remaining number of trips: 86538"
This is straight-line distance between (longitude,latitude) points – it doesn’t incorporate an actual bicycle route.
There are services (e.g., from Google) which can compute and measure a recommended bicycle route between points, but use of such services requires a subscription and incurs a cost.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5885 1.0477 1.3090 1.7280 10.6603
In this subset of the data, the maximum distance between stations is
## Max.
## 10.66028
km. In the data there are some stations for which the latitude and longitude are zero, which suggests that the distance between such a station and an actual station is many thousands of miles. If such items exist, we will delete them:
## [1] "No unusually long distances were found in this subset of the data."
There is a time-based usage fee for rides longer than an initial period:
#### Summary of trip durations AFTER censoring/truncation:
#express trip duration in seconds, minutes, hours, days
# note: we needed to fix the November daylight savings problem to eliminate negative trip times
#### Supplied seconds
#print("Supplied Seconds:")
supplied_secs<-summary(CB$trip_duration)
#### Seconds
CB$trip_duration_s = as.numeric(CB$e_time - CB$s_time,"secs")
calc_secs<-summary(CB$trip_duration_s)
#### Minutes
CB$trip_duration_m = as.numeric(CB$e_time - CB$s_time,"mins")
calc_mins<-summary(CB$trip_duration_m)
#### Hours
CB$trip_duration_h = as.numeric(CB$e_time - CB$s_time,"hours")
calc_hours<-summary(CB$trip_duration_h)
#### Days
CB$trip_duration_d = as.numeric(CB$e_time - CB$s_time,"days")
calc_days <-summary(CB$trip_duration_d)
# library(kableExtra) # loaded above
rbind(supplied_secs, calc_secs, calc_mins, calc_hours, calc_days) %>%
kable(caption = "Summary of trip durations - AFTER truncations:") %>%
kable_styling(c("bordered","striped"),latex_options = "hold_position")
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
supplied_secs | 61.0000000 | 363.0000000 | 592.0000000 | 775.7573898 | 994.0000000 | 10617.0000000 |
calc_secs | 60.0000000 | 363.0000000 | 593.0000000 | 776.2454997 | 994.7922499 | 10617.5160000 |
calc_mins | 1.0000000 | 6.0500000 | 9.8833333 | 12.9374250 | 16.5798708 | 176.9586000 |
calc_hours | 0.0166667 | 0.1008333 | 0.1647222 | 0.2156237 | 0.2763312 | 2.9493100 |
calc_days | 0.0006944 | 0.0042014 | 0.0068634 | 0.0089843 | 0.0115138 | 0.1228879 |
We could have chosen to censor the data, in which case we would not drop observations, but would instead move them to a limiting value, such as three hours (for trip time) or an age of 90 years (for adjusting birth_year).
As there were few such cases, we instead decided to truncate the data by dropping such observations from the dataset.
Because there is so much data, it is difficult to analyze the entire universe of trip-by-trip data unless one has high-performance computational resources.
We can examine the correlations between variables to understand the relationship between variables, and also to help be alert to potential problems of multicollinearity. Here we compute rank correlations (Pearson and Spearman) as well as actual correlations between key variables. Here we compute the correlations between key variables on the individual CitiBike Trip data. (Later we will compute correlations on daily aggregated data which has been joined with the daily weather observations.)
Aggregate individual CitiBike trip data by day, and join to daily weather data
We will perform our calculations on an aggregated basis. We will group each day’s rides together, but we will segment by user_type (“Subscriber” or “Customer”) and by gender (“Male or”Female“). For each of these segments, there are some cases where the user_type is not specified, so we have designated that as”Unknown." For gender, there are cases where the CitiBike data set contains a zero, which indicates that the gender of the user was not recorded.
For each day, we will aggregate the following items across each of the above groupings:
We will split the aggregated data into a training dataset, consisting of all (grouped, daily) aggregations from 2013-2018, and a test dataset, consisting of (grouped, daily) aggregations from 2019.
We will then join each aggregated CitiBike data element with the corresponding weather obserservation for that date.
#summary(CB)
##CB$user_type[is.na(CB$user_type)] <- "UNKNOWN" ## should not be necessary to do this
CB$gender <- recode_factor(CB$gender, '1' = "Male", '2' = "Female", '0' = "UNKNOWN")
# make the training data set
train <- CB %>%
mutate(start_date = as.Date(s_time, format="%Y-%m-%d"),
# user_type = as.character(user_type),
train = 1) %>%
filter(start_date < '2019-01-01') %>%
group_by(start_date, user_type, train, gender) %>%
summarise(
mean_duration = mean(trip_duration),
median_duration = median(trip_duration),
sum_distance_km = sum(distance_km),
sum_trip_fee = sum(trip_fee),
avg_age = mean(age),
trips = n()
) %>%
ungroup()
train_rows = dim(train)[1]
#summary(train)
# make the test data set
test <- CB %>%
mutate(start_date = as.Date(s_time, format="%Y-%m-%d"),
# user_type = as.character(user_type),
train = 0) %>%
filter(start_date >= '2019-01-01') %>%
group_by(start_date, user_type, train, gender) %>%
summarise(
mean_duration = mean(trip_duration),
median_duration = median(trip_duration),
sum_distance_km = sum(distance_km),
sum_trip_fee = sum(trip_fee),
avg_age = mean(age),
trips = n()
) %>%
ungroup()
test_rows = dim(test)[1]
# Join train with weather data (there should be no rows with missing values)
train_weather <- weather %>% inner_join(train, by = c("DATE" = "start_date" ))
#dim(train_weather)
# Join test with weather data (there should be no rows with missing values)
test_weather <- weather %>% inner_join(test, by = c("DATE" = "start_date" ))
df.join = rbind(train_weather, test_weather)
tmp.df <- mice(
df.join,
m = 2,
maxit = 10,
meth = 'pmm',
seed = 500,
print = FALSE
)
df.join <- complete(tmp.df)
naCols <- names(which(sapply(df.join, anyNA)))
df.join = df.join[, (!names(df.join) %in% naCols)]
df.join.agg = df.join %>% dplyr::group_by(DATE, user_type, train, gender) %>% summarise(AWND = max(AWND), PRCP = max(PRCP), SNOW = max(SNOW), SNWD = max(SNWD), TMAX = max(TMAX), TMIN = max(TMIN), WDF2 = max(WDF2), WDF5 = max(WDF5), WSF2 = max(WSF2), WSF5 = max(WSF5), mean_duration = max(mean_duration), median_duration = max(median_duration), sum_distance_km = sum(sum_distance_km), sum_trip_fee = sum(sum_trip_fee), avg_age = max(avg_age), trips = sum(trips))
df.join.agg$user_type = as.character(df.join.agg$user_type)
df.join.agg$user_type[df.join.agg$user_type == "Subscriber"] = 2
df.join.agg$user_type[df.join.agg$user_type == "Customer"] = 1
df.join.agg$user_type[df.join.agg$user_type == "UNKNOWN"] = 0
df.join.agg$user_type = as.integer(df.join.agg$user_type)
df.join.agg$gender = as.character(df.join.agg$gender)
df.join.agg$gender[df.join.agg$gender == "UNKNOWN"] = 0
df.join.agg$gender[df.join.agg$gender == "Female"] = 1
df.join.agg$gender[df.join.agg$gender == "Male"] = 2
df.join.agg$gender = as.integer(df.join.agg$gender)
train = df.join.agg[df.join.agg$train ==1,]
test = df.join.agg[df.join.agg$train !=1,]
train.agg = train %>% dplyr::group_by(DATE) %>% summarise(trips = sum(trips))
train.ts = ts(train.agg$trips, start = c(2013,181), end = c(2018,365), frequency = 365)
test.agg = test %>% group_by(DATE) %>% summarise(trips = sum(trips))
test.ts = ts(test.agg$trips, start = c(2019,1), end = c(2019,365), frequency = 365)
train.ts = tsclean(train.ts)
test.ts = tsclean(test.ts)
## [1] "Order of seasonal difference required = 0"
## [1] "Order of difference required = 1"
snaive.model = snaive(train.ts, h=365)
fcst = forecast(snaive.model, h = 365)
df.fcst.snaive = data.frame(fcst)
df.fcst.snaive = df.fcst.snaive %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('snaive', df.fcst.snaive$fcst, test.ts, 'Time Series')
## [1] "RMSE for model snaive = 19.6318165022517"
stl.model = stlf(train.ts, h=365)
fcst = forecast(stl.model, h = 365)
df.fcst.stl = data.frame(fcst)
df.fcst.stl = df.fcst.stl %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('stl', df.fcst.stl$fcst, test.ts, 'Time Series')
## [1] "RMSE for model stl = 20.8565111276799"
rwf.model =rwf(train.ts , h=365, drift = TRUE, lambda = BoxCox.lambda(train.ts))
fcst = forecast(rwf.model, h = 365, lambda =BoxCox.lambda(train.ts) )
df.fcst.rwf = data.frame(fcst)
df.fcst.rwf = df.fcst.rwf %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('rwf', df.fcst.rwf$fcst, test.ts, 'Time Series')
## [1] "RMSE for model rwf = 51.8626824385318"
func <- auto.arima(train.ts, seasonal = TRUE)
fcst = forecast(func, h = 365)
df.fcst.arima = data.frame(fcst)
df.fcst.arima = df.fcst.arima %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('auto arima', df.fcst.arima$fcst, test.ts, 'Time Series')
## [1] "RMSE for model auto arima = 45.8177058461619"
func = ets(test.ts, model="ZZZ")
fcst = forecast(func, h = 365)
df.fcst.ets = data.frame(fcst)
df.fcst.ets = df.fcst.ets %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('ETS', df.fcst.ets$fcst, test.ts, 'Time Series')
## [1] "RMSE for model ETS = 37.7541272011348"
func = thetaf(train.ts, h= 365)
fcst = forecast(func, h = 365)
df.fcst.theta = data.frame(fcst)
df.fcst.theta = df.fcst.theta %>% dplyr::rename('fcst' = 'Point.Forecast')
calcRMSE('Theta', df.fcst.theta$fcst, test.ts, 'Time Series')
## [1] "RMSE for model Theta = 17.2872284591559"
## Warning in modeldf.default(object): Could not find appropriate degrees of freedom for this model.
df.cons = data.frame( cbind(df.fcst.theta$Hi.80, df.fcst.snaive$Hi.80, df.fcst.stl$Hi.80))
names(df.cons) = c("fcst1", "fcst2", "fcst3")
df.cons$fcst = (df.cons$fcst1 + df.cons$fcst2 + df.cons$fcst3) /3
calcRMSE('TS Ensemble', df.cons$fcst, test.ts, 'Time Series')
## [1] "RMSE for model TS Ensemble = 16.5124625909039"
ModelName | RMSE | ModelType |
---|---|---|
TS Ensemble | 16.51246 | Time Series |
Theta | 17.28723 | Time Series |
snaive | 19.63182 | Time Series |
stl | 20.85651 | Time Series |
ETS | 37.75413 | Time Series |
auto arima | 45.81771 | Time Series |
rwf | 51.86268 | Time Series |
lm.model <- train(trips ~ ., data = train,
method = "lm",
preProcess = c( "center", "scale"),
trControl = ctrl)
lm.pred <- predict(lm.model, newdata = testX)
df.pred = data.frame(pred = lm.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('lm', df.pred$pred, test.ts, 'Linear Model')
## [1] "RMSE for model lm = 5.86958506650074"
autoplot(train.ts) +
autolayer(test.ts, series = 'Test Data') +
autolayer(ts(df.pred$pred, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
ggtitle("Forecast - Citibike ridership") +
ylab("Ride Count")
plsTune <- train(trips ~ .,
data = train,
method = "pls",
tuneGrid = expand.grid(ncomp = 1:19),
trControl = ctrl)
pls.pred <- predict(plsTune, newdata = testX)
df.pred = data.frame(pred = pls.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('pls', df.pred$pred, test.ts, 'Linear Model')
## [1] "RMSE for model pls = 5.87525402411147"
pcrTune <- train(trips ~ .,
data = train,
method = "pcr",
tuneGrid = expand.grid(ncomp = 1:19),
trControl = ctrl)
pcr.pred <- predict(pcrTune, newdata = testX)
df.pred = data.frame(pred = pcr.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('pcr', df.pred$pred, test.ts, 'Linear Model')
## [1] "RMSE for model pcr = 5.86958506650087"
ridgeGrid <- expand.grid(lambda = seq(0, .025, length = 11))
train.m = train[, -nearZeroVar(train)]
ridgeTune <- train(trips ~ .,
data = train.m,
method = "ridge",
#tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c("BoxCox", "center", "scale")
)
ridge.pred <- predict(ridgeTune, newdata = testX)
df.pred = data.frame(pred = ridge.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('ridge', df.pred$pred, test.ts, 'Linear Model')
## [1] "RMSE for model ridge = 5.78166711055302"
enetGrid <- expand.grid(lambda = c(seq(0,1,length=6)),
fraction = seq(.05, 1, length = 20))
enetTune <- train(trips ~ .,
data = train.m,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("BoxCox", "center", "scale")
)
enet.pred <- predict(enetTune, newdata = testX)
df.pred = data.frame(pred = enet.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('elastic net', df.pred$pred, test.ts, 'Linear Model')
## [1] "RMSE for model elastic net = 5.78143488861333"
ModelName | RMSE | ModelType |
---|---|---|
elastic net | 5.781435 | Linear Model |
ridge | 5.781667 | Linear Model |
lm | 5.869585 | Linear Model |
pcr | 5.869585 | Linear Model |
pls | 5.875254 | Linear Model |
## Warning in modeldf.default(object): Could not find appropriate degrees of freedom for this model.
df.pred = data.frame(pred = enet.pred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
autoplot(train.ts) +
autolayer(test.ts, series = 'Test Data') +
autolayer(ts(df.pred$pred, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
ggtitle("Forecast - Citibike ridership") +
ylab("Ride Count")
#### Forecast with elastic net model is much more improved as compared to time series models
nnetGrid <- expand.grid(size = c(1:10),
decay = c(0, 0.01, 0.1))
nnetModel <- train(
trips~ .,
data = train,
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl,
#preProcess = c("BoxCox", "center", "scale", "pca"),
maxit = 500
)
#plot(nnetModel)
#varImp(nnetModel)
nnetPred <- predict(nnetModel, newdata = testX)
df.pred = data.frame(pred = nnetPred, Date = test$DATE)
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('Neural Net', df.pred$pred, test.ts, 'Nonlinear Model')
library(earth)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:30)
marsModel <- train(
trips ~ .,
data = train,
method = "earth",
tuneGrid = marsGrid,
trControl = ctrl
)
marsPred <- predict(marsModel, newdata = testX)
df.pred = data.frame(pred = marsPred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('MARS', df.pred$pred, test.ts, 'Nonlinear Model')
svmModel <- train(
trips ~ .,
data = train,
method = "svmRadial",
preProc = c("center", "scale"),
#tuneLength = 14,
trControl = ctrl
)
svmPred <- predict(svmModel, newdata = testX)
df.pred = data.frame(pred = svmPred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('SVM_Radial', df.pred$pred, test.ts, 'Nonlinear Model')
## [1] "RMSE for model SVM_Radial = 8.93912270538394"
svmModel2 <- train(
trips ~ .,
data = train,
method = "svmPoly",
preProc = c("center", "scale"),
#tuneLength = 14,
trControl = ctrl
)
svmPred2 <- predict(svmModel2, newdata = testX)
df.pred = data.frame(pred = svmPred2, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('SVM_Poly', df.pred$pred, test.ts, 'Nonlinear Model')
## [1] "RMSE for model SVM_Poly = 4.46486569718514"
svmModel3 <- train(
trips~ .,
data =train,
method = "svmLinear",
preProc = c("center", "scale"),
tuneGrid = data.frame(.C = c(.25, .5, 1)),
trControl = ctrl
)
svmPred3 <- predict(svmModel3, newdata = testX)
df.pred = data.frame(pred = svmPred3, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('SVM_Linear', df.pred$pred, test.ts, 'Nonlinear Model')
## [1] "RMSE for model SVM_Linear = 6.00371767358531"
knnModel <- train(
trips ~ .,
data = train,
method = "knn",
preProcess = c("center", "scale"),
tuneGrid = data.frame(.k = 1:10),
trControl = ctrl
)
knnPred <- predict(knnModel, newdata = testX)
df.pred = data.frame(pred = knnPred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('KNN', df.pred$pred, test.ts, 'Nonlinear Model')
## [1] "RMSE for model KNN = 7.90295893180088"
ModelName | RMSE | ModelType |
---|---|---|
MARS | 4.434103 | Nonlinear Model |
SVM_Poly | 4.464866 | Nonlinear Model |
SVM_Linear | 6.003718 | Nonlinear Model |
KNN | 7.902959 | Nonlinear Model |
SVM_Radial | 8.939123 | Nonlinear Model |
Neural Net | 55.240198 | Nonlinear Model |
## Warning in modeldf.default(object): Could not find appropriate degrees of freedom for this model.
df.pred = data.frame(pred = marsPred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
autoplot(train.ts) +
autolayer(test.ts, series = 'Test Data') +
autolayer(ts(df.pred$pred, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
ggtitle("Forecast - Citibike ridership") +
ylab("Ride Count")
#### Forecast with MARS model is much more improved as compared to time series models/ linear models
library(randomForest)
rf_model <- randomForest(trips ~ ., data = train, ntree = 300)
rf_predicted <- predict(rf_model, testX)
df.pred = data.frame(pred = rf_predicted, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('Random Forest', df.pred$pred, test.ts, 'Treebased models')
## [1] "RMSE for model Random Forest = 5.39192947837228"
library(vip)
rfImp1 <- rf_model$importance
vip(rf_model, color = 'red', fill='green') +
ggtitle('Random Forest Var Imp')
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2), n.trees = seq(100, 1000, by = 100), shrinkage = 0.1, n.minobsinnode = 5)
gbm_model <- train(trips ~ ., data = train, tuneGrid = gbmGrid, verbose = FALSE, method = 'gbm' )
gbm_predicted <- predict(gbm_model, testX)
df.pred = data.frame(pred = gbm_predicted, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('Boosted Trees', df.pred$pred, test.ts, "Treebased models")
## [1] "RMSE for model Boosted Trees = 4.43923091284317"
ModelName | RMSE | ModelType |
---|---|---|
Boosted Trees | 4.439231 | Treebased models |
Random Forest | 5.391929 | Treebased models |
## Warning in modeldf.default(object): Could not find appropriate degrees of freedom for this model.
df.pred = data.frame(pred = gbm_predicted, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
autoplot(train.ts) +
autolayer(test.ts, series = 'Test Data') +
autolayer(ts(df.pred$pred, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
ggtitle("Forecast - Citibike ridership") +
ylab("Ride Count")
ModelName | RMSE | ModelType |
---|---|---|
MARS | 4.434103 | Nonlinear Model |
Boosted Trees | 4.439231 | Treebased models |
SVM_Poly | 4.464866 | Nonlinear Model |
Random Forest | 5.391929 | Treebased models |
elastic net | 5.781435 | Linear Model |
ridge | 5.781667 | Linear Model |
lm | 5.869585 | Linear Model |
pcr | 5.869585 | Linear Model |
pls | 5.875254 | Linear Model |
SVM_Linear | 6.003718 | Nonlinear Model |
KNN | 7.902959 | Nonlinear Model |
SVM_Radial | 8.939123 | Nonlinear Model |
TS Ensemble | 16.512463 | Time Series |
Theta | 17.287229 | Time Series |
snaive | 19.631816 | Time Series |
stl | 20.856511 | Time Series |
ETS | 37.754127 | Time Series |
auto arima | 45.817706 | Time Series |
rwf | 51.862682 | Time Series |
Neural Net | 55.240198 | Nonlinear Model |
final.pred = (gbm_predicted + marsPred + svmPred2)/3
df.pred = data.frame(pred = final.pred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
calcRMSE('Ensemble (gbm, mars, svm_poly)', df.pred$pred, test.ts, "Top Ensemble Models")
## [1] "RMSE for model Ensemble (gbm, mars, svm_poly) = 4.16579042936282"
ModelName | RMSE | ModelType |
---|---|---|
Ensemble (gbm, mars, svm_poly) | 4.165790 | Top Ensemble Models |
MARS | 4.434103 | Nonlinear Model |
Boosted Trees | 4.439231 | Treebased models |
SVM_Poly | 4.464866 | Nonlinear Model |
Random Forest | 5.391929 | Treebased models |
elastic net | 5.781435 | Linear Model |
ridge | 5.781667 | Linear Model |
lm | 5.869585 | Linear Model |
pcr | 5.869585 | Linear Model |
pls | 5.875254 | Linear Model |
SVM_Linear | 6.003718 | Nonlinear Model |
KNN | 7.902959 | Nonlinear Model |
SVM_Radial | 8.939123 | Nonlinear Model |
TS Ensemble | 16.512463 | Time Series |
Theta | 17.287229 | Time Series |
snaive | 19.631816 | Time Series |
stl | 20.856511 | Time Series |
ETS | 37.754127 | Time Series |
auto arima | 45.817706 | Time Series |
rwf | 51.862682 | Time Series |
Neural Net | 55.240198 | Nonlinear Model |
df.pred = data.frame(pred = final.pred, Date = test$DATE)
names(df.pred) = c('pred', 'Date')
df.pred = df.pred %>% group_by(Date) %>%summarise(pred = sum(pred))
autoplot(train.ts) +
autolayer(test.ts, series = 'Test Data') +
autolayer(ts(df.pred$pred, start = c(2019,1), end = c(2019,365), frequency = 365), series = 'Forecast') +
ggtitle("Ensemble model Forecast - Citibike ridership") +
ylab("Ride Count")
The Citybike ridership analysis determined in this paper have many implications. Analysis reveals interesting scientific facts which are statestically significant. Citibike ridership data shows a strong annual sesonality and increasing trend. Ridership count increases in summer and decreases in winter. Increasing trend in the ridership count from 2013 to 2019 shows increasing popularity of Citybike ridership service over the period of time. It is established that most the Citibike riders are young in their 30s and 40s. Citibike customers prefers to use ride share service for making shot distance trips < 2 KM. It is established with statestical significance that weather has a significant impact on Citibike ridfership pattern. Good weather has positive corrleation with Citibike ridership count.
For predicting future Citibike ridership count, statestical timeseries models alone are not sufficient. Based on RMSE and residual analysis, statestical timeseries models alone doesn’t do a better job of forecasting Citibike rideshare count. This is due to the fact that statestical timeseries models doesn’t leverage the rich feature set available for model optimization e.g. weather data. Based on RMSE Linear models do better job of forecasting Citibike ridership count than timeseries models. This is due to the fact that Linear models are multivariate and leverage rich input features for model optimization.
Nonlinear models further improves the RMSE and outerforms Linear models. The fact that Nonlinear models are better suited for CitiBike ridership dataset suggests that there exists a non linear relatioship between input features and predictor variables
The best model performance is derived from tree based models. Boosted tree model outperforms all other models and provides best performance in terms of RMSE. Residual analysis of boosted tree models looks more appropriate where residuals are perfectly normally distributed and ACF plot shows no correlation between residuals. The variable importance of Boosted trees models revals important infomration about what factors significantly imact the Citibike ridership pattern. From the Boosted Trees model variable importance plot we can conclude that following factors are statestically significant in infuencing Citibike ridership pattern
Above factors are key input for Citibike system operator for planning increase in the number of bicycles availability and expansion of the geographic footprint of docking stations.
It is imperative that for a complex business operation like Citibike which involves many factors (Internal and External) one model alone won’t be sufficient for ridership forecasting. When we tried Ensemble of top three models (Boosted Trees, MARS and SVM Poly) we got the best performace in terms of RMSE (7 % RMSE improvement). We can conclude that the best modelling strategy for predicting future ridership pattern for Citibike ridershare service is to use model ensembles. Winning model enselbles established in this paper will enable Citibike system operator to make informed business expansion plans resulting into increased ROI and long-term business profitability and viability of Citibike operations