INTRODUCTION

Bike-sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, the user is able to easily rent a bike from a particular position and return back to another position. Today, there exists great interest in these systems due to their important role in traffic, environmental, and health issues.

Apart from interesting real-world applications of bike-sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Having features such as duration of travel, departure, and arrival position, total bike number rented turns the bike-sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.

Capital Bikeshare has more than 4300 bikes available at 500 stations across 7 jurisdictions. With that number, Capital Bikeshare provides residents and visitors with a convenient, fun, and affordable transportation option for getting from point A to point B. People use Capital Bikeshare to commute to work or school, run errands, get to appointments or social engagements and more.

DATA PREPARATION

We aggregated the data on daily basis and set limitation on only one station Capital Bikeshare system and focusing on only the number of bikes rented.

library(dplyr)
library(tidyr)
library(lubridate)
library(forecast)
library(tseries)

dataset <- read.csv("data_input/day.csv")
dataset <- dataset %>% select(dteday,cnt) 
dataset$dteday <- ymd(dataset$dteday)

#checking missing date, will returns TRUE if there is no missing date
complete_date = seq.Date(from = min(dataset$dteday), to = max(dataset$dteday), by = "day")
all(complete_date == dataset$dteday)
#> [1] TRUE
#checking missing value, will returns FALSE if there are no missing values
anyNA(dataset$cnt)
#> [1] FALSE
#changing into timeseries object
# data.    : harian
# pola     : tahunan
#frequency : 365
bike_ts <- ts(data = dataset$cnt,
              start = c(2011),
              frequency = 365)

EXPLORATORY DATA ANALYSIS

📈 Insight : - Trend of bike renting is increasing from 2011 to 2012 - Number of bike renting has its peak during mid-year (summer - fall season), it might because of the weather condition and temperature outside is comfortable to use bike as transportation - The data has trend , seasonal and additive type - From the data shown, we are gonna use model Triple Exponential Smoothing and ARIMA

CROSS-VALIDATION

bike_test <- tail(bike_ts,30)
bike_train <- head(bike_ts, -30)

BUILD MODEL

bike_hw <- HoltWinters(bike_ts,
                       seasonal = "additive")

FORECASTING

TRIPLE EXPONENTIAL SMOOTHING

bike_forecast <- forecast(bike_hw,
         h =365)
bike_ts %>% 
  autoplot() +
  autolayer(bike_forecast$mean, series = "forecast HoltWinter")

ARIMA

STATIONARITY

bike_ts %>% autoplot()

Using adf.test() to check the stationarity assumption with using assumptions:

  • H0: data is not stationary
  • H1: data is stationary

with p-value <0.05 (alpha), means that H0 is rejected

adf.test(bike_ts)
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  bike_ts
#> Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
#> alternative hypothesis: stationary

📈 Insight : - p-value is > than 0.05 means that the data is not stationary - we have to do differencing

bike_ts %>% diff() %>% adf.test()
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -13.798, Lag order = 8, p-value = 0.01
#> alternative hypothesis: stationary

📈 Insight : - p-value is < than 0.05 means that the data is stationary after differencing - so we put d = 1

###PACF & ACF PLOT

diff(bike_ts, differences = 1) %>% 
  tsdisplay()

📈 Insight : - PACF plot shows that we can use these lag for p > 1,2,3,4,5

  • ACF plot shows that we can use these lag for q > 1,2,3

  • from differencing and plot PACF and ACF we have order list for ARIMA Model: > p : 0,1,2,3,4,5 > d : 1 > q : 0,1,2,3

hence we got combination of: - c(0,1,0) - c(0,1,1) - c(0,1,2) - c(0,1,3) - c(1,1,0) - c(1,1,1) - c(1,1,2) - c(1,1,3) - c(1,1,0) - c(2,1,1) - c(2,1,2) - c(2,1,3) - c(3,1,0) - c(3,1,1) - c(3,1,2) - c(3,1,3) - c(4,1,0) - c(4,1,1) - c(4,1,2) - c(4,1,3) - c(5,1,0) - c(5,1,1) - c(5,1,2) - c(5,1,3)

but we only took 2 combination of order list for ARIMA Model we took one with high value of p and low value of q and we took one with low value of p and high value of q

bike_arima1 <- Arima(bike_ts, order = c(1,1,3))
bike_arima2 <- Arima(bike_ts, order = c(5,1,0))
bike_arima1$aic
#> [1] 12051.98
bike_arima2$aic
#> [1] 12070.63
accuracy(bike_arima1$fitted,bike_ts)
#>                ME     RMSE      MAE       MPE     MAPE        ACF1 Theil's U
#> Test set 12.16737 923.0264 645.7667 -44.25499 58.28445 -0.00071928  2.309738
accuracy(bike_arima2$fitted,bike_ts)
#>                ME     RMSE   MAE       MPE     MAPE        ACF1 Theil's U
#> Test set 4.220283 933.7811 658.8 -44.23189 58.70165 -0.01593013  2.074105

📈 Insight : - the MAPE of the two models are above 50% which means that this model cannot forecast well enough
- for this issue, we recommend to add more data because we only have 2 years span of data
- the model cannot catch the trend well because the dataset only have 2 repetition

ACKNOWLEDGEMENT

Hadi Fanaee-T
Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto INESC Porto
Campus da FEUP Rua Dr. Roberto Frias, 378 4200 - 465 Porto, Portugal

Original dataset : https://www.kaggle.com/c/bike-sharing-demand
Capital Bikeshare trip data : http://capitalbikeshare.com/system-data
Weather Information : https://openweathermap.org/history
Holiday Schedule : http://dchr.dc.gov/page/holiday-schedule

Title  

A work by Taufan Anggoro Adhi

tf.anggoro@gmail.com