library(tidyverse)
library(lubridate) # manipulate date
library(plotly) # interactive visualization package
library(zoo) # manipulate missing value
library(tseries) # time series package
library(forecast) # forecast package
library(MLmetrics) # calculate error

Introduction

People nowadays concerns about their healthy and global warming issues. Global warming issue itself has been public’s attention since 2008 and global warming issue itself affected to our daily life, climate change is one of them. Climate change is forcing cities to re-imaging their transportation infrastructure. Shared mobility concepts, such as car sharing, bike sharing or scooter sharing become more and more popular. And if they are implemented well, they can actually contribute to mitigating climate change. Bike sharing in particular is interesting because no electricity of gasoline is necessary (unless e-bikes are used) for this mode of transportation. Many cities are provide the bike sharing facilities, Washington DC is one of them. This Learning By Building (LBB) will do simple forecast on Bike Sharing in Washington DC data set.

Read Data

bike <- read.csv("bike_sharing_dataset.csv")
head(bike)

A Peek of Data

Let’s take a peek this data set further

dim(bike) # checking dimensions of data
## [1] 2922   29

This data set has 2,922 days and have 29 columns that will present our target and predictors. Then we need to know, how long this 2,922 days recorded on this data set by using range () function.

range(bike$date)
## [1] "2011-01-01" "2018-12-31"

This data set has period of time from 01 Jan 2011 till 31 Des 2018. Then, we need to know the contents of each columns further by using glimpse() function

glimpse(bike)
## Rows: 2,922
## Columns: 29
## $ date              <chr> "2011-01-01", "2011-01-02", "2011-01-03", "2011-0...
## $ temp_avg          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ temp_min          <dbl> -1.566667, 0.880000, -3.442857, -5.957143, -4.293...
## $ temp_max          <dbl> 11.9733333, 13.8066667, 7.4642857, 4.6428571, 6.1...
## $ temp_observ       <dbl> 2.7727273, 7.3272727, -3.0600000, -3.1000000, -1....
## $ precip            <dbl> 0.069333333, 1.037349398, 1.878823529, 0.00000000...
## $ wind              <dbl> 2.575, 3.925, 3.625, 1.800, 2.950, 1.600, 2.550, ...
## $ wt_fog            <dbl> 1, 1, NA, NA, NA, NA, 1, 1, NA, NA, 1, 1, NA, 1, ...
## $ wt_heavy_fog      <dbl> NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA,...
## $ wt_thunder        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ wt_sleet          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA...
## $ wt_hail           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ wt_glaze          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA...
## $ wt_haze           <dbl> 1, NA, NA, NA, NA, NA, 1, 1, NA, NA, 1, 1, NA, NA...
## $ wt_drift_snow     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ wt_high_wind      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, 1, ...
## $ wt_mist           <dbl> 1, 1, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1, NA, NA...
## $ wt_drizzle        <dbl> NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA,...
## $ wt_rain           <dbl> 1, 1, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, ...
## $ wt_freeze_rain    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ wt_snow           <dbl> NA, NA, NA, NA, NA, 1, 1, 1, NA, NA, 1, NA, NA, 1...
## $ wt_ground_fog     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ wt_ice_fog        <dbl> NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, 1, NA, NA,...
## $ wt_freeze_drizzle <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA...
## $ wt_unknown        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ casual            <dbl> 330, 130, 120, 107, 82, 88, 148, 68, 54, 41, 43, ...
## $ registered        <dbl> 629, 651, 1181, 1429, 1489, 1485, 1345, 871, 748,...
## $ total_cust        <dbl> 959, 781, 1301, 1536, 1571, 1573, 1493, 939, 802,...
## $ holiday           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

Contents of Columns

  • date : date with the format yyyy-mm-dd
  • temp_avg : average daily temperature in degree Celsius
  • temp_min : minimum daily temperature in degree Celsius
  • temp_max : maximum daily temperature in degree Celsius
  • temp_observ : temperature at the time of observation in degree Celsius
  • precip : amount of precipitation in mm
  • wind : wind speed in meters per second
  • wt_fog : weather type fog, ice fog, or freezing fog (may include heavy fog)
  • wtheavyfog : weather type heavy fog or heaving freezing fog (not always distinguished from fog)
  • wt_thunder : weather type thunder
  • wt_sleet : weather type ice pellets, sleet, snow pellets, or small hail
  • wt_hail : weather type hail (may include small hail)
  • wt_glaze : weather type glaze or rime
  • wt_haze : weather type smoke or haze
  • wtdriftsnow : weather type blowing or drifting snow
  • wthighwind : weather type high or damaging winds
  • wt_mist : weather type mist
  • wt_drizzle : weather type drizzle
  • wt_rain : weather type rain (may include freezing rain, drizzle, and freezing drizzle)
  • wtfreezerain : weather type freezing rain
  • wt_snow : weather type snow, snow pellets, snow grains, or ice crystals
  • wtgroundfog : weather type ground fog
  • wticefog : weather type ice fog or freezing fog
  • wtfreezedrizzle : weather type freezing drizzle
  • wt_unknown : weather type unknown source of precipitation
  • casual : number of unregistered customers
  • registered : number of registered customers
  • total_cust : sum of registered and casual customers
  • holiday : indicates whether the day is a holiday or not

Data Pre-processing

After take a peek into data set we know that most of columns have missing value, then we need to know how bad or how many the missing value is

colSums(is.na(bike))
##              date          temp_avg          temp_min          temp_max 
##                 0               821                 0                 0 
##       temp_observ            precip              wind            wt_fog 
##                 0                 0                 0              1419 
##      wt_heavy_fog        wt_thunder          wt_sleet           wt_hail 
##              2714              2228              2793              2872 
##          wt_glaze           wt_haze     wt_drift_snow      wt_high_wind 
##              2769              2217              2915              2664 
##           wt_mist        wt_drizzle           wt_rain    wt_freeze_rain 
##              2551              2794              2516              2917 
##           wt_snow     wt_ground_fog        wt_ice_fog wt_freeze_drizzle 
##              2838              2886              2912              2918 
##        wt_unknown            casual        registered        total_cust 
##              2921                 4                 4                 4 
##           holiday 
##              2833

Well most of contents of the columns have more than 5% missing value, then we need to remove them. We’re gonna select column date and total_cust since column total_cust contains total number of registered and unregistered customers.

bike_clean <- bike %>% 
  mutate(date = ymd(date)) %>% 
  select(date, total_cust)
colSums(is.na(bike_clean))
##       date total_cust 
##          0          4

Fill the missing value by mean value before and after missing data by using na.fill() function by zoo package

bike_clean <- bike_clean %>% 
  mutate(total_cust = na.fill(object = total_cust,
                              fill = "extend"))

Visualize Data

Let’s visualize data before create time series data of this data set.

plot1 <- ggplot(data = bike_clean, mapping = aes(date, total_cust)) +
  geom_line() +
  geom_point() +
  labs(y = "Total Customer",
       x = "Date")+
  theme_minimal()
ggplotly(plot1)

Based on graph above we could indicate that this data has trend and seasonal pattern. Now, let’s move on to time series object.

Time Series & Forecast

We have data set about Bike Sharing that records 8 years of the bike sharing data. How many customers that use this bike sharing per day. Now, we would like to forecast on 2018 of this data set. First, we need to create time series object.

Create Time Series Object and Check Stationary

In this section, we’re going to create initial time series object of data set to observe is this data set is an additive or multiplicative?

# initial time series object (first attempt)
bike_ts <- ts(data = bike_clean$total_cust,
              start = range(bike_clean$date)[[1]],
              frequency = (7*4)) # monthly pattern (7*4)

After create time series object, we’re gonna verify is this data has additive or multiplicative pattern or no by using decompose() function.

bike_ts %>% 
  decompose() %>% 
  autoplot()

Plot above shows that this data has multiplicative pattern and range of errors were from -5,000 to 5,000, and on trend plot shows up and down pattern are still exist. Therefore, we could indicate that this data set has Multiple Seasonal Time Series Object. How do we create time series object if our data set is a multiple seasonal time series? By using msts() function we could create multiple time series object and use mstl() function to create decomposition on multiple seasonal time series object.

# time series object with multiple seasonal (2nd attempt)
msts(data = bike_clean$total_cust,
     seasonal.periods = c(7*4, 7*4*4)) %>% # multiple seasonal ts (monthly, quarterly)
  mstl() %>% # multiple seasonal decomposition
  autoplot()

Take a peek on Trend section. The ups and downs are still exist. This mean that we need to find the right frequency/seasonal.periods.

# 2nd attempt on multiple seasonal ts (3rd attempt on time series object)
msts(data = bike_clean$total_cust,
     seasonal.periods = c(7*4, 7*4*12)) %>% # multiple seasonal ts (monthly, yearly)
  mstl() %>% # multiple seasonal decomposition
  autoplot()

Although on the 3rd attempt there are still slightly ups and downs, the trend is smoother than the 2nd attempt and the first attempt. Other than that, plot above shows that this data set has additive pattern. Then, we will assign the 3rd attempt into bike_msts. Afterthat, we need to know is time series object a stationary or not by using adf.test() function.

# assign the 3rd attempt
bike_msts <- msts(data = bike_clean$total_cust,
     seasonal.periods = c(7*4, 7*4*12))

# check stationary
adf.test(bike_msts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  bike_msts
## Dickey-Fuller = -3.6953, Lag order = 14, p-value = 0.02421
## alternative hypothesis: stationary

Result above shows that the p-value of bike_ts is stationary because p-value of data is smaller than 0.05 (p-value < 0.05). There’s no need to do differencing, thus we could applied ARIMA or SARIMA model building.

Cross Validation

Separate data to be data train and test. Data test were taken from last year (2018) of data.

bike_test <- bike_msts %>% 
  tail(7*4*12)

bike_train <- bike_msts %>% 
  head(length(bike_msts) - (7*4*12))

Build Model

This section, create 3 model. There are Holt-Winter model, model using auto.arima() function (model auto), and Seasonal ARIMA (SARIMA) model by using stlm() function.

# create Holt-Winter (Triple Exponential Smoothing/TES) model
bike_tes <- stlm(y = bike_train, method = "ets")

# create auto.arima model
bike_auto <- auto.arima(y = bike_train, seasonal = T)

# create SARIMA model
bike_stlm <- stlm(y = bike_train, method = "arima")

Forecast The Data

After build model, let’s forecast each model into data test.

# forecast on Holt-Winter model
bike_forecast1 <- forecast(object = bike_tes, h = 336)

# forecast on auto.arima model
bike_forecast2 <- forecast(object = bike_auto, h = 336)

# forecast on SARIMA (stlm) model
bike_forecast3 <- forecast(object = bike_stlm, h = 336)

Evaluate The Model

After forecast each model and assign it into bike_forecast1 (for TES/Holt-Winter model), bike_forecast2 (for auto.arima model), and bike_forecast3 (for SARIMA model), it will be better to evaluate each model. There are several ways to evaluate the model. In this section, I will use accuracy() function to compare Mean Absolute Percentage Error (MAPE) of each model.

accuracy(bike_forecast1)
##                    ME     RMSE      MAE       MPE    MAPE      MASE      ACF1
## Training set 8.330484 1408.851 1025.135 -15.97484 30.2332 0.4375529 0.1693021
accuracy(bike_forecast2)
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 11.50366 1680.878 1175.023 -23.69987 38.46132 0.5015287
##                      ACF1
## Training set 0.0002851324
accuracy(bike_forecast3)
##                    ME    RMSE      MAE       MPE     MAPE      MASE        ACF1
## Training set 12.79538 1368.62 987.4546 -15.15221 28.97288 0.4214699 0.004439838

Mean Absolute Percentage Error (MAPE) of each model was:

  • TES/Holt-Winter Model: 30.2332%
  • Auto ARIMA Model: 38.46%
  • SARIMA Model: 28.97%

There were a huge gap, the difference is huge between MAPE of TES/Holt-Winter Model, Auto ARIMA Model, and SARIMA Model.

Another way, we could visualize and compare between the actual and each model.

Visualization on TES/Holt-Winter Model

plot1 <- bike_msts %>% 
  autoplot(series = "Actual") +
  autolayer(bike_forecast1$mean, series = "Predict Test TES/Holt-Winter") +
  autolayer(bike_forecast1$fitted, series = "Predict Train TES/Holt-Winter") +
  labs(title = "Comparation Actual and Forecast using Holt-Winter Model") +
  theme_minimal()

ggplotly(plot1)
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Visualization on Auto ARIMA Model

plot2 <- bike_msts %>% 
  autoplot(series = "Actual") +
  autolayer(bike_forecast2$mean, series = "Predict Test Auto ARIMA") +
  autolayer(bike_forecast2$fitted, series = "Predict Train Auto ARIMA") +
  labs(title = "Comparation Actual and Forecast using Auto ARIMA Model") +
  theme_minimal()

ggplotly(plot2)

Visualization on SARIMA Model

plot3 <- bike_msts %>% 
  autoplot(series = "Actual") +
  autolayer(bike_forecast3$mean, series = "Predict Test SARIMA") +
  autolayer(bike_forecast3$fitted, series = "Predict Train SARIMA") +
  labs(title = "Comparation Actual and Forecast using SARIMA Model") +
  theme_minimal()

ggplotly(plot3)

Assumption Check

Because the error of SARIMA Model is smaller than others, I will do assumption check only on SARIMA Model. There were two assumption that we had to check.

1. No-autocorrelation Residual

To check No-autocorrelation Residual, we could use Box.test() function and input type parameter as Ljung-Box.

  • \(H_0\): residual has no-autocorrelation => p-value > 0.05 (alpha)
  • \(H_1\): residual has autocorrelation => p-value < 0.05 (alpha)

2. Normality Residual

To check Normality Residual, we could use shapiro.test() function.

  • \(H_0\): residuals spread normally => p-value > 0.05 (alpha)
  • \(H_1\): residuals spread unnormally => p-value < 0.05 (alpha)
# Assumption Check by LJung-Box Test 
Box.test(x = bike_stlm$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  bike_stlm$residuals
## X-squared = 0.051035, df = 1, p-value = 0.8213
# Shapiro Test
shapiro.test(bike_stlm$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  bike_stlm$residuals
## W = 0.95916, p-value < 2.2e-16

Both p-value on autocorrelation and normality test result are exceeding 0.05 (alpha). This means the residuals on SARIMA Model has no-autocorrelation and spread normally.

Conclusion

We have compare each model by the value of MAPE and even visualize it. There are several conclusion that we obtained,

  • If we compare MAPE between TES/Holt-Winter Model, Auto ARIMA Model, and SARIMA Model, SARIMA Model has smaller error than TES/Holt-Winter and Auto Arima Model. If we would like to forecast, it is more tolerable to use SARIMA Model than other Model.

  • On plot1 that represent visualization of the actual data and TES/Holt-Winter Model, on the forecast/predict train were able to capture the pattern of the actual, but when we will do some forecast on test there were failure to forecast and most of pattern is not captured.

  • On plot2 that represent visualization of the actual data and Auto ARIMA Model, the forecast/predict train were able to capture the pattern of the actual, although there were not exact pattern captured. But on the forecast/predict test were unable to capture any pattern.

  • On plot3 that represent visualization of the actual data and SARIMA Model, the forecast/predict train is just right (able to capture the pattern of the actual). But on the forecast/predict test, most of pattern unable to captured but it’s still better than TES/Holt-Winter or Auto ARIMA Model.

  • Why does SARIMA Model is better? Based on assumption check, the residuals on SARIMA Model has:

    • No-autocorrelation, means that each predictors/residuals has no correlation
    • Spreads normally, the prediction/forecast results that have small error.

It is recommended to use SARIMA Model, than other models (TES/Holt-Winter and Auto ARIMA).

But, it is recommended to do trial and error further. There is possibilities using other models than SARIMA, Auto ARIMA, and TES/Holt-Winter has smaller error and better performance.