library(tidyverse)
library(lubridate) # manipulate date
library(plotly) # interactive visualization package
library(zoo) # manipulate missing value
library(tseries) # time series package
library(forecast) # forecast package
library(MLmetrics) # calculate errorPeople nowadays concerns about their healthy and global warming issues. Global warming issue itself has been public’s attention since 2008 and global warming issue itself affected to our daily life, climate change is one of them. Climate change is forcing cities to re-imaging their transportation infrastructure. Shared mobility concepts, such as car sharing, bike sharing or scooter sharing become more and more popular. And if they are implemented well, they can actually contribute to mitigating climate change. Bike sharing in particular is interesting because no electricity of gasoline is necessary (unless e-bikes are used) for this mode of transportation. Many cities are provide the bike sharing facilities, Washington DC is one of them. This Learning By Building (LBB) will do simple forecast on Bike Sharing in Washington DC data set.
bike <- read.csv("bike_sharing_dataset.csv")
head(bike)Let’s take a peek this data set further
dim(bike) # checking dimensions of data## [1] 2922 29
This data set has 2,922 days and have 29 columns that will present our target and predictors. Then we need to know, how long this 2,922 days recorded on this data set by using range () function.
range(bike$date)## [1] "2011-01-01" "2018-12-31"
This data set has period of time from 01 Jan 2011 till 31 Des 2018. Then, we need to know the contents of each columns further by using glimpse() function
glimpse(bike)## Rows: 2,922
## Columns: 29
## $ date <chr> "2011-01-01", "2011-01-02", "2011-01-03", "2011-0...
## $ temp_avg <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ temp_min <dbl> -1.566667, 0.880000, -3.442857, -5.957143, -4.293...
## $ temp_max <dbl> 11.9733333, 13.8066667, 7.4642857, 4.6428571, 6.1...
## $ temp_observ <dbl> 2.7727273, 7.3272727, -3.0600000, -3.1000000, -1....
## $ precip <dbl> 0.069333333, 1.037349398, 1.878823529, 0.00000000...
## $ wind <dbl> 2.575, 3.925, 3.625, 1.800, 2.950, 1.600, 2.550, ...
## $ wt_fog <dbl> 1, 1, NA, NA, NA, NA, 1, 1, NA, NA, 1, 1, NA, 1, ...
## $ wt_heavy_fog <dbl> NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA,...
## $ wt_thunder <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ wt_sleet <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA...
## $ wt_hail <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ wt_glaze <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA...
## $ wt_haze <dbl> 1, NA, NA, NA, NA, NA, 1, 1, NA, NA, 1, 1, NA, NA...
## $ wt_drift_snow <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ wt_high_wind <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, 1, ...
## $ wt_mist <dbl> 1, 1, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1, NA, NA...
## $ wt_drizzle <dbl> NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA,...
## $ wt_rain <dbl> 1, 1, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, ...
## $ wt_freeze_rain <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ wt_snow <dbl> NA, NA, NA, NA, NA, 1, 1, 1, NA, NA, 1, NA, NA, 1...
## $ wt_ground_fog <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ wt_ice_fog <dbl> NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, 1, NA, NA,...
## $ wt_freeze_drizzle <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA...
## $ wt_unknown <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ casual <dbl> 330, 130, 120, 107, 82, 88, 148, 68, 54, 41, 43, ...
## $ registered <dbl> 629, 651, 1181, 1429, 1489, 1485, 1345, 871, 748,...
## $ total_cust <dbl> 959, 781, 1301, 1536, 1571, 1573, 1493, 939, 802,...
## $ holiday <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
After take a peek into data set we know that most of columns have missing value, then we need to know how bad or how many the missing value is
colSums(is.na(bike))## date temp_avg temp_min temp_max
## 0 821 0 0
## temp_observ precip wind wt_fog
## 0 0 0 1419
## wt_heavy_fog wt_thunder wt_sleet wt_hail
## 2714 2228 2793 2872
## wt_glaze wt_haze wt_drift_snow wt_high_wind
## 2769 2217 2915 2664
## wt_mist wt_drizzle wt_rain wt_freeze_rain
## 2551 2794 2516 2917
## wt_snow wt_ground_fog wt_ice_fog wt_freeze_drizzle
## 2838 2886 2912 2918
## wt_unknown casual registered total_cust
## 2921 4 4 4
## holiday
## 2833
Well most of contents of the columns have more than 5% missing value, then we need to remove them. We’re gonna select column date and total_cust since column total_cust contains total number of registered and unregistered customers.
bike_clean <- bike %>%
mutate(date = ymd(date)) %>%
select(date, total_cust)
colSums(is.na(bike_clean))## date total_cust
## 0 4
Fill the missing value by mean value before and after missing data by using na.fill() function by zoo package
bike_clean <- bike_clean %>%
mutate(total_cust = na.fill(object = total_cust,
fill = "extend"))Let’s visualize data before create time series data of this data set.
plot1 <- ggplot(data = bike_clean, mapping = aes(date, total_cust)) +
geom_line() +
geom_point() +
labs(y = "Total Customer",
x = "Date")+
theme_minimal()
ggplotly(plot1)Based on graph above we could indicate that this data has trend and seasonal pattern. Now, let’s move on to time series object.
We have data set about Bike Sharing that records 8 years of the bike sharing data. How many customers that use this bike sharing per day. Now, we would like to forecast on 2018 of this data set. First, we need to create time series object.
In this section, we’re going to create initial time series object of data set to observe is this data set is an additive or multiplicative?
# initial time series object (first attempt)
bike_ts <- ts(data = bike_clean$total_cust,
start = range(bike_clean$date)[[1]],
frequency = (7*4)) # monthly pattern (7*4)After create time series object, we’re gonna verify is this data has additive or multiplicative pattern or no by using decompose() function.
bike_ts %>%
decompose() %>%
autoplot()Plot above shows that this data has multiplicative pattern and range of errors were from -5,000 to 5,000, and on trend plot shows up and down pattern are still exist. Therefore, we could indicate that this data set has Multiple Seasonal Time Series Object. How do we create time series object if our data set is a multiple seasonal time series? By using msts() function we could create multiple time series object and use mstl() function to create decomposition on multiple seasonal time series object.
# time series object with multiple seasonal (2nd attempt)
msts(data = bike_clean$total_cust,
seasonal.periods = c(7*4, 7*4*4)) %>% # multiple seasonal ts (monthly, quarterly)
mstl() %>% # multiple seasonal decomposition
autoplot()Take a peek on Trend section. The ups and downs are still exist. This mean that we need to find the right frequency/seasonal.periods.
# 2nd attempt on multiple seasonal ts (3rd attempt on time series object)
msts(data = bike_clean$total_cust,
seasonal.periods = c(7*4, 7*4*12)) %>% # multiple seasonal ts (monthly, yearly)
mstl() %>% # multiple seasonal decomposition
autoplot()Although on the 3rd attempt there are still slightly ups and downs, the trend is smoother than the 2nd attempt and the first attempt. Other than that, plot above shows that this data set has additive pattern. Then, we will assign the 3rd attempt into bike_msts. Afterthat, we need to know is time series object a stationary or not by using adf.test() function.
# assign the 3rd attempt
bike_msts <- msts(data = bike_clean$total_cust,
seasonal.periods = c(7*4, 7*4*12))
# check stationary
adf.test(bike_msts)##
## Augmented Dickey-Fuller Test
##
## data: bike_msts
## Dickey-Fuller = -3.6953, Lag order = 14, p-value = 0.02421
## alternative hypothesis: stationary
Result above shows that the p-value of bike_ts is stationary because p-value of data is smaller than 0.05 (p-value < 0.05). There’s no need to do differencing, thus we could applied ARIMA or SARIMA model building.
Separate data to be data train and test. Data test were taken from last year (2018) of data.
bike_test <- bike_msts %>%
tail(7*4*12)
bike_train <- bike_msts %>%
head(length(bike_msts) - (7*4*12))This section, create 3 model. There are Holt-Winter model, model using auto.arima() function (model auto), and Seasonal ARIMA (SARIMA) model by using stlm() function.
# create Holt-Winter (Triple Exponential Smoothing/TES) model
bike_tes <- stlm(y = bike_train, method = "ets")
# create auto.arima model
bike_auto <- auto.arima(y = bike_train, seasonal = T)
# create SARIMA model
bike_stlm <- stlm(y = bike_train, method = "arima")After build model, let’s forecast each model into data test.
# forecast on Holt-Winter model
bike_forecast1 <- forecast(object = bike_tes, h = 336)
# forecast on auto.arima model
bike_forecast2 <- forecast(object = bike_auto, h = 336)
# forecast on SARIMA (stlm) model
bike_forecast3 <- forecast(object = bike_stlm, h = 336)After forecast each model and assign it into bike_forecast1 (for TES/Holt-Winter model), bike_forecast2 (for auto.arima model), and bike_forecast3 (for SARIMA model), it will be better to evaluate each model. There are several ways to evaluate the model. In this section, I will use accuracy() function to compare Mean Absolute Percentage Error (MAPE) of each model.
accuracy(bike_forecast1)## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 8.330484 1408.851 1025.135 -15.97484 30.2332 0.4375529 0.1693021
accuracy(bike_forecast2)## ME RMSE MAE MPE MAPE MASE
## Training set 11.50366 1680.878 1175.023 -23.69987 38.46132 0.5015287
## ACF1
## Training set 0.0002851324
accuracy(bike_forecast3)## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 12.79538 1368.62 987.4546 -15.15221 28.97288 0.4214699 0.004439838
Mean Absolute Percentage Error (MAPE) of each model was:
There were a huge gap, the difference is huge between MAPE of TES/Holt-Winter Model, Auto ARIMA Model, and SARIMA Model.
Another way, we could visualize and compare between the actual and each model.
plot1 <- bike_msts %>%
autoplot(series = "Actual") +
autolayer(bike_forecast1$mean, series = "Predict Test TES/Holt-Winter") +
autolayer(bike_forecast1$fitted, series = "Predict Train TES/Holt-Winter") +
labs(title = "Comparation Actual and Forecast using Holt-Winter Model") +
theme_minimal()
ggplotly(plot1)## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot2 <- bike_msts %>%
autoplot(series = "Actual") +
autolayer(bike_forecast2$mean, series = "Predict Test Auto ARIMA") +
autolayer(bike_forecast2$fitted, series = "Predict Train Auto ARIMA") +
labs(title = "Comparation Actual and Forecast using Auto ARIMA Model") +
theme_minimal()
ggplotly(plot2)plot3 <- bike_msts %>%
autoplot(series = "Actual") +
autolayer(bike_forecast3$mean, series = "Predict Test SARIMA") +
autolayer(bike_forecast3$fitted, series = "Predict Train SARIMA") +
labs(title = "Comparation Actual and Forecast using SARIMA Model") +
theme_minimal()
ggplotly(plot3)Because the error of SARIMA Model is smaller than others, I will do assumption check only on SARIMA Model. There were two assumption that we had to check.
1. No-autocorrelation Residual
To check No-autocorrelation Residual, we could use Box.test() function and input type parameter as Ljung-Box.
2. Normality Residual
To check Normality Residual, we could use shapiro.test() function.
# Assumption Check by LJung-Box Test
Box.test(x = bike_stlm$residuals, type = "Ljung-Box")##
## Box-Ljung test
##
## data: bike_stlm$residuals
## X-squared = 0.051035, df = 1, p-value = 0.8213
# Shapiro Test
shapiro.test(bike_stlm$residuals)##
## Shapiro-Wilk normality test
##
## data: bike_stlm$residuals
## W = 0.95916, p-value < 2.2e-16
Both p-value on autocorrelation and normality test result are exceeding 0.05 (alpha). This means the residuals on SARIMA Model has no-autocorrelation and spread normally.
We have compare each model by the value of MAPE and even visualize it. There are several conclusion that we obtained,
If we compare MAPE between TES/Holt-Winter Model, Auto ARIMA Model, and SARIMA Model, SARIMA Model has smaller error than TES/Holt-Winter and Auto Arima Model. If we would like to forecast, it is more tolerable to use SARIMA Model than other Model.
On plot1 that represent visualization of the actual data and TES/Holt-Winter Model, on the forecast/predict train were able to capture the pattern of the actual, but when we will do some forecast on test there were failure to forecast and most of pattern is not captured.
On plot2 that represent visualization of the actual data and Auto ARIMA Model, the forecast/predict train were able to capture the pattern of the actual, although there were not exact pattern captured. But on the forecast/predict test were unable to capture any pattern.
On plot3 that represent visualization of the actual data and SARIMA Model, the forecast/predict train is just right (able to capture the pattern of the actual). But on the forecast/predict test, most of pattern unable to captured but it’s still better than TES/Holt-Winter or Auto ARIMA Model.
Why does SARIMA Model is better? Based on assumption check, the residuals on SARIMA Model has:
It is recommended to use SARIMA Model, than other models (TES/Holt-Winter and Auto ARIMA).
But, it is recommended to do trial and error further. There is possibilities using other models than SARIMA, Auto ARIMA, and TES/Holt-Winter has smaller error and better performance.