1 Initialization

1.2 Data Pre-Processing

1.2.1 Read Data Source

We’re going to use Mobile App Usage data from Kaggle

app <- read.csv("app.csv")
app$time <- dmy_hm(app$time)
app$date <- day(app$time)
app$hour <- hour(app$time)

2 Working with time series

2.1 Understanding Time Series Data

app.ts <- ts(app[,2], start=c(22,9), frequency = 24)
plot(app.ts)

We can see from the chart that at some points of time in a day, the mobile app usage is spiking high and low at the other part.

2.2 Forecasting using ETS

app_ets <- ets(app.ts, model="ZZZ")  
plot(app.ts)
lines(app_ets$fitted, col="tomato3")

app_ets
## ETS(A,N,A) 
## 
## Call:
##  ets(y = app.ts, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.7521 
##     gamma = 0.0001 
## 
##   Initial states:
##     l = 78.2899 
##     s = -23.5187 -27.9652 -27.3668 -29.3756 -26.7561 -18.3827
##            -8.7678 -3.8701 1.2268 5.1463 14.9795 21.2725 23.3213 21.9408 25.8274 21.8564 22.3235 23.3003 14.1858 5.8434 1.0617 -3.3981 -11.9184 -20.9661
## 
##   sigma:  8.9982
## 
##      AIC     AICc      BIC 
## 1635.316 1646.039 1719.823
paste("MAPE: ", accuracy(app_ets)[5])
## [1] "MAPE:  10.2515619320515"

As we can see, the model used for this time series data is Multiplicative Residuals with \(\alpha\) = 0.7521, No Trend, and Multiplicative Seasonal with \(\gamma\) = 0.0001

Reading the plot, we could see that using ETS, the model is quite good with MAPE 10%

2.3 Forecasting using HW

app_hw <- HoltWinters(app.ts)
plot(app.ts)
lines(app_hw$fitted[,1], col="tomato3")

app_hw
## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = app.ts)
## 
## Smoothing parameters:
##  alpha: 0.6984261
##  beta : 0.005933009
##  gamma: 0.9193747
## 
## Coefficients:
##            [,1]
## a    46.1555652
## b    -0.7190774
## s1  -15.0936555
## s2   -4.0695143
## s3   -3.6638995
## s4    6.2353025
## s5   21.0877232
## s6   27.2124700
## s7   29.4604973
## s8   14.7678462
## s9   20.9469595
## s10  14.0849840
## s11  15.7683435
## s12  19.6813584
## s13  18.9266996
## s14  13.6909477
## s15  13.2163776
## s16   7.4841191
## s17   2.4173966
## s18 -10.9438309
## s19 -23.2027770
## s20 -30.5274645
## s21 -24.4254469
## s22 -26.3630012
## s23 -24.7087319
## s24 -27.5926508
paste("MAPE: ", accuracy(forecast(app_hw))[5])
## [1] "MAPE:  13.1905470097771"

As we can see, HW read this time series data as Additive Seasonal (differ with ETS), with \(\alpha\) = 0.698, \(\beta\) = 0.06 \(\gamma\) = 0.9193

Reading the plot, we could see that using HW, the model is quite good but worse MAPE 13%

acf(forecast(app_hw)$residuals, na.action = na.pass)

By evaluating the correlogram produced by ACF function above, we can see that there’s only 2 data that fall inside the 95% confidence bounds indicating the residuals appear to be random



2.4 Forecasting with ARIMA

2.4.1 Check Data Stationary

adf.test(app.ts, alternative="s")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  app.ts
## Dickey-Fuller = -4.4686, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

The p-value is smaller than 0.05, then we reject the H0 (data is stationary)

2.4.2 Auto ARIMA

app_autos <- auto.arima(app.ts)
app_auto <- auto.arima(app.ts, seasonal = F)
app_autos$aic
## [1] 1085.202
app_auto
## Series: app.ts 
## ARIMA(2,0,2) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     mean
##       1.8659  -0.9097  -1.0432  0.3323  71.7823
## s.e.  0.0446   0.0431   0.0918  0.0728   4.5941
## 
## sigma^2 estimated as 83.1:  log likelihood=-612.34
## AIC=1236.68   AICc=1237.2   BIC=1255.46

Given the AIC value from seasonal and non-seasonal model, we could simply chose the seasonal model of ARIMA for this kind of time series data

Furthermore we will compare the auto ARIMA with manual ARIMA, both using seasonal ARIMA

2.4.3 Manual ARIMA

app_manual <- arima(app.ts, c(0,1,2), list(order = c(0,3,3), period=24))
app_manual$aic
## [1] 888.9787


Using the above order, we get smaller AIC Value (888.98 compare to 1085.202) But before we chose model, we could compare the MAPE value given for each model

paste("MAPE Auto Seasonal: ", accuracy(app_autos)[5])
## [1] "MAPE Auto Seasonal:  10.3635034798368"
paste("MAPE Manual Seasonal: ", accuracy(app_manual)[5])
## [1] "MAPE Manual Seasonal:  10.205680541753"

Looking at below result, we could simply pick the manual ARIMA because of the MAPE value is smaller

2.4.4 Auto-corellation Test

tsdisplay(app_manual$residuals)



Both the ACF and PACF show almost significant spikes at lag 7, indicating some additional non-seasonal terms need to be included in the model. Using the Ljung-Box test we try this model:

Box.test(app_manual$residuals, lag=20, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  app_manual$residuals
## X-squared = 20.038, df = 20, p-value = 0.4555

Here the Ljung-Box test statistic is 20, and the p-value is 0.46, so there is little evidence of non-zero autocorrelations in the in-sample forecast errors at lag 1-20

3 Summary

For this time series data, it’s safe to say that we chose the modelling using Manual Arima with pdq order = (0,1,2), and seasonal order = (0,3,3)