library(dplyr)
library(forecast)
library(lubridate)
library(fpp)
library(ggplot2)
We’re going to use Mobile App Usage data from Kaggle
app <- read.csv("app.csv")
app$time <- dmy_hm(app$time)
app$date <- day(app$time)
app$hour <- hour(app$time)
app.ts <- ts(app[,2], start=c(22,9), frequency = 24)
plot(app.ts)
We can see from the chart that at some points of time in a day, the mobile app usage is spiking high and low at the other part.
app_ets <- ets(app.ts, model="ZZZ")
plot(app.ts)
lines(app_ets$fitted, col="tomato3")
app_ets
## ETS(A,N,A)
##
## Call:
## ets(y = app.ts, model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.7521
## gamma = 0.0001
##
## Initial states:
## l = 78.2899
## s = -23.5187 -27.9652 -27.3668 -29.3756 -26.7561 -18.3827
## -8.7678 -3.8701 1.2268 5.1463 14.9795 21.2725 23.3213 21.9408 25.8274 21.8564 22.3235 23.3003 14.1858 5.8434 1.0617 -3.3981 -11.9184 -20.9661
##
## sigma: 8.9982
##
## AIC AICc BIC
## 1635.316 1646.039 1719.823
paste("MAPE: ", accuracy(app_ets)[5])
## [1] "MAPE: 10.2515619320515"
As we can see, the model used for this time series data is Multiplicative Residuals with \(\alpha\) = 0.7521, No Trend, and Multiplicative Seasonal with \(\gamma\) = 0.0001
Reading the plot, we could see that using ETS, the model is quite good with MAPE 10%
app_hw <- HoltWinters(app.ts)
plot(app.ts)
lines(app_hw$fitted[,1], col="tomato3")
app_hw
## Holt-Winters exponential smoothing with trend and additive seasonal component.
##
## Call:
## HoltWinters(x = app.ts)
##
## Smoothing parameters:
## alpha: 0.6984261
## beta : 0.005933009
## gamma: 0.9193747
##
## Coefficients:
## [,1]
## a 46.1555652
## b -0.7190774
## s1 -15.0936555
## s2 -4.0695143
## s3 -3.6638995
## s4 6.2353025
## s5 21.0877232
## s6 27.2124700
## s7 29.4604973
## s8 14.7678462
## s9 20.9469595
## s10 14.0849840
## s11 15.7683435
## s12 19.6813584
## s13 18.9266996
## s14 13.6909477
## s15 13.2163776
## s16 7.4841191
## s17 2.4173966
## s18 -10.9438309
## s19 -23.2027770
## s20 -30.5274645
## s21 -24.4254469
## s22 -26.3630012
## s23 -24.7087319
## s24 -27.5926508
paste("MAPE: ", accuracy(forecast(app_hw))[5])
## [1] "MAPE: 13.1905470097771"
As we can see, HW read this time series data as Additive Seasonal (differ with ETS), with \(\alpha\) = 0.698, \(\beta\) = 0.06 \(\gamma\) = 0.9193
Reading the plot, we could see that using HW, the model is quite good but worse MAPE 13%
acf(forecast(app_hw)$residuals, na.action = na.pass)
By evaluating the correlogram produced by ACF function above, we can see that there’s only 2 data that fall inside the 95% confidence bounds indicating the residuals appear to be random
adf.test(app.ts, alternative="s")
##
## Augmented Dickey-Fuller Test
##
## data: app.ts
## Dickey-Fuller = -4.4686, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
The p-value is smaller than 0.05, then we reject the H0 (data is stationary)
app_autos <- auto.arima(app.ts)
app_auto <- auto.arima(app.ts, seasonal = F)
app_autos$aic
## [1] 1085.202
app_auto
## Series: app.ts
## ARIMA(2,0,2) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 ma2 mean
## 1.8659 -0.9097 -1.0432 0.3323 71.7823
## s.e. 0.0446 0.0431 0.0918 0.0728 4.5941
##
## sigma^2 estimated as 83.1: log likelihood=-612.34
## AIC=1236.68 AICc=1237.2 BIC=1255.46
Given the AIC value from seasonal and non-seasonal model, we could simply chose the seasonal model of ARIMA for this kind of time series data
Furthermore we will compare the auto ARIMA with manual ARIMA, both using seasonal ARIMA
app_manual <- arima(app.ts, c(0,1,2), list(order = c(0,3,3), period=24))
app_manual$aic
## [1] 888.9787
Using the above order, we get smaller AIC Value (888.98 compare to 1085.202) But before we chose model, we could compare the MAPE value given for each model
paste("MAPE Auto Seasonal: ", accuracy(app_autos)[5])
## [1] "MAPE Auto Seasonal: 10.3635034798368"
paste("MAPE Manual Seasonal: ", accuracy(app_manual)[5])
## [1] "MAPE Manual Seasonal: 10.205680541753"
Looking at below result, we could simply pick the manual ARIMA because of the MAPE value is smaller
tsdisplay(app_manual$residuals)
Both the ACF and PACF show almost significant spikes at lag 7, indicating some additional non-seasonal terms need to be included in the model. Using the Ljung-Box test we try this model:
Box.test(app_manual$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: app_manual$residuals
## X-squared = 20.038, df = 20, p-value = 0.4555
Here the Ljung-Box test statistic is 20, and the p-value is 0.46, so there is little evidence of non-zero autocorrelations in the in-sample forecast errors at lag 1-20
For this time series data, it’s safe to say that we chose the modelling using Manual Arima with pdq order = (0,1,2), and seasonal order = (0,3,3)