library(tseries) # for ADF test
library(forecast)# auto.arima
data(AirPassengers) # load
class(AirPassengers)
[1] "ts"
summary(AirPassengers)
Min. 1st Qu. Median Mean 3rd Qu. Max.
104.0 180.0 265.5 280.3 360.5 622.0
start(AirPassengers) # like head(xxx, 1)
[1] 1949 1
end(AirPassengers)
[1] 1960 12
frequency(AirPassengers) # The cycle of this time series: 12
[1] 12
AirPassengers # displayed as matrix / table due to frequency, but actually 1-dim
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
plot(AirPassengers)
linearModel = lm(AirPassengers ~ time(AirPassengers))
abline(reg = linearModel) # fit in a Linear Model (Intercept & Slope), and plot the line
# OBS: Those 3 actions should be done TOGETHER in a chunk /* cuz abline() needs the plot as canvas. */
cycle(AirPassengers) # the cycle across years.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 1 2 3 4 5 6 7 8 9 10 11 12
1950 1 2 3 4 5 6 7 8 9 10 11 12
1951 1 2 3 4 5 6 7 8 9 10 11 12
1952 1 2 3 4 5 6 7 8 9 10 11 12
1953 1 2 3 4 5 6 7 8 9 10 11 12
1954 1 2 3 4 5 6 7 8 9 10 11 12
1955 1 2 3 4 5 6 7 8 9 10 11 12
1956 1 2 3 4 5 6 7 8 9 10 11 12
1957 1 2 3 4 5 6 7 8 9 10 11 12
1958 1 2 3 4 5 6 7 8 9 10 11 12
1959 1 2 3 4 5 6 7 8 9 10 11 12
1960 1 2 3 4 5 6 7 8 9 10 11 12
t = aggregate(AirPassengers,FUN=mean)
t
Time Series:
Start = 1949
End = 1960
Frequency = 1
[1] 126.6667 139.6667 170.1667 197.0000 225.0000 238.9167 284.0000 328.2500 368.4167 381.0000 428.3333 476.1667
plot(t) # aggregate the cycles and display a year on year trend.
# the year on year trend clearly shows the incremental pattern.
boxplot(AirPassengers ~ cycle(AirPassengers)) # plot across months, a sense on seasonal effect
ACF: Auto-correlation Function, aka Total Correlation Chart.
ACF is a plot of total correlation between different lag functions.
PACF: partial auto-correlation function.
apNum = as.numeric(AirPassengers)
apDiff = diff(AirPassengers, differences = 1)
op = par(mfrow=c(1,2)) # start multi plot
acf(apDiff, plot = T)
pacf(apDiff, plot = T)
# both titles are "Series apDiff"
par(op) # stop mulit plot
# OBS: this is just example, cuz apDiff is not stationary series.
# OBS: to see the real lag as integers, use the variable apNum instead of AirPassengers.
ARIMA is introduced by Box and Jenkins in 1970s.
plot(AirPassengers)
We can see:
yearly growth trend
yearly variance groth trend
seasonality with a 12 months cycle
For \(x(t) = (mean + trend * t) + error\), discard the \((mean + trend * t)\) part.
This differencing is called Integration in AR(I)MA.
Three parameters: $ p: AR. d: I. q: MA $
Can also be seen from ACF/PACF.
See the manual arima() part below.
To stationarize the series,
for yearly groth: differencing;
for yearly variance groth: log;
for seasonality: ?????
Test:
adf.test(diff(log(AirPassengers)), alternative="stationary", k=0)
p-value smaller than printed p-value
Augmented Dickey-Fuller Test
data: diff(log(AirPassengers))
Dickey-Fuller = -9.6003, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary
ADF test’s NULL-hypothesis: NOT stationary.
The result tells that the alternative hypothesis (stationary) is choosen, which means our strategy (log + diff) is right.
??? problem: it seems that even the original series is stationary according to ADF test, which should not be true. ???
Warn The series that we are using is now a log-ed series, which will be feeded into SARIMA model.
plot(diff(log(AirPassengers)))
Parameter values \(p,d,q\) can be found using ACF and PACF.
If d = 0:
acf(log(apNum))
If both ACF and PACF decreases gradually (hard to see cut-off), we need more stationary by introducing \(d\).
The decay of ACF chart is very slow, which means that the population is not stationary.
If d = 1:
acf(diff(log(apNum)))
After adding \(d\), ARIMA model becomes ARMA model.
pacf(diff(log(apNum)))
??? Clearly, ACF plot cuts off after the first lag. Hence, we understood that value of p should be 0 as the ACF is the curve getting a cut off. While value of q should be 1 or 2. After a few iterations, we found that (0,1,1) as (p,d,q) comes out to be the combination with least AIC and BIC. ???
Based on previous section, we explore more \((p,d,q)\) combinations.
The one with the lowest BIC (Bayesian Information Criterion, aka Schwarz Criterion, SBC, SBIC) and AIC (Akaike Information Critera) should be our choice.
autoArimaModel = auto.arima(AirPassengers, d = 1)
autoArimaModel
Series: AirPassengers
ARIMA(0,1,1)(0,1,0)[12]
Coefficients:
ma1
-0.3184
s.e. 0.0877
sigma^2 estimated as 138.3: log likelihood=-508.32
AIC=1020.64 AICc=1020.73 BIC=1026.39
autoArimaModelLog = auto.arima(log(AirPassengers), d = 1)
autoArimaModelLog
Series: log(AirPassengers)
ARIMA(0,1,1)(0,1,1)[12]
Coefficients:
ma1 sma1
-0.4018 -0.5569
s.e. 0.0896 0.0731
sigma^2 estimated as 0.001371: log likelihood=244.7
AIC=-483.4 AICc=-483.21 BIC=-474.77
pdqParam = c(0, 1, 1)
manualFit <- arima(log(AirPassengers), pdqParam, seasonal = list(order = pdqParam, period = 12))
manualFit
Call:
arima(x = log(AirPassengers), order = pdqParam, seasonal = list(order = pdqParam,
period = 12))
Coefficients:
ma1 sma1
-0.4018 -0.5569
s.e. 0.0896 0.0731
sigma^2 estimated as 0.001348: log likelihood = 244.7, aic = -483.4
autoPred = forecast(autoArimaModel, h=25)
plot(autoPred)
autoPred = forecast(autoArimaModelLog, h=25)
autoPred$mean = exp(autoPred$mean)
autoPred$lower = exp(autoPred$lower)
autoPred$upper = exp(autoPred$upper)
autoPred$x = exp(autoPred$x)
autoPred$fitted = exp(autoPred$fitted)
autoPred$residuals = exp(autoPred$residuals)
plot(autoPred)
manualPred <- predict(manualFit, n.ahead = 25)
ts.plot(AirPassengers, exp(manualPred$pred), log = "y", lty = c(1,3))
Ljung-Box test is NOT suitable to validate the model, cuz LB test requires:
stationarity.
white noise process has a normal distribution with mean zero.
ref and related auto.arima() Failed The Ljung-Box
Box.test(autoArimaModel$resid,type="Ljung-Box")
Box-Ljung test
data: autoArimaModel$resid
X-squared = 0.0045141, df = 1, p-value = 0.9464