Data 624 - Predictive Analytics
Chapter 8
library(fpp2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──
## ✓ ggplot2 3.3.3 ✓ fma 2.4
## ✓ forecast 8.13 ✓ expsmooth 2.3
##
library(tseries)
library(urca)
All three figures indicate that the data are white noise, since the ACF bars are all within the dash lines which indicate critical values for the ACF to be considered statistically significance. For data with smaller number of samples, the ACF bars are taller than the data with larger number of samples. Since the data are randomly generated, they are independently and identically distributed, and therefore should have autocorrelation of zero for all of its lags. This is demonstrated by the figures - as more samples are generated, the autocorrelations tend toward zero.
These dash lines are estimated using ±1.96/√‾‾N with zero center. Mathematically, as N gets bigger, the absolute value of critical value become smaller. Statistically, this means that it is “easier” for smaller data set to exhibit correlation by chance than larger data set. So to compensate, the absolute value of critical values are larger for smaller data set - you need to have higher autocorrelation to reject the null hypothesis that the autocorrelation is zero (i.e. the autocorrelation observed are due to chance only).
help(ibmclose)
ggtsdisplay(ibmclose)
help(usnetelec)
ggtsdisplay(usnetelec)
usnetelec.s <- diff(usnetelec, 1)
ggtsdisplay(usnetelec.s)
usnetelec.s %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 3 lags.
##
## Value of test-statistic is: 0.1585
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
help(usgdp)
ggtsdisplay(usgdp)
bc.lambda <- BoxCox.lambda(usgdp)
paste('The Box-Cox lambda for this time series is:', bc.lambda)
## [1] "The Box-Cox lambda for this time series is: 0.366352049520934"
usgdp.s <- usgdp %>% BoxCox(bc.lambda) %>% diff(1) %>% diff(1)
ggtsdisplay(usgdp.s)
usgdp.s %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.012
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
help(mcopper)
ggtsdisplay(mcopper)
bc.lambda <- BoxCox.lambda(mcopper)
paste('The Box-Cox lambda for this time series is:', bc.lambda)
## [1] "The Box-Cox lambda for this time series is: 0.191904709003829"
mcopper.s <- mcopper %>% BoxCox(bc.lambda) %>% diff(1)
ggtsdisplay(mcopper.s)
mcopper.s %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 6 lags.
##
## Value of test-statistic is: 0.0573
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
help(enplanements)
ggtsdisplay(enplanements)
bc.lambda <- BoxCox.lambda(enplanements)
paste('The Box-Cox lambda for this time series is:', bc.lambda)
## [1] "The Box-Cox lambda for this time series is: -0.226946111237065"
enplanements.s <- enplanements %>% BoxCox(bc.lambda) %>% diff(12) %>% diff(1)
ggtsdisplay(enplanements.s)
enplanements.s %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0424
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
help(visitors)
ggtsdisplay(visitors)
bc.lambda <- BoxCox.lambda(visitors)
paste('The Box-Cox lambda for this time series is:', bc.lambda)
## [1] "The Box-Cox lambda for this time series is: 0.277524910835111"
visitors.s <- visitors %>% BoxCox(bc.lambda) %>% diff(12) %>% diff(1)
ggtsdisplay(visitors.s)
visitors.s %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.0158
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retaildata[,"A3349873A"], frequency=12, start=c(1982,4))
ggtsdisplay(myts)
bc.lambda <- BoxCox.lambda(myts)
paste('Best Box-Cox lambda for this time series is:', bc.lambda)
## [1] "Best Box-Cox lambda for this time series is: 0.127636859661548"
myts.s <- myts %>% BoxCox(bc.lambda) %>% diff(1) %>% diff(12)
ggtsdisplay(myts.s)
myts.s %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0138
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ar.1 <- function(phi, sd=1, n=100){
y <- ts(numeric(n))
e <- rnorm(n, sd=sd)
for(i in 2:n)
y[i] <- phi*y[i-1] + e[i]
return(y)
}
data.list <- list()
i <- 0
phi.vec <- c(0.0000001, 0.0001, 0.1) * 6
for (phi in phi.vec){
i <- i + 1
data.list[[i]] <- ar.1(phi)
}
gen.data <- do.call(cbind, data.list)
colnames(gen.data) <- paste('phi=', phi.vec, sep = '')
autoplot(gen.data) + ylab('Data')
par(mfrow=c(1,3))
acf(gen.data[,1], main='Phi=0.0000006')
acf(gen.data[,2], main='Phi=0.0006')
acf(gen.data[,3], main='Phi=0.6')
ma.1 <- function(theta, sd=1, n=100){
y <- ts(numeric(n))
e <- rnorm(n, sd=sd)
e[1] <- 0
for(i in 2:n)
y[i] <- theta*e[i-1] + e[i]
return(y)
}
data.list <- list()
i <- 0
theta.vec <- c(0.0000001, 0.0001, 0.1) * 6
for (theta in theta.vec){
i <- i + 1
data.list[[i]] <- ma.1(theta)
}
gen.data <- do.call(cbind, data.list)
colnames(gen.data) <- paste('theta=', theta.vec, sep = '')
autoplot(gen.data) + ylab('Data')
par(mfrow=c(1,3))
acf(gen.data[,1], main='theta=0.0000006')
acf(gen.data[,2], main='theta=0.0006')
acf(gen.data[,3], main='theta=0.6')
y1 <- ts(numeric(100))
e <- rnorm(100, sd=1)
for(i in 2:100)
y1[i] <- 0.6*y1[i-1] + 0.6*e[i-1] + e[i]
autoplot(y1) +
ggtitle('ARMA(1,1) Generated Data')
y2 <- ts(numeric(100))
e <- rnorm(100, sd=1)
for(i in 3:100)
y2[i] <- -0.8*y2[i-1] + 0.3*y2[i-2] + e[i]
autoplot(y2) +
ggtitle('AR(2) Generated Data')
par(mfrow=c(1,2))
acf(y1, main='ARMA(1,1) Generated Data')
acf(y2, main='AR(2) Generated Data')
help(wmurders)
ggtsdisplay(wmurders, main='Annual Female Murder Rate in USA')
wmurders.s <- wmurders %>% diff(1) %>% diff(1)
ggtsdisplay(wmurders.s)
wmurders.s %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 3 lags.
##
## Value of test-statistic is: 0.0458
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
(fit <- Arima(wmurders, order=c(1,2,1)))
## Series: wmurders
## ARIMA(1,2,1)
##
## Coefficients:
## ar1 ma1
## -0.2434 -0.8261
## s.e. 0.1553 0.1143
##
## sigma^2 estimated as 0.04632: log likelihood=6.44
## AIC=-6.88 AICc=-6.39 BIC=-0.97
checkresiduals(fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,2,1)
## Q* = 12.419, df = 8, p-value = 0.1335
##
## Model df: 2. Total lags used: 10
forecast(fit, h=3)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2005 2.470660 2.194836 2.746484 2.048824 2.892496
## 2006 2.363106 1.986351 2.739862 1.786908 2.939304
## 2007 2.252833 1.765391 2.740276 1.507354 2.998313
(1−ϕ1B)(1−B)2yt=c+(1+θ1B)εt (1+B2−2B−ϕ1B−ϕ1B3+2ϕ1B2)yt=εt+θ1Bεt
Applying the backshift operator:
yt+yt−2−2yt−1−ϕ1yt−1−ϕ1yt−3+2ϕ1yt−2=εt+θ1εt−1
Collecting same terms and moving terms to right side:
yt=(2+ϕ1)yt−1−(1+2ϕ1)yt−2+ϕ1yt−3+εt+θ1εt−1
From R, we estimated ϕ1=−0.2434 and θ1=−0.8261, and we use εt=0 for the forecast. Plugging in all the numbers:
yt=1.7566yt−1−0.5132yt−2−0.2434yt−3−0.8261εt−1
To begin forecasting, we will need the last 3 year data and last year residual:
tail(wmurders, 3)
## Time Series:
## Start = 2002
## End = 2004
## Frequency = 1
## [1] 2.797697 2.662227 2.589383
tail(residuals(fit), 1)
## Time Series:
## Start = 2004
## End = 2004
## Frequency = 1
## [1] 0.03708172
For 2005 forecast:
yt=1.7566×2.589383−0.5132×2.662227−0.2434×2.797697−0.8261×0.03708172=2.470663
For 2006 forecast:
yt=1.7566×2.470663−0.5132×2.589383−0.2434×2.662227−0.8261×0=2.363109
For 2007 forecast:
yt=1.7566×2.363109−0.5132×2.470663−0.2434×2.589383−0.8261×0=2.252837
These are nearly identical to the R calculated point forecasts.
autoplot(forecast(fit, h=3))
(fit2 <- auto.arima(wmurders, seasonal=F, stepwise=F, approximation=F))
## Series: wmurders
## ARIMA(0,2,3)
##
## Coefficients:
## ma1 ma2 ma3
## -1.0154 0.4324 -0.3217
## s.e. 0.1282 0.2278 0.1737
##
## sigma^2 estimated as 0.04475: log likelihood=7.77
## AIC=-7.54 AICc=-6.7 BIC=0.35
forecast(fit2, h=3)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2005 2.408817 2.137718 2.679916 1.994206 2.823428
## 2006 2.365555 1.985092 2.746018 1.783687 2.947423
## 2007 2.290976 1.753245 2.828706 1.468588 3.113363
autoplot(forecast(fit2, h=3))