Testing and Diagnostic \({H_0}: {\phi}=0\) or \({\theta=0}\) usual p-value calculation Diagnostic residual anaysis if ARIMA doing a good job, then residuals small and the stat we calculate is also small if it’s big, our arima isn’t doing a good job of estimating the data
\({H_0}\): model adequetly models our TS data
Test stats: box pierce stat or ljung box stat box.test(mod1$residuals): this will do the box pierce stat large p-value: we don’t have evidence to say that our model is inadequet, if this is the case, back up a couple of steps to improve model.
Decomposing Time Series
# n births per month in NYC
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
Read 168 items
birthstimeseries <- ts(births, frequency=12, start=c(1946,1))
TScomp <- decompose(birthstimeseries) #split the data into different components. The three bottom graphs added will make the observed graph
plot(TScomp)
# subtract seasonal component
noseason <- birthstimeseries - TScomp$seasonal
plot(noseason)
# only trend and irregular component remains
# this is what we removed
plot(TScomp$seasonal)
library(forecast)
library(quantmod)
Loading required package: xts
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
Loading required package: TTR
Version 0.4-0 included new data defaults. See ?getSymbols.
Learn from a quantmod author: https://www.datacamp.com/courses/importing-and-managing-financial-data-in-r
library(tseries)
‘tseries’ version: 0.10-44
‘tseries’ is a package for time series analysis and computational finance.
See ‘library(help="tseries")’ for details.
library(timeSeries)
Loading required package: timeDate
Attaching package: ‘timeSeries’
The following object is masked from ‘package:zoo’:
time<-
library(forecast)
library(xts)
Seasonal ARIMA seasonal and non seasonal differences if there’s season in our data, we take the seasonal differences \({y_t}-{y_{t-s}}\), ie \({y_t}-{y_{t-12}}\) for monthly data. helpt to remove seasonal trends
non-seasonal differences: remove upward/downward trend \({z_t}-{z_{t-1}}\), d=1
ACF: auto correlation function: tells you correlation of 2 points separated by a specific lag. lag: how far apart these two measures are PACF: partial auto correlation, like ACF, but controls for values of shorter lag.
Acf(birthstimeseries)
Pacf(birthstimeseries)
The blue dotted lines are a rough estimate for what’s high and low.
# STEP 1: PLOT DATA
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
Read 168 items
births <- births[-c(1:24)]
births <- ts(births, frequency=12, start=c(1946,1))
par(mfrow=c(2,1))
plot.ts(births) # hard to see seasonality due to noise
#this filter helps smooth the noise, seasonality sticks out
plot.ts(filter(births, sides=2, filter=rep(1/3,3)))
#this filter mostly smooths out the seasonality
par(mfrow=c(1,1))
(wgts <- c(.5, rep(1,11), .5)/12)
[1] 0.04166667 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
[12] 0.08333333 0.04166667
plot.ts(filter(births, sides=2, filter = wgts))
STEP 2: DIFFS First: diffs of lag S to remove seasonality Second: diffs to remove trend (Skip either step if deemed unnecessary in step 1)
# Step 2, part A:
diff1 <- diff(births, lag = 12) #s=12 because of the seasonal data (number of observations per year)
plot.ts(diff1)
adf.test(diff1) #null: data is not stationary
Augmented Dickey-Fuller Test
data: diff1
Dickey-Fuller = -3.4166, Lag order = 5, p-value = 0.05488
alternative hypothesis: stationary
p-value is low enough so we say it’s stationary
STEP 2, part B: doesn’t look necessary anymore (no trend remaining in diff1)
In this example D=1 and d=0
STEP 3: ACF and PACF
#### STEP 3: ACF and PACF
## Let's look at the nonseasonal p and q first
Acf(diff1) #spikes at low lags ==> nonseasonal MA (q > =1)
Pacf(diff1) # spike at low lag ==> nonseasonal AR (p >= 1)
# Our model thus far is p>=1, d=0, q>=1
## Let's look at the seasonal P and Q now
Acf(diff1) #spikes at lag 12 (S) ==> seasonal MA (q >= 1)
Pacf(diff1) # spike at lag 12 ==> seasonal AR (p >= 1)
# Our model thus far is P>=1, D>=1, Q>=1
# ARIMA (1, 0, 1) x (1, 1, 1) 12
mod1 <- arima(births, order = c(1,0,1), season = list( order = c( 1,1,1), period=12)) #fist argument is the original data set, second tells us the order of non seasonal part, third is a list with two things: order of the seasonal and the period
#### STEP 5: EXAMINE MODELs, and maybe make more
# ARIMA (1, 0, 1) x (1, 1, 1) 12
summary(mod1)
Call:
arima(x = births, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 12))
Coefficients:
ar1 ma1 sar1 sma1
0.9884 -0.1564 -0.2199 -0.8820
s.e. 0.0287 0.1747 0.1004 0.1498
sigma^2 estimated as 0.3924: log likelihood = -136.81, aic = 283.61
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.06656965 0.5998188 0.4452406 0.2299257 1.727419 0.3675936 0.0410978
Box.test(mod1$residuals, type = "Ljung")
Box-Ljung test
data: mod1$residuals
X-squared = 0.24832, df = 1, p-value = 0.6183
# no evidence the model is inadequate! (resids small enough)
# but mod1's ma1 is not significantly diff from 0, so let's try removing it
# ARIMA (1, 0, 0) x (1, 1, 1) 12
mod2 <- arima(births, order = c(1,0,0), season = list( order = c( 1,1,1), period=12))
Box.test(mod2$residuals, type = "Ljung") # yay
Box-Ljung test
data: mod2$residuals
X-squared = 0.62922, df = 1, p-value = 0.4276
summary(mod2) # most Wald test stats look decently far from 0
Call:
arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(1, 1, 1), period = 12))
Coefficients:
ar1 sar1 sma1
0.9750 -0.2263 -0.8595
s.e. 0.0326 0.0998 0.1280
sigma^2 estimated as 0.3986: log likelihood = -137.27, aic = 282.55
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.07991759 0.6045416 0.450681 0.28466 1.748639 0.3720852 -0.06542041
# let's test the parameter with Wald test stat closest to 0
(teststat <- -0.2263/0.0998)
[1] -2.267535
2*pt(abs(teststat), df = 144-3, lower.tail = FALSE) #p value
[1] 0.02487795
# therefore, reject null that this parameter = 0
# we should keep this parameter estimate in the model
# and not remove the seasonal ar parameter
### Let's see if we can improve our model.
## we said p >=1, P >=1, Q >=1 and we tried all EQUAL to 1
# ARIMA (2, 0, 0) x (1, 1, 1) 12
mod3 <- arima(births, order = c(2,0,0), season = list( order = c( 1,1,1), period=12))
mod3 # ar2 not significantly diff from 0
Call:
arima(x = births, order = c(2, 0, 0), seasonal = list(order = c(1, 1, 1), period = 12))
Coefficients:
ar1 ar2 sar1 sma1
0.9275 0.0513 -0.2237 -0.8671
s.e. 0.0876 0.0881 0.0999 0.1333
sigma^2 estimated as 0.3965: log likelihood = -137.1, aic = 284.21
# ARIMA (1, 0, 0) x (2, 1, 1) 12
mod4 <- arima(births, order = c(1,0,0), season = list( order = c( 2,1,1), period=12))
mod4 # sar2 IS significantly diff from 0
Call:
arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(2, 1, 1), period = 12))
Coefficients:
ar1 sar1 sar2 sma1
0.9677 -0.4796 -0.3846 -0.6243
s.e. 0.0268 0.1138 0.1022 0.1227
sigma^2 estimated as 0.3673: log likelihood = -131.72, aic = 273.44
AIC(mod4); AIC(mod2) # we improved! mod4 better than mod2
[1] 273.4391
[1] 282.5484
# should we add even more?
# ARIMA (1, 0, 0) x (2, 1, 2) 12
mod5 <- arima(births, order = c(1,0,0), season = list( order = c( 2,1,2), period=12))
mod5 # sma1 NOT significantly diff from 0
Call:
arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(2, 1, 2), period = 12))
Coefficients:
ar1 sar1 sar2 sma1 sma2
0.9612 -0.2486 -0.3563 -0.8751 0.2615
s.e. 0.0290 0.2787 0.1310 0.3022 0.2498
sigma^2 estimated as 0.3631: log likelihood = -131.09, aic = 274.19
# so forget mod5. Simpler (mod4) is better
AIC(mod5); AIC(mod4) #plus they have the same AIC (roughly anyway)
[1] 274.1868
[1] 273.4391
# Diagnostics for ARIMA (1, 0, 0) x (2, 1, 1) 12
mod4
Call:
arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(2, 1, 1), period = 12))
Coefficients:
ar1 sar1 sar2 sma1
0.9677 -0.4796 -0.3846 -0.6243
s.e. 0.0268 0.1138 0.1022 0.1227
sigma^2 estimated as 0.3673: log likelihood = -131.72, aic = 273.44
Box.test(mod4$residuals) # model not inadequate :)
Box-Pierce test
data: mod4$residuals
X-squared = 1.3432, df = 1, p-value = 0.2465
# So our final mod is mod4
data("wineind")
plot(wineind)
tt<-decompose(wineind)
plot(tt)
nooseason <- wineind - tt$seasonal
plot(nooseason)
plot(tt$seasonal)
diff4 <- diff(wineind, lag = 12) #s=12 because of the seasonal data (number of observations per year)
plot.ts(diff4)
adf.test(diff4)
p-value smaller than printed p-value
Augmented Dickey-Fuller Test
data: diff4
Dickey-Fuller = -4.458, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
Acf(nooseason)
Pacf(nooseason)