Note: This note includes April 27.
In summary, there are 3 major steps on time series modeling:
Model Identification
-Create plot
-identify if there is seasonality
-check stationarity by using ADF test
-Difference if needed
-Examine ACF and PACF
-Identify possible modelModel Building
-estimate the parameter
-we can simply used built in function like Arima() of forecast package
-Use lmtest package and see if there are parameters that are not significant
-Remodel if needed, we can use AIC to select best modelRun Model Diagnostic Check. Goal is to have white noise residual and if possible guassian white noise.
-check normality of residual using qqnorm() function or shapiro.test()
-check check constant variance assumption, residual vs fitted plot
-check residual independence, residual vs order plot(Optional) Forecast Evaluation using in-sample and out sample strategy, split your data, use in sample to estimate the model and check model performance on out sample. This is recommended if you have large data set. (How large? I don’t know haha)
library(forecast)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(readxl)
require(fUnitRoots)
## Loading required package: fUnitRoots
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'fUnitRoots'
#set the working directory to your folder
setwd("C:/Users/DSTD-NilreyC/Desktop/Nilrey Desktop files/Stat and Data Science References/MS Applied Math Materials/Time Series/Math 215 -MSMA")
#Load the data
Example_6_3 <- read_excel("Example 6.3.xlsx")
appdata<-ts(Example_6_3$Applications)
plot(appdata)
#if you have fUnitRoots follow the code below
##adfTest(appdata,lags=1,type="ct")
#if none, use tseries
tseries::adf.test(appdata)
## Warning in tseries::adf.test(appdata): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: appdata
## Dickey-Fuller = -4.4074, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
Acf(appdata)
Pacf(appdata)
Based on the result above, below is summary of the result:
ACF plot ACF - tails off PACF - cuts off after lag 2
Partial ACF plot -ACF - cuts off after lag 2 -PACF - tails off -ARIMA (0,0,2) -MA (2)
model1<-Arima(appdata,order=c(0,0,2))
model2<-Arima(appdata,order=c(2,0,0))
model3<-Arima(appdata,order=c(1,0,1))
model1$aic
## [1] 689.7361
model2$aic
## [1] 682.924
model3$aic
## [1] 689.7336
We see from AIC above that model2 has the lowest AIC.
Now, let us see the coefficients and if they are significant using lmtest package.
require(lmtest)
coeftest(model2)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.265893 0.088983 2.9881 0.002807 **
## ar2 0.412971 0.090068 4.5851 4.538e-06 ***
## intercept 66.853785 1.833404 36.4643 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2$coef
## ar1 ar2 intercept
## 0.2658927 0.4129705 66.8537847
model2$coef[3]*(1-model2$coef[1]-model2$coef[2])
## intercept
## 21.46921
res<-as.numeric(model2$residuals)
fit<-as.numeric(model2$fitted)
qqnorm(res,datax=TRUE,ylab="Residuals",pch=16)
qqline(res,datax=TRUE)
hist(res,xlab="Residuals",ylab="Frequency")
plot(fit,res,xlab="Fitted Value",ylab="Residuals",pch=16)
abline(h=0)
Based on the plots above, the residuals have a constant vairance because the plot is structureless.
Let’s make another plot.
plot(res,type="l",xlab="Order",ylab="Residuals")
abline(h=0)
Acf(res)
Pacf(res)
Box.test(res,lag=20,type="Ljung-Box",fitdf = 3)
##
## Box-Ljung test
##
## data: res
## X-squared = 14.024, df = 17, p-value = 0.6654
Based on above result:
-The residuals are uncorrelated -Seems normally distributted -The residuals are gaussian white noise -Box plot also is non-significant
Now let us forecast 5 step-ahead
forecast(model2,h=5)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 105 62.58571 54.53554 70.63588 50.27404 74.89738
## 106 64.12744 55.79756 72.45731 51.38798 76.86689
## 107 64.36628 55.17133 73.56123 50.30381 78.42875
## 108 65.06647 55.67336 74.45959 50.70094 79.43201
## 109 65.35129 55.72228 74.98029 50.62500 80.07757
Example_2_2<-read_excel("Example 2.2.xlsx")
data1<-ts(Example_2_2$'Dow Jones',start=c(1999,6),frequency=12)
plot(data1)
#adfTest(data1,lags=0,type="nc")
#if no adfTest from fUnitRoot, use tseries
tseries::adf.test(data1)
##
## Augmented Dickey-Fuller Test
##
## data: data1
## Dickey-Fuller = -1.3254, Lag order = 4, p-value = 0.8528
## alternative hypothesis: stationary
-Result = P value is small
-Model identification:
- A decreasing trend from 1999 to 2003
- Increasing trend from 2003 onwards
- nonstationaty
-ADF Test
- We fail to reject the Ho
- Ho: nonstationary
- H1: stationary
So we will difference the data and plot ACF and PACF and conduct ADF test again on difference data.
Ddata1<-diff(data1,1)
plot(Ddata1)
Acf(data1)
Pacf(data1)
Based on ACF and PACF, possible model is ARIMA(0,1,0)
model<-Arima(data1,order=c(0,1,0))
model
## Series: data1
## ARIMA(0,1,0)
##
## sigma^2 estimated as 165630: log likelihood=-623.93
## AIC=1249.85 AICc=1249.9 BIC=1252.28
We already made analysis on the model from last discussion. But we will still employ diagnostic check.
res<-as.numeric(model$residuals[-1])
fit<-as.numeric(model$fitted[-1])
length((Ddata1))
## [1] 84
length(res)
## [1] 84
qqnorm(res,datax=TRUE,ylab="Residual",pch=16)
qqline(res,datax=TRUE)
plot(fit,res,xlab="Fitted Value",ylab="Residual",pch=16)
abline(h=0)
plot(res,xlab="Order",ylab="Residual",type="l")
abline(h=0)
-QQnorm shows that since residuals are close to the link, this suggest that it is normally distributed. -Residual vs Fitted value is structureless, no pattern. -We can clearly see in residual vs order plot that the variance is not constant, hence not white noise.
Note: Sometimes, we can’t find a perfect model we will just settle on what we can get.
Example 6.5 ARIMA(1,1,0)
Note: In case your forecasting equation has \[\epsilon\] , replace it with e (residual)
[Attached photo on ex.6.5 derivation]
Example 6.7, same to 6.3 where model develop is AR(2) or ARIMA(2,0,0). We can use Arima() function from forecast package. Then we will use forecast() function to forecast or predict
appdata<-ts(Example_6_3$Applications)
data2<-ts(Example_6_3$Applications)
model<-arima(data2,order=c(2,0,0))
model2<-arima(appdata,order=c(2,0,0))
appdata
## Time Series:
## Start = 1
## End = 104
## Frequency = 1
## [1] 71 57 62 64 65 67 65 82 70 74 75 81 71 75 82 74 78 75 73 76 66 69 63
## [24] 76 65 73 62 77 76 88 71 72 66 65 73 76 81 84 68 63 66 71 67 69 63 61
## [47] 68 75 66 81 72 77 66 71 59 57 66 51 59 56 57 55 53 74 64 70 74 69 64
## [70] 68 64 70 73 59 68 59 66 63 63 61 73 72 65 70 54 63 62 60 67 59 74 61
## [93] 61 52 55 61 56 61 60 65 55 61 59 63
class(model2)
## [1] "Arima"
#load forcast package
forecast(model2,h=12)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 105 62.58571 54.65250 70.51892 50.45291 74.71851
## 106 64.12744 55.91858 72.33629 51.57307 76.68180
## 107 64.36628 55.30492 73.42764 50.50812 78.22444
## 108 65.06647 55.80983 74.32312 50.90965 79.22330
## 109 65.35129 55.86218 74.84039 50.83895 79.86362
## 110 65.71617 56.13346 75.29889 51.06068 80.37167
## 111 65.93081 56.27109 75.59054 51.15754 80.70409
## 112 66.13857 56.43926 75.83789 51.30475 80.97240
## 113 66.28246 56.55529 76.00962 51.40605 81.15887
## 114 66.40651 56.66341 76.14961 51.50572 81.30730
## 115 66.49892 56.74534 76.25249 51.58211 81.41572
## 116 66.57472 56.81486 76.33458 51.64830 81.50114
[See slides for more details] ARIMA(p,d,q)x(P,D,Q)s
recall: MA(1) cut off after log one ACF, PACF exponential decay. For seasonal, look at the frequency, example if it is monthly data s=12, then look at the behavior of 12,24,36 . . . of ACF and PACF
Example 6.8 ARIMA(0,1,1)x(0,1,1)12
Load the Data Example 6.9
#Load the Data Example 6.9
Example_6_9<-read_excel("Example 6.9.xlsx")
Example_6_9$Mo<-rep(1:12,11)
data6.9<-Example_6_9
data6.9<-ts(data6.9$Employ, start=c(1971,1), frequency = 12)
plot(data6.9)
seasonplot(data6.9)
boxplot(Employ~Mo, Example_6_9)
tseries::adf.test(data6.9)
##
## Augmented Dickey-Fuller Test
##
## data: data6.9
## Dickey-Fuller = -2.5744, Lag order = 5, p-value = 0.3376
## alternative hypothesis: stationary
-Based on graph, there is change in mean -seasonal -Based on adf test, it is non stationary. We will proceed on differencing
Ddata1<-diff(diff(data6.9,12),1)
plot(Ddata1)
tseries::adf.test(Ddata1)
## Warning in tseries::adf.test(Ddata1): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: Ddata1
## Dickey-Fuller = -5.9371, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
Now, the data is stationary and the adf test is significant.
Let’s create ACF and PACF plot.
Acf(Ddata1)
Pacf(Ddata1)
-Seasonal lags, ACF cuts of after lag 12 and PACF tails off. -Non seasonal lags, non significant ACF and PACF, but let us say ACF and PACF tails off. -Based on above findins we can try following model ARIMA(0,1,0)x(0,1,1) and ARIMA(1,1,1)(0,1,1)
model1<-Arima(data6.9, order=c(0,1,0), seasonal = c(0,1,1))
model2<-Arima(data6.9, order = c(1,1,1), seasonal = c(0,1,1))
model1$aic
## [1] 1302.617
model2$aic
## [1] 1296.806
# let us see if coeffciencts are significant
library(lmtest)
coeftest(model2)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.28778 0.24101 1.1940 0.232467
## ma1 -0.57017 0.20550 -2.7745 0.005528 **
## sma1 -0.99987 0.12934 -7.7308 1.069e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let us drop AR(1), hence ARIMA(0,1,1)x(0,1,1)12
model3<-Arima(data6.9, order = c(0,1,1), seasonal = c(0,1,1))
model3$aic
## [1] 1295.926
We see that AIC of model 3 is much lower, hence we select model 3. Now let us proceed on model diagnostic step. We saythat all parameters are significant.
#now let us get the residual, first observed that residual still did not consider the differencing
length(model3$residuals)
## [1] 132
#now let us rmeove first 13
res_val<-as.numeric(model3$residuals[-1:-13])
fit_val<-as.numeric(model3$fitted[-1:-13])
#test for nromality
qqnorm(res_val, datax = TRUE, ylab="Residual", pch=16)
qqline(res_val, datax = TRUE, ylab="Residual", pch=16)
shapiro.test(res_val)
##
## Shapiro-Wilk normality test
##
## data: res_val
## W = 0.98081, p-value = 0.08734
Based on residual plot, seems not normally distributed.
let’s plot more about residual
plot(fit_val, res_val,pch=16)
plot(res_val, xlab="Order", type="l")
Based on the plots above, seems that residuals when plotted against fitted value is strucrreless or no pattern which could mean that variance is constant.
In addition, pattern of residual vs order, there is no pattern. Hence, residual is independence.
We could say now that residual is white noise. Let us examine Acf and Pacf to see if there is no significant spikes to support our claim.
Acf(res_val)
Pacf(res_val)
ACF and PACF are not significant based on the plots which indicates that individually autocorrelation and partial-autocorrelation are not significant.
Let us proceed to Ljung Box, where Ho: No serial correlation among the residuals.
Box.test(res_val, lag=20, type = "Ljung-Box", fitdf=2)
##
## Box-Ljung test
##
## data: res_val
## X-squared = 16.908, df = 18, p-value = 0.5294
Based on test, p-value is greater than 0.05 which we say that there is no serail correlation among the residuals.
Say we have in-sample that we used to build the model and out-sample to get the performance.
We will use the out-sample to performa forecast evaluation
insample<-ts(data6.9[1:82], start=c(1971,1),frequency = 12)
modelA<-Arima(insample, order=c(0,1,1), seasonal = c(0,1,1))
modelB<-Arima(data6.9, model=modelA)
FE<-modelB$residuals[83:132]
acf(FE)
pacf(FE)
shapiro.test(FE)
##
## Shapiro-Wilk normality test
##
## data: FE
## W = 0.9486, p-value = 0.02989
If forecast error ACF and PACF plot shows no spikes which means independence and shaprito test shows normality assumption holds, then our model is good to go.
Note: Having out-sample model evaluation is not a hard rule (by sir Capili)
End of note