Transform the data and test using ADF and KPSS tests. Follow the Box-Jenkins methodology to build a time series model based on the data until the end of 2014. Check the model for adequacy and plot a forecast until the end of 2016
Plotting the data
unrate <- Quandl("FRED/UNRATE", type="ts")
plot(unrate, ylab="Unemployment Rate")
To be able to see the accuracy of forecast result later after we build the model, we have to split the data into two: Data up to 2014 and Data After 2014. For the first part, we are going to use Data up to 2013
unrateall <- unrate
unrate1 <- window(unrateall, end=c(2014,12))
unrate2 <- window(unrateall, start=c(2015,1))
unrate <- unrate1
Based on the initial plot of unemployment rate before, we will doing log transformation of the data using the following code
lunrate <- diff(log(unrate), differences=1)
The next step is to test for the stationary. First we can use ADF test
lunrate.urdf <- ur.df(lunrate, type = "trend", selectlags = "BIC")
summary(lunrate.urdf)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression trend
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.222527 -0.019520 -0.000486 0.020714 0.222800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.384e-04 2.627e-03 0.357 0.721
## z.lag.1 -6.668e-01 4.514e-02 -14.770 < 2e-16 ***
## tt -1.795e-06 5.663e-06 -0.317 0.751
## z.diff.lag -2.330e-01 3.427e-02 -6.799 2.07e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03704 on 797 degrees of freedom
## Multiple R-squared: 0.4652, Adjusted R-squared: 0.4632
## F-statistic: 231.1 on 3 and 797 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -14.77 72.7304 109.0867
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau3 -3.96 -3.41 -3.12
## phi2 6.09 4.68 4.03
## phi3 8.27 6.25 5.34
From the ADF test we can see that t-statistics are is significant which means that we reject null hypothesis, log of unemployment rate is trend stationary. Furthermore, we also interested to proceed to estimate model B.
lunrate.modelb <- ur.df(lunrate, type ="drift", selectlags = "BIC")
summary(lunrate.modelb)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression drift
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.221867 -0.019703 -0.000217 0.020788 0.223357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0002166 0.0013083 0.166 0.869
## z.lag.1 -0.6663353 0.0450969 -14.776 < 2e-16 ***
## z.diff.lag -0.2332737 0.0342419 -6.813 1.89e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03702 on 798 degrees of freedom
## Multiple R-squared: 0.4651, Adjusted R-squared: 0.4638
## F-statistic: 347 on 2 and 798 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -14.7756 109.1684
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1 6.43 4.59 3.78
From model B, we still get the same conclusion. Testing for model A, we’ll get the following:
lunrate.modela <- ur.df(lunrate, selectlags = "BIC")
summary(lunrate.modela)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression none
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.22166 -0.01949 0.00000 0.02100 0.22356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -0.66618 0.04506 -14.78 < 2e-16 ***
## z.diff.lag -0.23336 0.03422 -6.82 1.8e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.037 on 799 degrees of freedom
## Multiple R-squared: 0.4651, Adjusted R-squared: 0.4638
## F-statistic: 347.4 on 2 and 799 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -14.7843
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau1 -2.58 -1.95 -1.62
Model A also shows the same conclusion. Testing for the unit root using ADF test showing that null hypothesis rejected at any level, the value of test statistics is smaller than the critical value. Concluding that the data is trend stationary. Confirming with KPSS test we can get the same result
test1 <- kpss.test(lunrate, null = "Trend")
test1
KPSS Test for Trend Stationarity
data: lunrate
KPSS Trend = 0.027704, Truncation lag parameter = 6, p-value = 0.1
The next step to Box-Jenkins Method is to plot the ACF and PACF to decide the order of the univariate time series model
acf(as.data.frame(lunrate), lag=24, main=expression(Log(Unemp_Rate)))
pacf(as.data.frame(lunrate), lag=24, main=expression(Log(Unemp_Rate)))
Looking at the PACF, the most adequate model could possibly ARIMA(1,0,1)(1,0,1)[12] , or ARIMA(2,0,6)(1,0,1)[12] or ARIMA(1,0,6)(1,0,1)[12] or ARIMA(5,0,1)(1,0,1)[12] or ARIMA(5,0,6)(1,0,1)[12] Estimating the model give us the following:
m1 <- arima(lunrate,order=c(1,0,1),seasonal=list(order=c(2,0,1),period=12))
m1
##
## Call:
## arima(x = lunrate, order = c(1, 0, 1), seasonal = list(order = c(2, 0, 1), period = 12))
##
## Coefficients:
## ar1 ma1 sar1 sar2 sma1 intercept
## 0.8885 -0.7506 0.4519 -0.0603 -0.7115 0.0007
## s.e. 0.0310 0.0412 0.1043 0.0545 0.0998 0.0014
##
## sigma^2 estimated as 0.001242: log likelihood = 1545.53, aic = -3077.06
tsdiag(m1,gof.lag=12)
The p-values of Ljung Box statistics is bad for this model, signal to check for another model
m2 <- arima(lunrate,order=c(2,0,6),seasonal=list(order=c(1,0,1),period=12))
m2
##
## Call:
## arima(x = lunrate, order = c(2, 0, 6), seasonal = list(order = c(1, 0, 1), period = 12))
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
## ar1 ar2 ma1 ma2 ma3 ma4 ma5 ma6
## 0.1772 0.4243 -0.1505 -0.2705 0.1063 0.0227 0.1146 0.0101
## s.e. NaN NaN NaN NaN NaN 0.0265 0.0437 NaN
## sar1 sma1 intercept
## 0.5668 -0.8219 0.0007
## s.e. 0.0682 0.0497 0.0011
##
## sigma^2 estimated as 0.001204: log likelihood = 1557.95, aic = -3091.89
tsdiag(m2,gof.lag=12)
This model produced far better p-values and smaller AIC, trying third model
m3 <- arima(lunrate,order=c(1,0,1),seasonal=list(order=c(1,0,1),period=12))
m3
##
## Call:
## arima(x = lunrate, order = c(1, 0, 1), seasonal = list(order = c(1, 0, 1), period = 12))
##
## Coefficients:
## ar1 ma1 sar1 sma1 intercept
## 0.8905 -0.7533 0.5116 -0.7852 0.0007
## s.e. 0.0305 0.0406 0.0762 0.0573 0.0013
##
## sigma^2 estimated as 0.001244: log likelihood = 1544.93, aic = -3077.86
tsdiag(m3,gof.lag=12)
The result gives us similar result with the first model, trying for the fourth model
m4 <- arima(lunrate,order=c(1,0,1),seasonal=list(order=c(2,0,1),period=12))
m4
##
## Call:
## arima(x = lunrate, order = c(1, 0, 1), seasonal = list(order = c(2, 0, 1), period = 12))
##
## Coefficients:
## ar1 ma1 sar1 sar2 sma1 intercept
## 0.8885 -0.7506 0.4519 -0.0603 -0.7115 0.0007
## s.e. 0.0310 0.0412 0.1043 0.0545 0.0998 0.0014
##
## sigma^2 estimated as 0.001242: log likelihood = 1545.53, aic = -3077.06
tsdiag(m4,gof.lag=12)
P-values still have problems, trying fifth model
m5 <- arima(lunrate,order=c(5,0,1),seasonal=list(order=c(2,0,1),period=12))
m5
##
## Call:
## arima(x = lunrate, order = c(5, 0, 1), seasonal = list(order = c(2, 0, 1), period = 12))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 sar1 sar2
## 0.0708 0.1568 0.1381 0.0850 0.1172 -0.0439 0.4721 -0.0775
## s.e. 0.2354 0.0367 0.0548 0.0541 0.0427 0.2360 0.0990 0.0526
## sma1 intercept
## -0.7192 0.0007
## s.e. 0.0941 0.0013
##
## sigma^2 estimated as 0.001205: log likelihood = 1557.71, aic = -3093.42
tsdiag(m5,gof.lag=12)
The result of the fifth model is slightly better than the second model. Trying to estimate the last model
m6 <- arima(lunrate,order=c(5,0,6),seasonal=list(order=c(2,0,1),period=12))
m6
##
## Call:
## arima(x = lunrate, order = c(5, 0, 6), seasonal = list(order = c(2, 0, 1), period = 12))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.0566 0.3787 0.5422 0.0991 -0.4894 -0.0339 -0.2120 -0.4183
## s.e. 0.2832 0.2021 0.1309 0.1792 0.1563 0.2868 0.1898 0.1403
## ma4 ma5 ma6 sar1 sar2 sma1 intercept
## -0.0669 0.5505 -0.0486 0.5104 -0.1042 -0.7268 0.0007
## s.e. 0.1521 0.1325 0.0797 0.1056 0.0550 0.0967 0.0011
##
## sigma^2 estimated as 0.001185: log likelihood = 1564.21, aic = -3096.42
tsdiag(m6,gof.lag=12)
The last model have all p-value significant means there is no serial correlation, however, we cannot use this model since the degree of freedom of Ljung Box will be negative, thus we will use the fifth model. T is equal to 803 thus m will be around 7
m5.LB.lag7 <- Box.test(m5$residuals, lag=7, type="Ljung")
m5.LB.lag7
##
## Box-Ljung test
##
## data: m5$residuals
## X-squared = 0.95102, df = 7, p-value = 0.9956
1-pchisq(m5.LB.lag7$statistic,1)
## X-squared
## 0.32946
Model ARIMA(5,0,1)(1,0,1)[12] is adequate for our forecasting since the t-statistics showing that hypothesis null cannot be rejected, thus we have no serial correlation.
Forecasting with ARIMA(5,0,1)(1,0,1)[12] give us the following result:
m5.fcast <- forecast(m5, h=15)
plot(m5.fcast, xlim=c(2012,2016))
lines(diff(log(unrateall), differences=1))
Although close to the actual data, however, the forecasted data doesn’t show similar movement, we have to try ARIMA(2,0,1) with seasonal component of (1,0,1)
m7 <- arima(lunrate,order=c(2,0,1),seasonal=list(order=c(1,0,1),period=12))
m7
##
## Call:
## arima(x = lunrate, order = c(2, 0, 1), seasonal = list(order = c(1, 0, 1), period = 12))
##
## Coefficients:
## ar1 ar2 ma1 sar1 sma1 intercept
## 0.6690 0.1663 -0.6325 0.5351 -0.8013 0.0007
## s.e. 0.0657 0.0398 0.0595 0.0720 0.0528 0.0012
##
## sigma^2 estimated as 0.001219: log likelihood = 1552.9, aic = -3091.8
tsdiag(m7,gof.lag=12)
P values of Ljung-Box statistics is better, and the AIC is quite small. Test the Ljung box statistics
m7.LB.lag7 <- Box.test(m7$residuals, lag=7, type="Ljung")
m7.LB.lag7
##
## Box-Ljung test
##
## data: m7$residuals
## X-squared = 6.6813, df = 7, p-value = 0.4628
1-pchisq(m7.LB.lag7$statistic,4)
## X-squared
## 0.1537187
Ljung-Box statistic showing an adequacy to the model, try to forecast the model will show us
m7.fcast <- forecast(m7, h=15)
plot(m7.fcast, xlim=c(2012,2016))
lines(diff(log(unrateall), differences=1))
The data forecasted is similar with previous forecast however, this time it shows more similar movement with the actual data. I will use ARIMA(2,0,1)(1,0,1)[12] for forecasting.