Direction

Transform the data and test using ADF and KPSS tests. Follow the Box-Jenkins methodology to build a time series model based on the data until the end of 2014. Check the model for adequacy and plot a forecast until the end of 2016

Stationary Test

Plotting the data

unrate <- Quandl("FRED/UNRATE", type="ts")
plot(unrate, ylab="Unemployment Rate")

To be able to see the accuracy of forecast result later after we build the model, we have to split the data into two: Data up to 2014 and Data After 2014. For the first part, we are going to use Data up to 2013

unrateall <- unrate
unrate1 <- window(unrateall, end=c(2014,12))
unrate2 <- window(unrateall, start=c(2015,1))

unrate <- unrate1

Based on the initial plot of unemployment rate before, we will doing log transformation of the data using the following code

lunrate <- diff(log(unrate), differences=1)

The next step is to test for the stationary. First we can use ADF test

lunrate.urdf <- ur.df(lunrate, type = "trend", selectlags = "BIC")
summary(lunrate.urdf)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression trend 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.222527 -0.019520 -0.000486  0.020714  0.222800 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.384e-04  2.627e-03   0.357    0.721    
## z.lag.1     -6.668e-01  4.514e-02 -14.770  < 2e-16 ***
## tt          -1.795e-06  5.663e-06  -0.317    0.751    
## z.diff.lag  -2.330e-01  3.427e-02  -6.799 2.07e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03704 on 797 degrees of freedom
## Multiple R-squared:  0.4652, Adjusted R-squared:  0.4632 
## F-statistic: 231.1 on 3 and 797 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -14.77 72.7304 109.0867 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau3 -3.96 -3.41 -3.12
## phi2  6.09  4.68  4.03
## phi3  8.27  6.25  5.34

From the ADF test we can see that t-statistics are is significant which means that we reject null hypothesis, log of unemployment rate is trend stationary. Furthermore, we also interested to proceed to estimate model B.

lunrate.modelb <- ur.df(lunrate, type ="drift", selectlags = "BIC")
summary(lunrate.modelb)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression drift 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.221867 -0.019703 -0.000217  0.020788  0.223357 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0002166  0.0013083   0.166    0.869    
## z.lag.1     -0.6663353  0.0450969 -14.776  < 2e-16 ***
## z.diff.lag  -0.2332737  0.0342419  -6.813 1.89e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03702 on 798 degrees of freedom
## Multiple R-squared:  0.4651, Adjusted R-squared:  0.4638 
## F-statistic:   347 on 2 and 798 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -14.7756 109.1684 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1  6.43  4.59  3.78

From model B, we still get the same conclusion. Testing for model A, we’ll get the following:

lunrate.modela <- ur.df(lunrate, selectlags = "BIC")
summary(lunrate.modela)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression none 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22166 -0.01949  0.00000  0.02100  0.22356 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## z.lag.1    -0.66618    0.04506  -14.78  < 2e-16 ***
## z.diff.lag -0.23336    0.03422   -6.82  1.8e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.037 on 799 degrees of freedom
## Multiple R-squared:  0.4651, Adjusted R-squared:  0.4638 
## F-statistic: 347.4 on 2 and 799 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -14.7843 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau1 -2.58 -1.95 -1.62

Model A also shows the same conclusion. Testing for the unit root using ADF test showing that null hypothesis rejected at any level, the value of test statistics is smaller than the critical value. Concluding that the data is trend stationary. Confirming with KPSS test we can get the same result

test1 <- kpss.test(lunrate, null = "Trend")
test1

    KPSS Test for Trend Stationarity

data:  lunrate
KPSS Trend = 0.027704, Truncation lag parameter = 6, p-value = 0.1

Box-Jenkins Method

The next step to Box-Jenkins Method is to plot the ACF and PACF to decide the order of the univariate time series model

acf(as.data.frame(lunrate), lag=24, main=expression(Log(Unemp_Rate)))

pacf(as.data.frame(lunrate), lag=24, main=expression(Log(Unemp_Rate)))

Looking at the PACF, the most adequate model could possibly ARIMA(1,0,1)(1,0,1)[12] , or ARIMA(2,0,6)(1,0,1)[12] or ARIMA(1,0,6)(1,0,1)[12] or ARIMA(5,0,1)(1,0,1)[12] or ARIMA(5,0,6)(1,0,1)[12] Estimating the model give us the following:

m1 <- arima(lunrate,order=c(1,0,1),seasonal=list(order=c(2,0,1),period=12))
m1
## 
## Call:
## arima(x = lunrate, order = c(1, 0, 1), seasonal = list(order = c(2, 0, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1    sar1     sar2     sma1  intercept
##       0.8885  -0.7506  0.4519  -0.0603  -0.7115     0.0007
## s.e.  0.0310   0.0412  0.1043   0.0545   0.0998     0.0014
## 
## sigma^2 estimated as 0.001242:  log likelihood = 1545.53,  aic = -3077.06
tsdiag(m1,gof.lag=12)

The p-values of Ljung Box statistics is bad for this model, signal to check for another model

m2 <- arima(lunrate,order=c(2,0,6),seasonal=list(order=c(1,0,1),period=12))
m2
## 
## Call:
## arima(x = lunrate, order = c(2, 0, 6), seasonal = list(order = c(1, 0, 1), period = 12))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2      ma1      ma2     ma3     ma4     ma5     ma6
##       0.1772  0.4243  -0.1505  -0.2705  0.1063  0.0227  0.1146  0.0101
## s.e.     NaN     NaN      NaN      NaN     NaN  0.0265  0.0437     NaN
##         sar1     sma1  intercept
##       0.5668  -0.8219     0.0007
## s.e.  0.0682   0.0497     0.0011
## 
## sigma^2 estimated as 0.001204:  log likelihood = 1557.95,  aic = -3091.89
tsdiag(m2,gof.lag=12)

This model produced far better p-values and smaller AIC, trying third model

m3 <- arima(lunrate,order=c(1,0,1),seasonal=list(order=c(1,0,1),period=12))
m3
## 
## Call:
## arima(x = lunrate, order = c(1, 0, 1), seasonal = list(order = c(1, 0, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1    sar1     sma1  intercept
##       0.8905  -0.7533  0.5116  -0.7852     0.0007
## s.e.  0.0305   0.0406  0.0762   0.0573     0.0013
## 
## sigma^2 estimated as 0.001244:  log likelihood = 1544.93,  aic = -3077.86
tsdiag(m3,gof.lag=12)

The result gives us similar result with the first model, trying for the fourth model

m4 <- arima(lunrate,order=c(1,0,1),seasonal=list(order=c(2,0,1),period=12))
m4
## 
## Call:
## arima(x = lunrate, order = c(1, 0, 1), seasonal = list(order = c(2, 0, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1    sar1     sar2     sma1  intercept
##       0.8885  -0.7506  0.4519  -0.0603  -0.7115     0.0007
## s.e.  0.0310   0.0412  0.1043   0.0545   0.0998     0.0014
## 
## sigma^2 estimated as 0.001242:  log likelihood = 1545.53,  aic = -3077.06
tsdiag(m4,gof.lag=12)

P-values still have problems, trying fifth model

m5 <- arima(lunrate,order=c(5,0,1),seasonal=list(order=c(2,0,1),period=12))
m5
## 
## Call:
## arima(x = lunrate, order = c(5, 0, 1), seasonal = list(order = c(2, 0, 1), period = 12))
## 
## Coefficients:
##          ar1     ar2     ar3     ar4     ar5      ma1    sar1     sar2
##       0.0708  0.1568  0.1381  0.0850  0.1172  -0.0439  0.4721  -0.0775
## s.e.  0.2354  0.0367  0.0548  0.0541  0.0427   0.2360  0.0990   0.0526
##          sma1  intercept
##       -0.7192     0.0007
## s.e.   0.0941     0.0013
## 
## sigma^2 estimated as 0.001205:  log likelihood = 1557.71,  aic = -3093.42
tsdiag(m5,gof.lag=12)

The result of the fifth model is slightly better than the second model. Trying to estimate the last model

m6 <- arima(lunrate,order=c(5,0,6),seasonal=list(order=c(2,0,1),period=12))
m6
## 
## Call:
## arima(x = lunrate, order = c(5, 0, 6), seasonal = list(order = c(2, 0, 1), period = 12))
## 
## Coefficients:
##          ar1     ar2     ar3     ar4      ar5      ma1      ma2      ma3
##       0.0566  0.3787  0.5422  0.0991  -0.4894  -0.0339  -0.2120  -0.4183
## s.e.  0.2832  0.2021  0.1309  0.1792   0.1563   0.2868   0.1898   0.1403
##           ma4     ma5      ma6    sar1     sar2     sma1  intercept
##       -0.0669  0.5505  -0.0486  0.5104  -0.1042  -0.7268     0.0007
## s.e.   0.1521  0.1325   0.0797  0.1056   0.0550   0.0967     0.0011
## 
## sigma^2 estimated as 0.001185:  log likelihood = 1564.21,  aic = -3096.42
tsdiag(m6,gof.lag=12)

The last model have all p-value significant means there is no serial correlation, however, we cannot use this model since the degree of freedom of Ljung Box will be negative, thus we will use the fifth model. T is equal to 803 thus m will be around 7

m5.LB.lag7 <- Box.test(m5$residuals, lag=7, type="Ljung")
m5.LB.lag7
## 
##  Box-Ljung test
## 
## data:  m5$residuals
## X-squared = 0.95102, df = 7, p-value = 0.9956
1-pchisq(m5.LB.lag7$statistic,1)
## X-squared 
##   0.32946

Model ARIMA(5,0,1)(1,0,1)[12] is adequate for our forecasting since the t-statistics showing that hypothesis null cannot be rejected, thus we have no serial correlation.

Forecasting with ARIMA(5,0,1)(1,0,1)[12] give us the following result:

m5.fcast <- forecast(m5, h=15)
plot(m5.fcast, xlim=c(2012,2016))
lines(diff(log(unrateall), differences=1))

Although close to the actual data, however, the forecasted data doesn’t show similar movement, we have to try ARIMA(2,0,1) with seasonal component of (1,0,1)

m7 <- arima(lunrate,order=c(2,0,1),seasonal=list(order=c(1,0,1),period=12))
m7
## 
## Call:
## arima(x = lunrate, order = c(2, 0, 1), seasonal = list(order = c(1, 0, 1), period = 12))
## 
## Coefficients:
##          ar1     ar2      ma1    sar1     sma1  intercept
##       0.6690  0.1663  -0.6325  0.5351  -0.8013     0.0007
## s.e.  0.0657  0.0398   0.0595  0.0720   0.0528     0.0012
## 
## sigma^2 estimated as 0.001219:  log likelihood = 1552.9,  aic = -3091.8
tsdiag(m7,gof.lag=12)

P values of Ljung-Box statistics is better, and the AIC is quite small. Test the Ljung box statistics

m7.LB.lag7 <- Box.test(m7$residuals, lag=7, type="Ljung")
m7.LB.lag7
## 
##  Box-Ljung test
## 
## data:  m7$residuals
## X-squared = 6.6813, df = 7, p-value = 0.4628
1-pchisq(m7.LB.lag7$statistic,4)
## X-squared 
## 0.1537187

Ljung-Box statistic showing an adequacy to the model, try to forecast the model will show us

m7.fcast <- forecast(m7, h=15)
plot(m7.fcast, xlim=c(2012,2016))
lines(diff(log(unrateall), differences=1))

The data forecasted is similar with previous forecast however, this time it shows more similar movement with the actual data. I will use ARIMA(2,0,1)(1,0,1)[12] for forecasting.