Direction

Consider the monthly time series for Housing Starts (New Privately Owned Housing Units), FRED/HOUSTNSA. Note that the data is not seasonally adjusted. Follow the Box-Jenkins methodology to build a time series model based on the data until the end of 2013. After you check the model for adequacy use it to produce and plot a forecast until the end of 2016. If there are several competing speci???cations for the model compare their forecasting accuracy

Building and Checking Adequate Models - Box Jenkins Step

The first thing to do is to identify the variable that we are going to use in the model, FRED/HOUSTNSA is monthly time series data for Housing Starts, to make it simpler, we will define the variable as ht, the data is seasonal, since it is monthly data, thus, seasonal lag will be equal to 12

ht <- Quandl("FRED/HOUSTNSA", type="ts")

To be able to see the accuracy of forecast result later after we build the model, we have to split the data into two: Data up to 2013 and Data After 2013.For the first part, we are going to use Data up to 2013

htall <- ht
ht_1 <- window(htall, end=c(2013,12))
ht_2 <- window(htall, start=c(2014,1))
ht<-ht_1

The next step is to show some series of log transformation of the data and decide which transformation most suitable for the forecasting. We have to produce log of housing starts, first different of log housing starts and second different of log housing starts

lht <- log(ht)
dlht1 <- diff(lht)
dlht1_2 <- diff(lht,4)
ddlht1_2 <- diff(diff(lht,12))

It will be easier if we plot those log transformation data

par(mfrow=c(2,3))
plot(ht, main=expression())
plot(lht, main=expression(log(ht)))
plot.new()
plot(dlht1, main=expression(paste(Delta, "log(ht)")))
plot(dlht1_2, main=expression(paste(Delta[12], "log(ht)")))
plot(ddlht1_2, main=expression(paste(Delta, Delta[12], "log(ht)")))

Based on the plot we will see that except log(ht) which proven nonstationary, we can use the data for forecasting purpose. However, we still have to check each log transformation data one by one, to decide which one that is more adequate than the other by ploting ACF and their PACF

The first model that we are going to analyse is the first different of log transformation ht

Looking at the PACF, the most adequate model could possibly be an AR(1), AR(3) or an AR(4)

m1 <- arima(dlht1,order=c(1,0,0),seasonal=list(order=c(1,0,0),period=12))
m1
## 
## Call:
## arima(x = dlht1, order = c(1, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
## 
## Coefficients:
##           ar1    sar1  intercept
##       -0.3009  0.7550    -0.0002
## s.e.   0.0398  0.0264     0.0127
## 
## sigma^2 estimated as 0.01195:  log likelihood = 518.62,  aic = -1029.23
tsdiag(m1,gof.lag=12)

We can compare this model to the AR(3) specification

m2 <- arima(dlht1,order=c(3,0,0),seasonal=list(order=c(1,0,0),period=12))
m2
## 
## Call:
## arima(x = dlht1, order = c(3, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
## 
## Coefficients:
##           ar1      ar2     ar3    sar1  intercept
##       -0.3208  -0.0508  0.0053  0.7633    -0.0004
## s.e.   0.0425   0.0419  0.0394  0.0267     0.0124
## 
## sigma^2 estimated as 0.01191:  log likelihood = 519.47,  aic = -1026.94
tsdiag(m2,gof.lag=12)

We can see that P-Value for the Ljung Box Statistics showing better plot, however AR(1) AIC is still smaller, let’s compare to AR(4) to convince us further

m3 <- arima(dlht1,order=c(4,0,0),seasonal=list(order=c(1,0,0),period=12))
m3
## 
## Call:
## arima(x = dlht1, order = c(4, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
## 
## Coefficients:
##           ar1      ar2    ar3     ar4    sar1  intercept
##       -0.3277  -0.0516  0.022  0.0478  0.7740    -0.0004
## s.e.   0.0425   0.0418  0.042  0.0414  0.0274     0.0135
## 
## sigma^2 estimated as 0.01188:  log likelihood = 520.14,  aic = -1026.27
tsdiag(m3,gof.lag=12)

From the result, AR(4) is not different from AR(3) thus, we will choose model specification for ARIMA(1,0,0)(1,0,0)[12], this result can be confirmed by the Ljung Box test with T = 659, thus logT=6.497 make m = 6

## List of 5
##  $ statistic: Named num 0.0731
##   ..- attr(*, "names")= chr "X-squared"
##  $ parameter: Named num 6
##   ..- attr(*, "names")= chr "df"
##  $ p.value  : num 1
##  $ method   : chr "Box-Ljung test"
##  $ data.name: chr "m3$residuals"
##  - attr(*, "class")= chr "htest"
## 
##  Box-Ljung test
## 
## data:  m3$residuals
## X-squared = 0.073138, df = 6, p-value = 1
## X-squared 
##  0.999925

We can see that Model AR(1) adequate since the hypothesis non cannot be rejected, there is no serial correlation in the residual.

Now, lets check for the first difference and seasonal of Housing Starts. Again, we will start with ACF and PACF

Again, it seems that we are dealing with AR model, either AR(1), AR(2) or AR(3)

m4 <- arima(dlht1_2,order=c(1,0,0),seasonal=list(order=c(1,0,0),period=12))
m4
## 
## Call:
## arima(x = dlht1_2, order = c(1, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
## 
## Coefficients:
##          ar1    sar1  intercept
##       0.5837  0.7910    -0.0020
## s.e.  0.0323  0.0235     0.0595
## 
## sigma^2 estimated as 0.02008:  log likelihood = 344.94,  aic = -681.89
tsdiag(m4,gof.lag=12)

We can see that plot of P-Value of Ljung Box is not too good, signaling to add more order in the AR model, try AR(2)

m5 <- arima(dlht1_2,order=c(2,0,0),seasonal=list(order=c(1,0,0),period=12))
m4
## 
## Call:
## arima(x = dlht1_2, order = c(1, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
## 
## Coefficients:
##          ar1    sar1  intercept
##       0.5837  0.7910    -0.0020
## s.e.  0.0323  0.0235     0.0595
## 
## sigma^2 estimated as 0.02008:  log likelihood = 344.94,  aic = -681.89
tsdiag(m4,gof.lag=12)

The p-value of the Ljung-Box Statistics showing even worse problem, trying AR(3) showing us

m6 <- arima(dlht1_2,order=c(3,0,0),seasonal=list(order=c(1,0,0),period=12))
m6
## 
## Call:
## arima(x = dlht1_2, order = c(3, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
## 
## Coefficients:
##          ar1     ar2      ar3    sar1  intercept
##       0.4823  0.1721  -0.0154  0.8238    -0.0024
## s.e.  0.0425  0.0433   0.0428  0.0247     0.0790
## 
## sigma^2 estimated as 0.01953:  log likelihood = 353.05,  aic = -694.1
tsdiag(m6,gof.lag=12)

Although model estimated in higher lag, AR(3) showing exactly similar result like AR(1), it seems the first difference of seasonal data of the model is not really adequate for the forecast. However, we still have to check the last log transformation data Second Difference of Seasonal Data ht.

Looking at the plot, it seems that we deal with MA model with , however this also could be an AR model with additive seasonal MA component, as there are spike at multiple lag at PACF but it does not seem to died down. We possibly deal with ARIMA (2,0,1)(1,0,0)[12], ARIMA(2,0,1)(1,0,1)[12], ARIMA(2,0,2)(2,0,0)[12],or ARIMA(2,0,2)(2,0,1)[12]

Estimating each model we can get that

m7 <- arima(ddlht1_2,order=c(2,0,1),seasonal=list(order=c(1,0,0),period=12))
m7
## 
## Call:
## arima(x = ddlht1_2, order = c(2, 0, 1), seasonal = list(order = c(1, 0, 0), 
##     period = 12))
## 
## Coefficients:
##           ar1      ar2     ma1     sar1  intercept
##       -0.5653  -0.2009  0.1843  -0.4673     0.0005
## s.e.   0.1820   0.0680  0.1832   0.0356     0.0019
## 
## sigma^2 estimated as 0.01062:  log likelihood = 550.61,  aic = -1089.22
tsdiag(m7,gof.lag=12)

m8 <- arima(ddlht1_2,order=c(2,0,1),seasonal=list(order=c(1,0,1),period=12))
m8
## 
## Call:
## arima(x = ddlht1_2, order = c(2, 0, 1), seasonal = list(order = c(1, 0, 1), 
##     period = 12))
## 
## Coefficients:
##           ar1      ar2     ma1    sar1     sma1  intercept
##       -0.5353  -0.1693  0.1859  0.0703  -0.8940      0e+00
## s.e.   0.2232   0.0761  0.2241  0.0450   0.0215      3e-04
## 
## sigma^2 estimated as 0.008034:  log likelihood = 633.6,  aic = -1253.19
tsdiag(m8,gof.lag=12)

m9 <- arima(ddlht1_2,order=c(2,0,2),seasonal=list(order=c(2,0,0),period=12))
m9
## 
## Call:
## arima(x = ddlht1_2, order = c(2, 0, 2), seasonal = list(order = c(2, 0, 0), 
##     period = 12))
## 
## Coefficients:
##         ar1      ar2      ma1     ma2     sar1     sar2  intercept
##       0.624  -0.1186  -1.0195  0.4478  -0.5952  -0.2976     0.0005
## s.e.  0.151   0.0894   0.1430  0.0813   0.0382   0.0385     0.0018
## 
## sigma^2 estimated as 0.009561:  log likelihood = 583.59,  aic = -1151.18
tsdiag(m9,gof.lag=12)

m10 <- arima(ddlht1_2,order=c(2,0,2),seasonal=list(order=c(2,0,2),period=12))
m10
## 
## Call:
## arima(x = ddlht1_2, order = c(2, 0, 2), seasonal = list(order = c(2, 0, 2), 
##     period = 12))
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     sar1    sar2    sma1     sma2
##       0.6300  -0.1889  -0.9821  0.4427  -0.8190  0.0927  0.0043  -0.7986
## s.e.  0.1839   0.1064   0.1745  0.1111   0.1262  0.0457  0.1221   0.1055
##       intercept
##           0e+00
## s.e.      4e-04
## 
## sigma^2 estimated as 0.007952:  log likelihood = 637.1,  aic = -1254.21
tsdiag(m10,gof.lag=12)

We can see that ARIMA(2,0,2)(2,0,1)[12] has smaller AIC, checking Ljung-Box Statistic for this specific model we get that

m10.LB <- Box.test(m10$residuals, lag=6, type="Ljung")
m10.LB
## 
##  Box-Ljung test
## 
## data:  m10$residuals
## X-squared = 2.2757, df = 6, p-value = 0.8927

We can see that model ARIMA(2,0,2)(2,0,1)[12] adequate since the hypothesis null cannot be rejected, there is no serial correlation in the residual. We can use the second difference of seasonal data for forecasting data.

Data Forecast

Now, the next thing that we need to do is to check which log transformation has the strongest forecasting ability, is it ARIMA(1,0,0)(1,0,0)[12] or ARIMA(2,0,2)(2,0,1)[12]

m1.fcast <- forecast(m1, h=25)
plot(m1.fcast, xlim=c(2012,2016))
lines(diff(log(htall)))

m10.fcast <- forecast(m10, h=25)
plot(m10.fcast, xlim=c(2012,2016))
lines(diff(log(htall)))

From the data forecasting plot we can conclude that our first model, ARIMA(1,0,0)(1,0,0)[12], is better than the second, thus we are going to use the first model to forecast the data of Housing Starts