Consider the monthly time series for Housing Starts (New Privately Owned Housing Units), FRED/HOUSTNSA. Note that the data is not seasonally adjusted. Follow the Box-Jenkins methodology to build a time series model based on the data until the end of 2013. After you check the model for adequacy use it to produce and plot a forecast until the end of 2016. If there are several competing speci???cations for the model compare their forecasting accuracy
The first thing to do is to identify the variable that we are going to use in the model, FRED/HOUSTNSA is monthly time series data for Housing Starts, to make it simpler, we will define the variable as ht, the data is seasonal, since it is monthly data, thus, seasonal lag will be equal to 12
ht <- Quandl("FRED/HOUSTNSA", type="ts")
To be able to see the accuracy of forecast result later after we build the model, we have to split the data into two: Data up to 2013 and Data After 2013.For the first part, we are going to use Data up to 2013
htall <- ht
ht_1 <- window(htall, end=c(2013,12))
ht_2 <- window(htall, start=c(2014,1))
ht<-ht_1
The next step is to show some series of log transformation of the data and decide which transformation most suitable for the forecasting. We have to produce log of housing starts, first different of log housing starts and second different of log housing starts
lht <- log(ht)
dlht1 <- diff(lht)
dlht1_2 <- diff(lht,4)
ddlht1_2 <- diff(diff(lht,12))
It will be easier if we plot those log transformation data
par(mfrow=c(2,3))
plot(ht, main=expression())
plot(lht, main=expression(log(ht)))
plot.new()
plot(dlht1, main=expression(paste(Delta, "log(ht)")))
plot(dlht1_2, main=expression(paste(Delta[12], "log(ht)")))
plot(ddlht1_2, main=expression(paste(Delta, Delta[12], "log(ht)")))
Based on the plot we will see that except log(ht) which proven nonstationary, we can use the data for forecasting purpose. However, we still have to check each log transformation data one by one, to decide which one that is more adequate than the other by ploting ACF and their PACF
The first model that we are going to analyse is the first different of log transformation ht
Looking at the PACF, the most adequate model could possibly be an AR(1), AR(3) or an AR(4)
m1 <- arima(dlht1,order=c(1,0,0),seasonal=list(order=c(1,0,0),period=12))
m1
##
## Call:
## arima(x = dlht1, order = c(1, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
##
## Coefficients:
## ar1 sar1 intercept
## -0.3009 0.7550 -0.0002
## s.e. 0.0398 0.0264 0.0127
##
## sigma^2 estimated as 0.01195: log likelihood = 518.62, aic = -1029.23
tsdiag(m1,gof.lag=12)
We can compare this model to the AR(3) specification
m2 <- arima(dlht1,order=c(3,0,0),seasonal=list(order=c(1,0,0),period=12))
m2
##
## Call:
## arima(x = dlht1, order = c(3, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
##
## Coefficients:
## ar1 ar2 ar3 sar1 intercept
## -0.3208 -0.0508 0.0053 0.7633 -0.0004
## s.e. 0.0425 0.0419 0.0394 0.0267 0.0124
##
## sigma^2 estimated as 0.01191: log likelihood = 519.47, aic = -1026.94
tsdiag(m2,gof.lag=12)
We can see that P-Value for the Ljung Box Statistics showing better plot, however AR(1) AIC is still smaller, let’s compare to AR(4) to convince us further
m3 <- arima(dlht1,order=c(4,0,0),seasonal=list(order=c(1,0,0),period=12))
m3
##
## Call:
## arima(x = dlht1, order = c(4, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
##
## Coefficients:
## ar1 ar2 ar3 ar4 sar1 intercept
## -0.3277 -0.0516 0.022 0.0478 0.7740 -0.0004
## s.e. 0.0425 0.0418 0.042 0.0414 0.0274 0.0135
##
## sigma^2 estimated as 0.01188: log likelihood = 520.14, aic = -1026.27
tsdiag(m3,gof.lag=12)
From the result, AR(4) is not different from AR(3) thus, we will choose model specification for ARIMA(1,0,0)(1,0,0)[12], this result can be confirmed by the Ljung Box test with T = 659, thus logT=6.497 make m = 6
## List of 5
## $ statistic: Named num 0.0731
## ..- attr(*, "names")= chr "X-squared"
## $ parameter: Named num 6
## ..- attr(*, "names")= chr "df"
## $ p.value : num 1
## $ method : chr "Box-Ljung test"
## $ data.name: chr "m3$residuals"
## - attr(*, "class")= chr "htest"
##
## Box-Ljung test
##
## data: m3$residuals
## X-squared = 0.073138, df = 6, p-value = 1
## X-squared
## 0.999925
We can see that Model AR(1) adequate since the hypothesis non cannot be rejected, there is no serial correlation in the residual.
Now, lets check for the first difference and seasonal of Housing Starts. Again, we will start with ACF and PACF
Again, it seems that we are dealing with AR model, either AR(1), AR(2) or AR(3)
m4 <- arima(dlht1_2,order=c(1,0,0),seasonal=list(order=c(1,0,0),period=12))
m4
##
## Call:
## arima(x = dlht1_2, order = c(1, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
##
## Coefficients:
## ar1 sar1 intercept
## 0.5837 0.7910 -0.0020
## s.e. 0.0323 0.0235 0.0595
##
## sigma^2 estimated as 0.02008: log likelihood = 344.94, aic = -681.89
tsdiag(m4,gof.lag=12)
We can see that plot of P-Value of Ljung Box is not too good, signaling to add more order in the AR model, try AR(2)
m5 <- arima(dlht1_2,order=c(2,0,0),seasonal=list(order=c(1,0,0),period=12))
m4
##
## Call:
## arima(x = dlht1_2, order = c(1, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
##
## Coefficients:
## ar1 sar1 intercept
## 0.5837 0.7910 -0.0020
## s.e. 0.0323 0.0235 0.0595
##
## sigma^2 estimated as 0.02008: log likelihood = 344.94, aic = -681.89
tsdiag(m4,gof.lag=12)
The p-value of the Ljung-Box Statistics showing even worse problem, trying AR(3) showing us
m6 <- arima(dlht1_2,order=c(3,0,0),seasonal=list(order=c(1,0,0),period=12))
m6
##
## Call:
## arima(x = dlht1_2, order = c(3, 0, 0), seasonal = list(order = c(1, 0, 0), period = 12))
##
## Coefficients:
## ar1 ar2 ar3 sar1 intercept
## 0.4823 0.1721 -0.0154 0.8238 -0.0024
## s.e. 0.0425 0.0433 0.0428 0.0247 0.0790
##
## sigma^2 estimated as 0.01953: log likelihood = 353.05, aic = -694.1
tsdiag(m6,gof.lag=12)
Although model estimated in higher lag, AR(3) showing exactly similar result like AR(1), it seems the first difference of seasonal data of the model is not really adequate for the forecast. However, we still have to check the last log transformation data Second Difference of Seasonal Data ht.
Looking at the plot, it seems that we deal with MA model with , however this also could be an AR model with additive seasonal MA component, as there are spike at multiple lag at PACF but it does not seem to died down. We possibly deal with ARIMA (2,0,1)(1,0,0)[12], ARIMA(2,0,1)(1,0,1)[12], ARIMA(2,0,2)(2,0,0)[12],or ARIMA(2,0,2)(2,0,1)[12]
Estimating each model we can get that
m7 <- arima(ddlht1_2,order=c(2,0,1),seasonal=list(order=c(1,0,0),period=12))
m7
##
## Call:
## arima(x = ddlht1_2, order = c(2, 0, 1), seasonal = list(order = c(1, 0, 0),
## period = 12))
##
## Coefficients:
## ar1 ar2 ma1 sar1 intercept
## -0.5653 -0.2009 0.1843 -0.4673 0.0005
## s.e. 0.1820 0.0680 0.1832 0.0356 0.0019
##
## sigma^2 estimated as 0.01062: log likelihood = 550.61, aic = -1089.22
tsdiag(m7,gof.lag=12)
m8 <- arima(ddlht1_2,order=c(2,0,1),seasonal=list(order=c(1,0,1),period=12))
m8
##
## Call:
## arima(x = ddlht1_2, order = c(2, 0, 1), seasonal = list(order = c(1, 0, 1),
## period = 12))
##
## Coefficients:
## ar1 ar2 ma1 sar1 sma1 intercept
## -0.5353 -0.1693 0.1859 0.0703 -0.8940 0e+00
## s.e. 0.2232 0.0761 0.2241 0.0450 0.0215 3e-04
##
## sigma^2 estimated as 0.008034: log likelihood = 633.6, aic = -1253.19
tsdiag(m8,gof.lag=12)
m9 <- arima(ddlht1_2,order=c(2,0,2),seasonal=list(order=c(2,0,0),period=12))
m9
##
## Call:
## arima(x = ddlht1_2, order = c(2, 0, 2), seasonal = list(order = c(2, 0, 0),
## period = 12))
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sar2 intercept
## 0.624 -0.1186 -1.0195 0.4478 -0.5952 -0.2976 0.0005
## s.e. 0.151 0.0894 0.1430 0.0813 0.0382 0.0385 0.0018
##
## sigma^2 estimated as 0.009561: log likelihood = 583.59, aic = -1151.18
tsdiag(m9,gof.lag=12)
m10 <- arima(ddlht1_2,order=c(2,0,2),seasonal=list(order=c(2,0,2),period=12))
m10
##
## Call:
## arima(x = ddlht1_2, order = c(2, 0, 2), seasonal = list(order = c(2, 0, 2),
## period = 12))
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sar2 sma1 sma2
## 0.6300 -0.1889 -0.9821 0.4427 -0.8190 0.0927 0.0043 -0.7986
## s.e. 0.1839 0.1064 0.1745 0.1111 0.1262 0.0457 0.1221 0.1055
## intercept
## 0e+00
## s.e. 4e-04
##
## sigma^2 estimated as 0.007952: log likelihood = 637.1, aic = -1254.21
tsdiag(m10,gof.lag=12)
We can see that ARIMA(2,0,2)(2,0,1)[12] has smaller AIC, checking Ljung-Box Statistic for this specific model we get that
m10.LB <- Box.test(m10$residuals, lag=6, type="Ljung")
m10.LB
##
## Box-Ljung test
##
## data: m10$residuals
## X-squared = 2.2757, df = 6, p-value = 0.8927
We can see that model ARIMA(2,0,2)(2,0,1)[12] adequate since the hypothesis null cannot be rejected, there is no serial correlation in the residual. We can use the second difference of seasonal data for forecasting data.
Now, the next thing that we need to do is to check which log transformation has the strongest forecasting ability, is it ARIMA(1,0,0)(1,0,0)[12] or ARIMA(2,0,2)(2,0,1)[12]
m1.fcast <- forecast(m1, h=25)
plot(m1.fcast, xlim=c(2012,2016))
lines(diff(log(htall)))
m10.fcast <- forecast(m10, h=25)
plot(m10.fcast, xlim=c(2012,2016))
lines(diff(log(htall)))
From the data forecasting plot we can conclude that our first model, ARIMA(1,0,0)(1,0,0)[12], is better than the second, thus we are going to use the first model to forecast the data of Housing Starts