The data is US housing starts. I downloaded it from the FRED database. This analysis also uses monthly NASDAQ data from Yahoo Finance and the Fed Funds Rate from the FRED database. This analysis used logged data for the housing starts, as that data provided for better forecasts.
Setup
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(psych)
library(tseries)
library(vars)
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: urca
## Loading required package: lmtest
library(knitr)
Data
housingstarts$HOUSTNSA = log(housingstarts$HOUSTNSA)
ts = ts(housingstarts[,2], frequency = 12, start = c(1959,1))
hist(ts, main = "Histogram of Housing Starts", xlab = "Housing Starts")
describe(ts)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 738 4.72 0.36 4.77 4.75 0.33 3.46 5.46 1.99 -0.73 0.52 0.01
autoplot(ts, main = "US Housing Starts", ylab = "Thousands of Units")
The data has a left-skew, as seen in both the histogram and the descriptive statistics. There does not seem to be many extreme outliers, although the dramatic decrease during the 2008 recession is clearly visible in the plot.
decomp = decompose(ts)
autoplot(decomp)
pacf(ts)
acf(ts)
adf.test(ts)
##
## Augmented Dickey-Fuller Test
##
## data: ts
## Dickey-Fuller = -2.0419, Lag order = 9, p-value = 0.5606
## alternative hypothesis: stationary
We can see from the acf and pacf plots that there is significant autocorrelation of both the lagged values and the lagged residuals. The data is non-stationary, and there is a strong seasonal component.
Split data
trainData = head(ts, n = (length(ts)-24))
testData = tail(ts, n = 24)
ETS
etsMod = ets(trainData)
checkresiduals(etsMod)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 62, df = 10, p-value = 1.513e-09
##
## Model df: 14. Total lags used: 24
etsFC = forecast(etsMod,h=24)
autoplot(etsFC)
This ETS model suffers from autocorrelation and appears to have large prediction intervals.
ARIMA
arimaMod = auto.arima(trainData)
checkresiduals(arimaMod)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(5,0,3)(1,1,2)[12]
## Q* = 24.092, df = 13, p-value = 0.03029
##
## Model df: 11. Total lags used: 24
arimaFC = forecast(arimaMod, h=24)
autoplot(arimaFC)
There is also autocorrelation in this model, though not to the same extent as the ETS model. There are also large prediction intervals for this forecast.
NNETAR
nnMod = nnetar(trainData)
checkresiduals(nnMod)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
nnFC = forecast(nnMod, h=24)
autoplot(nnFC)
The neural net model also has autocorrelation, but it is less than both the ETS and ARIMA models. The forecasts do not have prediction intervals, so we cannot compare those to that of the ETS and ARIMA. The performance of these models, along with the upcoming VAR forecasts, will be evaluated at the end.
VAR - Fed Funds Rate
#append fed funds rate to housing starts data
housingstarts2 = housingstarts
housingstarts2$fedfunds = fedfunds$FEDFUNDS
cor(log(housingstarts2$fedfunds), log(housingstarts2$HOUSTNSA))
## [1] 0.5234226
cor(housingstarts2$fedfunds, housingstarts2$HOUSTNSA)
## [1] 0.2535198
housingstarts2$HOUSTNSA = log(housingstarts2$HOUSTNSA)
housingstarts2$fedfunds = log(housingstarts2$fedfunds)
ts2 = ts(housingstarts2[,c(2,3)], frequency = 12, start = c(1959,01))
train2 = head(ts2, n = (length(ts)-24))
test2 = tail(ts2, n = 24)
select = VARselect(train2[,1:2], lag.max=8,
type="both")[["selection"]]
select
## AIC(n) HQ(n) SC(n) FPE(n)
## 8 5 2 8
var1 <- VAR(train2[,1:2], p=2, type="both")
serial.test(var1, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var1
## Chi-squared = 112.93, df = 32, p-value = 5.858e-11
var2 <- VAR(train2[,1:2], p=8, type="both")
serial.test(var2, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var2
## Chi-squared = 57.231, df = 8, p-value = 1.623e-09
varFC1 = forecast(var1, h=24)
varFC2 = forecast(var2, h=24)
autoplot(varFC1)
autoplot(varFC2)
For this model, I used the federal funds rate as a stand-in for the interest rate in order to predict the number of housing starts. This is because interest rates and housing starts could have an inverse effect; lower interest rates could incentivize home building due to the low cost of borrowing. And then, as more builders demand money to borrow, the cost of borrowing could increase. So, these two variables could affect each other in a way that VAR could model.
In this model, I logged the variables to find how percentage changes in one variables affects the other. However, as we see in the correlation figures, the two variables actually have a positive relationship in both the raw data and the logged data. Throughout the process of completing this analysis, I experimented with using the raw data and the logges data, and the logged data provided better forecasts, so that is what I decided to use for this model.
From the VAR select, a lag of 2 and 8 seemed to be the best, so forecasts were created and evaluated for both.
VAR - NASDAQ
housingstarts3 = tail(housingstarts, n = length(housingstarts)-147)
housingstarts3$nasdaq = nasdaq$Adj.Close
cor(housingstarts3$nasdaq, housingstarts3$HOUSTNSA)
## [1] -0.2367606
cor(log(housingstarts3$nasdaq), log(housingstarts3$HOUSTNSA))
## [1] -0.29444
housingstarts3$nasdaq = log(housingstarts3$nasdaq)
housingstarts3$HOUSTNSA = log(housingstarts3$HOUSTNSA)
ts3 = ts(housingstarts3[,c(2,3)], frequency = 12, start = c(1971,02))
train3 = head(ts3, n = (nrow(ts3)-24))
test3 = tail(ts3, n = 24)
select2 = VARselect(train3[,1:2], lag.max=8,
type="both")[["selection"]]
select2
## AIC(n) HQ(n) SC(n) FPE(n)
## 8 3 2 8
var3 <- VAR(train3[,1:2], p=2, type="both")
serial.test(var3, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var3
## Chi-squared = 63.768, df = 32, p-value = 0.0007036
var4 <- VAR(train3[,1:2], p=8, type="both")
serial.test(var2, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var2
## Chi-squared = 57.231, df = 8, p-value = 1.623e-09
varFC3 = forecast(var3, h=24)
varFC4 = forecast(var4, h=24)
autoplot(varFC3)
autoplot(varFC4)
This model also uses VAR to evaluate housing starts versus the stock market (NASDAQ). This data is from 1971 and not from 1959 because the NASDAQ data only goes back to 1971. This model also uses the logged data. The correlation between the NASDAQ and the number of housing starts is negative, which is unexpected, as conventional wisdom suggests a better stock market means a better economy and more people building homes. This does not appear to be the case.
Per VAR select, I used 2 and 8 for the lags. The NASDAQ forecasts are very similar for the two, but the housing starts forecasts are notably different.
Check accuracy
testAccuracy = data.frame(matrix(ncol = 8, nrow =0))
colnames(testAccuracy)=c("ME","RMSE", "MAE", "MPE", "MAPE", "MASE", "ACF1", "Theil's U")
#ETS
acc1 = data.frame(accuracy(etsFC, testData))
testAccuracy = rbind(testAccuracy, acc1)
#ARIMA
acc2 = data.frame(accuracy(arimaFC, testData))
testAccuracy = rbind(testAccuracy, acc2)
#Neural Net
testAccuracy =rbind(testAccuracy, data.frame(accuracy(nnFC, testData)))
#VAR - Fed Funds, lag = 2
testAccuracy =rbind(testAccuracy, data.frame(accuracy(varFC1$forecast$HOUSTNSA, test2[,1])))
#VAR - Fed Funds, lag = 8
testAccuracy =rbind(testAccuracy, data.frame(accuracy(varFC2$forecast$HOUSTNSA, test2[,1])))
#VAR - NASDAQ, lag = 2
testAccuracy =rbind(testAccuracy, data.frame(accuracy(varFC3$forecast$HOUSTNSA, test3[,1])))
#VAR - NASDAQ, lag = 8
testAccuracy = rbind(testAccuracy, data.frame(accuracy(varFC4$forecast$HOUSTNSA, test3[,1])))
rownames(testAccuracy) = c('ETS Train', 'ETS Test', 'ARIMA Train', 'ARIMA Test', 'Neural Net Train', 'Neural Net Test', 'VAR Fed Funds Lag 2 Train', 'VAR Fed Funds Lag 2 Test', 'VAR Fed Funds Lag 8 Train', 'VAR Fed Funds Lag 8 Test', 'VAR NASDAQ Lag 2 Train', 'VAR NASDAQ Lag 2 Test', 'VAR NASDAQ Lag 8 Train', 'VAR NASDAQ Lag 8 Test')
kable(testAccuracy, 'simple')
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil.s.U | |
|---|---|---|---|---|---|---|---|---|
| ETS Train | -0.0007770 | 0.0904240 | 0.0711499 | -0.0303621 | 1.535611 | 0.4253035 | 0.0409925 | NA |
| ETS Test | 0.0341969 | 0.1385459 | 0.0952886 | 0.7024408 | 2.054694 | 0.5695946 | 0.6325814 | 1.272126 |
| ARIMA Train | -0.0026898 | 0.0871041 | 0.0682849 | -0.0877056 | 1.474930 | 0.4081778 | -0.0083701 | NA |
| ARIMA Test | 0.0592050 | 0.1479810 | 0.1028573 | 1.2344956 | 2.209838 | 0.6148373 | 0.6518485 | 1.354208 |
| Neural Net Train | 0.0001121 | 0.0317371 | 0.0248406 | -0.0086894 | 0.528540 | 0.1484866 | 0.0181731 | NA |
| Neural Net Test | -0.1410085 | 0.2135661 | 0.1533369 | -3.0818767 | 3.343013 | 0.9165827 | 0.6715439 | 1.993381 |
| VAR Fed Funds Lag 2 Train | 0.0000000 | 0.0305446 | 0.0232272 | -0.0407335 | 1.519218 | 0.6412314 | -0.0135200 | NA |
| VAR Fed Funds Lag 2 Test | 0.0083842 | 0.0305003 | 0.0266225 | 0.5114335 | 1.740169 | 0.7349659 | 0.6723893 | 1.292138 |
| VAR Fed Funds Lag 8 Train | 0.0000000 | 0.0293324 | 0.0224154 | -0.0375847 | 1.464148 | 0.6188208 | -0.0567530 | NA |
| VAR Fed Funds Lag 8 Test | 0.0098796 | 0.0296977 | 0.0269270 | 0.6101179 | 1.758312 | 0.7433710 | 0.6428933 | 1.256455 |
| VAR NASDAQ Lag 2 Train | 0.0000000 | 0.0297544 | 0.0227154 | -0.0391049 | 1.489789 | 0.6282778 | -0.0251741 | NA |
| VAR NASDAQ Lag 2 Test | 0.0331089 | 0.0455303 | 0.0401118 | 2.1222425 | 2.596859 | 1.1094376 | 0.6995077 | 1.907584 |
| VAR NASDAQ Lag 8 Train | 0.0000000 | 0.0283749 | 0.0217339 | -0.0354845 | 1.425487 | 0.6011303 | -0.0529143 | NA |
| VAR NASDAQ Lag 8 Test | 0.0203725 | 0.0352108 | 0.0317669 | 1.2936049 | 2.063874 | 0.8786301 | 0.6495320 | 1.479754 |
Most of the models perform similarly. None of the models has troubling bias. The two models that performed the best were the VAR Fed Funds models, with the lag of 8 slightly outperforming the lag of 2 model. The Next best models were the NASDAQ VAR models, with the ETS, ARIMA, and Neural Net following, in that order (according to RMSE). Interestingly, the NASDAQ lag 2 model peforms worse for the training set that a seasonal naive, the only model to perform worse than the seasonal naive in either the training or test data.
These are intriguing forecasts due to the low RMSE and ME values. But it is interesting how well the Fed Funds works to forecast despite the correlation not working in the way we expected (i.e. positive instead of negative relationship). Despite this, the best forecasts are the VAR Fed Funds with a lag of 8 and that is the model I would use to forecast this data.