Homework3

The data is US housing starts. I downloaded it from the FRED database. This analysis also uses monthly NASDAQ data from Yahoo Finance and the Fed Funds Rate from the FRED database. This analysis used logged data for the housing starts, as that data provided for better forecasts.

Setup

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(psych)
library(tseries)
library(vars)

## Loading required package: MASS

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Loading required package: urca

## Loading required package: lmtest

library(knitr)

Data

housingstarts$HOUSTNSA = log(housingstarts$HOUSTNSA)
ts = ts(housingstarts[,2], frequency = 12, start = c(1959,1))
hist(ts, main = "Histogram of Housing Starts", xlab = "Housing Starts")

describe(ts)

##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis   se
## X1    1 738 4.72 0.36   4.77    4.75 0.33 3.46 5.46  1.99 -0.73     0.52 0.01

autoplot(ts, main = "US Housing Starts", ylab = "Thousands of Units")

The data has a left-skew, as seen in both the histogram and the descriptive statistics. There does not seem to be many extreme outliers, although the dramatic decrease during the 2008 recession is clearly visible in the plot.

decomp = decompose(ts)
autoplot(decomp)

pacf(ts)

acf(ts)

adf.test(ts)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts
## Dickey-Fuller = -2.0419, Lag order = 9, p-value = 0.5606
## alternative hypothesis: stationary

We can see from the acf and pacf plots that there is significant autocorrelation of both the lagged values and the lagged residuals. The data is non-stationary, and there is a strong seasonal component.

Split data

trainData = head(ts, n = (length(ts)-24))
testData = tail(ts, n = 24)

ETS

etsMod = ets(trainData)
checkresiduals(etsMod)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 62, df = 10, p-value = 1.513e-09
## 
## Model df: 14.   Total lags used: 24

etsFC = forecast(etsMod,h=24)
autoplot(etsFC)

This ETS model suffers from autocorrelation and appears to have large prediction intervals.

ARIMA

arimaMod = auto.arima(trainData)
checkresiduals(arimaMod)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,0,3)(1,1,2)[12]
## Q* = 24.092, df = 13, p-value = 0.03029
## 
## Model df: 11.   Total lags used: 24

arimaFC = forecast(arimaMod, h=24)
autoplot(arimaFC)

There is also autocorrelation in this model, though not to the same extent as the ETS model. There are also large prediction intervals for this forecast.

NNETAR

nnMod = nnetar(trainData)
checkresiduals(nnMod)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

nnFC = forecast(nnMod, h=24)
autoplot(nnFC)

The neural net model also has autocorrelation, but it is less than both the ETS and ARIMA models. The forecasts do not have prediction intervals, so we cannot compare those to that of the ETS and ARIMA. The performance of these models, along with the upcoming VAR forecasts, will be evaluated at the end.

VAR - Fed Funds Rate

#append fed funds rate to housing starts data
housingstarts2 = housingstarts
housingstarts2$fedfunds = fedfunds$FEDFUNDS

cor(log(housingstarts2$fedfunds), log(housingstarts2$HOUSTNSA))

## [1] 0.5234226

cor(housingstarts2$fedfunds, housingstarts2$HOUSTNSA)

## [1] 0.2535198

housingstarts2$HOUSTNSA = log(housingstarts2$HOUSTNSA)
housingstarts2$fedfunds = log(housingstarts2$fedfunds)

ts2 = ts(housingstarts2[,c(2,3)], frequency = 12, start = c(1959,01))


train2 = head(ts2, n = (length(ts)-24))
test2 = tail(ts2, n = 24)

select  = VARselect(train2[,1:2], lag.max=8,
  type="both")[["selection"]]

select

## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      8      5      2      8

var1 <- VAR(train2[,1:2], p=2, type="both")
serial.test(var1, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var1
## Chi-squared = 112.93, df = 32, p-value = 5.858e-11

var2 <- VAR(train2[,1:2], p=8, type="both")
serial.test(var2, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var2
## Chi-squared = 57.231, df = 8, p-value = 1.623e-09

varFC1 = forecast(var1, h=24)
varFC2 = forecast(var2, h=24)


autoplot(varFC1)

autoplot(varFC2)

For this model, I used the federal funds rate as a stand-in for the interest rate in order to predict the number of housing starts. This is because interest rates and housing starts could have an inverse effect; lower interest rates could incentivize home building due to the low cost of borrowing. And then, as more builders demand money to borrow, the cost of borrowing could increase. So, these two variables could affect each other in a way that VAR could model.

In this model, I logged the variables to find how percentage changes in one variables affects the other. However, as we see in the correlation figures, the two variables actually have a positive relationship in both the raw data and the logged data. Throughout the process of completing this analysis, I experimented with using the raw data and the logges data, and the logged data provided better forecasts, so that is what I decided to use for this model.

From the VAR select, a lag of 2 and 8 seemed to be the best, so forecasts were created and evaluated for both.

VAR - NASDAQ

housingstarts3 = tail(housingstarts, n = length(housingstarts)-147)
housingstarts3$nasdaq = nasdaq$Adj.Close

cor(housingstarts3$nasdaq, housingstarts3$HOUSTNSA)

## [1] -0.2367606

cor(log(housingstarts3$nasdaq), log(housingstarts3$HOUSTNSA))

## [1] -0.29444

housingstarts3$nasdaq = log(housingstarts3$nasdaq)
housingstarts3$HOUSTNSA = log(housingstarts3$HOUSTNSA)

ts3 = ts(housingstarts3[,c(2,3)], frequency = 12, start = c(1971,02))


train3 = head(ts3, n = (nrow(ts3)-24))
test3 = tail(ts3, n = 24)

select2  = VARselect(train3[,1:2], lag.max=8,
  type="both")[["selection"]]

select2

## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      8      3      2      8

var3 <- VAR(train3[,1:2], p=2, type="both")
serial.test(var3, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var3
## Chi-squared = 63.768, df = 32, p-value = 0.0007036

var4 <- VAR(train3[,1:2], p=8, type="both")
serial.test(var2, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var2
## Chi-squared = 57.231, df = 8, p-value = 1.623e-09

varFC3 = forecast(var3, h=24)
varFC4 = forecast(var4, h=24)



autoplot(varFC3)

autoplot(varFC4)

This model also uses VAR to evaluate housing starts versus the stock market (NASDAQ). This data is from 1971 and not from 1959 because the NASDAQ data only goes back to 1971. This model also uses the logged data. The correlation between the NASDAQ and the number of housing starts is negative, which is unexpected, as conventional wisdom suggests a better stock market means a better economy and more people building homes. This does not appear to be the case.

Per VAR select, I used 2 and 8 for the lags. The NASDAQ forecasts are very similar for the two, but the housing starts forecasts are notably different.

Check accuracy

testAccuracy = data.frame(matrix(ncol = 8, nrow =0))
colnames(testAccuracy)=c("ME","RMSE", "MAE", "MPE", "MAPE", "MASE", "ACF1", "Theil's U")


#ETS
acc1 =  data.frame(accuracy(etsFC, testData))
testAccuracy = rbind(testAccuracy, acc1)

#ARIMA
acc2 = data.frame(accuracy(arimaFC, testData))
testAccuracy = rbind(testAccuracy, acc2)


#Neural Net
testAccuracy =rbind(testAccuracy, data.frame(accuracy(nnFC, testData)))

#VAR - Fed Funds, lag = 2
testAccuracy =rbind(testAccuracy, data.frame(accuracy(varFC1$forecast$HOUSTNSA, test2[,1])))

#VAR - Fed Funds, lag = 8
testAccuracy =rbind(testAccuracy, data.frame(accuracy(varFC2$forecast$HOUSTNSA, test2[,1])))

#VAR - NASDAQ, lag = 2
testAccuracy =rbind(testAccuracy, data.frame(accuracy(varFC3$forecast$HOUSTNSA, test3[,1])))

#VAR - NASDAQ, lag = 8
testAccuracy = rbind(testAccuracy, data.frame(accuracy(varFC4$forecast$HOUSTNSA, test3[,1])))

rownames(testAccuracy) = c('ETS Train', 'ETS Test', 'ARIMA Train', 'ARIMA Test', 'Neural Net Train', 'Neural Net Test', 'VAR Fed Funds Lag 2 Train', 'VAR Fed Funds Lag 2 Test', 'VAR Fed Funds Lag 8 Train', 'VAR Fed Funds Lag 8 Test', 'VAR NASDAQ Lag 2 Train', 'VAR NASDAQ Lag 2 Test', 'VAR NASDAQ Lag 8 Train', 'VAR NASDAQ Lag 8 Test')

kable(testAccuracy, 'simple')

	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil.s.U
ETS Train	-0.0007770	0.0904240	0.0711499	-0.0303621	1.535611	0.4253035	0.0409925	NA
ETS Test	0.0341969	0.1385459	0.0952886	0.7024408	2.054694	0.5695946	0.6325814	1.272126
ARIMA Train	-0.0026898	0.0871041	0.0682849	-0.0877056	1.474930	0.4081778	-0.0083701	NA
ARIMA Test	0.0592050	0.1479810	0.1028573	1.2344956	2.209838	0.6148373	0.6518485	1.354208
Neural Net Train	0.0001121	0.0317371	0.0248406	-0.0086894	0.528540	0.1484866	0.0181731	NA
Neural Net Test	-0.1410085	0.2135661	0.1533369	-3.0818767	3.343013	0.9165827	0.6715439	1.993381
VAR Fed Funds Lag 2 Train	0.0000000	0.0305446	0.0232272	-0.0407335	1.519218	0.6412314	-0.0135200	NA
VAR Fed Funds Lag 2 Test	0.0083842	0.0305003	0.0266225	0.5114335	1.740169	0.7349659	0.6723893	1.292138
VAR Fed Funds Lag 8 Train	0.0000000	0.0293324	0.0224154	-0.0375847	1.464148	0.6188208	-0.0567530	NA
VAR Fed Funds Lag 8 Test	0.0098796	0.0296977	0.0269270	0.6101179	1.758312	0.7433710	0.6428933	1.256455
VAR NASDAQ Lag 2 Train	0.0000000	0.0297544	0.0227154	-0.0391049	1.489789	0.6282778	-0.0251741	NA
VAR NASDAQ Lag 2 Test	0.0331089	0.0455303	0.0401118	2.1222425	2.596859	1.1094376	0.6995077	1.907584
VAR NASDAQ Lag 8 Train	0.0000000	0.0283749	0.0217339	-0.0354845	1.425487	0.6011303	-0.0529143	NA
VAR NASDAQ Lag 8 Test	0.0203725	0.0352108	0.0317669	1.2936049	2.063874	0.8786301	0.6495320	1.479754

Most of the models perform similarly. None of the models has troubling bias. The two models that performed the best were the VAR Fed Funds models, with the lag of 8 slightly outperforming the lag of 2 model. The Next best models were the NASDAQ VAR models, with the ETS, ARIMA, and Neural Net following, in that order (according to RMSE). Interestingly, the NASDAQ lag 2 model peforms worse for the training set that a seasonal naive, the only model to perform worse than the seasonal naive in either the training or test data.

These are intriguing forecasts due to the low RMSE and ME values. But it is interesting how well the Fed Funds works to forecast despite the correlation not working in the way we expected (i.e. positive instead of negative relationship). Despite this, the best forecasts are the VAR Fed Funds with a lag of 8 and that is the model I would use to forecast this data.

Homework3

Justin Lynch

7/30/2020