IHS Markit Coding/Forecasting Assignment

In this assignment, I have been tasked with forecasting U.S. private housing starts of single-family units (HSS), and the stock of single-family housing units (KHS) through 2025Q4, given two explanatory variables (i.e., drivers): Gross Domestic Product (GDP) and the Federal Funds Rate (FFR). For this project, I will attempt to forecast the HSS and KHS series using a vector autoregressive (VAR) model. One of the assumptions of using a VAR model is that the series cannot be cointegrated. Thus, I will also check for cointegration among the variables. If cointegrating relationships are found, then I will use a Vector Error Correction Model to account for the cointegration. I also included a series of U.S. population (from FRED) to explore using in the model as population is a driver of housing demand. One quick note: the analysis is written so as to demonstrate coding/forecasting procedures, not for client consumption. The analysis is organized as follows.

Loading and Setting up Data
Exploring the Properties of the Series
Unit Root Tests on Level Data
Analyzing the Data in First Differences
Setting up the VAR Model for the HSS Forecast
Setting up the VAR Model for the KHS Forecast
Forecasts
Conclusion

Loading and Setting up Data

#Data Wrangling / Setup

#Load data
ihs <- read_excel("Data_Upload.xlsx")

#Filter out the year-quarter column using "select" function w/ piping
# hss = US private housing starts of single-family units
# khs = Stock of single-family housing units 
# gdp = Gross domestic product
# ffr = Federal funds rate
# pop = U.S. population
ihs_filter <- ihs %>%
  dplyr::select(hss, khs, gdp, ffr, pop)

#Create time series objects for each variable from filtered data
hss_ts <- ts(ihs_filter$hss, start =c(1966,1), end = c(2021,1), frequency = 4)
khs_ts <- ts(ihs_filter$khs, start =c(1966,1), end = c(2021,1), frequency = 4)
gdp_ts <- ts(ihs_filter$gdp, start =c(1966,1), end = c(2021,1), frequency = 4)
ffr_ts <- ts(ihs_filter$ffr, start =c(1966,1), end = c(2021,1), frequency = 4)
pop_ts <- ts(ihs_filter$pop, start =c(1966,1), end = c(2021,1), frequency = 4)

Exploring the Properties of the Series

In this section, I visually inspect the four series, plus the population series I included from FRED. Analysis is provided at the end of the section.

#The ggtsdisplay function provides a line graph, ACF and PACF
ggtsdisplay(hss_ts, main = "U.S. Private Housing Starts of Single-Family Units (HSS)", ylab = "Thousand Units")

#The ggtsdisplay function provides a line graph, ACF and PACF
ggtsdisplay(khs_ts, main = "Stock of Single-Family Housing Units (KHS)", ylab = "Million Units")

#The ggtsdisplay function provides a line graph, ACF and PACF
ggtsdisplay(gdp_ts, main = "U.S. Gross Domestic Product (GDP)", ylab = "Billions 2009 USD")

#The ggtsdisplay function provides a line graph, ACF and PACF
ggtsdisplay(ffr_ts, main = "Federal Funds Rate (FFR)", ylab = "Percent (%)")

#The ggtsdisplay function provides a line graph, ACF and PACF
ggtsdisplay(pop_ts, main = "U.S.Population (POP)", ylab = "Thousands")

Based on the graphs and the plots of the AutoCorrelation Function (ACF), these series do not appear stationary. There is significant serial correlation in all five series. The Stock of Single-Family Housing Units (KHS), Gross Domestic Product (GDP), and U.S. Population all display an upward trend. In contrast, no overall trend is evident in the Single-Family Housing Starts (HSS) and Federal Funds Rate (FFR) series. While there is some volatility, particularly during the Great Recession in the HSS series, the variance appears mostly stable in all five series. Thus, I don’t think logging the data will be necessary. I believe each series should be stationary after taking a first difference.

First, though, I will confirm non-stationarity by conducting unit root (UR) tests in the next section.

Unit Root Tests on Level Data

HSS Series in Levels

In this section I will conduct unit root (UR) tests on all five series in levels using the Phillips-Perron (PP) test. This test is preferable to the Augmented Dickey–Fuller (ADF) test, because it does not require the selection of a lag length, and therefore, the results are not sensitive to the lag length selection. In addition, the PP test is robust to general forms of heteroskedasticity in the error term. The null hypothesis of the PP test is that the series contains a unit root (i.e., is non-stationary). I will use a significance level of 5.0% as my standard, since that is a common standard in economics.

I begin with the HSS series. Looking at its graph above, there is no trend evident, so in the PP test I will choose a constant as my model type. Additionally, I choose lags = “long” – since the data contains many observations, I’m not worried about the degrees of freedom penalization of potentially including too many lags.

#PP test
#Constant chosen since series has a non-zero mean
summary(ur.pp(hss_ts, type = "Z-tau", model = "constant", lags = "long"))

## 
## ################################## 
## # Phillips-Perron Unit Root Test # 
## ################################## 
## 
## Test regression with intercept 
## 
## 
## Call:
## lm(formula = y ~ y.l1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -278.852  -50.556    6.005   50.067  260.212 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.67942   19.84967   2.049   0.0416 *  
## y.l1         0.96098    0.01878  51.165   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.82 on 218 degrees of freedom
## Multiple R-squared:  0.9231, Adjusted R-squared:  0.9228 
## F-statistic:  2618 on 1 and 218 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic, type: Z-tau  is: -2.6563 
## 
##          aux. Z statistics
## Z-tau-mu            2.5914
## 
## Critical values for Z statistics: 
##                      1pct      5pct     10pct
## critical values -3.461373 -2.874718 -2.573729

The results of the PP test give a test statistic of -2.6563. This result is significant at the 10.0% level; however, a 5.0% level of significance is more widely accepted, and is also the standard I have chosen. Therefore, using the standard of a 5.0% significance level, I fail to reject the null of non-stationarity. The results of the PP test indicate that the HSS data is not stationary in levels.

KHS Series in Levels

I now move on to the KHS series. Looking at its graph above, there is a trend evident, so I will choose a model with a trend in the code syntax for the PP test.

#PP test
#Trend chosen since series displays a trend
summary(ur.pp(khs_ts, type = "Z-tau", model = "trend", lags = "long"))

## 
## ################################## 
## # Phillips-Perron Unit Root Test # 
## ################################## 
## 
## Test regression with intercept and trend 
## 
## 
## Call:
## lm(formula = y ~ y.l1 + trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17360 -0.05731  0.03077  0.05388  0.24772 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.200043   0.386968  -0.517    0.606    
## y.l1         1.006180   0.005317 189.225   <2e-16 ***
## trend       -0.001816   0.001400  -1.297    0.196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08281 on 217 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.481e+06 on 2 and 217 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic, type: Z-tau  is: -1.0506 
## 
##            aux. Z statistics
## Z-tau-mu             -1.6344
## Z-tau-beta            1.0510
## 
## Critical values for Z statistics: 
##                      1pct      5pct     10pct
## critical values -4.002748 -3.431327 -3.139048

The test statistic is -1.0506, which is not statistically significant at any level; therefore, I fail to reject the null of non-stationarity. The results of the PP test indicate that the KHS series is not stationary in levels.

GDP Series in Levels

Next, I move on to the GDP series. Looking at the graph of GDP above, there is a clear upward trend in the data, so I will specify a trend in the PP test syntax.

#PP test
#Trend chosen since series displays a trend
summary(ur.pp(gdp_ts, type = "Z-tau", model = "trend", lags = "long"))

## 
## ################################## 
## # Phillips-Perron Unit Root Test # 
## ################################## 
## 
## Test regression with intercept and trend 
## 
## 
## Call:
## lm(formula = y ~ y.l1 + trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1680.88   -34.61    11.82    47.13  1089.84 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 558.38646  188.41877   2.964  0.00338 ** 
## y.l1          0.95180    0.01831  51.972  < 2e-16 ***
## trend         3.45755    1.24727   2.772  0.00605 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 155 on 217 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9987 
## F-statistic: 8.602e+04 on 2 and 217 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic, type: Z-tau  is: -2.4299 
## 
##            aux. Z statistics
## Z-tau-mu              4.1911
## Z-tau-beta            2.5890
## 
## Critical values for Z statistics: 
##                      1pct      5pct     10pct
## critical values -4.002748 -3.431327 -3.139048

The test statistic is -2.4299 , which is not statistically significant at any level; therefore, I fail to reject the null of non-stationarity. The results of the PP test indicate that the GDP series is not stationary in levels.

FFR Series in Levels

Next, I conduct a PP test on the FFR data. There is an upward trend in this series from 1966 to about 1982, which reflects the Federal Reserve’s efforts to combat high inflation during the 1970s. After peaking in the early 1980s, a downward trend in the FFR is observed following the 1980 recession. In summary, this series has two trends, but taken as a whole, there is no overall trend. Thus, I specify a constant in the PP test syntax.

#PP test
#Constant chosen since there is a non-zero mean
summary(ur.pp(ffr_ts, type = "Z-tau", model = "constant", lags = "long"))

## 
## ################################## 
## # Phillips-Perron Unit Root Test # 
## ################################## 
## 
## Test regression with intercept 
## 
## 
## Call:
## lm(formula = y ~ y.l1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6574 -0.2516 -0.0216  0.4006  6.2413 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.11163    0.10462   1.067    0.287    
## y.l1         0.97419    0.01642  59.324   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9241 on 218 degrees of freedom
## Multiple R-squared:  0.9417, Adjusted R-squared:  0.9414 
## F-statistic:  3519 on 1 and 218 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic, type: Z-tau  is: -1.741 
## 
##          aux. Z statistics
## Z-tau-mu            1.2178
## 
## Critical values for Z statistics: 
##                      1pct      5pct     10pct
## critical values -3.461373 -2.874718 -2.573729

The test statistic is -1.741, which is not statistically significant at any level; therefore, I fail to reject the null of non-stationarity. The results of the PP test indicate that the FFR series is not stationary in levels.

POP Series in Levels

Lastly, I conduct a PP test on the POP series. There is a clear upward trend in the data so I specify a trend in the PP test.

#PP test
#Trend specified since series displays a trend
summary(ur.pp(pop_ts, type = "Z-tau", model = "trend", lags = "long"))

## 
## ################################## 
## # Phillips-Perron Unit Root Test # 
## ################################## 
## 
## Test regression with intercept and trend 
## 
## 
## Call:
## lm(formula = y ~ y.l1 + trend)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -443.31  -99.46   -5.59   93.97  318.44 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.692e+03  9.753e+02   1.735   0.0841 .  
## y.l1        9.959e-01  3.717e-03 267.918   <2e-16 ***
## trend       2.780e+00  2.430e+00   1.144   0.2539    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 133.2 on 217 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.07e+07 on 2 and 217 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic, type: Z-tau  is: -1.5087 
## 
##            aux. Z statistics
## Z-tau-mu             -0.6109
## Z-tau-beta            1.5769
## 
## Critical values for Z statistics: 
##                      1pct      5pct     10pct
## critical values -4.002748 -3.431327 -3.139048

The test statistic is -1.5087, which is not statistically significant at any level; therefore, I fail to reject the null of non-stationarity. The results of the PP test indicate that the POP series is not stationary in levels.

Summary of Unit Root Tests in Levels

In summary, all five series are non-stationary, according to the results of the PP tests. In the next section, I will perform a first difference on each series to see if they are all stationary in first differences, or \(I(1)\)

Analyzing the Data in First Differences

HSS Series in First Differences

ggtsdisplay(diff(hss_ts), main = "First Difference - U.S. Private Housing Starts of Single-Family Units (HSS)", ylab = "Thousand Units")

The HSS series appears to be mean-reverting in first differences, and thus looks stationary. Looking at the ACF, there is no more serial correlation as the lags are not significant. Again, I do not think this data needs to be logged since the variance is mostly stable (with the exception of the high inflation period of the late 1970s and early 1980s).

I will now perform two unit root tests on the series in first differences to confirm that it is stationary. The first test is the PP test conducted earlier on the series in levels. As mentioned earlier, the null hypothesis of the PP test is that the series contains a unit root (i.e., is non-stationary). I will also conduct a second unit root test, which is the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test. The null hypothesis of the KPSS test is the inverse of the PP test. That is, the null hypothesis of the KPSS test is that the series does not contain a unit root (i.e., is stationary). By using two unit root tests with inverse null hypotheses, I can be more confident in my results.

PP Test on First Differenced HSS Series

#PP test
#No constant chosen as mean is reverting around zero
summary(ur.pp(diff(hss_ts), type = "Z-tau", lags = "long"))

## 
## ################################## 
## # Phillips-Perron Unit Root Test # 
## ################################## 
## 
## Test regression with intercept 
## 
## 
## Call:
## lm(formula = y ~ y.l1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -300.536  -49.860    2.812   46.923  296.436 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.29535    5.67282   0.228   0.8196  
## y.l1         0.13253    0.06728   1.970   0.0501 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.94 on 217 degrees of freedom
## Multiple R-squared:  0.01757,    Adjusted R-squared:  0.01304 
## F-statistic:  3.88 on 1 and 217 DF,  p-value: 0.05013
## 
## 
## Value of test-statistic, type: Z-tau  is: -13.0494 
## 
##          aux. Z statistics
## Z-tau-mu            0.2311
## 
## Critical values for Z statistics: 
##                      1pct      5pct    10pct
## critical values -3.461503 -2.874777 -2.57376

The test statistic is -13.0494, which is significant at the 1.0% level. Thus, I can reject the null hypothesis of non-stationarity. The results of the PP test indicate that the HSS series is stationary in first differences, or \(I(1)\).

KPSS Test on First Differenced HSS Series

#KPSS test
#No constant or linear trend present in graph of differenced data so "type" argument left blank
summary(ur.kpss(diff(hss_ts), lags = "long"))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 14 lags. 
## 
## Value of test-statistic is: 0.0445 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The test statistic from the KPSS test is 0.0445, which is not significant at any level. Thus, I fail to reject the null hypothesis of stationarity. The results of the KPSS test indicate that the HSS series is stationary in first differences, of \(I(1)\).

KHS Series in First Differences

ggtsdisplay(diff(khs_ts), main = "First Difference - Stock of Single-Family Housing Units  (KHS)", ylab = "Million Units")

The graph of the first differenced KHS data looks a bit odd, but seems to be due to the very close proximity of observations over time. This makes sense, because this series is the stock of single-family housing units. This is also apparent when looking at the scale of the y axis, which shows very minuscule changes from one quarter to the next. There is still a fair degree of serial correlation in the ACF, but it is not as much as the level series. The trend has been removed through the first difference, and the variance appears constant throughout.

I will now proceed with unit root tests on the series in first differences to assess whether it is stationary. Since taking the first difference removed the trend, I will specify a constant in the model type for the PP test.

PP Test on First Differenced KHS Series

#KPSS test 
#Model type is constant
summary(ur.pp(diff(khs_ts), type = "Z-tau", model = "constant", lags = "long"))

## 
## ################################## 
## # Phillips-Perron Unit Root Test # 
## ################################## 
## 
## Test regression with intercept 
## 
## 
## Call:
## lm(formula = y ~ y.l1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14966 -0.04966  0.01663  0.04920  0.15034 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.08451    0.01336   6.323 1.44e-09 ***
## y.l1         0.66287    0.05083  13.041  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06282 on 217 degrees of freedom
## Multiple R-squared:  0.4394, Adjusted R-squared:  0.4368 
## F-statistic: 170.1 on 1 and 217 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic, type: Z-tau  is: -8.3719 
## 
##          aux. Z statistics
## Z-tau-mu             7.964
## 
## Critical values for Z statistics: 
##                      1pct      5pct    10pct
## critical values -3.461503 -2.874777 -2.57376

The test statistic is -8.3719, which is significant at the 1.0% level. Thus, I can reject the null hypothesis of non-stationarity. The results of the PP test indicate that the KHS series is stationary in first differences, or \(I(1)\).

KPSS Test on First Differenced KHS Series

#KPSS test
#Constant chosen (i.e. type = mu) since data had trend prior to differencing
summary(ur.kpss(diff(khs_ts), type = "mu", lags = "long"))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 14 lags. 
## 
## Value of test-statistic is: 0.1836 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The test statistic is 0.1836, which is not significant at any level. Thus, I cannot reject the null hypothesis of stationarity. The results of the KPSS test indicate that the KHS series is stationary in first differences, or \(I(1)\).

GDP Series in First Differences

ggtsdisplay(diff(gdp_ts), main = "First Difference - U.S. Gross Domestic Product (GDP)", ylab = "Billions 2009 USD")

Taking the first difference has removed the trend, and the series now appears to be mean-reverting. The serial correlation is gone. Again, the variance appears stable, with the exception of 2020. Thus, I do not think this data needs be log-transformed. Since taking the first difference removed the trend, I will specify a constant in the model type for the PP test.

PP Test on First Differenced GDP Series

#PP test 
#Model type is constant
summary(ur.pp(diff(gdp_ts), type = "Z-tau", model = "constant", lags = "long"))

## 
## ################################## 
## # Phillips-Perron Unit Root Test # 
## ################################## 
## 
## Test regression with intercept 
## 
## 
## Call:
## lm(formula = y ~ y.l1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1736.27   -36.30     9.09    51.09   880.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 74.18202   11.32619   6.550 4.12e-10 ***
## y.l1        -0.16921    0.06719  -2.519   0.0125 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 155.7 on 217 degrees of freedom
## Multiple R-squared:  0.0284, Adjusted R-squared:  0.02392 
## F-statistic: 6.343 on 1 and 217 DF,  p-value: 0.01251
## 
## 
## Value of test-statistic, type: Z-tau  is: -17.6888 
## 
##          aux. Z statistics
## Z-tau-mu             6.659
## 
## Critical values for Z statistics: 
##                      1pct      5pct    10pct
## critical values -3.461503 -2.874777 -2.57376

The test statistic is -17.6888, which is significant at the 1.0% level. Thus, I can reject the null hypothesis of non-stationarity. The results of the PP test indicate that the GDP series is stationary in first differences, or \(I(1)\).

KPSS Test on First Differenced GDP Series

#KPSS test
#Constant chosen (i.e. type = mu) since data had trend prior to differencing
summary(ur.kpss(diff(gdp_ts), type = "mu", lags = "long"))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 14 lags. 
## 
## Value of test-statistic is: 0.3173 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The test statistic is 0.3173 , which is not significant at any level. Thus, I cannot reject the null hypothesis of stationarity. The results of the KPSS test indicate that the GDP series is stationary in first differences, or \(I(1)\).

FFR Series in First Differences

ggtsdisplay(diff(ffr_ts), main = "Federal Funds Rate (FFR)", ylab = "Percent (%)")

After taking the first difference, the series now appears to be mean-reverting around zero. There is some discontinuity in the variance, especially leading up to and during the 1970s inflation crisis. There is also still some autocorrelation present in the ACF.

I will proceed with a PP test on the differenced series to ensure that it is stationary.

PP Test on First Differenced FFR Series

#PP test
#No constant chosen since data did not have trend prior to differencing
summary(ur.pp(diff(ffr_ts), type = "Z-tau", lags = "long"))

## 
## ################################## 
## # Phillips-Perron Unit Root Test # 
## ################################## 
## 
## Test regression with intercept 
## 
## 
## Call:
## lm(formula = y ~ y.l1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9371 -0.1955  0.0172  0.2972  6.7802 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01722    0.06127  -0.281 0.778940    
## y.l1         0.22863    0.06607   3.461 0.000649 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9065 on 217 degrees of freedom
## Multiple R-squared:  0.0523, Adjusted R-squared:  0.04793 
## F-statistic: 11.98 on 1 and 217 DF,  p-value: 0.000649
## 
## 
## Value of test-statistic, type: Z-tau  is: -11.3594 
## 
##          aux. Z statistics
## Z-tau-mu           -0.2747
## 
## Critical values for Z statistics: 
##                      1pct      5pct    10pct
## critical values -3.461503 -2.874777 -2.57376

The test statistic is -11.3594, which is significant at the 1.0% level. Thus, I can reject the null hypothesis of non-stationarity. The results of the PP test indicate that the FFR series is stationary in first differences, or \(I(1)\).

KPSS Test on First Differenced FFR Series

#KPSS test
#No constant chosen since differenced series is mean-reverting around zero
summary(ur.kpss(diff(ffr_ts), lags = "long"))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 14 lags. 
## 
## Value of test-statistic is: 0.0891 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The test statistic is 0.0891, which is not significant at any level. Thus, I cannot reject the null hypothesis of stationarity. The results of the KPSS test indicate that the FFR series is stationary in first differences, or \(I(1)\).

POP Series in First Differences

ggtsdisplay(diff(pop_ts), main = "First Difference - U.S. Population (POP)", ylab = "Thousands")

After taking the first difference, the series appears more mean-reverting, and we have removed the trend. There is still a fair bit of serial correlation, but we have removed a lot of it through differencing.

I will proceed with unit root tests to check if it is stationary. In the PP test below, I specify a constant in the model since the POP series contained a trend prior to differencing.

PP Test on First Differenced POP Series

#PP test
#Constant specified since leveled data contained a trend before differencing 
summary(ur.pp(diff(pop_ts), type = "Z-tau", model = "constant", lags = "long"))

## 
## ################################## 
## # Phillips-Perron Unit Root Test # 
## ################################## 
## 
## Test regression with intercept 
## 
## 
## Call:
## lm(formula = y ~ y.l1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -279.498  -58.546    3.233   60.659  204.644 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 105.39391   25.71667   4.098 5.88e-05 ***
## y.l1          0.82674    0.04081  20.261  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.51 on 217 degrees of freedom
## Multiple R-squared:  0.6542, Adjusted R-squared:  0.6526 
## F-statistic: 410.5 on 1 and 217 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic, type: Z-tau  is: -4.6824 
## 
##          aux. Z statistics
## Z-tau-mu            4.5295
## 
## Critical values for Z statistics: 
##                      1pct      5pct    10pct
## critical values -3.461503 -2.874777 -2.57376

The test statistic is -4.6824, which is significant at the 1.0% level. Thus, I can reject the null hypothesis of non-stationarity. The POP series is stationary in first differences, or \(I(1)\).

KPSS Test on First Differenced POP Series

#KPSS test
#Constant chosen since series contained a trend prior to differencing
summary(ur.kpss(diff(pop_ts), type = "mu", lags = "long"))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 14 lags. 
## 
## Value of test-statistic is: 0.3006 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The test statistic is 0.3006, which is not significant at any level. Thus, I cannot reject the null hypothesis of stationarity. The results of the KPSS test indicate that the POP series is stationary in first differences, or \(I(1)\).

Setting up the VAR Model for the HSS Forecast

I begin by binding the five series into one object, named ihs_ts, using the code below. I will also create several other binded ts objects to perform tests on. Economic theory would suggest that the KHS data (which is single-family housing stock) is largely a function of its own lag, plus lagged single-family housing starts (i.e., the HSS data is an indicator that future stock is on the way). I think the introduction of the GDP, FFR, and POP series into a model forecasting the KHS series will not be well-specified, nor will it be parsimonious. Thus, the two ts objects I will use moving forward are one containing just the HSS and KHS series, and a second containing the HSS, GDP, FFR, and POP data. I will keep the five series ts object in case I want to test a model that includes all five series.

#Creating binded ts objects

#Create ts binded ts object
ihs_ts <- cbind(hss_ts, khs_ts, gdp_ts, ffr_ts, pop_ts)
#Create binded ts object of first differenced variables
ihs_ts_diff <- diff(ihs_ts)

#############################################################################

#Create binded ts object of just hss and khs (2 variables)
ihs_ts_2 <- cbind(khs_ts, hss_ts)
#Create diff series
ihs_ts_2_diff <- diff(ihs_ts_2)

Next, we need to select a lag selection using the VARselect command.

#Create lag_select object
#Lag.max selection is somewhat arbitrary, but we have a large N, so I choose 12
#The typical type to choose is "constant" so I choose that
lag_select <- VARselect(ihs_ts_diff, lag.max = 12, type = "const")
lag_select

## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      6      6      3      6 
## 
## $criteria
##                   1            2            3            4            5
## AIC(n) 2.163602e+01 2.101538e+01 1.965890e+01 1.968216e+01 1.930011e+01
## HQ(n)  2.183066e+01 2.137223e+01 2.017795e+01 2.036341e+01 2.014357e+01
## SC(n)  2.211740e+01 2.189791e+01 2.094257e+01 2.136697e+01 2.138607e+01
## FPE(n) 2.491374e+09 1.339922e+09 3.454732e+08 3.542839e+08 2.425419e+08
##                   6            7            8            9           10
## AIC(n) 1.913772e+01 1.918459e+01 1.926955e+01 1.918434e+01 1.927832e+01
## HQ(n)  2.014338e+01 2.035245e+01 2.059961e+01 2.067661e+01 2.093279e+01
## SC(n)  2.162483e+01 2.207284e+01 2.255895e+01 2.287489e+01 2.337001e+01
## FPE(n) 2.071405e+08 2.184758e+08 2.398899e+08 2.227349e+08 2.480917e+08
##                  11           12
## AIC(n) 1.928535e+01 1.935124e+01
## HQ(n)  2.110202e+01 2.133012e+01
## SC(n)  2.377819e+01 2.424523e+01
## FPE(n) 2.541362e+08 2.771155e+08

The Akaike information criterion (AIC), Hannan–Quinn (HQ) criterion, and Final Prediction Error (FPE) criterion all list six lags as being optimal. Since this is the majority consensus, I will go with six lags.

Next, I estimate a VAR, and explore Granger causality between the five series.

#Estimate VAR
var_1 <- VAR(ihs_ts_diff, p = 6 ,type = "none", season = NULL, exog = NULL)

#Granger Causality 
cause_hss <- causality(var_1, cause = "hss_ts")
cause_khs <- causality(var_1, cause = "khs_ts")
cause_gdp <- causality(var_1, cause = "gdp_ts")
cause_ffr <- causality(var_1, cause = "ffr_ts")
cause_pop <- causality(var_1, cause = "pop_ts")

granger_table <- list(cause_hss, cause_khs, cause_gdp, cause_ffr, cause_pop)
granger_results <- matrix(NA, ncol = 1, nrow = length(granger_table))
for(i in 1:length(granger_table)){
granger_results[i, ] <- granger_table[[i]]$Granger$p.value}
#Rename rows
rownames(granger_results) <- c("Granger-Cause HSS", "Granger-Cause KHS", "Granger-Cause GDP", "Granger-Cause FFR", "Granger-Cause POP")

granger_results

##                           [,1]
## Granger-Cause HSS 0.0000000000
## Granger-Cause KHS 0.2196206074
## Granger-Cause GDP 0.0032002870
## Granger-Cause FFR 0.0003966902
## Granger-Cause POP 0.1982956387

The significant p-values in the table above mean we can reject the null hypothesis of no Granger-causality for those tests. Therefore, the HSS, GDP, and FFR series have Granger causality with respect to one another. Put another way, each of these variables are useful for predicting the values of the other two. In contrast, the KHS and POP data do not indicate Granger-causality, and so I will remove them and re-estimate the VAR model.

#Create binded ts object cotaining HSS, GDP, and FFR)
ihs_ts_3 <- cbind(hss_ts, gdp_ts, ffr_ts)

#Create first differenced series
ihs_ts_3_diff <- diff(ihs_ts_3)

#Restimate VAR
var_2 <- VAR(ihs_ts_3_diff, p = 6 ,type = "none", season = NULL, exog = NULL)

Johansen Test for Cointegration

With a model of HSS, Using six lags, I can now carry out the Johansen Test for cointegration using the Trace and Eigen methods (which should produce the same result). The null hypothesis of this test is that there are no cointegrating relationships, and the alternate hypothesis is that the number of cointegrating relationships is at least one. The alternate hypothesis for the Eigen method is slightly different (that there are \(r+1\) cointegrating relationships). Annotations are given in the code chunk for the arguments of the Johansen test (i.e., “ca.jo”). With respect to the “ecdet” argument, I choose constant, because it is not clear from the graphs or from economic theory that there would be a trend in the linear combination of the three series. Moreover, the series in levels all appear to have a non-zero mean, so a constant is included in the Johansen test.

#Johansen test
#level data is used in this test
#Type = "trace"
#Ecdet = "constant"
#K is the lag order, which was chosen using VARselect as 6, but using VECM we need to subtract 1 lag, so I put 5
jo_test_1 <- ca.jo(ihs_ts_3, type = "trace", ecdet = "const", K = 5, spec = "longrun")
summary(jo_test_1)

## 
## ###################### 
## # Johansen-Procedure # 
## ###################### 
## 
## Test type: trace statistic , without linear trend and constant in cointegration 
## 
## Eigenvalues (lambda):
## [1] 1.467315e-01 7.809522e-02 3.716556e-02 1.110223e-16
## 
## Values of teststatistic and critical values of test:
## 
##           test 10pct  5pct  1pct
## r <= 2 |  8.18  7.52  9.24 12.97
## r <= 1 | 25.74 17.85 19.96 24.60
## r = 0  | 60.02 32.00 34.91 41.07
## 
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
## 
##              hss_ts.l5    gdp_ts.l5     ffr_ts.l5    constant
## hss_ts.l5    1.0000000    1.0000000  1.000000e+00   1.0000000
## gdp_ts.l5    0.1077892   -0.1458167  6.340688e-02  -0.2335988
## ffr_ts.l5   48.2758174 -224.3320035  5.225504e+01 130.6420890
## constant  -962.5027458 1730.7240038 -2.004618e+03 852.1837368
## 
## Weights W:
## (This is the loading matrix)
## 
##              hss_ts.l5    gdp_ts.l5    ffr_ts.l5      constant
## hss_ts.d  2.147814e-03 0.0101086805 -0.041512219 -4.600601e-17
## gdp_ts.d  6.725702e-02 0.0274066416 -0.019924700 -1.396544e-17
## ffr_ts.d -3.432244e-05 0.0004069813  0.000184712  3.300326e-19

In the output of the Johansen test, the column with r refers to the rank of the Johansen test checks. I begin by examining the test statistic for \(r<=0\). Since the test statistic for \(r=0\) is greater than the 5.0% critical value, there is at least one cointegrating relationship present. Going upward, I can see that the test statistic for \(r<=1\) is greater than the respective 5.0% critical values. The first rank that cannot be rejected is typically the number of cointegrating vectors. Thus, based on this result, there appears to be two cointegrating relationships. Moreover, the Eigenvalues are all between 0 and 1, indicating that the system is stable.

#Johansen test
#Type = "eigen"
#Ecdet = "constant"
#K is the lag order, which was chosen using VARselect as 6
jo_test_2 <- ca.jo(ihs_ts_3, type = "eigen", ecdet = "const", K = 5, spec = "longrun")
summary(jo_test_2)

## 
## ###################### 
## # Johansen-Procedure # 
## ###################### 
## 
## Test type: maximal eigenvalue statistic (lambda max) , without linear trend and constant in cointegration 
## 
## Eigenvalues (lambda):
## [1] 1.467315e-01 7.809522e-02 3.716556e-02 1.110223e-16
## 
## Values of teststatistic and critical values of test:
## 
##           test 10pct  5pct  1pct
## r <= 2 |  8.18  7.52  9.24 12.97
## r <= 1 | 17.56 13.75 15.67 20.20
## r = 0  | 34.28 19.77 22.00 26.81
## 
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
## 
##              hss_ts.l5    gdp_ts.l5     ffr_ts.l5    constant
## hss_ts.l5    1.0000000    1.0000000  1.000000e+00   1.0000000
## gdp_ts.l5    0.1077892   -0.1458167  6.340688e-02  -0.2335988
## ffr_ts.l5   48.2758174 -224.3320035  5.225504e+01 130.6420890
## constant  -962.5027458 1730.7240038 -2.004618e+03 852.1837368
## 
## Weights W:
## (This is the loading matrix)
## 
##              hss_ts.l5    gdp_ts.l5    ffr_ts.l5      constant
## hss_ts.d  2.147814e-03 0.0101086805 -0.041512219 -4.600601e-17
## gdp_ts.d  6.725702e-02 0.0274066416 -0.019924700 -1.396544e-17
## ffr_ts.d -3.432244e-05 0.0004069813  0.000184712  3.300326e-19

Running the Johansen test using Eigen statistics gives the same result (i.e., two cointegrating relationships). Since we have cointegrating variables, I cannot use an unrestricted VAR model, and need to proceed with a Vector Error Correction Model (VECM). The VECM model will help link the long-run equilibrium relationship implied by cointegration with the short-run adjustment mechanism that describes how the variables react when they move out of equilibrium.

VECM Model

#Create VECM model
#Estim = ML chosen since there are more than 1 cointegrating relationships (i.e., r doesn't allow 2OLS with more than 1)
#Lag length is reduced by 1 in VECM so lag = 5
vecm_1 <- VECM(ihs_ts_3_diff, lag = 5, r = 2, estim = "ML")
summary(vecm_1)

## Warning in if (class(x) == "numeric") return(noquote(r)): the condition has
## length > 1 and only the first element will be used

## Warning in if (class(x) == "matrix") return(matrix(noquote(r), ncol = ncol(x), :
## the condition has length > 1 and only the first element will be used

## Warning in if (class(x) == "numeric") return(noquote(r)): the condition has
## length > 1 and only the first element will be used

## Warning in if (class(x) == "matrix") return(matrix(noquote(r), ncol = ncol(x), :
## the condition has length > 1 and only the first element will be used

## #############
## ###Model VECM 
## #############
## Full sample size: 220    End sample size: 214
## Number of variables: 3   Number of estimated slope parameters 54
## AIC 3952.408     BIC 4140.903    SSR 6062558
## Cointegrating vector (estimated by ML):
##    hss_ts       gdp_ts   ffr_ts
## r1      1 1.110223e-16 235.3938
## r2      0 1.000000e+00 497.9822
## 
## 
##                 ECT1               ECT2               Intercept          
## Equation hss_ts -0.4404(0.1557)**  0.0827(0.0634)     -6.8842(6.4608)    
## Equation gdp_ts 1.1566(0.3321)***  -0.6009(0.1354)*** 34.9360(13.7867)*  
## Equation ffr_ts 0.0052(0.0017)**   -0.0034(0.0007)*** 0.2116(0.0720)**   
##                 hss_ts -1          gdp_ts -1          ffr_ts -1           
## Equation hss_ts -0.5699(0.1530)*** -0.1243(0.0643).   24.1673(14.0233).   
## Equation gdp_ts -0.9234(0.3265)**  -0.6411(0.1373)*** 29.9248(29.9243)    
## Equation ffr_ts -0.0034(0.0017).   0.0031(0.0007)***  -0.1772(0.1563)     
##                 hss_ts -2           gdp_ts -2          ffr_ts -2           
## Equation hss_ts -0.3689(0.1443)*    -0.2597(0.0681)*** 18.1388(12.7241)    
## Equation gdp_ts -0.7213(0.3079)*    -0.7386(0.1453)*** 27.7629(27.1519)    
## Equation ffr_ts -0.0025(0.0016)     0.0029(0.0008)***  -0.3449(0.1418)*    
##                 hss_ts -3           gdp_ts -3          ffr_ts -3           
## Equation hss_ts -0.2567(0.1273)*    -0.3151(0.0838)*** 6.2902(11.5211)     
## Equation gdp_ts -0.4483(0.2716)     -0.8747(0.1789)*** 25.6239(24.5850)    
## Equation ffr_ts -0.0031(0.0014)*    0.0033(0.0009)***  -0.0560(0.1284)     
##                 hss_ts -4           gdp_ts -4          ffr_ts -4           
## Equation hss_ts -0.1059(0.1073)     -0.1664(0.0912).   -0.2200(8.9127)     
## Equation gdp_ts -0.2795(0.2290)     -0.5696(0.1947)**  19.1973(19.0189)    
## Equation ffr_ts -0.0029(0.0012)*    0.0012(0.0010)     -0.0943(0.0993)     
##                 hss_ts -5           gdp_ts -5          ffr_ts -5          
## Equation hss_ts 0.0312(0.0731)      -0.1685(0.0805)*   -1.6802(6.8426)    
## Equation gdp_ts -0.1345(0.1560)     -0.4611(0.1719)**  5.6432(14.6015)    
## Equation ffr_ts -0.0036(0.0008)***  0.0023(0.0009)*    0.0928(0.0763)

The results of the VECM are mixed. The coefficient on ECT1 for the HSS series is negative and statistically significant, indicating convergence from HSS to the other two variables. Or, put another way, just under half of the disequilibrium is dissipated before the next period. Nonetheless, some of the coefficients are positive, which implies that these processes are not all converging in the long run. That is, the positive coefficients on ECT2 for GDP and FFR indicate that, following a divergence from equilibrium, there will be a tendency for these variables to adjust toward the target value. With more time, I could explore adding/subtracting more variables for this model. But, the negative and statistically significant coefficient for HSS on ECT1 is encouraging, and indicates some convergence of HSS from disequilibrium with the other two variables. These results could also reflect the extreme housing shortage facing the United States, with single-family housing starts still far below pre-Great Recession levels, and failing to keep up with economic growth and demand.

Diagnostic Tests

The commands to run diagnostics are designed for a model specified using a VAR system; therefore, we need to convert the VECM to a VAR.

Create VAR object

#Create VAR object 
var_hss <- vec2var(jo_test_1, r = 2)

Test for Serial Correlation

#Test for serial correlation
#PS.asymptotic selected since sample size is sufficiently large
serial_1 <- serial.test(var_hss, lags.pt = 12, type = "PT.asymptotic")
serial_1

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var_hss
## Chi-squared = 65.316, df = 66, p-value = 0.5006

The Portmanteau Test checks for autocorrelation in the errors of the VAR model, with the null hypothesis being that there is no autocorrelation. The results of the test show a p-value of 0.5006, which is not statistically significant. Thus, the results indicate no autocorrelation, which is a positive sign for this model.

Test for Heteroskedasticity

#Arch test for conditional heteroskedasticity of errors
arch_1 <- arch.test(var_hss, lags.multi = 8, multivariate.only = TRUE)
arch_1

## 
##  ARCH (multivariate)
## 
## data:  Residuals of VAR object var_hss
## Chi-squared = 519.5, df = 288, p-value = 1.665e-15

The results of the ARCH (multivariate) test give a p-value that is statistically significant at a significance level greater than 1.0%. Thus, these results indicate the presence of heteroskedasticity in the model. With more time, I would explore whether logging the data prior to differencing would help remove this heteroskedasticity.

Normality Test

#Normality tests of residuals
norm_1 <- normality.test(var_hss, multivariate.only = TRUE)
norm_1

## $JB
## 
##  JB-Test (multivariate)
## 
## data:  Residuals of VAR object var_hss
## Chi-squared = 73172, df = 6, p-value < 2.2e-16
## 
## 
## $Skewness
## 
##  Skewness only (multivariate)
## 
## data:  Residuals of VAR object var_hss
## Chi-squared = 2052.2, df = 3, p-value < 2.2e-16
## 
## 
## $Kurtosis
## 
##  Kurtosis only (multivariate)
## 
## data:  Residuals of VAR object var_hss
## Chi-squared = 71120, df = 3, p-value < 2.2e-16

The results of the JB-Test (multivariate) indicate non-normality in the residuals. With more time, I could explore logging the data and then re-running the model in the first difference of the logs. Moreover, although both the PP and KPSS unit root tests indicated that all of the series are \(I(1),\), some of the ACF plots looked like they still had a bit of serial correlation in them after differencing. As such, it is possible that, for example the KHS series, is \(I(2)\). If this is the case, then an ARDL method might be more appropriate. Nonetheless, \(I(2)\) variables are somewhat rare, so I would not think that this would be the case. Overall, the non-normality of residuals indicates that the model specification could be improved.

Setting up the VAR Model for the KHS Forecast

As mentioned earlier, my economic reasoning tells me that the KHS data makes more sense to model with the HSS data. That is, housing starts are a feeder into housing stock, since a housing start becomes housing stock after about eight months.

I start by selecting a lag length for the VAR using the VARselect function.

#Create lag_select object
#Lag.max selection is somewhat arbitrary, but we have a large N, so I choose 12
#The typical type to choose is "constant" so I choose that
lag_select_2 <- VARselect(ihs_ts_2_diff, lag.max = 12, type = "const")
lag_select_2

## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##     12      6      6     12 
## 
## $criteria
##                1         2         3         4         5         6         7
## AIC(n)  3.326351  2.747587  2.604212  2.463893  2.439570  2.334391  2.336044
## HQ(n)   3.365279  2.812468  2.695046  2.580679  2.582309  2.503082  2.530687
## SC(n)   3.422626  2.908045  2.828854  2.752719  2.792579  2.751583  2.817420
## FPE(n) 27.836678 15.605215 13.521255 11.751740 11.470374 10.326537 10.345436
##                8         9        10        11        12
## AIC(n)  2.321596  2.313577  2.314486  2.338235  2.305123
## HQ(n)   2.542192  2.560126  2.586987  2.636689  2.629529
## SC(n)   2.867155  2.923320  2.988412  3.076345  3.107416
## FPE(n) 10.199379 10.120858 10.133701 10.381770 10.048851

There is no majority consensus for the lag length selection, but the Hannan–Quinn (HQ) criterion tends to be more conservative so I will go with six lags. Moreover, 12 lags is somewhat excessive given the relationship between housing starts and housing stock.

Next, I estimate a VAR, and explore Granger causality between the five series.

#Estimate VAR
var_3 <- VAR(ihs_ts_2_diff, p = 6 , type = "const", season = NULL, exog = NULL)

#Granger Causality 
cause_hss_2 <- causality(var_3, cause = "hss_ts")
cause_khs_2 <- causality(var_3, cause = "khs_ts")

#Create Granger table to display results
granger_table_2 <- list(cause_hss_2, cause_khs_2)

granger_results_2 <- matrix(NA, ncol = 1, nrow = length(granger_table_2))
for(i in 1:length(granger_table_2)){
granger_results_2[i, ] <- granger_table_2[[i]]$Granger$p.value}

#Rename rows
rownames(granger_results_2) <- c("Granger-Cause HSS", "Granger-Cause KHS")

granger_results_2

##                         [,1]
## Granger-Cause HSS 0.00000000
## Granger-Cause KHS 0.01325239

The p-values for the Granger-cause tests are both significant, which indicates that these variables have Granger causality with respect to one another. Therefore, these variables are useful for predicting one another.

Johansen Test for Cointegration

#Johansen test
#level data is used in this test
#Type = "trace"
#Ecdet = "none" (no constant or trend) 
#K is the lag order, which was chosen using VARselect as 6, but using VECM we need to subtract 1 lag, so I put 5
jo_test_3 <- ca.jo(ihs_ts_2, type = "trace", ecdet = "const", K = 5, spec = "longrun")
summary(jo_test_3)

## 
## ###################### 
## # Johansen-Procedure # 
## ###################### 
## 
## Test type: trace statistic , without linear trend and constant in cointegration 
## 
## Eigenvalues (lambda):
## [1] 3.754073e-01 4.567909e-02 5.109799e-18
## 
## Values of teststatistic and critical values of test:
## 
##            test 10pct  5pct  1pct
## r <= 1 |  10.10  7.52  9.24 12.97
## r = 0  | 111.76 17.85 19.96 24.60
## 
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
## 
##            khs_ts.l5    hss_ts.l5      constant
## khs_ts.l5   1.000000    1.0000000   1.000000000
## hss_ts.l5  -3.263944    0.3178802   0.004588503
## constant  -78.736507 -327.3274910 -68.042003273
## 
## Weights W:
## (This is the loading matrix)
## 
##              khs_ts.l5     hss_ts.l5      constant
## khs_ts.d -0.0001900433 -1.066115e-05 -5.764606e-17
## hss_ts.d -0.0136055749 -2.542264e-01  1.140659e-13

#Johansen test
#Type = "eigen"
#Ecdet = "constant"
#K is the lag order, which was chosen using VARselect as 6
jo_test_4 <- ca.jo(ihs_ts_2, type = "eigen", ecdet = "const", K = 5, spec = "longrun")
summary(jo_test_4)

## 
## ###################### 
## # Johansen-Procedure # 
## ###################### 
## 
## Test type: maximal eigenvalue statistic (lambda max) , without linear trend and constant in cointegration 
## 
## Eigenvalues (lambda):
## [1] 3.754073e-01 4.567909e-02 5.109799e-18
## 
## Values of teststatistic and critical values of test:
## 
##            test 10pct  5pct  1pct
## r <= 1 |  10.10  7.52  9.24 12.97
## r = 0  | 101.66 13.75 15.67 20.20
## 
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
## 
##            khs_ts.l5    hss_ts.l5      constant
## khs_ts.l5   1.000000    1.0000000   1.000000000
## hss_ts.l5  -3.263944    0.3178802   0.004588503
## constant  -78.736507 -327.3274910 -68.042003273
## 
## Weights W:
## (This is the loading matrix)
## 
##              khs_ts.l5     hss_ts.l5      constant
## khs_ts.d -0.0001900433 -1.066115e-05 -5.764606e-17
## hss_ts.d -0.0136055749 -2.542264e-01  1.140659e-13

The Johansen tests with the Trace and Eigen statistics agree with one another, and the results suggest that there is at least one cointegrating relationship. Therefore, I proceed with the VECM.

VECM Model

#Create VECM model
#Estim = ML  
#Lag length is reduced by 1 in VECM so lag = 5
vecm_2 <- VECM(ihs_ts_2_diff, lag = 5, r = 1, estim = "ML")
summary(vecm_2)

## Warning in if (class(x) == "numeric") return(noquote(r)): the condition has
## length > 1 and only the first element will be used

## Warning in if (class(x) == "matrix") return(matrix(noquote(r), ncol = ncol(x), :
## the condition has length > 1 and only the first element will be used

## Warning in if (class(x) == "numeric") return(noquote(r)): the condition has
## length > 1 and only the first element will be used

## Warning in if (class(x) == "matrix") return(matrix(noquote(r), ncol = ncol(x), :
## the condition has length > 1 and only the first element will be used

## #############
## ###Model VECM 
## #############
## Full sample size: 220    End sample size: 214
## Number of variables: 2   Number of estimated slope parameters 24
## AIC 506.6849     BIC 590.8343    SSR 1410454
## Cointegrating vector (estimated by ML):
##    khs_ts   hss_ts
## r1      1 1.877041
## 
## 
##                 ECT                Intercept            khs_ts -1            
## Equation khs_ts 0.0007(6.1e-05)*** -9.3e-05(0.0025)     -1.3104(0.0665)***   
## Equation hss_ts -0.3985(0.1383)**  0.8713(5.7150)       284.2671(149.6290).  
##                 hss_ts -1           khs_ts -2              hss_ts -2          
## Equation khs_ts -0.0011(0.0001)***  -1.2219(0.1055)***     -0.0007(9.8e-05)***
## Equation hss_ts -0.1496(0.2495)     122.2861(237.3364)     -0.1182(0.2207)    
##                 khs_ts -3              hss_ts -3          
## Equation khs_ts -1.0205(0.1163)***     -0.0004(7.6e-05)***
## Equation hss_ts -73.9927(261.5954)     -0.1472(0.1720)    
##                 khs_ts -4              hss_ts -4          
## Equation khs_ts -0.6613(0.1060)***     -0.0002(5.5e-05)** 
## Equation hss_ts -60.0541(238.5457)     -0.0024(0.1244)    
##                 khs_ts -5              hss_ts -5           
## Equation khs_ts -0.3538(0.0673)***     -9.1e-05(3.6e-05)*  
## Equation hss_ts -38.1813(151.4957)     0.0423(0.0813)

The results of the VECM are mixed. The coefficient on ECT1 for the KHS series is positive and significant, and the coefficient on ECT1 for the HSS series is negative and significant. This implies that, although the HSS series converges back toward equilibrium in the long run, the KHS series does not converge. A possible explanation for this could be the economic relationship between housing starts and housing stock, with housing starts feeding into housing stock – the latter of which exhibits somewhat constant growth over time.

Diagnostic Tests

The commands to run diagnostics are designed for a model specified using a VAR system; therefore, we need to convert the VECM to a VAR.

Create VAR object

#create VAR object 
var_khs <- vec2var(jo_test_3, r = 1)

Test for Serial Correlation

#test for serial correlation
#PS.asymptotic selected since sample size is sufficiently large
serial_2 <- serial.test(var_khs, lags.pt = 12, type = "PT.asymptotic")
serial_2

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var_khs
## Chi-squared = 38.405, df = 30, p-value = 0.1396

Test for Heteroskedasticity

#Arch test for conditional heteroskedasticity of errors
arch_2 <- arch.test(var_khs, lags.multi = 8, multivariate.only = TRUE)
arch_2

## 
##  ARCH (multivariate)
## 
## data:  Residuals of VAR object var_khs
## Chi-squared = 107.55, df = 72, p-value = 0.004223

Normality Test

#Normality tests of residuals
norm_2 <- normality.test(var_khs, multivariate.only = TRUE)
norm_2

## $JB
## 
##  JB-Test (multivariate)
## 
## data:  Residuals of VAR object var_khs
## Chi-squared = 12.141, df = 4, p-value = 0.01634
## 
## 
## $Skewness
## 
##  Skewness only (multivariate)
## 
## data:  Residuals of VAR object var_khs
## Chi-squared = 1.4922, df = 2, p-value = 0.4742
## 
## 
## $Kurtosis
## 
##  Kurtosis only (multivariate)
## 
## data:  Residuals of VAR object var_khs
## Chi-squared = 10.648, df = 2, p-value = 0.004872

The diagnostic results for KHS VAR model show similar results. There is no evidence of autocorrelation in the model, which is good. However, there is heteroskedasticity present, and the residuals appear non-normal (with the exception of skewness). Overall, these results indicate that the model could be improved; however, it does not mean that the model’s results are invalid.

Forecasts

Forecast for HSS Data

##VECM forecast of 19 quarters with 95% confidence
forecast_1 <- predict(var_hss, n.ahead = 19, ci(095))
forecast_1

## $hss_ts
##            fcst    lower     upper       CI
##  [1,]  832.0828 692.7192  971.4463 139.3636
##  [2,]  979.0272 772.5713 1185.4831 206.4559
##  [3,] 1017.5845 746.1580 1289.0109 271.4265
##  [4,] 1052.8033 724.3726 1381.2340 328.4307
##  [5,]  938.6717 552.3840 1324.9593 386.2877
##  [6,]  986.5676 548.5742 1424.5609 437.9933
##  [7,]  986.6183 503.0150 1470.2217 483.6033
##  [8,]  995.2602 471.2818 1519.2386 523.9784
##  [9,]  957.9836 397.2377 1518.7296 560.7460
## [10,]  972.4891 379.5066 1565.4716 592.9825
## [11,]  969.5700 348.2465 1590.8935 621.3235
## [12,]  970.6240 324.1667 1617.0813 646.4573
## [13,]  957.3899 288.0911 1626.6886 669.2988
## [14,]  960.2926 270.3549 1650.2302 689.9377
## [15,]  957.1719 248.3358 1666.0080 708.8361
## [16,]  955.8900 229.5548 1682.2253 726.3352
## [17,]  950.1273 207.3096 1692.9450 742.8177
## [18,]  949.3457 190.9595 1707.7319 758.3862
## [19,]  946.4863 173.2508 1719.7219 773.2355
## 
## $gdp_ts
##           fcst    lower    upper        CI
##  [1,] 17747.09 17456.73 18037.44  290.3556
##  [2,] 18072.97 17710.95 18434.99  362.0157
##  [3,] 18120.07 17706.24 18533.90  413.8335
##  [4,] 18253.42 17802.94 18703.89  450.4787
##  [5,] 18217.64 17705.89 18729.39  511.7513
##  [6,] 18379.59 17817.57 18941.61  562.0222
##  [7,] 18447.71 17835.28 19060.14  612.4313
##  [8,] 18548.54 17888.10 19208.98  660.4379
##  [9,] 18594.42 17880.44 19308.41  713.9819
## [10,] 18701.11 17935.76 19466.45  765.3466
## [11,] 18778.15 17961.22 19595.08  816.9296
## [12,] 18868.28 18000.66 19735.90  867.6219
## [13,] 18940.33 18020.95 19859.72  919.3864
## [14,] 19031.32 18061.01 20001.64  970.3143
## [15,] 19112.72 18091.67 20133.77 1021.0513
## [16,] 19199.01 18127.73 20270.29 1071.2809
## [17,] 19279.51 18157.91 20401.11 1121.5997
## [18,] 19366.08 18194.52 20537.65 1171.5637
## [19,] 19449.61 18228.19 20671.04 1221.4251
## 
## $ffr_ts
##               fcst      lower    upper       CI
##  [1,]  1.221137975 -0.4019827 2.844259 1.623121
##  [2,]  0.142415304 -2.4020380 2.686869 2.544453
##  [3,] -0.377759985 -3.4076447 2.652125 3.029885
##  [4,]  0.053920325 -3.3655934 3.473434 3.419514
##  [5,]  0.525505101 -3.2325599 4.283570 3.758065
##  [6,] -0.008631934 -4.0061298 3.988866 3.997498
##  [7,] -0.227419602 -4.3838997 3.929060 4.156480
##  [8,] -0.084216413 -4.3484321 4.179999 4.264216
##  [9,] -0.003916715 -4.3390911 4.331258 4.335174
## [10,] -0.237947526 -4.6206604 4.144765 4.382713
## [11,] -0.335637441 -4.7523023 4.081027 4.416665
## [12,] -0.334563003 -4.7776613 4.108535 4.443098
## [13,] -0.366700321 -4.8339551 4.100554 4.467255
## [14,] -0.488036075 -4.9803547 4.004283 4.492319
## [15,] -0.561426730 -5.0804095 3.957556 4.518983
## [16,] -0.610545565 -5.1577028 3.936612 4.547157
## [17,] -0.670438006 -5.2470881 3.906212 4.576650
## [18,] -0.754797997 -5.3616757 3.852080 4.606878
## [19,] -0.823865945 -5.4609672 3.813235 4.637101

#VECM forecast chart
fanchart(forecast_1, names = "hss_ts", main = "Forecast - Single-Family Housing Starts (HSS) Through 2025 Q4 (red)", xlab = "Quarters", ylab = "Thousand Units", colors = "red")

Forecast for KHS Data

#VECM forecast of 19 quarters with 95% confidence
forecast_2 <- predict(var_khs, n.ahead = 19, ci(095))
forecast_2

## $khs_ts
##            fcst     lower    upper         CI
##  [1,]  99.97636  99.91484 100.0379 0.06152264
##  [2,] 100.26844 100.19106 100.3458 0.07738205
##  [3,] 100.54867 100.42457 100.6728 0.12409948
##  [4,] 100.83092 100.64363 101.0182 0.18728832
##  [5,] 101.11560 100.84610 101.3851 0.26950636
##  [6,] 101.39995 101.03667 101.7632 0.36328218
##  [7,] 101.68312 101.21246 102.1538 0.47065421
##  [8,] 101.96406 101.37533 102.5528 0.58872735
##  [9,] 102.24437 101.52708 102.9617 0.71729064
## [10,] 102.52360 101.66878 103.3784 0.85482386
## [11,] 102.80163 101.80065 103.8026 1.00097985
## [12,] 103.07815 101.92310 104.2332 1.15504826
## [13,] 103.35334 102.03673 104.6699 1.31661157
## [14,] 103.62724 102.14208 105.1124 1.48515858
## [15,] 103.89984 102.23951 105.5602 1.66032610
## [16,] 104.17108 102.32932 106.0128 1.84176647
## [17,] 104.44098 102.41179 106.4702 2.02918861
## [18,] 104.70956 102.48724 106.9319 2.22231463
## [19,] 104.97682 102.55592 107.3977 2.42090011
## 
## $hss_ts
##           fcst    lower    upper       CI
##  [1,] 1121.985 963.1662 1280.804 158.8190
##  [2,] 1149.061 909.4430 1388.679 239.6179
##  [3,] 1175.634 867.6037 1483.664 308.0301
##  [4,] 1166.578 800.3672 1532.789 366.2110
##  [5,] 1155.109 729.0984 1581.120 426.0108
##  [6,] 1153.152 672.7788 1633.524 480.3728
##  [7,] 1153.255 623.1374 1683.372 530.1172
##  [8,] 1148.520 572.7181 1724.322 575.8018
##  [9,] 1141.960 523.0502 1760.871 618.9102
## [10,] 1136.578 477.3217 1795.834 659.2561
## [11,] 1132.012 434.8934 1829.130 697.1182
## [12,] 1126.937 394.1557 1859.718 732.7810
## [13,] 1121.400 354.7761 1888.024 766.6239
## [14,] 1115.929 317.1224 1914.735 798.8065
## [15,] 1110.660 281.1760 1940.145 829.4843
## [16,] 1105.398 246.5997 1964.197 858.7986
## [17,] 1100.073 213.1775 1986.969 886.8958
## [18,] 1094.749 180.8649 2008.633 913.8840
## [19,] 1089.476 149.6212 2029.331 939.8548

#VECM forecast of 19 quarters with 95% confidence
fanchart(forecast_2, names = "khs_ts", main = "Forecast - Stock of Single-Family Units (KHS) Through 2025 Q4 (red)", xlab = "Quarters", ylab = "Million Units", colors = "red")

Conclusion

With more time, there are several ways this model could potentially be improved. First, the addition of one or more variables might help improve the model. In particular, a variable that could serve as a benchmark for the average 30-year mortgage rate would be worth exploring. I took a look at Freddie Mac’s PMMS for interest rates for 30 year mortgages; however, the data appears to only go back to 1971. To some extent, the FFR captures some of the variation that a 30-year mortgage rate series might have; however, the latter would have a more immediate relation to the HSS series. Other variables, such as the S&P/Case-Shiller U.S. National Home Price Index (available through FRED) would also be good candidates to explore, though this index appears to only go back to 1987. Another way to explore improving the model is to try using logged variables, which might eliminate or reduce some of the heteroskedasticity. From an economics standpoint, the relatively slow/stagnant rate of forecasted single-family housing start growth is plausible. There is a significant construction worker shortage, which is causing delays for many residential development projects. As such, it is realistic to expect that single-family housing starts will continue to be outpaced by demand over the next few years. With respect to the stock of single-family units (KHS) series, it is growing linearly, so the prediction interval is narrower.