Case Study Background

In this case study we are attempting to forecast house price growth.

We will structure these forecasts using the natural temporal persistence in this series, as well as leveraging the information contained in exogenous drivers such as housing starts.

Intended audience

You should have a working knowledge of 1) R and 2) econometric methods.

You do not need prior knowledge of time series methods.

Accompanying materials

Theoretical aspects of the topics discussed are relegated to a series of ``lecture notes’’ available at Prof. Aguilar’s website: aguilar-mike.weebly.com.

Contact information: aguilar-mike@outlook.com | https://www.linkedin.com/in/mike-aguilar-econ/

Acknowledgements

A special thanks to Ziming Huang for wonderful coding and design assistance.

Outline

We will detail the case study in 3 broad steps: 1) Explore, 2) Explain, 3) Forecast.

Package details

Several packages are used throughout this case study. Although the best practice is to invoke the library call for all packages within housekeeping, we invoke near the code implementation so that we can see when and how each package is used.

Explore

Housekeeping

Standard best practice to clear workspace:

rm(list=ls()) # clear work space
cat("\014")  # clear console

Data

Let’s gather data on housing prices, housing starts, and interest rates. There are undoubtedly many other relevant features, but we will focus on this manageable subset so that we can focus on the time series methods. The specific features are chosen to highlight certain time series techniques, not necessarily their explanatory power.

Our data source will be FRED (https://fred.stlouisfed.org/) from the St. Louis Federal Reserve.

We can obtain our data via FRED’s API using the fredr package. To do so, you need to obtain a free API key by visiting (https://fred.stlouisfed.org/docs/api/api_key.html).

After you’ve obtain the key, save it to a sample txt file and store in your working directory.

Connect to FRED API

#install.packages("fredr") 
library(fredr)
fred_api_key = read.delim("API_Key.txt", header = FALSE, sep="")[[1]] 
fredr_set_key(fred_api_key) #Store the API key as an environment variable

Download Housing Prices

In order to download data via FREDR we need the FRED series_id. These are easily obtained by searching directly on FRED’s website.

Price: S&P/Case-Shiller U.S. National Home Price Index (CSUSHPINSA)

CSUSHPINSA<-fredr(
  series_id = "CSUSHPINSA",
  frequency = "m", # monthly
  observation_start = as.Date("1987-01-01"),
  observation_end = as.Date("2023-03-01")
  )

Download Housing Starts

Housing starts: New Privately-Owned Housing Units Started: Total Units (HOUSTNSA)

HOUSTNSA<-fredr(
  series_id = "HOUSTNSA",
  frequency = "m", # monthly
  observation_start = as.Date("1959-01-01"),
  observation_end = as.Date("2023-03-01")
  )
HOUSTNSA$value <- HOUSTNSA$value*1000 # Convert raw data "Thousands of Units" to "Units"

Combine

Let’s combine the data into a single data frame for ease of analysis.

temp_df <- Reduce(function(x,y) merge(x, y, by="date", all = T), lapply(c("HOUSTNSA","CSUSHPINSA"), function(x) rbind(get(x)[,c(1,3)])))
colnames(temp_df) <- c("Date","HousingStart","PriceIndex") # rename dataframe

Check for missing

head(temp_df)

##         Date HousingStart PriceIndex
## 1 1959-01-01        96200         NA
## 2 1959-02-01        99000         NA
## 3 1959-03-01       127700         NA
## 4 1959-04-01       150800         NA
## 5 1959-05-01       152500         NA
## 6 1959-06-01       147800         NA

We can see right away that we don’t have a balanced panel.

summary(is.na(temp_df))

##     Date         HousingStart    PriceIndex     
##  Mode :logical   Mode :logical   Mode :logical  
##  FALSE:770       FALSE:770       FALSE:432      
##                                  TRUE :338

We can see the issues are most prolific for Prices

When are these missing values?

idxHousingStarts<-is.na(temp_df$HousingStart)
idxPrices<-is.na(temp_df$PriceIndex)
par(mfrow=c(1,2))
plot(idxPrices,main="Prices")
plot(idxHousingStarts,main="HousingStarts")

Importantly, the missing observations only are at the beginning and end of the series.

Clean

Cleaning time series is slightly different than cleaning cross sectional data. If we have a gap in our data we can’t simply skip over it. We need to consider why it is missing and then possibly interpolate.

Thankfully, our data is missing at the edges of the panel so we can simply trim to make it balanced.

df <- na.omit(temp_df) # remove the rows with NA
head(df)

##           Date HousingStart PriceIndex
## 337 1987-01-01       105100     63.734
## 338 1987-02-01       102800     64.134
## 339 1987-03-01       141200     64.469
## 340 1987-04-01       159300     64.973
## 341 1987-05-01       158000     65.547
## 342 1987-06-01       162900     66.218

Basic EDA

Full sample correlation matrix

cor(df[,c(-1)]) # get grid of the date column

##              HousingStart   PriceIndex
## HousingStart  1.000000000 -0.004503851
## PriceIndex   -0.004503851  1.000000000

Descriptive Stats

summary(df)

##       Date             HousingStart      PriceIndex    
##  Min.   :1987-01-01   Min.   : 31900   Min.   : 63.73  
##  1st Qu.:1995-12-24   1st Qu.: 86575   1st Qu.: 81.66  
##  Median :2004-12-16   Median :112350   Median :140.40  
##  Mean   :2004-12-15   Mean   :111177   Mean   :138.56  
##  3rd Qu.:2013-12-08   3rd Qu.:137525   3rd Qu.:176.09  
##  Max.   :2022-12-01   Max.   :197900   Max.   :308.36

Time Series Plots

library(tidyverse)
plot_df <- df %>% 
  `colnames<-`(c("Date","HousingStart","PriceIndex \n(Jan 2000=100)")) %>% 
  pivot_longer(c("HousingStart","PriceIndex \n(Jan 2000=100)")) # Convert wide dataframe to long
ggplot(data = plot_df, aes(x = Date,y = value, group = name, color = name)) + 
  geom_line() + 
  scale_x_date(date_labels = "%Y (%b)") + # format datetime
  facet_grid(name~., scales = "free_y") + 
  labs(x = "", y = '') +
  theme(legend.position = "none")

Several important time series properties are apparent:

The periodic patterns in housing starts is indicative of seasonality.
The trend in the price index is indicative of non stationarity.

We need to address each before we proceed with modeling.

Seasonality

Let’s zoom in on housing starts.

library(ggplot2)
ggplot(data = df, aes(x = Date,y = HousingStart)) + 
  geom_line() + 
  scale_x_date(date_labels = "%Y (%b)") + # format datetime
  labs(x = "", y = "Housing Start Units")

Be sure to stop and think about why there is temporal persistence and why that persistence occurs in a seasonal pattern. If we can’t find an ``economic’’ reason, we may be overlooking an important feature.

There are many ways to adjust for these seasonal patterns.

Prepare the df for analysis

library(xts)
FirstObs=df$Date[1]
FirstObs

## [1] "1987-01-01"

#We could parse the FirstObs chr string, but for our purposes it's easier to input by hand the start date
df_ts <- ts(xts(df[,2:3], order.by = df$Date), frequency = 12, start = c(1987,1))

Moving Avg

A simple moving avg might smooth over the seasonal effects.

library(TTR)
SMA24<-SMA(df_ts[,"HousingStart"],n=24) # Create a simple 24 period moving avg using `SMA()` 
toplot<-cbind(as.xts(df_ts[,"HousingStart"]),as.xts(SMA24))
colnames(toplot)[1]="Level" # Add column name   
colnames(toplot)[2]="24mth MovAvg" # Add column name
plot(toplot,main="Housing Starts",legend.loc="topright", auto.legend=TRUE)

Seasonal Decomposition

We could decompose the housing starts series into 3 pieces: 1) trend, 2) seasonal, 3) noise. The decomposition can be additive or multiplicative. There are numerous techniques for identifying and extracting the 3 components. Let’s demonstrate a few.

Classical Decomposition

Classical decomposition methods compute the seasonal component by essentially including dummy variables for each periodic unit (e.g. month, quarters, etc..)

library(stats)
decomp<-decompose(df_ts[,"HousingStart"], type="additive")
plot(decomp) # plot components

After the decomposition is implemented we can easily construct a seasonally adjusted series

SA = decomp$x-decomp$seasonal # Raw Minus Seasonal

Compare the SA and NSA

Trend = decomp$trend
library(forecast)
forecast::autoplot(df_ts[,"HousingStart"], series="Raw Data") + 
  forecast::autolayer(decomp$x-decomp$seasonal, series="Seasonally Adjusted") +
  forecast::autolayer(decomp$trend, series="Trend") +
  xlab("Year") + ylab("") + labs(color = "") + 
  ggtitle("Classical Additive Seasonally Adjusted") +
  scale_colour_manual(values=c("black","blue","red"), breaks=c("Raw Data","Seasonally Adjusted","Trend")) +
  theme(legend.position="bottom")

X11 Decomposition

The X11 approach is a variant of the classical decomposition.

library(seasonal)
x11decomp <- seas(df_ts[,"HousingStart"], x11="")
forecast::autoplot(x11decomp) +
  ggtitle("X11 Decomposition")

Compare SA and NSA

forecast::autoplot(df_ts[,"HousingStart"], series="Raw Data") +
  autolayer(seasadj(x11decomp), series="Seasonally Adjusted") +
  autolayer(trendcycle(x11decomp), series="Trend") +
  xlab("Year") + ylab("") + labs(color = "") + 
  ggtitle("X11 Seasonally Adjusted") +
  scale_colour_manual(values=c("black","blue","red"), breaks=c("Raw Data","Seasonally Adjusted","Trend")) +
  theme(legend.position="bottom")

STL Decomposition

STL is an acronym for “Seasonal and Trend decomposition using Loess”, wherein Loess is a method for estimating nonlinear relationships.

stldecomp <- stl(df_ts[,"HousingStart"], s.window="periodic", robust=TRUE)

forecast::autoplot(stldecomp) +
  ggtitle("STL Decomposition")

Loess is data intensive method and traditionally two-sided, meaning that we need data on either side of each point in order to apply the adjustment.

forecast::autoplot(df_ts[,"HousingStart"], series="Raw Data") +
  autolayer(seasadj(stldecomp), series="Seasonally Adjusted") +
  autolayer(trendcycle(stldecomp), series="Trend") +
  xlab("Year") + ylab("") + labs(color = "") + 
  ggtitle("STL Seasonally Adjusted") +
  scale_colour_manual(values=c("black","blue","red"), breaks=c("Raw Data","Seasonally Adjusted","Trend")) +
  theme(legend.position="bottom")

#### Compare the Decomp Methods

forecast::autoplot(df_ts[,"HousingStart"], series="Raw Data") +
  autolayer(decomp$x-decomp$seasonal, series="Classical Additive Seasonally Adjusted") +
  autolayer(seasadj(x11decomp), series="X11 Seasonally Adjusted") +
  autolayer(seasadj(stldecomp), series="STL Seasonally Adjusted") +
  xlab("Year") + ylab("") + labs(color = "") + 
  ggtitle("Seasonally Adjusted") +
  scale_colour_manual(values=c("black","darkgreen","steelblue","darkred"),
                      breaks=c("Raw Data","Classical Additive Seasonally Adjusted",
                               "X11 Seasonally Adjusted", "STL Seasonally Adjusted")) +
  theme(legend.position="bottom") + guides(colour = guide_legend(ncol = 2))

We’ll accomodate for this seasonally adjusted series within the next section.

Stationarity

Most standard time series techniques require stationarity before implementing. Therefore, it’s essential we are able to detect and remedies any violations to this assumption.

There are many forms of stationarity, but for our purposes we want the first and second moments of the DGP to be constant through time.

Detect Non-Stationarity w/ADF

There are several tests to detect (non) stationarity.

The Augmented Dickey Fuller (ADF) test focuses on a manifestation of non stationarity called a unit root. A case of unit roots is the `Random Walk’ wherein today’s value is equal to yesterday’s + random noise.

Ho: Non Stationary (Unit Root aka Random Walk)

Ha: Stationary

library(tseries)
adf.test(df_ts[,"HousingStart"])

## 
##  Augmented Dickey-Fuller Test
## 
## data:  df_ts[, "HousingStart"]
## Dickey-Fuller = -2.3394, Lag order = 7, p-value = 0.4342
## alternative hypothesis: stationary

Notice the high pvalue implies we faily to reject (i.e. we have a non stationary series).

Detect Non-Stationarity w/ KPSS

The Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test can be used in several ways. We’ll focus on its importance for testing trend stationarity (i.e. is the series stationary after adjusting for its trend).

H0: Trend Stationary

Ha: Not Trend Stationary

kpss.test(df_ts[,"HousingStart"], null="Trend")

## 
##  KPSS Test for Trend Stationarity
## 
## data:  df_ts[, "HousingStart"]
## KPSS Trend = 0.49999, Truncation lag parameter = 5, p-value = 0.01

Note the low pvalue implies we reject and those we have a series that is not trend stationary.

Detect Non-Stationarity w/ACF

Autocorrelation Function (ACF) tells us the correlation of a series with its own past. The Partial Autocorrelation Function (PACF) tells us the autocorrelation of a series with its own past after correlation for the impact of other lags of the series. The visualizes are collectively referred to as the (auto)correlogram.

library(astsa)
acf2(df_ts[,"HousingStart"], main = "ACF and PACF of Housing Starts Units")

##      [,1]  [,2]  [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF  0.93  0.84  0.75 0.68 0.64 0.62 0.62 0.64 0.69  0.77  0.83  0.85  0.80
## PACF 0.93 -0.12 -0.10 0.08 0.21 0.06 0.09 0.19 0.37  0.26  0.14  0.01 -0.34
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF   0.72  0.62  0.54  0.50  0.47  0.47  0.48  0.53  0.59  0.64  0.66  0.61
## PACF -0.28 -0.13 -0.09  0.06 -0.05  0.03 -0.11  0.07  0.02  0.00  0.01 -0.08
##      [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
## ACF   0.52  0.42  0.34  0.30  0.27  0.26  0.27  0.31  0.37  0.42  0.44  0.38
## PACF -0.07 -0.16 -0.04  0.05  0.02  0.01 -0.06 -0.01  0.02  0.07  0.04 -0.14
##      [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48]
## ACF   0.30  0.20  0.13  0.08  0.06  0.05  0.06  0.11  0.17  0.23  0.25
## PACF -0.07 -0.01  0.06 -0.01  0.05  0.04 -0.05  0.06  0.10  0.05  0.07

The ACF depicts a seasonal pattern (notice the wave cresting every 12mhts). The PACF strong persistence near the 1st lags, which might be helpful when we model the process with its own past lags (e.g. ARMA).

Remedy w/Differencing

A common approach to overcoming non stationarity is by taking the (percent) difference in the series. Often, if a series is non-stationary, then computing the \(\Delta x_{t}=x_{t}-x_{t-1}\) is stationary. If the first difference is not stationary, we can compute a second difference analogously.

Often in our modeling process these growth rates are the objects of interest, which aids our interpretation of any results.

Let’s first difference the Housing Starts series and recompute the ADF test.

FirstDiff <- diff(df_ts[,"HousingStart"]) # compute the first difference in housing
FirstDiff <- na.omit(FirstDiff) #clean up the data
plot(FirstDiff)

adf.test(FirstDiff)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  FirstDiff
## Dickey-Fuller = -16.219, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Notice that the low pvalue suggests we now have a stationary process. This is confirmed by visual inspection of the time series plot.

We can also accomodate for seasonality within our differencing method by taking year-over-year growth rates.

SeasDiffFirstDiff <- diff(FirstDiff,lag = 12)
plot(SeasDiffFirstDiff)

Pro Tip: Converting your series to logs and then computing the difference approximates a growth RATE.

Converting to growth rates is also helpful to standardize our features. So, let’s do that for housing starts.

library("quantmod")
HousingStartGrowth = na.omit(Delt(df_ts[,"HousingStart"]))*100 # convert to growth rate
adf.test(HousingStartGrowth)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  HousingStartGrowth
## Dickey-Fuller = -15.212, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Notice again that the low pvalue indicates stationarity.

Housing Prices: Stationarity

Let’s test, and adjust, for stationarity in housing prices

adf.test(df_ts[,"PriceIndex"])

## 
##  Augmented Dickey-Fuller Test
## 
## data:  df_ts[, "PriceIndex"]
## Dickey-Fuller = -2.688, Lag order = 7, p-value = 0.2868
## alternative hypothesis: stationary

The high pvalue suggests non-stationarity.

Let’s compute a 1st difference

PriceIndexGrowth = na.omit(Delt(df_ts[,"PriceIndex"]))*100
adf.test(PriceIndexGrowth)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  PriceIndexGrowth
## Dickey-Fuller = -3.109, Lag order = 7, p-value = 0.1089
## alternative hypothesis: stationary

Notice the mildly high pvalue suggesting possible non stationarity. That could be due to the oscillation from seasonal patterns.

plot(PriceIndexGrowth)

So let’s adjust that series for seasonality.

PriceIndexGrowth_decomp<-decompose(df_ts[,"PriceIndex"], type="additive")
PriceIndexGrowthSA = PriceIndexGrowth - PriceIndexGrowth_decomp$seasonal
adf.test(PriceIndexGrowthSA)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  PriceIndexGrowthSA
## Dickey-Fuller = -3.4693, Lag order = 7, p-value = 0.04553
## alternative hypothesis: stationary

The low pvalue now suggests a stationary series.

Prep the df

Let’s group all of these adjusted series together

df_adj <- data.frame(cbind(HousingStartGrowth, PriceIndexGrowthSA)) %>%
  `colnames<-`(c("HousingStartGrowth","PriceIndexGrowth"))
df_adj$Date <- df$Date[c(-1)]
#df_adj<-na.omit(df_adj)

Time Series EDA

Now that we have adjusted our data series, let’s examine their time series properties.

AutoCorrelogram

tsdisplay(df_adj$HousingStartGrowth, main = "Time Series Plot of Housing Start Units Growth")

The spikes in the auto correlogram suggest some short term persistence that we can incorporate into our modeling strategy later.

tsdisplay(df_adj$PriceIndexGrowth, main = "Time Series Plot of Price Index Growth")

Very little temporal persistence remains in our housing price series.

Cross Correlations

Sometimes series lead/lag one another for `economic’ reasons. We can visualize and detect this cross auto correlations with a CCF (Cross Correlation Function).

Here we’ll examine whether growth of housing starts lead or lag price growth.

ccfvalues = ccf(x=df_adj$HousingStartGrowth, y=df_adj$PriceIndexGrowth, 
    main = "CCF of Housing Start Growth and Price Index Growth")

When lag = -1 we are computing the cor(HousingStart_t-1, Price_t); i.e. Starts LEAD Prices.

When lag = +1 we are computing the cor(HousingStart_t+1,Price_t); i.e. Starts LAG Prices.

Positive values indicate that the series move together. Negative values indicate the series move in opposite directions.

The blue lines provide a 95% CI.

There spikes in the negative lag area are suggestive of possible leading effects for housing starts.

Here is an alternative visualization.

library(astsa)
HSGrowth <- df_adj$HousingStartGrowth
PriceGrowth <- df_adj$PriceIndexGrowth
lag2.plot(series1 = HSGrowth, 
          series2 = PriceGrowth,
          max.lag = 11)

# Note: series1 is the one that gets lagged.

Explain

Now that we’ve explore our data’s temporal patterns let’s attempt to explain. We’ll do so with traditional and time series econometric techniques.

Regression

Since we are dealing with time series data we need to be careful about what information we are using and WHEN that occurs. Let’s focus on a CONTEMPORANEOUS regression, wherein we are using data at time t to explain our outcome variable (housing prices) at time t.
\[ Price_{t}=\alpha + \beta Starts_{t} + u_{t}\] This is a traditional regression, but we need to be be careful that the time series data doesn’t contaminate our residual with a sickness called serial correlation, wherein the errors exhibit autocorrelation. If your model is plagued by serial correlation your estimators typically remain unbiased, but are not efficient.

LinearModel = lm(PriceIndexGrowth ~ HousingStartGrowth, data = df_adj)
summary(LinearModel)

## 
## Call:
## lm(formula = PriceIndexGrowth ~ HousingStartGrowth, data = df_adj)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9081 -0.8349 -0.0580  0.8297  3.1626 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.335232   0.052149   6.428 3.44e-10 ***
## HousingStartGrowth 0.024754   0.003955   6.259 9.43e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.081 on 429 degrees of freedom
## Multiple R-squared:  0.08366,    Adjusted R-squared:  0.08153 
## F-statistic: 39.17 on 1 and 429 DF,  p-value: 9.429e-10

Detect Serial Correlation

There are myriad visual and formal tests for serial correlation. Most operate on the residuals of the model.

Visual inspection

Let’s inspect our residuals for troublesome patterns.

checkresiduals(LinearModel)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 387.27, df = 10, p-value < 2.2e-16

Notice the spikes in the residual ACF. These are indicative of some type of omitted temporal pattern, which might suggest the presence of serial correlation.

Durbin Watson Test

The Durbin Watson (DW) test asks whether the model residuals are auto correlated.

Ho: No auto correlation in residuals

Ha: Auto correlation in residuals

library(lmtest)
dwtest(formula = LinearModel,  alternative = "two.sided")

## 
##  Durbin-Watson test
## 
## data:  LinearModel
## DW = 0.36901, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is not 0

The small pvalue suggests there is autocorrelation.

BG Test

Breusch Godfrey (BG) Test

Ho: Model residuals not autocorrelated

Ha: Model residuals are autocorrelated

bgtest(formula = LinearModel)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  LinearModel
## LM test = 343.45, df = 1, p-value < 2.2e-16

The small pvalue suggests there is autocorrelation.

LB Test

Ljung Box Test

Ho: Model residuals not autocorrelated

Ha: Model residuals are autocorrelated

Box.test(residuals(LinearModel), type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(LinearModel)
## X-squared = 285.85, df = 1, p-value < 2.2e-16

The small pvalue suggests there is autocorrelation.

Remediating Serial Correlation

Eventhough its not needed here, let’s discuss remediation in case you encounter serial correlation in practice.

If we are lucky enough to know the precise form of the serial correlation, then we can include it directly in our modeling via GLS. However, that is seldom the case in practice. So, we rely upon HAC heteroscedasticity and autocorrelation consistent standard errors.

library(sandwich)
NW_VCOV_LinearModel <- NeweyWest(LinearModel, prewhite = F, adjust = T) # The pre-whitening and adjust are set to F, and T respectively to ensure the proper formula and small sample adjustments are made.

# Compute Standard Errors 
HAC_err=sqrt(diag(NW_VCOV_LinearModel))
HAC_err

##        (Intercept) HousingStartGrowth 
##        0.081276725        0.003279163

# Compare with the Original Standard Errors
summary(LinearModel)$coefficient[,2]

##        (Intercept) HousingStartGrowth 
##        0.052149365        0.003955291

Notice the (slight) differences in the standard errors. If we did indeed have strong serial correlation the differences could be much larger.

Rolling Regression

An important feature of time series models is coefficient stability. i.e. is the `economic’ process stable through time? We can investigate by running our model over small windows through time.

Moving Window

Let’s run our regression over a fixed window of 30 periods, grab the results, then move the 30 period window forward by one month, rerun the regression, and repeat until the end of the sample.

library(rollRegres)
library(tidyverse)
RollingReg<-roll_regres(PriceIndexGrowth~HousingStartGrowth, 
                        data = df_adj, width = 30,
            do_compute = c("sigmas", "r.squareds", "1_step_forecasts"),
            do_downdates = TRUE # TRUE if you want a rolling window regressions. Otherwise, an expanding window is used
            )

RollingCoef<-as.data.frame(RollingReg$coefs)%>%na.omit()
RollingCoef<-rownames_to_column(RollingCoef, "Date")
RollingCoef$Date<-as.Date(df_adj$Date[-(1:29)])
#tail(RollingCoef,5)

RollingCoef_plotdf <- RollingCoef %>% pivot_longer(c("(Intercept)","HousingStartGrowth")) 
ggplot(data = RollingCoef_plotdf, aes(x = Date,y = value, group = name, color = name)) + 
  geom_line() + 
  scale_x_date(date_labels = "%Y (%b)") + # format datetime
  facet_grid(name~., scales = "free_y") + 
  labs(x = "", y = '', title = "Rolling Beta (Estimated Window Length = 30, Rolling Window)") +
  theme(legend.position = "none")

Notice for the coefficient on housing starts changes through time, implying that our underlying model may not have been stable during certain episodes. We would have to remedy through subject matter expertise by finding the reason behind this instability.

Expanding Window

Let’s run our regression over 30 periods and grab the coefficient estimates. Then let’s add one month onto our sample, rerun the regression, grab the coefficients, and repeat.

ExpandReg<-roll_regres(PriceIndexGrowth~HousingStartGrowth, data = df_adj, width = 30,
            do_compute = c("sigmas", "r.squareds", "1_step_forecasts"),
            do_downdates = FALSE # TRUE if you want a rolling window regressions. Otherwise, an expanding window is used
            )
ExpandCoef<-as.data.frame(ExpandReg$coefs)%>%na.omit()
ExpandCoef<-rownames_to_column(ExpandCoef, "Date")
ExpandCoef$Date<-as.Date(df_adj$Date[-(1:29)])
tail(ExpandCoef,5)

##           Date (Intercept) HousingStartGrowth
## 398 2022-08-01   0.3438307         0.02446015
## 399 2022-09-01   0.3387951         0.02463122
## 400 2022-10-01   0.3361494         0.02471279
## 401 2022-11-01   0.3350249         0.02477165
## 402 2022-12-01   0.3352316         0.02475433

ExpandCoef_plotdf <- ExpandCoef %>% pivot_longer(c("(Intercept)","HousingStartGrowth")) 
ggplot(data = ExpandCoef_plotdf, aes(x = Date,y = value, group = name, color = name)) + 
  geom_line() + 
  scale_x_date(date_labels = "%Y (%b)") + # format datetime
  facet_grid(name~., scales = "free_y") + 
  labs(x = "", y = '', title = "Rolling Beta (Initial Estimated Window Length = 30, Expanding Window)") +
  theme(legend.position = "none")

Notice how the change in the coefficients around the GFC impact this expanding window model much longer than the moving window counterpart.

Forecast

Now that we’ve explored the data and have explained its movements, we can attempt to forecast. There are myriad models to forecast a time series. Some use exogenous features (e.g. housing starts lags to forecast housing prices), while others use the series’ own temporal persistence (i.e. today is a good predictor for tomorrow). We will explore a few common approaches.

Training and Test Samples

Picking a training and test sample must be taken with extreme care in time series. Unlike cross sectional data, we can’t simply randomly choose some subset for testing since each row of our data frame (month) influences the next (i.e. temporal persistence). As such, we typically order the observations by time (they already are) and chose the most recent “x” observations as testing and all dates prior to train. Note: there are more sophisticated block bootstrapping methods that we won’t cover.

For our example we will use

Train: Jan1987-Dec2013

Test: Jan2014-end of sample

We could create separate dataframes from each subsample, but we will avoid that in order to illustrate some subseting features of the functions we will use.

Let’s adjust the dataframe into a time series object so that we can take advantage of certain time series features.

df_adjts <- ts(df_adj, start = c(1987,02), frequency = 12)

Distributed Lag

A distributed lag model forecasts a series with past values of some exogenous series (depicted here with one lag). \(Price_{t}=\alpha + \beta Starts_{t-1} + u_{t}\).

An autoregressive distributed lag (ADL) also includes lagged values of Price (depicted here with one lag). \(Price_{t}=\alpha + \beta Starts_{t-1} + \gamma Price_{t-1} + u_{t}\).

We can use the CCF to detect leads/lags in our training sample.

ccf(x=window(df_adjts[,"HousingStartGrowth"], end = c(2013,12)),
    y=window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), 
    main = "CCF of Housing Start Growth and Price Index Growth")

This spikes within the negative lag regions suggests that Starts appear to have some predictive power. Notice also that there appears to be a lagging aspect to the starts series as well. We will discuss this simultaneity later.

Given the temporal persistence of the price series, let’s try an ADL model. We’ll demonstrate with 2 lags on price growth and 12 on starts. We can also use information criterion to guide lag selection.

ADL_alldata <- ts.intersect(df_adjts, 
                      PriceIndexGrowthLag1 = stats::lag(df_adjts[,"PriceIndexGrowth"], -1),
                      PriceIndexGrowthLag2 = stats::lag(df_adjts[,"PriceIndexGrowth"], -2),
                      HousingStartGrowthLag1 = stats::lag(df_adjts[,"HousingStartGrowth"], -1),
                      HousingStartGrowthLag2 = stats::lag(df_adjts[,"HousingStartGrowth"], -2),
                      HousingStartGrowthLag3 = stats::lag(df_adjts[,"HousingStartGrowth"], -3),
                      HousingStartGrowthLag4 = stats::lag(df_adjts[,"HousingStartGrowth"], -4),
                      HousingStartGrowthLag9 = stats::lag(df_adjts[,"HousingStartGrowth"], -9),
                      HousingStartGrowthLag10 = stats::lag(df_adjts[,"HousingStartGrowth"], -10),
                      HousingStartGrowthLag11 = stats::lag(df_adjts[,"HousingStartGrowth"], -11),
                      HousingStartGrowthLag12 = stats::lag(df_adjts[,"HousingStartGrowth"], -12)) %>%
  `colnames<-`(c(colnames(df_adjts), 
                 "PriceIndexGrowthLag1","PriceIndexGrowthLag2","HousingStartGrowthLag1",
                 "HousingStartGrowthLag2","HousingStartGrowthLag3","HousingStartGrowthLag4",
                 "HousingStartGrowthLag9","HousingStartGrowthLag10","HousingStartGrowthLag11","HousingStartGrowthLag12"))

ADL_traindata= window(ADL_alldata, end = c(2013,12))
ADL_testdata= window(ADL_alldata, start = c(2014,1))


### Estimate ADL Model
ADL <- lm(PriceIndexGrowth ~ PriceIndexGrowthLag1+PriceIndexGrowthLag2+
            HousingStartGrowthLag1 + HousingStartGrowthLag2 + HousingStartGrowthLag3 + HousingStartGrowthLag4 +
            HousingStartGrowthLag9 + HousingStartGrowthLag10 + HousingStartGrowthLag11+HousingStartGrowthLag12,
           data = ADL_traindata)
summary(ADL)

## 
## Call:
## lm(formula = PriceIndexGrowth ~ PriceIndexGrowthLag1 + PriceIndexGrowthLag2 + 
##     HousingStartGrowthLag1 + HousingStartGrowthLag2 + HousingStartGrowthLag3 + 
##     HousingStartGrowthLag4 + HousingStartGrowthLag9 + HousingStartGrowthLag10 + 
##     HousingStartGrowthLag11 + HousingStartGrowthLag12, data = ADL_traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85278 -0.12686  0.00694  0.13383  1.10126 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.0497163  0.0140624   3.535 0.000471 ***
## PriceIndexGrowthLag1     1.4902808  0.0469509  31.741  < 2e-16 ***
## PriceIndexGrowthLag2    -0.6466255  0.0465640 -13.887  < 2e-16 ***
## HousingStartGrowthLag1  -0.0043010  0.0012655  -3.399 0.000768 ***
## HousingStartGrowthLag2  -0.0047772  0.0012927  -3.695 0.000261 ***
## HousingStartGrowthLag3  -0.0037941  0.0012612  -3.008 0.002850 ** 
## HousingStartGrowthLag4   0.0024818  0.0012010   2.066 0.039647 *  
## HousingStartGrowthLag9   0.0058646  0.0011571   5.068 7.04e-07 ***
## HousingStartGrowthLag10  0.0005399  0.0012257   0.440 0.659906    
## HousingStartGrowthLag11 -0.0020908  0.0012364  -1.691 0.091878 .  
## HousingStartGrowthLag12 -0.0034716  0.0012442  -2.790 0.005603 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2352 on 300 degrees of freedom
## Multiple R-squared:  0.955,  Adjusted R-squared:  0.9535 
## F-statistic:   637 on 10 and 300 DF,  p-value: < 2.2e-16

acf2(residuals(ADL), main = "ACF and PACF of ADL Model Residuals")

##       [,1] [,2]  [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF  -0.07 0.13 -0.08 0.23 0.08 0.05 0.05 0.08 0.09  0.02  0.18  0.37  0.10
## PACF -0.07 0.13 -0.06 0.21 0.13 0.01 0.07 0.05 0.05  0.01  0.16  0.41  0.15
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF   0.05 -0.04  0.07  0.10 -0.01 -0.02  0.09  0.05  0.02  0.12  0.21  0.04
## PACF  0.03 -0.07 -0.16 -0.04 -0.10 -0.10  0.07 -0.01 -0.05 -0.01  0.03 -0.05
##      [,26] [,27] [,28]
## ACF  -0.04  0.01  0.03
## PACF -0.09  0.07  0.04

The model residuals look nice and clean. Moreover, there appears to be just a bit of temporal persistence in the two month lag.

Let’s compute the predicted value for our first observation in the training sample:

predict(ADL, newdata = head(ADL_testdata,1), interval = "confidence", level=0.95)

##        fit      lwr      upr
## 1 1.182278 1.090282 1.274275

The “fit” is the prediction by the ``fitted’’ model and the lwr and upr indicate the confidence bounds on that forecast.

Let’s compare to the observed value

window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))#observation

##          Jan
## 2014 1.46015

Our forecast was outside the upper bound, which is not good. We can quantify the quality of this forecast by various forecast evaluation criterion such as RMSFE (Root Mean Squared Forecast Errors). We’ll do so once we have a few models to compare.

Holt Winters

There is a class of time series prediction models that act entirely on the trend-cycle decomposition we highlighted earlier.

plot(decompose(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), type="additive"))

A preliminary diagnostic to determine if this class of model will be useful for you is whether the random component is very large in magnitude relative to the trend and seasonal. If so, the model likely won’t be very good.

Holt Winters is a flexible decomposition model well suited for forecasting. Within the R implementation there are several settings we can tune if desired:

• alpha: the “base value”. Higher alpha puts more weight on the most recent observations.

• beta: the “trend value”. Higher beta means the trend slope is more dependent on recent trend slopes.

• gamma: the “seasonal component”. Higher gamma puts more weighting on the most recent seasonal cycles.

### Fitting with Holt-Winters
HW1 <- HoltWinters(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), seasonal = "additive")
HW2 <- HoltWinters(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), seasonal = "additive",alpha=0.2,beta=0,gamma=0.1) # Custom Holt-Winters (Trendless)

### Plot fitted value of two models
{plot(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), 
      ylab="Price Index Growth") # Visually evaluate the fits
lines(HW1$fitted[,1], lty=2, col="blue")
lines(HW2$fitted[,1], lty=2, col="red")
legend("bottomleft",legend=c("Raw Data", "R Fit Holt-Winters","Custom Fit Holt-Winters (Trendless)"),
       col=c("black","red", "blue"), lty=c(1,2,2), cex=0.8, box.lty=0)}

Let’s predict one step ahead

predict(HW1, 1, prediction.interval = TRUE, level=0.95)

##               fit      upr       lwr
## Jan 2014 1.443605 1.900272 0.9869381

And compare to the observed value in the testing set

window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))#observation

##          Jan
## 2014 1.46015

The confidence band is wider than for the ADL, but the forecast does appear much closer to the observed value.

ARMA

AutoRegressive Moving Average (ARMA) models are the bedrock of time series modeling. The basic idea is that any series may depend upon its own past (AR), and the errors from that specification may also depend upon their own past (MA).

Detect ARMA patterns

acf2(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), 
     main = "ACF and PACF of Price Index Growth")

##      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8] [,9] [,10] [,11] [,12]
## ACF  0.87  0.57  0.19 -0.18 -0.45 -0.55 -0.46 -0.21 0.14  0.50  0.78  0.87
## PACF 0.87 -0.81 -0.19  0.13  0.04  0.18  0.29  0.26 0.28  0.25  0.14 -0.13
##      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## ACF   0.76  0.48  0.12 -0.23 -0.48 -0.58 -0.51 -0.27  0.05  0.39  0.65  0.75
## PACF -0.27  0.02  0.04 -0.04 -0.02 -0.14 -0.05 -0.06 -0.02  0.06  0.09 -0.05
##      [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## ACF   0.66  0.40  0.07 -0.26 -0.50 -0.61 -0.54 -0.33 -0.03  0.29  0.54  0.65
## PACF -0.08  0.07  0.02 -0.08  0.02 -0.02 -0.02 -0.05 -0.02  0.03  0.02 -0.02
##      [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48]
## ACF   0.57  0.34  0.03 -0.29 -0.53 -0.63 -0.58 -0.39 -0.10  0.20  0.44  0.56
## PACF -0.10 -0.01 -0.02 -0.03  0.03 -0.03 -0.04 -0.05 -0.08  0.04  0.07  0.06

The periodic swings in the ACF indicate some type of seasonality. The seasonal wave cresting roughly every twelve months is suggestive of an annual cycle. The spikes in the PACF suggest the possibility of a moving average component.

We can consider a 12month lag.

acf2(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12))%>%diff(lag = 12), 
     main = "ACF and PACF of Price Index Growth Diff 12", max.lag = 36)

##      [,1]  [,2]  [,3] [,4]  [,5] [,6] [,7]  [,8] [,9] [,10] [,11] [,12] [,13]
## ACF  0.88  0.75  0.62 0.54  0.46 0.40 0.35  0.31 0.28  0.21  0.14  0.05  0.05
## PACF 0.88 -0.12 -0.10 0.18 -0.08 0.04 0.03 -0.02 0.03 -0.20 -0.03 -0.08  0.30
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF   0.05  0.04  0.05  0.07  0.06  0.04  0.02  0.00 -0.01 -0.01 -0.03 -0.04
## PACF -0.09 -0.11  0.23 -0.03 -0.15  0.04  0.02 -0.01 -0.08 -0.01 -0.08  0.11
##      [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## ACF  -0.05 -0.03 -0.03 -0.03 -0.01  0.02  0.04  0.05  0.03 -0.01 -0.06
## PACF -0.03  0.00  0.09  0.04 -0.02  0.06 -0.01 -0.02 -0.14 -0.07 -0.16

Notice the seasonal wave of autocorrelations is gone. The remaining ACF decay and the single PACF indicate a remaining MA component.

Fit ARMA

We can fit arma model by indicating the number of AR lags and how many MA lags. In this case, let’s first an AR(1,0,0) [often written as AR(1)] \(Price_{t} = \alpha + \beta Price_{t-1}+u_{t}\)

ARMA1 <- arima(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), 
               order=c(1, 0, 0))
summary(ARMA1)

## 
## Call:
## arima(x = window(df_adjts[, "PriceIndexGrowth"], end = c(2013, 12)), order = c(1, 
##     0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.8818     0.3375
## s.e.  0.0263     0.2378
## 
## sigma^2 estimated as 0.2662:  log likelihood = -245.33,  aic = 496.65
## 
## Training set error measures:
##                        ME     RMSE       MAE      MPE     MAPE      MASE
## Training set -0.008419172 0.515954 0.4503532 4.465093 217.5368 0.9684304
##                   ACF1
## Training set 0.7449836

An AR(2,0,0) would be \(Price_{t} = \alpha + \beta_{1} Price_{t-1}+\beta_{2}Price_{t-2}+u_{t}\)

An ARMA(1,1) \(Price_{t} = \alpha + \beta Price_{t-1}+\gamma u_{t-1}+u_{t}\)is fit by

ARMA11 <- arima(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), 
               order=c(1, 0, 1))
summary(ARMA11)

## 
## Call:
## arima(x = window(df_adjts[, "PriceIndexGrowth"], end = c(2013, 12)), order = c(1, 
##     0, 1))
## 
## Coefficients:
##          ar1     ma1  intercept
##       0.8382  0.6346     0.3329
## s.e.  0.0311  0.0320     0.2108
## 
## sigma^2 estimated as 0.1451:  log likelihood = -147.85,  aic = 303.7
## 
## Training set error measures:
##                        ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.005567083 0.3809118 0.3210698 -61.76058 180.4082 0.6904219
##                   ACF1
## Training set 0.3422065

Seasonal ARMA

The c(p,d,q) elements indicate the AR(p), MA(q), and the number (d) of differences.

Seasonal components can be added by the following where the ordering of the seasonal component follows analogously. ARMA(1,0,0)(1,1,2)12 would be \(Price_{t}=\alpha+\beta_{1} Price_{t-1}+\beta_{2}Price_{t-12}+\gamma_{1}u_{t-12}+\gamma_{2}u_{t-24}+u_{t}\)

ARMA11S <- arima(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), 
               order=c(1, 0, 0), seasonal = list(order = c(0,1,2), period = 12))
summary(ARMA11S)

## 
## Call:
## arima(x = window(df_adjts[, "PriceIndexGrowth"], end = c(2013, 12)), order = c(1, 
##     0, 0), seasonal = list(order = c(0, 1, 2), period = 12))
## 
## Coefficients:
##          ar1     sma1     sma2
##       0.9249  -0.4969  -0.1038
## s.e.  0.0219   0.0572   0.0608
## 
## sigma^2 estimated as 0.037:  log likelihood = 68.29,  aic = -128.58
## 
## Training set error measures:
##                         ME      RMSE       MAE      MPE     MAPE      MASE
## Training set -0.0005726178 0.1887553 0.1212657 1.169253 94.81513 0.2607673
##                   ACF1
## Training set 0.1697167

Auto Detect Lag Length

Notice the ARMA results provided myriad evaluation criterion (e.g. aic). We can leverage that information to choose the optimal model form.

ARIMA_auto <- auto.arima(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), seasonal = TRUE)
summary(ARIMA_auto)

## Series: window(df_adjts[, "PriceIndexGrowth"], end = c(2013, 12)) 
## ARIMA(2,0,2)(0,1,2)[12] 
## 
## Coefficients:
##          ar1     ar2     ma1     ma2     sma1     sma2
##       0.2946  0.5151  0.8426  0.3384  -0.5532  -0.0982
## s.e.  0.1622  0.1511  0.1566  0.0571   0.0601   0.0622
## 
## sigma^2 = 0.03436:  log likelihood = 81.95
## AIC=-149.91   AICc=-149.54   BIC=-123.73
## 
## Training set error measures:
##                        ME      RMSE       MAE       MPE    MAPE      MASE
## Training set -0.001795185 0.1801295 0.1150071 -24.40756 108.185 0.3594878
##                      ACF1
## Training set -0.002475849

Here, R found an ARMA(2,0,2) with a 2 month seasonal component.
\(Price_{t}=\alpha+\beta_{1} Price_{t-1}+\beta_{2}Price_{t-2}+\gamma_{1}u_{t-1}+\gamma_{2}u_{t-2}+\gamma_{t-2}u_{t-12}+\gamma_{4}u_{t-24}+u_{t}\)

Let’s look at the post regression diagnostics

Box.test(residuals(ARIMA_auto), type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  residuals(ARIMA_auto)
## X-squared = 0.0019984, df = 1, p-value = 0.9643

Recall the Lung Box Ho: No Autocorrelation in the errors. The very high pvalue suggests we fail to reject (ie no remaining serial correlation).

tsdisplay(residuals(ARIMA_auto), main='Model Residuals')

The visual diagnostics look pretty good. There are some annual spikes in the residuals and some issues with the model surrounding the GFC, which is not surprising.

Forecast Seasonal ARMA

Let’s use this ARMA model to predict one step ahead into the testing sample.

predict(ARIMA_auto, 1, prediction.interval = TRUE, level=0.95)

## $pred
##           Jan
## 2014 1.516968
## 
## $se
##            Jan
## 2014 0.1853686

And recall the observed value:

window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))#observation

##          Jan
## 2014 1.46015

The fit is pretty good within relatively small standard error.

ARMA-GARCH

One of the important characteristics of our price series is that its variability tends to be autoregressive. By imposing an autoregressive structure onto the residuals of the model we can capture this characteristic.

Visualize GARCH Effects

Let’s visualize the variance of the model residuals.

residualSqr = residuals(ARIMA_auto)^2 
tsdisplay(residualSqr, main = "Residual Square")

Notice the persistence in the residual variance. This is indicative of autoregressive volatility.

Test for GARCH

We can test formally for (G)ARCH type effects with an Engle LM Test, which determines if the residual variance is indeed related to its own past.

Ho: No ARCH effects (white noise)

Ha: ARCH effects

library(FinTS)
ArchTest(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), lags = 1, demean = TRUE)

## 
##  ARCH LM-test; Null hypothesis: no ARCH effects
## 
## data:  window(df_adjts[, "PriceIndexGrowth"], end = c(2013, 12))
## Chi-squared = 168.18, df = 1, p-value < 2.2e-16

Notice that the small p value suggests we reject the null, implying we have ARCH-type effects.

Fitting an ARMA-GARCH

Augmenting our ARMA model with GARCH effects is easy in R. One limitation is the inability to forecast seasonal ARIMA-GARCH directly. So, we will demonstrate without seasonality.

Find approprate ARMA

ARIMA_nonseasonal<-auto.arima(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), 
                               seasonal = FALSE)
predict(ARIMA_nonseasonal,1, prediction.interval = TRUE, level=0.95)

## $pred
##           Jan
## 2014 1.281348
## 
## $se
##            Jan
## 2014 0.2582665

Fit ARMA-GARCH

We’ll use that optimal ARMA structure as the basis for the “mean” specification. We’ll also use the “workhorse” GARCH specification of a GARCH(1,1), which includes one lag of the variance of the errors (GARCH) as well as one lag of the model residuals (ARCH).

library(rugarch)
spec <- ugarchspec(variance.model=list(model="sGARCH", garchOrder=c(1,1)),
                   mean.model=list(include.mean = T, armaOrder=c(3,2)), 
                   distribution.model="norm")
garch_fit <- ugarchfit(spec = spec, data = window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)))
garch_fit

## 
## *---------------------------------*
## *          GARCH Model Fit        *
## *---------------------------------*
## 
## Conditional Variance Dynamics    
## -----------------------------------
## GARCH Model  : sGARCH(1,1)
## Mean Model   : ARFIMA(3,0,2)
## Distribution : norm 
## 
## Optimal Parameters
## ------------------------------------
##         Estimate  Std. Error  t value Pr(>|t|)
## mu      0.526849    0.053394  9.86712 0.000000
## ar1     1.086071    0.206627  5.25620 0.000000
## ar2     0.049557    0.350780  0.14128 0.887652
## ar3    -0.532755    0.197247 -2.70095 0.006914
## ma1     0.522348    0.202317  2.58183 0.009828
## ma2     0.166843    0.074011  2.25429 0.024178
## omega   0.000313    0.000462  0.67765 0.497996
## alpha1  0.084036    0.024070  3.49128 0.000481
## beta1   0.914964    0.023651 38.68527 0.000000
## 
## Robust Standard Errors:
##         Estimate  Std. Error  t value Pr(>|t|)
## mu      0.526849    0.132759  3.96847 0.000072
## ar1     1.086071    0.237324  4.57632 0.000005
## ar2     0.049557    0.400229  0.12382 0.901457
## ar3    -0.532755    0.229035 -2.32608 0.020014
## ma1     0.522348    0.224112  2.33074 0.019767
## ma2     0.166843    0.078433  2.12719 0.033404
## omega   0.000313    0.000354  0.88490 0.376209
## alpha1  0.084036    0.015948  5.26923 0.000000
## beta1   0.914964    0.014157 64.63200 0.000000
## 
## LogLikelihood : -9.24581 
## 
## Information Criteria
## ------------------------------------
##                     
## Akaike       0.11298
## Bayes        0.21824
## Shibata      0.11148
## Hannan-Quinn 0.15500
## 
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
##                          statistic p-value
## Lag[1]                       3.429 0.06405
## Lag[2*(p+q)+(p+q)-1][14]    57.483 0.00000
## Lag[4*(p+q)+(p+q)-1][24]   105.432 0.00000
## d.o.f=5
## H0 : No serial correlation
## 
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
##                         statistic   p-value
## Lag[1]                      17.57 2.764e-05
## Lag[2*(p+q)+(p+q)-1][5]     17.67 8.445e-05
## Lag[4*(p+q)+(p+q)-1][9]     18.09 6.148e-04
## d.o.f=2
## 
## Weighted ARCH LM Tests
## ------------------------------------
##             Statistic Shape Scale P-Value
## ARCH Lag[3]   0.06801 0.500 2.000  0.7943
## ARCH Lag[5]   0.16718 1.440 1.667  0.9730
## ARCH Lag[7]   0.52521 2.315 1.543  0.9760
## 
## Nyblom stability test
## ------------------------------------
## Joint Statistic:  1.998
## Individual Statistics:              
## mu     0.40153
## ar1    0.03089
## ar2    0.12116
## ar3    0.30154
## ma1    0.05907
## ma2    0.16572
## omega  0.22809
## alpha1 0.30350
## beta1  0.25690
## 
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic:          2.1 2.32 2.82
## Individual Statistic:     0.35 0.47 0.75
## 
## Sign Bias Test
## ------------------------------------
##                    t-value      prob sig
## Sign Bias            2.868 4.412e-03 ***
## Negative Sign Bias   1.689 9.220e-02   *
## Positive Sign Bias   7.478 7.388e-13 ***
## Joint Effect        64.255 7.241e-14 ***
## 
## 
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
##   group statistic p-value(g-1)
## 1    20     24.93      0.16300
## 2    30     31.15      0.35845
## 3    40     44.18      0.26192
## 4    50     70.96      0.02176
## 
## 
## Elapsed time : 0.1534491

1-step Ahead ARMGARCH

ugarchforecast(fitORspec = garch_fit, n.ahead = 1)

## 
## *------------------------------------*
## *       GARCH Model Forecast         *
## *------------------------------------*
## Model: sGARCH
## Horizon: 1
## Roll Steps: 0
## Out of Sample: 0
## 
## 0-roll forecast [T0=Dec 2013]:
##     Series  Sigma
## T+1  1.358 0.3124

For comparison, the observed testing sample observation was

window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))#observation

##          Jan
## 2014 1.46015

Notice the fit is quite good. Moreover, note that the GARCH model also produces a forecast of the variance “sigma” of the forecast.

NNET

A feed forward neural network with hidden layers can leverage lagged inputs for forecasting.

Fit the NNET

set.seed(123) # set random seed so that we can replicate the example  
NNET <- nnetar(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)))

Forecast the NNET

In the following, h = # of periods ahead to forecast. PI = True computes prediction intervals.

forecast(NNET, h = 1, PI=TRUE)

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 2014        1.48943 1.237953 1.737922 1.114619 1.882389

As a reminder

window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))#observation

##          Jan
## 2014 1.46015

We get a great forecast with a modestly large confidence interval.

Random Forest

We must use caution when implementing a random forest model with time series data. As you may know, random forests are built by randomly selecting training data, upon which we estimate a regression tree. The process is repeated and then averaged together in an ensemble. The randomly selecting the observations to be used in the training set will disrupt the temporal persistence of the data. We need to ensure that the we are always using training data that occurred prior to the testing data. As such, we use forward-validation rather than k-fold cross validation.

Prep data

This step of prepping the data is sometimes called time embedding.

lag_order <- 12 # the lags number you want to include in your model
RF_data <- embed(df_adjts[,"PriceIndexGrowth"], lag_order + 1) %>% # embedding matrix
  `colnames<-`(c("Raw",paste("Lag",1:lag_order, sep=''))) %>% ts(end = end(df_adjts), frequency = 12)
head(RF_data)

##                 Raw       Lag1      Lag2      Lag3      Lag4       Lag5
## Feb 1988  2.1239480  1.7525175 1.2347540 0.5337933 0.0853447 -0.3179031
## Mar 1988  1.6839198  2.1239480 1.7525175 1.2347540 0.5337933  0.0853447
## Apr 1988  0.9086591  1.6839198 2.1239480 1.7525175 1.2347540  0.5337933
## May 1988  0.2901266  0.9086591 1.6839198 2.1239480 1.7525175  1.2347540
## Jun 1988 -0.3085801  0.2901266 0.9086591 1.6839198 2.1239480  1.7525175
## Jul 1988 -0.3863442 -0.3085801 0.2901266 0.9086591 1.6839198  2.1239480
##                Lag6       Lag7       Lag8       Lag9      Lag10      Lag11
## Feb 1988 -0.4963157 -0.3995805 -0.2208604  0.2851656  1.0196571  1.6127703
## Mar 1988 -0.3179031 -0.4963157 -0.3995805 -0.2208604  0.2851656  1.0196571
## Apr 1988  0.0853447 -0.3179031 -0.4963157 -0.3995805 -0.2208604  0.2851656
## May 1988  0.5337933  0.0853447 -0.3179031 -0.4963157 -0.3995805 -0.2208604
## Jun 1988  1.2347540  0.5337933  0.0853447 -0.3179031 -0.4963157 -0.3995805
## Jul 1988  1.7525175  1.2347540  0.5337933  0.0853447 -0.3179031 -0.4963157
##               Lag12
## Feb 1988  2.2659993
## Mar 1988  1.6127703
## Apr 1988  1.0196571
## May 1988  0.2851656
## Jun 1988 -0.2208604
## Jul 1988 -0.3995805

Assign Train and Test Samples

RF_traindata <- window(RF_data, end = c(2013, 12))
RF_testdata <- window(RF_data, start = c(2014, 1))


y_train <- RF_traindata[, 1] # the target
X_train <- RF_traindata[, -1] # everything but the target
y_test <- RF_testdata[, 1]
X_test <- RF_testdata[, -1]

Fit the Random Forest

library(randomForest)
set.seed(123) # set random seed so that we can replicate the example
RF_fit <- randomForest(X_train, y_train)

Forecast the Random Forest

predict(RF_fit, head(X_test,1))

##        1 
## 1.426869

And a reminder of the testing sample

window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))

##          Jan
## 2014 1.46015

We get a relatively accurate forecast. Unfortunately, a confidence interval isn’t provided natively.

Multi Step Ahead Forecasts

Background

Let’s define Static and Dynamic multi step ahead forecasts.

Static Forecast:
- Train = 1990-2000. Test = 2001 (1 step ahead)
- Train = 1990-2000. Test = 2002 (2 step ahead)
Dynamic Forecast:
- Train = 1990-2000. Test = 2001 (1 step ahead)
- Train = 1990-2000 & 2001 Forecast. Test = 2002 (1 step ahead)

Notice that static forecasts successively further into the future, which can be difficult. Meanwhile, dynamic forecasts build upno potentially error-prone prior forecasts, which can be equally troublesome.

Notice pure distributed lag models (i.e. \(Y_{t}=\alpha + \beta X_{t-1}+u_{T]\)}) don’t permit dynamic forecasts since they require a forecast of \(X\). Moreover, distributed lag models only permit static forecasts up util the horizon of the \(X\) lag (i.e. one in this case).

Forecasts

Let’s use our training set, which ended in Dec2013 to create a series of 24 static forecasts (ie through the end of 2015).

### ARIMA
ARIMA_mulpred <- ts(forecast(ARIMA_auto, h = 24)$mean, start = c(2014,1), frequency =12)


### ARMA-ARCH
Garch_mulpred <- ts(fitted(ugarchforecast(fitORspec = garch_fit, n.ahead = 24)), start = c(2014,1), frequency =12)


### Holt-Winter
HW1_mulpred <- ts(forecast(HW1, h = 24)$mean, start = c(2014,1), frequency =12)


### NNET
NNET_mulpred <- ts(forecast(NNET, h = 24)$mean, start = c(2014,1), frequency =12)


### Random Forest
# We will need to write a loop to get static multiple step ahead forecast
h <- 24
RF_mulpred <- matrix(NA, h, 1)
X_multest <- head(X_test,1)
for (i in 1:h){
  set.seed(123)
  RF_mulpred[i] <- predict(RF_fit, X_multest)
  X_multest[2:ncol(X_multest)] <-  X_multest[-ncol(X_multest)] # update the "regressor" with predicted value
  X_multest[1] <- RF_mulpred[i]
}
RF_mulpred <- ts(RF_mulpred, start = c(2014,1), frequency =12)



### Plot Forecast
Test<-window(df_adjts[,"PriceIndexGrowth"], start = c(2014, 1), end=c(2015,12))
mulpreds <- ts.union(ARIMA_mulpred, Garch_mulpred, HW1_mulpred, NNET_mulpred, RF_mulpred,Test) %>% 
  `colnames<-`(c("ARMA","ARMA-GARCH","Holt Winters", "NNETAR", "Random Forest","TestSample"))
autoplot(mulpreds) +
  labs(color ='', y = "Forecast", x = '', title = "Static 24-Step Ahead Forecast")

Forecast evaluation

Let’s compute the out of sample (OOS) forecast errors

errOOS <- ts((window(df_adjts[,"PriceIndexGrowth"], start = c(2014, 1)) - mulpreds[,-6]),
             start = c(2014,1), frequency =12)%>% 
            `colnames<-`(c("ARMA","ARMA-GARCH","Holt Winters", "NNETAR", "Random Forest"))

We can summarize the errors

summary(errOOS)

##       ARMA            ARMA-GARCH       Holt Winters          NNETAR       
##  Min.   :-0.73038   Min.   :-0.9960   Min.   :-0.49944   Min.   :-1.2421  
##  1st Qu.:-0.42911   1st Qu.:-0.6311   1st Qu.:-0.09299   1st Qu.:-0.7166  
##  Median :-0.06286   Median :-0.1992   Median : 0.05930   Median :-0.2723  
##  Mean   :-0.14576   Mean   :-0.1896   Mean   : 0.07614   Mean   :-0.2477  
##  3rd Qu.: 0.11068   3rd Qu.: 0.2678   3rd Qu.: 0.36356   3rd Qu.: 0.1962  
##  Max.   : 0.41506   Max.   : 0.7548   Max.   : 0.60372   Max.   : 0.8901  
##  Random Forest     
##  Min.   :-0.89069  
##  1st Qu.:-0.62201  
##  Median :-0.31827  
##  Mean   :-0.35636  
##  3rd Qu.:-0.06131  
##  Max.   : 0.10892

Let’s construct the Mean Squared Forecast Error (MSFE).

MSFE<-colMeans(errOOS^2,na.rm = TRUE)
MSFE

##          ARMA    ARMA-GARCH  Holt Winters        NNETAR Random Forest 
##    0.14387261    0.28120525    0.09217917    0.43127173    0.23429218

Interestingly, in this case forecasting accuracy declines with model complexity.

Simulation

There are many uses of simulation in a time series context.

Parametric simulation (i.e. Monte Carlo) is often used when you have a good sense of the DGP. Meanwhile, non parametric simulation (i.e. Bootstrapping) relies upon successive draws of the empirical distribution.

In the following we will demonstrate using bootstrapping for construction forecast confidence intervals.

Bootstrapping Time Series

Recall that bootstrapping randomly resamples from the observed dataset in order to create many pseudo samples for analysis. The model of interest is implemented on each pseudo sample. The range of the resulting estimates provides a confidence interval for the estimation on the observed sample.

Unfortunately, random resampling disrupts the temporal persistence of the process. So, we rely upon block bootstrapping. Briefly, suppose your EDA suggests that the series has a temporal persistence of 12 months. Then we resample from the observed data by drawing blocks of 12 successive months. We can stitch these blocks together to generate a a pseudo sample of the desired size.

Create Block Bootstrap Samples

Note that due to the simulation process the pseudo samples you draw may be different each time. Best practice is to draw several hundred (or thousand) pseudo samples. For illustratives purpose, we’re only drawing for 4 pseudo samples here.

nT = nrow(df_adjts)
Train = window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12))
Test = window(df_adjts[,"PriceIndexGrowth"], start = c(2014, 1))
TrainNum <- round(length(Train))
BootpathNum = 4
BootSamples <- bld.mbb.bootstrap(Train, BootpathNum,block_size=12) %>% data.frame() %>% ts(end= end(Train), frequency=12)
colnames(BootSamples) <- c(lapply(1:(BootpathNum), FUN =function(x){paste("Path",x,sep="")}))
plot(BootSamples, main = "Bootstrapped Samples")

Bootstrapping for a Confidence Interval

Our goal here is to form a confidence interval around our forecast. We want to avoid making distributional assumptions, so we rely upon bootstraps.

Fit an ARIMA to Training Sample

ARMA_fit <- auto.arima(Train)
arimafc <- forecast(ARMA_fit, h=nT-TrainNum, level=95)
arimafc_err <- df_adjts[,"PriceIndexGrowth"] - arimafc$mean

Forecast on each Pseudo Sample

Use that fitted model specification and parameter estimates to construct forecasts for each bootstrap sample.

Bootfcast <- matrix(0, nrow=BootpathNum, ncol=nT-TrainNum)
for(i in seq(BootpathNum)) {
  ARMA_refit <- Arima(BootSamples[,i], model=ARMA_fit) # fit model to bootstrapped samples without re-estimating
  Bootfcast[i,] <- forecast(ARMA_refit, h = nT-TrainNum)$mean}

Gather the Forecasts

We will gather all of the forecasts for Jan2014, and compute the 2.5% and 97.5% quantiles, which will serve to form a 95% confidence interval. Then repeat that for Feb2014, etc…

bootfc <- structure(list(
  mean = ts(colMeans(Bootfcast), start = start(Test), frequency=12),
  lower = ts(apply(Bootfcast, 2, quantile, prob=0.025), start = start(Test), frequency=12),
  upper = ts(apply(Bootfcast, 2, quantile, prob=0.975), start = start(Test), frequency=12),
  level = 95),
  class="forecast")

{plot(bootfc, xlim = c(1988,2023),main="Bootstrap Forecasts", shadecols = "gray70",fcol = NA)
lines(df_adjts[,"PriceIndexGrowth"], col = "black", lty = 2)
legend("bottomleft",legend=c("Raw Data","Bootstrapped Confidence Interval"),
       col=c("black","grey70"), lty=c(2,1), lwd = c(1,4), cex=0.8, box.lty=0)}

Notice that the blue line is the mean of the forecasts on the pseudo sample. This is often referred to as BAGGING (i.e. bootstrap aggregation).

The gray region illustrates the range of the 95% forecast interval.

Notice the large spike of the black line outside of the gray near the end of the sample. That implies our series experienced a “rare” event.

Bootstrap Aggregating (Bagging)

Aggregate forecasts across bootstrap sample (aka “bagging”) can be used to improve forecast accuracy by reducing one or more sources of uncertainty

• Parameter Uncertainty

• DGP Uncertainty (i.e. not sure if the DGP will be the same in the future)

• Functional Form Uncertainty (aka model uncertainty)

Bagging-Version 1

we can use bagging to reduce parameter uncertainty and uncertainty over instability of the DGP into the out of sample period. We will find the optimal function form via information criterion. We will then fit this functional form to each bootstrap sample, and then forecast. In doing so, the ARMA order remains the same, but the parameter estimates are optimized for each bootstrap path.

ARMA_fit <- auto.arima(Train) # optimal ARIMA is ARIMA(2,0,2)(0,1,2)[12]
bagfcast1 <- matrix(0, nrow=BootpathNum, ncol=nT-TrainNum)
for(i in seq(BootpathNum)) {
  ARMA_refit <- Arima(BootSamples[,i], 
                order=c(2, 0, 2), 
                seasonal = list(order = c(0,1,2), period = 12)) # fitted ARIMA(2,0,2)(0,1,2)[12] with re-estimating
  bagfcast1[i,] <- forecast(ARMA_refit, h = nT-TrainNum)$mean} %>% ts(frequency=12, start = start(Test))

Bagging-Version 2

We can use bagging to reduce i) parameter uncertainty, ii) uncertainty over instability of the DGP into the out of sample period, and iii) model uncertainty. We will find the optimal ARIMA structure for each bootstrap path, fit the ARMA model accordingly, and then forecast.

bagfcast2 <- matrix(0, nrow=BootpathNum, ncol=nT-TrainNum)
for(i in seq(BootpathNum)) {
  ARMA_refit <- auto.arima(BootSamples[,i]) # fitted optimal arima model with each sample
  bagfcast2[i,] <- forecast(ARMA_refit, h = nT-TrainNum)$mean}

Compare with (Non) Bagging

Let’s generate bagging forecast errors in the testing period.

bagfc1 <- colMeans(bagfcast1) %>% ts(frequency=12, start = start(Test))
bagfc2 <- colMeans(bagfcast2) %>% ts(frequency=12, start = start(Test))

bagfc1_err <- df_adjts[,"PriceIndexGrowth"] - bagfc1
bagfc2_err <- df_adjts[,"PriceIndexGrowth"] - bagfc2

Now compare the forecast errors built from the optimal arima forecast, bootstrapped forecast, bagging1 and bagging2 forecast errors

{plot(arimafc_err, col = "red", xlab = "", ylab = "OOS Error", main = "ARIMA OOS Forecast Error")
lines(bagfc1_err, col = "green", lty = 3)
lines(bagfc2_err, col = "purple", lty = 4)
legend("bottomleft",legend=c("ARIMA Forecast OOS Error",
                             "Bagging 1 Forecast OOS Error (Parameter and DGP Uncertainty)", 
                             "Bagging 2 Forecast OOS Error (Parameter, DGP and Model Uncertainty)"),
       col=c("red", "green","purple"), lty=c(1,2,3,4), cex=0.6, box.lty=0)
}

VAR

Recall our earlier EDA wherein we saw that housing starts may both lead and lag price growth. Economically, it might be that rapid housing starts signal more supply, which depresses prices. Conversely, starts may be rising because prices are rising, which signal more profitability. This simultaneity will bias coefficient estimates of our univariate specifications. Instead, we might consider a simultaneous system of autoregressive specifications; i.e. a vector auto regression VAR.

Lag Selection

We can rely upon information criterion to guide our choice.

library(vars)
VAR_data<-window(df_adjts[,c("PriceIndexGrowth","HousingStartGrowth")], end = c(2013,12)) 
VARselect(VAR_data, lag.max = 12, type = "const", season = 12)$selection

## AIC(n)  HQ(n)  SC(n) FPE(n) 
##     12      2      2     12

The AIC suggests 12 lags. This is rather high. Note that parameters proliferate quickly in a VAR. We will instead use 2 lags, as indicated by other criterion above.

Fitting a VAR

VAR_est <- VAR(y = VAR_data, p = 2,  type = "const")
summary(VAR_est)

## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: PriceIndexGrowth, HousingStartGrowth 
## Deterministic variables: const 
## Sample size: 321 
## Log Likelihood: -1248.106 
## Roots of the characteristic polynomial:
## 0.9238 0.9238 0.337 0.337
## Call:
## VAR(y = VAR_data, p = 2, type = "const")
## 
## 
## Estimation results for equation PriceIndexGrowth: 
## ================================================= 
## PriceIndexGrowth = PriceIndexGrowth.l1 + HousingStartGrowth.l1 + PriceIndexGrowth.l2 + HousingStartGrowth.l2 + const 
## 
##                        Estimate Std. Error t value Pr(>|t|)    
## PriceIndexGrowth.l1    1.564252   0.032263  48.485  < 2e-16 ***
## HousingStartGrowth.l1 -0.005653   0.001218  -4.642 5.07e-06 ***
## PriceIndexGrowth.l2   -0.747722   0.035888 -20.835  < 2e-16 ***
## HousingStartGrowth.l2 -0.003578   0.001251  -2.860  0.00452 ** 
## const                  0.060153   0.014597   4.121 4.82e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.2525 on 316 degrees of freedom
## Multiple R-Squared: 0.9462,  Adjusted R-squared: 0.9455 
## F-statistic:  1389 on 4 and 316 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation HousingStartGrowth: 
## =================================================== 
## HousingStartGrowth = PriceIndexGrowth.l1 + HousingStartGrowth.l1 + PriceIndexGrowth.l2 + HousingStartGrowth.l2 + const 
## 
##                       Estimate Std. Error t value Pr(>|t|)    
## PriceIndexGrowth.l1    0.65615    1.50634   0.436 0.663433    
## HousingStartGrowth.l1 -0.08300    0.05686  -1.460 0.145369    
## PriceIndexGrowth.l2    6.19221    1.67558   3.696 0.000259 ***
## HousingStartGrowth.l2 -0.10001    0.05842  -1.712 0.087892 .  
## const                 -1.13884    0.68152  -1.671 0.095705 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 11.79 on 316 degrees of freedom
## Multiple R-Squared: 0.245,   Adjusted R-squared: 0.2355 
## F-statistic: 25.64 on 4 and 316 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##                    PriceIndexGrowth HousingStartGrowth
## PriceIndexGrowth            0.06374            -0.6507
## HousingStartGrowth         -0.65074           138.9374
## 
## Correlation matrix of residuals:
##                    PriceIndexGrowth HousingStartGrowth
## PriceIndexGrowth             1.0000            -0.2187
## HousingStartGrowth          -0.2187             1.0000

Notice that there are 2 “regression” tables. That’s because we have a system of two equations. Let’s interpret one of the #s from the first table. The second coefficient down is -.005653. That suggests that if housing starts growth rate increased by 1 unit (% since we computed growth rates) TODAY, then price growth (eg return) would fall by .005653 units (% since we computed growth rates) TOMORROW.

tsdisplay(VAR_est$varresult$PriceIndexGrowth$residuals, main = "Residuals of Price Index Growth")

tsdisplay(VAR_est$varresult$HousingStartGrowth$residuals, main = "Residuals of Housing Start Growth")

Post regression diagnostics for each of the equations in our system reveal some periodicity near 12 lags, indicating possible remaining seasonality.

Fit Seasonal VAR

VAR_est2 <- VAR(y = VAR_data, p = 2, season = 12,type = "const") 
summary(VAR_est2)

## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: PriceIndexGrowth, HousingStartGrowth 
## Deterministic variables: const 
## Sample size: 321 
## Log Likelihood: -1048.57 
## Roots of the characteristic polynomial:
## 0.8255 0.4468 0.3949 0.3949
## Call:
## VAR(y = VAR_data, p = 2, type = "const", season = 12L)
## 
## 
## Estimation results for equation PriceIndexGrowth: 
## ================================================= 
## PriceIndexGrowth = PriceIndexGrowth.l1 + HousingStartGrowth.l1 + PriceIndexGrowth.l2 + HousingStartGrowth.l2 + const + sd1 + sd2 + sd3 + sd4 + sd5 + sd6 + sd7 + sd8 + sd9 + sd10 + sd11 
## 
##                        Estimate Std. Error t value Pr(>|t|)    
## PriceIndexGrowth.l1    1.285677   0.051556  24.937  < 2e-16 ***
## HousingStartGrowth.l1  0.002678   0.001340   1.998 0.046624 *  
## PriceIndexGrowth.l2   -0.387408   0.051455  -7.529 5.81e-13 ***
## HousingStartGrowth.l2  0.003744   0.001353   2.767 0.006011 ** 
## const                  0.025058   0.012279   2.041 0.042146 *  
## sd1                   -0.215863   0.057597  -3.748 0.000213 ***
## sd2                   -0.601785   0.067963  -8.855  < 2e-16 ***
## sd3                   -1.015351   0.095401 -10.643  < 2e-16 ***
## sd4                   -1.073953   0.110708  -9.701  < 2e-16 ***
## sd5                   -0.969141   0.099566  -9.734  < 2e-16 ***
## sd6                   -0.566908   0.094893  -5.974 6.44e-09 ***
## sd7                   -0.714366   0.078505  -9.100  < 2e-16 ***
## sd8                   -0.415923   0.077112  -5.394 1.39e-07 ***
## sd9                   -0.233209   0.070602  -3.303 0.001070 ** 
## sd10                  -0.188574   0.066287  -2.845 0.004744 ** 
## sd11                  -0.086645   0.059756  -1.450 0.148094    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.1961 on 305 degrees of freedom
## Multiple R-Squared: 0.9687,  Adjusted R-squared: 0.9671 
## F-statistic: 628.9 on 15 and 305 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation HousingStartGrowth: 
## =================================================== 
## HousingStartGrowth = PriceIndexGrowth.l1 + HousingStartGrowth.l1 + PriceIndexGrowth.l2 + HousingStartGrowth.l2 + const + sd1 + sd2 + sd3 + sd4 + sd5 + sd6 + sd7 + sd8 + sd9 + sd10 + sd11 
## 
##                       Estimate Std. Error t value Pr(>|t|)    
## PriceIndexGrowth.l1    0.28652    2.17055   0.132 0.895068    
## HousingStartGrowth.l1 -0.40217    0.05643  -7.127 7.43e-12 ***
## PriceIndexGrowth.l2    1.25020    2.16627   0.577 0.564284    
## HousingStartGrowth.l2 -0.16057    0.05698  -2.818 0.005148 ** 
## const                  0.70201    0.51697   1.358 0.175493    
## sd1                   11.10914    2.42487   4.581 6.74e-06 ***
## sd2                   36.35284    2.86130  12.705  < 2e-16 ***
## sd3                   29.01771    4.01645   7.225 4.04e-12 ***
## sd4                   20.56456    4.66087   4.412 1.42e-05 ***
## sd5                   13.67383    4.19180   3.262 0.001232 ** 
## sd6                    7.45428    3.99505   1.866 0.063018 .  
## sd7                    7.48517    3.30511   2.265 0.024231 *  
## sd8                    6.52898    3.24647   2.011 0.045194 *  
## sd9                   10.97460    2.97238   3.692 0.000263 ***
## sd10                  -3.91043    2.79073  -1.401 0.162164    
## sd11                  -8.26059    2.51578  -3.284 0.001144 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 8.254 on 305 degrees of freedom
## Multiple R-Squared: 0.6427,  Adjusted R-squared: 0.6251 
## F-statistic: 36.57 on 15 and 305 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##                    PriceIndexGrowth HousingStartGrowth
## PriceIndexGrowth            0.03844            0.08769
## HousingStartGrowth          0.08769           68.12606
## 
## Correlation matrix of residuals:
##                    PriceIndexGrowth HousingStartGrowth
## PriceIndexGrowth            1.00000            0.05419
## HousingStartGrowth          0.05419            1.00000

Recall the impact of possible serial correlation. So, it’s best practice to consider our robust standard errors.

library(lmtest)
library(sandwich)
coeftest(VAR_est2, vcov. = sandwich)

## 
## t test of coefficients:
## 
##                                            Estimate Std. Error t value
## PriceIndexGrowth:(Intercept)              0.0250581  0.0161187  1.5546
## PriceIndexGrowth:PriceIndexGrowth.l1      1.2856775  0.0743862 17.2838
## PriceIndexGrowth:HousingStartGrowth.l1    0.0026776  0.0013815  1.9381
## PriceIndexGrowth:PriceIndexGrowth.l2     -0.3874079  0.0719956 -5.3810
## PriceIndexGrowth:HousingStartGrowth.l2    0.0037442  0.0016233  2.3066
## PriceIndexGrowth:sd1                     -0.2158633  0.0593161 -3.6392
## PriceIndexGrowth:sd2                     -0.6017850  0.0965587 -6.2323
## PriceIndexGrowth:sd3                     -1.0153513  0.1076739 -9.4299
## PriceIndexGrowth:sd4                     -1.0739527  0.1311516 -8.1886
## PriceIndexGrowth:sd5                     -0.9691414  0.1179980 -8.2132
## PriceIndexGrowth:sd6                     -0.5669085  0.1138084 -4.9813
## PriceIndexGrowth:sd7                     -0.7143658  0.0902086 -7.9190
## PriceIndexGrowth:sd8                     -0.4159226  0.0891303 -4.6665
## PriceIndexGrowth:sd9                     -0.2332091  0.0777081 -3.0011
## PriceIndexGrowth:sd10                    -0.1885737  0.0670889 -2.8108
## PriceIndexGrowth:sd11                    -0.0866448  0.0625306 -1.3856
## HousingStartGrowth:(Intercept)            0.7020055  0.5784976  1.2135
## HousingStartGrowth:PriceIndexGrowth.l1    0.2865217  2.4756064  0.1157
## HousingStartGrowth:HousingStartGrowth.l1 -0.4021720  0.0566221 -7.1027
## HousingStartGrowth:PriceIndexGrowth.l2    1.2501990  2.3859033  0.5240
## HousingStartGrowth:HousingStartGrowth.l2 -0.1605687  0.0648869 -2.4746
## HousingStartGrowth:sd1                   11.1091403  2.7827057  3.9922
## HousingStartGrowth:sd2                   36.3528368  3.5435165 10.2590
## HousingStartGrowth:sd3                   29.0177125  4.3264647  6.7070
## HousingStartGrowth:sd4                   20.5645606  5.0209088  4.0958
## HousingStartGrowth:sd5                   13.6738317  4.6677671  2.9294
## HousingStartGrowth:sd6                    7.4542850  4.3179984  1.7263
## HousingStartGrowth:sd7                    7.4851655  3.7126343  2.0161
## HousingStartGrowth:sd8                    6.5289797  3.7005702  1.7643
## HousingStartGrowth:sd9                   10.9746043  3.4085180  3.2198
## HousingStartGrowth:sd10                  -3.9104308  3.0216621 -1.2941
## HousingStartGrowth:sd11                  -8.2605923  3.0019954 -2.7517
##                                           Pr(>|t|)    
## PriceIndexGrowth:(Intercept)             0.1210787    
## PriceIndexGrowth:PriceIndexGrowth.l1     < 2.2e-16 ***
## PriceIndexGrowth:HousingStartGrowth.l1   0.0535292 .  
## PriceIndexGrowth:PriceIndexGrowth.l2     1.479e-07 ***
## PriceIndexGrowth:HousingStartGrowth.l2   0.0217489 *  
## PriceIndexGrowth:sd1                     0.0003211 ***
## PriceIndexGrowth:sd2                     1.526e-09 ***
## PriceIndexGrowth:sd3                     < 2.2e-16 ***
## PriceIndexGrowth:sd4                     7.306e-15 ***
## PriceIndexGrowth:sd5                     6.179e-15 ***
## PriceIndexGrowth:sd6                     1.060e-06 ***
## PriceIndexGrowth:sd7                     4.496e-14 ***
## PriceIndexGrowth:sd8                     4.594e-06 ***
## PriceIndexGrowth:sd9                     0.0029123 ** 
## PriceIndexGrowth:sd10                    0.0052616 ** 
## PriceIndexGrowth:sd11                    0.1668697    
## HousingStartGrowth:(Intercept)           0.2258785    
## HousingStartGrowth:PriceIndexGrowth.l1   0.9079364    
## HousingStartGrowth:HousingStartGrowth.l1 8.658e-12 ***
## HousingStartGrowth:PriceIndexGrowth.l2   0.6006633    
## HousingStartGrowth:HousingStartGrowth.l2 0.0138824 *  
## HousingStartGrowth:sd1                   8.207e-05 ***
## HousingStartGrowth:sd2                   < 2.2e-16 ***
## HousingStartGrowth:sd3                   9.642e-11 ***
## HousingStartGrowth:sd4                   5.396e-05 ***
## HousingStartGrowth:sd5                   0.0036522 ** 
## HousingStartGrowth:sd6                   0.0853009 .  
## HousingStartGrowth:sd7                   0.0446625 *  
## HousingStartGrowth:sd8                   0.0786795 .  
## HousingStartGrowth:sd9                   0.0014213 ** 
## HousingStartGrowth:sd10                  0.1965987    
## HousingStartGrowth:sd11                  0.0062834 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The pvalues from the coefficients above suggest some indication of today’s housing starts to explain tomorrow’s prices, but not vice versa. We need a test of combine the aggregated effects of these forces.

Determining Direction of causality

If Housing Starts today have an impact on Prices tomorrow, then Housing Starts are said to Granger Cause Prices.

causality(VAR_est2, cause = "HousingStartGrowth")$Granger

## 
##  Granger causality H0: HousingStartGrowth do not Granger-cause
##  PriceIndexGrowth
## 
## data:  VAR object VAR_est2
## F-Test = 4.4408, df1 = 2, df2 = 610, p-value = 0.01217

Given the low pvalue we reject the null, suggesting that housing starts have a leading impact on prices.

Let’s examine the other direction.

causality(VAR_est2, cause = "PriceIndexGrowth")$Granger

## 
##  Granger causality H0: PriceIndexGrowth do not Granger-cause
##  HousingStartGrowth
## 
## data:  VAR object VAR_est2
## F-Test = 1.7979, df1 = 2, df2 = 610, p-value = 0.1665

The relatively high pvalue suggest that the causal direction is only one way (housing starts to prices).

Impulse Response Function (IRF)

Another way to understand and use the VAR estimates is via a stress test. For instance, what happens to Prices over the next 2yrs if we get a shock to housing starts today?

irf_PriceIndexGrowth <- irf(VAR_est2, impulse = "HousingStartGrowth",        
                            response = "PriceIndexGrowth",
                            cumulative = TRUE,
                            n.ahead = 20, boot = TRUE)

irf_PriceIndexGrowth$irf$HousingStartGrowth[3]

## [1] 0.07242436

plot(irf_PriceIndexGrowth, ylab = "Price Index Growth", main = "Cumulative Impact of 1unit shock to  Housing Starts")

The IRF suggests that housing starts rise today, prices will rapidly for about 6mths and then the impact begins to slow.

irf_HousingStartGrowth <- irf(VAR_est2, impulse = "PriceIndexGrowth", 
                              cumulative = TRUE, 
                              response = "HousingStartGrowth", 
                              n.ahead = 20, boot = TRUE)
plot(irf_HousingStartGrowth, ylab = "Housing Start Growth", main = "Cumulative Impact from 1 unit shock to Prices")

A shock to prices has little impact for the first few months, but then may start to build over time.

Forecast with VAR

Given the causality findings, we may consider refining the model. For now, we’ll use that estimated structure to predict both Prices and Starts 24-months head.

VAR_mulpred <- predict(VAR_est2, n.ahead = 24)
plot(VAR_mulpred, names = "PriceIndexGrowth", main = "Forecast of Price Growth")

plot(VAR_mulpred, names = "HousingStartGrowth", main = "Forecast of Housing Starts Growth")

Introduction to Time Series

Case Study: Forecasting Housing Prices

Mike Aguilar

2023-03-23

Case Study Background

Intended audience

Accompanying materials

Acknowledgements

Outline

Package details

Explore

Housekeeping

Data

Connect to FRED API

Download Housing Prices

Download Housing Starts

Combine

Check for missing

Clean

Basic EDA

Full sample correlation matrix

Descriptive Stats

Time Series Plots

Seasonality

Prepare the df for analysis

Moving Avg

Seasonal Decomposition

Classical Decomposition

X11 Decomposition

STL Decomposition

Stationarity

Detect Non-Stationarity w/ADF

Detect Non-Stationarity w/ KPSS

Detect Non-Stationarity w/ACF

Remedy w/Differencing

Housing Prices: Stationarity

Prep the df

Time Series EDA

AutoCorrelogram

Cross Correlations

Explain

Regression

Detect Serial Correlation

Visual inspection

Durbin Watson Test

BG Test

LB Test

Remediating Serial Correlation

Rolling Regression

Moving Window

Expanding Window

Forecast

Training and Test Samples

Distributed Lag

Holt Winters

ARMA

Detect ARMA patterns

Fit ARMA

Seasonal ARMA

Auto Detect Lag Length

Forecast Seasonal ARMA

ARMA-GARCH

Visualize GARCH Effects

Test for GARCH

Fitting an ARMA-GARCH

Find approprate ARMA

Fit ARMA-GARCH

1-step Ahead ARMGARCH

NNET

Fit the NNET

Forecast the NNET

Random Forest

Prep data

Assign Train and Test Samples

Fit the Random Forest

Forecast the Random Forest

Multi Step Ahead Forecasts

Background

Forecasts

Forecast evaluation