In this case study we are attempting to forecast house price growth.
We will structure these forecasts using the natural temporal persistence in this series, as well as leveraging the information contained in exogenous drivers such as housing starts.
You should have a working knowledge of 1) R and 2) econometric methods.
You do not need prior knowledge of time series methods.
Theoretical aspects of the topics discussed are relegated to a series of ``lecture notes’’ available at Prof. Aguilar’s website: aguilar-mike.weebly.com.
Contact information: aguilar-mike@outlook.com | https://www.linkedin.com/in/mike-aguilar-econ/
A special thanks to Ziming Huang for wonderful coding and design assistance.
We will detail the case study in 3 broad steps: 1) Explore, 2) Explain, 3) Forecast.
Several packages are used throughout this case study. Although the best practice is to invoke the library call for all packages within housekeeping, we invoke near the code implementation so that we can see when and how each package is used.
Standard best practice to clear workspace:
rm(list=ls()) # clear work space
cat("\014") # clear console
Let’s gather data on housing prices, housing starts, and interest rates. There are undoubtedly many other relevant features, but we will focus on this manageable subset so that we can focus on the time series methods. The specific features are chosen to highlight certain time series techniques, not necessarily their explanatory power.
Our data source will be FRED (https://fred.stlouisfed.org/) from the St. Louis Federal Reserve.
We can obtain our data via FRED’s API using the fredr
package. To do so, you need to obtain a free API key by visiting (https://fred.stlouisfed.org/docs/api/api_key.html).
After you’ve obtain the key, save it to a sample txt file and store in your working directory.
#install.packages("fredr")
library(fredr)
fred_api_key = read.delim("API_Key.txt", header = FALSE, sep="")[[1]]
fredr_set_key(fred_api_key) #Store the API key as an environment variable
In order to download data via FREDR we need the FRED series_id. These are easily obtained by searching directly on FRED’s website.
Price: S&P/Case-Shiller U.S. National Home Price Index (CSUSHPINSA)
CSUSHPINSA<-fredr(
series_id = "CSUSHPINSA",
frequency = "m", # monthly
observation_start = as.Date("1987-01-01"),
observation_end = as.Date("2023-03-01")
)
Housing starts: New Privately-Owned Housing Units Started: Total Units (HOUSTNSA)
HOUSTNSA<-fredr(
series_id = "HOUSTNSA",
frequency = "m", # monthly
observation_start = as.Date("1959-01-01"),
observation_end = as.Date("2023-03-01")
)
HOUSTNSA$value <- HOUSTNSA$value*1000 # Convert raw data "Thousands of Units" to "Units"
Let’s combine the data into a single data frame for ease of analysis.
temp_df <- Reduce(function(x,y) merge(x, y, by="date", all = T), lapply(c("HOUSTNSA","CSUSHPINSA"), function(x) rbind(get(x)[,c(1,3)])))
colnames(temp_df) <- c("Date","HousingStart","PriceIndex") # rename dataframe
head(temp_df)
## Date HousingStart PriceIndex
## 1 1959-01-01 96200 NA
## 2 1959-02-01 99000 NA
## 3 1959-03-01 127700 NA
## 4 1959-04-01 150800 NA
## 5 1959-05-01 152500 NA
## 6 1959-06-01 147800 NA
We can see right away that we don’t have a balanced panel.
summary(is.na(temp_df))
## Date HousingStart PriceIndex
## Mode :logical Mode :logical Mode :logical
## FALSE:770 FALSE:770 FALSE:432
## TRUE :338
We can see the issues are most prolific for Prices
When are these missing values?
idxHousingStarts<-is.na(temp_df$HousingStart)
idxPrices<-is.na(temp_df$PriceIndex)
par(mfrow=c(1,2))
plot(idxPrices,main="Prices")
plot(idxHousingStarts,main="HousingStarts")
Importantly, the missing observations only are at the beginning and end
of the series.
Cleaning time series is slightly different than cleaning cross sectional data. If we have a gap in our data we can’t simply skip over it. We need to consider why it is missing and then possibly interpolate.
Thankfully, our data is missing at the edges of the panel so we can simply trim to make it balanced.
df <- na.omit(temp_df) # remove the rows with NA
head(df)
## Date HousingStart PriceIndex
## 337 1987-01-01 105100 63.734
## 338 1987-02-01 102800 64.134
## 339 1987-03-01 141200 64.469
## 340 1987-04-01 159300 64.973
## 341 1987-05-01 158000 65.547
## 342 1987-06-01 162900 66.218
cor(df[,c(-1)]) # get grid of the date column
## HousingStart PriceIndex
## HousingStart 1.000000000 -0.004503851
## PriceIndex -0.004503851 1.000000000
summary(df)
## Date HousingStart PriceIndex
## Min. :1987-01-01 Min. : 31900 Min. : 63.73
## 1st Qu.:1995-12-24 1st Qu.: 86575 1st Qu.: 81.66
## Median :2004-12-16 Median :112350 Median :140.40
## Mean :2004-12-15 Mean :111177 Mean :138.56
## 3rd Qu.:2013-12-08 3rd Qu.:137525 3rd Qu.:176.09
## Max. :2022-12-01 Max. :197900 Max. :308.36
library(tidyverse)
plot_df <- df %>%
`colnames<-`(c("Date","HousingStart","PriceIndex \n(Jan 2000=100)")) %>%
pivot_longer(c("HousingStart","PriceIndex \n(Jan 2000=100)")) # Convert wide dataframe to long
ggplot(data = plot_df, aes(x = Date,y = value, group = name, color = name)) +
geom_line() +
scale_x_date(date_labels = "%Y (%b)") + # format datetime
facet_grid(name~., scales = "free_y") +
labs(x = "", y = '') +
theme(legend.position = "none")
Several important time series properties are apparent:
We need to address each before we proceed with modeling.
Let’s zoom in on housing starts.
library(ggplot2)
ggplot(data = df, aes(x = Date,y = HousingStart)) +
geom_line() +
scale_x_date(date_labels = "%Y (%b)") + # format datetime
labs(x = "", y = "Housing Start Units")
Be sure to stop and think about why there is temporal persistence and
why that persistence occurs in a seasonal pattern. If we can’t find an
``economic’’ reason, we may be overlooking an important feature.
There are many ways to adjust for these seasonal patterns.
library(xts)
FirstObs=df$Date[1]
FirstObs
## [1] "1987-01-01"
#We could parse the FirstObs chr string, but for our purposes it's easier to input by hand the start date
df_ts <- ts(xts(df[,2:3], order.by = df$Date), frequency = 12, start = c(1987,1))
A simple moving avg might smooth over the seasonal effects.
library(TTR)
SMA24<-SMA(df_ts[,"HousingStart"],n=24) # Create a simple 24 period moving avg using `SMA()`
toplot<-cbind(as.xts(df_ts[,"HousingStart"]),as.xts(SMA24))
colnames(toplot)[1]="Level" # Add column name
colnames(toplot)[2]="24mth MovAvg" # Add column name
plot(toplot,main="Housing Starts",legend.loc="topright", auto.legend=TRUE)
We could decompose the housing starts series into 3 pieces: 1) trend, 2) seasonal, 3) noise. The decomposition can be additive or multiplicative. There are numerous techniques for identifying and extracting the 3 components. Let’s demonstrate a few.
Classical decomposition methods compute the seasonal component by essentially including dummy variables for each periodic unit (e.g. month, quarters, etc..)
library(stats)
decomp<-decompose(df_ts[,"HousingStart"], type="additive")
plot(decomp) # plot components
After the decomposition is implemented we can easily construct a seasonally adjusted series
SA = decomp$x-decomp$seasonal # Raw Minus Seasonal
Compare the SA and NSA
Trend = decomp$trend
library(forecast)
forecast::autoplot(df_ts[,"HousingStart"], series="Raw Data") +
forecast::autolayer(decomp$x-decomp$seasonal, series="Seasonally Adjusted") +
forecast::autolayer(decomp$trend, series="Trend") +
xlab("Year") + ylab("") + labs(color = "") +
ggtitle("Classical Additive Seasonally Adjusted") +
scale_colour_manual(values=c("black","blue","red"), breaks=c("Raw Data","Seasonally Adjusted","Trend")) +
theme(legend.position="bottom")
The X11 approach is a variant of the classical decomposition.
library(seasonal)
x11decomp <- seas(df_ts[,"HousingStart"], x11="")
forecast::autoplot(x11decomp) +
ggtitle("X11 Decomposition")
Compare SA and NSA
forecast::autoplot(df_ts[,"HousingStart"], series="Raw Data") +
autolayer(seasadj(x11decomp), series="Seasonally Adjusted") +
autolayer(trendcycle(x11decomp), series="Trend") +
xlab("Year") + ylab("") + labs(color = "") +
ggtitle("X11 Seasonally Adjusted") +
scale_colour_manual(values=c("black","blue","red"), breaks=c("Raw Data","Seasonally Adjusted","Trend")) +
theme(legend.position="bottom")
STL is an acronym for “Seasonal and Trend decomposition using Loess”, wherein Loess is a method for estimating nonlinear relationships.
stldecomp <- stl(df_ts[,"HousingStart"], s.window="periodic", robust=TRUE)
forecast::autoplot(stldecomp) +
ggtitle("STL Decomposition")
Loess is data intensive method and traditionally two-sided, meaning that
we need data on either side of each point in order to apply the
adjustment.
forecast::autoplot(df_ts[,"HousingStart"], series="Raw Data") +
autolayer(seasadj(stldecomp), series="Seasonally Adjusted") +
autolayer(trendcycle(stldecomp), series="Trend") +
xlab("Year") + ylab("") + labs(color = "") +
ggtitle("STL Seasonally Adjusted") +
scale_colour_manual(values=c("black","blue","red"), breaks=c("Raw Data","Seasonally Adjusted","Trend")) +
theme(legend.position="bottom")
#### Compare the Decomp Methods
forecast::autoplot(df_ts[,"HousingStart"], series="Raw Data") +
autolayer(decomp$x-decomp$seasonal, series="Classical Additive Seasonally Adjusted") +
autolayer(seasadj(x11decomp), series="X11 Seasonally Adjusted") +
autolayer(seasadj(stldecomp), series="STL Seasonally Adjusted") +
xlab("Year") + ylab("") + labs(color = "") +
ggtitle("Seasonally Adjusted") +
scale_colour_manual(values=c("black","darkgreen","steelblue","darkred"),
breaks=c("Raw Data","Classical Additive Seasonally Adjusted",
"X11 Seasonally Adjusted", "STL Seasonally Adjusted")) +
theme(legend.position="bottom") + guides(colour = guide_legend(ncol = 2))
We’ll accomodate for this seasonally adjusted series within the next section.
Most standard time series techniques require stationarity before implementing. Therefore, it’s essential we are able to detect and remedies any violations to this assumption.
There are many forms of stationarity, but for our purposes we want the first and second moments of the DGP to be constant through time.
There are several tests to detect (non) stationarity.
The Augmented Dickey Fuller (ADF) test focuses on a manifestation of non stationarity called a unit root. A case of unit roots is the `Random Walk’ wherein today’s value is equal to yesterday’s + random noise.
Ho: Non Stationary (Unit Root aka Random Walk)
Ha: Stationary
library(tseries)
adf.test(df_ts[,"HousingStart"])
##
## Augmented Dickey-Fuller Test
##
## data: df_ts[, "HousingStart"]
## Dickey-Fuller = -2.3394, Lag order = 7, p-value = 0.4342
## alternative hypothesis: stationary
Notice the high pvalue implies we faily to reject (i.e. we have a non stationary series).
The Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test can be used in several ways. We’ll focus on its importance for testing trend stationarity (i.e. is the series stationary after adjusting for its trend).
H0: Trend Stationary
Ha: Not Trend Stationary
kpss.test(df_ts[,"HousingStart"], null="Trend")
##
## KPSS Test for Trend Stationarity
##
## data: df_ts[, "HousingStart"]
## KPSS Trend = 0.49999, Truncation lag parameter = 5, p-value = 0.01
Note the low pvalue implies we reject and those we have a series that is not trend stationary.
Autocorrelation Function (ACF) tells us the correlation of a series with its own past. The Partial Autocorrelation Function (PACF) tells us the autocorrelation of a series with its own past after correlation for the impact of other lags of the series. The visualizes are collectively referred to as the (auto)correlogram.
library(astsa)
acf2(df_ts[,"HousingStart"], main = "ACF and PACF of Housing Starts Units")
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF 0.93 0.84 0.75 0.68 0.64 0.62 0.62 0.64 0.69 0.77 0.83 0.85 0.80
## PACF 0.93 -0.12 -0.10 0.08 0.21 0.06 0.09 0.19 0.37 0.26 0.14 0.01 -0.34
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF 0.72 0.62 0.54 0.50 0.47 0.47 0.48 0.53 0.59 0.64 0.66 0.61
## PACF -0.28 -0.13 -0.09 0.06 -0.05 0.03 -0.11 0.07 0.02 0.00 0.01 -0.08
## [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
## ACF 0.52 0.42 0.34 0.30 0.27 0.26 0.27 0.31 0.37 0.42 0.44 0.38
## PACF -0.07 -0.16 -0.04 0.05 0.02 0.01 -0.06 -0.01 0.02 0.07 0.04 -0.14
## [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48]
## ACF 0.30 0.20 0.13 0.08 0.06 0.05 0.06 0.11 0.17 0.23 0.25
## PACF -0.07 -0.01 0.06 -0.01 0.05 0.04 -0.05 0.06 0.10 0.05 0.07
The ACF depicts a seasonal pattern (notice the wave cresting every 12mhts). The PACF strong persistence near the 1st lags, which might be helpful when we model the process with its own past lags (e.g. ARMA).
A common approach to overcoming non stationarity is by taking the (percent) difference in the series. Often, if a series is non-stationary, then computing the \(\Delta x_{t}=x_{t}-x_{t-1}\) is stationary. If the first difference is not stationary, we can compute a second difference analogously.
Often in our modeling process these growth rates are the objects of interest, which aids our interpretation of any results.
Let’s first difference the Housing Starts series and recompute the ADF test.
FirstDiff <- diff(df_ts[,"HousingStart"]) # compute the first difference in housing
FirstDiff <- na.omit(FirstDiff) #clean up the data
plot(FirstDiff)
adf.test(FirstDiff)
##
## Augmented Dickey-Fuller Test
##
## data: FirstDiff
## Dickey-Fuller = -16.219, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
Notice that the low pvalue suggests we now have a stationary process. This is confirmed by visual inspection of the time series plot.
We can also accomodate for seasonality within our differencing method by taking year-over-year growth rates.
SeasDiffFirstDiff <- diff(FirstDiff,lag = 12)
plot(SeasDiffFirstDiff)
Pro Tip: Converting your series to logs and then computing the difference approximates a growth RATE.
Converting to growth rates is also helpful to standardize our features. So, let’s do that for housing starts.
library("quantmod")
HousingStartGrowth = na.omit(Delt(df_ts[,"HousingStart"]))*100 # convert to growth rate
adf.test(HousingStartGrowth)
##
## Augmented Dickey-Fuller Test
##
## data: HousingStartGrowth
## Dickey-Fuller = -15.212, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
Notice again that the low pvalue indicates stationarity.
Let’s test, and adjust, for stationarity in housing prices
adf.test(df_ts[,"PriceIndex"])
##
## Augmented Dickey-Fuller Test
##
## data: df_ts[, "PriceIndex"]
## Dickey-Fuller = -2.688, Lag order = 7, p-value = 0.2868
## alternative hypothesis: stationary
The high pvalue suggests non-stationarity.
Let’s compute a 1st difference
PriceIndexGrowth = na.omit(Delt(df_ts[,"PriceIndex"]))*100
adf.test(PriceIndexGrowth)
##
## Augmented Dickey-Fuller Test
##
## data: PriceIndexGrowth
## Dickey-Fuller = -3.109, Lag order = 7, p-value = 0.1089
## alternative hypothesis: stationary
Notice the mildly high pvalue suggesting possible non stationarity. That could be due to the oscillation from seasonal patterns.
plot(PriceIndexGrowth)
So let’s adjust that series for seasonality.
PriceIndexGrowth_decomp<-decompose(df_ts[,"PriceIndex"], type="additive")
PriceIndexGrowthSA = PriceIndexGrowth - PriceIndexGrowth_decomp$seasonal
adf.test(PriceIndexGrowthSA)
##
## Augmented Dickey-Fuller Test
##
## data: PriceIndexGrowthSA
## Dickey-Fuller = -3.4693, Lag order = 7, p-value = 0.04553
## alternative hypothesis: stationary
The low pvalue now suggests a stationary series.
Let’s group all of these adjusted series together
df_adj <- data.frame(cbind(HousingStartGrowth, PriceIndexGrowthSA)) %>%
`colnames<-`(c("HousingStartGrowth","PriceIndexGrowth"))
df_adj$Date <- df$Date[c(-1)]
#df_adj<-na.omit(df_adj)
Now that we have adjusted our data series, let’s examine their time series properties.
tsdisplay(df_adj$HousingStartGrowth, main = "Time Series Plot of Housing Start Units Growth")
The spikes in the auto correlogram suggest some short term persistence
that we can incorporate into our modeling strategy later.
tsdisplay(df_adj$PriceIndexGrowth, main = "Time Series Plot of Price Index Growth")
Very little temporal persistence remains in our housing price
series.
Sometimes series lead/lag one another for `economic’ reasons. We can visualize and detect this cross auto correlations with a CCF (Cross Correlation Function).
Here we’ll examine whether growth of housing starts lead or lag price growth.
ccfvalues = ccf(x=df_adj$HousingStartGrowth, y=df_adj$PriceIndexGrowth,
main = "CCF of Housing Start Growth and Price Index Growth")
When lag = -1 we are computing the cor(HousingStart_t-1, Price_t); i.e. Starts LEAD Prices.
When lag = +1 we are computing the cor(HousingStart_t+1,Price_t); i.e. Starts LAG Prices.
Positive values indicate that the series move together. Negative values indicate the series move in opposite directions.
The blue lines provide a 95% CI.
There spikes in the negative lag area are suggestive of possible leading effects for housing starts.
Here is an alternative visualization.
library(astsa)
HSGrowth <- df_adj$HousingStartGrowth
PriceGrowth <- df_adj$PriceIndexGrowth
lag2.plot(series1 = HSGrowth,
series2 = PriceGrowth,
max.lag = 11)
# Note: series1 is the one that gets lagged.
Now that we’ve explore our data’s temporal patterns let’s attempt to explain. We’ll do so with traditional and time series econometric techniques.
Since we are dealing with time series data we need to be careful
about what information we are using and WHEN that occurs. Let’s focus on
a CONTEMPORANEOUS regression, wherein we are using data at time t to
explain our outcome variable (housing prices) at time t.
\[ Price_{t}=\alpha + \beta Starts_{t} +
u_{t}\] This is a traditional regression, but we need to be be
careful that the time series data doesn’t contaminate our residual with
a sickness called serial correlation, wherein the errors exhibit
autocorrelation. If your model is plagued by serial correlation your
estimators typically remain unbiased, but are not efficient.
LinearModel = lm(PriceIndexGrowth ~ HousingStartGrowth, data = df_adj)
summary(LinearModel)
##
## Call:
## lm(formula = PriceIndexGrowth ~ HousingStartGrowth, data = df_adj)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9081 -0.8349 -0.0580 0.8297 3.1626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.335232 0.052149 6.428 3.44e-10 ***
## HousingStartGrowth 0.024754 0.003955 6.259 9.43e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.081 on 429 degrees of freedom
## Multiple R-squared: 0.08366, Adjusted R-squared: 0.08153
## F-statistic: 39.17 on 1 and 429 DF, p-value: 9.429e-10
There are myriad visual and formal tests for serial correlation. Most operate on the residuals of the model.
Let’s inspect our residuals for troublesome patterns.
checkresiduals(LinearModel)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 387.27, df = 10, p-value < 2.2e-16
Notice the spikes in the residual ACF. These are indicative of some type of omitted temporal pattern, which might suggest the presence of serial correlation.
The Durbin Watson (DW) test asks whether the model residuals are auto correlated.
Ho: No auto correlation in residuals
Ha: Auto correlation in residuals
library(lmtest)
dwtest(formula = LinearModel, alternative = "two.sided")
##
## Durbin-Watson test
##
## data: LinearModel
## DW = 0.36901, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is not 0
The small pvalue suggests there is autocorrelation.
Breusch Godfrey (BG) Test
Ho: Model residuals not autocorrelated
Ha: Model residuals are autocorrelated
bgtest(formula = LinearModel)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: LinearModel
## LM test = 343.45, df = 1, p-value < 2.2e-16
The small pvalue suggests there is autocorrelation.
Ljung Box Test
Ho: Model residuals not autocorrelated
Ha: Model residuals are autocorrelated
Box.test(residuals(LinearModel), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(LinearModel)
## X-squared = 285.85, df = 1, p-value < 2.2e-16
The small pvalue suggests there is autocorrelation.
Eventhough its not needed here, let’s discuss remediation in case you encounter serial correlation in practice.
If we are lucky enough to know the precise form of the serial correlation, then we can include it directly in our modeling via GLS. However, that is seldom the case in practice. So, we rely upon HAC heteroscedasticity and autocorrelation consistent standard errors.
library(sandwich)
NW_VCOV_LinearModel <- NeweyWest(LinearModel, prewhite = F, adjust = T) # The pre-whitening and adjust are set to F, and T respectively to ensure the proper formula and small sample adjustments are made.
# Compute Standard Errors
HAC_err=sqrt(diag(NW_VCOV_LinearModel))
HAC_err
## (Intercept) HousingStartGrowth
## 0.081276725 0.003279163
# Compare with the Original Standard Errors
summary(LinearModel)$coefficient[,2]
## (Intercept) HousingStartGrowth
## 0.052149365 0.003955291
Notice the (slight) differences in the standard errors. If we did indeed have strong serial correlation the differences could be much larger.
An important feature of time series models is coefficient stability. i.e. is the `economic’ process stable through time? We can investigate by running our model over small windows through time.
Let’s run our regression over a fixed window of 30 periods, grab the results, then move the 30 period window forward by one month, rerun the regression, and repeat until the end of the sample.
library(rollRegres)
library(tidyverse)
RollingReg<-roll_regres(PriceIndexGrowth~HousingStartGrowth,
data = df_adj, width = 30,
do_compute = c("sigmas", "r.squareds", "1_step_forecasts"),
do_downdates = TRUE # TRUE if you want a rolling window regressions. Otherwise, an expanding window is used
)
RollingCoef<-as.data.frame(RollingReg$coefs)%>%na.omit()
RollingCoef<-rownames_to_column(RollingCoef, "Date")
RollingCoef$Date<-as.Date(df_adj$Date[-(1:29)])
#tail(RollingCoef,5)
RollingCoef_plotdf <- RollingCoef %>% pivot_longer(c("(Intercept)","HousingStartGrowth"))
ggplot(data = RollingCoef_plotdf, aes(x = Date,y = value, group = name, color = name)) +
geom_line() +
scale_x_date(date_labels = "%Y (%b)") + # format datetime
facet_grid(name~., scales = "free_y") +
labs(x = "", y = '', title = "Rolling Beta (Estimated Window Length = 30, Rolling Window)") +
theme(legend.position = "none")
Notice for the coefficient on housing starts changes through time,
implying that our underlying model may not have been stable during
certain episodes. We would have to remedy through subject matter
expertise by finding the reason behind this instability.
Let’s run our regression over 30 periods and grab the coefficient estimates. Then let’s add one month onto our sample, rerun the regression, grab the coefficients, and repeat.
ExpandReg<-roll_regres(PriceIndexGrowth~HousingStartGrowth, data = df_adj, width = 30,
do_compute = c("sigmas", "r.squareds", "1_step_forecasts"),
do_downdates = FALSE # TRUE if you want a rolling window regressions. Otherwise, an expanding window is used
)
ExpandCoef<-as.data.frame(ExpandReg$coefs)%>%na.omit()
ExpandCoef<-rownames_to_column(ExpandCoef, "Date")
ExpandCoef$Date<-as.Date(df_adj$Date[-(1:29)])
tail(ExpandCoef,5)
## Date (Intercept) HousingStartGrowth
## 398 2022-08-01 0.3438307 0.02446015
## 399 2022-09-01 0.3387951 0.02463122
## 400 2022-10-01 0.3361494 0.02471279
## 401 2022-11-01 0.3350249 0.02477165
## 402 2022-12-01 0.3352316 0.02475433
ExpandCoef_plotdf <- ExpandCoef %>% pivot_longer(c("(Intercept)","HousingStartGrowth"))
ggplot(data = ExpandCoef_plotdf, aes(x = Date,y = value, group = name, color = name)) +
geom_line() +
scale_x_date(date_labels = "%Y (%b)") + # format datetime
facet_grid(name~., scales = "free_y") +
labs(x = "", y = '', title = "Rolling Beta (Initial Estimated Window Length = 30, Expanding Window)") +
theme(legend.position = "none")
Notice how the change in the coefficients around the GFC impact this
expanding window model much longer than the moving window
counterpart.
Now that we’ve explored the data and have explained its movements, we can attempt to forecast. There are myriad models to forecast a time series. Some use exogenous features (e.g. housing starts lags to forecast housing prices), while others use the series’ own temporal persistence (i.e. today is a good predictor for tomorrow). We will explore a few common approaches.
Picking a training and test sample must be taken with extreme care in time series. Unlike cross sectional data, we can’t simply randomly choose some subset for testing since each row of our data frame (month) influences the next (i.e. temporal persistence). As such, we typically order the observations by time (they already are) and chose the most recent “x” observations as testing and all dates prior to train. Note: there are more sophisticated block bootstrapping methods that we won’t cover.
For our example we will use
Train: Jan1987-Dec2013
Test: Jan2014-end of sample
We could create separate dataframes from each subsample, but we will avoid that in order to illustrate some subseting features of the functions we will use.
Let’s adjust the dataframe into a time series object so that we can take advantage of certain time series features.
df_adjts <- ts(df_adj, start = c(1987,02), frequency = 12)
A distributed lag model forecasts a series with past values of some exogenous series (depicted here with one lag). \(Price_{t}=\alpha + \beta Starts_{t-1} + u_{t}\).
An autoregressive distributed lag (ADL) also includes lagged values of Price (depicted here with one lag). \(Price_{t}=\alpha + \beta Starts_{t-1} + \gamma Price_{t-1} + u_{t}\).
We can use the CCF to detect leads/lags in our training sample.
ccf(x=window(df_adjts[,"HousingStartGrowth"], end = c(2013,12)),
y=window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)),
main = "CCF of Housing Start Growth and Price Index Growth")
This spikes within the negative lag regions suggests that Starts appear to have some predictive power. Notice also that there appears to be a lagging aspect to the starts series as well. We will discuss this simultaneity later.
Given the temporal persistence of the price series, let’s try an ADL model. We’ll demonstrate with 2 lags on price growth and 12 on starts. We can also use information criterion to guide lag selection.
ADL_alldata <- ts.intersect(df_adjts,
PriceIndexGrowthLag1 = stats::lag(df_adjts[,"PriceIndexGrowth"], -1),
PriceIndexGrowthLag2 = stats::lag(df_adjts[,"PriceIndexGrowth"], -2),
HousingStartGrowthLag1 = stats::lag(df_adjts[,"HousingStartGrowth"], -1),
HousingStartGrowthLag2 = stats::lag(df_adjts[,"HousingStartGrowth"], -2),
HousingStartGrowthLag3 = stats::lag(df_adjts[,"HousingStartGrowth"], -3),
HousingStartGrowthLag4 = stats::lag(df_adjts[,"HousingStartGrowth"], -4),
HousingStartGrowthLag9 = stats::lag(df_adjts[,"HousingStartGrowth"], -9),
HousingStartGrowthLag10 = stats::lag(df_adjts[,"HousingStartGrowth"], -10),
HousingStartGrowthLag11 = stats::lag(df_adjts[,"HousingStartGrowth"], -11),
HousingStartGrowthLag12 = stats::lag(df_adjts[,"HousingStartGrowth"], -12)) %>%
`colnames<-`(c(colnames(df_adjts),
"PriceIndexGrowthLag1","PriceIndexGrowthLag2","HousingStartGrowthLag1",
"HousingStartGrowthLag2","HousingStartGrowthLag3","HousingStartGrowthLag4",
"HousingStartGrowthLag9","HousingStartGrowthLag10","HousingStartGrowthLag11","HousingStartGrowthLag12"))
ADL_traindata= window(ADL_alldata, end = c(2013,12))
ADL_testdata= window(ADL_alldata, start = c(2014,1))
### Estimate ADL Model
ADL <- lm(PriceIndexGrowth ~ PriceIndexGrowthLag1+PriceIndexGrowthLag2+
HousingStartGrowthLag1 + HousingStartGrowthLag2 + HousingStartGrowthLag3 + HousingStartGrowthLag4 +
HousingStartGrowthLag9 + HousingStartGrowthLag10 + HousingStartGrowthLag11+HousingStartGrowthLag12,
data = ADL_traindata)
summary(ADL)
##
## Call:
## lm(formula = PriceIndexGrowth ~ PriceIndexGrowthLag1 + PriceIndexGrowthLag2 +
## HousingStartGrowthLag1 + HousingStartGrowthLag2 + HousingStartGrowthLag3 +
## HousingStartGrowthLag4 + HousingStartGrowthLag9 + HousingStartGrowthLag10 +
## HousingStartGrowthLag11 + HousingStartGrowthLag12, data = ADL_traindata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85278 -0.12686 0.00694 0.13383 1.10126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0497163 0.0140624 3.535 0.000471 ***
## PriceIndexGrowthLag1 1.4902808 0.0469509 31.741 < 2e-16 ***
## PriceIndexGrowthLag2 -0.6466255 0.0465640 -13.887 < 2e-16 ***
## HousingStartGrowthLag1 -0.0043010 0.0012655 -3.399 0.000768 ***
## HousingStartGrowthLag2 -0.0047772 0.0012927 -3.695 0.000261 ***
## HousingStartGrowthLag3 -0.0037941 0.0012612 -3.008 0.002850 **
## HousingStartGrowthLag4 0.0024818 0.0012010 2.066 0.039647 *
## HousingStartGrowthLag9 0.0058646 0.0011571 5.068 7.04e-07 ***
## HousingStartGrowthLag10 0.0005399 0.0012257 0.440 0.659906
## HousingStartGrowthLag11 -0.0020908 0.0012364 -1.691 0.091878 .
## HousingStartGrowthLag12 -0.0034716 0.0012442 -2.790 0.005603 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2352 on 300 degrees of freedom
## Multiple R-squared: 0.955, Adjusted R-squared: 0.9535
## F-statistic: 637 on 10 and 300 DF, p-value: < 2.2e-16
acf2(residuals(ADL), main = "ACF and PACF of ADL Model Residuals")
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF -0.07 0.13 -0.08 0.23 0.08 0.05 0.05 0.08 0.09 0.02 0.18 0.37 0.10
## PACF -0.07 0.13 -0.06 0.21 0.13 0.01 0.07 0.05 0.05 0.01 0.16 0.41 0.15
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF 0.05 -0.04 0.07 0.10 -0.01 -0.02 0.09 0.05 0.02 0.12 0.21 0.04
## PACF 0.03 -0.07 -0.16 -0.04 -0.10 -0.10 0.07 -0.01 -0.05 -0.01 0.03 -0.05
## [,26] [,27] [,28]
## ACF -0.04 0.01 0.03
## PACF -0.09 0.07 0.04
The model residuals look nice and clean. Moreover, there appears to be just a bit of temporal persistence in the two month lag.
Let’s compute the predicted value for our first observation in the training sample:
predict(ADL, newdata = head(ADL_testdata,1), interval = "confidence", level=0.95)
## fit lwr upr
## 1 1.182278 1.090282 1.274275
The “fit” is the prediction by the ``fitted’’ model and the lwr and upr indicate the confidence bounds on that forecast.
Let’s compare to the observed value
window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))#observation
## Jan
## 2014 1.46015
Our forecast was outside the upper bound, which is not good. We can quantify the quality of this forecast by various forecast evaluation criterion such as RMSFE (Root Mean Squared Forecast Errors). We’ll do so once we have a few models to compare.
There is a class of time series prediction models that act entirely on the trend-cycle decomposition we highlighted earlier.
plot(decompose(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), type="additive"))
A preliminary diagnostic to determine if this class of model will be
useful for you is whether the random component is very large in
magnitude relative to the trend and seasonal. If so, the model likely
won’t be very good.
Holt Winters is a flexible decomposition model well suited for forecasting. Within the R implementation there are several settings we can tune if desired:
• alpha: the “base value”. Higher alpha puts more weight on the most recent observations.
• beta: the “trend value”. Higher beta means the trend slope is more dependent on recent trend slopes.
• gamma: the “seasonal component”. Higher gamma puts more weighting on the most recent seasonal cycles.
### Fitting with Holt-Winters
HW1 <- HoltWinters(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), seasonal = "additive")
HW2 <- HoltWinters(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), seasonal = "additive",alpha=0.2,beta=0,gamma=0.1) # Custom Holt-Winters (Trendless)
### Plot fitted value of two models
{plot(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)),
ylab="Price Index Growth") # Visually evaluate the fits
lines(HW1$fitted[,1], lty=2, col="blue")
lines(HW2$fitted[,1], lty=2, col="red")
legend("bottomleft",legend=c("Raw Data", "R Fit Holt-Winters","Custom Fit Holt-Winters (Trendless)"),
col=c("black","red", "blue"), lty=c(1,2,2), cex=0.8, box.lty=0)}
Let’s predict one step ahead
predict(HW1, 1, prediction.interval = TRUE, level=0.95)
## fit upr lwr
## Jan 2014 1.443605 1.900272 0.9869381
And compare to the observed value in the testing set
window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))#observation
## Jan
## 2014 1.46015
The confidence band is wider than for the ADL, but the forecast does appear much closer to the observed value.
AutoRegressive Moving Average (ARMA) models are the bedrock of time series modeling. The basic idea is that any series may depend upon its own past (AR), and the errors from that specification may also depend upon their own past (MA).
acf2(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)),
main = "ACF and PACF of Price Index Growth")
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## ACF 0.87 0.57 0.19 -0.18 -0.45 -0.55 -0.46 -0.21 0.14 0.50 0.78 0.87
## PACF 0.87 -0.81 -0.19 0.13 0.04 0.18 0.29 0.26 0.28 0.25 0.14 -0.13
## [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## ACF 0.76 0.48 0.12 -0.23 -0.48 -0.58 -0.51 -0.27 0.05 0.39 0.65 0.75
## PACF -0.27 0.02 0.04 -0.04 -0.02 -0.14 -0.05 -0.06 -0.02 0.06 0.09 -0.05
## [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## ACF 0.66 0.40 0.07 -0.26 -0.50 -0.61 -0.54 -0.33 -0.03 0.29 0.54 0.65
## PACF -0.08 0.07 0.02 -0.08 0.02 -0.02 -0.02 -0.05 -0.02 0.03 0.02 -0.02
## [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48]
## ACF 0.57 0.34 0.03 -0.29 -0.53 -0.63 -0.58 -0.39 -0.10 0.20 0.44 0.56
## PACF -0.10 -0.01 -0.02 -0.03 0.03 -0.03 -0.04 -0.05 -0.08 0.04 0.07 0.06
The periodic swings in the ACF indicate some type of seasonality. The seasonal wave cresting roughly every twelve months is suggestive of an annual cycle. The spikes in the PACF suggest the possibility of a moving average component.
We can consider a 12month lag.
acf2(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12))%>%diff(lag = 12),
main = "ACF and PACF of Price Index Growth Diff 12", max.lag = 36)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF 0.88 0.75 0.62 0.54 0.46 0.40 0.35 0.31 0.28 0.21 0.14 0.05 0.05
## PACF 0.88 -0.12 -0.10 0.18 -0.08 0.04 0.03 -0.02 0.03 -0.20 -0.03 -0.08 0.30
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF 0.05 0.04 0.05 0.07 0.06 0.04 0.02 0.00 -0.01 -0.01 -0.03 -0.04
## PACF -0.09 -0.11 0.23 -0.03 -0.15 0.04 0.02 -0.01 -0.08 -0.01 -0.08 0.11
## [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## ACF -0.05 -0.03 -0.03 -0.03 -0.01 0.02 0.04 0.05 0.03 -0.01 -0.06
## PACF -0.03 0.00 0.09 0.04 -0.02 0.06 -0.01 -0.02 -0.14 -0.07 -0.16
Notice the seasonal wave of autocorrelations is gone. The remaining ACF decay and the single PACF indicate a remaining MA component.
We can fit arma model by indicating the number of AR lags and how many MA lags. In this case, let’s first an AR(1,0,0) [often written as AR(1)] \(Price_{t} = \alpha + \beta Price_{t-1}+u_{t}\)
ARMA1 <- arima(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)),
order=c(1, 0, 0))
summary(ARMA1)
##
## Call:
## arima(x = window(df_adjts[, "PriceIndexGrowth"], end = c(2013, 12)), order = c(1,
## 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.8818 0.3375
## s.e. 0.0263 0.2378
##
## sigma^2 estimated as 0.2662: log likelihood = -245.33, aic = 496.65
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.008419172 0.515954 0.4503532 4.465093 217.5368 0.9684304
## ACF1
## Training set 0.7449836
An AR(2,0,0) would be \(Price_{t} = \alpha + \beta_{1} Price_{t-1}+\beta_{2}Price_{t-2}+u_{t}\)
An ARMA(1,1) \(Price_{t} = \alpha + \beta Price_{t-1}+\gamma u_{t-1}+u_{t}\)is fit by
ARMA11 <- arima(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)),
order=c(1, 0, 1))
summary(ARMA11)
##
## Call:
## arima(x = window(df_adjts[, "PriceIndexGrowth"], end = c(2013, 12)), order = c(1,
## 0, 1))
##
## Coefficients:
## ar1 ma1 intercept
## 0.8382 0.6346 0.3329
## s.e. 0.0311 0.0320 0.2108
##
## sigma^2 estimated as 0.1451: log likelihood = -147.85, aic = 303.7
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.005567083 0.3809118 0.3210698 -61.76058 180.4082 0.6904219
## ACF1
## Training set 0.3422065
The c(p,d,q) elements indicate the AR(p), MA(q), and the number (d) of differences.
Seasonal components can be added by the following where the ordering of the seasonal component follows analogously. ARMA(1,0,0)(1,1,2)12 would be \(Price_{t}=\alpha+\beta_{1} Price_{t-1}+\beta_{2}Price_{t-12}+\gamma_{1}u_{t-12}+\gamma_{2}u_{t-24}+u_{t}\)
ARMA11S <- arima(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)),
order=c(1, 0, 0), seasonal = list(order = c(0,1,2), period = 12))
summary(ARMA11S)
##
## Call:
## arima(x = window(df_adjts[, "PriceIndexGrowth"], end = c(2013, 12)), order = c(1,
## 0, 0), seasonal = list(order = c(0, 1, 2), period = 12))
##
## Coefficients:
## ar1 sma1 sma2
## 0.9249 -0.4969 -0.1038
## s.e. 0.0219 0.0572 0.0608
##
## sigma^2 estimated as 0.037: log likelihood = 68.29, aic = -128.58
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0005726178 0.1887553 0.1212657 1.169253 94.81513 0.2607673
## ACF1
## Training set 0.1697167
Notice the ARMA results provided myriad evaluation criterion (e.g. aic). We can leverage that information to choose the optimal model form.
ARIMA_auto <- auto.arima(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), seasonal = TRUE)
summary(ARIMA_auto)
## Series: window(df_adjts[, "PriceIndexGrowth"], end = c(2013, 12))
## ARIMA(2,0,2)(0,1,2)[12]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1 sma2
## 0.2946 0.5151 0.8426 0.3384 -0.5532 -0.0982
## s.e. 0.1622 0.1511 0.1566 0.0571 0.0601 0.0622
##
## sigma^2 = 0.03436: log likelihood = 81.95
## AIC=-149.91 AICc=-149.54 BIC=-123.73
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.001795185 0.1801295 0.1150071 -24.40756 108.185 0.3594878
## ACF1
## Training set -0.002475849
Here, R found an ARMA(2,0,2) with a 2 month seasonal component.
\(Price_{t}=\alpha+\beta_{1}
Price_{t-1}+\beta_{2}Price_{t-2}+\gamma_{1}u_{t-1}+\gamma_{2}u_{t-2}+\gamma_{t-2}u_{t-12}+\gamma_{4}u_{t-24}+u_{t}\)
Let’s look at the post regression diagnostics
Box.test(residuals(ARIMA_auto), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(ARIMA_auto)
## X-squared = 0.0019984, df = 1, p-value = 0.9643
Recall the Lung Box Ho: No Autocorrelation in the errors. The very high pvalue suggests we fail to reject (ie no remaining serial correlation).
tsdisplay(residuals(ARIMA_auto), main='Model Residuals')
The visual diagnostics look pretty good. There are some annual spikes in the residuals and some issues with the model surrounding the GFC, which is not surprising.
Let’s use this ARMA model to predict one step ahead into the testing sample.
predict(ARIMA_auto, 1, prediction.interval = TRUE, level=0.95)
## $pred
## Jan
## 2014 1.516968
##
## $se
## Jan
## 2014 0.1853686
And recall the observed value:
window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))#observation
## Jan
## 2014 1.46015
The fit is pretty good within relatively small standard error.
One of the important characteristics of our price series is that its variability tends to be autoregressive. By imposing an autoregressive structure onto the residuals of the model we can capture this characteristic.
Let’s visualize the variance of the model residuals.
residualSqr = residuals(ARIMA_auto)^2
tsdisplay(residualSqr, main = "Residual Square")
Notice the persistence in the residual variance. This is indicative of
autoregressive volatility.
We can test formally for (G)ARCH type effects with an Engle LM Test, which determines if the residual variance is indeed related to its own past.
Ho: No ARCH effects (white noise)
Ha: ARCH effects
library(FinTS)
ArchTest(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)), lags = 1, demean = TRUE)
##
## ARCH LM-test; Null hypothesis: no ARCH effects
##
## data: window(df_adjts[, "PriceIndexGrowth"], end = c(2013, 12))
## Chi-squared = 168.18, df = 1, p-value < 2.2e-16
Notice that the small p value suggests we reject the null, implying we have ARCH-type effects.
Augmenting our ARMA model with GARCH effects is easy in R. One limitation is the inability to forecast seasonal ARIMA-GARCH directly. So, we will demonstrate without seasonality.
ARIMA_nonseasonal<-auto.arima(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)),
seasonal = FALSE)
predict(ARIMA_nonseasonal,1, prediction.interval = TRUE, level=0.95)
## $pred
## Jan
## 2014 1.281348
##
## $se
## Jan
## 2014 0.2582665
We’ll use that optimal ARMA structure as the basis for the “mean” specification. We’ll also use the “workhorse” GARCH specification of a GARCH(1,1), which includes one lag of the variance of the errors (GARCH) as well as one lag of the model residuals (ARCH).
library(rugarch)
spec <- ugarchspec(variance.model=list(model="sGARCH", garchOrder=c(1,1)),
mean.model=list(include.mean = T, armaOrder=c(3,2)),
distribution.model="norm")
garch_fit <- ugarchfit(spec = spec, data = window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)))
garch_fit
##
## *---------------------------------*
## * GARCH Model Fit *
## *---------------------------------*
##
## Conditional Variance Dynamics
## -----------------------------------
## GARCH Model : sGARCH(1,1)
## Mean Model : ARFIMA(3,0,2)
## Distribution : norm
##
## Optimal Parameters
## ------------------------------------
## Estimate Std. Error t value Pr(>|t|)
## mu 0.526849 0.053394 9.86712 0.000000
## ar1 1.086071 0.206627 5.25620 0.000000
## ar2 0.049557 0.350780 0.14128 0.887652
## ar3 -0.532755 0.197247 -2.70095 0.006914
## ma1 0.522348 0.202317 2.58183 0.009828
## ma2 0.166843 0.074011 2.25429 0.024178
## omega 0.000313 0.000462 0.67765 0.497996
## alpha1 0.084036 0.024070 3.49128 0.000481
## beta1 0.914964 0.023651 38.68527 0.000000
##
## Robust Standard Errors:
## Estimate Std. Error t value Pr(>|t|)
## mu 0.526849 0.132759 3.96847 0.000072
## ar1 1.086071 0.237324 4.57632 0.000005
## ar2 0.049557 0.400229 0.12382 0.901457
## ar3 -0.532755 0.229035 -2.32608 0.020014
## ma1 0.522348 0.224112 2.33074 0.019767
## ma2 0.166843 0.078433 2.12719 0.033404
## omega 0.000313 0.000354 0.88490 0.376209
## alpha1 0.084036 0.015948 5.26923 0.000000
## beta1 0.914964 0.014157 64.63200 0.000000
##
## LogLikelihood : -9.24581
##
## Information Criteria
## ------------------------------------
##
## Akaike 0.11298
## Bayes 0.21824
## Shibata 0.11148
## Hannan-Quinn 0.15500
##
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
## statistic p-value
## Lag[1] 3.429 0.06405
## Lag[2*(p+q)+(p+q)-1][14] 57.483 0.00000
## Lag[4*(p+q)+(p+q)-1][24] 105.432 0.00000
## d.o.f=5
## H0 : No serial correlation
##
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
## statistic p-value
## Lag[1] 17.57 2.764e-05
## Lag[2*(p+q)+(p+q)-1][5] 17.67 8.445e-05
## Lag[4*(p+q)+(p+q)-1][9] 18.09 6.148e-04
## d.o.f=2
##
## Weighted ARCH LM Tests
## ------------------------------------
## Statistic Shape Scale P-Value
## ARCH Lag[3] 0.06801 0.500 2.000 0.7943
## ARCH Lag[5] 0.16718 1.440 1.667 0.9730
## ARCH Lag[7] 0.52521 2.315 1.543 0.9760
##
## Nyblom stability test
## ------------------------------------
## Joint Statistic: 1.998
## Individual Statistics:
## mu 0.40153
## ar1 0.03089
## ar2 0.12116
## ar3 0.30154
## ma1 0.05907
## ma2 0.16572
## omega 0.22809
## alpha1 0.30350
## beta1 0.25690
##
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic: 2.1 2.32 2.82
## Individual Statistic: 0.35 0.47 0.75
##
## Sign Bias Test
## ------------------------------------
## t-value prob sig
## Sign Bias 2.868 4.412e-03 ***
## Negative Sign Bias 1.689 9.220e-02 *
## Positive Sign Bias 7.478 7.388e-13 ***
## Joint Effect 64.255 7.241e-14 ***
##
##
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
## group statistic p-value(g-1)
## 1 20 24.93 0.16300
## 2 30 31.15 0.35845
## 3 40 44.18 0.26192
## 4 50 70.96 0.02176
##
##
## Elapsed time : 0.1534491
ugarchforecast(fitORspec = garch_fit, n.ahead = 1)
##
## *------------------------------------*
## * GARCH Model Forecast *
## *------------------------------------*
## Model: sGARCH
## Horizon: 1
## Roll Steps: 0
## Out of Sample: 0
##
## 0-roll forecast [T0=Dec 2013]:
## Series Sigma
## T+1 1.358 0.3124
For comparison, the observed testing sample observation was
window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))#observation
## Jan
## 2014 1.46015
Notice the fit is quite good. Moreover, note that the GARCH model also produces a forecast of the variance “sigma” of the forecast.
A feed forward neural network with hidden layers can leverage lagged inputs for forecasting.
set.seed(123) # set random seed so that we can replicate the example
NNET <- nnetar(window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12)))
In the following, h = # of periods ahead to forecast. PI = True computes prediction intervals.
forecast(NNET, h = 1, PI=TRUE)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 1.48943 1.237953 1.737922 1.114619 1.882389
As a reminder
window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))#observation
## Jan
## 2014 1.46015
We get a great forecast with a modestly large confidence interval.
We must use caution when implementing a random forest model with time series data. As you may know, random forests are built by randomly selecting training data, upon which we estimate a regression tree. The process is repeated and then averaged together in an ensemble. The randomly selecting the observations to be used in the training set will disrupt the temporal persistence of the data. We need to ensure that the we are always using training data that occurred prior to the testing data. As such, we use forward-validation rather than k-fold cross validation.
This step of prepping the data is sometimes called time embedding.
lag_order <- 12 # the lags number you want to include in your model
RF_data <- embed(df_adjts[,"PriceIndexGrowth"], lag_order + 1) %>% # embedding matrix
`colnames<-`(c("Raw",paste("Lag",1:lag_order, sep=''))) %>% ts(end = end(df_adjts), frequency = 12)
head(RF_data)
## Raw Lag1 Lag2 Lag3 Lag4 Lag5
## Feb 1988 2.1239480 1.7525175 1.2347540 0.5337933 0.0853447 -0.3179031
## Mar 1988 1.6839198 2.1239480 1.7525175 1.2347540 0.5337933 0.0853447
## Apr 1988 0.9086591 1.6839198 2.1239480 1.7525175 1.2347540 0.5337933
## May 1988 0.2901266 0.9086591 1.6839198 2.1239480 1.7525175 1.2347540
## Jun 1988 -0.3085801 0.2901266 0.9086591 1.6839198 2.1239480 1.7525175
## Jul 1988 -0.3863442 -0.3085801 0.2901266 0.9086591 1.6839198 2.1239480
## Lag6 Lag7 Lag8 Lag9 Lag10 Lag11
## Feb 1988 -0.4963157 -0.3995805 -0.2208604 0.2851656 1.0196571 1.6127703
## Mar 1988 -0.3179031 -0.4963157 -0.3995805 -0.2208604 0.2851656 1.0196571
## Apr 1988 0.0853447 -0.3179031 -0.4963157 -0.3995805 -0.2208604 0.2851656
## May 1988 0.5337933 0.0853447 -0.3179031 -0.4963157 -0.3995805 -0.2208604
## Jun 1988 1.2347540 0.5337933 0.0853447 -0.3179031 -0.4963157 -0.3995805
## Jul 1988 1.7525175 1.2347540 0.5337933 0.0853447 -0.3179031 -0.4963157
## Lag12
## Feb 1988 2.2659993
## Mar 1988 1.6127703
## Apr 1988 1.0196571
## May 1988 0.2851656
## Jun 1988 -0.2208604
## Jul 1988 -0.3995805
RF_traindata <- window(RF_data, end = c(2013, 12))
RF_testdata <- window(RF_data, start = c(2014, 1))
y_train <- RF_traindata[, 1] # the target
X_train <- RF_traindata[, -1] # everything but the target
y_test <- RF_testdata[, 1]
X_test <- RF_testdata[, -1]
library(randomForest)
set.seed(123) # set random seed so that we can replicate the example
RF_fit <- randomForest(X_train, y_train)
predict(RF_fit, head(X_test,1))
## 1
## 1.426869
And a reminder of the testing sample
window(df_adjts[,"PriceIndexGrowth"], start = c(2014,1), end = c(2014,1))
## Jan
## 2014 1.46015
We get a relatively accurate forecast. Unfortunately, a confidence interval isn’t provided natively.
Let’s define Static and Dynamic multi step ahead forecasts.
Notice that static forecasts successively further into the future, which can be difficult. Meanwhile, dynamic forecasts build upno potentially error-prone prior forecasts, which can be equally troublesome.
Notice pure distributed lag models (i.e. \(Y_{t}=\alpha + \beta X_{t-1}+u_{T]\)}) don’t permit dynamic forecasts since they require a forecast of \(X\). Moreover, distributed lag models only permit static forecasts up util the horizon of the \(X\) lag (i.e. one in this case).
Let’s use our training set, which ended in Dec2013 to create a series of 24 static forecasts (ie through the end of 2015).
### ARIMA
ARIMA_mulpred <- ts(forecast(ARIMA_auto, h = 24)$mean, start = c(2014,1), frequency =12)
### ARMA-ARCH
Garch_mulpred <- ts(fitted(ugarchforecast(fitORspec = garch_fit, n.ahead = 24)), start = c(2014,1), frequency =12)
### Holt-Winter
HW1_mulpred <- ts(forecast(HW1, h = 24)$mean, start = c(2014,1), frequency =12)
### NNET
NNET_mulpred <- ts(forecast(NNET, h = 24)$mean, start = c(2014,1), frequency =12)
### Random Forest
# We will need to write a loop to get static multiple step ahead forecast
h <- 24
RF_mulpred <- matrix(NA, h, 1)
X_multest <- head(X_test,1)
for (i in 1:h){
set.seed(123)
RF_mulpred[i] <- predict(RF_fit, X_multest)
X_multest[2:ncol(X_multest)] <- X_multest[-ncol(X_multest)] # update the "regressor" with predicted value
X_multest[1] <- RF_mulpred[i]
}
RF_mulpred <- ts(RF_mulpred, start = c(2014,1), frequency =12)
### Plot Forecast
Test<-window(df_adjts[,"PriceIndexGrowth"], start = c(2014, 1), end=c(2015,12))
mulpreds <- ts.union(ARIMA_mulpred, Garch_mulpred, HW1_mulpred, NNET_mulpred, RF_mulpred,Test) %>%
`colnames<-`(c("ARMA","ARMA-GARCH","Holt Winters", "NNETAR", "Random Forest","TestSample"))
autoplot(mulpreds) +
labs(color ='', y = "Forecast", x = '', title = "Static 24-Step Ahead Forecast")
Let’s compute the out of sample (OOS) forecast errors
errOOS <- ts((window(df_adjts[,"PriceIndexGrowth"], start = c(2014, 1)) - mulpreds[,-6]),
start = c(2014,1), frequency =12)%>%
`colnames<-`(c("ARMA","ARMA-GARCH","Holt Winters", "NNETAR", "Random Forest"))
We can summarize the errors
summary(errOOS)
## ARMA ARMA-GARCH Holt Winters NNETAR
## Min. :-0.73038 Min. :-0.9960 Min. :-0.49944 Min. :-1.2421
## 1st Qu.:-0.42911 1st Qu.:-0.6311 1st Qu.:-0.09299 1st Qu.:-0.7166
## Median :-0.06286 Median :-0.1992 Median : 0.05930 Median :-0.2723
## Mean :-0.14576 Mean :-0.1896 Mean : 0.07614 Mean :-0.2477
## 3rd Qu.: 0.11068 3rd Qu.: 0.2678 3rd Qu.: 0.36356 3rd Qu.: 0.1962
## Max. : 0.41506 Max. : 0.7548 Max. : 0.60372 Max. : 0.8901
## Random Forest
## Min. :-0.89069
## 1st Qu.:-0.62201
## Median :-0.31827
## Mean :-0.35636
## 3rd Qu.:-0.06131
## Max. : 0.10892
Let’s construct the Mean Squared Forecast Error (MSFE).
MSFE<-colMeans(errOOS^2,na.rm = TRUE)
MSFE
## ARMA ARMA-GARCH Holt Winters NNETAR Random Forest
## 0.14387261 0.28120525 0.09217917 0.43127173 0.23429218
Interestingly, in this case forecasting accuracy declines with model complexity.
There are many uses of simulation in a time series context.
Parametric simulation (i.e. Monte Carlo) is often used when you have a good sense of the DGP. Meanwhile, non parametric simulation (i.e. Bootstrapping) relies upon successive draws of the empirical distribution.
In the following we will demonstrate using bootstrapping for construction forecast confidence intervals.
Recall that bootstrapping randomly resamples from the observed dataset in order to create many pseudo samples for analysis. The model of interest is implemented on each pseudo sample. The range of the resulting estimates provides a confidence interval for the estimation on the observed sample.
Unfortunately, random resampling disrupts the temporal persistence of the process. So, we rely upon block bootstrapping. Briefly, suppose your EDA suggests that the series has a temporal persistence of 12 months. Then we resample from the observed data by drawing blocks of 12 successive months. We can stitch these blocks together to generate a a pseudo sample of the desired size.
Note that due to the simulation process the pseudo samples you draw may be different each time. Best practice is to draw several hundred (or thousand) pseudo samples. For illustratives purpose, we’re only drawing for 4 pseudo samples here.
nT = nrow(df_adjts)
Train = window(df_adjts[,"PriceIndexGrowth"], end = c(2013,12))
Test = window(df_adjts[,"PriceIndexGrowth"], start = c(2014, 1))
TrainNum <- round(length(Train))
BootpathNum = 4
BootSamples <- bld.mbb.bootstrap(Train, BootpathNum,block_size=12) %>% data.frame() %>% ts(end= end(Train), frequency=12)
colnames(BootSamples) <- c(lapply(1:(BootpathNum), FUN =function(x){paste("Path",x,sep="")}))
plot(BootSamples, main = "Bootstrapped Samples")
Our goal here is to form a confidence interval around our forecast. We want to avoid making distributional assumptions, so we rely upon bootstraps.
ARMA_fit <- auto.arima(Train)
arimafc <- forecast(ARMA_fit, h=nT-TrainNum, level=95)
arimafc_err <- df_adjts[,"PriceIndexGrowth"] - arimafc$mean
Use that fitted model specification and parameter estimates to construct forecasts for each bootstrap sample.
Bootfcast <- matrix(0, nrow=BootpathNum, ncol=nT-TrainNum)
for(i in seq(BootpathNum)) {
ARMA_refit <- Arima(BootSamples[,i], model=ARMA_fit) # fit model to bootstrapped samples without re-estimating
Bootfcast[i,] <- forecast(ARMA_refit, h = nT-TrainNum)$mean}
We will gather all of the forecasts for Jan2014, and compute the 2.5% and 97.5% quantiles, which will serve to form a 95% confidence interval. Then repeat that for Feb2014, etc…
bootfc <- structure(list(
mean = ts(colMeans(Bootfcast), start = start(Test), frequency=12),
lower = ts(apply(Bootfcast, 2, quantile, prob=0.025), start = start(Test), frequency=12),
upper = ts(apply(Bootfcast, 2, quantile, prob=0.975), start = start(Test), frequency=12),
level = 95),
class="forecast")
{plot(bootfc, xlim = c(1988,2023),main="Bootstrap Forecasts", shadecols = "gray70",fcol = NA)
lines(df_adjts[,"PriceIndexGrowth"], col = "black", lty = 2)
legend("bottomleft",legend=c("Raw Data","Bootstrapped Confidence Interval"),
col=c("black","grey70"), lty=c(2,1), lwd = c(1,4), cex=0.8, box.lty=0)}
Notice that the blue line is the mean of the forecasts on the pseudo
sample. This is often referred to as BAGGING (i.e. bootstrap
aggregation).
The gray region illustrates the range of the 95% forecast interval.
Notice the large spike of the black line outside of the gray near the end of the sample. That implies our series experienced a “rare” event.
Aggregate forecasts across bootstrap sample (aka “bagging”) can be used to improve forecast accuracy by reducing one or more sources of uncertainty
• Parameter Uncertainty
• DGP Uncertainty (i.e. not sure if the DGP will be the same in the future)
• Functional Form Uncertainty (aka model uncertainty)
we can use bagging to reduce parameter uncertainty and uncertainty over instability of the DGP into the out of sample period. We will find the optimal function form via information criterion. We will then fit this functional form to each bootstrap sample, and then forecast. In doing so, the ARMA order remains the same, but the parameter estimates are optimized for each bootstrap path.
ARMA_fit <- auto.arima(Train) # optimal ARIMA is ARIMA(2,0,2)(0,1,2)[12]
bagfcast1 <- matrix(0, nrow=BootpathNum, ncol=nT-TrainNum)
for(i in seq(BootpathNum)) {
ARMA_refit <- Arima(BootSamples[,i],
order=c(2, 0, 2),
seasonal = list(order = c(0,1,2), period = 12)) # fitted ARIMA(2,0,2)(0,1,2)[12] with re-estimating
bagfcast1[i,] <- forecast(ARMA_refit, h = nT-TrainNum)$mean} %>% ts(frequency=12, start = start(Test))
We can use bagging to reduce i) parameter uncertainty, ii) uncertainty over instability of the DGP into the out of sample period, and iii) model uncertainty. We will find the optimal ARIMA structure for each bootstrap path, fit the ARMA model accordingly, and then forecast.
bagfcast2 <- matrix(0, nrow=BootpathNum, ncol=nT-TrainNum)
for(i in seq(BootpathNum)) {
ARMA_refit <- auto.arima(BootSamples[,i]) # fitted optimal arima model with each sample
bagfcast2[i,] <- forecast(ARMA_refit, h = nT-TrainNum)$mean}
Let’s generate bagging forecast errors in the testing period.
bagfc1 <- colMeans(bagfcast1) %>% ts(frequency=12, start = start(Test))
bagfc2 <- colMeans(bagfcast2) %>% ts(frequency=12, start = start(Test))
bagfc1_err <- df_adjts[,"PriceIndexGrowth"] - bagfc1
bagfc2_err <- df_adjts[,"PriceIndexGrowth"] - bagfc2
Now compare the forecast errors built from the optimal arima forecast, bootstrapped forecast, bagging1 and bagging2 forecast errors
{plot(arimafc_err, col = "red", xlab = "", ylab = "OOS Error", main = "ARIMA OOS Forecast Error")
lines(bagfc1_err, col = "green", lty = 3)
lines(bagfc2_err, col = "purple", lty = 4)
legend("bottomleft",legend=c("ARIMA Forecast OOS Error",
"Bagging 1 Forecast OOS Error (Parameter and DGP Uncertainty)",
"Bagging 2 Forecast OOS Error (Parameter, DGP and Model Uncertainty)"),
col=c("red", "green","purple"), lty=c(1,2,3,4), cex=0.6, box.lty=0)
}
Recall our earlier EDA wherein we saw that housing starts may both lead and lag price growth. Economically, it might be that rapid housing starts signal more supply, which depresses prices. Conversely, starts may be rising because prices are rising, which signal more profitability. This simultaneity will bias coefficient estimates of our univariate specifications. Instead, we might consider a simultaneous system of autoregressive specifications; i.e. a vector auto regression VAR.
We can rely upon information criterion to guide our choice.
library(vars)
VAR_data<-window(df_adjts[,c("PriceIndexGrowth","HousingStartGrowth")], end = c(2013,12))
VARselect(VAR_data, lag.max = 12, type = "const", season = 12)$selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 12 2 2 12
The AIC suggests 12 lags. This is rather high. Note that parameters proliferate quickly in a VAR. We will instead use 2 lags, as indicated by other criterion above.
VAR_est <- VAR(y = VAR_data, p = 2, type = "const")
summary(VAR_est)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: PriceIndexGrowth, HousingStartGrowth
## Deterministic variables: const
## Sample size: 321
## Log Likelihood: -1248.106
## Roots of the characteristic polynomial:
## 0.9238 0.9238 0.337 0.337
## Call:
## VAR(y = VAR_data, p = 2, type = "const")
##
##
## Estimation results for equation PriceIndexGrowth:
## =================================================
## PriceIndexGrowth = PriceIndexGrowth.l1 + HousingStartGrowth.l1 + PriceIndexGrowth.l2 + HousingStartGrowth.l2 + const
##
## Estimate Std. Error t value Pr(>|t|)
## PriceIndexGrowth.l1 1.564252 0.032263 48.485 < 2e-16 ***
## HousingStartGrowth.l1 -0.005653 0.001218 -4.642 5.07e-06 ***
## PriceIndexGrowth.l2 -0.747722 0.035888 -20.835 < 2e-16 ***
## HousingStartGrowth.l2 -0.003578 0.001251 -2.860 0.00452 **
## const 0.060153 0.014597 4.121 4.82e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.2525 on 316 degrees of freedom
## Multiple R-Squared: 0.9462, Adjusted R-squared: 0.9455
## F-statistic: 1389 on 4 and 316 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation HousingStartGrowth:
## ===================================================
## HousingStartGrowth = PriceIndexGrowth.l1 + HousingStartGrowth.l1 + PriceIndexGrowth.l2 + HousingStartGrowth.l2 + const
##
## Estimate Std. Error t value Pr(>|t|)
## PriceIndexGrowth.l1 0.65615 1.50634 0.436 0.663433
## HousingStartGrowth.l1 -0.08300 0.05686 -1.460 0.145369
## PriceIndexGrowth.l2 6.19221 1.67558 3.696 0.000259 ***
## HousingStartGrowth.l2 -0.10001 0.05842 -1.712 0.087892 .
## const -1.13884 0.68152 -1.671 0.095705 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 11.79 on 316 degrees of freedom
## Multiple R-Squared: 0.245, Adjusted R-squared: 0.2355
## F-statistic: 25.64 on 4 and 316 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## PriceIndexGrowth HousingStartGrowth
## PriceIndexGrowth 0.06374 -0.6507
## HousingStartGrowth -0.65074 138.9374
##
## Correlation matrix of residuals:
## PriceIndexGrowth HousingStartGrowth
## PriceIndexGrowth 1.0000 -0.2187
## HousingStartGrowth -0.2187 1.0000
Notice that there are 2 “regression” tables. That’s because we have a system of two equations. Let’s interpret one of the #s from the first table. The second coefficient down is -.005653. That suggests that if housing starts growth rate increased by 1 unit (% since we computed growth rates) TODAY, then price growth (eg return) would fall by .005653 units (% since we computed growth rates) TOMORROW.
tsdisplay(VAR_est$varresult$PriceIndexGrowth$residuals, main = "Residuals of Price Index Growth")
tsdisplay(VAR_est$varresult$HousingStartGrowth$residuals, main = "Residuals of Housing Start Growth")
Post regression diagnostics for each of the equations in our system reveal some periodicity near 12 lags, indicating possible remaining seasonality.
VAR_est2 <- VAR(y = VAR_data, p = 2, season = 12,type = "const")
summary(VAR_est2)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: PriceIndexGrowth, HousingStartGrowth
## Deterministic variables: const
## Sample size: 321
## Log Likelihood: -1048.57
## Roots of the characteristic polynomial:
## 0.8255 0.4468 0.3949 0.3949
## Call:
## VAR(y = VAR_data, p = 2, type = "const", season = 12L)
##
##
## Estimation results for equation PriceIndexGrowth:
## =================================================
## PriceIndexGrowth = PriceIndexGrowth.l1 + HousingStartGrowth.l1 + PriceIndexGrowth.l2 + HousingStartGrowth.l2 + const + sd1 + sd2 + sd3 + sd4 + sd5 + sd6 + sd7 + sd8 + sd9 + sd10 + sd11
##
## Estimate Std. Error t value Pr(>|t|)
## PriceIndexGrowth.l1 1.285677 0.051556 24.937 < 2e-16 ***
## HousingStartGrowth.l1 0.002678 0.001340 1.998 0.046624 *
## PriceIndexGrowth.l2 -0.387408 0.051455 -7.529 5.81e-13 ***
## HousingStartGrowth.l2 0.003744 0.001353 2.767 0.006011 **
## const 0.025058 0.012279 2.041 0.042146 *
## sd1 -0.215863 0.057597 -3.748 0.000213 ***
## sd2 -0.601785 0.067963 -8.855 < 2e-16 ***
## sd3 -1.015351 0.095401 -10.643 < 2e-16 ***
## sd4 -1.073953 0.110708 -9.701 < 2e-16 ***
## sd5 -0.969141 0.099566 -9.734 < 2e-16 ***
## sd6 -0.566908 0.094893 -5.974 6.44e-09 ***
## sd7 -0.714366 0.078505 -9.100 < 2e-16 ***
## sd8 -0.415923 0.077112 -5.394 1.39e-07 ***
## sd9 -0.233209 0.070602 -3.303 0.001070 **
## sd10 -0.188574 0.066287 -2.845 0.004744 **
## sd11 -0.086645 0.059756 -1.450 0.148094
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.1961 on 305 degrees of freedom
## Multiple R-Squared: 0.9687, Adjusted R-squared: 0.9671
## F-statistic: 628.9 on 15 and 305 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation HousingStartGrowth:
## ===================================================
## HousingStartGrowth = PriceIndexGrowth.l1 + HousingStartGrowth.l1 + PriceIndexGrowth.l2 + HousingStartGrowth.l2 + const + sd1 + sd2 + sd3 + sd4 + sd5 + sd6 + sd7 + sd8 + sd9 + sd10 + sd11
##
## Estimate Std. Error t value Pr(>|t|)
## PriceIndexGrowth.l1 0.28652 2.17055 0.132 0.895068
## HousingStartGrowth.l1 -0.40217 0.05643 -7.127 7.43e-12 ***
## PriceIndexGrowth.l2 1.25020 2.16627 0.577 0.564284
## HousingStartGrowth.l2 -0.16057 0.05698 -2.818 0.005148 **
## const 0.70201 0.51697 1.358 0.175493
## sd1 11.10914 2.42487 4.581 6.74e-06 ***
## sd2 36.35284 2.86130 12.705 < 2e-16 ***
## sd3 29.01771 4.01645 7.225 4.04e-12 ***
## sd4 20.56456 4.66087 4.412 1.42e-05 ***
## sd5 13.67383 4.19180 3.262 0.001232 **
## sd6 7.45428 3.99505 1.866 0.063018 .
## sd7 7.48517 3.30511 2.265 0.024231 *
## sd8 6.52898 3.24647 2.011 0.045194 *
## sd9 10.97460 2.97238 3.692 0.000263 ***
## sd10 -3.91043 2.79073 -1.401 0.162164
## sd11 -8.26059 2.51578 -3.284 0.001144 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 8.254 on 305 degrees of freedom
## Multiple R-Squared: 0.6427, Adjusted R-squared: 0.6251
## F-statistic: 36.57 on 15 and 305 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## PriceIndexGrowth HousingStartGrowth
## PriceIndexGrowth 0.03844 0.08769
## HousingStartGrowth 0.08769 68.12606
##
## Correlation matrix of residuals:
## PriceIndexGrowth HousingStartGrowth
## PriceIndexGrowth 1.00000 0.05419
## HousingStartGrowth 0.05419 1.00000
Recall the impact of possible serial correlation. So, it’s best practice to consider our robust standard errors.
library(lmtest)
library(sandwich)
coeftest(VAR_est2, vcov. = sandwich)
##
## t test of coefficients:
##
## Estimate Std. Error t value
## PriceIndexGrowth:(Intercept) 0.0250581 0.0161187 1.5546
## PriceIndexGrowth:PriceIndexGrowth.l1 1.2856775 0.0743862 17.2838
## PriceIndexGrowth:HousingStartGrowth.l1 0.0026776 0.0013815 1.9381
## PriceIndexGrowth:PriceIndexGrowth.l2 -0.3874079 0.0719956 -5.3810
## PriceIndexGrowth:HousingStartGrowth.l2 0.0037442 0.0016233 2.3066
## PriceIndexGrowth:sd1 -0.2158633 0.0593161 -3.6392
## PriceIndexGrowth:sd2 -0.6017850 0.0965587 -6.2323
## PriceIndexGrowth:sd3 -1.0153513 0.1076739 -9.4299
## PriceIndexGrowth:sd4 -1.0739527 0.1311516 -8.1886
## PriceIndexGrowth:sd5 -0.9691414 0.1179980 -8.2132
## PriceIndexGrowth:sd6 -0.5669085 0.1138084 -4.9813
## PriceIndexGrowth:sd7 -0.7143658 0.0902086 -7.9190
## PriceIndexGrowth:sd8 -0.4159226 0.0891303 -4.6665
## PriceIndexGrowth:sd9 -0.2332091 0.0777081 -3.0011
## PriceIndexGrowth:sd10 -0.1885737 0.0670889 -2.8108
## PriceIndexGrowth:sd11 -0.0866448 0.0625306 -1.3856
## HousingStartGrowth:(Intercept) 0.7020055 0.5784976 1.2135
## HousingStartGrowth:PriceIndexGrowth.l1 0.2865217 2.4756064 0.1157
## HousingStartGrowth:HousingStartGrowth.l1 -0.4021720 0.0566221 -7.1027
## HousingStartGrowth:PriceIndexGrowth.l2 1.2501990 2.3859033 0.5240
## HousingStartGrowth:HousingStartGrowth.l2 -0.1605687 0.0648869 -2.4746
## HousingStartGrowth:sd1 11.1091403 2.7827057 3.9922
## HousingStartGrowth:sd2 36.3528368 3.5435165 10.2590
## HousingStartGrowth:sd3 29.0177125 4.3264647 6.7070
## HousingStartGrowth:sd4 20.5645606 5.0209088 4.0958
## HousingStartGrowth:sd5 13.6738317 4.6677671 2.9294
## HousingStartGrowth:sd6 7.4542850 4.3179984 1.7263
## HousingStartGrowth:sd7 7.4851655 3.7126343 2.0161
## HousingStartGrowth:sd8 6.5289797 3.7005702 1.7643
## HousingStartGrowth:sd9 10.9746043 3.4085180 3.2198
## HousingStartGrowth:sd10 -3.9104308 3.0216621 -1.2941
## HousingStartGrowth:sd11 -8.2605923 3.0019954 -2.7517
## Pr(>|t|)
## PriceIndexGrowth:(Intercept) 0.1210787
## PriceIndexGrowth:PriceIndexGrowth.l1 < 2.2e-16 ***
## PriceIndexGrowth:HousingStartGrowth.l1 0.0535292 .
## PriceIndexGrowth:PriceIndexGrowth.l2 1.479e-07 ***
## PriceIndexGrowth:HousingStartGrowth.l2 0.0217489 *
## PriceIndexGrowth:sd1 0.0003211 ***
## PriceIndexGrowth:sd2 1.526e-09 ***
## PriceIndexGrowth:sd3 < 2.2e-16 ***
## PriceIndexGrowth:sd4 7.306e-15 ***
## PriceIndexGrowth:sd5 6.179e-15 ***
## PriceIndexGrowth:sd6 1.060e-06 ***
## PriceIndexGrowth:sd7 4.496e-14 ***
## PriceIndexGrowth:sd8 4.594e-06 ***
## PriceIndexGrowth:sd9 0.0029123 **
## PriceIndexGrowth:sd10 0.0052616 **
## PriceIndexGrowth:sd11 0.1668697
## HousingStartGrowth:(Intercept) 0.2258785
## HousingStartGrowth:PriceIndexGrowth.l1 0.9079364
## HousingStartGrowth:HousingStartGrowth.l1 8.658e-12 ***
## HousingStartGrowth:PriceIndexGrowth.l2 0.6006633
## HousingStartGrowth:HousingStartGrowth.l2 0.0138824 *
## HousingStartGrowth:sd1 8.207e-05 ***
## HousingStartGrowth:sd2 < 2.2e-16 ***
## HousingStartGrowth:sd3 9.642e-11 ***
## HousingStartGrowth:sd4 5.396e-05 ***
## HousingStartGrowth:sd5 0.0036522 **
## HousingStartGrowth:sd6 0.0853009 .
## HousingStartGrowth:sd7 0.0446625 *
## HousingStartGrowth:sd8 0.0786795 .
## HousingStartGrowth:sd9 0.0014213 **
## HousingStartGrowth:sd10 0.1965987
## HousingStartGrowth:sd11 0.0062834 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The pvalues from the coefficients above suggest some indication of today’s housing starts to explain tomorrow’s prices, but not vice versa. We need a test of combine the aggregated effects of these forces.
If Housing Starts today have an impact on Prices tomorrow, then Housing Starts are said to Granger Cause Prices.
causality(VAR_est2, cause = "HousingStartGrowth")$Granger
##
## Granger causality H0: HousingStartGrowth do not Granger-cause
## PriceIndexGrowth
##
## data: VAR object VAR_est2
## F-Test = 4.4408, df1 = 2, df2 = 610, p-value = 0.01217
Given the low pvalue we reject the null, suggesting that housing starts have a leading impact on prices.
Let’s examine the other direction.
causality(VAR_est2, cause = "PriceIndexGrowth")$Granger
##
## Granger causality H0: PriceIndexGrowth do not Granger-cause
## HousingStartGrowth
##
## data: VAR object VAR_est2
## F-Test = 1.7979, df1 = 2, df2 = 610, p-value = 0.1665
The relatively high pvalue suggest that the causal direction is only one way (housing starts to prices).
Another way to understand and use the VAR estimates is via a stress test. For instance, what happens to Prices over the next 2yrs if we get a shock to housing starts today?
irf_PriceIndexGrowth <- irf(VAR_est2, impulse = "HousingStartGrowth",
response = "PriceIndexGrowth",
cumulative = TRUE,
n.ahead = 20, boot = TRUE)
irf_PriceIndexGrowth$irf$HousingStartGrowth[3]
## [1] 0.07242436
plot(irf_PriceIndexGrowth, ylab = "Price Index Growth", main = "Cumulative Impact of 1unit shock to Housing Starts")
The IRF suggests that housing starts rise today, prices will rapidly for
about 6mths and then the impact begins to slow.
irf_HousingStartGrowth <- irf(VAR_est2, impulse = "PriceIndexGrowth",
cumulative = TRUE,
response = "HousingStartGrowth",
n.ahead = 20, boot = TRUE)
plot(irf_HousingStartGrowth, ylab = "Housing Start Growth", main = "Cumulative Impact from 1 unit shock to Prices")
A shock to prices has little impact for the first few months, but then may start to build over time.
Given the causality findings, we may consider refining the model. For now, we’ll use that estimated structure to predict both Prices and Starts 24-months head.
VAR_mulpred <- predict(VAR_est2, n.ahead = 24)
plot(VAR_mulpred, names = "PriceIndexGrowth", main = "Forecast of Price Growth")
plot(VAR_mulpred, names = "HousingStartGrowth", main = "Forecast of Housing Starts Growth")