Identify Predictors for Home Price in VT

Objective

This paper models Vermont (VT) home prices using a set of lagged predictors. The primary objective is to develop an estimation model that forecasts the dependent variable (home prices) for the next two years without requiring future forecasts of the predictor variables.

Data

The core data used in this analysis is sourced from Moody’s Analytics. It is important to note that while alternative real estate data sources exist—notably Zillow and FRED:realtor.com. Moody’s has an extensive coverage across economic, financial, and real estate indicators, often providing longer historical time series and proprietary forecast values. Conversely, the public-facing Zillow and realtor.com data typically begin around \(2016\).

The Dependent Variable for the analysis is: the year-over-year percent change of FHOFHOPIQ: The All-Transactions House Price Index (HPI) sourced from the Federal Housing Finance Agency (FHFA), accessed via the FRED database.

Code
# Time Series Machine Learning
library(tidymodels)
library(modeltime)
library(vip)

# EDA
library(DataExplorer)

# Core 
library(tidyverse)
library(timetk)
library(lubridate)
library(readxl)

moodys_tbl <- read_excel("../../../Vermont/docs from Tom/Moodys 0625.xlsx", sheet = "Vermont (Quarterly)", skip = 0, n_max = 98) %>%
    filter(!is.na(`VERMONT - QUARTERLY`))

moodys_us_tbl <- read_excel("../../../Vermont/docs from Tom/Moodys 0625.xlsx", sheet = "US (Quarterly)", skip = 0) %>%
    filter(!is.na(`UNITED STATES - QUARTERLY`))

home_inventory <- read_csv("../00_data/Housing_Inventory_VT_FRED.csv")

# Zillow
# https://www.zillow.com/research/data/
# ZORI (Smoothed): All Homes Plus Multifamily Time Series ($): County
# Downloaded on 10/16/2025
zillow <- read_csv("../00_data/County_zori_uc_sfrcondomfr_sm_month.csv")

1.0 Clean data

The following steps were taken to clean the time series data, induce stationarity, and prepare the final sample for modeling:

  1. Achieving Stationarity: To ensure the time series data met the critical assumption of stationarity necessary for time series modeling, the following transformations were applied:
    • Level or Dollar Variables (e.g., Home Prices, Net Migration): Stationarity was achieved by calculating the year-over-year percent change (\(\% \Delta Y/Y\)). This transformation removes level effects and standardizes volatility.
    • Rate or Percentage Variables (e.g., Unemployment Rate): For variables already expressed as percentages, stationarity was achieved by taking the year-over-year difference (\(Y_t - Y_{t-4}\)).
  2. Sample Truncation and Data Quality: The data set was truncated to begin on 1984-Q1.
    • Justification: Data prior to \(1984-01-01\) was removed because visualization of the Vermont home price series exhibited significantly higher volatility and irregularity during this early period, suggesting low data quality or structural instability that would compromise model estimation.
  3. Exclusion of Forecasted Values: Data for the 2025 Q2 (Second Quarter) and all subsequent periods were excluded from the final estimation sample, as these values represent external forecasts rather than observed historical data.
  4. Standardization: All transformed variables were standardized (centered by the mean and scaled by the standard deviation).
    • Justification: Standardization is a necessary step to ensure all predictors are on the same scale, which prevents any single variable from dominating algorithms that are sensitive to variable magnitude (e.g., Ridge or LASSO regression). This transformation also facilitates easier visual comparison of variable volatility.

Percent variables

Code
moodys_pct_tbl <- moodys_tbl %>% 
    
    # Create a dummy whether it's percent
    mutate(is_percent = if_else(...2 %>% str_detect("%"), "Yes", "No")) %>%
    # select(...2, is_percent) %>%
    
    # Select % 
    filter(is_percent == "Yes") %>%
    
    mutate(across(where(is.numeric), as.character)) %>%

    select(-2) %>%
    pivot_longer(-`VERMONT - QUARTERLY`, names_to = "date") %>%
    pivot_wider(names_from = `VERMONT - QUARTERLY`, values_from = value) %>%
    # Replace "ND" with ""
    mutate(across(everything(), ~str_replace(.,"ND",""))) %>%
    # Convert them to numeric
    mutate(across(-date, as.numeric)) %>%
    mutate(date = date %>% yq()) 

moodys_pct_tbl %>% glimpse()
Rows: 325
Columns: 10
$ date    <date> 1975-01-01, 1975-04-01, 1975-07-01, 1975-10-01, 1976-01-01, 1…
$ FLBR    <dbl> NA, NA, NA, NA, 8.53780, 8.44145, 8.10166, 7.82850, 7.64558, 7…
$ FHVACRQ <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ FMBADC  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ FMBAFSC <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ FDEBTC  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20.15313, 20.4…
$ FCCDAF  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ FCCDAB  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ FCCDBC  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ FCCDTO  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Calculate year-over-year percent change of home price to remove stationarity and seasonality.

Code
moodys_pct_trans_tbl <- moodys_pct_tbl %>%
    
    # Difference mortgage variables to remove stationarity
    tk_augment_differences(-date, .lags = 4)
    
moodys_pct_trans_tbl
# A tibble: 325 × 19
   date        FLBR FHVACRQ FMBADC FMBAFSC FDEBTC FCCDAF FCCDAB FCCDBC FCCDTO
   <date>     <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 1975-01-01 NA         NA     NA      NA     NA     NA     NA     NA     NA
 2 1975-04-01 NA         NA     NA      NA     NA     NA     NA     NA     NA
 3 1975-07-01 NA         NA     NA      NA     NA     NA     NA     NA     NA
 4 1975-10-01 NA         NA     NA      NA     NA     NA     NA     NA     NA
 5 1976-01-01  8.54      NA     NA      NA     NA     NA     NA     NA     NA
 6 1976-04-01  8.44      NA     NA      NA     NA     NA     NA     NA     NA
 7 1976-07-01  8.10      NA     NA      NA     NA     NA     NA     NA     NA
 8 1976-10-01  7.83      NA     NA      NA     NA     NA     NA     NA     NA
 9 1977-01-01  7.65      NA     NA      NA     NA     NA     NA     NA     NA
10 1977-04-01  7.15      NA     NA      NA     NA     NA     NA     NA     NA
# ℹ 315 more rows
# ℹ 9 more variables: FLBR_lag4_diff1 <dbl>, FHVACRQ_lag4_diff1 <dbl>,
#   FMBADC_lag4_diff1 <dbl>, FMBAFSC_lag4_diff1 <dbl>, FDEBTC_lag4_diff1 <dbl>,
#   FCCDAF_lag4_diff1 <dbl>, FCCDAB_lag4_diff1 <dbl>, FCCDBC_lag4_diff1 <dbl>,
#   FCCDTO_lag4_diff1 <dbl>

Level Variables

Code
moodys_level_tbl <- moodys_tbl %>% 
    
    # Create a dummy whether it's percent
    mutate(is_percent = if_else(...2 %>% str_detect("%"), "Yes", "No")) %>%
    # select(...2, is_percent) %>%
    
    # Select % 
    filter(is_percent != "Yes") %>%
    
    mutate(across(where(is.numeric), as.character)) %>%

    select(-2) %>%
    pivot_longer(-`VERMONT - QUARTERLY`, names_to = "date") %>%
    pivot_wider(names_from = `VERMONT - QUARTERLY`, values_from = value) %>%
    # Replace "ND" with ""
    mutate(across(everything(), ~str_replace(.,"ND",""))) %>%
    # Convert them to numeric
    mutate(across(-date, as.numeric)) %>%
    mutate(date = date %>% yq())

moodys_level_tbl
# A tibble: 325 × 89
   date        RETQ RERMQ RE23Q REMFQ RETUQ RE51Q RETRQ RETTQ RE42Q RERTQ REFIQ
   <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 1975-01-01  161. 0.882  6.47  35.0  5.31  4.54  2.81  25.6  5.27  20.3  7.53
 2 1975-04-01  160. 0.863  7.64  34.0  5.26  4.41  2.84  25.6  5.39  20.2  7.65
 3 1975-07-01  162. 0.879  8.29  33.7  5.19  4.36  2.80  25.9  5.35  20.6  7.61
 4 1975-10-01  165. 0.881  7.99  34.9  5.29  4.38  2.85  26.1  5.37  20.7  7.66
 5 1976-01-01  166. 0.875  6.80  35.4  5.28  4.38  2.81  26.3  5.42  20.9  7.76
 6 1976-04-01  167. 0.873  8.08  35.9  5.32  4.41  2.83  26.2  5.43  20.7  7.76
 7 1976-07-01  170. 0.874  8.52  35.7  5.33  4.33  2.80  26.2  5.58  20.7  7.75
 8 1976-10-01  171. 0.873  8.14  35.8  5.24  4.36  2.72  26.8  5.58  21.2  7.86
 9 1977-01-01  174. 0.886  7.26  37.1  5.48  4.50  2.86  27.2  5.61  21.5  8.14
10 1977-04-01  177. 0.916  8.84  37.7  5.51  4.54  2.87  27.3  5.70  21.6  8.24
# ℹ 315 more rows
# ℹ 77 more variables: REPSQ <dbl>, REEHQ <dbl>, RELHQ <dbl>, RE81Q <dbl>,
#   RE62Q <dbl>, RE5411Q <dbl>, RE5416Q <dbl>, REGVQ <dbl>, RE311Q <dbl>,
#   RE312Q <dbl>, RE313Q <dbl>, RE314Q <dbl>, RE315Q <dbl>, RE316Q <dbl>,
#   RE321Q <dbl>, RE322Q <dbl>, RE323Q <dbl>, RE324Q <dbl>, RE325Q <dbl>,
#   RE326Q <dbl>, RE327Q <dbl>, RE331Q <dbl>, RE332Q <dbl>, RE333Q <dbl>,
#   RE334Q <dbl>, RE335Q <dbl>, RE336Q <dbl>, RE337Q <dbl>, RE339Q <dbl>, …

Calculate year-over-year percent change of home price to remove stationarity and seasonality.

Code
moodys_level_trans_tbl <- moodys_level_tbl %>% 
    
    # Transform home_price to YOY to remove stationarity
    mutate(across(.cols  = -date, 
                  .fns   = ~ ((.x - lag(.x, 4)) / lag(.x, 4)) * 100, 
                  .names = "{.col}_yoy"))
    
moodys_level_trans_tbl
# A tibble: 325 × 177
   date        RETQ RERMQ RE23Q REMFQ RETUQ RE51Q RETRQ RETTQ RE42Q RERTQ REFIQ
   <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 1975-01-01  161. 0.882  6.47  35.0  5.31  4.54  2.81  25.6  5.27  20.3  7.53
 2 1975-04-01  160. 0.863  7.64  34.0  5.26  4.41  2.84  25.6  5.39  20.2  7.65
 3 1975-07-01  162. 0.879  8.29  33.7  5.19  4.36  2.80  25.9  5.35  20.6  7.61
 4 1975-10-01  165. 0.881  7.99  34.9  5.29  4.38  2.85  26.1  5.37  20.7  7.66
 5 1976-01-01  166. 0.875  6.80  35.4  5.28  4.38  2.81  26.3  5.42  20.9  7.76
 6 1976-04-01  167. 0.873  8.08  35.9  5.32  4.41  2.83  26.2  5.43  20.7  7.76
 7 1976-07-01  170. 0.874  8.52  35.7  5.33  4.33  2.80  26.2  5.58  20.7  7.75
 8 1976-10-01  171. 0.873  8.14  35.8  5.24  4.36  2.72  26.8  5.58  21.2  7.86
 9 1977-01-01  174. 0.886  7.26  37.1  5.48  4.50  2.86  27.2  5.61  21.5  8.14
10 1977-04-01  177. 0.916  8.84  37.7  5.51  4.54  2.87  27.3  5.70  21.6  8.24
# ℹ 315 more rows
# ℹ 165 more variables: REPSQ <dbl>, REEHQ <dbl>, RELHQ <dbl>, RE81Q <dbl>,
#   RE62Q <dbl>, RE5411Q <dbl>, RE5416Q <dbl>, REGVQ <dbl>, RE311Q <dbl>,
#   RE312Q <dbl>, RE313Q <dbl>, RE314Q <dbl>, RE315Q <dbl>, RE316Q <dbl>,
#   RE321Q <dbl>, RE322Q <dbl>, RE323Q <dbl>, RE324Q <dbl>, RE325Q <dbl>,
#   RE326Q <dbl>, RE327Q <dbl>, RE331Q <dbl>, RE332Q <dbl>, RE333Q <dbl>,
#   RE334Q <dbl>, RE335Q <dbl>, RE336Q <dbl>, RE337Q <dbl>, RE339Q <dbl>, …

Aggregate

Code
moodys_level_trans_tbl %>%
    plot_time_series(date, FHOFHOPIQ_yoy, 
                     .smooth = FALSE, 
                     .title = "Low Data Quality Before 1984")
Code
moodys_trans_tbl <- left_join(moodys_level_trans_tbl, moodys_pct_trans_tbl) %>%
    
    # Remove original columns
    select(date, matches("(yoy|diff)")) %>%

    # Remove volatile early periods
    filter(date >= as.Date("1984-01-01")) %>%
    
    # Remove forecast
    filter(date <= as.Date("2025-01-01")) %>%

    # Preprocess Target
    mutate(across(-date, standardize_vec))

moodys_trans_tbl %>% glimpse()
Rows: 165
Columns: 98
$ date               <date> 1984-01-01, 1984-04-01, 1984-07-01, 1984-10-01, 19…
$ RETQ_yoy           <dbl> 1.27821943, 1.08100065, 0.77031955, 1.05341952, 0.9…
$ RERMQ_yoy          <dbl> -0.8541096, -0.6732553, -1.2157324, 0.4861364, -0.2…
$ RE23Q_yoy          <dbl> 1.48286074, 0.90985446, 2.00283938, 1.29084967, 1.4…
$ REMFQ_yoy          <dbl> 0.95820725, 0.82999837, 0.64148170, 0.72529339, 1.0…
$ RETUQ_yoy          <dbl> 1.80489186, 1.52977862, 0.33600690, 0.58011631, 0.5…
$ RE51Q_yoy          <dbl> -0.07418295, -0.25015357, 0.51547829, 0.82489738, 0…
$ RETRQ_yoy          <dbl> 1.86663984, 1.65284469, 0.23169559, 0.56714104, 0.6…
$ RETTQ_yoy          <dbl> 1.37503848, 1.64466584, 1.01899052, 1.03417743, 1.8…
$ RE42Q_yoy          <dbl> 0.26054248, 0.85277844, 1.06562545, 1.32309322, 2.2…
$ RERTQ_yoy          <dbl> 1.59731728, 1.75369061, 0.93502902, 0.88403884, 1.6…
$ REFIQ_yoy          <dbl> 1.7096173, 1.3846039, 0.9054173, 1.8517462, 1.17734…
$ REPSQ_yoy          <dbl> 0.62001115, 0.74222649, 0.76830781, 1.24018929, 0.7…
$ REEHQ_yoy          <dbl> -0.12230657, -0.72673176, -0.45009717, -0.30341999,…
$ RELHQ_yoy          <dbl> 1.08641704, 1.06061974, 0.31049362, 0.69107773, 0.0…
$ RE81Q_yoy          <dbl> 1.2396833, 1.5951314, 0.7844142, 1.4973991, 0.25948…
$ RE62Q_yoy          <dbl> -0.535667914, -0.863566031, -0.962198100, -0.792753…
$ RE5411Q_yoy        <dbl> 1.4808237, 1.4808237, 1.4808237, 1.4808237, 1.92239…
$ RE5416Q_yoy        <dbl> 0.26248821, 0.26248821, 0.26248821, 0.26248821, 0.4…
$ REGVQ_yoy          <dbl> 1.0913087, 0.7300023, 0.4616505, 1.0606812, 0.87764…
$ RE311Q_yoy         <dbl> -0.19418192, 0.91600539, 0.28589546, 0.95278291, 0.…
$ RE312Q_yoy         <dbl> -0.2118578, -0.2637435, -0.3018458, -0.2913114, -0.…
$ RE313Q_yoy         <dbl> 0.274056765, 0.218055823, 0.176817754, 0.188065858,…
$ RE314Q_yoy         <dbl> 0.367829383, 0.307398257, 0.263294973, 0.275500988,…
$ RE315Q_yoy         <dbl> 0.321518228, 0.255510336, 0.207078028, 0.220492356,…
$ RE316Q_yoy         <dbl> -0.04284849, -0.06407498, -0.08077466, -0.07560492,…
$ RE321Q_yoy         <dbl> 5.598377e-01, 4.521951e-01, 3.730505e-01, 3.949578e…
$ RE322Q_yoy         <dbl> 1.09683181, 0.92558017, 0.79955069, 0.83435420, 1.1…
$ RE323Q_yoy         <dbl> 1.05862916, 0.91225523, 0.80473146, 0.83443027, 1.0…
$ RE324Q_yoy         <dbl> 0.2755192068, 0.2180437245, 0.1759451144, 0.1876210…
$ RE325Q_yoy         <dbl> 0.216161326, 0.105016323, 0.023448667, 0.046029337,…
$ RE326Q_yoy         <dbl> 0.43087565, 0.30533726, 0.21310480, 0.23854406, 0.4…
$ RE327Q_yoy         <dbl> 0.52850970, 0.39273313, 0.29291878, 0.32051500, 0.5…
$ RE331Q_yoy         <dbl> 0.6436158, 0.5632223, 0.5040266, 0.5203344, 0.66414…
$ RE332Q_yoy         <dbl> 0.37721119, 0.25552785, 0.16622643, 0.19084089, 0.4…
$ RE333Q_yoy         <dbl> 0.77711298, 0.63806822, 0.53593360, 0.56410223, 0.8…
$ RE334Q_yoy         <dbl> 0.81799165, 0.68755270, 0.59170566, 0.61815964, 0.8…
$ RE335Q_yoy         <dbl> 0.59361801, 0.46268543, 0.36627969, 0.39296366, 0.6…
$ RE336Q_yoy         <dbl> 0.490839976, 0.387979045, 0.312364508, 0.333231938,…
$ RE337Q_yoy         <dbl> 0.72280282, 0.59197058, 0.49589228, 0.52240273, 0.7…
$ RE339Q_yoy         <dbl> 0.082805774, -0.007528717, -0.073702295, -0.0554917…
$ REMFDQ_yoy         <dbl> 0.963410553, 0.779878352, 0.645001071, 0.682221791,…
$ REMFNQ_yoy         <dbl> 0.632524258, 0.751566769, 0.457742762, 0.667619955,…
$ REOFFQ_yoy         <dbl> 0.8261388, 0.6460373, 0.5149732, 1.2680680, 0.54631…
$ FLBF_yoy           <dbl> 0.5355560, 0.8327380, 0.6035291, 0.8803294, 1.25576…
$ FLBE_yoy           <dbl> 1.316009370, 1.477419304, 1.168347898, 1.180088125,…
$ FLBU_yoy           <dbl> -0.67226736, -0.68558824, -0.64709914, -0.53245089,…
$ FYPQ_yoy           <dbl> 0.95819998, 1.28220087, 1.93374590, 1.97739135, 1.7…
$ `FYP$Q_yoy`        <dbl> 0.4074392, 0.7196038, 1.5940573, 1.6555489, 1.36033…
$ FYPCONQ_yoy        <dbl> 1.9381502, 1.6949058, 1.9843699, 2.3903886, 1.33139…
$ FYPADJQ_yoy        <dbl> 3.11567423, 2.49503910, 1.50403920, 0.34155201, -0.…
$ FYPDIRQ_yoy        <dbl> 0.51414918, 1.15096251, 1.58907674, 1.68916157, 1.5…
$ FYPTRPQ_yoy        <dbl> -0.33067092, -0.42478225, -0.33700851, -0.27245323,…
$ FYPEWSQ_yoy        <dbl> 1.71699487, 1.26168180, 1.39910647, 1.63117514, 1.3…
$ FYPESSQ_yoy        <dbl> 1.4649746, 1.0162386, 1.1849678, 1.5477971, 1.01045…
$ FYPEPPQ_yoy        <dbl> -0.36284664, 1.13683844, 2.72928208, 1.93513968, 1.…
$ FYPEPPFRQ_yoy      <dbl> -0.838349763, -0.434829224, 0.939756907, 0.85163579…
$ FYPEPPNFQ_yoy      <dbl> 1.14663661, 1.69314458, 2.37438691, 1.39959914, 1.1…
$ FYPNWQ_yoy         <dbl> 0.06720115, 0.81843445, 1.60491230, 1.49502659, 1.3…
$ FYPDPIQ_yoy        <dbl> 1.0704175161, 1.5024714999, 1.8282710959, 1.8197438…
$ FAHEMF_yoy         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ FPOPQ_yoy          <dbl> 0.3974774, 0.2983015, 0.2470605, 0.2382590, 0.25927…
$ FPOP0004Q_yoy      <dbl> 1.738631, 1.562097, 1.437477, 1.357751, 1.306224, 1…
$ FPOP0519Q_yoy      <dbl> -0.91239603, -0.82687261, -0.78949312, -0.77854640,…
$ FPOP2024Q_yoy      <dbl> -0.7312329, -0.8433117, -0.9057922, -0.9466412, -1.…
$ FPOP2544Q_yoy      <dbl> 2.0032268, 1.9746379, 1.9384584, 1.9060921, 1.88968…
$ FPOP4564Q_yoy      <dbl> -0.57696216, -0.57456345, -0.56144687, -0.53969774,…
$ FPOP65GQ_yoy       <dbl> -1.1119924, -1.1763855, -1.1521156, -1.0682000, -0.…
$ FHHOLDQ_yoy        <dbl> 0.2695279, 0.6826213, 1.0085714, 1.0967762, 1.08963…
$ FNMQ_yoy           <dbl> -0.022318196, -0.034238162, -0.051980869, -0.225716…
$ FBRQ_yoy           <dbl> 0.83864802, 0.86933972, 0.89246709, 0.90877151, 0.8…
$ FDRQ_yoy           <dbl> -0.47175666, -0.38014421, -0.29216918, -0.21330988,…
$ FRTFSQ_yoy         <dbl> 2.60358879, 2.07684771, 0.73327902, 0.80498429, 0.7…
$ FCCALLQ_yoy        <dbl> 1.25416680, 1.90066122, 2.37255418, 2.56167841, 2.6…
$ FCCREVQ_yoy        <dbl> 3.0563342, 3.5934180, 3.7639159, 3.4417772, 3.73819…
$ FBKPY_yoy          <dbl> 1.19096078, -0.27585093, -0.45294583, -0.81707620, …
$ FHPNR_yoy          <dbl> 2.890759316, 1.272656139, 0.557100127, 0.540916690,…
$ FHPN1_yoy          <dbl> 2.23948622, 1.16642915, 0.89507846, 0.79193803, -0.…
$ FHPNM_yoy          <dbl> 0.30376435, -0.01060779, -0.17056574, -0.15389989, …
$ FHX1_yoy           <dbl> 0.15519951, 0.26334238, 0.08016645, -0.64838504, -0…
$ FHX1MED_yoy        <dbl> -0.3285722, 0.9878435, 0.3475084, -0.3206653, 0.805…
$ FHXAFF_yoy         <dbl> 1.19248108, -0.00448020, -0.01917184, 0.05059087, -…
$ FMOSDQ_yoy         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ FMOPDQ_yoy         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ FMORDQ_yoy         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ FYHHMEDQ_yoy       <dbl> 1.32298567, 1.88167353, 2.23845910, 2.40110631, 2.4…
$ FHOFHOPIQ_yoy      <dbl> -0.12635228, 0.74735933, 0.28871143, 0.09184348, 0.…
$ FGDPQ_yoy          <dbl> 1.7784335, 1.0793785, 0.7091805, 1.3509509, 1.10280…
$ `FGDP$Q_yoy`       <dbl> 1.37146917, 0.42232614, 0.03718233, 0.78747668, 0.7…
$ FLBR_lag4_diff1    <dbl> -1.50051304, -1.43133145, -1.20623914, -0.91678072,…
$ FHVACRQ_lag4_diff1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.3…
$ FMBADC_lag4_diff1  <dbl> 3.47543582, 3.13625357, -1.65971646, -1.94013065, -…
$ FMBAFSC_lag4_diff1 <dbl> 5.11552707, -0.02826542, -0.02826542, -0.02826542, …
$ FDEBTC_lag4_diff1  <dbl> 0.4490928, 0.7476560, 0.9541723, 1.1307768, 1.63890…
$ FCCDAF_lag4_diff1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ FCCDAB_lag4_diff1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ FCCDBC_lag4_diff1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ FCCDTO_lag4_diff1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Code
# Save Key Params
std_mean    <- 5.53812089201925
std_sd      <- 10.0094837714369

Data Description

Code
# Extract var description
moodys_desc_tbl <- moodys_tbl %>% 
    select(1:2) %>% 
    set_names(c("Var_abb","Var_desc")) %>%
    mutate(Source = "Moodys")
moodys_desc_us_tbl <- moodys_us_tbl %>% 
    select(1:2) %>% 
    set_names(c("Var_abb","Var_desc")) %>%
    filter(Var_desc != "0") %>%
    mutate(Source = "Moodys")

# Find the start date for each column
start_date_tbl <- moodys_trans_tbl %>%
  # Reshape the data from wide to long format
  # The 'date' column is kept, and all other columns are pivoted into
  # a new 'name' and 'value' column.
  pivot_longer(
    cols = -date,
    names_to = "Column_name",
    values_to = "value"
  ) %>%
  # Remove rows where the value is missing
  filter(!is.na(value)) %>%
  # Group the data by the original column names
  group_by(Column_name) %>%
  # For each group, keep only the first row
  # which corresponds to the first non-NA value in the original data.
  slice(1) %>%
  # Ungroup the data to finalize the output
  ungroup() %>%
  # Select only the column name and the corresponding start date
  select(Column_name, Start_date = date) %>%
  mutate(Var_abb = str_remove(Column_name, "(_yoy|_lag4_diff1)"))

moodys_desc_startdate_tbl <- left_join(moodys_desc_tbl, start_date_tbl)
    
moodys_desc_startdate_tbl
# A tibble: 97 × 5
   Var_abb Var_desc                                Source Column_name Start_date
   <chr>   <chr>                                   <chr>  <chr>       <date>    
 1 RETQ    Employment: Total nonfarm payroll, (Th… Moodys RETQ_yoy    1984-01-01
 2 RERMQ   Employment: Natural resources and mini… Moodys RERMQ_yoy   1984-01-01
 3 RE23Q   Employment: Construction, (Ths. #, SA)  Moodys RE23Q_yoy   1984-01-01
 4 REMFQ   Employment: Manufacturing, (Ths. #, SA) Moodys REMFQ_yoy   1984-01-01
 5 RETUQ   Employment: Transportation and utiliti… Moodys RETUQ_yoy   1984-01-01
 6 RE51Q   Employment: Information, (Ths. #, SA)   Moodys RE51Q_yoy   1984-01-01
 7 RETRQ   Employment: Transportation, (Ths., SA)  Moodys RETRQ_yoy   1984-01-01
 8 RETTQ   Employment: Total Trade, (Ths., SA)     Moodys RETTQ_yoy   1984-01-01
 9 RE42Q   Employment: Wholesale trade, (Ths. #, … Moodys RE42Q_yoy   1984-01-01
10 RERTQ   Employment: Retail trade, (Ths. #, SA)  Moodys RERTQ_yoy   1984-01-01
# ℹ 87 more rows
Code
moodys_desc_startdate_tbl %>% count(Start_date, sort = T)
# A tibble: 5 × 2
  Start_date     n
  <date>     <int>
1 1984-01-01    88
2 2006-07-01     4
3 1991-01-01     3
4 1987-01-01     1
5 2002-01-01     1
Code
start_date_after_1984_vec <- moodys_desc_startdate_tbl %>% 
    filter(Start_date > "1984-01-01") %>%
    pull(Column_name)

start_date_after_1984_vec
[1] "FAHEMF_yoy"         "FHVACRQ_lag4_diff1" "FCCDAF_lag4_diff1" 
[4] "FCCDAB_lag4_diff1"  "FCCDBC_lag4_diff1"  "FCCDTO_lag4_diff1" 
[7] "FMOSDQ_yoy"         "FMOPDQ_yoy"         "FMORDQ_yoy"        
Code
moodys_trans_1984_tbl <- moodys_trans_tbl %>%
    pivot_longer(-date) %>%
    # Filter out variables starting after 1984
    filter(!name %in% start_date_after_1984_vec) %>%
    pivot_wider(names_from = name, values_from = value)

moodys_trans_1984_tbl %>% glimpse()
Rows: 165
Columns: 89
$ date               <date> 1984-01-01, 1984-04-01, 1984-07-01, 1984-10-01, 19…
$ RETQ_yoy           <dbl> 1.27821943, 1.08100065, 0.77031955, 1.05341952, 0.9…
$ RERMQ_yoy          <dbl> -0.8541096, -0.6732553, -1.2157324, 0.4861364, -0.2…
$ RE23Q_yoy          <dbl> 1.48286074, 0.90985446, 2.00283938, 1.29084967, 1.4…
$ REMFQ_yoy          <dbl> 0.95820725, 0.82999837, 0.64148170, 0.72529339, 1.0…
$ RETUQ_yoy          <dbl> 1.80489186, 1.52977862, 0.33600690, 0.58011631, 0.5…
$ RE51Q_yoy          <dbl> -0.07418295, -0.25015357, 0.51547829, 0.82489738, 0…
$ RETRQ_yoy          <dbl> 1.86663984, 1.65284469, 0.23169559, 0.56714104, 0.6…
$ RETTQ_yoy          <dbl> 1.37503848, 1.64466584, 1.01899052, 1.03417743, 1.8…
$ RE42Q_yoy          <dbl> 0.26054248, 0.85277844, 1.06562545, 1.32309322, 2.2…
$ RERTQ_yoy          <dbl> 1.59731728, 1.75369061, 0.93502902, 0.88403884, 1.6…
$ REFIQ_yoy          <dbl> 1.7096173, 1.3846039, 0.9054173, 1.8517462, 1.17734…
$ REPSQ_yoy          <dbl> 0.62001115, 0.74222649, 0.76830781, 1.24018929, 0.7…
$ REEHQ_yoy          <dbl> -0.12230657, -0.72673176, -0.45009717, -0.30341999,…
$ RELHQ_yoy          <dbl> 1.08641704, 1.06061974, 0.31049362, 0.69107773, 0.0…
$ RE81Q_yoy          <dbl> 1.2396833, 1.5951314, 0.7844142, 1.4973991, 0.25948…
$ RE62Q_yoy          <dbl> -0.535667914, -0.863566031, -0.962198100, -0.792753…
$ RE5411Q_yoy        <dbl> 1.4808237, 1.4808237, 1.4808237, 1.4808237, 1.92239…
$ RE5416Q_yoy        <dbl> 0.26248821, 0.26248821, 0.26248821, 0.26248821, 0.4…
$ REGVQ_yoy          <dbl> 1.0913087, 0.7300023, 0.4616505, 1.0606812, 0.87764…
$ RE311Q_yoy         <dbl> -0.19418192, 0.91600539, 0.28589546, 0.95278291, 0.…
$ RE312Q_yoy         <dbl> -0.2118578, -0.2637435, -0.3018458, -0.2913114, -0.…
$ RE313Q_yoy         <dbl> 0.274056765, 0.218055823, 0.176817754, 0.188065858,…
$ RE314Q_yoy         <dbl> 0.367829383, 0.307398257, 0.263294973, 0.275500988,…
$ RE315Q_yoy         <dbl> 0.321518228, 0.255510336, 0.207078028, 0.220492356,…
$ RE316Q_yoy         <dbl> -0.04284849, -0.06407498, -0.08077466, -0.07560492,…
$ RE321Q_yoy         <dbl> 5.598377e-01, 4.521951e-01, 3.730505e-01, 3.949578e…
$ RE322Q_yoy         <dbl> 1.09683181, 0.92558017, 0.79955069, 0.83435420, 1.1…
$ RE323Q_yoy         <dbl> 1.05862916, 0.91225523, 0.80473146, 0.83443027, 1.0…
$ RE324Q_yoy         <dbl> 0.2755192068, 0.2180437245, 0.1759451144, 0.1876210…
$ RE325Q_yoy         <dbl> 0.216161326, 0.105016323, 0.023448667, 0.046029337,…
$ RE326Q_yoy         <dbl> 0.43087565, 0.30533726, 0.21310480, 0.23854406, 0.4…
$ RE327Q_yoy         <dbl> 0.52850970, 0.39273313, 0.29291878, 0.32051500, 0.5…
$ RE331Q_yoy         <dbl> 0.6436158, 0.5632223, 0.5040266, 0.5203344, 0.66414…
$ RE332Q_yoy         <dbl> 0.37721119, 0.25552785, 0.16622643, 0.19084089, 0.4…
$ RE333Q_yoy         <dbl> 0.77711298, 0.63806822, 0.53593360, 0.56410223, 0.8…
$ RE334Q_yoy         <dbl> 0.81799165, 0.68755270, 0.59170566, 0.61815964, 0.8…
$ RE335Q_yoy         <dbl> 0.59361801, 0.46268543, 0.36627969, 0.39296366, 0.6…
$ RE336Q_yoy         <dbl> 0.490839976, 0.387979045, 0.312364508, 0.333231938,…
$ RE337Q_yoy         <dbl> 0.72280282, 0.59197058, 0.49589228, 0.52240273, 0.7…
$ RE339Q_yoy         <dbl> 0.082805774, -0.007528717, -0.073702295, -0.0554917…
$ REMFDQ_yoy         <dbl> 0.963410553, 0.779878352, 0.645001071, 0.682221791,…
$ REMFNQ_yoy         <dbl> 0.632524258, 0.751566769, 0.457742762, 0.667619955,…
$ REOFFQ_yoy         <dbl> 0.8261388, 0.6460373, 0.5149732, 1.2680680, 0.54631…
$ FLBF_yoy           <dbl> 0.5355560, 0.8327380, 0.6035291, 0.8803294, 1.25576…
$ FLBE_yoy           <dbl> 1.316009370, 1.477419304, 1.168347898, 1.180088125,…
$ FLBU_yoy           <dbl> -0.67226736, -0.68558824, -0.64709914, -0.53245089,…
$ FYPQ_yoy           <dbl> 0.95819998, 1.28220087, 1.93374590, 1.97739135, 1.7…
$ `FYP$Q_yoy`        <dbl> 0.4074392, 0.7196038, 1.5940573, 1.6555489, 1.36033…
$ FYPCONQ_yoy        <dbl> 1.9381502, 1.6949058, 1.9843699, 2.3903886, 1.33139…
$ FYPADJQ_yoy        <dbl> 3.11567423, 2.49503910, 1.50403920, 0.34155201, -0.…
$ FYPDIRQ_yoy        <dbl> 0.51414918, 1.15096251, 1.58907674, 1.68916157, 1.5…
$ FYPTRPQ_yoy        <dbl> -0.33067092, -0.42478225, -0.33700851, -0.27245323,…
$ FYPEWSQ_yoy        <dbl> 1.71699487, 1.26168180, 1.39910647, 1.63117514, 1.3…
$ FYPESSQ_yoy        <dbl> 1.4649746, 1.0162386, 1.1849678, 1.5477971, 1.01045…
$ FYPEPPQ_yoy        <dbl> -0.36284664, 1.13683844, 2.72928208, 1.93513968, 1.…
$ FYPEPPFRQ_yoy      <dbl> -0.838349763, -0.434829224, 0.939756907, 0.85163579…
$ FYPEPPNFQ_yoy      <dbl> 1.14663661, 1.69314458, 2.37438691, 1.39959914, 1.1…
$ FYPNWQ_yoy         <dbl> 0.06720115, 0.81843445, 1.60491230, 1.49502659, 1.3…
$ FYPDPIQ_yoy        <dbl> 1.0704175161, 1.5024714999, 1.8282710959, 1.8197438…
$ FPOPQ_yoy          <dbl> 0.3974774, 0.2983015, 0.2470605, 0.2382590, 0.25927…
$ FPOP0004Q_yoy      <dbl> 1.738631, 1.562097, 1.437477, 1.357751, 1.306224, 1…
$ FPOP0519Q_yoy      <dbl> -0.91239603, -0.82687261, -0.78949312, -0.77854640,…
$ FPOP2024Q_yoy      <dbl> -0.7312329, -0.8433117, -0.9057922, -0.9466412, -1.…
$ FPOP2544Q_yoy      <dbl> 2.0032268, 1.9746379, 1.9384584, 1.9060921, 1.88968…
$ FPOP4564Q_yoy      <dbl> -0.57696216, -0.57456345, -0.56144687, -0.53969774,…
$ FPOP65GQ_yoy       <dbl> -1.1119924, -1.1763855, -1.1521156, -1.0682000, -0.…
$ FHHOLDQ_yoy        <dbl> 0.2695279, 0.6826213, 1.0085714, 1.0967762, 1.08963…
$ FNMQ_yoy           <dbl> -0.022318196, -0.034238162, -0.051980869, -0.225716…
$ FBRQ_yoy           <dbl> 0.83864802, 0.86933972, 0.89246709, 0.90877151, 0.8…
$ FDRQ_yoy           <dbl> -0.47175666, -0.38014421, -0.29216918, -0.21330988,…
$ FRTFSQ_yoy         <dbl> 2.60358879, 2.07684771, 0.73327902, 0.80498429, 0.7…
$ FCCALLQ_yoy        <dbl> 1.25416680, 1.90066122, 2.37255418, 2.56167841, 2.6…
$ FCCREVQ_yoy        <dbl> 3.0563342, 3.5934180, 3.7639159, 3.4417772, 3.73819…
$ FBKPY_yoy          <dbl> 1.19096078, -0.27585093, -0.45294583, -0.81707620, …
$ FHPNR_yoy          <dbl> 2.890759316, 1.272656139, 0.557100127, 0.540916690,…
$ FHPN1_yoy          <dbl> 2.23948622, 1.16642915, 0.89507846, 0.79193803, -0.…
$ FHPNM_yoy          <dbl> 0.30376435, -0.01060779, -0.17056574, -0.15389989, …
$ FHX1_yoy           <dbl> 0.15519951, 0.26334238, 0.08016645, -0.64838504, -0…
$ FHX1MED_yoy        <dbl> -0.3285722, 0.9878435, 0.3475084, -0.3206653, 0.805…
$ FHXAFF_yoy         <dbl> 1.19248108, -0.00448020, -0.01917184, 0.05059087, -…
$ FYHHMEDQ_yoy       <dbl> 1.32298567, 1.88167353, 2.23845910, 2.40110631, 2.4…
$ FHOFHOPIQ_yoy      <dbl> -0.12635228, 0.74735933, 0.28871143, 0.09184348, 0.…
$ FGDPQ_yoy          <dbl> 1.7784335, 1.0793785, 0.7091805, 1.3509509, 1.10280…
$ `FGDP$Q_yoy`       <dbl> 1.37146917, 0.42232614, 0.03718233, 0.78747668, 0.7…
$ FLBR_lag4_diff1    <dbl> -1.50051304, -1.43133145, -1.20623914, -0.91678072,…
$ FMBADC_lag4_diff1  <dbl> 3.47543582, 3.13625357, -1.65971646, -1.94013065, -…
$ FMBAFSC_lag4_diff1 <dbl> 5.11552707, -0.02826542, -0.02826542, -0.02826542, …
$ FDEBTC_lag4_diff1  <dbl> 0.4490928, 0.7476560, 0.9541723, 1.1307768, 1.63890…

2.0 Explore data

Code
# Create lags
moodys_trans_all_lags_tbl <- moodys_trans_1984_tbl %>%
    
    tk_augment_lags(.value = -date, .lags = 1:10) %>%
    drop_na()

# Calculate correlations
library(corrr)

# Calculate correlations and focus on FHOFHOPIQ_yoy
data_corr_tbl <- moodys_trans_all_lags_tbl %>%       
    select(-date) %>%    # Remove non-numeric columns like 'date'
    correlate() %>%
    focus(FHOFHOPIQ_yoy) %>%
    rename(correlation = FHOFHOPIQ_yoy) %>%
    arrange(-correlation)

data_corr_tbl
# A tibble: 967 × 2
   term               correlation
   <chr>                    <dbl>
 1 FHOFHOPIQ_yoy_lag1       0.956
 2 FHOFHOPIQ_yoy_lag2       0.909
 3 FHOFHOPIQ_yoy_lag3       0.851
 4 FHX1MED_yoy              0.785
 5 FHX1MED_yoy_lag1         0.782
 6 FHX1MED_yoy_lag2         0.768
 7 FHOFHOPIQ_yoy_lag4       0.760
 8 FHX1MED_yoy_lag3         0.725
 9 FHOFHOPIQ_yoy_lag5       0.690
10 FHX1MED_yoy_lag4         0.655
# ℹ 957 more rows
Code
# Label Cor Table
data_corr_labeled_tbl <- data_corr_tbl %>% 
    mutate(Var_abb = str_split_fixed(string = term, pattern = "_", n = 2)[,1]) %>%
    
    left_join(moodys_desc_startdate_tbl) %>%
    
    # Create a new column, lag
    mutate(lag = str_split_fixed(term, "_lag", 2)[,2] %>% as.numeric())

data_corr_labeled_tbl
# A tibble: 967 × 8
   term         correlation Var_abb Var_desc Source Column_name Start_date   lag
   <chr>              <dbl> <chr>   <chr>    <chr>  <chr>       <date>     <dbl>
 1 FHOFHOPIQ_y…       0.956 FHOFHO… FHFA Al… Moodys FHOFHOPIQ_… 1984-01-01     1
 2 FHOFHOPIQ_y…       0.909 FHOFHO… FHFA Al… Moodys FHOFHOPIQ_… 1984-01-01     2
 3 FHOFHOPIQ_y…       0.851 FHOFHO… FHFA Al… Moodys FHOFHOPIQ_… 1984-01-01     3
 4 FHX1MED_yoy        0.785 FHX1MED Existin… Moodys FHX1MED_yoy 1984-01-01    NA
 5 FHX1MED_yoy…       0.782 FHX1MED Existin… Moodys FHX1MED_yoy 1984-01-01     1
 6 FHX1MED_yoy…       0.768 FHX1MED Existin… Moodys FHX1MED_yoy 1984-01-01     2
 7 FHOFHOPIQ_y…       0.760 FHOFHO… FHFA Al… Moodys FHOFHOPIQ_… 1984-01-01     4
 8 FHX1MED_yoy…       0.725 FHX1MED Existin… Moodys FHX1MED_yoy 1984-01-01     3
 9 FHOFHOPIQ_y…       0.690 FHOFHO… FHFA Al… Moodys FHOFHOPIQ_… 1984-01-01     5
10 FHX1MED_yoy…       0.655 FHX1MED Existin… Moodys FHX1MED_yoy 1984-01-01     4
# ℹ 957 more rows
Code
# Identify lagged predictors
corr_top_tbl <- data_corr_labeled_tbl %>% 
    # Filter for strong correlation
    filter(correlation > 0.2) %>%
    select(term, correlation, Var_desc, lag, Start_date) %>%
    
    # Select lag with highest corr per column
    group_by(Var_desc) %>%
    arrange(-correlation) %>%
    slice(1) %>%
    ungroup() %>%
    
    # Remove contemporary predictors
    filter(lag >= 4)

corr_top_tbl %>%
    ggplot(aes(correlation, fct_reorder(Var_desc, correlation))) +
    geom_col(fill = "steelblue") +
    geom_text(aes(label = paste0("lag at ", lag)), hjust = 1, color = "white") +
    labs(title = "VT Home Price Predictors (YOY%) with Lags of 4Q or Longer",
         y = NULL) +
         
    # FIX: Shift the title to the left
    theme(
        # Set horizontal justification (hjust) of the plot title to 0 (left-aligned)
        plot.title = element_text(hjust = 1.2)
    )

Code
# Create a vector of correlated predictors
corr_top_vec <- corr_top_tbl %>% 
    # head(12) %>%
    mutate(Var = str_split_fixed(string = term, pattern = "_lag", n = 2)[,1]) %>%
    distinct(Var) %>% 
    pull()


moodys_trans_corr_tbl <- moodys_trans_1984_tbl %>% 
    
    # Select corr predictors + FHOFHOPIQ_yoy
    pivot_longer(-date) %>% 
    filter(name %in% c("FHOFHOPIQ_yoy", corr_top_vec))  %>%
    mutate(name = name %>% str_remove("_yoy$")) %>%

    # Add var description
    left_join(moodys_desc_tbl, by = c("name" = "Var_abb")) %>%
    
    select(date, Var_desc, value) %>%
    
    pivot_wider(names_from = Var_desc, values_from = value)


# Extract all variable names except date and dependent variable(s)
dep_var_desc_vec <- moodys_desc_tbl %>% filter(Var_abb == "FHOFHOPIQ") %>% pull(Var_desc)

ccf_vars <- moodys_trans_corr_tbl %>%
  select(-date, -all_of(dep_var_desc_vec)) %>%
  colnames()

g <- moodys_trans_corr_tbl %>%
    plot_acf_diagnostics(
        date, `FHFA All Transactions Home Price Index, (Index 1980Q1=100, SA)`,
        .ccf_vars = all_of(ccf_vars),
        .show_ccf_vars_only = TRUE,
        .facet_ncol = 2, 
        .lags = 12, 
        .interactive = FALSE,
        .title = "Identified Predictors Cross Correlation with VT Home Prices (YOY%)" 
    )+
  theme(strip.text = element_text(size = 9))   # Smaller facet title font

plotly::ggplotly(g)

4.0 Prep Data

4.1 Create full dataset

  • Extend to Future Window
  • Add any lags to full dataset
  • Add any external regressors to full dataset

How do I determine? rolling_periods <- c(30, 60, 90)

Code
horizon    <- 4
lag_period <- c(4) # first spike in ACF/PACF beyond the forecasting time horizon

# Create a vector of correlated predictors, lagged 
corr_top_lag_vec <- corr_top_tbl %>% pull(term)

data_prepared_full_tbl <- moodys_trans_1984_tbl %>%
    
    # Add future window
    bind_rows(
        future_frame(.data = ., .date_var = date, .length_out = horizon)
    ) %>%
    
    # Add external lags
    tk_augment_lags(-date, .lags = 1:10) %>% 

    # Select top predictors
    pivot_longer(-date) %>% 
    filter(name %in% c(corr_top_lag_vec, "FHOFHOPIQ_yoy")) %>% 
    pivot_wider(names_from = name, values_from = value) %>%
    
    # Format Columns
    rename_with(.cols = contains("lag"), .fn = ~ str_c("lag_", .))


data_prepared_full_tbl %>%
    pivot_longer(-date) %>%
    plot_time_series(date, value, name, .smooth = FALSE, 
                     .title = "VT Home Price and Lagged Predictors (YOY%; normalized)")
Code
corr_top_tbl %>% knitr::kable()
term correlation Var_desc lag Start_date
FCCREVQ_yoy_lag10 0.2603976 Consumer Credit: Revolving, (Mil. USD, SA) 10 1984-01-01
FCCALLQ_yoy_lag10 0.3493830 Consumer Credit: Total, (Mil. USD, SA) 10 1984-01-01
FHX1_yoy_lag10 0.3949063 Existing Single-Family Home Sales, (Ths. #, SAAR) 10 1984-01-01
FYPDPIQ_yoy_lag5 0.2539926 Income: Disposable Personal, (Mil. USD, SAAR) 5 1984-01-01
FYPEPPFRQ_yoy_lag8 0.2055911 Income: Farm Proprietors, (Mil. USD, SAAR) 8 1984-01-01
FYPEPPNFQ_yoy_lag4 0.3165328 Income: Non-farm Proprietors, (Mil. USD, SAAR) 4 1984-01-01
FYPEPPQ_yoy_lag5 0.3128554 Income: Total Proprietors, (Mil. USD, SAAR) 5 1984-01-01
FYP$Q_yoy_lag6 0.2470075 Real Personal Income, (Mil. Ch. 2017 USD, SAAR) 6 1984-01-01
FPOP2544Q_yoy_lag6 0.3037312 Resident Population: Both sexes - Age 25 to 44, (Ths. #, NSA) 6 1984-01-01
FHPN1_yoy_lag4 0.2436038 Residential Permits: Single-family, (#, SAAR) 4 1984-01-01
FHPNR_yoy_lag10 0.2238866 Residential Permits: Total, (#, SAAR) 10 1984-01-01
Code
data_prepared_full_tbl %>% tail(horizon + 1)
# A tibble: 5 × 13
  date       FHOFHOPIQ_yoy lag_FYPEPPNFQ_yoy_lag4 lag_FHPN1_yoy_lag4
  <date>             <dbl>                  <dbl>              <dbl>
1 2025-01-01         0.291                  0.273              0.683
2 2025-04-01        NA                      0.144             -0.347
3 2025-07-01        NA                     -0.111              0.148
4 2025-10-01        NA                     -0.321              0.445
5 2026-01-01        NA                     -0.347             -0.201
# ℹ 9 more variables: lag_FYPEPPQ_yoy_lag5 <dbl>, lag_FYPDPIQ_yoy_lag5 <dbl>,
#   `lag_FYP$Q_yoy_lag6` <dbl>, lag_FPOP2544Q_yoy_lag6 <dbl>,
#   lag_FYPEPPFRQ_yoy_lag8 <dbl>, lag_FCCALLQ_yoy_lag10 <dbl>,
#   lag_FCCREVQ_yoy_lag10 <dbl>, lag_FHPNR_yoy_lag10 <dbl>,
#   lag_FHX1_yoy_lag10 <dbl>

4.2 Separate into model and forecast data

Code
data_prepared_full_tbl %>% tail(horizon + 1)
# A tibble: 5 × 13
  date       FHOFHOPIQ_yoy lag_FYPEPPNFQ_yoy_lag4 lag_FHPN1_yoy_lag4
  <date>             <dbl>                  <dbl>              <dbl>
1 2025-01-01         0.291                  0.273              0.683
2 2025-04-01        NA                      0.144             -0.347
3 2025-07-01        NA                     -0.111              0.148
4 2025-10-01        NA                     -0.321              0.445
5 2026-01-01        NA                     -0.347             -0.201
# ℹ 9 more variables: lag_FYPEPPQ_yoy_lag5 <dbl>, lag_FYPDPIQ_yoy_lag5 <dbl>,
#   `lag_FYP$Q_yoy_lag6` <dbl>, lag_FPOP2544Q_yoy_lag6 <dbl>,
#   lag_FYPEPPFRQ_yoy_lag8 <dbl>, lag_FCCALLQ_yoy_lag10 <dbl>,
#   lag_FCCREVQ_yoy_lag10 <dbl>, lag_FHPNR_yoy_lag10 <dbl>,
#   lag_FHX1_yoy_lag10 <dbl>
Code
data_prepared_tbl <- data_prepared_full_tbl %>%
    filter(!is.na(FHOFHOPIQ_yoy))
data_prepared_tbl
# A tibble: 165 × 13
   date       FHOFHOPIQ_yoy lag_FYPEPPNFQ_yoy_lag4 lag_FHPN1_yoy_lag4
   <date>             <dbl>                  <dbl>              <dbl>
 1 1984-01-01       -0.126                 NA                 NA     
 2 1984-04-01        0.747                 NA                 NA     
 3 1984-07-01        0.289                 NA                 NA     
 4 1984-10-01        0.0918                NA                 NA     
 5 1985-01-01        0.738                  1.15               2.24  
 6 1985-04-01        0.904                  1.69               1.17  
 7 1985-07-01        0.0307                 2.37               0.895 
 8 1985-10-01        0.523                  1.40               0.792 
 9 1986-01-01        1.13                   1.14              -0.192 
10 1986-04-01        0.759                  0.0744             0.0214
# ℹ 155 more rows
# ℹ 9 more variables: lag_FYPEPPQ_yoy_lag5 <dbl>, lag_FYPDPIQ_yoy_lag5 <dbl>,
#   `lag_FYP$Q_yoy_lag6` <dbl>, lag_FPOP2544Q_yoy_lag6 <dbl>,
#   lag_FYPEPPFRQ_yoy_lag8 <dbl>, lag_FCCALLQ_yoy_lag10 <dbl>,
#   lag_FCCREVQ_yoy_lag10 <dbl>, lag_FHPNR_yoy_lag10 <dbl>,
#   lag_FHX1_yoy_lag10 <dbl>
Code
forecast_tbl <- data_prepared_full_tbl %>%
    filter(is.na(FHOFHOPIQ_yoy))
forecast_tbl
# A tibble: 4 × 13
  date       FHOFHOPIQ_yoy lag_FYPEPPNFQ_yoy_lag4 lag_FHPN1_yoy_lag4
  <date>             <dbl>                  <dbl>              <dbl>
1 2025-04-01            NA                  0.144             -0.347
2 2025-07-01            NA                 -0.111              0.148
3 2025-10-01            NA                 -0.321              0.445
4 2026-01-01            NA                 -0.347             -0.201
# ℹ 9 more variables: lag_FYPEPPQ_yoy_lag5 <dbl>, lag_FYPDPIQ_yoy_lag5 <dbl>,
#   `lag_FYP$Q_yoy_lag6` <dbl>, lag_FPOP2544Q_yoy_lag6 <dbl>,
#   lag_FYPEPPFRQ_yoy_lag8 <dbl>, lag_FCCALLQ_yoy_lag10 <dbl>,
#   lag_FCCREVQ_yoy_lag10 <dbl>, lag_FHPNR_yoy_lag10 <dbl>,
#   lag_FHX1_yoy_lag10 <dbl>

4.3 Split model data into training/test

Code
data_prepared_tbl
# A tibble: 165 × 13
   date       FHOFHOPIQ_yoy lag_FYPEPPNFQ_yoy_lag4 lag_FHPN1_yoy_lag4
   <date>             <dbl>                  <dbl>              <dbl>
 1 1984-01-01       -0.126                 NA                 NA     
 2 1984-04-01        0.747                 NA                 NA     
 3 1984-07-01        0.289                 NA                 NA     
 4 1984-10-01        0.0918                NA                 NA     
 5 1985-01-01        0.738                  1.15               2.24  
 6 1985-04-01        0.904                  1.69               1.17  
 7 1985-07-01        0.0307                 2.37               0.895 
 8 1985-10-01        0.523                  1.40               0.792 
 9 1986-01-01        1.13                   1.14              -0.192 
10 1986-04-01        0.759                  0.0744             0.0214
# ℹ 155 more rows
# ℹ 9 more variables: lag_FYPEPPQ_yoy_lag5 <dbl>, lag_FYPDPIQ_yoy_lag5 <dbl>,
#   `lag_FYP$Q_yoy_lag6` <dbl>, lag_FPOP2544Q_yoy_lag6 <dbl>,
#   lag_FYPEPPFRQ_yoy_lag8 <dbl>, lag_FCCALLQ_yoy_lag10 <dbl>,
#   lag_FCCREVQ_yoy_lag10 <dbl>, lag_FHPNR_yoy_lag10 <dbl>,
#   lag_FHX1_yoy_lag10 <dbl>
Code
splits <- time_series_split(data_prepared_tbl, assess = horizon, cumulative = TRUE)

splits %>%
    tk_time_series_cv_plan() %>%
    plot_time_series_cv_plan(date, FHOFHOPIQ_yoy)

5.0 Specify recipes

  • Time Series Signature - Adds bulk time-based features
  • Spline Transformation to index.num
  • External lags
Code
recipe_spec_base <- recipe(FHOFHOPIQ_yoy ~ ., data = training(splits)) %>%
    step_normalize(all_numeric_predictors())
recipe_spec_base %>% prep() %>% juice() %>% glimpse()
Rows: 161
Columns: 13
$ date                   <date> 1984-01-01, 1984-04-01, 1984-07-01, 1984-10-01…
$ lag_FYPEPPNFQ_yoy_lag4 <dbl> NA, NA, NA, NA, 1.12414863, 1.65832840, 2.32420…
$ lag_FHPN1_yoy_lag4     <dbl> NA, NA, NA, NA, 2.19602669, 1.14601454, 0.88049…
$ lag_FYPEPPQ_yoy_lag5   <dbl> NA, NA, NA, NA, NA, -0.3671872, 1.1002718, 2.65…
$ lag_FYPDPIQ_yoy_lag5   <dbl> NA, NA, NA, NA, NA, 1.07162468, 1.49722693, 1.8…
$ `lag_FYP$Q_yoy_lag6`   <dbl> NA, NA, NA, NA, NA, NA, 0.3804744, 0.6840663, 1…
$ lag_FPOP2544Q_yoy_lag6 <dbl> NA, NA, NA, NA, NA, NA, 1.947724, 1.919953, 1.8…
$ lag_FYPEPPFRQ_yoy_lag8 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, -0.90341016, -0…
$ lag_FCCALLQ_yoy_lag10  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1.20497…
$ lag_FCCREVQ_yoy_lag10  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.94685…
$ lag_FHPNR_yoy_lag10    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.83920…
$ lag_FHX1_yoy_lag10     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.12902…
$ FHOFHOPIQ_yoy          <dbl> -0.12635228, 0.74735933, 0.28871143, 0.09184348…

6.0 Specify models

6.1 GLMNet

Code
# - Strengths: Very good for trend
# - Weaknesses: Not as good for complex patterns (i.e. seasonality)

model_spec_glmnet <- linear_reg(
    # Hyperparameter 1: The penalty (lambda)
    # Use 'tune()' to indicate this value will be optimized later
    penalty = tune(),
    
    # Hyperparameter 2: The mixture (alpha)
    # Use 'tune()' to indicate this value will be optimized later (0=Ridge, 1=Lasso)
    mixture = tune()
) %>%
    set_engine("glmnet")


recipe_spec_2_lag <- recipe_spec_base %>%
    step_rm(date) %>% # Remove linear features
    step_naomit(contains("lag"))

recipe_spec_2_lag %>% prep() %>% juice() %>% glimpse()
Rows: 151
Columns: 12
$ lag_FYPEPPNFQ_yoy_lag4 <dbl> -0.36138694, 0.19470376, -2.29994949, 0.0268751…
$ lag_FHPN1_yoy_lag4     <dbl> 0.53304863, 0.54453310, 0.46191413, 0.85928600,…
$ lag_FYPEPPQ_yoy_lag5   <dbl> 0.38926146, -0.11803183, 0.27332754, -2.3247218…
$ lag_FYPDPIQ_yoy_lag5   <dbl> 1.22869048, 0.73076201, 0.53986652, 0.61307754,…
$ `lag_FYP$Q_yoy_lag6`   <dbl> 1.3072017, 0.8066176, 0.5621954, 0.3397425, 0.2…
$ lag_FPOP2544Q_yoy_lag6 <dbl> 1.8374341, 1.8484795, 1.8917189, 1.9486293, 1.9…
$ lag_FYPEPPFRQ_yoy_lag8 <dbl> 1.13832695, 1.03714058, 0.42034672, 0.14563810,…
$ lag_FCCALLQ_yoy_lag10  <dbl> 1.2049717, 1.8397917, 2.3031633, 2.4888723, 2.6…
$ lag_FCCREVQ_yoy_lag10  <dbl> 2.9468519, 3.4644452, 3.6287558, 3.3183074, 3.6…
$ lag_FHPNR_yoy_lag10    <dbl> 2.839201480, 1.257093162, 0.557454761, 0.541631…
$ lag_FHX1_yoy_lag10     <dbl> 0.12902958, 0.24142523, 0.05104580, -0.70615638…
$ FHOFHOPIQ_yoy          <dbl> 1.2176159, 1.6745625, 1.5087107, 1.8329694, 1.6…
Code
# Note: The model specification (model_spec_glmnet) and recipe (recipe_spec_2_lag) are correct for tuning.

# 1. Create a tuning grid for the penalty and mixture
# We'll search across a predefined set of hyperparameters
glmnet_grid <- grid_regular(
    penalty(range = c(-3, 0)), # Log10 range for penalty (0.001 to 1)
    mixture(range = c(0, 1)),  # Range for mixture (Ridge to Lasso)
    levels = 10                # Create 100 combinations (10x10)
)

# 2. Create a 5-fold cross-validation object for tuning
resamples_cv <- vfold_cv(training(splits), v = 5)

# 3. Create the workflow
glmnet_wflow <- workflow() %>%
    add_recipe(recipe_spec_2_lag) %>%
    add_model(model_spec_glmnet)

# 4. TUNE THE MODEL (Replace fit() with tune_grid())
# This fits the model across all combinations in the grid using cross-validation
glmnet_tuned <- glmnet_wflow %>%
    tune_grid(
        resamples = resamples_cv,
        grid = glmnet_grid,
        control = control_grid(verbose = TRUE)
    )

# 5. SELECT THE BEST MODEL PARAMETERS
best_glmnet_params <- glmnet_tuned %>%
    select_best(metric = "rmse") # Selects the combination with the lowest Root Mean Squared Error

# 6. FINALIZE THE WORKFLOW (Insert the best parameters into the blueprint)
final_glmnet_wflow <- glmnet_wflow %>%
    finalize_workflow(best_glmnet_params)

# 7. FIT THE FINAL MODEL (Now that the parameters are fixed, we can use fit() on the full training data)
final_fit_glmnet <- final_glmnet_wflow %>%
    fit(training(splits))

# 8. DISPLAY THE NON-ZERO COEFFICIENTS (Final Step)
est_coeff_tbl <- final_fit_glmnet %>%
    # Pull the final fitted model object
    extract_fit_parsnip() %>%
    # Use tidy() on the final fit
    broom::tidy() %>%
    # Filter to show only predictors where the estimate is not zero (as per Lasso's effect)
    filter(estimate != 0) %>%
    # Arrange by the absolute value of the estimate to see the most important predictors
    arrange(desc(abs(estimate)))  %>%
    
    # Add var description
    filter(term != "(Intercept)") %>%
    head(10) %>%
    mutate(term = term %>% str_remove("^lag_")) %>%
    separate(term, into = c("term", "Lag"), sep = "_yoy_") %>%
    
    # Add var description
    left_join(moodys_desc_tbl, by = c("term" = "Var_abb")) 

est_coeff_tbl %>%
    
    ggplot(aes(estimate, fct_reorder(Var_desc, estimate))) +
    geom_col(fill = "steelblue") +
    geom_text(aes(label = Lag), 
              # Set hjust to 1 for positive bars, and 0 for negative bars
              hjust = ifelse(est_coeff_tbl$estimate > 0, 1, 0),
              color = "white") +
    labs(title = "Estimated Coefficients",
         subtitle = "All variables are in YOY percent",
         y = NULL, x = NULL)

6.2 XGBOOST

Code
# - Strengths: Best for seasonality & complex patterns
# - Weaknesses: 
#   - Cannot predict beyond the maximum/minimum target (e.g. increasing trend)
# - Solution: Model trend separately (if needed). 
#   - Can combine with ARIMA, Linear Regression, Mars, or Prophet
#   - prophet_boost & arima_boost:

library(doParallel) # For parallel processing
library(tictoc)     # For timing the tuning process (optional but helpful)

# 1. REVISE MODEL SPECIFICATION: Replace fixed values with tune()
model_spec_boost_tune <- boost_tree(
    mode = "regression",
    # Hyperparameters to tune:
    mtry = tune(),
    tree_depth = tune(),
    learn_rate = tune(), 
    # Fixed hyperparameters:
    trees = 3000, # Set high, as tuning will use early stopping
    min_n = 2
) %>%
    set_engine("xgboost")

recipe_spec_2 <- recipe_spec_base %>%
    step_rm(date) %>%
    step_naomit(starts_with("lag_"))

# 2. DEFINE THE TUNING GRID
# We use 'dials' functions to create a reasonable grid space for each hyperparameter
xgboost_grid <- grid_regular(
    mtry(range = c(2, 6)),          # Sample 2 to 6 predictors at a split
    tree_depth(range = c(3, 8)),    # Explore depths from shallow to moderate
    learn_rate(range = c(-2, -1)),  # Log10 range for learning rate (0.01 to 0.1)
    levels = 4                      # 4 levels for each parameter = 4^3 = 64 models
)

# 3. CREATE CROSS-VALIDATION RESAMPLES
set.seed(123)
resamples_cv <- vfold_cv(training(splits), v = 5) # 5-fold cross-validation

# 4. CREATE THE WORKFLOW
wflw_tune_xgboost <- workflow() %>%
    add_recipe(recipe_spec_2) %>% # Assuming recipe_spec_2 is correctly defined
    add_model(model_spec_boost_tune)

# 5. EXECUTE THE TUNING PROCESS (REPLACE fit() with tune_grid())
tic() # Start timer
# Set up parallel processing (optional, but highly recommended for speed)
all_cores <- parallel::detectCores(logical = FALSE)
registerDoParallel(cores = all_cores)

set.seed(123)
tune_results <- wflw_tune_xgboost %>%
    tune_grid(
        resamples = resamples_cv,
        grid = xgboost_grid,
        control = control_grid(
            save_pred = TRUE,
            verbose = TRUE,
            # Use a control function to implement early stopping within XGBoost
            extract = function(x) {
                if (x$submodel == 1) {
                    list(
                        # This tells the model to stop training early if performance stops improving
                        # It manages the 'trees' hyperparameter automatically.
                        validation_set = training(splits), 
                        validation_metric = "rmse"
                    )
                } else {
                    NULL
                }
            }
        )
    )
toc() # End timer
192.78 sec elapsed
Code
# 6. SELECT THE BEST MODEL PARAMETERS
best_params <- tune_results %>%
    select_best(metric = "rmse") # Choose parameters minimizing Root Mean Squared Error

# 7. FINALIZE THE WORKFLOW and FIT (The final step)
final_wflw_xgboost <- wflw_tune_xgboost %>%
    finalize_workflow(best_params)

# Fit the final model to the entire training set using the best parameters
final_fit_xgboost <- final_wflw_xgboost %>%
    fit(training(splits))

# 8. DISPLAY RESULTS (Variable Importance using the final model)
# final_fit_xgboost %>%
#     extract_fit_parsnip() %>%
#     vip::vip(num_features = 10, geom = "point")
library(xgboost)
library(vip)

# Extract the fitted xgboost model
xgb_model <- final_fit_xgboost %>%
  extract_fit_parsnip() %>%
  pluck("fit")

# Get variable importance as a tibble
vip_table <- xgb_model %>%
  xgb.importance(model = .) %>%
  as_tibble()

# View the table
vip_table %>%
    mutate(Feature = Feature %>% str_remove("^lag_")) %>%
    separate(Feature, into = c("Feature", "Lag"), sep = "_yoy_") %>%
    
    # Add var description
    left_join(moodys_desc_tbl, by = c("Feature" = "Var_abb")) %>%
    
    ggplot(aes(Gain, fct_reorder(Var_desc, Gain))) +
    geom_col(fill = "steelblue") +
    geom_text(aes(label = Lag), hjust = 1, color = "white") +
    labs(title = "Variable Importance",
         subtitle = "All variables are in YOY percent",
         y = NULL, x = NULL)

6.3 Standard Linear Model

Code
model_spec_lm <- linear_reg() %>%
    set_engine("lm")

recipe_spec_1 <- recipe_spec_base %>%
    step_rm(date) 

recipe_spec_1 %>% prep() %>% juice() %>% glimpse()
Rows: 161
Columns: 12
$ lag_FYPEPPNFQ_yoy_lag4 <dbl> NA, NA, NA, NA, 1.12414863, 1.65832840, 2.32420…
$ lag_FHPN1_yoy_lag4     <dbl> NA, NA, NA, NA, 2.19602669, 1.14601454, 0.88049…
$ lag_FYPEPPQ_yoy_lag5   <dbl> NA, NA, NA, NA, NA, -0.3671872, 1.1002718, 2.65…
$ lag_FYPDPIQ_yoy_lag5   <dbl> NA, NA, NA, NA, NA, 1.07162468, 1.49722693, 1.8…
$ `lag_FYP$Q_yoy_lag6`   <dbl> NA, NA, NA, NA, NA, NA, 0.3804744, 0.6840663, 1…
$ lag_FPOP2544Q_yoy_lag6 <dbl> NA, NA, NA, NA, NA, NA, 1.947724, 1.919953, 1.8…
$ lag_FYPEPPFRQ_yoy_lag8 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, -0.90341016, -0…
$ lag_FCCALLQ_yoy_lag10  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1.20497…
$ lag_FCCREVQ_yoy_lag10  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.94685…
$ lag_FHPNR_yoy_lag10    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.83920…
$ lag_FHX1_yoy_lag10     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.12902…
$ FHOFHOPIQ_yoy          <dbl> -0.12635228, 0.74735933, 0.28871143, 0.09184348…
Code
workflow_fit_lm_1_spline <- workflow() %>%
    add_model(model_spec_lm) %>%
    add_recipe(recipe_spec_1) %>%
    fit(training(splits))

workflow_fit_lm_1_spline %>% 
    pull_workflow_fit() %>%
    pluck("fit") %>%
    summary()

Call:
stats::lm(formula = ..y ~ ., data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.73477 -0.48710  0.01705  0.44508  2.03691 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -0.01928    0.06043  -0.319 0.750177    
lag_FYPEPPNFQ_yoy_lag4  0.01909    0.08847   0.216 0.829504    
lag_FHPN1_yoy_lag4      0.09735    0.06555   1.485 0.139779    
lag_FYPEPPQ_yoy_lag5    0.36188    0.08523   4.246 3.96e-05 ***
lag_FYPDPIQ_yoy_lag5    0.17978    0.07778   2.311 0.022281 *  
`lag_FYP$Q_yoy_lag6`   -0.17944    0.07968  -2.252 0.025885 *  
lag_FPOP2544Q_yoy_lag6  0.09603    0.07801   1.231 0.220394    
lag_FYPEPPFRQ_yoy_lag8  0.22072    0.06412   3.442 0.000762 ***
lag_FCCALLQ_yoy_lag10   0.51625    0.10066   5.129 9.61e-07 ***
lag_FCCREVQ_yoy_lag10  -0.19990    0.11606  -1.722 0.087205 .  
lag_FHPNR_yoy_lag10     0.14924    0.06596   2.263 0.025215 *  
lag_FHX1_yoy_lag10      0.38830    0.06830   5.685 7.39e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7387 on 139 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.5212,    Adjusted R-squared:  0.4833 
F-statistic: 13.75 on 11 and 139 DF,  p-value: < 2.2e-16
Code
est_coeff_tbl <- workflow_fit_lm_1_spline %>% 
    pull_workflow_fit() %>%
    pluck("fit") %>%
    tidy() %>%
    
    # Add var description
    filter(term != "(Intercept)") %>%
    arrange(-abs(estimate)) %>%
    head(10) %>%
    mutate(term = term %>% str_remove_all("`")) %>%
    mutate(term = term %>% str_remove("^lag_")) %>%
    separate(term, into = c("term", "Lag"), sep = "_yoy_") %>%
    
    # Add var description
    left_join(moodys_desc_tbl, by = c("term" = "Var_abb")) 

est_coeff_tbl %>%
    
    ggplot(aes(estimate, fct_reorder(Var_desc, estimate))) +
    geom_col(fill = "steelblue") +
    geom_text(aes(label = Lag), 
              # Set hjust to 1 for positive bars, and 0 for negative bars
              hjust = ifelse(est_coeff_tbl$estimate > 0, 1, 0),
              color = "white") +
    labs(title = "Estimated Coefficients",
         subtitle = "All variables are in YOY percent",
         y = NULL, x = NULL)

7.0 Compare model performance

7.1 Modeltime table

Organize

Code
model_tbl <- modeltime_table(
    final_fit_glmnet,
    final_fit_xgboost,
    workflow_fit_lm_1_spline
) %>%
    update_model_description(2, "XGBOOST")

model_tbl
# Modeltime Table
# A tibble: 3 × 3
  .model_id .model     .model_desc
      <int> <list>     <chr>      
1         1 <workflow> GLMNET     
2         2 <workflow> XGBOOST    
3         3 <workflow> LM         

7.2 Calibration

Code
# - Calculates residual model errors on test set
# - Gives us a true prediction error estimate when we model with confidence intervals

calibration_tbl <- model_tbl %>%
    modeltime_calibrate(new_data = testing(splits))

# calibration_tbl %>%
#     slice(1) %>%
#     unnest(.calibration_data)

7.3 Test accuracy

  1. XGBoost is the Strongest Model (Based on \(R^2\))XGBoost has an \(R^2\) of \(0.651\). This means it has captured the underlying signal and variance of the target variable far better than the linear models (GLMNET at \(0.207\) and LM at \(0.214\)).The high \(R^2\) suggests that the complex, non-linear nature of XGBoost was necessary to adequately model the data.
  2. Linear Models (GLMNET and LM) are WeakBoth the regularized linear model (GLMNET) and the standard linear model (LM) have very low \(R^2\) values (around \(0.21\)), indicating they only explain about \(21\%\) of the variance.While GLMNET and LM have lower MAE and RMSE, this is likely because they make simpler, smoother predictions (closer to the mean), which can minimize absolute error while completely failing to capture the variable’s true movements (low \(R^2\)).
  3. All Models Struggle with Percentage Error (MAPE)The MAPE and sMAPE values are all over \(100\%\) (e.g., \(112\%\) to \(190\%\)).Interpretation of High MAPE: MAPE is unstable when the true values of the target variable are close to zero. Since your target variable is likely a Year-over-Year (YoY) percentage change, there are likely periods where the YoY change is close to zero, causing the percentage errors to balloon and rendering MAPE/sMAPE unreliable for comparison.
Code
# - Calculates common accuracy measures
# - MAE, MAPE, MASE, SMAPE, RMSE, R-SQUARED

calibration_tbl %>%
    modeltime_accuracy()
# A tibble: 3 × 9
  .model_id .model_desc .type   mae  mape  mase smape  rmse   rsq
      <int> <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1         1 GLMNET      Test  0.562  113.  1.21  190. 0.647 0.211
2         2 XGBOOST     Test  0.676  130.  1.46  187. 0.799 0.651
3         3 LM          Test  0.567  113.  1.22  190. 0.657 0.214

7.4 Test forecast

Code
# - Visualize the out-of-sample forecast

calibration_tbl %>%
    modeltime_forecast(
        new_data      = testing(splits),
        actual_data   = data_prepared_tbl,
        conf_interval = 0.80
    ) %>%
    
    # Invert Transformation
    mutate(across(.value:.conf_hi, .fns = ~ standardize_inv_vec(
        x    = .,
        mean = std_mean,
        sd   = std_sd
    ))) %>%
    plot_modeltime_forecast(
        .legend_max_width = 60,
        .legend_show = TRUE,
        .conf_interval_show = TRUE,
        .conf_interval_alpha = 0.20,
        .conf_interval_fill = "lightblue",
        .title = "NH Unemployment Rate Forecast"
    )

8.0 Refit

Code
refit_tbl <- calibration_tbl %>%
    modeltime_refit(data = data_prepared_tbl)

9.0 Forecast

Code
# * Final Forecast ----
# - 'new_data' vs 'h'
# - 'actual_data'
# - Preprocessing

refit_tbl %>%
    modeltime_forecast(
        # h = "8 weeks",
        new_data = forecast_tbl,
        actual_data = data_prepared_tbl,
        conf_interval = 0.80
    ) %>%
    
    # Invert Transformation
    mutate(across(.value:.conf_hi, .fns = ~ standardize_inv_vec(
        x    = .,
        mean = std_mean,
        sd   = std_sd
    ))) %>%
    
    plot_modeltime_forecast(
        .legend_max_width = 25,
        .conf_interval_fill = "lightblue",
        .interactive = TRUE,
        .legend_show = TRUE 
    )