Does "soft" data enhance "hard" data forecasts?

Sasidhar Maddipatla
December 5th 2021

what is hard data and soft data?

“Hard” data is obtained from government statistical agencies and other sources and examined and scrutinized for insights into the broad economy. Other type of data is derived from surveys of businesses, consumer confidence and sentiment surveys - and is used similarly: to infer future business outcomes and performance.

In our business scenario, does adding soft data like Volatility index and uncertainty index improve the univariate(employment) forecast?

Import the monthly Federal reserve economic data

  • Monthly Employment data (hard data)

  • Monthly Fear or volatility index(soft data)

  • Monthly uncertainty Index(soft data)

#Employment Data

empl = pdfetch::pdfetch_FRED("PAYEMS")  #Monthly and seasonally adjusted data.

# Fear Index data
vix = pdfetch::pdfetch_FRED("VIXCLS") # the fear index
vix.m = xts::to.monthly(vix, indexAt = "yearmon", drop.time = TRUE)
vix.m = vix.m[,4]       

#Uncertainty index

uix = pdfetch::pdfetch_FRED("USEPUINDXD") # the uncertainty index
uix.m = xts::to.monthly(uix, indexAt = "yearmon", drop.time = TRUE)
uix.m = uix.m[,4]  

Convert the data to time-series and shorten the data

#Convert to TS

empl.ts = ts(empl, start = c(1939,1), end = c(2021, 11), frequency = 12)
vix.m.ts = ts(vix.m, start = c(1990,1), end = c(2021, 12), frequency = 12)
uix.m.ts = ts(uix.m, start = c(1985,1), end = c(2021, 12), frequency = 12)

#Adjust the time-periods to match the time-periods of uncertainty and volatility variables.

empl.ts = window(empl.ts, start = c(1990,1), end = c(2021, 11), frequency = 12)
vix.m.ts = window(vix.m.ts, start = c(1990,1), end = c(2021, 11), frequency = 12)
uix.m.ts = window(uix.m.ts, start = c(1990,1), end = c(2021, 11), frequency = 12)

Visualize the employment data with soft data variables

plot of chunk unnamed-chunk-5

  • Employment data has trend and its already seasonally adjusted by FRED.
  • Volatility and Uncertainty Index seems to be stationary.

Train the Model

#employment data split 

emp_split=ts_split(empl.ts)
length(emp_split$train);length(emp_split$test)
start(emp_split$train); end(emp_split$train)
start(emp_split$test); end(emp_split$test)

# uncertainty and volatility variables split

vix_split=ts_split(vix.m.ts)
length(vix_split$train);length(emp_split$test)
start(vix_split$train); end(vix_split$train)
start(vix_split$test); end(vix_split$test)

usep_split=ts_split(uix.m.ts)
length(usep_split$train);length(usep_split$test)
start(usep_split$train); end(usep_split$train)
start(usep_split$test); end(usep_split$test)

Visualize the employment data without regressors

  • see the forecast of employment data without any regressors
  • Record the RMSE
model_arima = auto.arima(emp_split$train)
accuracy(model_arima)   
                ME RMSE  MAE       MPE   MAPE   MASE    ACF1
Training set -1.34  123 93.8 -0.000791 0.0744 0.0435 -0.0101
arima_fc = forecast(model_arima, h = length(emp_split$test))
autoplot(arima_fc)+autolayer(emp_split$test)

plot of chunk unnamed-chunk-7

Combine the regressors

  • We are using two regressors with the employment data to check whether they make any improvement to the univariate employment forecast.
  • As we are using two soft data variables, these two are combined as single dataframe and then supplied as argument to the forecast models.
  • Because,I want to test how effective the combination can create an impact on the univariate forecast.
mod_data_train = cbind(vix_split$train, usep_split$train)
mod_data_test = cbind(vix_split$test, usep_split$test)

Ensemble Model:

  • Ensemble model takes the training dataset and the two regressors as arguments.
  • This model performs the below forecast models and then averages the outcome of three forecasts to make the final accurate forecast
  • ARIMAX
  • NNETS
  • TBATS
model_hybrid_x= forecastHybrid::hybridModel(emp_split$train,
                                            a.args=list(xreg=mod_data_train),
                                            n.args=list(xreg=mod_data_train),
                                            models='ant')
accuracy(model_hybrid_x)
           ME RMSE  MAE    MPE   MAPE   ACF1 Theil's U
Test set 6.88  119 92.8 0.0059 0.0732 0.0259     0.478

Visualize and Compare the testing set and ensemble forecast.

modelhybrid_fc = forecast(model_hybrid_x, h = length(emp_split$test),xreg=mod_data_train)

autoplot(modelhybrid_fc)+autolayer(emp_split$test)

plot of chunk unnamed-chunk-10

Visualize the model and compare with complete employement data

autoplot(empl.ts, col = "darkred") + 
  autolayer(modelhybrid_fc) +autolayer(emp_split$test)

plot of chunk unnamed-chunk-11

Visualize the average of ensemble model forecast

autoplot(empl.ts) + 
  autolayer(modelhybrid_fc$mean)+
  autolayer(emp_split$test)

plot of chunk unnamed-chunk-12

Ensemble Model Interpretations

  • Before adding regressors, The forecast seems to be off target with a RMSE of 123
  • After adding regressors, The Ensemble model forecast has RMSE of 119
  • That means, very little improvement in the forecast
  • Ensemble with regressors preformed better than the normal forecast.
  • But the impact of these regressors on the hard data(employment) forecast is close to negligible.
  • so, we can conclude that regressors adds no value to the hard data forecast in our business scenario.
  • Employment is not influenced by both volatility and uncertainty of the business.
  • However, we will check few more forecast models to finalize the output.

Neural-Network Forecast Model

  • We are testing with neural-networks model without regressors.
  • The data is not de-trended.
model_nnetar = nnetar(emp_split$train)
accuracy(model_nnetar)  
                ME RMSE MAE       MPE   MAPE   MASE  ACF1
Training set 0.231  143 107 0.0000345 0.0841 0.0495 0.438
nnetar_fc = forecast(model_nnetar, h = length(emp_split$test))
autoplot(empl.ts, col = "darkred") + 
  autolayer(nnetar_fc)+autolayer(emp_split$test)

plot of chunk unnamed-chunk-13

Neural-Network Forecast Model after de-trending the data

  • After de-trending the data
ndiffs(empl.ts)
[1] 1
empl1.ts=empl.ts%>% diff()
emp_split1=ts_split(empl1.ts)

model_nnetar = nnetar(emp_split1$train)
accuracy(model_nnetar)
                 ME RMSE  MAE   MPE MAPE    MASE   ACF1
Training set 0.0767 3.58 1.65 -0.44 4.35 0.00856 -0.254
nnetar_fc = forecast(model_nnetar, h = length(emp_split1$test))
autoplot(empl1.ts, col = "darkred") + 
  autolayer(nnetar_fc)+autolayer(emp_split1$test)

plot of chunk unnamed-chunk-14

Neural-network Forecast Model Interpretations

  • NNETS forecast model without regressors and before de-trending has a very high RMSE of 145
  • This tells that NNETS is performing no better than the ensemble model
  • But the striking point here is , NNETS performs more better than the ensemble model when the data is de-trended and made it to stationary.
  • NNETS with de-trend has the lowest RMSE of 3.93.
  • Even without the need of regressors, NNETS without trend is performing better than all the other models.

ARIMA forecast model

model_arima = auto.arima(emp_split$train)
accuracy(model_arima) 
                ME RMSE  MAE       MPE   MAPE   MASE    ACF1
Training set -1.34  123 93.8 -0.000791 0.0744 0.0435 -0.0101
arima_fc = forecast(model_arima, h = length(emp_split$test))
autoplot(arima_fc)+autolayer(emp_split$test)

plot of chunk unnamed-chunk-15

- ARIMAX forecast model

model_arimax=auto.arima(emp_split$train,
                        xreg=mod_data_train)
accuracy(model_arimax)  
               ME RMSE  MAE     MPE   MAPE   MASE     ACF1
Training set 10.1  122 93.9 0.00861 0.0745 0.0436 -0.00932
arima_fc = forecast(model_arimax, h = length(emp_split$test),xreg=mod_data_train)

autoplot(arima_fc)+autolayer(emp_split$test)

plot of chunk unnamed-chunk-16

ARIMA and ARIMAX interpretations

  • ARIMA without regressors has RMSE of 123
  • ARIMAX with regressors has RMSE of 122
  • When regressors are added to the ARIMA model there is very little to no difference in the performance of the employment forecast
  • It is very clear than ARIMA and ARIMAX has very similar forecasting results and they are not significant
  • These models accuracy is less than the ensemble model.
  • It is confirmed that soft data doesn't add any value to the employment forecast.

Final Conclusion:

-RMSE :119 for ensemble method is the best forecast model when added with regressors. -But the impact of soft data (volatility and uncertainty index) is very little to zero improvement in the hard data forecast(employment)

-NNETAR without regressors seems to be model with highest RMSE(145), completely ineffective.

-ARIMA without regressors and ARIMAX with regressors has very little difference in RMSE(123 and 122), -This shows that adding regressors doesn't add any value to the hard data forecast and these two models are not performing better than the ensemble model.

-Striking point:

-NNETAR is not performing good when there is trend or seasonality and even when data is non-stationary -However, When the data is de-trended in our case and made the data to be stationary. -NNETAR model suits the best with the employment data and RMSE is (3.93) -which is better than all the models even the ensemble method.

  • It is evident now that Soft data like volatility and uncertainty index doesn't add any value to the hard data(employment) forecast
  • Every forecast model confirmed the same with their accuracy scores.