2023-05-31

Introduction

Often, we want to explain the relationship between a target variable \(Y\) and a vector of explanatory variables \(X\), or we want to make predictions of \(Y\) based on known values of \(X\).

Shmueli explains the distinction between the two approaches: “to explain or to predict?” (Reference: Shmueli paper).

Let’s summarize it briefly.

Exploratory Modeling

If there exists a true relationship (underlying theory) of the form: \[Y = f(X) + \epsilon\]

where \(\epsilon\) is the unpredictable error term and \(f\) is the signal, depending on the objective (predictive vs explanatory), the estimation of the function \(\widehat{f}\) changes.

In explanatory modeling, we seek the best model by minimizing the distortion \(B(\widehat{f}) = E(f - \widehat{f})\), to have a more accurate representation of the underlying theory.

Predictive Modeling

In predictive modeling, we seek the best model by jointly minimizing the distortion \(B(\widehat{f})\) and the variance \(V(\widehat{f}) = E\left[\left(\widehat{f}-E(\widehat{f})\right)^2\right]\) of the model.

The consequence is that the best predictive model, which produces the best predictions for new values of \(Y\) based on known values of \(X\), could also be theoretically wrong.

Challenges in Real Data Analysis

When analyzing real data, we have to deal with various challenges, such as:

  • Incomplete datasets
  • Missing values
  • Outliers
  • Measurement errors
  • Multicollinearity
  • Skewed target variables
  • Nonlinear and difficult-to-model relationships
  • Sparse data matrices
  • High-dimensional data (\(p \ge n\))
  • And more…

Nonparametric Models and Gradient Boosting

Nonparametric models based on decision trees (Random Forest, Bagging, Gradient Boosting, etc.) have the advantage of being more robust and, in the case of tabular data, more performant in terms of predictive capabilities.

In particular, XGBoost (Extreme Gradient Boosting) is an algorithm that has won several competitions and has gained interest for its robust results even in the presence of many of the aforementioned challenges (Reference: XGBoost paper).

Trade-off: Computational Cost

The price we pay for not imposing prior restrictions on the possible function \(f\) is a higher computational cost.

A single decision tree tries to predict the value of \(Y\) (whether numeric, categorical, multi-output,…) by partitioning the input space to obtain subsets of units that are homogeneous in the target variable.

XGBoost starts with a single tree and, at each iteration, improves it by applying additional trees; since it combines multiple weak learners, it is an ensemble method. (Reference: XGBoost documentation).

Understanding CART: Classification and Regression Trees

XGBoost combines multiple trees…

Example 1: Multicollinearity and Outliers

Consider the following relationship: \[Y = X_1 + 2X_2 + \pi X_3 + \epsilon, \quad \epsilon \sim N(0,1), \quad \text{Cor}(X_1,X_2) \ge 0.9\]

True parameter vector is \(\beta = (0,1,2,\pi,0)\), but the linear model cannot disentangle correctly marginal effects.

##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  48.8825537 21.8655415  2.2355977 2.771986e-02
## x1          -14.2533499  5.5860329 -2.5516051 1.231871e-02
## x2            0.7730522  5.7825662  0.1336867 8.939331e-01
## x3            2.5928629  0.2073066 12.5073791 8.589942e-22
## x4            2.1198444  1.7963341  1.1800947 2.409091e-01

## Esempio 1: Collinearità e outliers
## OLS MSE: 334.5319
## XGBoost MSE: 1.661577e-06

Example 2: Non linear relations

Consider the following relationship: \[ Y = 25\sin^2(X_1) + 10X_2 + 3\sqrt{X_2}+ \epsilon, \epsilon \sim N(0,1)\] where \(X_1 \sim U(-2\pi, 2\pi)\) and \(X_2 \sim N(\mu=30,\sigma=2)\).

From the LM summary, we notice that only the parameter of \(X_2\) is correctly estimated.

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 27.3856510 15.8263892  1.730379 8.680976e-02
## x1          -0.1845653  0.1539188 -1.199108 2.334682e-01
## x2          10.0035808  0.5001459 20.001326 7.999447e-36
## x3          -0.3011961  0.2074243 -1.452078 1.497748e-01
## x4           0.4661342  0.2664589  1.749366 8.345723e-02

## Esempio 2: Relazioni non lineari
## OLS MSE: 69.50064
## XGBoost MSE: 1.187448e-06

Overfitting

To correctly assess which model produces the best prediction in new data, we have to test the model for new data points (so called test set) which were not used during the training process.

We talk about overfitting whenever the model performance in the test set are way worse than the performance observed in the training set. An overfitted model can’t generalize well.

Cross validations methods (train-test splits) should always be used whenever we want to build a predictive model.

Be careful with XGBoost, since it could easily overfit. Some important hyperparameters to prevent it are:

  • subsample: default 1; is the ratio of training instances randomly used to grow trees.
  • max_depth: default 6; is the maximum depth of each tree.
  • colsample_bytree: default 1; is the number of columns that could potentially be used for each tree.
  • \(\eta\): default 0.3; is the learning rate, i.e. a factor used to scale the contribution of each tree.
  • early_stopping_rounds: default NULL; when set to \(K\), the model stop searching a better configuration if after \(K\) steps there is no improvement.
  • other regularization parameters: to implement LASSO, RIDGE or ElasticNet.

  • For now there are not many application in the economic domain, but its forecasting performance are promising, f.e. here is an application on VAT tax gap: (Reference: VAT Gap).

  • Here you can play with XGBoost and ARIMA models on your favorite time series. This is a preliminary version and it works only for monthly time series from 2002 to 2022. In the next version you will be able to upload any dataset with any frequency and to make new features (like lagged variables, differenced variables, Box-Cox transformation etc.) (Reference: Forecast app).

Thank you for the attention.