Research: Non Parametric Methods and Conformal Prediction

Intro

My research focuses on exploring the application of non-parametric methods, specifically tree-based methods like XGBoost, in the field of Economics. I aim to investigate their effectiveness in situations where traditional regression-based methods may fall short. To begin my research, I will write an R script to simulate three different relationships between an outcome variable \(Y\) and a set of \(k\) predictors \((X_1,...,X_k)\).

Scenario 1: Multicollinearity

In this scenario, there is a problem of multicollinearity among the predictors. I will demonstrate the superiority of non-parametric methods compared to ordinary least squares (OLS) regression. By generating a dataset with highly correlated predictors, I will apply both OLS regression and a non-parametric method, such as XGBoost, to model the relationship between Y and the predictors. Through appropriate evaluation metrics and statistical tests, I will show that non-parametric methods outperform OLS regression in handling multicollinearity.

# Scenario 1: Multicollinearity
n <- 1000  # Number of observations
k <- 5     # Number of predictors

sigma <- 0.25

# Generate correlated predictors
X <- matrix(rnorm(n * k), nrow = n)
X[, 2] <- X[, 1] + rnorm(n, 0, sigma)
X[, 3] <- X[, 1] + rnorm(n, 0, sigma)
X[, 4] <- X[, 2] + X[, 3] + rnorm(n, 0, sigma)
X[, 5] <- rnorm(n, 0, sigma)

# Generate outcome variable
Y <- -5*X[, 1] + X[, 2] + 2*X[, 3]*X[, 4] + 5*X[,5] + rnorm(n, 0, sigma)

# Perform OLS regression
ols_model <- lm(Y ~ X)
summary(ols_model)

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.088 -3.511 -2.236  1.356 42.994 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.9782     0.1794  22.179  < 2e-16 ***
## X1           -4.8124     1.0414  -4.621 4.32e-06 ***
## X2            0.6230     1.0113   0.616    0.538    
## X3            0.9238     1.0400   0.888    0.375    
## X4           -0.1727     0.7184  -0.240    0.810    
## X5            4.3041     0.6890   6.247 6.21e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.663 on 994 degrees of freedom
## Multiple R-squared:  0.3094, Adjusted R-squared:  0.3059 
## F-statistic: 89.07 on 5 and 994 DF,  p-value: < 2.2e-16

ols_predictions <- predict(ols_model)

# Perform XGBoost
xgb_data <- xgb.DMatrix(data = as.matrix(X), label = Y)
xgb_params <- list(objective = "reg:squarederror")
xgb_model <- xgboost(data = xgb_data, params = xgb_params, nrounds = 100, verbose=0)
xgb_predictions <- predict(xgb_model, newdata = as.matrix(X))

# Compare performance
mse_ols <- mean((ols_predictions - Y)^2)
mse_xgb <- mean((xgb_predictions - Y)^2)

## Scenario 1: Multicollinearity

## OLS MSE: 31.88163

## XGBoost MSE: 0.004202015

In Scenario 1, where there is multicollinearity among predictors, non-parametric methods like XGBoost tend to handle correlated predictors more effectively than OLS regression. XGBoost’s ability to capture non-linear relationships and interactions between predictors could help alleviate the negative impact of multicollinearity. Therefore, it is reasonable to expect that XGBoost would outperform OLS regression in terms of MSE.

It must be note, however, that is not a limit of parametric methods per se: if the researcher investigates the residual (or actual vs predicted values), it might notice a non random displacement around the identity function. Therefore, it might try to include manually interaction until the right specification is found: in that case, OLS handle very well parameter estimation and attributes the correct marginal effects, as shown below.

Still, the example aimed exactly to shown that whenever it comes to parametric methods, the researcher must be very careful and properly investigate and re-train his models. On the contrary, powerful non parametric methods like XGBoosting exhibits excellent prediction performance without even the hyper-parameters fine tuning. The algorithm find as the most relevant predictor the variable \(X_1\) (feature 0), then the variables \(X_4, X_3\) and residually (compared to the most relevant) the variables \(X_5, X_2\). This is very interesting, since the marginal effect of both \(X_1\) and \(X_5\) on \(Y\) are equals in magnitude (\(\pm5\)) but all the multicollinearity induced in the design matrix is driven by \(X_1\). XGBoost can exploit this and therefore recognize this predictor as the most important.

## 
## Call:
## lm(formula = Y ~ X + I(X[, 3] * X[, 4]))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83342 -0.16456 -0.01037  0.16597  0.77368 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        -0.011508   0.009549   -1.205    0.228    
## X1                 -4.988097   0.045287 -110.143   <2e-16 ***
## X2                  0.980023   0.043982   22.282   <2e-16 ***
## X3                 -0.007217   0.045244   -0.160    0.873    
## X4                  0.010459   0.031242    0.335    0.738    
## X5                  4.994498   0.029979  166.601   <2e-16 ***
## I(X[, 3] * X[, 4])  1.999788   0.002761  724.301   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2463 on 993 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9987 
## F-statistic: 1.267e+05 on 6 and 993 DF,  p-value: < 2.2e-16

## Scenario 1: Multicollinearity

## New OLS MSE: 0.06023246

## XGBoost MSE: 0.004202015

Scenario 2: Nonlinear Relationship

In this case, the relationship between Y and the single predictor \(X\) is highly non-linear. To illustrate the superiority of non-parametric methods over regression-based methods, I will introduce transcendental functions such as logarithmic, exponential, sine, and cosine functions to create a non-linear pattern. Again, I will compare the performance of non-parametric methods, like XGBoost, with that of regression-based methods in capturing the underlying non-linear relationship. Evaluation metrics and visualization techniques will be utilized to highlight the advantages of non-parametric methods.

\[Y_i = \ln(|X_i|) + 10\sin(X_i) + \epsilon_i, \quad \epsilon_i \sim_{iid}N(0,1), \quad X_i \sim_{iid}U(-\pi,\pi)\]

# Scenario 2: Nonlinear Relationship
n <- 1000  # Number of observations
k <- 1     # Number of predictors

# Generate predictors
X <- runif(n, -pi, pi)

sigma <- 1 

# Generate outcome variable with nonlinear relationship
Y <- log(abs(X)) + 10*sin(X) + rnorm(n, 0, sigma)

signal = function(x) {log(abs(x)) + 10*sin(x)}

# Perform OLS regression
ols_model <- lm(Y ~ X)
ols_predictions <- predict(ols_model)

# Perform XGBoost
xgb_data <- xgb.DMatrix(data = as.matrix(X), label = Y)
xgb_params <- list(objective = "reg:squarederror")
xgb_model <- xgboost(data = xgb_data, params = xgb_params, nrounds = 100, verbose=0)
xgb_predictions <- predict(xgb_model, newdata = as.matrix(X))

# Compare performance
mse_ols <- mean((ols_predictions - Y)^2)
mse_xgb <- mean((xgb_predictions - Y)^2)

## Scenario 2: Nonlinear Relationship

## OLS MSE: 21.21811

## XGBoost MSE: 0.2776865

In Scenario 2, where the relationship is highly non-linear, non-parametric methods like XGBoost are specifically designed to handle complex patterns and capture non-linear relationships. Therefore, it is likely that XGBoost would perform better than OLS regression in this scenario as well.

Again, the researcher would likely notice such badly distributed residuals and exploit the non linearities by fitting - for instance - different polynomial regression models of degree \(d\) and choose the one that minimize the residual sum of squares. One could expect the result of this procedure being comparable with XGBoosting ones, but again let me stress this: none of them capture the true (unknown to the researcher) relation between \(Y\) and \(X\); basically both try to approximate it to the best of their kwnoledge (which comes from data). What would be the benefit of a hand-tuned parametric best approximation, when one could obtain better results in a non parametric, semi-authomatized way? In statistical inference, we known that parametric tests are more powerfull when the null hypothesis is true; unless we explicitly introduce as predictors the two transformation of the \(X\) variable (natural log of absolute values and sine transformation), we can’t have a true null hypothesis case. It does not matter how high will be our final polynomial: we can’t capture the cusp at \(X=0\) with linear models.

## Scenario 2: Nonlinear Relationship

## New OLS MSE: 1.229415

## XGBoost MSE: 0.2776865

Scenario 3: Non-Normal (Asymmetric) Error Term

In the third scenario, the relationship between Y and the predictors appears to be satisfactory, but the true error term follows a Gamma distribution instead of the typical normal distribution. I will demonstrate the impact of this non-normal error term on the performance of traditional regression methods and contrast it with the performance of non-parametric methods. By generating a shifted Gamma-distributed error term, I will compare the accuracy and robustness of regression-based methods, such as OLS, against non-parametric methods like XGBoost.

\[Y_i = 1 + 2X_i + \epsilon_i, \quad X_i \sim_{iid} U(0,1), \quad \epsilon_i \sim_{iid}{\cal{Gamma}}(3,10)-0.5\]

# Scenario 3: Non-Normal Error Term
n <- 1000  # Number of observations
k <- 1     # Number of predictors

# Generate predictors
X <- runif(n)

# Generate outcome variable with gamma-distributed error term
error_term <- rgamma(n, shape = 3, rate = 10) - .5
hist(error_term, freq=FALSE, ylim=c(0,4), main='', xlab=expression(epsilon))

Y <- 1 + 2*X + error_term

# Perform OLS regression
ols_model <- lm(Y ~ X)
ols_predictions <- predict(ols_model)
summary(ols_model)

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29116 -0.13113 -0.03593  0.09469  0.89909 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.80186    0.01103   72.72   <2e-16 ***
## X            1.99265    0.01925  103.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1748 on 998 degrees of freedom
## Multiple R-squared:  0.9148, Adjusted R-squared:  0.9147 
## F-statistic: 1.072e+04 on 1 and 998 DF,  p-value: < 2.2e-16

# Perform XGBoost
xgb_data <- xgb.DMatrix(data = as.matrix(X), label = Y)
xgb_params <- list(objective = "reg:squarederror")
xgb_model <- xgboost(data = xgb_data, params = xgb_params, nrounds = 100, verbose=0)
xgb_predictions <- predict(xgb_model, newdata = as.matrix(X))

# Compare performance
mse_ols <- mean((ols_predictions - Y)^2)
mse_xgb <- mean((xgb_predictions - Y)^2)

## Scenario 3: Non-Normal Error Term

## OLS MSE: 0.030489

## XGBoost MSE: 0.008272935

In Scenario 3, where the error term is non-normally distributed, it is difficult to predict the performance of XGBoost and OLS regression without knowing the specific characteristics of the Gamma distribution and the relationship between predictors and the error term. Both methods may be affected differently by the non-normal error term. Therefore, it is unclear which method would have a lower MSE in this scenario without further analysis.

By inspecting the plots above, though, we can confirm how XGBoost is doing better against Linear Models. When we look at the true vs predicted values, we can see that the XGBoost model tends to provide predicted values more closer to the actual ones, so that the data cloud is closer to the identity function; on the contrary, the linear models plot is more spread around the identity function.

If we look at the true signal, both models are positively biased due to the assumption of simmetric error terms: in both cases, in fact, the estimated relation is higher than the true relation. Both methods incorporate in their modelling part of the error terms, due to the presumption of seeing an equal number of positive and negative residuals. By visual inspection, the Linear Model perform better: that is because the true relation is indeed linear, so when one incorporates into the model the new prior assumption that the error term is asymmetric (with an higher probabiity of observing negative residuals), then it is reasonable to expect a better performance of the Linear Model over the XGBoost.

To conferm the visual intuition numerically, I can compare the mean squared error applied to the de-noised \(Y\) valued (i.e. their actual expected value, ruling out the error term) and check that the Linear Model does this time a slightly better job in capturing the signal only.

## Scenario 3: Non-Normal Error Term

## OLS MSE: 0.04071862

## XGBoost MSE: 0.0513557

Scenario 4: High-Dimensional Data

Consider a scenario with a high-dimensional dataset, where the number of predictors (k) is much larger than the number of observations (n). In such cases, OLS regression may struggle due to the “curse of dimensionality” and overfitting. On the other hand, XGBoost has the capability to handle high-dimensional data effectively by automatically performing feature selection and capturing non-linear relationships. You can simulate a high-dimensional dataset with a relatively small number of observations and compare the performance of XGBoost and OLS regression.

# Scenario: High-Dimensional Data
n <- 100  # Number of observations
k <- 100  # Number of predictors (high-dimensional)

# Generate high-dimensional predictors
X <- matrix(rnorm(n * k), nrow = n)

# Generate outcome variable
Y <- 2 * X[, 1] + 3 * X[, 2] - 4 * X[, 3] + rnorm(n)

# Perform OLS regression
ols_model <- lm(Y ~ X)
ols_predictions <- predict(ols_model)


# Perform XGBoost
xgb_data <- xgb.DMatrix(data = as.matrix(X), label = Y)
xgb_params <- list(objective = "reg:squarederror")
xgb_model <- xgboost(data = xgb_data, params = xgb_params, nrounds = 100, verbose = 0)
xgb_predictions <- predict(xgb_model, newdata = as.matrix(X))

# Compare performance
mse_ols <- mean((ols_predictions - Y)^2)
mse_xgb <- mean((xgb_predictions - Y)^2)

## Scenario 4: High-Dimensional Data

## OLS MSE: 2.189406e-27

## XGBoost MSE: 1.236862e-07

##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  5.2073671        NaN     NaN      NaN
## X1          13.7159455        NaN     NaN      NaN
## X2           0.1598747        NaN     NaN      NaN
## X3          -5.4908404        NaN     NaN      NaN
## X4          -4.3391847        NaN     NaN      NaN
## X5          -2.6132456        NaN     NaN      NaN
## X6           0.4701802        NaN     NaN      NaN
## X7           0.6742521        NaN     NaN      NaN
## X8           5.3953746        NaN     NaN      NaN
## X9          -4.7367775        NaN     NaN      NaN

In Scenario 4, OLS could fail to capture non-linearities or might overfit the data due to the curse of dimensionality, by including irrelevant predictors in the model. Overall, XGBoost’s ability to handle high-dimensional data, capture non-linear relationships, adapt to interactions, and effectively handle irrelevant predictors makes it a preferable choice over OLS regression in the high-dimensional data scenario. It is expected to provide more accurate predictions and better generalization by avoiding the overfitting issues associated with OLS regression.

Scenario 5: Outliers and Heteroscedasticity

Simulate a dataset where outliers and heteroscedasticity are present. OLS regression is sensitive to outliers and assumes constant error variance (homoscedasticity), which can lead to biased parameter estimates and inflated standard errors. In contrast, XGBoost is more robust to outliers and can capture patterns even when heteroscedasticity is present. Generate data with a mix of outliers and varying error variance and compare the performance of XGBoost and OLS regression in terms of robustness and accuracy.

# Scenario 5: Outliers and Heteroscedasticity
n <- 100  # Number of observations
k <- 1    # Number of predictors

# Generate predictors
X <- runif(n)

# Generate outcome variable with outliers and heteroscedasticity
Y <- X^2 + 0.5 * X + rnorm(n, mean = 0, sd = 0.2)
Y[25] <- 10  # Introduce an outlier

# Perform OLS regression
ols_model <- lm(Y ~ X)
ols_predictions <- predict(ols_model)

# Perform XGBoost
xgb_data <- xgb.DMatrix(data = as.matrix(X), label = Y)
xgb_params <- list(objective = "reg:squarederror")
xgb_model <- xgboost(data = xgb_data, params = xgb_params, nrounds = 100, verbose = 0)
xgb_predictions <- predict(xgb_model, newdata = as.matrix(X))

# Compare performance
mse_ols <- mean((ols_predictions - Y)^2)
mse_xgb <- mean((xgb_predictions - Y)^2)

## Scenario 4: Outliers and Heteroscedasticity

## OLS MSE: 0.9671339

## XGBoost MSE: 5.214102e-05

Outliers: OLS regression is sensitive to outliers, meaning that even a single extreme data point can have a substantial impact on the estimated coefficients and the overall fit of the model. As a result, the presence of outliers can lead to biased parameter estimates and reduced predictive performance in OLS regression. On the other hand, XGBoost is more robust to outliers due to the nature of its tree-based algorithm. Decision trees are less influenced by individual outliers as they partition the feature space into regions, and the impact of outliers is localized within those regions. This robustness allows XGBoost to better handle outliers and provide more reliable predictions.

Heteroscedasticity: OLS regression assumes homoscedasticity, meaning that the error term has a constant variance across all levels of the predictors. When heteroscedasticity is present, OLS regression may produce inefficient estimates and unreliable standard errors. The predicted values from OLS may have unequal variability, leading to suboptimal model performance. XGBoost, on the other hand, does not assume constant error variance and can capture patterns in the data even when heteroscedasticity is present. By effectively modeling the nonlinear relationships between predictors and the outcome, XGBoost can better adapt to varying levels of variability in the data, resulting in improved performance compared to OLS regression.

Overall, it is likely that XGBoost would perform better than OLS regression in this scenario as well.