WMEA 2024

1 Introduction

One of the most widely used data analysis techniques is univariate multiple linear regression. In general, it proposes to mathematically model the relation between a dependent variable and a set of independent variables (or predictors) in order to understand whether the latter predict and/or explain the former. In the modelling process some assumptions should be imposed (Greene 2008).
Data transformations are sometimes applied, being one of those the standardization of the variables. This is done by subtracting the mean from each and then dividing by their respective standard deviation. The estimated coefficients (from now on referred as coefficients) are called standardized coefficients (Bring 1994).
Consider the univariate multiple linear regression model given by \[\begin{align} y &= X\beta + \epsilon, \end{align}\] where \(y\) denotes a \((N \times 1)\) vector of noisy observations, \(N \in \mathbb{N}\), \(\beta\) is a \(((K+1) \times 1)\) vector of unknown parameters or coefficients, \(K \in \mathbb{N}\), \(X\) is a known \((N \times (K+1))\) design matrix and \(\epsilon\) is a \((N \times 1)\) vector of random disturbances (errors), usually assumed to have a conditional expected value of zero and representing spherical disturbances, i.e, \(E\left[ \epsilon | X\right]=0\) and \(E[\epsilon \epsilon'|X]=\sigma^2I\), where \(I\) is a (\(N \times N\)) identity matrix and \(\sigma^2\) is the error variance.

The univariate multiple linear regression model can be written as \[\begin{align} y &= \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \dots + \beta_K x_{K} + \epsilon, \end{align}\] and standardizing \(y\) and \(x_j,j \in \left\lbrace 1,\dots,K\right\rbrace\) the following results are obtained \[\begin{align} % y^* &= f^*(y^*) + \epsilon^* y^* &= X^*b + \epsilon^*\\ y^* &= b_0 + b_1x_1^* + b_2x_2^* + \dots + b_Kx_K^* + \epsilon^*, \end{align}\] where \[\begin{align} y_i^*&=\frac{y_i-\frac{\sum_{i=1}^{N}y_i}{N}}{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( y_i-\frac{\sum_{i=1}^{N}y_i}{N}\right)^2}},\\ x_{ji}^*&=\frac{x_{ji}-\frac{\sum_{i=1}^{N}x_{ji}}{N}}{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( x_{ji}-\frac{\sum_{i=1}^{N}x_{ji}}{N}\right)^2}},\\ b_j&=\frac{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( x_{ji}-\frac{\sum_{i=1}^{N}x_{ji}}{N}\right)^2}}{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( y_i-\frac{\sum_{i=1}^{N}y_i}{N}\right)^2}}\beta_j, \label{stand_coef_1} \end{align}\] with \(j\in \left\lbrace 1,\dots,K\right\rbrace\), and \(b_0=0\). In this formulation, \(b_j\) are called standardized coefficients.

1.1 Ordinary Least Squares

The Ordinary Least Squares (OLS) estimator of \(\beta\) takes the form \[\begin{align} \widehat{\beta}^{OLS}= \begin{bmatrix} \widehat{\beta}_0^{OLS}\\ \widehat{\beta}_1^{OLS}\\ \widehat{\beta}_2^{OLS}\\ \vdots \\ \widehat{\beta}_K^{OLS} \end{bmatrix} &= \underset{\beta}{\operatorname{argmin}} \|y-X\beta\|^2\\ &= \left(X'X\right)^{-1}X'y, \end{align}\] where \(X'\) is the transpose of \(X\).

1.2 Ridge regression

The Ridge regression introduced by Hoerl and Kennard (Hoerl and Kennard 1970) is an estimation procedure to handle collinearity without removing variables from the regression model. By adding a small non-negative constant (ridge or shrinkage parameter) to the diagonal of the correlation matrix of the explanatory variables, it is possible to reduce the variance of the OLS estimator through the introduction of some bias. Although the resulting estimators are biased, the biases are small enough for these estimators to be substantially more precise than the unbiased estimators. The challenge in Ridge regression remains on the selection of the ridge parameter. One straightforward approach is based on simply plotting the coefficients against several possible values for the ridge parameter and inspecting the resulting traces. The Ridge Regression estimator of \(\lambda\) takes the form \[\begin{align} \widehat{\beta}^{ridge}&= \underset{\beta}{\operatorname{argmin}} \|y-X\beta\|^2+\lambda \|\beta\|^2 \\ &=(X'X+\lambda I)^{-1}X'y, \end{align}\] where \(\lambda \geq 0\) denotes the ridge parameter and \(I\) is a \((K \times K)\) identity matrix. Note that when \(\lambda \rightarrow 0\), the Ridge regression estimator approaches the OLS estimator whereas the Ridge regression estimator approaches the zero vector when \(\lambda \rightarrow \infty\). Thus, a trade-off between variance and bias is needed.

1.3 Generalized Maximum Entropy estimator

Golan et al (Golan, Judge, and Miller 1996) generalized the Maximum Entropy formalism (Jaynes 1957) to linear inverse problems with noise, expressed in the previous chapter linear model. The idea is to treat each \(\beta_j\), \(j\in\left\{ 0,\dots,K\right\}\), as a discrete random variable with a compact support and \(2 \leq M < \infty\) possible outcomes, and each \(\epsilon_i\), \(i \in \left\{ 1, \dots, N \right\}\), as a finite and discrete random variable with \(2 \leq J < \infty\) possible outcomes. Assuming that both the unknown parameters and the unknown error terms may be bounded a priori, the linear model can be presented as \[\begin{equation*} Y=XZp + Vw, \end{equation*}\] where \[\begin{equation} \beta=Zp= \left[ \begin{array}{cccc} z'_1 & 0 & \cdots & {0} \\ 0 & z'_2 & \cdots & {0} \\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & {z}'_K \end{array}\right] \left[ \begin{array}{c} {p}_1 \\ {p}_2 \\ \vdots\\ {p}_K \end{array}\right], \end{equation}\] with \({Z}\) a \((K \times (K\times M))\) matrix of support values and \({p}\) a \(((K\times M) \times 1)\) vector of unknown probabilities, and \[\begin{equation} \epsilon={Vw}= \left[ \begin{array}{cccc} v'_1 & 0 & \cdots & {0} \\ 0 & v'_2 & \cdots & {0} \\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & v'_N \end{array}\right] \left[ \begin{array}{c} w_1 \\ w_2 \\ \vdots\\ w_N \end{array}\right], \end{equation}\] with \(V\) a \((N \times (N\times J))\) matrix of support values and \(w\) a \(((N\times J) \times 1)\) vector of unknown probabilities. For the linear regression model, the Generalized Maximum Entropy (GME) estimator is given by \[\begin{equation} \hat{\beta}^{GME}(Z,V) = \underset{p,w}{\operatorname{argmax}} \left\{-p' \ln p - w' \ln w \right\}, \label{Golan631} \end{equation}\] subject to the model constraint \[\begin{equation*} Y=XZp + Vw, \end{equation*}\] and the additivity constraints for \(p\) and \(w\), respectively, \[\begin{equation*} \begin{array}{c} 1_K=(I_K \otimes 1'_M)p,\\ 1_N=(I_N \otimes 1'_J)w, \end{array} \end{equation*}\] where \(\otimes\) represents the Kronecker product, \({1}\) is a column vector of ones with a specific dimension, \({I}\) is an identity matrix with a specific dimension and \({Z}\) and \({V}\) are the matrices of supports, and \({p}>{0}\) and \({w}>{0}\) are probability vectors to be estimated. The GME estimator generates the optimal probability vectors \(\widehat{{p}}\) and \(\widehat{{w}}\) that can be used to form point estimates of the unknown parameters and the unknown random errors. Since the objective function is strictly concave in the interior of the additivity constraint set, a unique solution for the GME estimator is guaranteed if the intersection of the model and the additivity constraint sets is non-empty (Golan, Judge, and Miller 1996). The supports in matrices \({Z}\) and \({V}\) are defined as being closed and bounded intervals within which each parameter or error is restricted to lie, implying that researchers need to provide exogenous information (which, unfortunately, it is not always available). This is considered the main weakness of the GME estimator (Caputo and Paris 2008). Golan et al (Golan, Judge, and Miller 1996) discuss these issues in the case of minimal prior information: for the unknown parameters, the authors recommend the use of wide bounds (this is naturally subjective) for the supports in \({Z}\), without extreme risk consequences; for the unknown errors, the authors suggest the use of the three-sigma rule with a sample scale parameter (Pukelsheim 1994). The number of points \(M\) in the supports is less controversial and are usually used in the literature between 3 and 7 points, since there is likely no significant improvement in the estimation with more points in the supports. The three-sigma rule, considering the standard deviation of the noisy observations and \(J=3\) points in the supports is usually adopted.

2 Setting up

2.1 R software

“R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.” — R software (RCoreTeam 2024)

Follow https://cran.r-project.org/ to install it.

2.2 R Studio

“RStudio is an integrated development environment (IDE) designed to support multiple languages, including both R and Python.” — RStudio

Follow https://posit.co/downloads/ to install it.

2.3 Packages

During the workshop, several R packages will be necessary. Open R Studio and create a new Quarto document.

Insert the following code inside the code chunk.

# install.packages("devtools")
library(devtools)
# install.packages("ggcorrplot")
library(ggcorrplot)
# install.packages("ggpubr")
library(ggpubr)
# install.packages("MLmetrics")
library(MLmetrics)
# install.packages("glmnet")
library(glmnet)
# devtools::install_github("jorgevazcabral/GCE",
#                          build_vignettes = TRUE,
#                          build_manual = TRUE)
library(GCE)
#install.packages("AER")
library(AER)

https://github.com/jorgevazcabral

3 Simulated Data

3.1 Generate Data

n = 100 #Number of individuals
intercept.beta = 0 #Intercept
y.gen.cont.beta = c(3, 6, 9) #Coefficients used for generating y
cont.k = 0 #Number of continuous variables not used for generating y #use also 5
condnumber = 1 #Condition number of the X matrix #use also 10 and 100

simulated_data <-
  gdata_GCE(
    n = n,
    cont.k = cont.k, 
    y.gen.cont.k = length(y.gen.cont.beta),
    y.gen.cont.beta = y.gen.cont.beta,
    intercept.beta = intercept.beta,
    Xgenerator.method = "svd",
    condnumber = condnumber,
    Xgenerator.seed = 28102024,
    error.dist = "normal",
    error.dist.mean = 0,
    error.dist.sd = 1,
    dataframe = TRUE)

simulated_data_coef <- c(intercept.beta,
                   rep(0,cont.k),
                   y.gen.cont.beta)
simulated_data[1:5,]

         X001         X002        X003          y
1 -0.02408009 -0.003041504 -0.16085117 -2.7581023
2  0.14891804 -0.055189659  0.09739407  0.7978739
3 -0.05127719  0.085678733  0.02025297  0.7474960
4  0.15802821 -0.106477674  0.01375576  0.6281851
5  0.07692678 -0.044082368  0.14430848  0.3824326

(simulated_data_cor <- 
  cor(simulated_data,
      method = "pearson"))

             X001         X002         X003         y
X001  1.000000000 -0.001580996 -0.005650525 0.2385828
X002 -0.001580996  1.000000000 -0.011835200 0.3491026
X003 -0.005650525 -0.011835200  1.000000000 0.6020907
y     0.238582834  0.349102573  0.602090689 1.0000000

ggcorrplot::ggcorrplot(simulated_data_cor,
                       lab = TRUE)

3.2 Estimation

3.2.1 OLS

sd_model_OLS <- 
  lm(data = simulated_data,
     y ~ . #A typical model has the form response ~ terms #or y ~ .
     )

summary(sd_model_OLS)


Call:
lm(formula = y ~ ., data = simulated_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.11530 -0.76077 -0.00233  0.76996  2.66429 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.01902    0.10688   0.178 0.859116    
X001         3.69581    1.04511   3.536 0.000627 ***
X002         5.44107    1.04650   5.199 1.13e-06 ***
X003         9.44850    1.06666   8.858 4.22e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.045 on 96 degrees of freedom
Multiple R-squared:  0.5483,    Adjusted R-squared:  0.5342 
F-statistic: 38.84 on 3 and 96 DF,  p-value: < 2.2e-16

(sd_model_OLS_coef <- coef(sd_model_OLS))

(Intercept)        X001        X002        X003 
 0.01902268  3.69580700  5.44106587  9.44849741

(sd_model_OLS_RMSE <-
  MLmetrics::RMSE(fitted(sd_model_OLS),
                  simulated_data$y))

[1] 1.023594

(sd_model_OLS_beta_RMSE <-
  MLmetrics::RMSE(sd_model_OLS_coef,
                  simulated_data_coef))

[1] 0.4995164

3.2.1.1 Assumptions

par(mfrow = c(2,2))
plot(sd_model_OLS)

par(mfrow = c(1,1))

p <- sjPlot::plot_model(sd_model_OLS,
                   type = "diag")
ggpubr::ggarrange(p[[1]],
                  p[[2]],
                  p[[3]],
                  p[[4]])

performance::check_model(sd_model_OLS)

#Breusch-Pagan test
# The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model (by default the same explanatory variables are taken as in the main regression model) and rejects if too much of the variance is explained by the additional explanatory variables.

lmtest::bptest(sd_model_OLS)


    studentized Breusch-Pagan test

data:  sd_model_OLS
BP = 3.9348, df = 3, p-value = 0.2686

#Durbin-Watson test

#The Durbin-Watson test has the null hypothesis that the autocorrelation of the disturbances is 0.
lmtest::dwtest(sd_model_OLS)


    Durbin-Watson test

data:  sd_model_OLS
DW = 1.9813, p-value = 0.4626
alternative hypothesis: true autocorrelation is greater than 0

3.2.2 Ridge

sd_model_ridge <- 
  glmnet::cv.glmnet(
    x = as.matrix(simulated_data[,-ncol(simulated_data)]),
    y = simulated_data[,ncol(simulated_data)] ,
    standardize = TRUE,
    alpha = 0,
    lambda = 10^seq(-3,3,by=0.1),
    nfolds = 5,
    type.measure = "default")

plot(sd_model_ridge)

plot(sd_model_ridge$glmnet.fit,
     xvar = "lambda")
abline(v = log(sd_model_ridge$lambda.min),
       lty = 3)
abline(v = log(sd_model_ridge$lambda.1se),
       lty = 3)

coef(sd_model_ridge,
     s = "lambda.min")

4 x 1 sparse Matrix of class "dgCMatrix"
                    s1
(Intercept) 0.01305381
X001        3.60013026
X002        5.29956189
X003        9.20560163

coef(sd_model_ridge,
     s = "lambda.1se")

4 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept) -0.04940631
X001         2.60046509
X002         3.82309097
X003         6.66244241

sd_model_ridge_coef <-
  t(as.matrix(coef(sd_model_ridge,
                   s = "lambda.min")))

(sd_model_ridge_RMSE <-
  MLmetrics::RMSE(predict(sd_model_ridge,
                          as.matrix(simulated_data[,-ncol(simulated_data)]),
                          s = "lambda.min"),
                  simulated_data$y))

[1] 1.024007

(sd_model_ridge_beta_RMSE <-
  MLmetrics::RMSE(sd_model_ridge_coef,
                  simulated_data_coef))

[1] 0.4725496

3.2.3 GCE

sd_model_GCE <- 
  GCE::gce(data = simulated_data,
           y ~ .,
           int.one.prior = "ridge",
           int.one.method = "adaptive",
           model = TRUE,
           ci.B = 10,
           seed = 28102024
      )

summary(sd_model_GCE)


Call:
GCE::gce(formula = y ~ ., data = simulated_data, model = TRUE, 
    int.one.prior = "ridge", int.one.method = "adaptive", ci.B = 10, 
    seed = 28102024)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.12538 -0.73902  0.02761  0.77810  2.72719 

Coefficients:
              Estimate  LB 95% CI  UB 95% CI
(Intercept) 0.01032578 -0.1109409  0.1965275
X001        3.57685324  2.4163743  5.3751207
X002        5.25854033  3.2613797  6.9218831
X003        9.08505509  7.7768015 10.7058827

Residual standard error: 1.046 on 96 degrees of freedom
Multiple R-squared:  0.5293,    Adjusted R-squared:  0.5146
Choosen support:  2.385
CV-RMSE:  1.085

sd_model_GCE_coef <- coef(sd_model_GCE)

(sd_model_GCE_RMSE <-
  MLmetrics::RMSE(fitted(sd_model_GCE),
                  simulated_data$y))

[1] 1.024434

(sd_model_GCE_beta_RMSE <-
  MLmetrics::RMSE(sd_model_GCE_coef,
                  simulated_data_coef))

[1] 0.4716628

par(mfrow = c(2,2))
plot(sd_model_GCE)

par(mfrow = c(1,1))

3.2.4 All

data.frame(rbind(
  cbind(
    "TRUE" = simulated_data_coef, 
    OLS = sd_model_OLS_coef,
    ridge = as.numeric(sd_model_ridge_coef),
    GCE = sd_model_GCE_coef
  ),
  beta_RMSE = c(
    0,
    sd_model_OLS_beta_RMSE,
    sd_model_ridge_beta_RMSE,
    sd_model_GCE_beta_RMSE
  ),
  RMSE = c(
    0,
    sd_model_OLS_RMSE,
    sd_model_ridge_RMSE,
    sd_model_GCE_RMSE)
))

            TRUE.        OLS      ridge        GCE
(Intercept)     0 0.01902268 0.01305381 0.01032578
X001            3 3.69580700 3.60013026 3.57685324
X002            6 5.44106587 5.29956189 5.25854033
X003            9 9.44849741 9.20560163 9.08505509
beta_RMSE       0 0.49951640 0.47254955 0.47166278
RMSE            0 1.02359386 1.02400709 1.02443366

4 Real Data

which_data <- 1

if (which_data == 1) {
data("Electricity1970")
real_data <- Electricity1970[,c(2:8,1)] 
} else if (which_data == 2) {
 data("HousePrices")
 real_data <- HousePrices[1:200,c(2,3,4,1)] 
}
names(real_data)[ncol(real_data)] <- "y"

set.seed(112233)
ind <- sample(1:nrow(real_data),floor(0.7*nrow(real_data)))
real_data_train <- real_data[ind,]
real_data_test <- real_data[-ind,]

ggcorrplot::ggcorrplot(cor(real_data_train),
                       lab = TRUE)

base::kappa(real_data_train,
            exact = TRUE)

[1] 413156

rd_model_OLS <- 
  lm(data = real_data_train,
     y ~ . )

rd_model_ridge <- 
  glmnet::cv.glmnet(
    x = as.matrix(real_data_train[,-ncol(real_data_train)]),
    y = real_data_train[,ncol(real_data_train)] ,
    standardize = TRUE,
    alpha = 0,
    lambda = 10^seq(-3,3,by=0.1),
    nfolds = 5,
    type.measure = "default")

rd_model_GCE <- 
  GCE::gce(data = real_data_train,
           y ~ .,
           int.one.prior = "ridge",
           int.one.method = "adaptive",
           model = TRUE,
           seed = 28102024
      )

data.frame(rbind(
  cbind(
    OLS = coef(rd_model_OLS),
    ridge = as.numeric(t(as.matrix(coef(rd_model_ridge,
                                        s="lambda.min")))),
    GCE = coef(rd_model_GCE)
  ),
  RMSE = c(
    MLmetrics::RMSE(
      predict(rd_model_OLS,
              real_data_test[,-ncol(real_data_test)]),
      real_data_test$y),
    MLmetrics::RMSE(
      predict(rd_model_ridge,
              as.matrix(real_data_test[,-ncol(real_data_test)]),
              s = "lambda.min"),
      real_data_test$y),
    MLmetrics::RMSE(
      as.matrix(cbind(1,
                      real_data_test[,
                                     -ncol(real_data_test)]
                      )) %*% coef(rd_model_GCE),
                    real_data_test$y))
))

                       OLS         ridge           GCE
(Intercept)  -94.152497765 -93.390257488 -87.989688098
output         0.005981379   0.005981269   0.005942775
labor          0.001819813   0.001819463   0.001902602
laborshare    86.457346064  85.670154159  78.114224413
capital        0.182051999   0.181963293   0.188712885
capitalshare  62.607968148  61.823954598  54.466652331
fuel           1.331647714   1.331951379   1.324672856
fuelshare    -12.698696683 -13.447873664 -18.510285558
RMSE          25.449084436  25.447383873  24.982648524

References

Bring, Johan. 1994. “How to Standardize Regression Coefficients.” The American Statistician 48: 209–13. https://doi.org/10.2307/2684719.

Caputo, Michael R, and Quirino Paris. 2008. “Comparative Statics of the Generalized Maximum Entropy Estimator of the General Linear Model.” European Journal of Operational Research 185: 195–203. https://doi.org/https://doi.org/10.1016/j.ejor.2006.12.031.

Golan, Amos, George G. Judge, and Douglas Miller. 1996. Maximum Entropy Econometrics : Robust Estimation with Limited Data. Wiley.

Greene, W H. 2008. Econometric Analysis. Pearson/Prentice Hall. https://books.google.pt/books?id=b541vgAACAAJ.

Hoerl, Arthur E, and Robert W Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (February): 55–67. https://doi.org/10.1080/00401706.1970.10488634.

Jaynes, E T. 1957. “Information Theory and Statistical Mechanics.” Physical Review 106 (May): 620–30. https://doi.org/10.1103/PhysRev.106.620.

Pukelsheim, Friedrich. 1994. “The Three Sigma Rule.” The American Statistician 48: 88–91. https://doi.org/10.2307/2684253.

RCoreTeam. 2024. “R: A Language and Environment for Statistical Computing.” https://www.R-project.org/.