# install.packages("devtools")
library(devtools)
# install.packages("ggcorrplot")
library(ggcorrplot)
# install.packages("ggpubr")
library(ggpubr)
# install.packages("MLmetrics")
library(MLmetrics)
# install.packages("glmnet")
library(glmnet)
# devtools::install_github("jorgevazcabral/GCE",
# build_vignettes = TRUE,
# build_manual = TRUE)
library(GCE)
#install.packages("AER")
library(AER)1 Introduction
One of the most widely used data analysis techniques is univariate multiple linear regression. In general, it proposes to mathematically model the relation between a dependent variable and a set of independent variables (or predictors) in order to understand whether the latter predict and/or explain the former. In the modelling process some assumptions should be imposed (Greene 2008).
Data transformations are sometimes applied, being one of those the standardization of the variables. This is done by subtracting the mean from each and then dividing by their respective standard deviation. The estimated coefficients (from now on referred as coefficients) are called standardized coefficients (Bring 1994).
Consider the univariate multiple linear regression model given by \[\begin{align}
y &= X\beta + \epsilon,
\end{align}\] where \(y\) denotes a \((N \times 1)\) vector of noisy observations, \(N \in \mathbb{N}\), \(\beta\) is a \(((K+1) \times 1)\) vector of unknown parameters or coefficients, \(K \in \mathbb{N}\), \(X\) is a known \((N \times (K+1))\) design matrix and \(\epsilon\) is a \((N \times 1)\) vector of random disturbances (errors), usually assumed to have a conditional expected value of zero and representing spherical disturbances, i.e, \(E\left[ \epsilon | X\right]=0\) and \(E[\epsilon \epsilon'|X]=\sigma^2I\), where \(I\) is a (\(N \times N\)) identity matrix and \(\sigma^2\) is the error variance.
The univariate multiple linear regression model can be written as \[\begin{align}
y &= \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \dots + \beta_K x_{K} + \epsilon,
\end{align}\] and standardizing \(y\) and \(x_j,j \in \left\lbrace 1,\dots,K\right\rbrace\) the following results are obtained \[\begin{align}
% y^* &= f^*(y^*) + \epsilon^*
y^* &= X^*b + \epsilon^*\\
y^* &= b_0 + b_1x_1^* + b_2x_2^* + \dots + b_Kx_K^* + \epsilon^*,
\end{align}\] where \[\begin{align} y_i^*&=\frac{y_i-\frac{\sum_{i=1}^{N}y_i}{N}}{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( y_i-\frac{\sum_{i=1}^{N}y_i}{N}\right)^2}},\\ x_{ji}^*&=\frac{x_{ji}-\frac{\sum_{i=1}^{N}x_{ji}}{N}}{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( x_{ji}-\frac{\sum_{i=1}^{N}x_{ji}}{N}\right)^2}},\\ b_j&=\frac{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( x_{ji}-\frac{\sum_{i=1}^{N}x_{ji}}{N}\right)^2}}{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( y_i-\frac{\sum_{i=1}^{N}y_i}{N}\right)^2}}\beta_j, \label{stand_coef_1}
\end{align}\] with \(j\in \left\lbrace 1,\dots,K\right\rbrace\), and \(b_0=0\). In this formulation, \(b_j\) are called standardized coefficients.
1.1 Ordinary Least Squares
The Ordinary Least Squares (OLS) estimator of \(\beta\) takes the form \[\begin{align} \widehat{\beta}^{OLS}= \begin{bmatrix} \widehat{\beta}_0^{OLS}\\ \widehat{\beta}_1^{OLS}\\ \widehat{\beta}_2^{OLS}\\ \vdots \\ \widehat{\beta}_K^{OLS} \end{bmatrix} &= \underset{\beta}{\operatorname{argmin}} \|y-X\beta\|^2\\ &= \left(X'X\right)^{-1}X'y, \end{align}\] where \(X'\) is the transpose of \(X\).
1.2 Ridge regression
The Ridge regression introduced by Hoerl and Kennard (Hoerl and Kennard 1970) is an estimation procedure to handle collinearity without removing variables from the regression model. By adding a small non-negative constant (ridge or shrinkage parameter) to the diagonal of the correlation matrix of the explanatory variables, it is possible to reduce the variance of the OLS estimator through the introduction of some bias. Although the resulting estimators are biased, the biases are small enough for these estimators to be substantially more precise than the unbiased estimators. The challenge in Ridge regression remains on the selection of the ridge parameter. One straightforward approach is based on simply plotting the coefficients against several possible values for the ridge parameter and inspecting the resulting traces. The Ridge Regression estimator of \(\lambda\) takes the form \[\begin{align} \widehat{\beta}^{ridge}&= \underset{\beta}{\operatorname{argmin}} \|y-X\beta\|^2+\lambda \|\beta\|^2 \\ &=(X'X+\lambda I)^{-1}X'y, \end{align}\] where \(\lambda \geq 0\) denotes the ridge parameter and \(I\) is a \((K \times K)\) identity matrix. Note that when \(\lambda \rightarrow 0\), the Ridge regression estimator approaches the OLS estimator whereas the Ridge regression estimator approaches the zero vector when \(\lambda \rightarrow \infty\). Thus, a trade-off between variance and bias is needed.
1.3 Generalized Maximum Entropy estimator
Golan et al (Golan, Judge, and Miller 1996) generalized the Maximum Entropy formalism (Jaynes 1957) to linear inverse problems with noise, expressed in the previous chapter linear model. The idea is to treat each \(\beta_j\), \(j\in\left\{ 0,\dots,K\right\}\), as a discrete random variable with a compact support and \(2 \leq M < \infty\) possible outcomes, and each \(\epsilon_i\), \(i \in \left\{ 1, \dots, N \right\}\), as a finite and discrete random variable with \(2 \leq J < \infty\) possible outcomes. Assuming that both the unknown parameters and the unknown error terms may be bounded a priori, the linear model can be presented as \[\begin{equation*} Y=XZp + Vw, \end{equation*}\] where \[\begin{equation} \beta=Zp= \left[ \begin{array}{cccc} z'_1 & 0 & \cdots & {0} \\ 0 & z'_2 & \cdots & {0} \\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & {z}'_K \end{array}\right] \left[ \begin{array}{c} {p}_1 \\ {p}_2 \\ \vdots\\ {p}_K \end{array}\right], \end{equation}\] with \({Z}\) a \((K \times (K\times M))\) matrix of support values and \({p}\) a \(((K\times M) \times 1)\) vector of unknown probabilities, and \[\begin{equation} \epsilon={Vw}= \left[ \begin{array}{cccc} v'_1 & 0 & \cdots & {0} \\ 0 & v'_2 & \cdots & {0} \\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & v'_N \end{array}\right] \left[ \begin{array}{c} w_1 \\ w_2 \\ \vdots\\ w_N \end{array}\right], \end{equation}\] with \(V\) a \((N \times (N\times J))\) matrix of support values and \(w\) a \(((N\times J) \times 1)\) vector of unknown probabilities. For the linear regression model, the Generalized Maximum Entropy (GME) estimator is given by \[\begin{equation} \hat{\beta}^{GME}(Z,V) = \underset{p,w}{\operatorname{argmax}} \left\{-p' \ln p - w' \ln w \right\}, \label{Golan631} \end{equation}\] subject to the model constraint \[\begin{equation*} Y=XZp + Vw, \end{equation*}\] and the additivity constraints for \(p\) and \(w\), respectively, \[\begin{equation*} \begin{array}{c} 1_K=(I_K \otimes 1'_M)p,\\ 1_N=(I_N \otimes 1'_J)w, \end{array} \end{equation*}\] where \(\otimes\) represents the Kronecker product, \({1}\) is a column vector of ones with a specific dimension, \({I}\) is an identity matrix with a specific dimension and \({Z}\) and \({V}\) are the matrices of supports, and \({p}>{0}\) and \({w}>{0}\) are probability vectors to be estimated. The GME estimator generates the optimal probability vectors \(\widehat{{p}}\) and \(\widehat{{w}}\) that can be used to form point estimates of the unknown parameters and the unknown random errors. Since the objective function is strictly concave in the interior of the additivity constraint set, a unique solution for the GME estimator is guaranteed if the intersection of the model and the additivity constraint sets is non-empty (Golan, Judge, and Miller 1996). The supports in matrices \({Z}\) and \({V}\) are defined as being closed and bounded intervals within which each parameter or error is restricted to lie, implying that researchers need to provide exogenous information (which, unfortunately, it is not always available). This is considered the main weakness of the GME estimator (Caputo and Paris 2008). Golan et al (Golan, Judge, and Miller 1996) discuss these issues in the case of minimal prior information: for the unknown parameters, the authors recommend the use of wide bounds (this is naturally subjective) for the supports in \({Z}\), without extreme risk consequences; for the unknown errors, the authors suggest the use of the three-sigma rule with a sample scale parameter (Pukelsheim 1994). The number of points \(M\) in the supports is less controversial and are usually used in the literature between 3 and 7 points, since there is likely no significant improvement in the estimation with more points in the supports. The three-sigma rule, considering the standard deviation of the noisy observations and \(J=3\) points in the supports is usually adopted.
2 Setting up
2.1 R software
“R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.” — R software (RCoreTeam 2024)
Follow https://cran.r-project.org/ to install it.
2.2 R Studio
“RStudio is an integrated development environment (IDE) designed to support multiple languages, including both R and Python.” — RStudio
Follow https://posit.co/downloads/ to install it.
2.3 Packages
During the workshop, several R packages will be necessary. Open R Studio and create a new Quarto document.
Insert the following code inside the code chunk.
https://github.com/jorgevazcabral
3 Simulated Data
3.1 Generate Data
n = 100 #Number of individuals
intercept.beta = 0 #Intercept
y.gen.cont.beta = c(3, 6, 9) #Coefficients used for generating y
cont.k = 0 #Number of continuous variables not used for generating y #use also 5
condnumber = 1 #Condition number of the X matrix #use also 10 and 100
simulated_data <-
gdata_GCE(
n = n,
cont.k = cont.k,
y.gen.cont.k = length(y.gen.cont.beta),
y.gen.cont.beta = y.gen.cont.beta,
intercept.beta = intercept.beta,
Xgenerator.method = "svd",
condnumber = condnumber,
Xgenerator.seed = 28102024,
error.dist = "normal",
error.dist.mean = 0,
error.dist.sd = 1,
dataframe = TRUE) X001 X002 X003 y
1 -0.02408009 -0.003041504 -0.16085117 -2.7581023
2 0.14891804 -0.055189659 0.09739407 0.7978739
3 -0.05127719 0.085678733 0.02025297 0.7474960
4 0.15802821 -0.106477674 0.01375576 0.6281851
5 0.07692678 -0.044082368 0.14430848 0.3824326
(simulated_data_cor <-
cor(simulated_data,
method = "pearson")) X001 X002 X003 y
X001 1.000000000 -0.001580996 -0.005650525 0.2385828
X002 -0.001580996 1.000000000 -0.011835200 0.3491026
X003 -0.005650525 -0.011835200 1.000000000 0.6020907
y 0.238582834 0.349102573 0.602090689 1.0000000
ggcorrplot::ggcorrplot(simulated_data_cor,
lab = TRUE)3.2 Estimation
3.2.1 OLS
sd_model_OLS <-
lm(data = simulated_data,
y ~ . #A typical model has the form response ~ terms #or y ~ .
)summary(sd_model_OLS)
Call:
lm(formula = y ~ ., data = simulated_data)
Residuals:
Min 1Q Median 3Q Max
-2.11530 -0.76077 -0.00233 0.76996 2.66429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01902 0.10688 0.178 0.859116
X001 3.69581 1.04511 3.536 0.000627 ***
X002 5.44107 1.04650 5.199 1.13e-06 ***
X003 9.44850 1.06666 8.858 4.22e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.045 on 96 degrees of freedom
Multiple R-squared: 0.5483, Adjusted R-squared: 0.5342
F-statistic: 38.84 on 3 and 96 DF, p-value: < 2.2e-16
(sd_model_OLS_coef <- coef(sd_model_OLS))(Intercept) X001 X002 X003
0.01902268 3.69580700 5.44106587 9.44849741
[1] 1.023594
(sd_model_OLS_beta_RMSE <-
MLmetrics::RMSE(sd_model_OLS_coef,
simulated_data_coef))[1] 0.4995164
3.2.1.1 Assumptions
p <- sjPlot::plot_model(sd_model_OLS,
type = "diag")
ggpubr::ggarrange(p[[1]],
p[[2]],
p[[3]],
p[[4]])performance::check_model(sd_model_OLS)#Breusch-Pagan test
# The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model (by default the same explanatory variables are taken as in the main regression model) and rejects if too much of the variance is explained by the additional explanatory variables.
lmtest::bptest(sd_model_OLS)
studentized Breusch-Pagan test
data: sd_model_OLS
BP = 3.9348, df = 3, p-value = 0.2686
#Durbin-Watson test
#The Durbin-Watson test has the null hypothesis that the autocorrelation of the disturbances is 0.
lmtest::dwtest(sd_model_OLS)
Durbin-Watson test
data: sd_model_OLS
DW = 1.9813, p-value = 0.4626
alternative hypothesis: true autocorrelation is greater than 0
3.2.2 Ridge
plot(sd_model_ridge)plot(sd_model_ridge$glmnet.fit,
xvar = "lambda")
abline(v = log(sd_model_ridge$lambda.min),
lty = 3)
abline(v = log(sd_model_ridge$lambda.1se),
lty = 3)coef(sd_model_ridge,
s = "lambda.min")4 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 0.01305381
X001 3.60013026
X002 5.29956189
X003 9.20560163
coef(sd_model_ridge,
s = "lambda.1se")4 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -0.04940631
X001 2.60046509
X002 3.82309097
X003 6.66244241
(sd_model_ridge_RMSE <-
MLmetrics::RMSE(predict(sd_model_ridge,
as.matrix(simulated_data[,-ncol(simulated_data)]),
s = "lambda.min"),
simulated_data$y))[1] 1.024007
(sd_model_ridge_beta_RMSE <-
MLmetrics::RMSE(sd_model_ridge_coef,
simulated_data_coef))[1] 0.4725496
3.2.3 GCE
sd_model_GCE <-
GCE::gce(data = simulated_data,
y ~ .,
int.one.prior = "ridge",
int.one.method = "adaptive",
model = TRUE,
ci.B = 10,
seed = 28102024
)summary(sd_model_GCE)
Call:
GCE::gce(formula = y ~ ., data = simulated_data, model = TRUE,
int.one.prior = "ridge", int.one.method = "adaptive", ci.B = 10,
seed = 28102024)
Residuals:
Min 1Q Median 3Q Max
-2.12538 -0.73902 0.02761 0.77810 2.72719
Coefficients:
Estimate LB 95% CI UB 95% CI
(Intercept) 0.01032578 -0.1109409 0.1965275
X001 3.57685324 2.4163743 5.3751207
X002 5.25854033 3.2613797 6.9218831
X003 9.08505509 7.7768015 10.7058827
Residual standard error: 1.046 on 96 degrees of freedom
Multiple R-squared: 0.5293, Adjusted R-squared: 0.5146
Choosen support: 2.385
CV-RMSE: 1.085
sd_model_GCE_coef <- coef(sd_model_GCE) [1] 1.024434
(sd_model_GCE_beta_RMSE <-
MLmetrics::RMSE(sd_model_GCE_coef,
simulated_data_coef))[1] 0.4716628
3.2.4 All
data.frame(rbind(
cbind(
"TRUE" = simulated_data_coef,
OLS = sd_model_OLS_coef,
ridge = as.numeric(sd_model_ridge_coef),
GCE = sd_model_GCE_coef
),
beta_RMSE = c(
0,
sd_model_OLS_beta_RMSE,
sd_model_ridge_beta_RMSE,
sd_model_GCE_beta_RMSE
),
RMSE = c(
0,
sd_model_OLS_RMSE,
sd_model_ridge_RMSE,
sd_model_GCE_RMSE)
)) TRUE. OLS ridge GCE
(Intercept) 0 0.01902268 0.01305381 0.01032578
X001 3 3.69580700 3.60013026 3.57685324
X002 6 5.44106587 5.29956189 5.25854033
X003 9 9.44849741 9.20560163 9.08505509
beta_RMSE 0 0.49951640 0.47254955 0.47166278
RMSE 0 1.02359386 1.02400709 1.02443366
4 Real Data
which_data <- 1
if (which_data == 1) {
data("Electricity1970")
real_data <- Electricity1970[,c(2:8,1)]
} else if (which_data == 2) {
data("HousePrices")
real_data <- HousePrices[1:200,c(2,3,4,1)]
}
names(real_data)[ncol(real_data)] <- "y"
set.seed(112233)
ind <- sample(1:nrow(real_data),floor(0.7*nrow(real_data)))
real_data_train <- real_data[ind,]
real_data_test <- real_data[-ind,]
ggcorrplot::ggcorrplot(cor(real_data_train),
lab = TRUE)base::kappa(real_data_train,
exact = TRUE)[1] 413156
rd_model_OLS <-
lm(data = real_data_train,
y ~ . )
rd_model_ridge <-
glmnet::cv.glmnet(
x = as.matrix(real_data_train[,-ncol(real_data_train)]),
y = real_data_train[,ncol(real_data_train)] ,
standardize = TRUE,
alpha = 0,
lambda = 10^seq(-3,3,by=0.1),
nfolds = 5,
type.measure = "default")
rd_model_GCE <-
GCE::gce(data = real_data_train,
y ~ .,
int.one.prior = "ridge",
int.one.method = "adaptive",
model = TRUE,
seed = 28102024
)data.frame(rbind(
cbind(
OLS = coef(rd_model_OLS),
ridge = as.numeric(t(as.matrix(coef(rd_model_ridge,
s="lambda.min")))),
GCE = coef(rd_model_GCE)
),
RMSE = c(
MLmetrics::RMSE(
predict(rd_model_OLS,
real_data_test[,-ncol(real_data_test)]),
real_data_test$y),
MLmetrics::RMSE(
predict(rd_model_ridge,
as.matrix(real_data_test[,-ncol(real_data_test)]),
s = "lambda.min"),
real_data_test$y),
MLmetrics::RMSE(
as.matrix(cbind(1,
real_data_test[,
-ncol(real_data_test)]
)) %*% coef(rd_model_GCE),
real_data_test$y))
)) OLS ridge GCE
(Intercept) -94.152497765 -93.390257488 -87.989688098
output 0.005981379 0.005981269 0.005942775
labor 0.001819813 0.001819463 0.001902602
laborshare 86.457346064 85.670154159 78.114224413
capital 0.182051999 0.181963293 0.188712885
capitalshare 62.607968148 61.823954598 54.466652331
fuel 1.331647714 1.331951379 1.324672856
fuelshare -12.698696683 -13.447873664 -18.510285558
RMSE 25.449084436 25.447383873 24.982648524