Effect of the collinearity on the stability and statistical properties of OLS estimators

Scoring model

Analyzed data set: generated

Loading the necesary packages…

Project Objectives

The objective of the project is to highlight the effect of the degree of correlation between predictors on the cost function (RSS) and how it influences the shape of the objective function (RSS) in the predictor space.

Modelul de regresie liniară multipla

The equation of the regression model is:

\[Y_i = a + b X_i + \varepsilon_i\]

The estimated regression line represents the conditional mean of the dependent variable for given values of the independent variables.

\[\hat{y} = a + b x\]

The difference between the estimated value and the observed value represents the estimation error (residual):

\[e_i = y_i - \hat{y}_i\]

The OLS model allows the estimation of regression parameters by minimizing the Residual Sum of Squares (RSS):

Estimation of Regression Parameters

In the OLS regression model, the parameters are estimated using the Ordinary Least Squares method, which gives the model its name.

\[ Y \mid X = x_i \sim N(\mu_i,\sigma^2) \]

This method is based on determining the values of the regression coefficients that minimize the sum of squared errors:

\[RSS = \sum_{i=1}^{n} (y_i - a - b x_i)^2 = min = \sum_{i=1}^{n} e_i^2 \]

By minimizing this function, we obtain the estimated values of the regression coefficients:

the slope coefficient

\[\hat{b} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]

the intercept (β₀)

\[\hat{a} = \bar{y} - \hat{b}\bar{x}\]

where: \(\bar{y}\) is the response variable mean, and \(\bar{x}\) is the mean value of predictor

Thus, we obtain the estimated values or the estimated regression line…

\[\hat{y}_i = \hat{a} + \hat{b}x_i\]

which approximates the relationship between the dependent variable and the predictor variables, such that the sum of squared errors is minimized.

Multiple Linear Regression Model (MRM)

The multiple linear regression model describes the relationship between a dependent variable and a set of predictors.

The regression equation for observation \(i\) is:

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \varepsilon_i \]

where:

\(Y_i\) is the dependent variable
\(X_{ij}\) represents the independent variable
\(\beta_j\) are the modele parameters
\(\varepsilon_i\) is the error term.

The error terms are assumed to follow a normal distribution with mean zero and constant variance (\(\sigma^2 = const.\)).

\[ \varepsilon_i \sim N(0,\sigma^2) \]

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.952377    2.733657

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.9524     0.1980   9.861 0.000000000000000268 ***
## Xx2           2.7337     0.2097  13.034 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.7431, Adjusted R-squared:  0.7378 
## F-statistic: 140.3 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.952377    2.733657

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.9524     0.1980   9.861 0.000000000000000268 ***
## Xx2           2.7337     0.2097  13.034 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.7431, Adjusted R-squared:  0.7378 
## F-statistic: 140.3 on 2 and 97 DF,  p-value: < 0.00000000000000022

## [1] NA

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.952377    2.733657

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.9524     0.1980   9.861 0.000000000000000268 ***
## Xx2           2.7337     0.2097  13.034 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.7431, Adjusted R-squared:  0.7378 
## F-statistic: 140.3 on 2 and 97 DF,  p-value: < 0.00000000000000022

[Matrix form of the model]{style =‘color:blue’}

For a dataset with n observations, the model can be written in matrix form:

\[ \mathbf{y} = X\beta + \varepsilon \]

where:

The vector of the dependent variable is:

\[ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \]

The predictor matrix is:

\[ X = \begin{bmatrix} 1 & x_{11} & x_{12} & \dots & x_{1p} \\ 1 & x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & \dots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix} \]

The first column contains only ones and represents the intercept of the model, i.e., the expected value of the dependent variable when all predictors are equal to zero.

The coefficient vector is:

\[ \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \]

The error vector is:

\[ \varepsilon = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix} \]

Estimation of the multiple regression model parameters

The parameters of the multiple regression model are estimated using the Ordinary Least Squares (OLS) method, which minimizes the sum of squared errors.

he residual variable is defined as:

\[ e = y - X\beta \]

The objective function to be minimized is:

\[ RSS(\beta) = (y - X\beta)^T (y - X\beta) \]

By differentiating this function with respect to the coefficient vector \(\beta\) and setting it equal to zero, we obtain the normal equations:

\[ X^T X \beta = X^T y \]

If the matrix \(X^T X\) is invertible, the solution for estimating the coefficients is given by:

\[ \hat{\beta} = (X^T X)^{-1} X^T y \]

model <- lm(y ~ X) %>% summary()

model$coefficients[1, 3]

## [1] 2.114195

This relationship shows that the estimation of the coefficients depends on:

the matrix of predictor cross-products \((X^T X)\), respectiv
the correlation between predictors and the response variable \(X^T y\)

This result is important because it allows the analysis of several properties of the regression model, such as:

multicollinearity (when \((X^T X)\) is nearly singular)
eigenvalues of \((X^T X)\)
the condition number of the matrix

This formula represents the OLS estimator of the coefficient vector and forms the basis of linear regression analysis. The properties of the matrix \(X^TX\) are essential for studying issues such as multicollinearity, eigenvalues, and the stability of the estimates.

Distribution of the estimators

The distribution of the estimators changes as a function of the correlation coefficient (r) between predictors, with higher collinearity leading to increased variance and more spread-out estimates.

Higher predictor correlation → higher SE, lower t, higher p

As the correlation (r) between predictors increases, the standard errors of the coefficient estimates grow, resulting in smaller t-values and consequently larger p-values, which reduces the statistical significance of the predictors.

Higher collinearity: R² increase, F-test less reliable

As collinearity among predictors increases, the R-squared and adjusted R-squared values remain relatively stable, but the F-test becomes less reliable due to inflated standard errors, making the overall significance of the model harder to detect.

Geometric interpretation and multicollinearity

In what follows, we estimate a series of OLS regression models in order to highlight the effect of the degree of correlation between predictors and how it influences the shape of the objective function (RSS) ellipses in the predictor space.

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.952377    2.733657

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.9524     0.1980   9.861 0.000000000000000268 ***
## Xx2           2.7337     0.2097  13.034 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.7431, Adjusted R-squared:  0.7378 
## F-statistic: 140.3 on 2 and 97 DF,  p-value: < 0.00000000000000022

From a geometric perspective, the RSS objective function forms elliptical surfaces in the space of regression coefficients, with the center of these ellipses representing the OLS solution, i.e., the optimal coefficient values.

Each point corresponds to a combination of regression coefficients, and the OLS method selects the combination that minimizes the sum of squared errors (RSS).

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.784936    2.855927

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.7849     0.1993   8.958   0.0000000000000239 ***
## Xx2           2.8559     0.2094  13.640 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.7408, Adjusted R-squared:  0.7355 
## F-statistic: 138.6 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.790427    2.865725

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.7904     0.2018   8.872   0.0000000000000366 ***
## Xx2           2.8657     0.2120  13.520 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.7571, Adjusted R-squared:  0.7521 
## F-statistic: 151.2 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.794573    2.875069

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.7946     0.2068   8.678   0.0000000000000953 ***
## Xx2           2.8751     0.2170  13.251 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.7716, Adjusted R-squared:  0.7669 
## F-statistic: 163.8 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.797356    2.884303

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.7974     0.2148   8.368     0.00000000000044 ***
## Xx2           2.8843     0.2250  12.820 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.7845, Adjusted R-squared:  0.7801 
## F-statistic: 176.6 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.798604    2.893849

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.7986     0.2269   7.927     0.00000000000384 ***
## Xx2           2.8938     0.2371  12.203 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.7961, Adjusted R-squared:  0.7919 
## F-statistic: 189.3 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.797866    2.904353

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.7979     0.2453   7.329      0.0000000000695 ***
## Xx2           2.9044     0.2556  11.364 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.8065, Adjusted R-squared:  0.8025 
## F-statistic: 202.2 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.794075    2.917036

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.7941     0.2746   6.534        0.00000000297 ***
## Xx2           2.9170     0.2849  10.240 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.816,  Adjusted R-squared:  0.8122 
## F-statistic: 215.1 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.784327    2.934923

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.7843     0.3268   5.460    0.000000365949147 ***
## Xx2           2.9349     0.3371   8.706    0.000000000000083 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.8247, Adjusted R-squared:  0.8211 
## F-statistic: 228.1 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.756882    2.969856

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.7569     0.4504   3.901             0.000177 ***
## Xx2           2.9699     0.4608   6.446        0.00000000447 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.8326, Adjusted R-squared:  0.8292 
## F-statistic: 241.3 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.714536    3.015728

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 < 0.0000000000000002 ***
## Xx1           1.7145     0.6298   2.722              0.00769 ** 
## Xx2           3.0157     0.6402   4.711           0.00000824 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.8364, Adjusted R-squared:  0.833 
## F-statistic:   248 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    1.529752    3.203237

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 <0.0000000000000002 ***
## Xx1           1.5298     1.3987   1.094              0.2768    
## Xx2           3.2032     1.4091   2.273              0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.8394, Adjusted R-squared:  0.8361 
## F-statistic: 253.5 on 2 and 97 DF,  p-value: < 0.00000000000000022

## y ~ X

## (Intercept)         Xx1         Xx2 
##    5.270131    0.801922    3.931668

## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7460 -1.3215 -0.2489  1.2427  4.1597 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   5.2701     0.1923  27.409 <0.0000000000000002 ***
## Xx1           0.8019     4.4232   0.181               0.857    
## Xx2           3.9317     4.4336   0.887               0.377    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.903 on 97 degrees of freedom
## Multiple R-squared:  0.8401, Adjusted R-squared:  0.8368 
## F-statistic: 254.8 on 2 and 97 DF,  p-value: < 0.00000000000000022

As the degree of collinearity increases, the ellipses become more elongated, and the estimates of the regression coefficients become unstable (their variance increases significantly). For this reason, in practice, robust regression models or regularization-based methods are often used, which apply penalties to large coefficients in order to improve the stability of the estimates.

Conclusions

The degree of correlation between predictors has a significant impact on the shape of the RSS function and the stability of the estimated coefficients. As multicollinearity increases, the RSS contours become more elongated, indicating higher uncertainty in parameter estimation.

Multicollinearity affects the reliability of the regression model by increasing the variance of the estimated coefficients. This makes the model sensitive to small changes in the data and reduces the interpretability of individual predictors.

To address the issues caused by multicollinearity, alternative approaches such as regularization methods (e.g., ridge regression) or robust regression techniques should be considered, as they improve the stability and predictive performance of the model.

Bibliography

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.
Peterson BG, Carl P (2020). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. R package version 2.0.4, https://CRAN.R-project.org/package=PerformanceAnalytics.
Wickham H, Pedersen T, Seidel D (2025). scales: Scale Functions for Visualization. R package version 1.4.0, https://CRAN.R-project.org/package=scales.
Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2, https://CRAN.R-project.org/package=GGally.
Taiyun Wei and Viliam Simko (2021). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.92). Available from https://github.com/taiyun/corrplot
Rubba C (2023). htmltab: Assemble Data Frames from HTML Tables. R package version 0.8.2.9000, https://github.com/htmltab/htmltab.
Brandon M. Greenwell and Bradley C. Boehmke (2020). Variable Importance Plots—An Introduction to the vip Package. The R Journal, 12(1), 343–366. URL https://doi.org/10.32614/RJ-2020-013.
Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01
Robinson D, Hayes A, Couch S (2023). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.5, https://CRAN.R-project.org/package=broom. 13.H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.
Zhu H (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.
Cui B (2020). DataExplorer: Automate Data Exploration and Treatment. R package version 0.8.2, https://CRAN.R-project.org/package=DataExplorer.
Rushworth A (2022). inspectdf: Inspection, Comparison and Visualisation of Data Frames. R package version 0.0.12, https://CRAN.R-project.org/package=inspectdf.
Grosjean P, Ibanez F (2018). pastecs: Package for Analysis of Space-Time Ecological Series. R package version 1.3.21, https://CRAN.R-project.org/package=pastecs.
https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
William Revelle (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.3.9, https://CRAN.R-project.org/package=psych.
B. Venables, Modern Applied Statistics With S, 2002, Edition: 4thPublisher: Springer-Verlag, DOI: 10.1007/b97626.
Friedman J, Tibshirani R, Hastie T (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, 33(1), 1-22. doi:10.18637/jss.v033.i01 https://doi.org/10.18637/jss.v033.i01.
Martin Maechler, Peter Rousseeuw, Christophe Croux, Valentin Todorov, Andreas Ruckstuhl, Matias Salibian-Barrera, Tobias Verbeke, Manuel Koller, c(“Eduardo”, “L. T.”) Conceicao and Maria Anna di Palma (2023). robustbase: Basic Robust Statistics R package version 0.99-1. URL http://CRAN.R-project.org/package=robustbase
https://www.linkedin.com/pulse/lasso-regression-clearly-explained-bhabani-shankear-basak/ accesat la data 17.12.2024
U. Riswanto, Ridge regression is a fantastic choice when you need a balance between flexibility and simplicity, especially in cases where you have lots of features or multicollinearity issues, https://ujangriswanto08.medium.com/a-beginners-guide-to-ridge-regression-and-regularization-in-machine-learning-4aeae6ec7680 accesat la data 25.12.2024.
https://www.quora.com/What-are-the-pros-and-cons-of-lasso-regression accesat la data de 19.12.2024.
https://medium.com/@shruti.dhumne/what-is-lasso-regression-bd44addc448c accesat la data de 19.12.2024.
J. Gallier, J. Quaintance, Solving the Elastic Net and Lasso Regression Problems, 2024, https://www.cis.upenn.edu/~cis5150/elastic-net.pdf accesat la data de 19.12.2024.
H. Zou, T. Hastie, Regularization and Variable Selection via the Elastic Net, 2004, https://statanaly.com/wp-content/uploads/2023/04/elasticnet.pdf accesat la data 19.12.2024.
E. Rodola, A. Torsello, T. Harada, Y. Kuniyoshi, D. Cremers, Elastic Net Constraints for Shape Matching, https://cvg.cit.tum.de/_media/spezial/bib/rodola-iccv13.pdf accesat la data 19.12.2024.
P. Mohan, Ridge, Lasso & Elastic Net Regression, 2021, https://blog.devgenius.io/ridge-lasso-elastic-net-regression-2ea752186e51 accesat la data 17.12.2024.
https://dev.to/harsimranjit_singh_0133dc/elastic-net-regularization-balancing-between-l1-and-l2-penalties-3ib7 accesat la data 17.12.2024. 33.Arthur E. Hoerl, Robert W. Kennard, TECHNOMETRICS, 1970, 12, 1, Ridge Regression: Biased Estimation for Nonorthogonal Problems.
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning Data Mining, Inference, and Prediction, 2017, Springer Science+Business Media, ISBN: 978-0-387-84857-0.
Xavier Bourret Sicotte, Ridge and Lasso: visualizing the optimal solutions, 2018, https://xavierbourretsicotte.github.io/ridge_lasso_visual.html.
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning Data Mining, Inference, and Prediction, 2017, Springer Science+Business Media, ISBN: 978-0-387-84857-0.
Arthur E. Hoerl, Robert W. Kennard, TECHNOMETRICS, 1970, 12, 1, Ridge Regression: Biased Estimation for Nonorthogonal Problems.
Hui Zou and Trevor Hastie, Regularization and variable selection via the elastic net, J. R. Statist. Soc. B, 2005, 67, Part 2, pp. 301–320.