Loading the necesary packages…
The objective of the project is to highlight the effect of the degree of correlation between predictors on regression coefficient estimatorss.
The equation of the regression model is:
\[Y_i = a + b X_i + \varepsilon_i\]
Estimated regression line is the conditional mean of the dependent variable for a value date of independent variables:
\[\hat{y} = a + b x\]
The difference between the estimated value and the value Observed represents the estimation error:
\[e_i = y_i - \hat{y}_i\]
The OLS model allows the estimation of the regression by minimizing sum of error squares (RSS):
In the OLS regression model, the estimation of the regression parameters is made with the help of the method of the smallest squares which also gives the name of the OLS method.
\[ Y \mid X = x_i \sim N(\mu_i,\sigma^2) \]
The method is based on determination of the values of regression coefficients that minimize the amount of Error Squares
\[RSS = \sum_{i=1}^{n} (y_i - a - b x_i)^2 = min = \sum_{i=1}^{n} e_i^2 \]
By minimizing this function, the estimated values of the regression coefficients:
the regression slope or b parameter
\[\hat{b} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]
the intercept
\[\hat{a} = \bar{y} - \hat{b}\bar{x}\]
where: \(\bar{y}\) is the mean of the respons variable, and \(\bar{x}\) is the mean of the predictor values
Thus, we obtain the estimated values or the estimated regression line:
\[\hat{y}_i = \hat{a} + \hat{b}x_i\]
which approximates the relationship between the dependent variable and the predictor variables for which the sum of the square of errors is minimal.
s
The multiple linear regression model describes the relationship between the dependent variable and a set of predictors.
Regression model equation for \(i\) observation:
\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \varepsilon_i \]
where:
The mean of errors follows a normal law of mean zero and the constant variant \(\sigma^2\).
\[ \varepsilon_i \sim N(0,\sigma^2) \]
The estimator is the statistical rule by which regression coefficients are obtained in a simulation, it is a random variable constructed in order to estimate the value of the model parameters, applied to a sample of volume n. The estimate is the numerical value obtained in a simulation. The extraction of a sample from the reference population implies, in the selection space, a lot of possibilities, which is why the regression equation can be written with the help of estimators.
\[\hat{\beta}\ =\ {(X^TX)}^{-1}X^Ty\]
The formula by which we obtain the coefficients of the regression model in each simulation.
Each point in the scatter plot corresponds to a simulation and represents the pair of estimates (\(\widehat{\beta_1},\ \widehat{\beta_2}\)) obtained in that simulation. The figure below illustrates the variability of the regression coefficient estimators (i.e., the distribution of the estimated values) (\(\widehat{\beta_1},\ \widehat{\beta_2}\)) or, alternatively, the estimation errors (\(e_1,\ e_2\)) across different simulations. It can be observed how much the estimators deviate from the true parameter values. The graphs highlight the effect of multicollinearity on the regression coefficient estimates, as well as the dependence between them.
Below are presented the obtained values of the correlation coefficients and the covariance for different values of the correlation coefficient between predictors.
We observe that there is an inverse relationship between the estimated regression coefficients associated with the predictors, as well as between the estimation errors, for positive values of the correlation coefficient between predictors.
| Nr.Crt. | r | cor_errors | cov_errors |
|---|---|---|---|
| 1 | 0 | -0.0567 | -0.00232 |
| 2 | 0.1 | -0.123 | -0.00508 |
| 3 | 0.2 | -0.222 | -0.00951 |
| 4 | 0.3 | -0.321 | -0.0145 |
| 5 | 0.4 | -0.419 | -0.0206 |
| 6 | 0.5 | -0.517 | -0.0286 |
| 7 | 0.6 | -0.615 | -0.0399 |
| 8 | 0.7 | -0.712 | -0.0581 |
| 9 | 0.8 | -0.808 | -0.0937 |
| 10 | 0.95 | -0.952 | -0.409 |
| 11 | 0.98 | -0.981 | -1.04 |
| 12 | 0.99 | -0.99 | -2.09 |
The figure shows, for different values of the correlation coefficient between predictors, the dependence between the estimation errors.
It can be observed that there is an inverse relationship between the estimation errors of the coefficients.
In order to determine whether there is a significant relationship between the estimated errors of the regression coefficients, we tested the correlation coefficient.
Formulation of the hypotheses:
\(H_0\): \(cov(e_1, e_2) \neq 0\) or \(cor(e_1, e_2) \neq 0\)
\(H_1\): \(cov(e_1, e_2) = 0\) or \(cov(e_1, e_2) = 0\)
| Nr.Crt. | r | cor_test | p_value | hypothesis | decision |
|---|---|---|---|---|---|
| 1 | 0 | -0.0567 | 0.0731 | accept H0 | ns |
| 2 | 0.1 | -0.123 | 0.000101 | reject H0 | s |
| 3 | 0.2 | -0.222 | 1.22e-12 | reject H0 | s |
| 4 | 0.3 | -0.321 | 2.1e-25 | reject H0 | s |
| 5 | 0.4 | -0.419 | 6.78e-44 | reject H0 | s |
| 6 | 0.5 | -0.517 | 1.38e-69 | reject H0 | s |
| 7 | 0.6 | -0.615 | 4.41e-105 | reject H0 | s |
| 8 | 0.7 | -0.712 | 2.06e-155 | reject H0 | s |
| 9 | 0.8 | -0.808 | 0 | reject H0 | s |
| 10 | 0.95 | -0.952 | 0 | reject H0 | s |
| 11 | 0.98 | -0.981 | 0 | reject H0 | s |
| 12 | 0.99 | -0.99 | 0 | reject H0 | s |
Since the data are generated from a Gaussian distribution and the estimators are linear combinations of the dependent variable, uncorrelated errors and independence coincide: \[cov(e_1, e_2) = 0 \Longleftrightarrow e_1 \text{ and } e_2 \text{ are independent}\] Therefore, we can test the correlation coefficient to assess independence.
We observe that the estimation errors of the regression coefficients are not independent. For small values of the correlation coefficient between predictors, the p-values are very small and start to increase as the degree of multicollinearity increases.
We observe that the correlation is significantly different from zero; therefore, the estimation errors are dependent.
For negative values of the correlation coefficient between predictors, we observe that the relationship between the estimated regression coefficients is direct.
We observe that the obtained results are similar to those obtained for positive values of the correlation coefficient between predictors.
The values of the correlation coefficient between predictors are positive.
| Nr.Crt. | r | cor_errors | cov_errors |
|---|---|---|---|
| 1 | -0.99 | 0.99 | 2.09 |
| 2 | -0.98 | 0.981 | 1.04 |
| 3 | -0.95 | 0.952 | 0.409 |
| 4 | -0.8 | 0.808 | 0.0937 |
| 5 | -0.7 | 0.712 | 0.0581 |
| 6 | -0.6 | 0.615 | 0.0399 |
| 7 | -0.5 | 0.517 | 0.0286 |
| 8 | -0.4 | 0.419 | 0.0206 |
| 9 | -0.3 | 0.321 | 0.0145 |
| 10 | -0.2 | 0.222 | 0.00951 |
| 11 | -0.1 | 0.123 | 0.00508 |
| 12 | 0 | -0.0567 | -0.00232 |
| Nr.Crt. | r | cor_test | p_value | hypothesis | decision |
|---|---|---|---|---|---|
| 1 | -0.99 | 0.99 | 0 | reject H0 | s |
| 2 | -0.98 | 0.981 | 0 | reject H0 | s |
| 3 | -0.95 | 0.952 | 0 | reject H0 | s |
| 4 | -0.8 | 0.808 | 0 | reject H0 | s |
| 5 | -0.7 | 0.712 | 2.06e-155 | reject H0 | s |
| 6 | -0.6 | 0.615 | 4.41e-105 | reject H0 | s |
| 7 | -0.5 | 0.517 | 1.38e-69 | reject H0 | s |
| 8 | -0.4 | 0.419 | 6.78e-44 | reject H0 | s |
| 9 | -0.3 | 0.321 | 2.1e-25 | reject H0 | s |
| 10 | -0.2 | 0.222 | 1.22e-12 | reject H0 | s |
| 11 | -0.1 | 0.123 | 0.000101 | reject H0 | s |
| 12 | 0 | -0.0567 | 0.0731 | accept H0 | ns |
Multicollinearity has a significant impact on the estimation of regression coefficients, leading to increased variability and instability of the estimators.
There exists a statistically significant dependence between the estimation errors of the regression coefficients, as indicated by the correlation coefficient being significantly different from zero.
The relationship between the estimated coefficients and their corresponding errors depends on the sign and magnitude of the correlation between predictors, being inverse for positive correlations and direct for negative correlations.
https://ehsanx.github.io/EpiMethods/predictivefactors1.html#:~:text=Avoiding%20collinear%20variables%20can%20result%20in%20a, predicted%20from%20the%20others%20with%20substantial%20accuracy.
Masdeu Lluís, Recognising and Dealing with Multicollinearity, 2025, https://medium.com/@masdeu.lluis/recognising-and-dealing-with-multicollinearity-e3fa899f0bfc.
Clay Ford, Addressing Multicollinearity, 2025, https://library.virginia.edu/data/articles/addressing-multicollinearity.
D.A. Belsley, E. Kuh, R.E. Welsch, Regression diagnostics. Identifying influential data and sources of collinearity, WILEY INTERSCIENCE, A JOHN WILEY & SONS, INC., PUBLICATION.
Stephanie C. C. van der Lubbe, Jose M. Valderas, Evangelos Kontopantelis, The effect of collinearity and sample size on linear regression results: a simulation study, 2026, https://arxiv.org/pdf/2601.18072.
Salmerón, R., García, C. B., & García, J. (2018). Variance Inflation Factor and Condition Number in multiple linear regression. Journal of Statistical Computation and Simulation, 88(12), 2365–2384.
C. Davino, R. Romano, D. Vistocco, Handling multicollinearity in quantile regression through the use of principal component regression, METRON (2022) 80:153–174 https://doi.org/10.1007/s40300-022-00230-3.
D. E. Farrar, R. R. Glauber, Multicollinearity in regression analysis the problem revisited, 1964, Sloan School of Management Massachusetts Institute of Technology Cambridge 39, Massachusetts, https://dspace.mit.edu/bitstream/handle/1721.1/48530/multicollinearit00farr.pdf 18.03.2026.
Noora Shrestha, Detecting Multicollinearity in Regression Analysis, American Journal of Applied Mathematics and Statistics, 2020, Vol. 8, No. 2, 39-42, https://www.researchgate.net/publication/342413955_Detecting_Multicollinearity_in_Regression_Analysi.
https://rafalab.dfci.harvard.edu/dsbook/ggplot2.html accessed on 11.03.2026.
https://jrnold.github.io/r4ds-exercise-solutions/data-visualisation.html accessed on 11.03.2026.
https://www.appsilon.com/post/ggplot2-histograms accessed on 11.03.2026.
Number of predictors and multicollinearity: What are their effects on error and bias in regression? https://www.tandfonline.com/doi/full/10.1080/03610918.2017.1371750?scroll=top&needAccess=true