Regression model

Analyzed data set: generated.

Loading the necesary packages…

Project Objectives

The objective of the project is to highlight the effect of the degree of correlation between predictors on regression coefficient estimatorss.

Multiple linear regression model

The equation of the regression model is:

\[Y_i = a + b X_i + \varepsilon_i\]

Estimated regression line is the conditional mean of the dependent variable for a value date of independent variables:

\[\hat{y} = a + b x\]

The difference between the estimated value and the value Observed represents the estimation error:

\[e_i = y_i - \hat{y}_i\]

The OLS model allows the estimation of the regression by minimizing sum of error squares (RSS):

Estimation of regression parameters

In the OLS regression model, the estimation of the regression parameters is made with the help of the method of the smallest squares which also gives the name of the OLS method.

\[ Y \mid X = x_i \sim N(\mu_i,\sigma^2) \]

The method is based on determination of the values of regression coefficients that minimize the amount of Error Squares

\[RSS = \sum_{i=1}^{n} (y_i - a - b x_i)^2 = min = \sum_{i=1}^{n} e_i^2 \]

By minimizing this function, the estimated values of the regression coefficients:

the regression slope or b parameter

\[\hat{b} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]

the intercept

\[\hat{a} = \bar{y} - \hat{b}\bar{x}\]

where: \(\bar{y}\) is the mean of the respons variable, and \(\bar{x}\) is the mean of the predictor values

Thus, we obtain the estimated values or the estimated regression line:

\[\hat{y}_i = \hat{a} + \hat{b}x_i\]

which approximates the relationship between the dependent variable and the predictor variables for which the sum of the square of errors is minimal.

s

Multiple linear regression model

The multiple linear regression model describes the relationship between the dependent variable and a set of predictors.

Regression model equation for \(i\) observation:

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \varepsilon_i \]

where:

  • \(Y_i\) is the response variable
  • \(X_{ij}\) represents the independent variables
  • \(\beta_j\) is the model parameters
  • \(\varepsilon_i\) is the error term or residual variable.

The mean of errors follows a normal law of mean zero and the constant variant \(\sigma^2\).

\[ \varepsilon_i \sim N(0,\sigma^2) \]

The estimator is the statistical rule by which regression coefficients are obtained in a simulation, it is a random variable constructed in order to estimate the value of the model parameters, applied to a sample of volume n. The estimate is the numerical value obtained in a simulation. The extraction of a sample from the reference population implies, in the selection space, a lot of possibilities, which is why the regression equation can be written with the help of estimators.

\[\hat{\beta}\ =\ {(X^TX)}^{-1}X^Ty\]

The formula by which we obtain the coefficients of the regression model in each simulation.

Each point in the scatter plot corresponds to a simulation and represents the pair of estimates (\(\widehat{\beta_1},\ \widehat{\beta_2}\)) obtained in that simulation. The figure below illustrates the variability of the regression coefficient estimators (i.e., the distribution of the estimated values) (\(\widehat{\beta_1},\ \widehat{\beta_2}\)) or, alternatively, the estimation errors (\(e_1,\ e_2\)) across different simulations. It can be observed how much the estimators deviate from the true parameter values. The graphs highlight the effect of multicollinearity on the regression coefficient estimates, as well as the dependence between them.

Computation of the estimation error, the correlation coefficient, and the covariance

Below are presented the obtained values of the correlation coefficients and the covariance for different values of the correlation coefficient between predictors.

We observe that there is an inverse relationship between the estimated regression coefficients associated with the predictors, as well as between the estimation errors, for positive values of the correlation coefficient between predictors.

Nr.Crt.rcor_errorscov_errors
10   -0.0567-0.00232
20.1 -0.123 -0.00508
30.2 -0.222 -0.00951
40.3 -0.321 -0.0145
50.4 -0.419 -0.0206
60.5 -0.517 -0.0286
70.6 -0.615 -0.0399
80.7 -0.712 -0.0581
90.8 -0.808 -0.0937
100.95-0.952 -0.409
110.98-0.981 -1.04
120.99-0.99  -2.09
Graphical representation of the estimation error of the regression coefficients

The figure shows, for different values of the correlation coefficient between predictors, the dependence between the estimation errors.

It can be observed that there is an inverse relationship between the estimation errors of the coefficients.

Testing the correlation coefficient between the estimation errors

In order to determine whether there is a significant relationship between the estimated errors of the regression coefficients, we tested the correlation coefficient.

Formulation of the hypotheses:

\(H_0\): \(cov(e_1, e_2) \neq 0\) or \(cor(e_1, e_2) \neq 0\)

\(H_1\): \(cov(e_1, e_2) = 0\) or \(cov(e_1, e_2) = 0\)

Nr.Crt.rcor_testp_valuehypothesisdecision
10   -0.05670.0731accept H0ns
20.1 -0.123 0.000101reject H0s
30.2 -0.222 1.22e-12reject H0s
40.3 -0.321 2.1e-25reject H0s
50.4 -0.419 6.78e-44reject H0s
60.5 -0.517 1.38e-69reject H0s
70.6 -0.615 4.41e-105reject H0s
80.7 -0.712 2.06e-155reject H0s
90.8 -0.808 0reject H0s
100.95-0.952 0reject H0s
110.98-0.981 0reject H0s
120.99-0.99  0reject H0s

Since the data are generated from a Gaussian distribution and the estimators are linear combinations of the dependent variable, uncorrelated errors and independence coincide: \[cov(e_1, e_2) = 0 \Longleftrightarrow e_1 \text{ and } e_2 \text{ are independent}\] Therefore, we can test the correlation coefficient to assess independence.

We observe that the estimation errors of the regression coefficients are not independent. For small values of the correlation coefficient between predictors, the p-values are very small and start to increase as the degree of multicollinearity increases.

We observe that the correlation is significantly different from zero; therefore, the estimation errors are dependent.

The effect of multicollinearity on the estimates of the regression model coefficients

For negative values of the correlation coefficient between predictors, we observe that the relationship between the estimated regression coefficients is direct.

The relationship between the estimates of the regression model coefficients

We observe that the obtained results are similar to those obtained for positive values of the correlation coefficient between predictors.

Computation of the estimation error, the correlation coefficient, and the covariance

The values of the correlation coefficient between predictors are positive.

Nr.Crt.rcor_errorscov_errors
1-0.990.99  2.09
2-0.980.981 1.04
3-0.950.952 0.409
4-0.8 0.808 0.0937
5-0.7 0.712 0.0581
6-0.6 0.615 0.0399
7-0.5 0.517 0.0286
8-0.4 0.419 0.0206
9-0.3 0.321 0.0145
10-0.2 0.222 0.00951
11-0.1 0.123 0.00508
120   -0.0567-0.00232
Graphical representation of the estimation error of the coefficients
Testing the correlation coefficient between the estimation errors
Nr.Crt.rcor_testp_valuehypothesisdecision
1-0.990.99  0reject H0s
2-0.980.981 0reject H0s
3-0.950.952 0reject H0s
4-0.8 0.808 0reject H0s
5-0.7 0.712 2.06e-155reject H0s
6-0.6 0.615 4.41e-105reject H0s
7-0.5 0.517 1.38e-69reject H0s
8-0.4 0.419 6.78e-44reject H0s
9-0.3 0.321 2.1e-25reject H0s
10-0.2 0.222 1.22e-12reject H0s
11-0.1 0.123 0.000101reject H0s
120   -0.05670.0731accept H0ns

Conclusions

Multicollinearity has a significant impact on the estimation of regression coefficients, leading to increased variability and instability of the estimators.

There exists a statistically significant dependence between the estimation errors of the regression coefficients, as indicated by the correlation coefficient being significantly different from zero.

The relationship between the estimated coefficients and their corresponding errors depends on the sign and magnitude of the correlation between predictors, being inverse for positive correlations and direct for negative correlations.

Bibliography

  1. https://ehsanx.github.io/EpiMethods/predictivefactors1.html#:~:text=Avoiding%20collinear%20variables%20can%20result%20in%20a, predicted%20from%20the%20others%20with%20substantial%20accuracy.

  2. Masdeu Lluís, Recognising and Dealing with Multicollinearity, 2025, https://medium.com/@masdeu.lluis/recognising-and-dealing-with-multicollinearity-e3fa899f0bfc.

  3. Clay Ford, Addressing Multicollinearity, 2025, https://library.virginia.edu/data/articles/addressing-multicollinearity.

  4. D.A. Belsley, E. Kuh, R.E. Welsch, Regression diagnostics. Identifying influential data and sources of collinearity, WILEY INTERSCIENCE, A JOHN WILEY & SONS, INC., PUBLICATION.

  5. Stephanie C. C. van der Lubbe, Jose M. Valderas, Evangelos Kontopantelis, The effect of collinearity and sample size on linear regression results: a simulation study, 2026, https://arxiv.org/pdf/2601.18072.

  6. Salmerón, R., García, C. B., & García, J. (2018). Variance Inflation Factor and Condition Number in multiple linear regression. Journal of Statistical Computation and Simulation, 88(12), 2365–2384.

  7. https://doi.org/10.1080/00949655.2018.1463376

  8. C. Davino, R. Romano, D. Vistocco, Handling multicollinearity in quantile regression through the use of principal component regression, METRON (2022) 80:153–174 https://doi.org/10.1007/s40300-022-00230-3.

  9. D. E. Farrar, R. R. Glauber, Multicollinearity in regression analysis the problem revisited, 1964, Sloan School of Management Massachusetts Institute of Technology Cambridge 39, Massachusetts, https://dspace.mit.edu/bitstream/handle/1721.1/48530/multicollinearit00farr.pdf 18.03.2026.

  10. Noora Shrestha, Detecting Multicollinearity in Regression Analysis, American Journal of Applied Mathematics and Statistics, 2020, Vol. 8, No. 2, 39-42, https://www.researchgate.net/publication/342413955_Detecting_Multicollinearity_in_Regression_Analysi.

  11. https://rafalab.dfci.harvard.edu/dsbook/ggplot2.html accessed on 11.03.2026.

  12. https://jrnold.github.io/r4ds-exercise-solutions/data-visualisation.html accessed on 11.03.2026.

  13. https://www.appsilon.com/post/ggplot2-histograms accessed on 11.03.2026.

  14. Number of predictors and multicollinearity: What are their effects on error and bias in regression? https://www.tandfonline.com/doi/full/10.1080/03610918.2017.1371750?scroll=top&needAccess=true