Collinearity implies presence of linear near perfect relationship
among the variables. In Econometrics, under multiple regression
analysis, it is possible for regressors to be highly correlated.
Multicollinearity is a case of perfect linear relationship among two or
more regressors. In such a case, \(X^{'}X\) does not result yield an
inverse owing to it being a singular matrix. This can be stated as \[\sum_{i=1}^k {\lambda_i X_i}=0\] where all
\(\lambda_i\) are not zero. Then \(X_i\) are linear combination of one
another, either row or column wise. Thus, the \(\hat{\beta}={(X^{'}X)}^{-1} X^{'}
Y\) is non existent. Thus, under presence of perfect
multicollinearity among any two or more regressors, estimation of
population parameters is impossible. However, less than perfect
multicollinearity allows for estimation of population parameters but
very often such estimates are biased and bear wrong sign. The estimated
standard errors of such parameter estimates is wide or inflated and
thus, leads to insignificant t-values along with high R square values
(Gujarati, Porter, and Gunasekar 2012) ].
This is because \({VCOV(\hat{\beta})}=\sigma^{2}{(X^{'}X)}^{-1}\)
If \({(X^{'}X)}^{-1}=0\) then \({VCOV(\hat{\beta})}=\infty\) under finite
variance of residual random error term. Consequently, calculated
t-values would be zero.Currently, most of the econometric and
statistical programs remove variables causing (multi)collinearity and
then report the regression results. However, the user of regression tool
must be aware of the diagnostic procedure. The next section outlines the
diagnostic tools in detecting multicolinearity using R software for
statistical computing. Thereafter the paper outlines some corrective
procedures for multicollinearity followed by conclusion. This paper uses
state.x77 dataset from datasets package (United States Department Of Commerce. Bureau Of The
Census 1984) in R software (R Core Team 2021)
. This data set contains data on 50 U.S. states on 8 parameters
namely,
Population:population estimate as of July 1, 1975
Income:per capita income (1974)
Illiteracy: illiteracy (1970, percent of population)
Life Exp:life expectancy in years (1969–71)
Murder: murder and non-negligent manslaughter rate per 100,000 population (1976)
HS Grad:percent high-school graduates (1970)
Frost:mean number of days with minimum temperature below freezing (1931–1960) in capital or large city
Area:land area in square miles
Following package and commands load data in html format (Xie, Cheng, and Tan 2021)
#### If DT package is not already installed, install it using
#install.packages("DT")
library(DT) ### loading DT package from package library
datatable(datasets::state.x77) ###using datatable function to make HTML table
The hypothesis set here is that murder rate per 100,000 population is
dependent on population, income, illiteracy, life expectancy, percentage
of high school graduates, number of frosty days and area of the state.
The OLS regression is run with lm command as demonstrated
below:
states<-as.data.frame.array(state.x77) ###converting the state.x777 data from array to data frame
colnames(states)<-c("Population", "Income", "Illiteracy", "LifeExp", "Murder", "HSGrad", "Frost","Area") ### Assigning new column names to make it compatible with olsrr package
states.lm<-lm(Murder~.,data = states)
library(sjPlot)
tab_model(states.lm,show.fstat = TRUE)
| Murder | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 122.18 | 86.08 – 158.28 | <0.001 |
| Population | 0.00 | 0.00 – 0.00 | 0.006 |
| Income | -0.00 | -0.00 – 0.00 | 0.782 |
| Illiteracy | 1.37 | -0.31 – 3.05 | 0.106 |
| LifeExp | -1.65 | -2.17 – -1.14 | <0.001 |
| HSGrad | 0.03 | -0.08 – 0.15 | 0.575 |
| Frost | -0.01 | -0.03 – 0.00 | 0.089 |
| Area | 0.00 | -0.00 – 0.00 | 0.124 |
| Observations | 50 | ||
| R2 / R2 adjusted | 0.808 / 0.776 | ||
anova(states.lm)
## Analysis of Variance Table
##
## Response: Murder
## Df Sum Sq Mean Sq F value Pr(>F)
## Population 1 78.854 78.854 25.8674 8.049e-06 ***
## Income 1 63.507 63.507 20.8328 4.322e-05 ***
## Illiteracy 1 236.196 236.196 77.4817 4.380e-11 ***
## LifeExp 1 139.466 139.466 45.7506 3.166e-08 ***
## HSGrad 1 8.066 8.066 2.6460 0.1113
## Frost 1 6.109 6.109 2.0039 0.1643
## Area 1 7.514 7.514 2.4650 0.1239
## Residuals 42 128.033 3.048
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above results it is evident that population and life expectancy are the significant factors affecting Murder rates in the U.S. states. Before going to diagnostic tools, zero order correlation matrix would be examined.
tab_corr(states)
| Population | Income | Illiteracy | LifeExp | Murder | HSGrad | Frost | Area | |
|---|---|---|---|---|---|---|---|---|
| Population | 0.208 | 0.108 | -0.068 | 0.344* | -0.098 | -0.332* | 0.023 | |
| Income | 0.208 | -0.437** | 0.340* | -0.230 | 0.620*** | 0.226 | 0.363** | |
| Illiteracy | 0.108 | -0.437** | -0.588*** | 0.703*** | -0.657*** | -0.672*** | 0.077 | |
| LifeExp | -0.068 | 0.340* | -0.588*** | -0.781*** | 0.582*** | 0.262 | -0.107 | |
| Murder | 0.344* | -0.230 | 0.703*** | -0.781*** | -0.488*** | -0.539*** | 0.228 | |
| HSGrad | -0.098 | 0.620*** | -0.657*** | 0.582*** | -0.488*** | 0.367** | 0.334* | |
| Frost | -0.332* | 0.226 | -0.672*** | 0.262 | -0.539*** | 0.367** | 0.059 | |
| Area | 0.023 | 0.363** | 0.077 | -0.107 | 0.228 | 0.334* | 0.059 | |
| Computed correlation used pearson-method with listwise-deletion. | ||||||||
The correlation structure of the data is merely a sufficient indicator of multicollinearity but not a necessary one; high correlation among regressors does not imply high multicollinearity. Therefore, it is not reported as part of diagnostic tools. From the correlation table it is evident that life expectancy a significant predictor of murder rates is highly correlated with illiteracy which is having high correlation with dependent variable. Illiteracy has moderate degree of correlation with Frost, HSGrad apart from LifeExp. This suggest that Illiteracy or LifeExp can a variable to examined further in diagnostic. Now the diagnostic tools would be demonstrated in next section.
Variance Inflation Factor (VIF): VIFs are the most commonly used tool to identify the regressors responsible for multicollinearity(Gujarati, Porter, and Gunasekar 2012). \[{VIF_j}=\frac{1}{1-R_j^{2}}={(X^{'}X)}^{-1}_{jj}\] where \(R_j^{2}\) is the coefficient of multiple determination from auxiliary regression of \(X_j\) regressor on rest of the regressors. If the value of \(VIF_j>=01\) then \(X_j\) regressor is of the concern and if it is \(4<=VIF_j<10\) then the jth regressor needs to be investigated.
Tolerance (TOL): It is the percentage of the variance not explained by other regressors. It is one minus coefficient of multiple determination from auxilliary regression used in calculating VIF. \[Tolerance_j=1-R_j^{2}\] The logic here is that if there exists high degree of collinearity among regressors then one variable should explain large amount of variation present in the other regressors leading to high \(R_j^{2}\) and low value of tolerance.
library(olsrr)
ols_vif_tol(states.lm)
## Variables Tolerance VIF
## 1 Population 0.7447732 1.342691
## 2 Income 0.5026655 1.989395
## 3 Illiteracy 0.2417821 4.135956
## 4 LifeExp 0.5259200 1.901430
## 5 HSGrad 0.2909280 3.437276
## 6 Frost 0.4213252 2.373463
## 7 Area 0.5914972 1.690625
From this table it can be observed that only illiteracy is the factor which is having VIF value above 4 and the lowest tolerance value. Thus, illiteracy variable should be examined further.
\[CI_j=\sqrt \frac{Max {(\lambda_j)}}{\lambda_j}\]
An conditional index value greater than 15 indicates presence of multicollinearity and greater than 30 indicates severe multicollinearity. Associated with conditional index is output of variance decomposition for each principal component into intercept and regressors. For each component where conditional index exceeds 15, one should look for presence of variance concentration above 0.7 on at least two regressors. On 7th component, illiteracy and HSGrad have varince proportions above 0.6. However, some literature also suggest variance proportion threshold of 0.9. This indicate that on this principal component these two variables load very high, meaning that these two variables are highly correlated on that given dimension.
datatable(round(ols_eigen_cindex(states.lm),2))
ols_correlations(states.lm)
## Correlations
## ---------------------------------------------
## Variable Zero Order Partial Part
## ---------------------------------------------
## Population 0.344 0.409 0.196
## Income -0.230 -0.043 -0.019
## Illiteracy 0.703 0.247 0.111
## LifeExp -0.781 -0.706 -0.436
## HSGrad -0.488 0.087 0.038
## Frost -0.539 -0.260 -0.118
## Area 0.228 0.235 0.106
## ---------------------------------------------
From the above results it is evident that only population and life expectancy have high degree of relationship with Murder rates when linear effects of other regressors is removed (Partial correlation coefficients). The problematic variable is illiteracy which had very high effect of other regressors on it. Consequently its partial correlation coefficients are very low. Same is the case with Income, Frost and HSGrad variables. Part correlation coefficients also reveals to some degree importance of the variable in regression coefficient. Once again life expectancy and population emerges as the two most important regressors in the regression.
R2 from Auxilliary Regression: If the \(R_j^{2}\) from auxiliary regression of \(X_j\) on rest of the regressors is higher than the \(R^{2}\) from the main regression then according to Klein(1962), the collinearity is harmful.
Farrar and Glauber Test: This test was developed by D.E. Farrar and R.R. Glauber Test (Farrar and Glauber 1967). It has three test statistics, chi-square, F and t statistics. Farrar and Glauber have developed the Chi-square test for detecting the strength of the multicollinearity over the whole set of explanatory variables. This test is based on the fact that in case of perfect multicollinearity the simple correlation coefficients are equal to unity and so the determinant turns to zero. The Chi-square test statistic is given by
\[\chi^{2}=-{[n-1-\frac{1}{6(2K+5)}{log_e\Delta}]}\sim \chi^{2}_{\frac{1}{2}K(K-1)}\]
where \(\Delta\) is the determinant of the zero order correlation matrix of the data matrix. where \(n=sample size\) and \(k=number of explanatory variables\). If the observed value of the Chi-square test statistic is found to be greater than the critical value of Chi-square at the desired level of significance, we reject the assumption of orthogonality and accept the presence of multicollinearity in the model.If the observed value of the Chi-square test statistic is found to be less than the critical value of Chi-square at the desired level of significance, we accept that there is no problem of multicollinearity in the model.
The second test in the Farar-Glauber test is an F test (Wi) for the location of multicollinearity. To do this, they have computed the multiple correlation coefficients among the explanatory variables and tested the statistical significance of these multiple correlation coefficients using an F test. The test statistic is given as
\[F^{*}=\frac {{R^{2}_{X_i.X_1,X_2,...,X_i-1,X_i+1,...,X_k}}/{(K-1)}}{{(1-R^{2}_{X_i.X_1,X_2,...,X_i-1,X_i+1,...,X_k})}/{(n-K)}}\sim F_{[n-K ,K-1)]}\]
Null hypothesis here is that \(R^{2}_{X_i.X_1,X_2,...,X_i-1,X_i+1,...,X_k}=0\) vs alternate hypothesis of \(R^{2}_{X_i.X_1,X_2,...,X_i-1,X_i+1,...,X_k}!=0\)
If the observed value of F is found to be greater than the theoretical value of F with degrees of freedom at the desired level of significance, we accept that the variable \(X_i\) is multicollinear.On the other hand, if the observed value of F is less than the theoretical value of F, we accept that the variable Xi is not multicollinear.
Finally, the Farrar – Glauber test concludes with a t – test for the pattern of multicollinearity. In fact, this is a t-test which aims at the detection of the variables which cause multicollinearity.partial correlation coefficients among the explanatory variables are computed and their statistical significance are tested with the t test. Null hypothesis is that partial correlation coefficients are equal to zero.
\[t^{*}= \frac {{r^{2}_{X_i X_j.X_1,X_2,...,X_i-1,X_i+1,...,X_j-1,X_j+1,...,X_k}} \sqrt {n-k}}{\sqrt {1-r^{2}_{X_i X_j.X_1,X_2,...,X_i-1,X_i+1,...,X_j-1,X_j+1,...,X_k}}} \]
The above test statistic follows the t -distribution with (n–k) degrees of freedom. Thus, if the computed value of t -statistic is greater than the theoretical value of t with (n–k) degrees of freedom at the desired level of significance, we accept that the variables \(X_i\) and \(X_j\) are responsible for the multicollinearity in the model, otherwise the variables are not the cause of multicollinearity since their partial correlation coefficient is not statistically significant.
Sum of Inverse of Eigen Value: Chatterjee and Hadi (2006) and Carlson, Dillon, and Goldstein (1986) suggested sum of inverse of eigen values of either \({(X^{'}X)}\) or its related correlation matrix greater than or equal to five times number of predictor indicates presence of multicollinearity.
Other Measures: Other measures include Theil’s measure (Theil and Collection 1971) , determinanant of normalized correlation matrix of \({(X^{'}X)}\) (Asteriou and Hall 2007), Red indicator(Kovàcs, Petres, and Tóth 2006), Leamer index(Greene 2003), Corrected VIF (CVIF) (Curto and Pinto 2010 ), Klein’s method (Klein 1977), IND1 & IND2 measures (Imdad Ullah, Altaf, and Ahmed 2019 ). Imdad Ullah, Altaf, and Ahmed (2019) provides an excellent summary of all the measures.
library(mctest)
omcdiag(states.lm)
##
## Call:
## omcdiag(mod = states.lm)
##
##
## Overall Multicollinearity Diagnostics
##
## MC Results detection
## Determinant |X'X|: 0.0466 0
## Farrar Chi-Square: 140.5382 1
## Red Indicator: 0.3753 0
## Sum of Lambda Inverse: 16.8708 0
## Theil's Method: -1.1685 0
## Condition Number: 262.9508 1
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
imcdiag(states.lm)
##
## Call:
## imcdiag(mod = states.lm)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein IND1 IND2
## Population 1.3427 0.7448 2.4559 3.0157 0.8630 -0.3009 0 0.1039 0.4853
## Income 1.9894 0.5027 7.0907 8.7067 0.7090 -0.4458 0 0.0701 0.9457
## Illiteracy 4.1360 0.2418 22.4743 27.5964 0.4917 -0.9269 0 0.0337 1.4418
## LifeExp 1.9014 0.5259 6.4602 7.9326 0.7252 -0.4261 0 0.0734 0.9015
## HSGrad 3.4373 0.2909 17.4671 21.4480 0.5394 -0.7703 0 0.0406 1.3484
## Frost 2.3735 0.4213 9.8432 12.0865 0.6491 -0.5319 0 0.0588 1.1004
## Area 1.6906 0.5915 4.9495 6.0775 0.7691 -0.3789 0 0.0825 0.7768
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## Income , Illiteracy , HSGrad , Frost , Area , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.8083
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
Corrective steps to ameliorate problem of multicollinearity can be any of the following:
Do nothing: If the extent of multicollinearity is not severe then one can ignore it safely, i.e. tolerate it.
Remove the independent variable causing problem:
Based on various tools and measures discussed above, once it is
indicated that one or more regressors are the cause of multicollinearity
then one can consider removing these from regression equation. In the
case of regression states.lm considered above, illiteracy
was one of the regressor found with moderate levels of VIF. The
following results shows regression without it.
states.lm1<-lm(Murder~Population+Income+LifeExp+HSGrad+Frost+Area,data = states)
tab_model(states.lm1)
| Murder | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 134.35 | 100.84 – 167.86 | <0.001 |
| Population | 0.00 | 0.00 – 0.00 | 0.013 |
| Income | -0.00 | -0.00 – 0.00 | 0.653 |
| LifeExp | -1.75 | -2.27 – -1.24 | <0.001 |
| HSGrad | -0.01 | -0.12 – 0.09 | 0.812 |
| Frost | -0.02 | -0.03 – -0.01 | 0.001 |
| Area | 0.00 | 0.00 – 0.00 | 0.020 |
| Observations | 50 | ||
| R2 / R2 adjusted | 0.796 / 0.767 | ||
ols_vif_tol(states.lm1)
## Variables Tolerance VIF
## 1 Population 0.7701675 1.298419
## 2 Income 0.5088160 1.965347
## 3 LifeExp 0.5561071 1.798215
## 4 HSGrad 0.3746059 2.669472
## 5 Frost 0.7589571 1.317598
## 6 Area 0.7130561 1.402414
The results shows that after removing illiteracy variable, no other variable has VIF greater than 4. Earlier only two regressors, life expectancy and population were significant but now ‘Frost’ and ‘Area’ too are significant. Thus, removing illiteracy improved the regression output.
states.lm model considered, income, murder rates,
percentage of high school graduates are all measured in per unit of
population. It is possible to replace population and area by population
density.states$population_density=states$Population/states$Area
states.lm2<-lm(Murder~population_density+Income+Illiteracy+LifeExp+HSGrad+Frost,data = states)
ols_vif_tol(states.lm2)
## Variables Tolerance VIF
## 1 population_density 0.7115184 1.405445
## 2 Income 0.4524071 2.210398
## 3 Illiteracy 0.3000624 3.332640
## 4 LifeExp 0.5340291 1.872557
## 5 HSGrad 0.3263090 3.064580
## 6 Frost 0.5153251 1.940523
Now it can be observed that coefficient of regresor income which hitherto was not significant is now significant and for none of the variable involved VIF level is greater than 4.
Add more observation: Multicollinearity is more often observed in small cross section data sets and time series data. It is possible in small data set that due to non-random sampling or sampling from a sub-set of population various regressors are highly correlated and thus, causing multicollinearity. Add more observations sampled from different subset of population would reduce the problem of multicollinearity. In case of time series augmenting data is not possible but one can remove various lags of dependent and independent variables which are causing multicollinearity.
Principal Component Regression: When multicollinearity is not tolerable and the researcher wants to keep all the regressors due to their importance or theoretical underpinnings then it is possible to get the principal component from the regressors and use principal components instead of original regessors and later on extract the coefficient for original regressors from principal component regression.
This article was prepared using Rmarkdown package (Xie, Dervieux, and Riederer 2020; Xie, Allaire, and Grolemund 2018; Allaire et al. 2021) in R software. To analyze data and report results, DT package (Xie, Cheng, and Tan 2021), stats package (R Core Team 2021), olsrr package (Hebbali 2020a) mcmc package (Imdadullah, Aslam, and Altaf 2016; Imdad and Aslam 2020), and sjPlot package (Lüdecke 2021) were used. Mathematics support in Rmarkdown was obtained from Pruim (2016). R code support and contents were derived from Ghosh (2017), Hebbali (2020b) and Thondamallu, Sagar, and Veetil (2018).