library(tidyverse)
library(corrplot)
library(car)  # for vif()
library(caret)  # for preProcess()
library(broom)  # for tidy()

Multicollinearity exists when two or more predictors are at least moderately correlated. Multicollinearity causes unstable coefficient estimates and inflated standard errors in the correlated predictors. The usual interpretation of a coefficient - the change in the mean response for each additional unit increase in the predictor when all the other predictors are held constant - breaks down because changing one predictor necessarily changes the others.

Multicollinearity is not always a problem. Multicollinearity has no affect on the model predictions, so it is primarily a problem for explanatory models, not predictive models. Even in explanatory models, you can safely ignore multicollinearity in some cases. E.g., If the variables are control variables, then their individual effects are of no concern. E.g., a study of college graduation rates vs public/private schools, controlling for SAT and ACT test scores will have very high multicollinearity, but that multicollinearity is expected and irrelevant.

Muticollinearity can occur for structural reasons, as when one variable is a transformation of another variable, or for data reasons, as occurs in observational studies. A model with multicollinearity may have large correlations between predictors, but this is not a reliable way to detect it because a variable may be correlated with a function of several variables. A model with multicollinearity may have a significant F-test with insignificant individual slope estimator t-tests. The most reliable way to detect multicollinearity is with variance inflation factors (VIF). For each variable k, define a its VIF as

\[VIF_k = \frac{1}{1 - R_k^2}\]

where \(R_k^2\) is the \(R^2\) of a linear model of variable k regressed on all other predictors. It’s called the variance inflation factor because it estimates how much the variance of a coefficient is “inflated” because of linear dependence with other predictors. A VIF of 1.8 means the variance is 80% larger than it would be if that predictor was uncorrelated with the other predictors. VIF ranges from 1 to infinity. A VIF >= 4 warrants investigation; a VIF >= 10 requires correction.

Example with Data-based Multicollinearity

Dataset bloodpress (see PSU STAT 501) records blood pressure and other potentially related measures for n = 20 patients.

bloodpress <- read_tsv(file = "./Data/Bloodpress.txt")
## Parsed with column specification:
## cols(
##   Pt = col_double(),
##   BP = col_double(),
##   Age = col_double(),
##   Weight = col_double(),
##   BSA = col_double(),
##   Dur = col_double(),
##   Pulse = col_double(),
##   Stress = col_double()
## )
head(bloodpress)
## # A tibble: 6 x 8
##      Pt    BP   Age Weight   BSA   Dur Pulse Stress
##   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1     1   105    47   85.4  1.75   5.1    63     33
## 2     2   115    49   94.2  2.1    3.8    70     14
## 3     3   116    49   95.3  1.98   8.2    72     10
## 4     4   117    50   94.7  2.01   5.8    73     99
## 5     5   112    51   89.4  1.89   7      72     95
## 6     6   121    48   99.5  2.25   9.3    71     10

A correlation matrix shows the marginal relationships between the response variable BP and the predictors. Blood pressure is strongly related to Weight (r = 0.95) and BSA (r = 0.87), moderately related to Age (r = 0.66) and Pulse (r = 0.72), and weakly related to Dur (r = 0.29) and Stress (r = 0.16). The correlation matrix also shows the marginal relationships among the predictors. Weight is strongly related to BSA (r = 0.88), and Pulse is moderately related to Weight (r = 0.66), Age (r = 0.62) and Stress (r = 0.51). The high correlation among some predictors is an indicator of data-based multicollinearity.

#plot(bloodpress[, -1])
corrplot(cor(bloodpress[, -1]), method = "number", type = "upper", diag = FALSE)

The coefficient estimates in the linear regression are not significant at the .05 level of significance for Dur, Pulse, and Stress.

summary(bloodpress.lm <- lm(BP ~ . - Pt, data = bloodpress))
## 
## Call:
## lm(formula = BP ~ . - Pt, data = bloodpress)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93213 -0.11314  0.03064  0.21834  0.48454 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.870476   2.556650  -5.034 0.000229 ***
## Age           0.703259   0.049606  14.177 2.76e-09 ***
## Weight        0.969920   0.063108  15.369 1.02e-09 ***
## BSA           3.776491   1.580151   2.390 0.032694 *  
## Dur           0.068383   0.048441   1.412 0.181534    
## Pulse        -0.084485   0.051609  -1.637 0.125594    
## Stress        0.005572   0.003412   1.633 0.126491    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4072 on 13 degrees of freedom
## Multiple R-squared:  0.9962, Adjusted R-squared:  0.9944 
## F-statistic: 560.6 on 6 and 13 DF,  p-value: 6.395e-15

The VIFs are fairly large. Weight, BSA, and Pulse are inflated by factors >= 4.

round(vif(bloodpress.lm), 2)
##    Age Weight    BSA    Dur  Pulse Stress 
##   1.76   8.42   5.33   1.24   4.41   1.83

Incidentally, here is a manual calculation of VIF for Weight.

bloodpress.lm2 <- lm(Weight ~ . - Pt - BP, data = bloodpress)
r.squared <- summary(bloodpress.lm2)$r.squared
round(1 / (1 - r.squared), 2)
## [1] 8.42
rm(bloodpress.lm2, r.squared)

When multicollinearity is due to data issues, as in this case, either collect additional data under different experimental or observational conditions, or remove one or more of the violating predictors from the regression model. When two variables are correlated, remove the one that is more difficult to collect. In this case, I’ll remove BSA because of its high correlation with both Weight and Pulse. Removing BSA from the model reduces the VIFs to acceptible levels.

summary(bloodpress.lm2 <- lm(BP ~ . - Pt - BSA, data = bloodpress))
## 
## Call:
## lm(formula = BP ~ . - Pt - BSA, data = bloodpress)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02600 -0.18526 -0.00077  0.21934  0.72533 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.116781   2.748758  -5.499 7.83e-05 ***
## Age           0.731940   0.055646  13.154 2.85e-09 ***
## Weight        1.098958   0.037773  29.093 6.37e-14 ***
## Dur           0.064105   0.055965   1.145   0.2712    
## Pulse        -0.137444   0.053885  -2.551   0.0231 *  
## Stress        0.007429   0.003841   1.934   0.0736 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4708 on 14 degrees of freedom
## Multiple R-squared:  0.9945, Adjusted R-squared:  0.9925 
## F-statistic: 502.5 on 5 and 14 DF,  p-value: 2.835e-15
vif(bloodpress.lm2)
##      Age   Weight      Dur    Pulse   Stress 
## 1.659637 2.256150 1.235620 3.599913 1.739641

Example with Structural Multicollinearity

If the multicollinearity occurs because you are using a polynomial regression model, center the predictor variables. Dataset exerimmun (see PSU STAT 501) records immunoglobin in blood (a measure of immunity) and maximal oxygen uptake (a measure of exercise level) for n = 30 subjects.

exerimmun <- read_tsv(file = "./Data/exerimmun.txt")
## Parsed with column specification:
## cols(
##   igg = col_double(),
##   oxygen = col_double()
## )
head(exerimmun)
## # A tibble: 6 x 2
##     igg oxygen
##   <dbl>  <dbl>
## 1   881   34.6
## 2  1290   45  
## 3  2147   62.3
## 4  1909   58.9
## 5  1282   42.5
## 6  1530   44.3

The scatterplot oxygen ~ igg shows some curvature, so rather than formulating a linear regression function, formulate a quadratic polynomial regression function, \(igg_i = \beta_0 + \beta_1 oxygen_i + \beta_2 oxygen_i^2 + \epsilon_i\) where the error terms are assumed to be independent, normally distributed, and have equal variance.

ggplot(exerimmun, aes(y = igg, x = oxygen)) + 
  geom_point() +
  geom_smooth(method = lm, se = FALSE, aes(color = "linear")) +
  geom_smooth(method = lm, formula = y ~ poly(x, 2), se = FALSE, aes(color = "quadratic")) +
  labs(color = "model")

The formulated regression fits the data well (adjusted R-squared = .9331), but the terms oxygen and oxygen^2 are strongly correlated.

summary(exerimmun.lm <- lm(igg ~ poly(oxygen, degree = 2, raw = TRUE), data = exerimmun))
## 
## Call:
## lm(formula = igg ~ poly(oxygen, degree = 2, raw = TRUE), data = exerimmun)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -185.375  -82.129    1.047   66.007  227.377 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                           -1464.4042   411.4012  -3.560
## poly(oxygen, degree = 2, raw = TRUE)1    88.3071    16.4735   5.361
## poly(oxygen, degree = 2, raw = TRUE)2    -0.5362     0.1582  -3.390
##                                       Pr(>|t|)    
## (Intercept)                            0.00140 ** 
## poly(oxygen, degree = 2, raw = TRUE)1 1.16e-05 ***
## poly(oxygen, degree = 2, raw = TRUE)2  0.00217 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106.4 on 27 degrees of freedom
## Multiple R-squared:  0.9377, Adjusted R-squared:  0.9331 
## F-statistic: 203.2 on 2 and 27 DF,  p-value: < 2.2e-16
vif(lm(igg ~ oxygen + I(oxygen^2), data = exerimmun))
##      oxygen I(oxygen^2) 
##    99.94261    99.94261

Remove the structural multicollinearity by centering the predictors.

summary(exerimmun.lm <- lm(igg ~ poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE), data = exerimmun))
## 
## Call:
## lm(formula = igg ~ poly(scale(oxygen, center = TRUE, scale = FALSE), 
##     degree = 2, raw = TRUE), data = exerimmun)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -185.375  -82.129    1.047   66.007  227.377 
## 
## Coefficients:
##                                                                             Estimate
## (Intercept)                                                                1632.1962
## poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE)1   33.9995
## poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE)2   -0.5362
##                                                                            Std. Error
## (Intercept)                                                                   29.3486
## poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE)1     1.6890
## poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE)2     0.1582
##                                                                            t value
## (Intercept)                                                                  55.61
## poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE)1   20.13
## poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE)2   -3.39
##                                                                            Pr(>|t|)
## (Intercept)                                                                 < 2e-16
## poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE)1  < 2e-16
## poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE)2  0.00217
##                                                                               
## (Intercept)                                                                ***
## poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE)1 ***
## poly(scale(oxygen, center = TRUE, scale = FALSE), degree = 2, raw = TRUE)2 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106.4 on 27 degrees of freedom
## Multiple R-squared:  0.9377, Adjusted R-squared:  0.9331 
## F-statistic: 203.2 on 2 and 27 DF,  p-value: < 2.2e-16
df <- exerimmun %>%
  mutate(oxygen = scale(oxygen, center = TRUE, scale = FALSE))
vif(lm(igg ~ oxygen + I(oxygen^2), data = df))
##      oxygen I(oxygen^2) 
##    1.050628    1.050628
rm(df)

The estimated coefficient \(\hat{\beta}_0 = 1632\) means a person whose maximal oxygen uptake is the sample mean of 50.64 ml/kg is predicted to have 1632 mg of immunoglobin in his blood. The estimated coefficient \(\hat{\beta}_1 = 34.0\) means when a person’s maximal oxygen uptake is near 50.64 ml/kg, you can expect the their immunoglobin to increase by 34.0 mg for every 1 ml/kg increase in maximal oxygen uptake.

Predicting with the Model

When using the model for prediction, take care to account for the formulation. In the previous example, the model was formulated to use predictor oxygen with the value centered and expressed as a quadratic. If you had performed these tranformations prior to fitting the model, then the newdata provided to the model for prediction should contain the same tranformations. If you are going to use the model for predictions, best practice is to transform any variables right in the regression equation. Here is the predicted igg when oxygen is 50.64 ml/kg.

predict(exerimmun.lm, newdata = data.frame(oxygen = 50.64), interval = "prediction")
##        fit      lwr      upr
## 1 1632.196 1405.676 1858.716

The downside to including the centering in the regression equation is that the regression equation and summary output became cluttered. You could break it into two steps instead.

# center all values
exerimmun.cntr <- scale(exerimmun, center = TRUE, scale = FALSE)
# create new data frame with original y and centered X
exerimmun.cntr_x <- cbind(exerimmun[, 1], oxygen = as.data.frame(exerimmun.cntr)[, -1])
exerimmun.lm2 <- lm(igg ~ poly(oxygen, degree = 2, raw = TRUE), data = exerimmun.cntr_x)
# Use scale() attributes to center new data for prediction.  Need to include y in scale().
exerimmun.new <- scale(data.frame(igg = 0, oxygen = 50.64), attr(exerimmun.cntr, "scaled:center"), scale = FALSE) %>% data.frame()
predict(exerimmun.lm2, newdata = exerimmun.new, interval = "prediction")
##       fit      lwr     upr
## 1 1632.31 1405.789 1858.83

That was a nightmare (in part because I only wanted to center my predictors and not the response). Try caret’s train() with preProcess().

exerimmun.lm3 <- train(igg ~ oxygen + I(oxygen^2), 
                       data = exerimmun, 
                       method = "lm", 
                       preProcess = c("center"))
predict(exerimmun.lm3, newdata = data.frame(oxygen = 50.64))
##       1 
## 1632.31

References

Allison, Paul. “When Can You Safely Ignore Multicollinearity?”, Statistical Horizons. https://statisticalhorizons.com/multicollinearity.

Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer. http://appliedpredictivemodeling.com/.

Lakshmanan, Swetha. “How, When and Why Should You Normalize / Standardize / Rescale Your Data?”. [Medium] (https://medium.com/@swethalakshmanan14/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff).

Statistics 501 | Regression Methods, Lesson 12.6 “Reducing Structural Multicollinearity”, Penn State University. https://newonlinecourses.science.psu.edu/stat501/lesson/12/12.6.