Multicollinearity

Sameer Mathur

Theory and Example

Regression Diagnostics

---

Multicollinearity

Multicollinearity exists whenever two or more of the predictors in a regression model are moderately or highly correlated.

Types of Multicollinearity

There are two types of multicollinearity:

  1. Structural Multicollinarity

  2. Data-based Multicollinearity

Structural Multicollinearity

It is a mathematical artifact caused by creating new predictors from other predictors such as, creating the predictor \( x^2 \) from the predictor \( x \).

In the case of structural multicollinearity, the multicollinearity is induced by what you have done.

Data-based Multicollinearity

It is a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected.

Data-based multicollinearity is the more troublesome of the two types of multicollinearity. (Unfortunately it is the type we encounter most often.)

Example - Blood Pressure data

The data concerns 20 individuals with high blood pressure.

Data Variable Description

  1. blood pressure: (y = BP, in mm Hg)
  2. age: (x1 = Age, in years)
  3. weight: (x2 = Weight, in kg)
  4. body surface area: (x3 = BSA, in sq m)
  5. duration of hypertension: (x4 = Dur, in years)
  6. basal pulse: (x5 = Pulse, in beats per minute)
  7. stress index: (x6 = Stress)

Reading Data

# reading data
bp.df <- read.delim("BloodPressureData.txt")
# attaching data columns of the dataframe
attach(bp.df)
# dimension of the dataframe
dim(bp.df)
[1] 20  8

Descriptive Statistics

# descriptive statistics
library(psych)
describe(bp.df)[, c(1:5, 8:9)]
       vars  n   mean    sd median    min    max
Pt        1 20  10.50  5.92  10.50   1.00  20.00
BP        2 20 114.00  5.43 114.00 105.00 125.00
Age       3 20  48.60  2.50  48.50  45.00  56.00
Weight    4 20  93.09  4.29  94.15  85.40 101.30
BSA       5 20   2.00  0.14   1.98   1.75   2.25
Dur       6 20   6.43  2.15   6.00   2.50  10.20
Pulse     7 20  69.60  3.80  70.00  62.00  76.00
Stress    8 20  53.35 37.09  44.50   8.00  99.00

Scatter Plot Matrix

# basic scatterplot matrix
pairs(~ BP + Weight + BSA + Stress, data = bp.df, 
      main = "Scatter Plot Matrix")

plot of chunk unnamed-chunk-5

Correlation Matrix

corVar <- bp.df[c("BP", "Weight", "BSA", "Stress")]
# correlation matrix
corMat <- round(cor(corVar), 2)
corMat
         BP Weight  BSA Stress
BP     1.00   0.95 0.87   0.16
Weight 0.95   1.00 0.88   0.03
BSA    0.87   0.88 1.00   0.02
Stress 0.16   0.03 0.02   1.00

Visualizing Correlation Matrix

# visualizing correlation
library(PerformanceAnalytics)
chart.Correlation(corMat, histogram = TRUE, pch=19)

plot of chunk unnamed-chunk-9

Highly Correlated Variables

# highly correlated variables
library(caret)
findCorrelation(corMat, cutoff = 0.75, names = TRUE)
[1] "BP"     "Weight"