Sameer Mathur
Theory and Example
Regression Diagnostics
---
Multicollinearity exists whenever two or more of the predictors in a regression model are moderately or highly correlated.
There are two types of multicollinearity:
Structural Multicollinarity
Data-based Multicollinearity
It is a mathematical artifact caused by creating new predictors from other predictors such as, creating the predictor \( x^2 \) from the predictor \( x \).
In the case of structural multicollinearity, the multicollinearity is induced by what you have done.
It is a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected.
Data-based multicollinearity is the more troublesome of the two types of multicollinearity. (Unfortunately it is the type we encounter most often.)
The data concerns 20 individuals with high blood pressure.
BP, in mm Hg)Age, in years)Weight, in kg)BSA, in sq m)Dur, in years)Pulse, in beats per minute)Stress)# reading data
bp.df <- read.delim("BloodPressureData.txt")
# attaching data columns of the dataframe
attach(bp.df)
# dimension of the dataframe
dim(bp.df)
[1] 20 8
# descriptive statistics
library(psych)
describe(bp.df)[, c(1:5, 8:9)]
vars n mean sd median min max
Pt 1 20 10.50 5.92 10.50 1.00 20.00
BP 2 20 114.00 5.43 114.00 105.00 125.00
Age 3 20 48.60 2.50 48.50 45.00 56.00
Weight 4 20 93.09 4.29 94.15 85.40 101.30
BSA 5 20 2.00 0.14 1.98 1.75 2.25
Dur 6 20 6.43 2.15 6.00 2.50 10.20
Pulse 7 20 69.60 3.80 70.00 62.00 76.00
Stress 8 20 53.35 37.09 44.50 8.00 99.00
# basic scatterplot matrix
pairs(~ BP + Weight + BSA + Stress, data = bp.df,
main = "Scatter Plot Matrix")
corVar <- bp.df[c("BP", "Weight", "BSA", "Stress")]
# correlation matrix
corMat <- round(cor(corVar), 2)
corMat
BP Weight BSA Stress
BP 1.00 0.95 0.87 0.16
Weight 0.95 1.00 0.88 0.03
BSA 0.87 0.88 1.00 0.02
Stress 0.16 0.03 0.02 1.00
# visualizing correlation
library(PerformanceAnalytics)
chart.Correlation(corMat, histogram = TRUE, pch=19)
# highly correlated variables
library(caret)
findCorrelation(corMat, cutoff = 0.75, names = TRUE)
[1] "BP" "Weight"