Regression Modelling for Residual Analysis

Regression Model

In this weekly discussion, each data set was individually selected to build a regression model and conduct a residual analysis. The analysis will indicate the data linearity for unbiased observations.

Heart Failure Prediction Dataset

Heart failure due to cardiovascular diseases (CVDs) is one of the top causation of deaths among individuals. Individuals with CVDs exhibits these risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease which can lead to heart attacks, strokes and death.

This Kaggle data set contains a combination of 5 heart data sets with over 11 common features that is available for research purposes.

Load Required Libraries

library(tidyverse)
library(ggplot2)
library(corrplot)
library(car)
library(randomForest)
library(class)
library(gtools)
library(psych)

Load Data - Multiple Regression (two or more independent variables)

PreProcess

Understanding the data set

#retrieve the number of rows and columns in an object
dim(heart)

## [1] 918  12

#view the header and first few rows of the data frame
head(heart)

##   Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1  40   M           ATA       140         289         0     Normal   172
## 2  49   F           NAP       160         180         0     Normal   156
## 3  37   M           ATA       130         283         0         ST    98
## 4  48   F           ASY       138         214         0     Normal   108
## 5  54   M           NAP       150         195         0     Normal   122
## 6  39   M           NAP       120         339         0     Normal   170
##   ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1              N     0.0       Up            0
## 2              N     1.0     Flat            1
## 3              N     0.0       Up            0
## 4              Y     1.5     Flat            1
## 5              N     0.0       Up            0
## 6              N     0.0       Up            0

Check the variables numeric summary of the data (minimum, median, mean, and maximum) values of the independent variables and dependent variable:

summary(heart)

##       Age            Sex            ChestPainType        RestingBP    
##  Min.   :28.00   Length:918         Length:918         Min.   :  0.0  
##  1st Qu.:47.00   Class :character   Class :character   1st Qu.:120.0  
##  Median :54.00   Mode  :character   Mode  :character   Median :130.0  
##  Mean   :53.51                                         Mean   :132.4  
##  3rd Qu.:60.00                                         3rd Qu.:140.0  
##  Max.   :77.00                                         Max.   :200.0  
##   Cholesterol      FastingBS       RestingECG            MaxHR      
##  Min.   :  0.0   Min.   :0.0000   Length:918         Min.   : 60.0  
##  1st Qu.:173.2   1st Qu.:0.0000   Class :character   1st Qu.:120.0  
##  Median :223.0   Median :0.0000   Mode  :character   Median :138.0  
##  Mean   :198.8   Mean   :0.2331                      Mean   :136.8  
##  3rd Qu.:267.0   3rd Qu.:0.0000                      3rd Qu.:156.0  
##  Max.   :603.0   Max.   :1.0000                      Max.   :202.0  
##  ExerciseAngina        Oldpeak          ST_Slope          HeartDisease   
##  Length:918         Min.   :-2.6000   Length:918         Min.   :0.0000  
##  Class :character   1st Qu.: 0.0000   Class :character   1st Qu.:0.0000  
##  Mode  :character   Median : 0.6000   Mode  :character   Median :1.0000  
##                     Mean   : 0.8874                      Mean   :0.5534  
##                     3rd Qu.: 1.5000                      3rd Qu.:1.0000  
##                     Max.   : 6.2000                      Max.   :1.0000

Check data frame for NULL and NA values

#check NULL value of vector
is.null(heart)

## [1] FALSE

any(is.na(heart))

## [1] FALSE

Restructure data set

#Change variable to data type "factor"
heart %>% mutate(Sex = factor(Sex) , 
                 ChestPainType = factor(ChestPainType, levels = c("ASY" ,"NAP", "ATA", "TA")),
                 RestingECG = factor(RestingECG, levels = c("Normal" , "ST", "LVH")), 
                 ExerciseAngina = factor(ExerciseAngina) , 
                 ST_Slope = factor(ST_Slope, levels = c("Down", "Flat", "Up")),
                 HeartDisease = factor(HeartDisease, labels = c("NO", "YES") ), 
                 FastingBS = factor(FastingBS , labels = c("NO","High") )) -> heart_factor

heart_factor %>% mutate_if(is.numeric, scale) %>% filter_if(is.numeric, any_vars(. > -3.29 | . < 3.29 )) -> heart_removed_outlier

Check the data type conversion: character changed to factor for statistical operations

#view the header and first few rows of the data frame
head(heart_factor, 6)

##   Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1  40   M           ATA       140         289        NO     Normal   172
## 2  49   F           NAP       160         180        NO     Normal   156
## 3  37   M           ATA       130         283        NO         ST    98
## 4  48   F           ASY       138         214        NO     Normal   108
## 5  54   M           NAP       150         195        NO     Normal   122
## 6  39   M           NAP       120         339        NO     Normal   170
##   ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1              N     0.0       Up           NO
## 2              N     1.0     Flat          YES
## 3              N     0.0       Up           NO
## 4              Y     1.5     Flat          YES
## 5              N     0.0       Up           NO
## 6              N     0.0       Up           NO

Provide a numeric summary of the data for the independent variables and the dependent variable:

summary(heart_factor)

##       Age        Sex     ChestPainType   RestingBP      Cholesterol   
##  Min.   :28.00   F:193   ASY:496       Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:47.00   M:725   NAP:203       1st Qu.:120.0   1st Qu.:173.2  
##  Median :54.00           ATA:173       Median :130.0   Median :223.0  
##  Mean   :53.51           TA : 46       Mean   :132.4   Mean   :198.8  
##  3rd Qu.:60.00                         3rd Qu.:140.0   3rd Qu.:267.0  
##  Max.   :77.00                         Max.   :200.0   Max.   :603.0  
##  FastingBS   RestingECG      MaxHR       ExerciseAngina    Oldpeak       
##  NO  :704   Normal:552   Min.   : 60.0   N:547          Min.   :-2.6000  
##  High:214   ST    :178   1st Qu.:120.0   Y:371          1st Qu.: 0.0000  
##             LVH   :188   Median :138.0                  Median : 0.6000  
##                          Mean   :136.8                  Mean   : 0.8874  
##                          3rd Qu.:156.0                  3rd Qu.: 1.5000  
##                          Max.   :202.0                  Max.   : 6.2000  
##  ST_Slope   HeartDisease
##  Down: 63   NO :410     
##  Flat:460   YES:508     
##  Up  :395               
##                         
##                         
##

Checking Assumptions for Linear Regression

The selected dependent variable for the data set is: Age

The Age variable will be studied under the assumption that it depends on the other variables to predict heart failure.

#Separate numerical and categorical variables
heart.numeric <- heart_factor[,c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak")]
heart.categ <- heart_factor[,!names(heart_factor) %in% colnames(heart.numeric)]
#heart.factor <- heart_factor[,!names(heart_factor) %in% "HeartDisease"]

#Dependent variable
dependent <- heart$Age

Independence of observations (aka no autocorrelation)

heart_removed_outlier %>% mutate_if(is.factor, as.numeric) %>% cor

##                        Age          Sex ChestPainType    RestingBP Cholesterol
## Age             1.00000000  0.055750099   -0.16589586  0.254399356 -0.09528177
## Sex             0.05575010  1.000000000   -0.16825414  0.005132708 -0.20009232
## ChestPainType  -0.16589586 -0.168254135    1.00000000 -0.022168346  0.13613950
## RestingBP       0.25439936  0.005132708   -0.02216835  1.000000000  0.10089294
## Cholesterol    -0.09528177 -0.200092323    0.13613950  0.100892942  1.00000000
## FastingBS       0.19803907  0.120075988   -0.11670254  0.070193336 -0.26097433
## RestingECG      0.21315196 -0.018343366   -0.03138321  0.097661436  0.11209457
## MaxHR          -0.38204468 -0.189185764    0.34365368 -0.112134997  0.23579240
## ExerciseAngina  0.21579269  0.190664102   -0.41662480  0.155101089 -0.03416587
## Oldpeak         0.25861154  0.105733537   -0.24502682  0.164803043  0.05014811
## ST_Slope       -0.26826399 -0.150692544    0.31747954 -0.075162174  0.11147054
## HeartDisease    0.28203851  0.305444916   -0.47135450  0.107588980 -0.23274064
##                  FastingBS  RestingECG       MaxHR ExerciseAngina     Oldpeak
## Age             0.19803907  0.21315196 -0.38204468     0.21579269  0.25861154
## Sex             0.12007599 -0.01834337 -0.18918576     0.19066410  0.10573354
## ChestPainType  -0.11670254 -0.03138321  0.34365368    -0.41662480 -0.24502682
## RestingBP       0.07019334  0.09766144 -0.11213500     0.15510109  0.16480304
## Cholesterol    -0.26097433  0.11209457  0.23579240    -0.03416587  0.05014811
## FastingBS       1.00000000  0.05070670 -0.13143849     0.06045067  0.05269786
## RestingECG      0.05070670  1.00000000  0.04855228     0.03611881  0.11442795
## MaxHR          -0.13143849  0.04855228  1.00000000    -0.37042487 -0.16069055
## ExerciseAngina  0.06045067  0.03611881 -0.37042487     1.00000000  0.40875250
## Oldpeak         0.05269786  0.11442795 -0.16069055     0.40875250  1.00000000
## ST_Slope       -0.17577434 -0.07880669  0.34341944    -0.42870594 -0.50192127
## HeartDisease    0.26729119  0.06101109 -0.40042077     0.49428199  0.40395072
##                   ST_Slope HeartDisease
## Age            -0.26826399   0.28203851
## Sex            -0.15069254   0.30544492
## ChestPainType   0.31747954  -0.47135450
## RestingBP      -0.07516217   0.10758898
## Cholesterol     0.11147054  -0.23274064
## FastingBS      -0.17577434   0.26729119
## RestingECG     -0.07880669   0.06101109
## MaxHR           0.34341944  -0.40042077
## ExerciseAngina -0.42870594   0.49428199
## Oldpeak        -0.50192127   0.40395072
## ST_Slope        1.00000000  -0.55877071
## HeartDisease   -0.55877071   1.00000000

Graphical display of a correlation matrix

#package "corrplot"
heart2 <- heart_removed_outlier %>% mutate_if(is.factor, as.numeric)
corrplot(cor(heart2), type = "upper", order = "hclust",
           tl.col = "black", tl.srt = 45) #Multi correlation plot

Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color show the correlation coefficients and the corresponding colors.

Independence of Observations Outcome: A negative correlation implies that the two variables vary in opposite directions, and if one variable increases the other decreases and vice versa.

Normality

Test whether the dependent variable follows a normal distribution. The histogram shows that the dependent variable,Age, follows a normal distribution.

heart_removed_outlier %>% 
  select_if(is.numeric) %>% 
  multi.hist(bcol=c("gray"), dcol=c("blue", "red", dlty=c("dotted", "solid"))) #Multi histogram

Linearity

A scatter plot shows the linear relationship between the independent and dependent variable and to see if the distribution of data points follow a straight line.

plot(Age ~ RestingBP, data = heart2,
col = rgb(0,100,0,50, maxColorValue = 255), pch=20)

plot(Age ~ Cholesterol, data = heart2,
col = rgb(0,100,0,50, maxColorValue = 255), pch=20)

Perform the Linear Regression Analysis

A linear relationship test was performed between numerical data: cholesterol, heart disease, resting bp, maxHR, oldpeak and age.

HeartAge.lm<-lm(Age ~ Cholesterol + HeartDisease + RestingBP + MaxHR + Oldpeak, data = heart2)

summary(HeartAge.lm)

## 
## Call:
## lm(formula = Age ~ Cholesterol + HeartDisease + RestingBP + MaxHR + 
##     Oldpeak, data = heart2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2764 -0.5891  0.0063  0.5973  2.9572 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.22604    0.11336  -1.994   0.0464 *  
## Cholesterol  -0.03486    0.03101  -1.124   0.2612    
## HeartDisease  0.14552    0.07053   2.063   0.0394 *  
## RestingBP     0.19163    0.02989   6.411 2.32e-10 ***
## MaxHR        -0.29901    0.03237  -9.237  < 2e-16 ***
## Oldpeak       0.15149    0.03249   4.662 3.59e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8822 on 912 degrees of freedom
## Multiple R-squared:  0.2259, Adjusted R-squared:  0.2217 
## F-statistic: 53.23 on 5 and 912 DF,  p-value: < 2.2e-16

**The estimated effect of cholesterol on age is positive, while the estimated effect of resting bp (blood pressure) is negative. This means that as age increases the cholesterol level decreases, and during an age increase the resting bp increases.

Check for Homoscedasticity

The plots provides visual models to see any bias in the residuals. The red line represents the mean of the residuals and are mostly horizontal and centered around zero.

par(mfrow=c(2,2))
plot(HeartAge.lm)

par(mfrow=c(1,1))

Visualizaton Variation - Linear Regression Function (lm)

The plot shows graphically size of the residual value using a color code (red is longer line to green - smaller line) and size of point. The size of residual is the length of the vertical line from the point to where it meets the regression line. Source:Residual Analysis in Linear Regression

d <- heart2
fit <- lm(Age ~ Sex, data = d) # fit the model
d$predicted <- predict(fit)   # Save the predicted values
d$residuals <- residuals(fit) # Save the residual values
ggplot(d, aes(x = Cholesterol, y = Age)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +     # regression line  
  geom_segment(aes(xend = Age, yend = predicted), alpha = .2) +      # draw line from point to line
  geom_point(aes(color = abs(residuals), size = abs(residuals))) +  # size of the points
  scale_color_continuous(low = "green", high = "red") +             # colour of the points mapped to residual size - green smaller, red larger
  guides(color = FALSE, size = FALSE) +                             # Size legend removed
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## `geom_smooth()` using formula 'y ~ x'

summary(fit)

## 
## Call:
## lm(formula = Age ~ Sex, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7333 -0.7190  0.1291  0.6592  2.4922 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.24474    0.14853  -1.648   0.0997 .
## Sex          0.13674    0.08092   1.690   0.0914 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.999 on 916 degrees of freedom
## Multiple R-squared:  0.003108,   Adjusted R-squared:  0.00202 
## F-statistic: 2.856 on 1 and 916 DF,  p-value: 0.09138

Results

Was the linear model appropriate? Why or Why not?

The linear model was not appropriate base on R2 (multiple R-Square) value of 0.2259. The model R2 value is not close to 1 indicating little variation was captured and large amount of variance will not be explained. Therefore, the heart failure data set of 918 observations and 17 variables, is not a good fit.

Summary Observation The F-stats (F = 53.23) is greater than 1, indicating there is a relationship between predictor and response variable. The variable, Cholesterol has no p-value significance and can be removal from the model. There are significant relationships between these frequencies (heart disease, resting bp (blood pressure), and oldpeak) and the frequency of age (p<0 and p<0.01, respectively).

References:

Heart Failure Prediction Dataset

Linear Regression in R | An Easy Step-by-Step Guide

STHDA | Statistical tools for high=throughput data analysis

Regression Modelling for Residual Analysis

Coffy Andrews-Guo

06 April, 2022