In this weekly discussion, each data set was individually selected to build a regression model and conduct a residual analysis. The analysis will indicate the data linearity for unbiased observations.
Heart failure due to cardiovascular diseases (CVDs) is one of the top causation of deaths among individuals. Individuals with CVDs exhibits these risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease which can lead to heart attacks, strokes and death.
This Kaggle data set contains a combination of 5 heart data sets with over 11 common features that is available for research purposes.
library(tidyverse)
library(ggplot2)
library(corrplot)
library(car)
library(randomForest)
library(class)
library(gtools)
library(psych)
#retrieve the number of rows and columns in an object
dim(heart)
## [1] 918 12
#view the header and first few rows of the data frame
head(heart)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 N 0.0 Up 0
## 2 N 1.0 Flat 1
## 3 N 0.0 Up 0
## 4 Y 1.5 Flat 1
## 5 N 0.0 Up 0
## 6 N 0.0 Up 0
summary(heart)
## Age Sex ChestPainType RestingBP
## Min. :28.00 Length:918 Length:918 Min. : 0.0
## 1st Qu.:47.00 Class :character Class :character 1st Qu.:120.0
## Median :54.00 Mode :character Mode :character Median :130.0
## Mean :53.51 Mean :132.4
## 3rd Qu.:60.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## Cholesterol FastingBS RestingECG MaxHR
## Min. : 0.0 Min. :0.0000 Length:918 Min. : 60.0
## 1st Qu.:173.2 1st Qu.:0.0000 Class :character 1st Qu.:120.0
## Median :223.0 Median :0.0000 Mode :character Median :138.0
## Mean :198.8 Mean :0.2331 Mean :136.8
## 3rd Qu.:267.0 3rd Qu.:0.0000 3rd Qu.:156.0
## Max. :603.0 Max. :1.0000 Max. :202.0
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## Length:918 Min. :-2.6000 Length:918 Min. :0.0000
## Class :character 1st Qu.: 0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median : 0.6000 Mode :character Median :1.0000
## Mean : 0.8874 Mean :0.5534
## 3rd Qu.: 1.5000 3rd Qu.:1.0000
## Max. : 6.2000 Max. :1.0000
#check NULL value of vector
is.null(heart)
## [1] FALSE
any(is.na(heart))
## [1] FALSE
#Change variable to data type "factor"
heart %>% mutate(Sex = factor(Sex) ,
ChestPainType = factor(ChestPainType, levels = c("ASY" ,"NAP", "ATA", "TA")),
RestingECG = factor(RestingECG, levels = c("Normal" , "ST", "LVH")),
ExerciseAngina = factor(ExerciseAngina) ,
ST_Slope = factor(ST_Slope, levels = c("Down", "Flat", "Up")),
HeartDisease = factor(HeartDisease, labels = c("NO", "YES") ),
FastingBS = factor(FastingBS , labels = c("NO","High") )) -> heart_factor
heart_factor %>% mutate_if(is.numeric, scale) %>% filter_if(is.numeric, any_vars(. > -3.29 | . < 3.29 )) -> heart_removed_outlier
Check the data type conversion: character changed to factor for statistical operations
#view the header and first few rows of the data frame
head(heart_factor, 6)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1 40 M ATA 140 289 NO Normal 172
## 2 49 F NAP 160 180 NO Normal 156
## 3 37 M ATA 130 283 NO ST 98
## 4 48 F ASY 138 214 NO Normal 108
## 5 54 M NAP 150 195 NO Normal 122
## 6 39 M NAP 120 339 NO Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 N 0.0 Up NO
## 2 N 1.0 Flat YES
## 3 N 0.0 Up NO
## 4 Y 1.5 Flat YES
## 5 N 0.0 Up NO
## 6 N 0.0 Up NO
Provide a numeric summary of the data for the independent variables and the dependent variable:
summary(heart_factor)
## Age Sex ChestPainType RestingBP Cholesterol
## Min. :28.00 F:193 ASY:496 Min. : 0.0 Min. : 0.0
## 1st Qu.:47.00 M:725 NAP:203 1st Qu.:120.0 1st Qu.:173.2
## Median :54.00 ATA:173 Median :130.0 Median :223.0
## Mean :53.51 TA : 46 Mean :132.4 Mean :198.8
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:267.0
## Max. :77.00 Max. :200.0 Max. :603.0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## NO :704 Normal:552 Min. : 60.0 N:547 Min. :-2.6000
## High:214 ST :178 1st Qu.:120.0 Y:371 1st Qu.: 0.0000
## LVH :188 Median :138.0 Median : 0.6000
## Mean :136.8 Mean : 0.8874
## 3rd Qu.:156.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## ST_Slope HeartDisease
## Down: 63 NO :410
## Flat:460 YES:508
## Up :395
##
##
##
The selected dependent variable for the data set is: Age
The Age variable will be studied under the assumption that it depends on the other variables to predict heart failure.
#Separate numerical and categorical variables
heart.numeric <- heart_factor[,c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak")]
heart.categ <- heart_factor[,!names(heart_factor) %in% colnames(heart.numeric)]
#heart.factor <- heart_factor[,!names(heart_factor) %in% "HeartDisease"]
#Dependent variable
dependent <- heart$Age
heart_removed_outlier %>% mutate_if(is.factor, as.numeric) %>% cor
## Age Sex ChestPainType RestingBP Cholesterol
## Age 1.00000000 0.055750099 -0.16589586 0.254399356 -0.09528177
## Sex 0.05575010 1.000000000 -0.16825414 0.005132708 -0.20009232
## ChestPainType -0.16589586 -0.168254135 1.00000000 -0.022168346 0.13613950
## RestingBP 0.25439936 0.005132708 -0.02216835 1.000000000 0.10089294
## Cholesterol -0.09528177 -0.200092323 0.13613950 0.100892942 1.00000000
## FastingBS 0.19803907 0.120075988 -0.11670254 0.070193336 -0.26097433
## RestingECG 0.21315196 -0.018343366 -0.03138321 0.097661436 0.11209457
## MaxHR -0.38204468 -0.189185764 0.34365368 -0.112134997 0.23579240
## ExerciseAngina 0.21579269 0.190664102 -0.41662480 0.155101089 -0.03416587
## Oldpeak 0.25861154 0.105733537 -0.24502682 0.164803043 0.05014811
## ST_Slope -0.26826399 -0.150692544 0.31747954 -0.075162174 0.11147054
## HeartDisease 0.28203851 0.305444916 -0.47135450 0.107588980 -0.23274064
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## Age 0.19803907 0.21315196 -0.38204468 0.21579269 0.25861154
## Sex 0.12007599 -0.01834337 -0.18918576 0.19066410 0.10573354
## ChestPainType -0.11670254 -0.03138321 0.34365368 -0.41662480 -0.24502682
## RestingBP 0.07019334 0.09766144 -0.11213500 0.15510109 0.16480304
## Cholesterol -0.26097433 0.11209457 0.23579240 -0.03416587 0.05014811
## FastingBS 1.00000000 0.05070670 -0.13143849 0.06045067 0.05269786
## RestingECG 0.05070670 1.00000000 0.04855228 0.03611881 0.11442795
## MaxHR -0.13143849 0.04855228 1.00000000 -0.37042487 -0.16069055
## ExerciseAngina 0.06045067 0.03611881 -0.37042487 1.00000000 0.40875250
## Oldpeak 0.05269786 0.11442795 -0.16069055 0.40875250 1.00000000
## ST_Slope -0.17577434 -0.07880669 0.34341944 -0.42870594 -0.50192127
## HeartDisease 0.26729119 0.06101109 -0.40042077 0.49428199 0.40395072
## ST_Slope HeartDisease
## Age -0.26826399 0.28203851
## Sex -0.15069254 0.30544492
## ChestPainType 0.31747954 -0.47135450
## RestingBP -0.07516217 0.10758898
## Cholesterol 0.11147054 -0.23274064
## FastingBS -0.17577434 0.26729119
## RestingECG -0.07880669 0.06101109
## MaxHR 0.34341944 -0.40042077
## ExerciseAngina -0.42870594 0.49428199
## Oldpeak -0.50192127 0.40395072
## ST_Slope 1.00000000 -0.55877071
## HeartDisease -0.55877071 1.00000000
Graphical display of a correlation matrix
#package "corrplot"
heart2 <- heart_removed_outlier %>% mutate_if(is.factor, as.numeric)
corrplot(cor(heart2), type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45) #Multi correlation plot
Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color show the correlation coefficients and the corresponding colors.
Independence of Observations Outcome: A negative correlation implies that the two variables vary in opposite directions, and if one variable increases the other decreases and vice versa.
Test whether the dependent variable follows a normal distribution. The histogram shows that the dependent variable,Age, follows a normal distribution.
heart_removed_outlier %>%
select_if(is.numeric) %>%
multi.hist(bcol=c("gray"), dcol=c("blue", "red", dlty=c("dotted", "solid"))) #Multi histogram
A scatter plot shows the linear relationship between the independent and dependent variable and to see if the distribution of data points follow a straight line.
plot(Age ~ RestingBP, data = heart2,
col = rgb(0,100,0,50, maxColorValue = 255), pch=20)
plot(Age ~ Cholesterol, data = heart2,
col = rgb(0,100,0,50, maxColorValue = 255), pch=20)
A linear relationship test was performed between numerical data: cholesterol, heart disease, resting bp, maxHR, oldpeak and age.
HeartAge.lm<-lm(Age ~ Cholesterol + HeartDisease + RestingBP + MaxHR + Oldpeak, data = heart2)
summary(HeartAge.lm)
##
## Call:
## lm(formula = Age ~ Cholesterol + HeartDisease + RestingBP + MaxHR +
## Oldpeak, data = heart2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2764 -0.5891 0.0063 0.5973 2.9572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.22604 0.11336 -1.994 0.0464 *
## Cholesterol -0.03486 0.03101 -1.124 0.2612
## HeartDisease 0.14552 0.07053 2.063 0.0394 *
## RestingBP 0.19163 0.02989 6.411 2.32e-10 ***
## MaxHR -0.29901 0.03237 -9.237 < 2e-16 ***
## Oldpeak 0.15149 0.03249 4.662 3.59e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8822 on 912 degrees of freedom
## Multiple R-squared: 0.2259, Adjusted R-squared: 0.2217
## F-statistic: 53.23 on 5 and 912 DF, p-value: < 2.2e-16
**The estimated effect of cholesterol on age is positive, while the estimated effect of resting bp (blood pressure) is negative. This means that as age increases the cholesterol level decreases, and during an age increase the resting bp increases.
The plots provides visual models to see any bias in the residuals. The red line represents the mean of the residuals and are mostly horizontal and centered around zero.
par(mfrow=c(2,2))
plot(HeartAge.lm)
par(mfrow=c(1,1))
The plot shows graphically size of the residual value using a color code (red is longer line to green - smaller line) and size of point. The size of residual is the length of the vertical line from the point to where it meets the regression line. Source:Residual Analysis in Linear Regression
d <- heart2
fit <- lm(Age ~ Sex, data = d) # fit the model
d$predicted <- predict(fit) # Save the predicted values
d$residuals <- residuals(fit) # Save the residual values
ggplot(d, aes(x = Cholesterol, y = Age)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # regression line
geom_segment(aes(xend = Age, yend = predicted), alpha = .2) + # draw line from point to line
geom_point(aes(color = abs(residuals), size = abs(residuals))) + # size of the points
scale_color_continuous(low = "green", high = "red") + # colour of the points mapped to residual size - green smaller, red larger
guides(color = FALSE, size = FALSE) + # Size legend removed
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## `geom_smooth()` using formula 'y ~ x'
summary(fit)
##
## Call:
## lm(formula = Age ~ Sex, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7333 -0.7190 0.1291 0.6592 2.4922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.24474 0.14853 -1.648 0.0997 .
## Sex 0.13674 0.08092 1.690 0.0914 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.999 on 916 degrees of freedom
## Multiple R-squared: 0.003108, Adjusted R-squared: 0.00202
## F-statistic: 2.856 on 1 and 916 DF, p-value: 0.09138
Was the linear model appropriate? Why or Why not?
The linear model was not appropriate base on R2 (multiple R-Square) value of 0.2259. The model R2 value is not close to 1 indicating little variation was captured and large amount of variance will not be explained. Therefore, the heart failure data set of 918 observations and 17 variables, is not a good fit.
Summary Observation The F-stats (F = 53.23) is greater than 1, indicating there is a relationship between predictor and response variable. The variable, Cholesterol has no p-value significance and can be removal from the model. There are significant relationships between these frequencies (heart disease, resting bp (blood pressure), and oldpeak) and the frequency of age (p<0 and p<0.01, respectively).