Cleaning the data:
data <- read.table("C:/Users/Alisa/Downloads/archive/heart.csv", header=TRUE, sep=",", dec=".")
mydata <- data[ , -c(6,7,9,10,11)]
colnames(mydata) <- c("Age", "Gender", "ChestPain", "BloodPressure", "Cholesterol", "MaxHeartRate", "HeartDisease")
mydata$GenderFactor <- factor(mydata$Gender,
levels = c("M","F"),
labels = c("Male", "Female"))
mydata$HeartDiseaseFactor <- factor(mydata$HeartDisease,
levels = c("0","1"),
labels = c("No", "Yes"))
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(1)
mydata1 <- sample_n(mydata, 300) #made a sample out of the whole dataframe
head(mydata1,10)
## Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor
## 1 54 M ASY 110 239 126 1 Male Yes
## 2 60 F TA 150 240 171 0 Female No
## 3 54 F ATA 120 230 140 0 Female No
## 4 67 M ASY 120 0 150 1 Male Yes
## 5 53 M ASY 126 0 106 1 Male Yes
## 6 51 M ASY 110 0 92 1 Male Yes
## 7 47 F NAP 130 235 145 0 Female No
## 8 58 M ATA 130 251 110 0 Male No
## 9 55 M ASY 115 0 155 1 Male Yes
## 10 57 M ASY 122 264 100 1 Male Yes
Description:
Unit of observation: Patients with cardiovascular disease Sample size: 918 observations
Variables: Age: age of the patient [years]
Gender: gender of the patient - M = Male - F = Female
ChestPain: chest pain type - TA = Typical Angina - ATA = Atypical Angina - NAP = Non-Anginal Pain - ASY = Asymptomatic
BloodPressure: resting blood pressure [mm Hg]
Cholesterol: serum cholesterol [mm/dl]
MaxHeartRate: maximum heart rate achieved [Numeric value between 60 and 202]
HeartDisease: output class - 1: Has heart disease - 0: Doesn’t have heart disease
Datasource: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download
How do factors, age of the patient, gender of the patient, resting blood pressure, serum cholesterol, chest pain type and having or not having the heart disease, influence the maximum heart rate achieved?
In the regression model, we will identify what is the relationship between variables in our data:
Dependent variable:MaxHeartRate (maximum heart rate achieved)
Independent variables: Age, Gender, ChestPain (chest pain type), BloodPressure (resting blood pressure), Cholesterol, HeartDisease (1: Has heart disease, 0: Doesn’t have heart disease)
We will define the relationship’s strength and its statistical significance using the regression model, demonstrating how strongly independent factors affect the value of the dependent variable. MaxHeartRate is the dependent variable, and Age, Gender, Chest Pain, Blood Pressure, Cholesterol, and Heart Disease are the independent variables. We chose these variables because, according to theory, there should be a link between them that influences the maximum heart rate.
Graphical presentation of the relationships among variables:
library(car)
scatterplotMatrix(mydata1[ , c(6, 1, 4, 5)],
smooth = FALSE)
From the score from the scatterplot matrix we can see a negative correlation with age and blood pressure and a positive on with cholesterol.
Checking the multicolinearity of variables:
fit <- lm(MaxHeartRate ~ Age + BloodPressure + Cholesterol + GenderFactor + HeartDiseaseFactor + ChestPain,
data=mydata1)
vif(fit)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.242124 1 1.114506
## BloodPressure 1.140342 1 1.067868
## Cholesterol 1.181416 1 1.086929
## GenderFactor 1.166734 1 1.080155
## HeartDiseaseFactor 1.636287 1 1.279175
## ChestPain 1.496910 3 1.069546
There is no high multicolinearity for any of the variables, we see that all numbers are significantly below 5, which is a good indicator. For categorical variable with more than 3 categories we will compare values in last column √5 = 2.24. In our case the variable ChestPain has no problem with multicolinearity since the value is well below 2.24.
mean(vif(fit))
## [1] 1.253444
There is not a strong relationship between the explanatory factors.
Checking for outliers:
mydata1$StdResid <- round(rstandard(fit), 3)
mydata1$CooksD <- round(cooks.distance(fit),3)
hist(mydata1$StdResid,
xlab = "Standardised residuals",
ylab = "Frequency",
main = "Histogram of standardised residuals")
We see that there may be an outlier on the left side, below the value -3.
We will first do the Shapiro test to check if the distribution is normal or not.
H₀: Distribution of the variable is normal
H₁: Distribution of the variable is not normal
shapiro.test(mydata1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata1$StdResid
## W = 0.99254, p-value = 0.1381
The p value is higher than 5%, based on this we cannot reject H₀ at p = 13.8. This means that the distribution is normal.
We analyse Cook’s Distance to see whether we have any units with a big influence.
hist(mydata1$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distance")
We see that there might be units with high impact on the right side of the histogram.
Checking the outliers based on numbers:
head(mydata1[order(-mydata1$CooksD),],10)
## Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor StdResid
## 81 39 M ATA 190 241 106 0 Male No -2.710
## 193 40 M NAP 106 240 80 0 Male No -3.848
## 144 62 M NAP 160 0 72 1 Male Yes -2.540
## 165 58 F ATA 180 393 110 1 Female Yes -1.765
## 207 61 M TA 142 200 100 1 Male Yes -1.576
## 213 52 M NAP 128 0 180 1 Male Yes 2.506
## 15 66 F ASY 178 228 165 1 Female Yes 1.903
## 252 57 F ATA 130 236 174 1 Female Yes 1.833
## 120 32 M TA 95 0 127 1 Male Yes -0.990
## 189 66 M NAP 110 213 99 0 Male No -1.782
## CooksD
## 81 0.069
## 193 0.052
## 144 0.032
## 165 0.026
## 207 0.023
## 213 0.022
## 15 0.019
## 252 0.017
## 120 0.016
## 189 0.015
head(mydata1[order(-mydata1$StdResid),],10)
## Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor StdResid
## 213 52 M NAP 128 0 180 1 Male Yes 2.506
## 115 51 M ASY 140 261 186 0 Male No 2.191
## 4 67 M ASY 120 0 150 1 Male Yes 2.114
## 276 63 M ASY 150 0 154 1 Male Yes 2.060
## 59 51 M ASY 130 0 163 1 Male Yes 2.054
## 96 55 M ATA 110 214 180 0 Male No 1.974
## 15 66 F ASY 178 228 165 1 Female Yes 1.903
## 9 55 M ASY 115 0 155 1 Male Yes 1.868
## 125 63 M ASY 140 0 149 1 Male Yes 1.838
## 215 61 F ASY 130 330 169 1 Female Yes 1.833
## CooksD
## 213 0.022
## 115 0.013
## 4 0.011
## 276 0.010
## 59 0.009
## 96 0.012
## 15 0.019
## 9 0.007
## 125 0.007
## 215 0.011
head(mydata1[order(mydata1$StdResid),],10)
## Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor StdResid
## 193 40 M NAP 106 240 80 0 Male No -3.848
## 81 39 M ATA 190 241 106 0 Male No -2.710
## 144 62 M NAP 160 0 72 1 Male Yes -2.540
## 64 60 M ASY 135 0 63 1 Male Yes -2.514
## 176 65 M ASY 145 0 67 1 Male Yes -2.140
## 169 45 M NAP 135 192 110 0 Male No -2.139
## 202 61 M NAP 120 0 80 1 Male Yes -2.057
## 84 52 M ASY 160 331 94 1 Male Yes -1.969
## 156 62 M ASY 115 0 72 1 Male Yes -1.935
## 28 57 M ASY 130 207 96 0 Male No -1.888
## CooksD
## 193 0.052
## 81 0.069
## 144 0.032
## 64 0.012
## 176 0.011
## 169 0.013
## 202 0.015
## 84 0.010
## 156 0.008
## 28 0.009
mydata2 <- mydata1[ -c(81,193,144), ] #removed because of the cooks distance numbers - they had significant differences in comparison to other numbers
head(mydata2[order(-mydata2$CooksD),],10)
## Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor StdResid
## 165 58 F ATA 180 393 110 1 Female Yes -1.765
## 207 61 M TA 142 200 100 1 Male Yes -1.576
## 213 52 M NAP 128 0 180 1 Male Yes 2.506
## 15 66 F ASY 178 228 165 1 Female Yes 1.903
## 252 57 F ATA 130 236 174 1 Female Yes 1.833
## 120 32 M TA 95 0 127 1 Male Yes -0.990
## 189 66 M NAP 110 213 99 0 Male No -1.782
## 202 61 M NAP 120 0 80 1 Male Yes -2.057
## 229 51 M TA 125 213 125 0 Male No -1.255
## 166 32 M ASY 118 529 130 1 Male Yes -1.285
## CooksD
## 165 0.026
## 207 0.023
## 213 0.022
## 15 0.019
## 252 0.017
## 120 0.016
## 189 0.015
## 202 0.015
## 229 0.015
## 166 0.014
fit1 <- lm(MaxHeartRate ~ Age + BloodPressure + Cholesterol + GenderFactor + HeartDiseaseFactor + ChestPain,
data=mydata2)
mydata2$StdResid <- round(rstandard(fit1), 3)
mydata2$StdFittedValues <- scale(fit1$fitted.values)
The normality was okay before that is why we check again only the cooks distances:
hist(mydata2$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distance")
We don’t see any gaps so we decided to continue.
Checking if the linearity assumption is fulfilled:
scatterplot(y=mydata2$StdResid, x=mydata2$StdFittedValues,
ylab = "Standardised residuals",
xlab = "Standardised fitted values",
regLine = FALSE,
boxplots = FALSE,
smooth = FALSE)
Based on the scatterplot, where the patterns are random, we can assume that the linearity assumption is met.
We still have to check for heteroskedasticity.
H₀: There is homoskedasticity.
H₁: There is heteroskedasticity.
suppressPackageStartupMessages(library("olsrr"))
olsrr::ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ----------------------------------------
## Response : MaxHeartRate
## Variables: fitted values of MaxHeartRate
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 4.753417
## Prob > Chi2 = 0.02924018
As the p value is smaller than 5% we have to reject H₀ at p = 0.029 and further use lm_robust, using robust standard errors to correct heteroskedasticity.
Test regression coefficients:
H₀: Slope is equal to 0 (variable has no influence)
H₁: Slope is not equal to 0 (variable has an influence)
library(estimatr)
fit2 <- lm_robust(MaxHeartRate ~ Age + BloodPressure + Cholesterol + GenderFactor + HeartDiseaseFactor + ChestPain,
se_type = "HC1",
data = mydata2)
summary(fit2)
##
## Call:
## lm_robust(formula = MaxHeartRate ~ Age + BloodPressure + Cholesterol +
## GenderFactor + HeartDiseaseFactor + ChestPain, data = mydata2,
## se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 166.13081 9.73818 17.060 1.389e-45 146.963787 185.29784 288
## Age -0.92194 0.11970 -7.702 2.164e-13 -1.157537 -0.68635 288
## BloodPressure 0.11013 0.07201 1.529 1.273e-01 -0.031600 0.25186 288
## Cholesterol 0.03179 0.01185 2.681 7.755e-03 0.008454 0.05512 288
## GenderFactorFemale 6.50922 2.75755 2.361 1.892e-02 1.081715 11.93673 288
## HeartDiseaseFactorYes -11.16771 3.01306 -3.706 2.520e-04 -17.098131 -5.23729 288
## ChestPainATA 5.82800 3.65781 1.593 1.122e-01 -1.371438 13.02744 288
## ChestPainNAP 11.07502 3.10145 3.571 4.167e-04 4.970636 17.17941 288
## ChestPainTA 10.56356 4.81895 2.192 2.917e-02 1.078733 20.04838 288
##
## Multiple R-squared: 0.395 , Adjusted R-squared: 0.3782
## F-statistic: 24.78 on 8 and 288 DF, p-value: < 2.2e-16
Age: If the age of a patient increases by one year, then the maximum achieved heart rate decreases by 0.91 beats per minute on average, assuming all other variables remain unchanged (p < 0.001).
Blood Pressure: We cannot say that blood pressure affects maximum achieved heart rate (p = 0.07).
Cholesterol: If the serum cholesterol of a patient increases by one mm/dl, then the maximum achieved heart rate increases by 0.04 beats per minute on average, assuming all other variables remain unchanged (p = 0.003).
Gender: Given the values of all other variables, females have on average 7 beats per minute higher maximum achieved heart rate than males (p = 0.01), assuming all other variables remain unchanged.
HeartDisease: Given the values of all other variables, patients with heart disease have on average 10.8 beats per minute lower maximum achieved heart rate than patients without heart disease. (p < 0.001), assuming all other variables remain unchanged.
Chest pain type:
-ATA: We cannot say that atypical angina chest pain has any effect on maximum achieved heart rate (p = 0.07).
-NAP: Given the values of all other variables, patients with non-anginal chest pain have on average 10.4 beats per minute higher maximum achieved heart rate than patients that are asymptomatic. (p < 0.001), assuming all other variables remain unchanged.
-TA: Given the values of all other variables, patients with typical angina chest pain have on average 10.6 beats per minute higher maximum achieved heart rate than patients that are asymptomatic. (p < 0.001), assuming all other variables remain unchanged.
ChestPain: chest pain type - TA = Typical Angina - ATA = Atypical Angina - NAP = Non-Anginal Pain - ASY = Asymptomatic
Multiple R-squared= 0.4072 –> 40.72 % of variability of the dependent variable maximum heart rate achieved is explained by the variability of the 6 explanatory variables in the model.
F-statistic:
H0: ρ2 (rho squared) = 0
H1: ρ2 > 0
We can reject H0 at p < 0.001 –> There is some relationship between the dependent and at least one explanatory variable.
sqrt(summary(fit2)$r.squared)
## [1] 0.6284874
Linear relationship between maximum heart rate achieved and all 6 explanatory variables is semistrong (0.3 < 0.64 < 0.7).