Cleaning the data:

data <- read.table("C:/Users/Alisa/Downloads/archive/heart.csv", header=TRUE, sep=",", dec=".")

mydata <- data[ , -c(6,7,9,10,11)]
colnames(mydata) <- c("Age", "Gender", "ChestPain", "BloodPressure", "Cholesterol", "MaxHeartRate", "HeartDisease")
mydata$GenderFactor <- factor(mydata$Gender,
                           levels = c("M","F"),
                           labels = c("Male", "Female"))
mydata$HeartDiseaseFactor <- factor(mydata$HeartDisease,
                           levels = c("0","1"),
                           labels = c("No", "Yes"))
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
## 
##     first, last
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(1)
mydata1 <- sample_n(mydata, 300) #made a sample out of the whole dataframe
head(mydata1,10)
##    Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor
## 1   54      M       ASY           110         239          126            1         Male                Yes
## 2   60      F        TA           150         240          171            0       Female                 No
## 3   54      F       ATA           120         230          140            0       Female                 No
## 4   67      M       ASY           120           0          150            1         Male                Yes
## 5   53      M       ASY           126           0          106            1         Male                Yes
## 6   51      M       ASY           110           0           92            1         Male                Yes
## 7   47      F       NAP           130         235          145            0       Female                 No
## 8   58      M       ATA           130         251          110            0         Male                 No
## 9   55      M       ASY           115           0          155            1         Male                Yes
## 10  57      M       ASY           122         264          100            1         Male                Yes

Description:

Unit of observation: Patients with cardiovascular disease Sample size: 918 observations

Variables: Age: age of the patient [years]

Gender: gender of the patient - M = Male - F = Female

ChestPain: chest pain type - TA = Typical Angina - ATA = Atypical Angina - NAP = Non-Anginal Pain - ASY = Asymptomatic

BloodPressure: resting blood pressure [mm Hg]

Cholesterol: serum cholesterol [mm/dl]

MaxHeartRate: maximum heart rate achieved [Numeric value between 60 and 202]

HeartDisease: output class - 1: Has heart disease - 0: Doesn’t have heart disease

Datasource: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download

Research question

How do factors, age of the patient, gender of the patient, resting blood pressure, serum cholesterol, chest pain type and having or not having the heart disease, influence the maximum heart rate achieved?

Definition of the regression model

In the regression model, we will identify what is the relationship between variables in our data:

Dependent variable:MaxHeartRate (maximum heart rate achieved)

Independent variables: Age, Gender, ChestPain (chest pain type), BloodPressure (resting blood pressure), Cholesterol, HeartDisease (1: Has heart disease, 0: Doesn’t have heart disease)

We will define the relationship’s strength and its statistical significance using the regression model, demonstrating how strongly independent factors affect the value of the dependent variable. MaxHeartRate is the dependent variable, and Age, Gender, Chest Pain, Blood Pressure, Cholesterol, and Heart Disease are the independent variables. We chose these variables because, according to theory, there should be a link between them that influences the maximum heart rate.

Graphical presentation of the relationships among variables:

library(car)
scatterplotMatrix(mydata1[ , c(6, 1, 4, 5)], 
                  smooth = FALSE) 

From the score from the scatterplot matrix we can see a negative correlation with age and blood pressure and a positive on with cholesterol.

Checking the multicolinearity of variables:

fit <- lm(MaxHeartRate ~ Age + BloodPressure + Cholesterol + GenderFactor + HeartDiseaseFactor + ChestPain,
          data=mydata1)
vif(fit)
##                        GVIF Df GVIF^(1/(2*Df))
## Age                1.242124  1        1.114506
## BloodPressure      1.140342  1        1.067868
## Cholesterol        1.181416  1        1.086929
## GenderFactor       1.166734  1        1.080155
## HeartDiseaseFactor 1.636287  1        1.279175
## ChestPain          1.496910  3        1.069546

There is no high multicolinearity for any of the variables, we see that all numbers are significantly below 5, which is a good indicator. For categorical variable with more than 3 categories we will compare values in last column √5 = 2.24. In our case the variable ChestPain has no problem with multicolinearity since the value is well below 2.24.

mean(vif(fit))
## [1] 1.253444

There is not a strong relationship between the explanatory factors.

Checking for outliers:

mydata1$StdResid <- round(rstandard(fit), 3)
mydata1$CooksD <- round(cooks.distance(fit),3)

hist(mydata1$StdResid,
     xlab = "Standardised residuals",
     ylab = "Frequency",
     main = "Histogram of standardised residuals")

We see that there may be an outlier on the left side, below the value -3.

We will first do the Shapiro test to check if the distribution is normal or not.

H₀: Distribution of the variable is normal

H₁: Distribution of the variable is not normal

shapiro.test(mydata1$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$StdResid
## W = 0.99254, p-value = 0.1381

The p value is higher than 5%, based on this we cannot reject H₀ at p = 13.8. This means that the distribution is normal.

We analyse Cook’s Distance to see whether we have any units with a big influence.

hist(mydata1$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distance")

We see that there might be units with high impact on the right side of the histogram.

Checking the outliers based on numbers:

head(mydata1[order(-mydata1$CooksD),],10)
##     Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor StdResid
## 81   39      M       ATA           190         241          106            0         Male                 No   -2.710
## 193  40      M       NAP           106         240           80            0         Male                 No   -3.848
## 144  62      M       NAP           160           0           72            1         Male                Yes   -2.540
## 165  58      F       ATA           180         393          110            1       Female                Yes   -1.765
## 207  61      M        TA           142         200          100            1         Male                Yes   -1.576
## 213  52      M       NAP           128           0          180            1         Male                Yes    2.506
## 15   66      F       ASY           178         228          165            1       Female                Yes    1.903
## 252  57      F       ATA           130         236          174            1       Female                Yes    1.833
## 120  32      M        TA            95           0          127            1         Male                Yes   -0.990
## 189  66      M       NAP           110         213           99            0         Male                 No   -1.782
##     CooksD
## 81   0.069
## 193  0.052
## 144  0.032
## 165  0.026
## 207  0.023
## 213  0.022
## 15   0.019
## 252  0.017
## 120  0.016
## 189  0.015
head(mydata1[order(-mydata1$StdResid),],10)
##     Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor StdResid
## 213  52      M       NAP           128           0          180            1         Male                Yes    2.506
## 115  51      M       ASY           140         261          186            0         Male                 No    2.191
## 4    67      M       ASY           120           0          150            1         Male                Yes    2.114
## 276  63      M       ASY           150           0          154            1         Male                Yes    2.060
## 59   51      M       ASY           130           0          163            1         Male                Yes    2.054
## 96   55      M       ATA           110         214          180            0         Male                 No    1.974
## 15   66      F       ASY           178         228          165            1       Female                Yes    1.903
## 9    55      M       ASY           115           0          155            1         Male                Yes    1.868
## 125  63      M       ASY           140           0          149            1         Male                Yes    1.838
## 215  61      F       ASY           130         330          169            1       Female                Yes    1.833
##     CooksD
## 213  0.022
## 115  0.013
## 4    0.011
## 276  0.010
## 59   0.009
## 96   0.012
## 15   0.019
## 9    0.007
## 125  0.007
## 215  0.011
head(mydata1[order(mydata1$StdResid),],10)
##     Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor StdResid
## 193  40      M       NAP           106         240           80            0         Male                 No   -3.848
## 81   39      M       ATA           190         241          106            0         Male                 No   -2.710
## 144  62      M       NAP           160           0           72            1         Male                Yes   -2.540
## 64   60      M       ASY           135           0           63            1         Male                Yes   -2.514
## 176  65      M       ASY           145           0           67            1         Male                Yes   -2.140
## 169  45      M       NAP           135         192          110            0         Male                 No   -2.139
## 202  61      M       NAP           120           0           80            1         Male                Yes   -2.057
## 84   52      M       ASY           160         331           94            1         Male                Yes   -1.969
## 156  62      M       ASY           115           0           72            1         Male                Yes   -1.935
## 28   57      M       ASY           130         207           96            0         Male                 No   -1.888
##     CooksD
## 193  0.052
## 81   0.069
## 144  0.032
## 64   0.012
## 176  0.011
## 169  0.013
## 202  0.015
## 84   0.010
## 156  0.008
## 28   0.009
mydata2 <- mydata1[ -c(81,193,144), ] #removed because of the cooks distance numbers - they had significant differences in comparison to other numbers
head(mydata2[order(-mydata2$CooksD),],10)
##     Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor StdResid
## 165  58      F       ATA           180         393          110            1       Female                Yes   -1.765
## 207  61      M        TA           142         200          100            1         Male                Yes   -1.576
## 213  52      M       NAP           128           0          180            1         Male                Yes    2.506
## 15   66      F       ASY           178         228          165            1       Female                Yes    1.903
## 252  57      F       ATA           130         236          174            1       Female                Yes    1.833
## 120  32      M        TA            95           0          127            1         Male                Yes   -0.990
## 189  66      M       NAP           110         213           99            0         Male                 No   -1.782
## 202  61      M       NAP           120           0           80            1         Male                Yes   -2.057
## 229  51      M        TA           125         213          125            0         Male                 No   -1.255
## 166  32      M       ASY           118         529          130            1         Male                Yes   -1.285
##     CooksD
## 165  0.026
## 207  0.023
## 213  0.022
## 15   0.019
## 252  0.017
## 120  0.016
## 189  0.015
## 202  0.015
## 229  0.015
## 166  0.014
fit1 <- lm(MaxHeartRate ~ Age + BloodPressure + Cholesterol + GenderFactor + HeartDiseaseFactor + ChestPain,
          data=mydata2)
mydata2$StdResid <- round(rstandard(fit1), 3)
mydata2$StdFittedValues <- scale(fit1$fitted.values)

The normality was okay before that is why we check again only the cooks distances:

hist(mydata2$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distance")

We don’t see any gaps so we decided to continue.

Checking if the linearity assumption is fulfilled:

scatterplot(y=mydata2$StdResid, x=mydata2$StdFittedValues,
            ylab = "Standardised residuals",
            xlab = "Standardised fitted values",
            regLine = FALSE,
            boxplots = FALSE,
            smooth = FALSE)

Based on the scatterplot, where the patterns are random, we can assume that the linearity assumption is met.

We still have to check for heteroskedasticity.

H₀: There is homoskedasticity.

H₁: There is heteroskedasticity.

suppressPackageStartupMessages(library("olsrr"))
olsrr::ols_test_breusch_pagan(fit1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                   Data                   
##  ----------------------------------------
##  Response : MaxHeartRate 
##  Variables: fitted values of MaxHeartRate 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    4.753417 
##  Prob > Chi2   =    0.02924018

As the p value is smaller than 5% we have to reject H₀ at p = 0.029 and further use lm_robust, using robust standard errors to correct heteroskedasticity.

Test regression coefficients:

H₀: Slope is equal to 0 (variable has no influence)

H₁: Slope is not equal to 0 (variable has an influence)

library(estimatr)
fit2 <- lm_robust(MaxHeartRate ~ Age + BloodPressure + Cholesterol + GenderFactor + HeartDiseaseFactor + ChestPain,
                  se_type = "HC1",
                  data = mydata2)
summary(fit2)
## 
## Call:
## lm_robust(formula = MaxHeartRate ~ Age + BloodPressure + Cholesterol + 
##     GenderFactor + HeartDiseaseFactor + ChestPain, data = mydata2, 
##     se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                        Estimate Std. Error t value  Pr(>|t|)   CI Lower  CI Upper  DF
## (Intercept)           166.13081    9.73818  17.060 1.389e-45 146.963787 185.29784 288
## Age                    -0.92194    0.11970  -7.702 2.164e-13  -1.157537  -0.68635 288
## BloodPressure           0.11013    0.07201   1.529 1.273e-01  -0.031600   0.25186 288
## Cholesterol             0.03179    0.01185   2.681 7.755e-03   0.008454   0.05512 288
## GenderFactorFemale      6.50922    2.75755   2.361 1.892e-02   1.081715  11.93673 288
## HeartDiseaseFactorYes -11.16771    3.01306  -3.706 2.520e-04 -17.098131  -5.23729 288
## ChestPainATA            5.82800    3.65781   1.593 1.122e-01  -1.371438  13.02744 288
## ChestPainNAP           11.07502    3.10145   3.571 4.167e-04   4.970636  17.17941 288
## ChestPainTA            10.56356    4.81895   2.192 2.917e-02   1.078733  20.04838 288
## 
## Multiple R-squared:  0.395 , Adjusted R-squared:  0.3782 
## F-statistic: 24.78 on 8 and 288 DF,  p-value: < 2.2e-16

Conclusions based on the analysis

Age: If the age of a patient increases by one year, then the maximum achieved heart rate decreases by 0.91 beats per minute on average, assuming all other variables remain unchanged (p < 0.001).

Blood Pressure: We cannot say that blood pressure affects maximum achieved heart rate (p = 0.07).

Cholesterol: If the serum cholesterol of a patient increases by one mm/dl, then the maximum achieved heart rate increases by 0.04 beats per minute on average, assuming all other variables remain unchanged (p = 0.003).

Gender: Given the values of all other variables, females have on average 7 beats per minute higher maximum achieved heart rate than males (p = 0.01), assuming all other variables remain unchanged.

HeartDisease: Given the values of all other variables, patients with heart disease have on average 10.8 beats per minute lower maximum achieved heart rate than patients without heart disease. (p < 0.001), assuming all other variables remain unchanged.

Chest pain type:

-ATA: We cannot say that atypical angina chest pain has any effect on maximum achieved heart rate (p = 0.07).

-NAP: Given the values of all other variables, patients with non-anginal chest pain have on average 10.4 beats per minute higher maximum achieved heart rate than patients that are asymptomatic. (p < 0.001), assuming all other variables remain unchanged.

-TA: Given the values of all other variables, patients with typical angina chest pain have on average 10.6 beats per minute higher maximum achieved heart rate than patients that are asymptomatic. (p < 0.001), assuming all other variables remain unchanged.

ChestPain: chest pain type - TA = Typical Angina - ATA = Atypical Angina - NAP = Non-Anginal Pain - ASY = Asymptomatic

Multiple R-squared= 0.4072 –> 40.72 % of variability of the dependent variable maximum heart rate achieved is explained by the variability of the 6 explanatory variables in the model.

F-statistic:

H0: ρ2 (rho squared) = 0

H1: ρ2 > 0

We can reject H0 at p < 0.001 –> There is some relationship between the dependent and at least one explanatory variable.

sqrt(summary(fit2)$r.squared)
## [1] 0.6284874

Linear relationship between maximum heart rate achieved and all 6 explanatory variables is semistrong (0.3 < 0.64 < 0.7).