data <- read.table("C:/Users/Alisa/Downloads/archive/heart.csv", header=TRUE, sep=",", dec=".")

mydata <- data[ , -c(6,7,9,10,11)]
colnames(mydata) <- c("Age", "Gender", "ChestPain", "BloodPressure", "Cholesterol", "MaxHeartRate", "HeartDisease")
mydata$GenderFactor <- factor(mydata$Gender,
                           levels = c("M","F"),
                           labels = c("Male", "Female"))
mydata$HeartDiseaseFactor <- factor(mydata$HeartDisease,
                           levels = c("0","1"),
                           labels = c("No", "Yes"))
mydata1 <- drop_na(mydata) #there were none of the values that are not available
head(mydata1)
##   Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor
## 1  40      M       ATA           140         289          172            0         Male                 No
## 2  49      F       NAP           160         180          156            1       Female                Yes
## 3  37      M       ATA           130         283           98            0         Male                 No
## 4  48      F       ASY           138         214          108            1       Female                Yes
## 5  54      M       NAP           150         195          122            0         Male                 No
## 6  39      M       NAP           120         339          170            0         Male                 No

Description:

Unit of observation: A patient Sample size: 918 observations

Variables: Age: age of the patient [years]

Gender: gender of the patient - M = Male - F = Female

ChestPain: chest pain type - TA = Typical Angina - ATA = Atypical Angina - NAP = Non-Anginal Pain - ASY = Asymptomatic

BloodPressure: resting blood pressure [mm Hg]

Cholesterol: serum cholesterol [mm/dl]

MaxHeartRate: maximum heart rate achieved [Numeric value between 60 and 202]

HeartDisease: output class - 1: Has heart disease - 0: Doesn’t have heart disease

Datasource: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download

1. Independent Samples t-test or Wilcox Rank Sum test

Research question:

Is the average maximum heart rate of people who have a heart disease and do not have a heart disease the same?

H₀:μ maximum heart rate of people who have a heart disease - μ maxHeart Rate of people who do not have a heart disease = 0

H₁:μ maximum heart rate of people who have a heart disease - μ maxHeart Rate of people who do not have a heart disease ≠ 0

Checking if the distributions of 2 groups of people are normally distributed:

library(ggplot2)

HeartDisease_no <- ggplot(mydata1[mydata1$HeartDiseaseFactor=="No",  ], aes(x = MaxHeartRate)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 10, color = "white") +
  ggtitle("No Heart disease")

HeartDisease_yes <- ggplot(mydata1[mydata1$HeartDiseaseFactor=="Yes",  ], aes(x = MaxHeartRate)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 10, color = "white") +
  ggtitle("Heart disease")

library(ggpubr)
ggarrange(HeartDisease_no, HeartDisease_yes,
          ncol= 2, nrow = 1)

Checking if the distributions of 2 groups of people are normally distributed:

H₀: Distribution is normal

H₁: Distribution is not normal

suppressPackageStartupMessages(library("rstatix"))
mydata1 %>%
  rstatix::group_by(HeartDiseaseFactor) %>%
  shapiro_test(MaxHeartRate)
## # A tibble: 2 × 4
##   HeartDiseaseFactor variable     statistic         p
##   <fct>              <chr>            <dbl>     <dbl>
## 1 No                 MaxHeartRate     0.982 0.0000624
## 2 Yes                MaxHeartRate     0.996 0.252

By checking the p values, we can reject H₀ with p < 0.001 with people who do not have the disease.There is a 25% chance that the group of people who have the heart disease is normally distributed.

Because the assumption, that scores of both populations are normally distributed, is violated, we continue with the non-parametric Wilcox Rank Sum test.

H₀: Distribution locations are the same.

H₁: Distribution locations are not the same.

wilcox.test(mydata1$MaxHeartRate ~ mydata1$HeartDiseaseFactor,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata1$MaxHeartRate by mydata1$HeartDiseaseFactor
## W = 153090, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

We can reject the H₀ at p < 0.001, which means that the distribution locations are not the same.

suppressPackageStartupMessages(library("effectsize"))
effectsize(wilcox.test(mydata1$MaxHeartRate ~ mydata1$HeartDiseaseFactor,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |       95% CI
## --------------------------------
## 0.47              | [0.41, 0.53]
interpret_rank_biserial(0.47)
## [1] "very large"
## (Rules: funder2019)

Conclusion

Based on the data, we find that people who have a heart disease and the ones who do not have the heart disease differ in the in the average maximum heart rate (p < 0.001). The difference in distribution is very large (r=0.47).

2. ANOVA or Kruskal-Wallis Rank Sum test

Research question:

Is the average maximum heart rate the same for all four groups of chest pain (TA, ATA, NAP, ASY)?

H₀: μ TA = μ ATA = μ NAP = μ ASY

H₁: At least one μ is different

We will first do the Levene’s test to check the homogenity of variables:

H₀: Variances of maximal heart rate are the same in all four groups

H₁: Variances of maximal heart rate are not the same in all four groups

library(car)
leveneTest(mydata1$MaxHeartRate, group = mydata1$ChestPain)
## Warning in leveneTest.default(mydata1$MaxHeartRate, group = mydata1$ChestPain): mydata1$ChestPain coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   3  1.8809 0.1312
##       914

We can not reject H₀ at p < 0.132 which means that the variances of maximal heart rate are the same in all groups and the assumption of homogenity of variances is met.

We will also check normality with the Shapiro test:

H₀: maximal heart rate is normally distributed

H₁: maximal heart rate is not normally distributed

suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("rstatix"))
mydata1 %>%
  group_by(ChestPain) %>%
  shapiro_test(MaxHeartRate)
## # A tibble: 4 × 4
##   ChestPain variable     statistic        p
##   <chr>     <chr>            <dbl>    <dbl>
## 1 ASY       MaxHeartRate     0.996 0.311   
## 2 ATA       MaxHeartRate     0.982 0.0217  
## 3 NAP       MaxHeartRate     0.975 0.000949
## 4 TA        MaxHeartRate     0.973 0.364

We can confirm the H₀ at the group of people with ASY and TA, but reject H₀ for ATA at p < 0.022 and NAP at p < 0.001. By the results we see that the assumption, that maximal heart rate is normally distributed in all groups, is violated. Based on this we will continue with a non-parametric test (Kruskal-Wallis Rank Sum test).

H₀: location distribution of maximum heart rate is the same in all 4 groups

H₁: location distribution of maximum heart rate is not the same in all 4 groups

kruskal.test(MaxHeartRate ~ ChestPain, 
             data = mydata1)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  MaxHeartRate by ChestPain
## Kruskal-Wallis chi-squared = 126.25, df = 3, p-value < 2.2e-16

Based on the p value which is smaller than 0.001, we can reject H₀.

kruskal_effsize(MaxHeartRate ~ ChestPain, 
                data = mydata1)
## # A tibble: 1 × 5
##   .y.              n effsize method  magnitude
## * <chr>        <int>   <dbl> <chr>   <ord>    
## 1 MaxHeartRate   918   0.135 eta2[H] moderate

There is a moderate effect size (0.135).

groups_nonpar <- wilcox_test(MaxHeartRate ~ ChestPain,
                             paired = FALSE,
                             p.adjust.method = "bonferroni",
                             data = mydata)

groups_nonpar
## # A tibble: 6 × 9
##   .y.          group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
## * <chr>        <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 MaxHeartRate ASY    ATA      496   173    21372  7.58e-23 4.55e-22 ****        
## 2 MaxHeartRate ASY    NAP      496   203    33153  1.3 e-12 7.8 e-12 ****        
## 3 MaxHeartRate ASY    TA       496    46     6370. 7.06e- 7 4.24e- 6 ****        
## 4 MaxHeartRate ATA    NAP      173   203    20126. 1.5 e- 2 8.7 e- 2 ns          
## 5 MaxHeartRate ATA    TA       173    46     4254. 4.71e- 1 1   e+ 0 ns          
## 6 MaxHeartRate NAP    TA       203    46     4264  3.59e- 1 1   e+ 0 ns

Conclusion

Based on the analysis we found that the distribution of maximum heart rate differs for at least one of the chest pain groups (X = 126.25, p < 0.001), the effect size was moderate (eta2 = 0.135). With the Post-Hoc tests we revealed differences for 3 pairs (ASY+ATA, ASY+NAP, ASY+TA) with p < 0.001.

3. TEST OF POPULATION PROPORTIONS

Research question:

Can we confirm that the proportion of people who have the heart disease is more than 0.60?

Assumptions:

  • n * π > 5 -> 918*0.60 = 550.8 > 5

  • n * (1-π) > 5 -> 918*0.40 = 367.2 > 5

Since the assumtions are met, we can move on with the parametric test. We will be using prop test.

H₀: π = 0.60

H₁: π > 0.40

table(mydata1$HeartDiseaseFactor)
## 
##  No Yes 
## 410 508
prop.test(x=508, 
          n=918,
          p=0.60,
          correct=FALSE,
          alternative= "greater")
## 
##  1-sample proportions test without continuity correction
## 
## data:  508 out of 918, null probability 0.6
## X-squared = 8.3145, df = 1, p-value = 0.998
## alternative hypothesis: true p is greater than 0.6
## 95 percent confidence interval:
##  0.5262703 1.0000000
## sample estimates:
##         p 
## 0.5533769

Conclusion

Based on the data we can not reject H₀ at p = 0.998. This means that we can not confirm that the proportion of people who have the heart disease is more than 0.60. The sample estimate in this case is 0.55.