data <- read.table("C:/Users/Alisa/Downloads/archive/heart.csv", header=TRUE, sep=",", dec=".")
mydata <- data[ , -c(6,7,9,10,11)]
colnames(mydata) <- c("Age", "Gender", "ChestPain", "BloodPressure", "Cholesterol", "MaxHeartRate", "HeartDisease")
mydata$GenderFactor <- factor(mydata$Gender,
levels = c("M","F"),
labels = c("Male", "Female"))
mydata$HeartDiseaseFactor <- factor(mydata$HeartDisease,
levels = c("0","1"),
labels = c("No", "Yes"))
mydata1 <- drop_na(mydata) #there were none of the values that are not available
head(mydata1)
## Age Gender ChestPain BloodPressure Cholesterol MaxHeartRate HeartDisease GenderFactor HeartDiseaseFactor
## 1 40 M ATA 140 289 172 0 Male No
## 2 49 F NAP 160 180 156 1 Female Yes
## 3 37 M ATA 130 283 98 0 Male No
## 4 48 F ASY 138 214 108 1 Female Yes
## 5 54 M NAP 150 195 122 0 Male No
## 6 39 M NAP 120 339 170 0 Male No
Description:
Unit of observation: A patient Sample size: 918 observations
Variables: Age: age of the patient [years]
Gender: gender of the patient - M = Male - F = Female
ChestPain: chest pain type - TA = Typical Angina - ATA = Atypical Angina - NAP = Non-Anginal Pain - ASY = Asymptomatic
BloodPressure: resting blood pressure [mm Hg]
Cholesterol: serum cholesterol [mm/dl]
MaxHeartRate: maximum heart rate achieved [Numeric value between 60 and 202]
HeartDisease: output class - 1: Has heart disease - 0: Doesn’t have heart disease
Datasource: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download
Is the average maximum heart rate of people who have a heart disease and do not have a heart disease the same?
H₀:μ maximum heart rate of people who have a heart disease - μ maxHeart Rate of people who do not have a heart disease = 0
H₁:μ maximum heart rate of people who have a heart disease - μ maxHeart Rate of people who do not have a heart disease ≠ 0
Checking if the distributions of 2 groups of people are normally distributed:
library(ggplot2)
HeartDisease_no <- ggplot(mydata1[mydata1$HeartDiseaseFactor=="No", ], aes(x = MaxHeartRate)) +
theme_linedraw() +
geom_histogram(binwidth = 10, color = "white") +
ggtitle("No Heart disease")
HeartDisease_yes <- ggplot(mydata1[mydata1$HeartDiseaseFactor=="Yes", ], aes(x = MaxHeartRate)) +
theme_linedraw() +
geom_histogram(binwidth = 10, color = "white") +
ggtitle("Heart disease")
library(ggpubr)
ggarrange(HeartDisease_no, HeartDisease_yes,
ncol= 2, nrow = 1)
Checking if the distributions of 2 groups of people are normally distributed:
H₀: Distribution is normal
H₁: Distribution is not normal
suppressPackageStartupMessages(library("rstatix"))
mydata1 %>%
rstatix::group_by(HeartDiseaseFactor) %>%
shapiro_test(MaxHeartRate)
## # A tibble: 2 × 4
## HeartDiseaseFactor variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 No MaxHeartRate 0.982 0.0000624
## 2 Yes MaxHeartRate 0.996 0.252
By checking the p values, we can reject H₀ with p < 0.001 with people who do not have the disease.There is a 25% chance that the group of people who have the heart disease is normally distributed.
Because the assumption, that scores of both populations are normally distributed, is violated, we continue with the non-parametric Wilcox Rank Sum test.
H₀: Distribution locations are the same.
H₁: Distribution locations are not the same.
wilcox.test(mydata1$MaxHeartRate ~ mydata1$HeartDiseaseFactor,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata1$MaxHeartRate by mydata1$HeartDiseaseFactor
## W = 153090, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
We can reject the H₀ at p < 0.001, which means that the distribution locations are not the same.
suppressPackageStartupMessages(library("effectsize"))
effectsize(wilcox.test(mydata1$MaxHeartRate ~ mydata1$HeartDiseaseFactor,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.47 | [0.41, 0.53]
interpret_rank_biserial(0.47)
## [1] "very large"
## (Rules: funder2019)
Based on the data, we find that people who have a heart disease and the ones who do not have the heart disease differ in the in the average maximum heart rate (p < 0.001). The difference in distribution is very large (r=0.47).
Is the average maximum heart rate the same for all four groups of chest pain (TA, ATA, NAP, ASY)?
H₀: μ TA = μ ATA = μ NAP = μ ASY
H₁: At least one μ is different
We will first do the Levene’s test to check the homogenity of variables:
H₀: Variances of maximal heart rate are the same in all four groups
H₁: Variances of maximal heart rate are not the same in all four groups
library(car)
leveneTest(mydata1$MaxHeartRate, group = mydata1$ChestPain)
## Warning in leveneTest.default(mydata1$MaxHeartRate, group = mydata1$ChestPain): mydata1$ChestPain coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 1.8809 0.1312
## 914
We can not reject H₀ at p < 0.132 which means that the variances of maximal heart rate are the same in all groups and the assumption of homogenity of variances is met.
We will also check normality with the Shapiro test:
H₀: maximal heart rate is normally distributed
H₁: maximal heart rate is not normally distributed
suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("rstatix"))
mydata1 %>%
group_by(ChestPain) %>%
shapiro_test(MaxHeartRate)
## # A tibble: 4 × 4
## ChestPain variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 ASY MaxHeartRate 0.996 0.311
## 2 ATA MaxHeartRate 0.982 0.0217
## 3 NAP MaxHeartRate 0.975 0.000949
## 4 TA MaxHeartRate 0.973 0.364
We can confirm the H₀ at the group of people with ASY and TA, but reject H₀ for ATA at p < 0.022 and NAP at p < 0.001. By the results we see that the assumption, that maximal heart rate is normally distributed in all groups, is violated. Based on this we will continue with a non-parametric test (Kruskal-Wallis Rank Sum test).
H₀: location distribution of maximum heart rate is the same in all 4 groups
H₁: location distribution of maximum heart rate is not the same in all 4 groups
kruskal.test(MaxHeartRate ~ ChestPain,
data = mydata1)
##
## Kruskal-Wallis rank sum test
##
## data: MaxHeartRate by ChestPain
## Kruskal-Wallis chi-squared = 126.25, df = 3, p-value < 2.2e-16
Based on the p value which is smaller than 0.001, we can reject H₀.
kruskal_effsize(MaxHeartRate ~ ChestPain,
data = mydata1)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 MaxHeartRate 918 0.135 eta2[H] moderate
There is a moderate effect size (0.135).
groups_nonpar <- wilcox_test(MaxHeartRate ~ ChestPain,
paired = FALSE,
p.adjust.method = "bonferroni",
data = mydata)
groups_nonpar
## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 MaxHeartRate ASY ATA 496 173 21372 7.58e-23 4.55e-22 ****
## 2 MaxHeartRate ASY NAP 496 203 33153 1.3 e-12 7.8 e-12 ****
## 3 MaxHeartRate ASY TA 496 46 6370. 7.06e- 7 4.24e- 6 ****
## 4 MaxHeartRate ATA NAP 173 203 20126. 1.5 e- 2 8.7 e- 2 ns
## 5 MaxHeartRate ATA TA 173 46 4254. 4.71e- 1 1 e+ 0 ns
## 6 MaxHeartRate NAP TA 203 46 4264 3.59e- 1 1 e+ 0 ns
Based on the analysis we found that the distribution of maximum heart rate differs for at least one of the chest pain groups (X = 126.25, p < 0.001), the effect size was moderate (eta2 = 0.135). With the Post-Hoc tests we revealed differences for 3 pairs (ASY+ATA, ASY+NAP, ASY+TA) with p < 0.001.
Can we confirm that the proportion of people who have the heart disease is more than 0.60?
Assumptions:
n * π > 5 -> 918*0.60 = 550.8 > 5
n * (1-π) > 5 -> 918*0.40 = 367.2 > 5
Since the assumtions are met, we can move on with the parametric test. We will be using prop test.
H₀: π = 0.60
H₁: π > 0.40
table(mydata1$HeartDiseaseFactor)
##
## No Yes
## 410 508
prop.test(x=508,
n=918,
p=0.60,
correct=FALSE,
alternative= "greater")
##
## 1-sample proportions test without continuity correction
##
## data: 508 out of 918, null probability 0.6
## X-squared = 8.3145, df = 1, p-value = 0.998
## alternative hypothesis: true p is greater than 0.6
## 95 percent confidence interval:
## 0.5262703 1.0000000
## sample estimates:
## p
## 0.5533769
Based on the data we can not reject H₀ at p = 0.998. This means that we can not confirm that the proportion of people who have the heart disease is more than 0.60. The sample estimate in this case is 0.55.