heart <- read.table("C:/Users/Tinkara/Desktop/IMB 2024 R/Bootcamp/heart_disease_uci.csv", header=TRUE, sep=",", dec=".")
### Removing not needed variables
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
heart_clean <- heart %>% select (age, sex, cp, trestbps, chol, thalch)
### Cleaning data
heart_clean$trestbps <-ifelse(test = heart_clean$trestbps == 0,
yes = NA,
no = heart_clean$trestbps)
heart_clean$chol <-ifelse(test = heart_clean$chol == 0,
yes = NA,
no = heart_clean$chol)
library(tidyr)
heart_clean <- drop_na(heart_clean)
head(heart_clean)
## age sex cp trestbps chol thalch
## 1 63 Male typical angina 145 233 150
## 2 67 Male asymptomatic 160 286 108
## 3 67 Male asymptomatic 120 229 129
## 4 37 Male non-anginal 130 250 187
## 5 41 Female atypical angina 130 204 172
## 6 56 Male atypical angina 120 236 178
Unit of observation: one patient
Initial sample size: 675 patients
Definition of variables and units of measurement:
Source of data: https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data
### Data manipulation, creating factors
heart_clean$sex <- factor(heart_clean$sex,
levels = c ("Male", "Female"),
labels = c ("Male","Female"))
heart_clean$cp <- factor(heart_clean$cp,
levels = c ("typical angina", "atypical angina", "non-anginal", "asymptomatic"),
labels = c ("typical angina", "atypical angina", "non-anginal", "asymptomatic"))
### Descriptive statistic for numerical variables
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(heart_clean[ , c (1, 4, 5, 6) ]), 2)
## age trestbps chol thalch
## nbr.val 675.00 675.00 675.00 675.00
## nbr.null 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 28.00 92.00 85.00 69.00
## max 77.00 200.00 603.00 202.00
## range 49.00 108.00 518.00 133.00
## sum 35478.00 89569.00 166828.00 95351.00
## median 54.00 130.00 240.00 142.00
## mean 52.56 132.69 247.15 141.26
## SE.mean 0.36 0.68 2.26 0.96
## CI.mean.0.95 0.71 1.34 4.43 1.88
## var 88.76 314.42 3435.27 620.29
## std.dev 9.42 17.73 58.61 24.91
## coef.var 0.18 0.13 0.24 0.18
The estimated average age of a patient in our sample was 52.56 years old.
50% of patients in our sample had serum cholesterol up to or equal to 240.00 mg/dl, and 50% of them had higher serum cholesterol.
The lowest resting blood pressure on admission in the sample was 92 mm Hg, and the highest was 200 mm Hg.
### Descriptive statistics by group
library(psych)
describeBy(heart_clean$trestbps, heart_clean$sex)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 500 133.02 17.48 130 131.98 14.83 92 200 108 0.62 0.5 0.78
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 175 131.75 18.45 130 130.45 14.83 94 200 106 0.76 0.84 1.39
The estimated average resting blood pressure for a male patient in our sample was 133.02 on admission to the hospital.
We had 175 female patients in our sample.
Are the average serum cholesterol values different between male and female?
describeBy(heart_clean$chol, heart_clean$sex)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 500 243.96 56.53 237 240.53 45.96 85 603 518 1.42 6.09 2.53
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 175 256.28 63.46 248 250.76 59.3 141 564 423 1.18 2.76 4.8
To test the hypothesis, I need to use Independent t-test with Welch correction (parametric test) or Wilcoxon Rank Sum Test (non-parametric test). We need to test assumptions:
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(heart_clean, aes(x = chol)) +
geom_histogram(binwidth = 10, colour="gray", fill="darkblue") +
facet_wrap(~sex, ncol = 1) +
ylab("Frequency")
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
heart_clean %>%
group_by(sex) %>%
shapiro_test(chol)
## # A tibble: 2 × 4
## sex variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Male chol 0.918 8.22e-16
## 2 Female chol 0.934 3.81e- 7
H0: Serum cholesterol is normally distributed in males. H1: Serum cholesterol is not normally distributed in males.
HO: Serum cholesterol is normally distributed in females. H1: Serum cholesterol is not normally distributed in females.
Based on the p value we reject H0 for males at p<0.001. Based on the p value we reject H0 for females at p<0.001. We cannot claim that serum cholesterol is normally distributed in males or females. The appropriate way to proceed is to perform the non-parametric test.
In this case we should not perform the parametric test, as normality is not met.
Hypothesis:
H0: μ(M) = μ(F)
H1: μ(M) ≠ μ(F)
t.test(heart_clean$chol ~ heart_clean$sex,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: heart_clean$chol by heart_clean$sex
## t = -2.2723, df = 276.64, p-value = 0.02384
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## -22.997224 -1.646776
## sample estimates:
## mean in group Male mean in group Female
## 243.958 256.280
We reject H0 at p=0.024. We have found differences in means for serum cholesterol between males and females.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
cohens_d(heart_clean$chol ~ heart_clean$sex,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## --------------------------
## -0.21 | [-0.38, -0.03]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.21)
## [1] "small"
## (Rules: cohen1988)
The differences in serum cholesterol between male and female are small.
Wilcoxon Rank Sum Test is more suitable in this case, because the assumption of normality is violated as shown above with the Shapiro-Wilk test.
Hypothesis:
H0: Location distribution of serum cholesterol is the same for males and females.
H1: Location distribution of serum cholesterol is different for males and females.
wilcox.test(heart_clean$chol ~ heart_clean$sex,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: heart_clean$chol by heart_clean$sex
## W = 39415, p-value = 0.05084
## alternative hypothesis: true location shift is not equal to 0
We cannot reject H0. We cannot claim that distribution locations of serum cholesterol are different between men and women.
effectsize(wilcox.test(heart_clean$chol ~ heart_clean$sex,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.10 | [-0.20, 0.00]
interpret_rank_biserial(0.10)
## [1] "small"
## (Rules: funder2019)
Effectsize is very small.
Based on the sample data, we cannot claim that men and women differ in their values of serum cholesterol. Differences between distribution locations of the serum cholesterol of men and women are very small (r = 0.10).
Is there a correlation between the age and maximum heart rate achieved?
To answer the research question, I will be checking the Pearson correlation coefficient, because both variables are numeric.
cor(heart_clean$age, heart_clean$thalch,
method = "pearson",
use = "complete.obs")
## [1] -0.3547074
cor.test(heart_clean$age, heart_clean$thalch,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: heart_clean$age and heart_clean$thalch
## t = -9.8418, df = 673, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4189565 -0.2869241
## sample estimates:
## cor
## -0.3547074
We tested: H0: ρ(age, thalch) = 0 H1: ρ(age, thalch) ≠ 0
We reject H0 at p<0.001. The linear relationship between age and maximum heart rate achieved is negative and semi strong.
Does the type of chest pain depend on the gender of the patient?
H0: There is no association between gender and type of chest pain.
H1: There is association between gender and type of chest pain.
Assumptions for chi square test:
chi_square <- chisq.test(heart_clean$sex, heart_clean$cp,
correct = FALSE)
chi_square
##
## Pearson's Chi-squared test
##
## data: heart_clean$sex and heart_clean$cp
## X-squared = 22.223, df = 3, p-value = 5.863e-05
We reject H0 at p<0.001. There is association between gender and type of chest pain.
addmargins(chi_square$observed)
## heart_clean$cp
## heart_clean$sex typical angina atypical angina non-anginal asymptomatic Sum
## Male 28 98 98 276 500
## Female 9 54 50 62 175
## Sum 37 152 148 338 675
Above table shows observed frequencies:
round(chi_square$expected, 2)
## heart_clean$cp
## heart_clean$sex typical angina atypical angina non-anginal asymptomatic
## Male 27.41 112.59 109.63 250.37
## Female 9.59 39.41 38.37 87.63
Above table shows theoretical frequencies. All are above 5, so assumptions for chi square test are met.
round(chi_square$residuals, 2)
## heart_clean$cp
## heart_clean$sex typical angina atypical angina non-anginal asymptomatic
## Male 0.11 -1.38 -1.11 1.62
## Female -0.19 2.32 1.88 -2.74
effectsize::cramers_v(heart_clean$sex, heart_clean$cp)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.17 | [0.08, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.17)
## [1] "small"
## (Rules: funder2019)
The effectsize is small.
We reject H0 at p<0.001. There is more than expected number of units in combination females and atypical angina (at α = 5%) and less than expected number of units in combination asymptomatic and women (α = 1%). Effect size is small.