1 Data import

heart <- read.table("C:/Users/Tinkara/Desktop/IMB 2024 R/Bootcamp/heart_disease_uci.csv", header=TRUE, sep=",", dec=".")

### Removing not needed variables

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
heart_clean <- heart %>% select (age, sex, cp, trestbps, chol, thalch)
### Cleaning data

heart_clean$trestbps <-ifelse(test = heart_clean$trestbps == 0,
                              yes = NA,
                              no = heart_clean$trestbps)

heart_clean$chol <-ifelse(test = heart_clean$chol == 0,
                              yes = NA,
                              no = heart_clean$chol)
library(tidyr)
heart_clean <- drop_na(heart_clean)

head(heart_clean)
##   age    sex              cp trestbps chol thalch
## 1  63   Male  typical angina      145  233    150
## 2  67   Male    asymptomatic      160  286    108
## 3  67   Male    asymptomatic      120  229    129
## 4  37   Male     non-anginal      130  250    187
## 5  41 Female atypical angina      130  204    172
## 6  56   Male atypical angina      120  236    178

2 Data description

Unit of observation: one patient

Initial sample size: 675 patients

Definition of variables and units of measurement:

Source of data: https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data

### Data manipulation, creating factors

heart_clean$sex <- factor(heart_clean$sex,
                         levels = c ("Male", "Female"),
                         labels = c ("Male","Female"))

heart_clean$cp <- factor(heart_clean$cp,
                         levels = c ("typical angina", "atypical angina", "non-anginal", "asymptomatic"),
                         labels = c ("typical angina", "atypical angina", "non-anginal", "asymptomatic"))
### Descriptive statistic for numerical variables
library(pastecs)
## 
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following objects are masked from 'package:dplyr':
## 
##     first, last
round(stat.desc(heart_clean[ , c (1, 4, 5, 6) ]), 2)
##                   age trestbps      chol   thalch
## nbr.val        675.00   675.00    675.00   675.00
## nbr.null         0.00     0.00      0.00     0.00
## nbr.na           0.00     0.00      0.00     0.00
## min             28.00    92.00     85.00    69.00
## max             77.00   200.00    603.00   202.00
## range           49.00   108.00    518.00   133.00
## sum          35478.00 89569.00 166828.00 95351.00
## median          54.00   130.00    240.00   142.00
## mean            52.56   132.69    247.15   141.26
## SE.mean          0.36     0.68      2.26     0.96
## CI.mean.0.95     0.71     1.34      4.43     1.88
## var             88.76   314.42   3435.27   620.29
## std.dev          9.42    17.73     58.61    24.91
## coef.var         0.18     0.13      0.24     0.18

The estimated average age of a patient in our sample was 52.56 years old.

50% of patients in our sample had serum cholesterol up to or equal to 240.00 mg/dl, and 50% of them had higher serum cholesterol.

The lowest resting blood pressure on admission in the sample was 92 mm Hg, and the highest was 200 mm Hg.

### Descriptive statistics by group

library(psych)
describeBy(heart_clean$trestbps, heart_clean$sex)
## 
##  Descriptive statistics by group 
## group: Male
##    vars   n   mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 500 133.02 17.48    130  131.98 14.83  92 200   108 0.62      0.5 0.78
## ------------------------------------------------------------ 
## group: Female
##    vars   n   mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 175 131.75 18.45    130  130.45 14.83  94 200   106 0.76     0.84 1.39

The estimated average resting blood pressure for a male patient in our sample was 133.02 on admission to the hospital.

We had 175 female patients in our sample.

3.1 Research question 1

Are the average serum cholesterol values different between male and female?

describeBy(heart_clean$chol, heart_clean$sex)
## 
##  Descriptive statistics by group 
## group: Male
##    vars   n   mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 500 243.96 56.53    237  240.53 45.96  85 603   518 1.42     6.09 2.53
## ------------------------------------------------------------ 
## group: Female
##    vars   n   mean    sd median trimmed  mad min max range skew kurtosis  se
## X1    1 175 256.28 63.46    248  250.76 59.3 141 564   423 1.18     2.76 4.8

To test the hypothesis, I need to use Independent t-test with Welch correction (parametric test) or Wilcoxon Rank Sum Test (non-parametric test). We need to test assumptions:

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(heart_clean, aes(x = chol)) +
  geom_histogram(binwidth = 10, colour="gray", fill="darkblue") +
  facet_wrap(~sex, ncol = 1) + 
  ylab("Frequency")

library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
heart_clean %>% 
  group_by(sex) %>%
  shapiro_test(chol)
## # A tibble: 2 × 4
##   sex    variable statistic        p
##   <fct>  <chr>        <dbl>    <dbl>
## 1 Male   chol         0.918 8.22e-16
## 2 Female chol         0.934 3.81e- 7

H0: Serum cholesterol is normally distributed in males. H1: Serum cholesterol is not normally distributed in males.

HO: Serum cholesterol is normally distributed in females. H1: Serum cholesterol is not normally distributed in females.

Based on the p value we reject H0 for males at p<0.001. Based on the p value we reject H0 for females at p<0.001. We cannot claim that serum cholesterol is normally distributed in males or females. The appropriate way to proceed is to perform the non-parametric test.

Parametric test t-test with Welch correction

In this case we should not perform the parametric test, as normality is not met.

Hypothesis:

H0: μ(M) = μ(F)

H1: μ(M) ≠ μ(F)

t.test(heart_clean$chol ~ heart_clean$sex,
       var.equal = FALSE,
       alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  heart_clean$chol by heart_clean$sex
## t = -2.2723, df = 276.64, p-value = 0.02384
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -22.997224  -1.646776
## sample estimates:
##   mean in group Male mean in group Female 
##              243.958              256.280

We reject H0 at p=0.024. We have found differences in means for serum cholesterol between males and females.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
cohens_d(heart_clean$chol ~ heart_clean$sex,
         pooled_sd = FALSE)
## Cohen's d |         95% CI
## --------------------------
## -0.21     | [-0.38, -0.03]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.21)
## [1] "small"
## (Rules: cohen1988)

The differences in serum cholesterol between male and female are small.

Non-parametric test: Wilcoxon Rank Sum Test

Wilcoxon Rank Sum Test is more suitable in this case, because the assumption of normality is violated as shown above with the Shapiro-Wilk test.

Hypothesis:

H0: Location distribution of serum cholesterol is the same for males and females.

H1: Location distribution of serum cholesterol is different for males and females.

wilcox.test(heart_clean$chol ~ heart_clean$sex,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  heart_clean$chol by heart_clean$sex
## W = 39415, p-value = 0.05084
## alternative hypothesis: true location shift is not equal to 0

We cannot reject H0. We cannot claim that distribution locations of serum cholesterol are different between men and women.

effectsize(wilcox.test(heart_clean$chol ~ heart_clean$sex,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## -0.10             | [-0.20, 0.00]
interpret_rank_biserial(0.10)
## [1] "small"
## (Rules: funder2019)

Effectsize is very small.

Answering RQ1

Based on the sample data, we cannot claim that men and women differ in their values of serum cholesterol. Differences between distribution locations of the serum cholesterol of men and women are very small (r = 0.10).

3.2 Research question 2

Is there a correlation between the age and maximum heart rate achieved?

To answer the research question, I will be checking the Pearson correlation coefficient, because both variables are numeric.

cor(heart_clean$age, heart_clean$thalch,
    method = "pearson",
    use = "complete.obs")
## [1] -0.3547074
cor.test(heart_clean$age, heart_clean$thalch,
    method = "pearson",
    use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  heart_clean$age and heart_clean$thalch
## t = -9.8418, df = 673, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4189565 -0.2869241
## sample estimates:
##        cor 
## -0.3547074

We tested: H0: ρ(age, thalch) = 0 H1: ρ(age, thalch) ≠ 0

We reject H0 at p<0.001. The linear relationship between age and maximum heart rate achieved is negative and semi strong.

3.3 Research question 3

Does the type of chest pain depend on the gender of the patient?

H0: There is no association between gender and type of chest pain.

H1: There is association between gender and type of chest pain.

Assumptions for chi square test:

chi_square <- chisq.test(heart_clean$sex, heart_clean$cp,
                         correct = FALSE)

chi_square
## 
##  Pearson's Chi-squared test
## 
## data:  heart_clean$sex and heart_clean$cp
## X-squared = 22.223, df = 3, p-value = 5.863e-05

We reject H0 at p<0.001. There is association between gender and type of chest pain.

addmargins(chi_square$observed)
##                heart_clean$cp
## heart_clean$sex typical angina atypical angina non-anginal asymptomatic Sum
##          Male               28              98          98          276 500
##          Female              9              54          50           62 175
##          Sum                37             152         148          338 675

Above table shows observed frequencies:

round(chi_square$expected, 2)
##                heart_clean$cp
## heart_clean$sex typical angina atypical angina non-anginal asymptomatic
##          Male            27.41          112.59      109.63       250.37
##          Female           9.59           39.41       38.37        87.63

Above table shows theoretical frequencies. All are above 5, so assumptions for chi square test are met.

round(chi_square$residuals, 2)
##                heart_clean$cp
## heart_clean$sex typical angina atypical angina non-anginal asymptomatic
##          Male             0.11           -1.38       -1.11         1.62
##          Female          -0.19            2.32        1.88        -2.74
effectsize::cramers_v(heart_clean$sex, heart_clean$cp)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.17              | [0.08, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.17)
## [1] "small"
## (Rules: funder2019)

The effectsize is small.

We reject H0 at p<0.001. There is more than expected number of units in combination females and atypical angina (at α = 5%) and less than expected number of units in combination asymptomatic and women (α = 1%). Effect size is small.