The dataset used for this research is Heart Attack in Japan Youth Vs Adult, created by Akshay Choudhary. Retrieved from Kaggle: https://www.kaggle.com/datasets/ashaychoudhary/heart-attack-in-japan-youth-vs-adult
#Importing data.
mydata <- read.table("~/IMB 20242025/MVA/japan_heart_attack_dataset.csv",
header=TRUE,
sep=",",
dec = ".")
mydata <- mydata[, -c(3, 6, 13, 14, 15, 18:32)] #Excluding some variables/columns, due to too many variables, which won't be needed for my analysis.
#I renamed the names of the variables for aesthetic purposes, before showing descriptive statistics.
colnames(mydata) [3] <- "Smoking"
colnames(mydata) [4] <- "Diabetes"
colnames(mydata) [5] <- "CholesterolLevel"
colnames(mydata) [6] <- "PhysicalActivity"
colnames(mydata) [7] <- "DietQuality"
colnames(mydata) [8] <- "AlcoholConsumption"
colnames(mydata) [9] <- "StressLevel"
colnames(mydata) [11] <- "FamilyHistory"
colnames(mydata) [12] <- "HeartAttack"
mydata <- cbind(ID = 1:nrow(mydata), mydata)
head(mydata)
## ID Age Gender Smoking Diabetes CholesterolLevel PhysicalActivity DietQuality
## 1 1 56 Male Yes No 186.4002 Moderate Poor
## 2 2 69 Male No No 185.1367 Low Good
## 3 3 46 Male Yes No 210.6966 Low Average
## 4 4 32 Female No No 211.1655 Moderate Good
## 5 5 60 Female No No 223.8143 High Good
## 6 6 25 Female No No 220.3400 Low Good
## AlcoholConsumption StressLevel BMI FamilyHistory HeartAttack
## 1 Low 3.644786 33.96135 No No
## 2 Low 3.384056 28.24287 Yes No
## 3 Moderate 3.810911 27.60121 No No
## 4 High 6.014878 23.71729 No No
## 5 High 6.806883 19.77158 No No
## 6 High 8.207233 20.24744 No No
Explanation of the dataset
• Unit of observation: an individual
• Sample size: n=30000
• Age: The age in years.
• Gender: Male or Female (categorical variable)
• SmokingHistory: the individual has a history of smoking - Yes or No (categorical variable)
• DiabetesHistory: the individual has a history of diabetes - Yes or No (categorical variable)
• CholesterolLevel: the level of cholesterol in an individual’s blood, in mg/dL
• PhysicalActivity: Low, Moderate or High (categorical variable)
• DietQuality: Poor, Average or Good (categorical variable)
• AlcoholConsumption: Low, Moderate or High (categorical variable)
• StressLevels: measured on a scale of 1-10
• BMI: an individual’s Body Mass Index, calculated as weight/height (kg/m²)
• FamilyHistory: the individual has a family history of a heart attack - Yes or No (categorical variable)
• HeartAttack: the individual has had a heart attack - Yes or No (categorical variable)
summary(mydata[, c("Age", "CholesterolLevel", "StressLevel", "BMI")])
## Age CholesterolLevel StressLevel BMI
## Min. :18.00 Min. : 80.02 Min. : 0.000 Min. : 5.58
## 1st Qu.:33.00 1st Qu.:179.55 1st Qu.: 3.644 1st Qu.:21.63
## Median :48.00 Median :199.77 Median : 4.993 Median :24.96
## Mean :48.49 Mean :199.90 Mean : 5.002 Mean :25.00
## 3rd Qu.:64.00 3rd Qu.:220.16 3rd Qu.: 6.353 3rd Qu.:28.36
## Max. :79.00 Max. :336.86 Max. :10.000 Max. :46.10
table(mydata$Gender)
##
## Female Male
## 14933 15067
table(mydata$Smoking)
##
## No Yes
## 21003 8997
table(mydata$Diabetes)
##
## No Yes
## 23903 6097
table(mydata$PhysicalActivity)
##
## High Low Moderate
## 9091 8985 11924
table(mydata$DietQuality)
##
## Average Good Poor
## 11971 12006 6023
table(mydata$AlcoholConsumption)
##
## High Low Moderate None
## 5828 9098 12059 3015
table(mydata$FamilyHistory)
##
## No Yes
## 21064 8936
table(mydata$HeartAttack)
##
## No Yes
## 27036 2964
Explanation of some parameters:
• Age:
Min: The minimum age of individuals from my data set is 18 years.
Max: The maximum age of individuals from my data set is 79 years.
• Cholesterol Level:
Median: Half of the individuals’ cholesterol level was lower than 199.77 mg/dL, and half of the individuals’ cholesterol level was higher.
• Stress Level:
3rd Quartile: 25 % of the individuals’ stress level was higher than 6.353.
• BMI:
Mean: The average Body Mass Index of the individuals was 25 kg/m².
Is there a significant difference in cholesterol levels between individuals with and without a history of a heart attack?
• Null Hypothesis (H₀): There is no difference in mean cholesterol levels between individuals with and without a history of a heart attack.
H₀: μ(HeartAttackYes) = μ(HeartAttackNo)
• Alternative Hypothesis (H₁): There is a significant difference in mean cholesterol levels between the two groups.
H₁: μ(HeartAttackYes) ≠ μ(HeartAttackNo)
Assumptions:
Variable is numeric (Cholesterol level is a numeric variable)
The distribution of the variable is normal in both populations (This will be tested with Shapiro Wilk’s test)
The data must come from two independent populations (Individuals with a heart attack and those without are independent)
Variable has the same variance in both populations (if this assumption would be violated, we will apply Welch correction)
#Since the size of my sample is n=30.000, I made a random sample of 300. I found this function on Google, as I'm not sure we did this in class.
set.seed(123)
sampled_data <- mydata[sample(1:nrow(mydata), size = 300, replace = FALSE), ]
head(sampled_data)
## ID Age Gender Smoking Diabetes CholesterolLevel PhysicalActivity
## 18847 18847 62 Male No No 137.9860 Low
## 18895 18895 58 Female No No 211.7057 Moderate
## 26803 26803 77 Female Yes Yes 195.5574 Moderate
## 25102 25102 53 Female Yes Yes 212.1132 Moderate
## 28867 28867 50 Female No No 170.4875 Moderate
## 2986 2986 30 Male Yes Yes 167.2758 High
## DietQuality AlcoholConsumption StressLevel BMI FamilyHistory
## 18847 Average Moderate 7.306264 29.81270 Yes
## 18895 Good Moderate 2.302482 29.33706 Yes
## 26803 Poor Moderate 6.291841 27.90748 No
## 25102 Good None 7.027272 17.10765 Yes
## 28867 Good Low 3.726472 21.16023 Yes
## 2986 Good Moderate 7.517483 31.11976 No
## HeartAttack
## 18847 No
## 18895 No
## 26803 Yes
## 25102 No
## 28867 No
## 2986 No
nrow(sampled_data)
## [1] 300
sampled_data$ID <- 1:nrow(sampled_data)
library(ggplot2)
ggplot(mydata, aes(x = CholesterolLevel, fill = HeartAttack)) +
geom_histogram(binwidth = 10, colour = "black", alpha = 0.7, position = "identity") +
scale_fill_manual(values = c("red", "blue"), name = "HeartAttack") +
labs(
x = "Cholesterol Level (mg/dL)",
y = "Frequency",
title = "Distribution of Cholesterol Levels by Heart Attack"
) +
theme_minimal()
The graph shows that the distribution of cholesterol levels for both those who had a heart attack, and those who didn’t, appears approximately normal, with most values centered around 150–250 mg/dL. To confirm this observation, we will perform the Shapiro-Wilk test to assess the normality of the data.
H₀ (Null Hypothesis): The variable cholesterol level is normally distributed in both groups.
H₁ (Alternative Hypothesis): The normality is violated for at least one group.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
sampled_data %>%
group_by(HeartAttack) %>%
shapiro_test(CholesterolLevel)
## # A tibble: 2 × 4
## HeartAttack variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 No CholesterolLevel 0.991 0.110
## 2 Yes CholesterolLevel 0.954 0.159
We cannot reject H₀, because p > 0.05 for both groups, so we can say that the data is normally distributed. As the assumption of normality is met, we can continue with the parametric test of Independent Samples, to compare the mean cholesterol levels between those with a history of a heart attack and those without.
t.test(sampled_data$CholesterolLevel ~ sampled_data$HeartAttack,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: sampled_data$CholesterolLevel by sampled_data$HeartAttack
## t = 0.34567, df = 42.541, p-value = 0.7313
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## -8.707099 12.308009
## sample estimates:
## mean in group No mean in group Yes
## 198.1406 196.3402
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
effectsize::cohens_d(sampled_data$CholesterolLevel ~ sampled_data$HeartAttack,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## -------------------------
## 0.06 | [-0.29, 0.41]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.06, rules = "sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)
Based on the sample data we cannot reject H₀, because p > 0.05. We cannot say that there is a statistically significant difference in the mean cholesterol levels between individuals with and without a history of a heart attack; the effect size is tiny, d = 0.06.
• Null Hypothesis (H₀): Distribution locations of Cholesterol Level are the same for individuals with and without a heart attack (the median cholesterol levels are equal for the two groups).
• Alternative Hypothesis (H₁): Distribution locations of cholesterol levels are different for the two groups (the median cholesterol levels are not equal for the two groups).
wilcox.test(sampled_data$CholesterolLevel ~ sampled_data$HeartAttack,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: sampled_data$CholesterolLevel by sampled_data$HeartAttack
## W = 4608, p-value = 0.8567
## alternative hypothesis: true location shift is not equal to 0
effectsize(wilcox.test(sampled_data$CholesterolLevel ~ sampled_data$HeartAttack,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.02 | [-0.19, 0.22]
interpret_rank_biserial(0.02)
## [1] "tiny"
## (Rules: funder2019)
Based on the sampled data, we cannot reject the null hypothesis, because p > 0.05; the effect size is again tiny (|r|=0.02). We cannot say there is a statistically significant difference in the distribution of cholesterol levels between individuals with and without a heart attack.
Since the Shapiro-Wilk test showed no violation of normality for cholesterol levels, the parametric test of independent samples is more appropriate for our analysis, as is it also more powerful than the alternative non-parametric Wilcoxon rank sum test, and it addresses means (which the RQ focuses on).
Is there a significant correlation between stress levels and BMI?
For RQ 2, I will be using the same sample of the data set.
Assumptions:
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(sampled_data, columns = c("StressLevel", "BMI"))
cor(sampled_data$StressLevel, sampled_data$BMI,
method = "pearson",
use = "complete.obs")
## [1] 0.1452456
cor.test(sampled_data$StressLevel, sampled_data$BMI,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: sampled_data$StressLevel and sampled_data$BMI
## t = 2.5342, df = 298, p-value = 0.01178
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03253997 0.25430372
## sample estimates:
## cor
## 0.1452456
Pearson correlation test:
• Null Hypothesis (H₀): There is no correlation between Stress Level and BMI in the population.
H₀: ρ=0
• Alternative Hypothesis (H₁): There is a correlation between Stress Level and BMI in the population.
H₁: ρ ≠0
Conclusion: We can reject the null hypothesis, at p = 0.02. We can conclude that there is a statistically significant weak positive correlation between the stress level and Body Mass Index of an individual (r = 0.15).
Is there a significant association between Smoking History and Heart Attack Occurrence?
• Null Hypothesis (H₀): There is no association between a history of smoking and a heart attack occurence.
• Alternative Hypothesis (H₁): There is asspciation between a history of smoking and a heart attack occurence.
Assumptions for the Pearson Chi2 test, for analysing the association between two categorical variables:
table_sample <- table(sampled_data$Smoking, sampled_data$HeartAttack)
print(table_sample)
##
## No Yes
## No 183 26
## Yes 83 8
chisq.test(table_sample)$expected #Checking assumptions.
##
## No Yes
## No 185.31333 23.68667
## Yes 80.68667 10.31333
The third assumption is met because no expected frequencies are below 5. Therefore, the Chi-square test can be performed without requiring Yates’ continuity correction.
results <- chisq.test(sampled_data$Smoking, sampled_data$HeartAttack,
correct = FALSE) #Yates' correction is not necessary, because of other assumptions met.
results
##
## Pearson's Chi-squared test
##
## data: sampled_data$Smoking and sampled_data$HeartAttack
## X-squared = 0.84002, df = 1, p-value = 0.3594
Based on the sample, we cannot reject the null hypothesis (p > 0.05). We cannot conclude there is association between a history of smoking and a heart attack occurence, as there is not enough evidence.
addmargins(results$observed)
## sampled_data$HeartAttack
## sampled_data$Smoking No Yes Sum
## No 183 26 209
## Yes 83 8 91
## Sum 266 34 300
round(results$expected, 2)
## sampled_data$HeartAttack
## sampled_data$Smoking No Yes
## No 185.31 23.69
## Yes 80.69 10.31
The second assumption is met because all expected frequencies are greater than 1. We do not need to do a nonparametric test.
If we assume no association, we would expect 10.31 people that are smoking to have a heart attack. But in reality, there are 8 people that are smoking that got a heart attack.
round(results$residuals, 2)
## sampled_data$HeartAttack
## sampled_data$Smoking No Yes
## No -0.17 0.48
## Yes 0.26 -0.72
Since all residuals are below |1.96|, there are no statistically significant differences between observed and expected frequencies in any category.
effectsize::cramers_v(sampled_data$Smoking, sampled_data$HeartAttack)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.00 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.00)
## [1] "tiny"
## (Rules: funder2019)
There is tiny association between Smoking History and Heart Attack Occurrence, which aligns with the results of the Pearson Chi-Square test, which showed no significant relationship between the two variables.