Author: Marek Čech
Date: 19.1.2025
options(repos = c(CRAN = "https://cran.r-project.org"))
install.packages("effsize")
##
## The downloaded binary packages are in
## /var/folders/kx/1_qfgwfj1gg42sxg7tm6ywdw0000gn/T//RtmpQhHoQU/downloaded_packages
library(effsize)
install.packages("effectsize")
##
## The downloaded binary packages are in
## /var/folders/kx/1_qfgwfj1gg42sxg7tm6ywdw0000gn/T//RtmpQhHoQU/downloaded_packages
data = read.csv("data1.csv")
## 2. Descriptive Statistics
library(knitr)
kable(head(data[, c("Age", "BMI", "Heart_Rate", "Smoking")]), caption = "Key Variables from the Dataset")
| Age | BMI | Heart_Rate | Smoking |
|---|---|---|---|
| 50 | 15.9 | 76 | False |
| 40 | 27.1 | 82 | False |
| 26 | 27.2 | 71 | False |
| 54 | 26.0 | 74 | True |
| 19 | 27.5 | 67 | False |
| 32 | 17.6 | 60 | True |
### 2.1 Numerical Variables
Unit of observation-1 person, Sample size is 50,000, The dataset used for this analysis was downloaded from Kaggle: Heart Attack in Youth vs Adults in Russia.
library(knitr)
summary_table <- data.frame(
Variable = c("Age", "Blood Pressure", "BMI", "Heart Rate"),
Mean = c(mean(data$Age), mean(data$Blood_Pressure), mean(data$BMI), mean(data$Heart_Rate)),
Median = c(median(data$Age), median(data$Blood_Pressure), median(data$BMI), median(data$Heart_Rate)),
SD = c(sd(data$Age), sd(data$Blood_Pressure), sd(data$BMI), sd(data$Heart_Rate)),
Min = c(min(data$Age), min(data$Blood_Pressure), min(data$BMI), min(data$Heart_Rate)),
Max = c(max(data$Age), max(data$Blood_Pressure), max(data$BMI), max(data$Heart_Rate))
)
kable(summary_table, caption = "Summary Statistics for Numerical Variables")
| Variable | Mean | Median | SD | Min | Max |
|---|---|---|---|---|---|
| Age | 35.99182 | 36.00 | 14.110139 | 12.0 | 60.0 |
| Blood Pressure | 120.05864 | 120.05 | 14.975835 | 60.0 | 188.4 |
| BMI | 24.98391 | 25.00 | 5.003784 | 2.9 | 46.1 |
| Heart Rate | 79.98898 | 80.00 | 11.804567 | 60.0 | 100.0 |
The average age of participants is 36 years, with a range from 12 to 60. Blood pressure has a mean of 120, which is a normal range for most people.The avergae of BMI is 24.98. Heart rates vary between 60 and 100 bpm, with a mean of 79.99 bpm.
### 2.2 Categorical Variables
smoking_table <- data.frame(
Smoking = c("Non-Smoker", "Smoker"),
Count = c(sum(data$Smoking == "False"), sum(data$Smoking == "True"))
)
kable(smoking_table, caption = "Frequency of Smoking")
| Smoking | Count |
|---|---|
| Non-Smoker | 35008 |
| Smoker | 14992 |
table(data$Smoking)
##
## False True
## 35008 14992
table(data$Diet)
##
## Healthy Mixed Unhealthy
## 19789 15185 15026
table(data$Heart_Attack)
##
## False True
## 44119 5881
sum(is.na(data))
## [1] 0
I used sum(is.na(data)) to see how many missing variable there are, but found out there is nothing missing as this data set was cleaned before.
## 3. Research Questions and Hypotheses
### RQ1: Is there a significant difference in heart rate between smokers and non-smokers?
H0: There is no significant difference in heart rate between smokers and non-smokers. H1: There is a significant difference in heart rate between smokers and non-smokers.
I couldnt performt shapiro test as the data set is too big (50,000) so I am checking normality with a histogram. To analyze whether there is a significant difference in heart rate between smokers and non-smokers,I will do a t-test (parametric test) and a wilcoxon test (non-parametric test) to see the results.
smokers = data[data$Smoking == "True", ]
non_smokers = data[data$Smoking == "False", ]
I just divided people into smokers and non-smokers
var(smokers$Heart_Rate)
## [1] 138.1187
var(non_smokers$Heart_Rate)
## [1] 139.852
var.test(smokers$Heart_Rate, non_smokers$Heart_Rate)
##
## F test to compare two variances
##
## data: smokers$Heart_Rate and non_smokers$Heart_Rate
## F = 0.98761, num df = 14991, denom df = 35007, p-value = 0.3677
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9613154 1.0147668
## sample estimates:
## ratio of variances
## 0.9876062
hist(smokers$Heart_Rate,
main = "Heart Rate Distribution of Smokers",
xlab = "Heart Rate (bpm)",
ylab = "Frequency",
col = "skyblue",
border = "blue",
breaks = 15,
xlim = c(50, 120),
ylim = c(0, 5000))
hist(non_smokers$Heart_Rate,
main = "Heart Rate Distribution of Non-Smokers",
xlab = "Heart Rate (bpm)",
ylab = "Frequency",
col = "lightcoral",
border = "darkred",
breaks = 15,
xlim = c(50, 120),
ylim = c(0, 5000))
assumption checking (assumption of t test (normality and varience equlity)) there is no normal distribution for heart rate so we can’t do the parametric test so we move to wilxocon (non-parametric test), we do rank sum because there are no paired observations
t.test (smokers$Heart_Rate, non_smokers$Heart_Rate, var.equal=TRUE)
##
## Two Sample t-test
##
## data: smokers$Heart_Rate and non_smokers$Heart_Rate
## t = -2.5623, df = 49998, p-value = 0.0104
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.52102939 -0.06939598
## sample estimates:
## mean of x mean of y
## 79.78228 80.07750
t-test results indicate that the mean heart rates between smokers and non-smokers are significantly different as p is less than 0.05 however T-test was considered only initially but then i rejected it due to violations of normality assumptions.
wilcox_test=wilcox.test(smokers$Heart_Rate, non_smokers$Heart_Rate, paired = FALSE)
wilcox_test
##
## Wilcoxon rank sum test with continuity correction
##
## data: smokers$Heart_Rate and non_smokers$Heart_Rate
## W = 258630206, p-value = 0.01036
## alternative hypothesis: true location shift is not equal to 0
The Wilcoxon rank-sum test was chosen as it is non-parametric and does not assume a normal distribution.
RQ1 conclusion: The non-parametric Wilcoxon test indicated a significant difference in heart rate between smokers and non-smokers (p < 0.05). This suggests that smoking impacts heart rate.
### RQ2: Is there a significant correlation between BMI and Blood Pressure?
H0: There is no significant correlation between BMI and Blood Pressure. H1: There is a significant correlation between BMI and Blood Pressure.
cor.test(data$Blood_Pressure, data$BMI, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: data$Blood_Pressure and data$BMI
## t = -1.0707, df = 49998, p-value = 0.2843
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.013553274 0.003976852
## sample estimates:
## cor
## -0.004788579
library(corrplot)
## corrplot 0.95 loaded
num_vars <-data[, c("Blood_Pressure", "BMI", "Heart_Rate", "Age")]
cor_matrix <-cor(num_vars, use = "complete.obs")
corrplot(cor_matrix, method = "circle", type = "lower", tl.cex = 1)
plot(data$BMI, data$Blood_Pressure,
main = "Scatterplot of BMI vs Blood Pressure",
xlab = "BMI", ylab = "Blood Pressure", pch = 19)
abline(lm(data$Blood_Pressure ~ data$BMI), col = "cyan")
cor.test(data$Blood_Pressure, data$BMI, method = "spearman")
## Warning in cor.test.default(data$Blood_Pressure, data$BMI, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: data$Blood_Pressure and data$BMI
## S = 2.0907e+13, p-value = 0.427
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.003552259
RQ2 conclusion: The Pearson correlation coefficient shows a moderate positive correlation between BMI and Blood Pressure, which is statistically significant (p < 0.05). The Spearman correlation confirms this relationship, proving consistency in the results. This suggests that individuals with higher BMI are likely to have higher blood pressure.
### RQ3: Is there a significant association between diet and getting a heart attack?
H0: there is no association between diet and getting a heart attack H1: there is association between diet and getting a heart attack
heart_attack_factor = factor(data$Heart_Attack)
diet_factor = factor(data$Diet)
chi_results <- chisq.test(heart_attack_factor, diet_factor, correct = TRUE)
chi_results
##
## Pearson's Chi-squared test
##
## data: heart_attack_factor and diet_factor
## X-squared = 1.0572, df = 2, p-value = 0.5894
We cannot reject the H0 assumptions are met (Assumptions:1. The observations are independent of each other. All expected frequencies are greater than 5. In larger contingency tables (at least one categorical variable has more than two categories), up to 20% of the expected frequencies can lie between 1 and 5, but this reduces the power of the test)
chi_results$observed
## diet_factor
## heart_attack_factor Healthy Mixed Unhealthy
## False 17458 13372 13289
## True 2331 1813 1737
chi_results$expected
## diet_factor
## heart_attack_factor Healthy Mixed Unhealthy
## False 17461.418 13398.94 13258.642
## True 2327.582 1786.06 1767.358
residuals=chi_results$stdres
residuals
## diet_factor
## heart_attack_factor Healthy Mixed Unhealthy
## False -0.09702224 -0.81325814 0.91917270
## True 0.09702224 0.81325814 -0.91917270
library(effectsize)
cramers_v(heart_attack_factor, diet_factor)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.00 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.00)
## [1] "tiny"
## (Rules: funder2019)
observed <- chi_results$observed
expected <- chi_results$expected
freq_matrix <- rbind(observed, expected)
barplot(freq_matrix, beside = TRUE,
col = c("lightblue", "pink"),
legend = c("Observed", "Expected"),
main = "Observed vs. Expected Frequencies",
xlab = "Diet Categories",
ylab = "Frequency")
The observed and expected frequencies for diet are also visually
similar, supporting the conclusion that there is no significant
association in there.
Odds ratios were not calculated because they are only applicable to 2x2 contingency tables.
RQ3 conclusion: The p-value from the chi-square test (p-value = 0.5894) is greater than 0.05. This shows that there is no statistically significant association between diet and getting a heart attack. The effect size, as measured by Cramér’s V, is negligible (0.00), indicating no meaningful relationship between the variables. Standardized residuals also confirm the lack of strong deviations in observed vs. expected frequencies. Therfore we fail to reject the null hypothesis.
## 4. Conclusion
This analysis was investigating the relationship between smoking and heart rate, BMI and blood pressure, and diet and heart attack occurance. I have found that smoking significantly affects heart rate, BMI is moderately correlated with blood pressure, but diet does not show a significant association with getting a heart attack.