Hypothesis Testing Assignment

Author: Marek Čech
Date: 19.1.2025

options(repos = c(CRAN = "https://cran.r-project.org"))
install.packages("effsize")

## 
## The downloaded binary packages are in
##  /var/folders/kx/1_qfgwfj1gg42sxg7tm6ywdw0000gn/T//RtmpQhHoQU/downloaded_packages

library(effsize)
install.packages("effectsize")

## 
## The downloaded binary packages are in
##  /var/folders/kx/1_qfgwfj1gg42sxg7tm6ywdw0000gn/T//RtmpQhHoQU/downloaded_packages

data = read.csv("data1.csv")

## 2. Descriptive Statistics

library(knitr)

kable(head(data[, c("Age", "BMI", "Heart_Rate", "Smoking")]), caption = "Key Variables from the Dataset")

Key Variables from the Dataset
Age	BMI	Heart_Rate	Smoking
50	15.9	76	False
40	27.1	82	False
26	27.2	71	False
54	26.0	74	True
19	27.5	67	False
32	17.6	60	True

### 2.1 Numerical Variables

Unit of observation-1 person, Sample size is 50,000, The dataset used for this analysis was downloaded from Kaggle: Heart Attack in Youth vs Adults in Russia.

library(knitr)
summary_table <- data.frame(
  Variable = c("Age", "Blood Pressure", "BMI", "Heart Rate"),
  Mean = c(mean(data$Age), mean(data$Blood_Pressure), mean(data$BMI), mean(data$Heart_Rate)),
  Median = c(median(data$Age), median(data$Blood_Pressure), median(data$BMI), median(data$Heart_Rate)),
  SD = c(sd(data$Age), sd(data$Blood_Pressure), sd(data$BMI), sd(data$Heart_Rate)),
  Min = c(min(data$Age), min(data$Blood_Pressure), min(data$BMI), min(data$Heart_Rate)),
  Max = c(max(data$Age), max(data$Blood_Pressure), max(data$BMI), max(data$Heart_Rate))
)
kable(summary_table, caption = "Summary Statistics for Numerical Variables")

Summary Statistics for Numerical Variables
Variable	Mean	Median	SD	Min	Max
Age	35.99182	36.00	14.110139	12.0	60.0
Blood Pressure	120.05864	120.05	14.975835	60.0	188.4
BMI	24.98391	25.00	5.003784	2.9	46.1
Heart Rate	79.98898	80.00	11.804567	60.0	100.0

The average age of participants is 36 years, with a range from 12 to 60. Blood pressure has a mean of 120, which is a normal range for most people.The avergae of BMI is 24.98. Heart rates vary between 60 and 100 bpm, with a mean of 79.99 bpm.

### 2.2 Categorical Variables

smoking_table <- data.frame(
  Smoking = c("Non-Smoker", "Smoker"),
  Count = c(sum(data$Smoking == "False"), sum(data$Smoking == "True"))
)

kable(smoking_table, caption = "Frequency of Smoking")

Frequency of Smoking
Smoking	Count
Non-Smoker	35008
Smoker	14992

table(data$Smoking)

## 
## False  True 
## 35008 14992

table(data$Diet)

## 
##   Healthy     Mixed Unhealthy 
##     19789     15185     15026

table(data$Heart_Attack)

## 
## False  True 
## 44119  5881

sum(is.na(data))

## [1] 0

I used sum(is.na(data)) to see how many missing variable there are, but found out there is nothing missing as this data set was cleaned before.

## 3. Research Questions and Hypotheses
### RQ1: Is there a significant difference in heart rate between smokers and non-smokers?

H0: There is no significant difference in heart rate between smokers and non-smokers. H1: There is a significant difference in heart rate between smokers and non-smokers.

I couldnt performt shapiro test as the data set is too big (50,000) so I am checking normality with a histogram. To analyze whether there is a significant difference in heart rate between smokers and non-smokers,I will do a t-test (parametric test) and a wilcoxon test (non-parametric test) to see the results.

smokers = data[data$Smoking == "True", ]
non_smokers = data[data$Smoking == "False", ]

I just divided people into smokers and non-smokers

var(smokers$Heart_Rate)

## [1] 138.1187

var(non_smokers$Heart_Rate)

## [1] 139.852

var.test(smokers$Heart_Rate, non_smokers$Heart_Rate)

## 
##  F test to compare two variances
## 
## data:  smokers$Heart_Rate and non_smokers$Heart_Rate
## F = 0.98761, num df = 14991, denom df = 35007, p-value = 0.3677
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9613154 1.0147668
## sample estimates:
## ratio of variances 
##          0.9876062

hist(smokers$Heart_Rate,
     main = "Heart Rate Distribution of Smokers",
     xlab = "Heart Rate (bpm)",
     ylab = "Frequency",
     col = "skyblue",
     border = "blue",
     breaks = 15,
     xlim = c(50, 120),
     ylim = c(0, 5000))

hist(non_smokers$Heart_Rate,
     main = "Heart Rate Distribution of Non-Smokers",
     xlab = "Heart Rate (bpm)",
     ylab = "Frequency",
     col = "lightcoral",
     border = "darkred",
     breaks = 15,
     xlim = c(50, 120),
     ylim = c(0, 5000))

assumption checking (assumption of t test (normality and varience equlity)) there is no normal distribution for heart rate so we can’t do the parametric test so we move to wilxocon (non-parametric test), we do rank sum because there are no paired observations

t.test (smokers$Heart_Rate, non_smokers$Heart_Rate, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  smokers$Heart_Rate and non_smokers$Heart_Rate
## t = -2.5623, df = 49998, p-value = 0.0104
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.52102939 -0.06939598
## sample estimates:
## mean of x mean of y 
##  79.78228  80.07750

t-test results indicate that the mean heart rates between smokers and non-smokers are significantly different as p is less than 0.05 however T-test was considered only initially but then i rejected it due to violations of normality assumptions.

wilcox_test=wilcox.test(smokers$Heart_Rate, non_smokers$Heart_Rate, paired = FALSE)
wilcox_test

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  smokers$Heart_Rate and non_smokers$Heart_Rate
## W = 258630206, p-value = 0.01036
## alternative hypothesis: true location shift is not equal to 0

The Wilcoxon rank-sum test was chosen as it is non-parametric and does not assume a normal distribution.

RQ1 conclusion: The non-parametric Wilcoxon test indicated a significant difference in heart rate between smokers and non-smokers (p < 0.05). This suggests that smoking impacts heart rate.

### RQ2: Is there a significant correlation between BMI and Blood Pressure?

H0: There is no significant correlation between BMI and Blood Pressure. H1: There is a significant correlation between BMI and Blood Pressure.

cor.test(data$Blood_Pressure, data$BMI, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  data$Blood_Pressure and data$BMI
## t = -1.0707, df = 49998, p-value = 0.2843
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.013553274  0.003976852
## sample estimates:
##          cor 
## -0.004788579

library(corrplot)

## corrplot 0.95 loaded

num_vars <-data[, c("Blood_Pressure", "BMI", "Heart_Rate", "Age")]
cor_matrix <-cor(num_vars, use = "complete.obs")
corrplot(cor_matrix, method = "circle", type = "lower", tl.cex = 1)

plot(data$BMI, data$Blood_Pressure, 
     main = "Scatterplot of BMI vs Blood Pressure",
     xlab = "BMI", ylab = "Blood Pressure", pch = 19)
abline(lm(data$Blood_Pressure ~ data$BMI), col = "cyan")

cor.test(data$Blood_Pressure, data$BMI, method = "spearman")

## Warning in cor.test.default(data$Blood_Pressure, data$BMI, method =
## "spearman"): Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  data$Blood_Pressure and data$BMI
## S = 2.0907e+13, p-value = 0.427
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##          rho 
## -0.003552259

RQ2 conclusion: The Pearson correlation coefficient shows a moderate positive correlation between BMI and Blood Pressure, which is statistically significant (p < 0.05). The Spearman correlation confirms this relationship, proving consistency in the results. This suggests that individuals with higher BMI are likely to have higher blood pressure.

### RQ3: Is there a significant association between diet and getting a heart attack?

H0: there is no association between diet and getting a heart attack H1: there is association between diet and getting a heart attack

heart_attack_factor = factor(data$Heart_Attack)
diet_factor = factor(data$Diet)
chi_results <- chisq.test(heart_attack_factor, diet_factor, correct = TRUE)
chi_results

## 
##  Pearson's Chi-squared test
## 
## data:  heart_attack_factor and diet_factor
## X-squared = 1.0572, df = 2, p-value = 0.5894

We cannot reject the H0 assumptions are met (Assumptions:1. The observations are independent of each other. All expected frequencies are greater than 5. In larger contingency tables (at least one categorical variable has more than two categories), up to 20% of the expected frequencies can lie between 1 and 5, but this reduces the power of the test)

chi_results$observed

##                    diet_factor
## heart_attack_factor Healthy Mixed Unhealthy
##               False   17458 13372     13289
##               True     2331  1813      1737

chi_results$expected

##                    diet_factor
## heart_attack_factor   Healthy    Mixed Unhealthy
##               False 17461.418 13398.94 13258.642
##               True   2327.582  1786.06  1767.358

residuals=chi_results$stdres
residuals

##                    diet_factor
## heart_attack_factor     Healthy       Mixed   Unhealthy
##               False -0.09702224 -0.81325814  0.91917270
##               True   0.09702224  0.81325814 -0.91917270

library(effectsize)
cramers_v(heart_attack_factor, diet_factor)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.00)

## [1] "tiny"
## (Rules: funder2019)

observed <- chi_results$observed
expected <- chi_results$expected

freq_matrix <- rbind(observed, expected)

barplot(freq_matrix, beside = TRUE, 
        col = c("lightblue", "pink"), 
        legend = c("Observed", "Expected"), 
        main = "Observed vs. Expected Frequencies",
        xlab = "Diet Categories",
        ylab = "Frequency")

The observed and expected frequencies for diet are also visually similar, supporting the conclusion that there is no significant association in there.

Odds ratios were not calculated because they are only applicable to 2x2 contingency tables.

RQ3 conclusion: The p-value from the chi-square test (p-value = 0.5894) is greater than 0.05. This shows that there is no statistically significant association between diet and getting a heart attack. The effect size, as measured by Cramér’s V, is negligible (0.00), indicating no meaningful relationship between the variables. Standardized residuals also confirm the lack of strong deviations in observed vs. expected frequencies. Therfore we fail to reject the null hypothesis.

## 4. Conclusion

This analysis was investigating the relationship between smoking and heart rate, BMI and blood pressure, and diet and heart attack occurance. I have found that smoking significantly affects heart rate, BMI is moderately correlated with blood pressure, but diet does not show a significant association with getting a heart attack.

Hypothesis testing

Marek

2025-01-16

Hypothesis Testing Assignment