HW1 MVA

Data import & Data description & Data manipulation

The dataset used for this research is Heart Attack in Japan Youth Vs Adult, created by Akshay Choudhary. Retrieved from Kaggle: https://www.kaggle.com/datasets/ashaychoudhary/heart-attack-in-japan-youth-vs-adult

#Importing data.
mydata <- read.table("~/IMB 20242025/MVA/japan_heart_attack_dataset.csv", 
                     header=TRUE, 
                     sep=",", 
                     dec = ".")

mydata <- mydata[, -c(3, 6, 13, 14, 15, 18:32)] #Excluding some variables/columns, due to too many variables, which won't be needed for my analysis.

#I renamed the names of the variables for aesthetic purposes, before showing descriptive statistics. 
colnames(mydata) [3] <- "Smoking"
colnames(mydata) [4] <- "Diabetes"
colnames(mydata) [5] <- "CholesterolLevel"
colnames(mydata) [6] <- "PhysicalActivity"
colnames(mydata) [7] <- "DietQuality"
colnames(mydata) [8] <- "AlcoholConsumption"
colnames(mydata) [9] <- "StressLevel"
colnames(mydata) [11] <- "FamilyHistory"
colnames(mydata) [12] <- "HeartAttack"

mydata <- cbind(ID = 1:nrow(mydata), mydata)

head(mydata)

##   ID Age Gender Smoking Diabetes CholesterolLevel PhysicalActivity DietQuality
## 1  1  56   Male     Yes       No         186.4002         Moderate        Poor
## 2  2  69   Male      No       No         185.1367              Low        Good
## 3  3  46   Male     Yes       No         210.6966              Low     Average
## 4  4  32 Female      No       No         211.1655         Moderate        Good
## 5  5  60 Female      No       No         223.8143             High        Good
## 6  6  25 Female      No       No         220.3400              Low        Good
##   AlcoholConsumption StressLevel      BMI FamilyHistory HeartAttack
## 1                Low    3.644786 33.96135            No          No
## 2                Low    3.384056 28.24287           Yes          No
## 3           Moderate    3.810911 27.60121            No          No
## 4               High    6.014878 23.71729            No          No
## 5               High    6.806883 19.77158            No          No
## 6               High    8.207233 20.24744            No          No

Explanation of the dataset

• Unit of observation: an individual

• Sample size: n=30000

• Age: The age in years.

• Gender: Male or Female (categorical variable)

• SmokingHistory: the individual has a history of smoking - Yes or No (categorical variable)

• DiabetesHistory: the individual has a history of diabetes - Yes or No (categorical variable)

• CholesterolLevel: the level of cholesterol in an individual’s blood, in mg/dL

• PhysicalActivity: Low, Moderate or High (categorical variable)

• DietQuality: Poor, Average or Good (categorical variable)

• AlcoholConsumption: Low, Moderate or High (categorical variable)

• StressLevels: measured on a scale of 1-10

• BMI: an individual’s Body Mass Index, calculated as weight/height (kg/m²)

• FamilyHistory: the individual has a family history of a heart attack - Yes or No (categorical variable)

• HeartAttack: the individual has had a heart attack - Yes or No (categorical variable)

summary(mydata[, c("Age", "CholesterolLevel", "StressLevel", "BMI")])

##       Age        CholesterolLevel  StressLevel          BMI       
##  Min.   :18.00   Min.   : 80.02   Min.   : 0.000   Min.   : 5.58  
##  1st Qu.:33.00   1st Qu.:179.55   1st Qu.: 3.644   1st Qu.:21.63  
##  Median :48.00   Median :199.77   Median : 4.993   Median :24.96  
##  Mean   :48.49   Mean   :199.90   Mean   : 5.002   Mean   :25.00  
##  3rd Qu.:64.00   3rd Qu.:220.16   3rd Qu.: 6.353   3rd Qu.:28.36  
##  Max.   :79.00   Max.   :336.86   Max.   :10.000   Max.   :46.10

table(mydata$Gender)

## 
## Female   Male 
##  14933  15067

table(mydata$Smoking)

## 
##    No   Yes 
## 21003  8997

table(mydata$Diabetes)

## 
##    No   Yes 
## 23903  6097

table(mydata$PhysicalActivity)

## 
##     High      Low Moderate 
##     9091     8985    11924

table(mydata$DietQuality)

## 
## Average    Good    Poor 
##   11971   12006    6023

table(mydata$AlcoholConsumption)

## 
##     High      Low Moderate     None 
##     5828     9098    12059     3015

table(mydata$FamilyHistory)

## 
##    No   Yes 
## 21064  8936

table(mydata$HeartAttack)

## 
##    No   Yes 
## 27036  2964

Explanation of some parameters:

• Age:

Min: The minimum age of individuals from my data set is 18 years.

Max: The maximum age of individuals from my data set is 79 years.

• Cholesterol Level:

Median: Half of the individuals’ cholesterol level was lower than 199.77 mg/dL, and half of the individuals’ cholesterol level was higher.

• Stress Level:

3rd Quartile: 25 % of the individuals’ stress level was higher than 6.353.

• BMI:

Mean: The average Body Mass Index of the individuals was 25 kg/m².

Research question 1

Is there a significant difference in cholesterol levels between individuals with and without a history of a heart attack?

• Null Hypothesis (H₀): There is no difference in mean cholesterol levels between individuals with and without a history of a heart attack.

H₀: μ(HeartAttackYes) = μ(HeartAttackNo)

• Alternative Hypothesis (H₁): There is a significant difference in mean cholesterol levels between the two groups.

H₁: μ(HeartAttackYes) ≠ μ(HeartAttackNo)

Assumptions:

Variable is numeric (Cholesterol level is a numeric variable)
The distribution of the variable is normal in both populations (This will be tested with Shapiro Wilk’s test)
The data must come from two independent populations (Individuals with a heart attack and those without are independent)
Variable has the same variance in both populations (if this assumption would be violated, we will apply Welch correction)

#Since the size of my sample is n=30.000, I made a random sample of 300. I found this function on Google, as I'm not sure we did this in class.
set.seed(123)
sampled_data <- mydata[sample(1:nrow(mydata), size = 300, replace = FALSE), ]
head(sampled_data)

##          ID Age Gender Smoking Diabetes CholesterolLevel PhysicalActivity
## 18847 18847  62   Male      No       No         137.9860              Low
## 18895 18895  58 Female      No       No         211.7057         Moderate
## 26803 26803  77 Female     Yes      Yes         195.5574         Moderate
## 25102 25102  53 Female     Yes      Yes         212.1132         Moderate
## 28867 28867  50 Female      No       No         170.4875         Moderate
## 2986   2986  30   Male     Yes      Yes         167.2758             High
##       DietQuality AlcoholConsumption StressLevel      BMI FamilyHistory
## 18847     Average           Moderate    7.306264 29.81270           Yes
## 18895        Good           Moderate    2.302482 29.33706           Yes
## 26803        Poor           Moderate    6.291841 27.90748            No
## 25102        Good               None    7.027272 17.10765           Yes
## 28867        Good                Low    3.726472 21.16023           Yes
## 2986         Good           Moderate    7.517483 31.11976            No
##       HeartAttack
## 18847          No
## 18895          No
## 26803         Yes
## 25102          No
## 28867          No
## 2986           No

nrow(sampled_data)

## [1] 300

sampled_data$ID <- 1:nrow(sampled_data)

library(ggplot2)

ggplot(mydata, aes(x = CholesterolLevel, fill = HeartAttack)) +
  geom_histogram(binwidth = 10, colour = "black", alpha = 0.7, position = "identity") +
  scale_fill_manual(values = c("red", "blue"), name = "HeartAttack") +
  labs(
    x = "Cholesterol Level (mg/dL)",
    y = "Frequency",
    title = "Distribution of Cholesterol Levels by Heart Attack"
  ) +
  theme_minimal()

The graph shows that the distribution of cholesterol levels for both those who had a heart attack, and those who didn’t, appears approximately normal, with most values centered around 150–250 mg/dL. To confirm this observation, we will perform the Shapiro-Wilk test to assess the normality of the data.

H₀ (Null Hypothesis): The variable cholesterol level is normally distributed in both groups.
H₁ (Alternative Hypothesis): The normality is violated for at least one group.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.4.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

sampled_data %>%
  group_by(HeartAttack) %>%
  shapiro_test(CholesterolLevel)

## # A tibble: 2 × 4
##   HeartAttack variable         statistic     p
##   <chr>       <chr>                <dbl> <dbl>
## 1 No          CholesterolLevel     0.991 0.110
## 2 Yes         CholesterolLevel     0.954 0.159

We cannot reject H₀, because p > 0.05 for both groups, so we can say that the data is normally distributed. As the assumption of normality is met, we can continue with the parametric test of Independent Samples, to compare the mean cholesterol levels between those with a history of a heart attack and those without.

Independent Samples

t.test(sampled_data$CholesterolLevel ~ sampled_data$HeartAttack,
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  sampled_data$CholesterolLevel by sampled_data$HeartAttack
## t = 0.34567, df = 42.541, p-value = 0.7313
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -8.707099 12.308009
## sample estimates:
##  mean in group No mean in group Yes 
##          198.1406          196.3402

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

effectsize::cohens_d(sampled_data$CholesterolLevel ~ sampled_data$HeartAttack,
                     pooled_sd = FALSE)

## Cohen's d |        95% CI
## -------------------------
## 0.06      | [-0.29, 0.41]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.06, rules = "sawilowsky2009")

## [1] "tiny"
## (Rules: sawilowsky2009)

Conclusion

Based on the sample data we cannot reject H₀, because p > 0.05. We cannot say that there is a statistically significant difference in the mean cholesterol levels between individuals with and without a history of a heart attack; the effect size is tiny, d = 0.06.

Alternative non-parametric test; Wilcoxon Rank Sum Test

• Null Hypothesis (H₀): Distribution locations of Cholesterol Level are the same for individuals with and without a heart attack (the median cholesterol levels are equal for the two groups).

• Alternative Hypothesis (H₁): Distribution locations of cholesterol levels are different for the two groups (the median cholesterol levels are not equal for the two groups).

wilcox.test(sampled_data$CholesterolLevel ~ sampled_data$HeartAttack,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  sampled_data$CholesterolLevel by sampled_data$HeartAttack
## W = 4608, p-value = 0.8567
## alternative hypothesis: true location shift is not equal to 0

effectsize(wilcox.test(sampled_data$CholesterolLevel ~ sampled_data$HeartAttack,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## 0.02              | [-0.19, 0.22]

interpret_rank_biserial(0.02)

## [1] "tiny"
## (Rules: funder2019)

Conclusion

Based on the sampled data, we cannot reject the null hypothesis, because p > 0.05; the effect size is again tiny (|r|=0.02). We cannot say there is a statistically significant difference in the distribution of cholesterol levels between individuals with and without a heart attack.

Since the Shapiro-Wilk test showed no violation of normality for cholesterol levels, the parametric test of independent samples is more appropriate for our analysis, as is it also more powerful than the alternative non-parametric Wilcoxon rank sum test, and it addresses means (which the RQ focuses on).

Research question 2

Is there a significant correlation between stress levels and BMI?

For RQ 2, I will be using the same sample of the data set.

Assumptions:

Both variables are numeric (this assumption is met).
Normality of variables (We assume this assumption is met, since we have a big enough sample).
A linear relationship.

library(GGally)

## Warning: package 'GGally' was built under R version 4.4.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(sampled_data, columns = c("StressLevel", "BMI"))

cor(sampled_data$StressLevel, sampled_data$BMI,
    method = "pearson",
    use = "complete.obs")

## [1] 0.1452456

cor.test(sampled_data$StressLevel, sampled_data$BMI,
    method = "pearson",
    use = "complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  sampled_data$StressLevel and sampled_data$BMI
## t = 2.5342, df = 298, p-value = 0.01178
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03253997 0.25430372
## sample estimates:
##       cor 
## 0.1452456

Pearson correlation test:

• Null Hypothesis (H₀): There is no correlation between Stress Level and BMI in the population.

H₀: ρ=0

• Alternative Hypothesis (H₁): There is a correlation between Stress Level and BMI in the population.

H₁: ρ ≠0

Conclusion: We can reject the null hypothesis, at p = 0.02. We can conclude that there is a statistically significant weak positive correlation between the stress level and Body Mass Index of an individual (r = 0.15).

Research question 3

Is there a significant association between Smoking History and Heart Attack Occurrence?

• Null Hypothesis (H₀): There is no association between a history of smoking and a heart attack occurence.

• Alternative Hypothesis (H₁): There is asspciation between a history of smoking and a heart attack occurence.

Assumptions for the Pearson Chi2 test, for analysing the association between two categorical variables:

Observations must be independent.(This assumption is met)
All expected frequencies are greater than 5
In larger contingency tables (at least one categorical variable has more than two categories), up to 20% of the expected frequencies can be between 1 and 5, but this will reduce the power of the test.

table_sample <- table(sampled_data$Smoking, sampled_data$HeartAttack)

print(table_sample)

##      
##        No Yes
##   No  183  26
##   Yes  83   8

chisq.test(table_sample)$expected #Checking assumptions.

##      
##              No      Yes
##   No  185.31333 23.68667
##   Yes  80.68667 10.31333

The third assumption is met because no expected frequencies are below 5. Therefore, the Chi-square test can be performed without requiring Yates’ continuity correction.

results <- chisq.test(sampled_data$Smoking, sampled_data$HeartAttack,
                      correct = FALSE) #Yates' correction is not necessary, because of other assumptions met. 

results

## 
##  Pearson's Chi-squared test
## 
## data:  sampled_data$Smoking and sampled_data$HeartAttack
## X-squared = 0.84002, df = 1, p-value = 0.3594

Based on the sample, we cannot reject the null hypothesis (p > 0.05). We cannot conclude there is association between a history of smoking and a heart attack occurence, as there is not enough evidence.

addmargins(results$observed)

##                     sampled_data$HeartAttack
## sampled_data$Smoking  No Yes Sum
##                  No  183  26 209
##                  Yes  83   8  91
##                  Sum 266  34 300

round(results$expected, 2)

##                     sampled_data$HeartAttack
## sampled_data$Smoking     No   Yes
##                  No  185.31 23.69
##                  Yes  80.69 10.31

The second assumption is met because all expected frequencies are greater than 1. We do not need to do a nonparametric test.

If we assume no association, we would expect 10.31 people that are smoking to have a heart attack. But in reality, there are 8 people that are smoking that got a heart attack.

round(results$residuals, 2)

##                     sampled_data$HeartAttack
## sampled_data$Smoking    No   Yes
##                  No  -0.17  0.48
##                  Yes  0.26 -0.72

Since all residuals are below |1.96|, there are no statistically significant differences between observed and expected frequencies in any category.

effectsize::cramers_v(sampled_data$Smoking, sampled_data$HeartAttack)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.00)

## [1] "tiny"
## (Rules: funder2019)

There is tiny association between Smoking History and Heart Attack Occurrence, which aligns with the results of the Pearson Chi-Square test, which showed no significant relationship between the two variables.

HW1 MVA

Vijola Spasovic

2025-01-16

Data import & Data description & Data manipulation

Research question 1

Independent Samples

Conclusion

Alternative non-parametric test; Wilcoxon Rank Sum Test

Conclusion

Research question 2

Research question 3