mydata <- read.csv("./HW2.csv",
header = TRUE,
sep = ';',
dec = ',')
head(mydata)
## id Age education sex is_smoking totChol BMI heartRate glucose TenYearCHD
## 1 1 36 4 0 NO 212 29.77 72 75 0
## 2 2 46 1 1 YES 250 20.35 88 94 0
## 3 3 50 1 0 YES 233 28.26 68 94 1
## 4 4 64 1 1 YES 241 26.42 70 77 0
## 5 5 61 3 1 NO 272 32.80 85 65 1
## 6 6 61 1 0 NO 238 24.83 75 79 0
Description of the variables:
id: Number of the patient/resident of Framingham
Age: Age of the patient/resident of Framingham (in years)
education: Level of education (level 1 - level 4)
Is_smoking: Whether or not the patient/resident is a current smoker (YES or NO)?
totChol: Total Cholesterol Level in mg/dL for individual
BMI: Body Mass Index (in kg/m2) for individual
heartRate: Heart Rate (in x times) for individual
glucose: Glucose Level in mg/dL for individual
TenYearCHD: Individual’s 10 year risk of coronary heart disease CHD (Yes = 1 or No = 0)
Sample has 148 observations. Unit of observation is a patient/resident of the town of Framingham, Massachusetts. The dataset was found on the Kaggle website (link: https://www.kaggle.com/datasets/christofel04/cardiovascular-study-dataset-predict-heart-disea).
Below, I changed the names of the variables.
colnames(mydata) <- c('ID','Age','Education','Gender','Smoking?','Cholesterol_Level','BMI','Heart_Rate','Glucose_Level','Ten_Year_CHD_Prediction')
head(mydata)
## ID Age Education Gender Smoking? Cholesterol_Level BMI Heart_Rate
## 1 1 36 4 0 NO 212 29.77 72
## 2 2 46 1 1 YES 250 20.35 88
## 3 3 50 1 0 YES 233 28.26 68
## 4 4 64 1 1 YES 241 26.42 70
## 5 5 61 3 1 NO 272 32.80 85
## 6 6 61 1 0 NO 238 24.83 75
## Glucose_Level Ten_Year_CHD_Prediction
## 1 75 0
## 2 94 0
## 3 94 1
## 4 77 0
## 5 65 1
## 6 79 0
Research question: Is the average BMI among given patients different as it is on the internet?
The average BMI for both genders is 26.55. The data about average was found on link (https://www.cdc.gov/nchs/data/nhanes/databriefs/adultweight.pdf). We would like to test whether there is a difference in the BMI between the US average and the average among given patients from Framingham, Massachusetts.
H0: µ (our sample) = 26.55 (internet)
H1: µ (our sample) ≠ 26.55 (internet)
First, we need to test assumptions so we will know which test should we use.
The first assumption is that the variable is numeric and this is true.
The second assumption is that the variable is normally distributed.
library(ggplot2)
ggplot(mydata, aes(x = BMI)) +
geom_histogram(binwidth = 1, colour = "black", fill = "lightblue") +
ylab("Frequency") +
xlab("BMI")
H0: The distribution is normal.
H1: The distribution is not normal.
shapiro.test(mydata$BMI)
##
## Shapiro-Wilk normality test
##
## data: mydata$BMI
## W = 0.925, p-value = 5.257e-07
Based on the given p-value, we reject the null hypothesis. We can accept the H1, which states that the distribution of the variable is not normal.
Due to the violations of assumptions for parametric test, we need to use alternative non-parametric test (Willcoxon signed rank test), which uses median.
The changed hypothesis:
H0: Me (our sample) = 27.55 (internet)
H1: Me (our sample) ≠ 27.55 (internet)
For the comparison, I have found that the median for men and women is 27.55 on the link (https://healthfully.com/76977-median-body-mass-index.html).
median(mydata$BMI)
## [1] 24.77
Half of the observed patients have BMI up to or equal to 24.77 and the other half has higher BMI than 24.77.
wilcox.test(mydata$BMI,
mu = 27.55,
correct = FALSE)
##
## Wilcoxon signed rank test
##
## data: mydata$BMI
## V = 2427.5, p-value = 5.76e-09
## alternative hypothesis: true location is not equal to 27.55
Based on the test above and its p-value, we can reject the null hypothesis. We accept the alternative hypothesis that true location is not equal to 27.55. Furthermore, we can reject the null hypothesis (Me (our sample) = 27.55 (internet)). So, median of BMI found on the internet is higher than the median from our sample.
library(effectsize)
effectsize(wilcox.test(mydata$BMI,
mu = 27.55,
correct = FALSE))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.55 | [-0.67, -0.41]
##
## - Deviation from a difference of 27.55.
interpret_rank_biserial(0.55, rules = 'funder2019')
## [1] "very large"
## (Rules: funder2019)
Based on the sample data, we found that the median of BMI for male and female patients is 24.77 and it is lower than the median of BMI for men and women on the internet (p < 0.05, r = 0.55 - very large effect).
Research question: Is the average BMI among patients different for both genders?
H0: μM = μF -> μM - μF = 0
H1: μM ≠ μF -> μM - μF ≠ 0
mydata$GenderF <- factor(mydata$Gender,
levels = c(0,1),
labels = c('M','F'))
head(mydata)
## ID Age Education Gender Smoking? Cholesterol_Level BMI Heart_Rate
## 1 1 36 4 0 NO 212 29.77 72
## 2 2 46 1 1 YES 250 20.35 88
## 3 3 50 1 0 YES 233 28.26 68
## 4 4 64 1 1 YES 241 26.42 70
## 5 5 61 3 1 NO 272 32.80 85
## 6 6 61 1 0 NO 238 24.83 75
## Glucose_Level Ten_Year_CHD_Prediction GenderF
## 1 75 0 M
## 2 94 0 F
## 3 94 1 M
## 4 77 0 F
## 5 65 1 F
## 6 79 0 M
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:effectsize':
##
## phi
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describeBy(mydata$BMI, g = mydata$GenderF)
##
## Descriptive statistics by group
## group: M
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 69 25.92 3.97 26.08 25.67 3.28 18.1 38.42 20.32 0.79 1.32 0.48
## ------------------------------------------------------------
## group: F
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 79 25.15 5.26 24.22 24.51 4.37 16.73 44.09 27.36 1.39 2.5 0.59
First, we need to check if the distribution is normal.
library(ggplot2)
BMI_Female <- ggplot(mydata[mydata$GenderF == "F", ], aes(x = BMI)) +
theme_linedraw() +
geom_histogram() +
ggtitle("BMI for Female Patients") +
ylab('Number of patients')
BMI_Male <- ggplot(mydata[mydata$GenderF == "M", ], aes(x = BMI)) +
theme_linedraw() +
geom_histogram() +
ggtitle("BMI for Male Patients") +
ylab('Number of patients')
library(ggpubr)
ggarrange(BMI_Female, BMI_Male,
ncol= 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, the Shapiro Test will be used to check the normality assumption.
H0: The distribution is normal.
H1: The distribution is not normal.
library(rstatix)
##
## Attaching package: 'rstatix'
## The following objects are masked from 'package:effectsize':
##
## cohens_d, eta_squared
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(GenderF) %>%
shapiro_test(BMI)
## # A tibble: 2 × 4
## GenderF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 M BMI 0.949 0.00660
## 2 F BMI 0.895 0.00000816
With Shapiro-Wilk test we can conclude that both given p-values are less than 5%. So that is why we can reject H0 and accept H1 (the distribution is not normal in both populations). The assumptions are not met that is why alternative non-parametric test - Wilcoxon Rank Sum Test.
H0: Distribution locations are the same.
H1: Distribution locations are different.
wilcox.test(mydata$BMI ~ mydata$GenderF,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = 'two.sided')
##
## Wilcoxon rank sum test
##
## data: mydata$BMI by mydata$GenderF
## W = 3204.5, p-value = 0.06559
## alternative hypothesis: true location shift is not equal to 0
We cannot reject H0 (Distribution locations are the same.) at the given p-value because it is more than 5%.
library(effectsize)
effectsize(wilcox.test(mydata$BMI ~ mydata$GenderF,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = 'two.sided'))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.18 | [-0.01, 0.35]
interpret_rank_biserial(0.18)
## [1] "small"
## (Rules: funder2019)
Based on the sample data, we find that male and female patients do not differ in the average BMI (p > 0.05) - male patients have slightly higher average BMI than female patients (but the difference is really low), the difference in distribution is small (r = 0.18).
Research question: Is the average BMI among patients different for different education levels?
H0: μ1 = μ2 = μ3 = μ4
H1: At least one μ is different.
(Information about education level I found on this link: https://www.unipage.net/en/education_system_usa)
mydata$EducationF <- factor(mydata$Education,
levels = c(1,2,3,4),
labels = c('Preschool education','Primary education', 'Secondary education', 'Higher education'))
head(mydata)
## ID Age Education Gender Smoking? Cholesterol_Level BMI Heart_Rate
## 1 1 36 4 0 NO 212 29.77 72
## 2 2 46 1 1 YES 250 20.35 88
## 3 3 50 1 0 YES 233 28.26 68
## 4 4 64 1 1 YES 241 26.42 70
## 5 5 61 3 1 NO 272 32.80 85
## 6 6 61 1 0 NO 238 24.83 75
## Glucose_Level Ten_Year_CHD_Prediction GenderF EducationF
## 1 75 0 M Higher education
## 2 94 0 F Preschool education
## 3 94 1 M Preschool education
## 4 77 0 F Preschool education
## 5 65 1 F Secondary education
## 6 79 0 M Preschool education
library(psych)
describe(mydata$BMI)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 148 25.51 4.7 24.77 25.05 4.3 16.73 44.09 27.36 1.2 2.42 0.39
describeBy(x = mydata$BMI, group = mydata$EducationF)
##
## Descriptive statistics by group
## group: Preschool education
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 66 26.37 5.6 25.74 25.71 3.84 17.38 44.09 26.71 1.22 1.48 0.69
## ------------------------------------------------------------
## group: Primary education
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 41 24.59 3.61 24.63 24.54 3.62 18.01 31.78 13.77 0.04 -0.84 0.56
## ------------------------------------------------------------
## group: Secondary education
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 28 24.65 3.93 24.04 24.34 3.76 16.73 35.58 18.85 0.76 0.68 0.74
## ------------------------------------------------------------
## group: Higher education
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 13 25.93 3.7 26.18 26 5.32 20.34 30.8 10.46 -0.04 -1.74 1.03
The assumption of homogeneity of variances will be now tested with Levene test.
H0: Variances of BMI are the same in all groups.
H1: Variances of BMI are not the same in all groups.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
leveneTest(mydata$BMI, group = mydata$EducationF)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 1.3099 0.2735
## 144
We cannot reject H0 because the given p-value is higher than 5%. This means that variances are the same in all groups and homogeneity is met.
Next, we will check if we have normal distribution.
H0: The distribution is normal.
H1: The distribution is not normal.
mydata %>%
group_by(EducationF) %>%
shapiro_test(BMI)
## # A tibble: 4 × 4
## EducationF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Preschool education BMI 0.896 0.0000447
## 2 Primary education BMI 0.975 0.493
## 3 Secondary education BMI 0.938 0.0994
## 4 Higher education BMI 0.909 0.178
We can see that the distribution is not normal in all four populations and we will use non-parametric test - > Kruskal-Wallis Rank Sum Test.
H0: Distribution locations of BMI are the same.
H1: At least one distribution location of BMI is different.
kruskal.test(BMI ~ EducationF,
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: BMI by EducationF
## Kruskal-Wallis chi-squared = 3.1118, df = 3, p-value = 0.3747
We cannot reject H0 because we have the given p-value above 5%. We assume that the distribution locations are the same for all education levels. Based on the sample data, we find that the education levels do not differ in the BMI at the given p-value 0.3747.
kruskal_effsize(BMI ~ EducationF,
data = mydata)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 BMI 148 0.000777 eta2[H] small
groups_nonpar <- wilcox_test(BMI ~ EducationF,
paired = FALSE,
p.adjust.method = "bonferroni",
data = mydata)
groups_nonpar
## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 stati…¹ p p.adj p.adj…²
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 BMI Preschool education Primary edu… 66 41 1554. 0.2 1 ns
## 2 BMI Preschool education Secondary e… 66 28 1084. 0.189 1 ns
## 3 BMI Preschool education Higher educ… 66 13 403 0.736 1 ns
## 4 BMI Primary education Secondary e… 41 28 582 0.927 1 ns
## 5 BMI Primary education Higher educ… 41 13 218 0.336 1 ns
## 6 BMI Secondary education Higher educ… 28 13 140 0.249 1 ns
## # … with abbreviated variable names ¹statistic, ²p.adj.signif
pwc <- mydata %>%
wilcox_test(BMI ~ EducationF,
paired = FALSE,
p.adjust.method = "bonferroni")
Kruskal_results <- kruskal_test(BMI ~ EducationF,
data = mydata)
pwc <- pwc %>%
add_y_position(fun = "median", step.increase = 0.35)
ggboxplot(mydata, x = "EducationF", y = "BMI", add = "point", ylim=c(10, 70)) +
stat_pvalue_manual(pwc, hide.ns = FALSE, y.position = c(47, 52, 57)) +
stat_summary(fun = median, geom = "point", shape = 16, size = 4,
aes(group = EducationF), color = "darkred",
position = position_dodge(width = 0.8)) +
stat_summary(fun = median, colour = "darkred",
position = position_dodge(width = 0.8),
geom = "text", vjust = -0.5, hjust = -1,
aes(label = round(after_stat(y), digits = 2), group = EducationF)) +
labs(subtitle = get_test_label(Kruskal_results, detailed = TRUE),
caption = get_pwc_label(pwc))
Based on the sample data, we did not find statistically significant relationship between education levels and BMI (χ2=3.11, p = 0.38, effect size is small).