mydata <- read.csv("./HW2.csv",
                    header = TRUE,
                    sep = ';',
                    dec = ',')

head(mydata)
##   id Age education sex is_smoking totChol   BMI heartRate glucose TenYearCHD
## 1  1  36         4   0         NO     212 29.77        72      75          0
## 2  2  46         1   1        YES     250 20.35        88      94          0
## 3  3  50         1   0        YES     233 28.26        68      94          1
## 4  4  64         1   1        YES     241 26.42        70      77          0
## 5  5  61         3   1         NO     272 32.80        85      65          1
## 6  6  61         1   0         NO     238 24.83        75      79          0

Description of the variables:

id: Number of the patient/resident of Framingham

Age: Age of the patient/resident of Framingham (in years)

education: Level of education (level 1 - level 4)

Is_smoking: Whether or not the patient/resident is a current smoker (YES or NO)?

totChol: Total Cholesterol Level in mg/dL for individual

BMI: Body Mass Index (in kg/m2) for individual

heartRate: Heart Rate (in x times) for individual

glucose: Glucose Level in mg/dL for individual

TenYearCHD: Individual’s 10 year risk of coronary heart disease CHD (Yes = 1 or No = 0)

Sample has 148 observations. Unit of observation is a patient/resident of the town of Framingham, Massachusetts. The dataset was found on the Kaggle website (link: https://www.kaggle.com/datasets/christofel04/cardiovascular-study-dataset-predict-heart-disea).

Below, I changed the names of the variables.

colnames(mydata) <- c('ID','Age','Education','Gender','Smoking?','Cholesterol_Level','BMI','Heart_Rate','Glucose_Level','Ten_Year_CHD_Prediction')

head(mydata)
##   ID Age Education Gender Smoking? Cholesterol_Level   BMI Heart_Rate
## 1  1  36         4      0       NO               212 29.77         72
## 2  2  46         1      1      YES               250 20.35         88
## 3  3  50         1      0      YES               233 28.26         68
## 4  4  64         1      1      YES               241 26.42         70
## 5  5  61         3      1       NO               272 32.80         85
## 6  6  61         1      0       NO               238 24.83         75
##   Glucose_Level Ten_Year_CHD_Prediction
## 1            75                       0
## 2            94                       0
## 3            94                       1
## 4            77                       0
## 5            65                       1
## 6            79                       0

Hypothesis about the population arithmetic mean

Research question: Is the average BMI among given patients different as it is on the internet?

The average BMI for both genders is 26.55. The data about average was found on link (https://www.cdc.gov/nchs/data/nhanes/databriefs/adultweight.pdf). We would like to test whether there is a difference in the BMI between the US average and the average among given patients from Framingham, Massachusetts.

H0: µ (our sample) = 26.55 (internet)

H1: µ (our sample) ≠ 26.55 (internet)

First, we need to test assumptions so we will know which test should we use.

The first assumption is that the variable is numeric and this is true.

The second assumption is that the variable is normally distributed.

library(ggplot2)
ggplot(mydata, aes(x = BMI)) +
  geom_histogram(binwidth = 1, colour = "black", fill = "lightblue") + 
  ylab("Frequency") + 
  xlab("BMI")

H0: The distribution is normal.

H1: The distribution is not normal.

shapiro.test(mydata$BMI)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$BMI
## W = 0.925, p-value = 5.257e-07

Based on the given p-value, we reject the null hypothesis. We can accept the H1, which states that the distribution of the variable is not normal.

Due to the violations of assumptions for parametric test, we need to use alternative non-parametric test (Willcoxon signed rank test), which uses median.

The changed hypothesis:

H0: Me (our sample) = 27.55 (internet)

H1: Me (our sample) ≠ 27.55 (internet)

For the comparison, I have found that the median for men and women is 27.55 on the link (https://healthfully.com/76977-median-body-mass-index.html).

median(mydata$BMI)
## [1] 24.77

Half of the observed patients have BMI up to or equal to 24.77 and the other half has higher BMI than 24.77.

wilcox.test(mydata$BMI,
            mu = 27.55,
            correct = FALSE)
## 
##  Wilcoxon signed rank test
## 
## data:  mydata$BMI
## V = 2427.5, p-value = 5.76e-09
## alternative hypothesis: true location is not equal to 27.55

Based on the test above and its p-value, we can reject the null hypothesis. We accept the alternative hypothesis that true location is not equal to 27.55. Furthermore, we can reject the null hypothesis (Me (our sample) = 27.55 (internet)). So, median of BMI found on the internet is higher than the median from our sample.

library(effectsize)
effectsize(wilcox.test(mydata$BMI,
                       mu = 27.55,
                       correct = FALSE))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.55             | [-0.67, -0.41]
## 
## - Deviation from a difference of 27.55.
interpret_rank_biserial(0.55, rules = 'funder2019')
## [1] "very large"
## (Rules: funder2019)

Based on the sample data, we found that the median of BMI for male and female patients is 24.77 and it is lower than the median of BMI for men and women on the internet (p < 0.05, r = 0.55 - very large effect).

Hypothesis about the difference between two population arithmetic means (independent samples)

Research question: Is the average BMI among patients different for both genders?

H0: μM = μF -> μM - μF = 0

H1: μM ≠ μF -> μM - μF ≠ 0

mydata$GenderF <- factor(mydata$Gender,
                         levels = c(0,1),
                         labels = c('M','F'))
head(mydata)
##   ID Age Education Gender Smoking? Cholesterol_Level   BMI Heart_Rate
## 1  1  36         4      0       NO               212 29.77         72
## 2  2  46         1      1      YES               250 20.35         88
## 3  3  50         1      0      YES               233 28.26         68
## 4  4  64         1      1      YES               241 26.42         70
## 5  5  61         3      1       NO               272 32.80         85
## 6  6  61         1      0       NO               238 24.83         75
##   Glucose_Level Ten_Year_CHD_Prediction GenderF
## 1            75                       0       M
## 2            94                       0       F
## 3            94                       1       M
## 4            77                       0       F
## 5            65                       1       F
## 6            79                       0       M
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:effectsize':
## 
##     phi
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describeBy(mydata$BMI, g = mydata$GenderF)
## 
##  Descriptive statistics by group 
## group: M
##    vars  n  mean   sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 69 25.92 3.97  26.08   25.67 3.28 18.1 38.42 20.32 0.79     1.32 0.48
## ------------------------------------------------------------ 
## group: F
##    vars  n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 79 25.15 5.26  24.22   24.51 4.37 16.73 44.09 27.36 1.39      2.5 0.59

First, we need to check if the distribution is normal.

library(ggplot2)
BMI_Female <- ggplot(mydata[mydata$GenderF == "F",  ], aes(x = BMI)) +
  theme_linedraw() + 
  geom_histogram() +
  ggtitle("BMI for Female Patients") +
  ylab('Number of patients')

BMI_Male <- ggplot(mydata[mydata$GenderF == "M",  ], aes(x = BMI)) +
  theme_linedraw() + 
  geom_histogram() +
  ggtitle("BMI for Male Patients") +
  ylab('Number of patients')

library(ggpubr)
ggarrange(BMI_Female, BMI_Male,
          ncol= 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, the Shapiro Test will be used to check the normality assumption.

H0: The distribution is normal.

H1: The distribution is not normal.

library(rstatix)
## 
## Attaching package: 'rstatix'
## The following objects are masked from 'package:effectsize':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:stats':
## 
##     filter
mydata %>%
  group_by(GenderF) %>%
  shapiro_test(BMI)
## # A tibble: 2 × 4
##   GenderF variable statistic          p
##   <fct>   <chr>        <dbl>      <dbl>
## 1 M       BMI          0.949 0.00660   
## 2 F       BMI          0.895 0.00000816

With Shapiro-Wilk test we can conclude that both given p-values are less than 5%. So that is why we can reject H0 and accept H1 (the distribution is not normal in both populations). The assumptions are not met that is why alternative non-parametric test - Wilcoxon Rank Sum Test.

H0: Distribution locations are the same.

H1: Distribution locations are different.

wilcox.test(mydata$BMI ~ mydata$GenderF,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = 'two.sided')
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$BMI by mydata$GenderF
## W = 3204.5, p-value = 0.06559
## alternative hypothesis: true location shift is not equal to 0

We cannot reject H0 (Distribution locations are the same.) at the given p-value because it is more than 5%.

library(effectsize)
effectsize(wilcox.test(mydata$BMI ~ mydata$GenderF,                                  
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = 'two.sided'))
## r (rank biserial) |        95% CI
## ---------------------------------
## 0.18              | [-0.01, 0.35]
interpret_rank_biserial(0.18)
## [1] "small"
## (Rules: funder2019)

Based on the sample data, we find that male and female patients do not differ in the average BMI (p > 0.05) - male patients have slightly higher average BMI than female patients (but the difference is really low), the difference in distribution is small (r = 0.18).

Hypothesis about the equality of three or more population arithmetic means for independent samples

Research question: Is the average BMI among patients different for different education levels?

H0: μ1 = μ2 = μ3 = μ4

H1: At least one μ is different.

(Information about education level I found on this link: https://www.unipage.net/en/education_system_usa)

mydata$EducationF <- factor(mydata$Education,
                           levels = c(1,2,3,4),
                           labels = c('Preschool education','Primary education', 'Secondary education', 'Higher education'))

head(mydata)
##   ID Age Education Gender Smoking? Cholesterol_Level   BMI Heart_Rate
## 1  1  36         4      0       NO               212 29.77         72
## 2  2  46         1      1      YES               250 20.35         88
## 3  3  50         1      0      YES               233 28.26         68
## 4  4  64         1      1      YES               241 26.42         70
## 5  5  61         3      1       NO               272 32.80         85
## 6  6  61         1      0       NO               238 24.83         75
##   Glucose_Level Ten_Year_CHD_Prediction GenderF          EducationF
## 1            75                       0       M    Higher education
## 2            94                       0       F Preschool education
## 3            94                       1       M Preschool education
## 4            77                       0       F Preschool education
## 5            65                       1       F Secondary education
## 6            79                       0       M Preschool education
library(psych)
describe(mydata$BMI)
##    vars   n  mean  sd median trimmed mad   min   max range skew kurtosis   se
## X1    1 148 25.51 4.7  24.77   25.05 4.3 16.73 44.09 27.36  1.2     2.42 0.39
describeBy(x = mydata$BMI, group = mydata$EducationF)
## 
##  Descriptive statistics by group 
## group: Preschool education
##    vars  n  mean  sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 66 26.37 5.6  25.74   25.71 3.84 17.38 44.09 26.71 1.22     1.48 0.69
## ------------------------------------------------------------ 
## group: Primary education
##    vars  n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 41 24.59 3.61  24.63   24.54 3.62 18.01 31.78 13.77 0.04    -0.84 0.56
## ------------------------------------------------------------ 
## group: Secondary education
##    vars  n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 28 24.65 3.93  24.04   24.34 3.76 16.73 35.58 18.85 0.76     0.68 0.74
## ------------------------------------------------------------ 
## group: Higher education
##    vars  n  mean  sd median trimmed  mad   min  max range  skew kurtosis   se
## X1    1 13 25.93 3.7  26.18      26 5.32 20.34 30.8 10.46 -0.04    -1.74 1.03

The assumption of homogeneity of variances will be now tested with Levene test.

H0: Variances of BMI are the same in all groups.

H1: Variances of BMI are not the same in all groups.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
leveneTest(mydata$BMI, group = mydata$EducationF)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   3  1.3099 0.2735
##       144

We cannot reject H0 because the given p-value is higher than 5%. This means that variances are the same in all groups and homogeneity is met.

Next, we will check if we have normal distribution.

H0: The distribution is normal.

H1: The distribution is not normal.

mydata %>%
  group_by(EducationF) %>%
  shapiro_test(BMI)
## # A tibble: 4 × 4
##   EducationF          variable statistic         p
##   <fct>               <chr>        <dbl>     <dbl>
## 1 Preschool education BMI          0.896 0.0000447
## 2 Primary education   BMI          0.975 0.493    
## 3 Secondary education BMI          0.938 0.0994   
## 4 Higher education    BMI          0.909 0.178

We can see that the distribution is not normal in all four populations and we will use non-parametric test - > Kruskal-Wallis Rank Sum Test.

H0: Distribution locations of BMI are the same.

H1: At least one distribution location of BMI is different.

kruskal.test(BMI ~ EducationF,
             data = mydata)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  BMI by EducationF
## Kruskal-Wallis chi-squared = 3.1118, df = 3, p-value = 0.3747

We cannot reject H0 because we have the given p-value above 5%. We assume that the distribution locations are the same for all education levels. Based on the sample data, we find that the education levels do not differ in the BMI at the given p-value 0.3747.

kruskal_effsize(BMI ~ EducationF,
                data = mydata)
## # A tibble: 1 × 5
##   .y.       n  effsize method  magnitude
## * <chr> <int>    <dbl> <chr>   <ord>    
## 1 BMI     148 0.000777 eta2[H] small
groups_nonpar <- wilcox_test(BMI ~ EducationF,
                             paired = FALSE,
                             p.adjust.method = "bonferroni",
                             data = mydata)

groups_nonpar
## # A tibble: 6 × 9
##   .y.   group1              group2          n1    n2 stati…¹     p p.adj p.adj…²
## * <chr> <chr>               <chr>        <int> <int>   <dbl> <dbl> <dbl> <chr>  
## 1 BMI   Preschool education Primary edu…    66    41   1554. 0.2       1 ns     
## 2 BMI   Preschool education Secondary e…    66    28   1084. 0.189     1 ns     
## 3 BMI   Preschool education Higher educ…    66    13    403  0.736     1 ns     
## 4 BMI   Primary education   Secondary e…    41    28    582  0.927     1 ns     
## 5 BMI   Primary education   Higher educ…    41    13    218  0.336     1 ns     
## 6 BMI   Secondary education Higher educ…    28    13    140  0.249     1 ns     
## # … with abbreviated variable names ¹​statistic, ²​p.adj.signif
pwc <- mydata %>%
  wilcox_test(BMI ~ EducationF, 
              paired = FALSE, 
              p.adjust.method = "bonferroni")
 
Kruskal_results <- kruskal_test(BMI ~ EducationF, 
                                data = mydata)

pwc <- pwc %>% 
       add_y_position(fun = "median", step.increase = 0.35)

ggboxplot(mydata, x = "EducationF", y = "BMI", add = "point", ylim=c(10, 70)) +
  stat_pvalue_manual(pwc, hide.ns = FALSE, y.position = c(47, 52, 57)) +
  stat_summary(fun = median, geom = "point", shape = 16, size = 4,
               aes(group = EducationF), color = "darkred",
               position = position_dodge(width = 0.8)) +
  stat_summary(fun = median, colour = "darkred", 
               position = position_dodge(width = 0.8),
               geom = "text", vjust = -0.5, hjust = -1,
               aes(label = round(after_stat(y), digits = 2), group = EducationF)) +
  labs(subtitle = get_test_label(Kruskal_results,  detailed = TRUE),
       caption = get_pwc_label(pwc)) 

Based on the sample data, we did not find statistically significant relationship between education levels and BMI (χ2=3.11, p = 0.38, effect size is small).