hmdata <- read.csv("~/IMB/R/HW2/oasis_cross-sectional.csv")
head(hmdata)

##              ID M.F Hand Age Educ SES MMSE CDR eTIV  nWBV   ASF Delay
## 1 OAS1_0001_MR1   F    R  74    2   3   29 0.0 1344 0.743 1.306   N/A
## 2 OAS1_0002_MR1   F    R  55    4   1   29 0.0 1147 0.810 1.531   N/A
## 3 OAS1_0003_MR1   F    R  73    4   3   27 0.5 1454 0.708 1.207   N/A
## 4 OAS1_0004_MR1   M    R  28   NA  NA   NA  NA 1588 0.803 1.105   N/A
## 5 OAS1_0005_MR1   M    R  18   NA  NA   NA  NA 1737 0.848 1.010   N/A
## 6 OAS1_0006_MR1   F    R  24   NA  NA   NA  NA 1131 0.862 1.551   N/A

The unit of observation is an individual, while the original sample size was 436. Since the targeted population are people above 40 years old, the cleaned data consists of 262 subjects. The data was obtained from OASIS cross-sectional on brain imaging.

Variable description:

ID: Identification of a subject
M.F: Gender (Male/Female)
Hand: Dominant hand (left/right)
Age: Age in years
Educ: Education level ( 1: less than high school grad., 2: high school grad., 3: some college, 4: college grad., 5: beyond college)
SES - Socioeconomic status (1-5)
MMSE - Mini Mental State Examination
CDR - Clinical Dementia Rating (0 = normal, 0.5 = very mild dementia, 1 = mild dementia, 2 = moderate dementia, 3 = severe dementia)
eTIV - Estimated Total Intracranial Volume
nWBV - Normalize Whole Brain Volume
ASF Atlas Scaling Factor
Delay - the number of days between two medical visits.

The purpose of this study is to observe the difference between demented and non-demented people above 40 years old through 3 different types of hypothesis testing. The research questions are the following:

What is the difference in the arithmetic means of whole brain volume between demented and non-demented people?

How do the arithmetic means of five populations differ?

What is the relationship between the gender and dementia groups?

As mentioned the targeted age is above 40, which was filtered as well as the irrelevant variables for the research were removed.

hmdata <- hmdata[hmdata$Age >= 40, ]

hmdata <- subset(hmdata, select = -c(Hand, Delay))

colnames(hmdata)[colnames(hmdata)== "M.F"] <- "Gender"

Next the missing values per column/variable need to be determined and imputated based on median. Median was used based on the research of other N/A values for similar data sets on brain function.

colSums(is.na(hmdata))

##     ID Gender    Age   Educ    SES   MMSE    CDR   eTIV   nWBV    ASF 
##      0      0      0     29     48     29     29      0      0      0

hmdata$Educ <- impute(hmdata$Educ, median)

hmdata$SES <- impute(hmdata$SES, median)

hmdata$MMSE <- impute(hmdata$MMSE, median)

hmdata$CDR <- impute(hmdata$CDR, median)

head(hmdata)

##               ID Gender Age Educ SES MMSE CDR eTIV  nWBV   ASF
## 1  OAS1_0001_MR1      F  74    2   3   29 0.0 1344 0.743 1.306
## 2  OAS1_0002_MR1      F  55    4   1   29 0.0 1147 0.810 1.531
## 3  OAS1_0003_MR1      F  73    4   3   27 0.5 1454 0.708 1.207
## 9  OAS1_0010_MR1      M  74    5   2   30 0.0 1636 0.689 1.073
## 10 OAS1_0011_MR1      F  52    3   2   30 0.0 1321 0.827 1.329
## 12 OAS1_0013_MR1      F  81    5   2   30 0.0 1664 0.679 1.055

colSums(is.na(hmdata))

##     ID Gender    Age   Educ    SES   MMSE    CDR   eTIV   nWBV    ASF 
##      0      0      0      0      0      0      0      0      0      0

hmdata$Group <- ifelse(hmdata$CDR > 0.4, "Demented", "Nondemented")

Hypothesis about the difference between two arithmetic means

H0: The arithmetic means of dementia groups and whole brain volume are equal.

H1: The arithmetic means of dementia groups and whole brain volume differ.

describeBy(hmdata$nWBV, g=hmdata$Group)

## 
##  Descriptive statistics by group 
## group: Demented
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.72 0.04   0.72    0.72 0.04 0.64 0.8  0.15 -0.03    -0.73  0
## ------------------------------------------------------------ 
## group: Nondemented
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 162 0.78 0.05   0.78    0.78 0.05 0.64 0.86  0.22 -0.5     -0.5  0

Nondemented <- ggplot(hmdata[hmdata$Group == "Nondemented", ], aes (x = nWBV)) +
  theme_linedraw()+
  geom_histogram(fill = "darkseagreen", colour = "darkseagreen4")+
  ggtitle("Nondemented")

Demented <- ggplot(hmdata[hmdata$Group == "Demented", ], aes (x = nWBV)) +
  theme_linedraw()+
  geom_histogram(fill = "darkseagreen", colour = "darkseagreen4")+
  ggtitle("Demented")

ggarrange(Nondemented, Demented,
          ncol = 2, nrow = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To test whether the normality assumption the following test is performed:

hmdata %>% 
  group_by(Group) %>% 
  shapiro_test(nWBV)

## # A tibble: 2 × 4
##   Group       variable statistic        p
##   <chr>       <chr>        <dbl>    <dbl>
## 1 Demented    nWBV         0.988 0.496   
## 2 Nondemented nWBV         0.967 0.000597

The test showed that the data from nondemented population is not normally distributed, which violates the normality assumption stating that both data should be normally distributed, meaning that a non-parametric alternative needs to be used. Resulting in Wilcoxon rank sum test is performed. The hypothesis is the following:

H0: Distribution locations of demented and non-demented are the same.

H1: distribution locations of demented and non-demented are different.

wilcox.test(hmdata$nWBV ~ hmdata$Group,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  hmdata$nWBV by hmdata$Group
## W = 2908, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The distribution location of normalized whole brain volume is not equal between demented and nondemented based on the p-value.

effectsize(wilcox.test(hmdata$nWBV ~ hmdata$Group,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.64             | [-0.72, -0.55]

interpret_rank_biserial(0.64)

## [1] "very large"
## (Rules: funder2019)

Based on the sample data, we find that demented and nondemented differ in the normalized whole brain volume (𝑝 < 0.001). Nondemented brain volume is larger, which is shown by the very large difference in distribution (𝑟 = 0.64).

Hypothesis about the equality of five population arithmetic means

In this section the equality of five population arithmetic means for independent sampling will be tested. The independent variable is the level of education, where there are 5 different (1: less than high school grad., 2: high school grad., 3: some college, 4: college grad., 5: beyond college). The dependent variable is the normalized whole brain volume.

The hypothesis is the following:

H0: µ1 = µ2 = µ3 = µ4 = µ5

H1: At least one µ is different

hmdata$EducF <- factor(hmdata$Educ,
                       levels = c(1,2,3,4,5),
                       labels = c("Less than highschool", "High school grad.", "Some college", "College grad.", "Beyond college"))

hmdata %>%
  group_by(EducF)%>%
  count()

## # A tibble: 5 × 2
## # Groups:   EducF [5]
##   EducF                    n
##   <fct>                <int>
## 1 Less than highschool    23
## 2 High school grad.       64
## 3 Some college            75
## 4 College grad.           49
## 5 Beyond college          51

describeBy(x = hmdata$eTIV, group = hmdata$EducF)

## 
##  Descriptive statistics by group 
## group: Less than highschool
##    vars  n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 23 1471.04 139.97   1461 1460.32 139.36 1295 1750   455 0.62    -0.95
##       se
## X1 29.19
## ------------------------------------------------------------ 
## group: High school grad.
##    vars  n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 64 1428.17 130.96 1445.5 1424.15 111.94 1161 1762   601 0.27     0.01
##       se
## X1 16.37
## ------------------------------------------------------------ 
## group: Some college
##    vars  n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 75 1444.79 163.28   1411 1431.72 124.54 1154 1992   838  0.9     1.05
##       se
## X1 18.85
## ------------------------------------------------------------ 
## group: College grad.
##    vars  n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 49 1471.78 173.18   1454 1461.29 160.12 1123 1891   768 0.58    -0.05
##       se
## X1 24.74
## ------------------------------------------------------------ 
## group: Beyond college
##    vars  n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 51 1507.76 182.14   1534 1507.32 226.84 1174 1911   737 0.02    -0.93
##      se
## X1 25.5

First the assumptions need to be checked, beginning with the variance of the groups with levene test for homogeneity of variance.

H0: The variance of the whole brain volume among all groups is equal.

H1: The variance of the whole brain volume among groups is different (at least two of them differ).

leveneTest(hmdata$nWBV, group = hmdata$EducF)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   4  0.3359 0.8536
##       257

Based on the p-value of 0.8536, the null hypothesis is accepted, showing no violation of the first assumption.

Next the normality assumption is tested.

H0: All sample groups are normally distributed.

H1: At least one sample group is not normally distributed.

hmdata %>%
  group_by(EducF) %>% 
  shapiro_test(nWBV)

## # A tibble: 5 × 4
##   EducF                variable statistic       p
##   <fct>                <chr>        <dbl>   <dbl>
## 1 Less than highschool nWBV         0.930 0.112  
## 2 High school grad.    nWBV         0.985 0.635  
## 3 Some college         nWBV         0.949 0.00435
## 4 College grad.        nWBV         0.953 0.0480 
## 5 Beyond college       nWBV         0.963 0.117

Since two groups (some college and college grad.) are not normally distributed, Kruskal-Wallis Rank Sum Test is used in continuation.

kruskal.test(nWBV ~ EducF,
             data = hmdata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  nWBV by EducF
## Kruskal-Wallis chi-squared = 32.384, df = 4, p-value = 1.597e-06

kruskal_effsize(nWBV ~ EducF,
                data = hmdata)

## # A tibble: 1 × 5
##   .y.       n effsize method  magnitude
## * <chr> <int>   <dbl> <chr>   <ord>    
## 1 nWBV    262   0.110 eta2[H] moderate

Groups_nonpar <- wilcox_test(nWBV ~ EducF,
                             paired = FALSE,
                             p.adjust.method = "bonferroni",
                             data = hmdata)
Groups_nonpar

## # A tibble: 10 × 9
##    .y.   group1               group2    n1    n2 stati…¹       p   p.adj p.adj…²
##  * <chr> <chr>                <chr>  <int> <int>   <dbl>   <dbl>   <dbl> <chr>  
##  1 nWBV  Less than highschool High …    23    64    471  1.1 e-2 1.09e-1 ns     
##  2 nWBV  Less than highschool Some …    23    75    296. 2.04e-6 2.04e-5 ****   
##  3 nWBV  Less than highschool Colle…    23    49    282. 7.04e-4 7   e-3 **     
##  4 nWBV  Less than highschool Beyon…    23    51    330. 3   e-3 2.8 e-2 *      
##  5 nWBV  High school grad.    Some …    64    75   1442. 5.25e-5 5.25e-4 ***    
##  6 nWBV  High school grad.    Colle…    64    49   1349  2.05e-1 1   e+0 ns     
##  7 nWBV  High school grad.    Beyon…    64    51   1461  3.37e-1 1   e+0 ns     
##  8 nWBV  Some college         Colle…    75    49   2294. 2   e-2 1.99e-1 ns     
##  9 nWBV  Some college         Beyon…    75    51   2461  6   e-3 6.4 e-2 ns     
## 10 nWBV  College grad.        Beyon…    49    51   1310. 6.82e-1 1   e+0 ns     
## # … with abbreviated variable names ¹statistic, ²p.adj.signif

For easier visualisation a graph is made to represent the Kruskal-Wallis results:

pwc <- hmdata %>%
  wilcox_test(nWBV ~ EducF, 
              paired = FALSE, 
              p.adjust.method = "bonferroni")
 
Kruskal_results <- kruskal_test(nWBV ~ EducF, 
                                data = hmdata)

library(rstatix)
pwc <- pwc %>% 
       add_y_position(fun = "median", step.increase = 0.35)

library(ggpubr)
ggboxplot(hmdata, x = "EducF", y = "nWBV", add = "point", ylim=c(0.62, 1)) +
  stat_pvalue_manual(pwc, hide.ns = FALSE) +
  stat_summary(fun = median, geom = "point", shape = 16, size = 4,
               aes(group = EducF), color = "darkred",
               position = position_dodge(width = 0.8))+
  theme(axis.text.x=element_text(size=10))+
  stat_summary(fun = median, colour = "darkred", 
               position = position_dodge(width = 0.8),
               geom = "text", vjust = -0.5, hjust = -1,
               aes(label = round(after_stat(y), digits = 2), group = EducF)) +
  labs(subtitle = get_test_label(Kruskal_results,  detailed = TRUE),
       caption = get_pwc_label(pwc))

It was found that distribution of whole brain volume differs for least one education level (𝜒2 = 32.4, 𝑝 < 0.001), while the effect size was moderate (𝜂2 = 0.11). Post-Hoc tests revealed differences for four groups, namely less than high school with some college, college grad and beyond college as well as high school grad and having some college. This shows that whole brain volume is related to the level of education.

Hypothesis about association between two categorical variables

hmdata1 <- subset(hmdata, select = c(Gender, Group))
head(hmdata1)

##    Gender       Group
## 1       F Nondemented
## 2       F Nondemented
## 3       F    Demented
## 9       M Nondemented
## 10      F Nondemented
## 12      F Nondemented

H0 : There is no association between gender and dementia status.

H1 : There is association between gender and dementia status.

results <- chisq.test(hmdata1$Gender, hmdata1$Group,
                      correct = TRUE)
results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hmdata1$Gender and hmdata1$Group
## X-squared = 3.4641, df = 1, p-value = 0.06271

addmargins(results$observed)

##               hmdata1$Group
## hmdata1$Gender Demented Nondemented Sum
##            F         59         115 174
##            M         41          47  88
##            Sum      100         162 262

round(results$expected, 2)

##               hmdata1$Group
## hmdata1$Gender Demented Nondemented
##              F    66.41      107.59
##              M    33.59       54.41

round(results$residuals, 2)

##               hmdata1$Group
## hmdata1$Gender Demented Nondemented
##              F    -0.91        0.71
##              M     1.28       -1.00

The assumptions are not violated. The null hypothesis is rejected at p < 0.062.

We cannot say that the number of either male or female show a significant difference than expected (+-1.96).

addmargins(round(prop.table(results$observed), 3))

##               hmdata1$Group
## hmdata1$Gender Demented Nondemented   Sum
##            F      0.225       0.439 0.664
##            M      0.156       0.179 0.335
##            Sum    0.381       0.618 0.999

addmargins(round(prop.table(results$observed, 1), 3), 2)

##               hmdata1$Group
## hmdata1$Gender Demented Nondemented   Sum
##              F    0.339       0.661 1.000
##              M    0.466       0.534 1.000

addmargins(round(prop.table(results$observed, 2), 3), 1)

##               hmdata1$Group
## hmdata1$Gender Demented Nondemented
##            F       0.59        0.71
##            M       0.41        0.29
##            Sum     1.00        1.00

Out of all females tested, 33.9% were diagnosed with dementia. Out of all the people that were diagnosed with dementia, 59% were female and 41% were male.

effectsize::cramers_v(hmdata$Gender, hmdata$Group)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.11              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.11)

## [1] "small"
## (Rules: funder2019)

There is a small effect between gender and dementia group.

Homework 2

Hana Murovec

2023-01-11

Hypothesis about the difference between two arithmetic means

Hypothesis about the equality of five population arithmetic means

Hypothesis about association between two categorical variables