hmdata <- read.csv("~/IMB/R/HW2/oasis_cross-sectional.csv")
head(hmdata)
## ID M.F Hand Age Educ SES MMSE CDR eTIV nWBV ASF Delay
## 1 OAS1_0001_MR1 F R 74 2 3 29 0.0 1344 0.743 1.306 N/A
## 2 OAS1_0002_MR1 F R 55 4 1 29 0.0 1147 0.810 1.531 N/A
## 3 OAS1_0003_MR1 F R 73 4 3 27 0.5 1454 0.708 1.207 N/A
## 4 OAS1_0004_MR1 M R 28 NA NA NA NA 1588 0.803 1.105 N/A
## 5 OAS1_0005_MR1 M R 18 NA NA NA NA 1737 0.848 1.010 N/A
## 6 OAS1_0006_MR1 F R 24 NA NA NA NA 1131 0.862 1.551 N/A
The unit of observation is an individual, while the original sample size was 436. Since the targeted population are people above 40 years old, the cleaned data consists of 262 subjects. The data was obtained from OASIS cross-sectional on brain imaging.
Variable description:
The purpose of this study is to observe the difference between demented and non-demented people above 40 years old through 3 different types of hypothesis testing. The research questions are the following:
What is the difference in the arithmetic means of whole brain volume between demented and non-demented people?
How do the arithmetic means of five populations differ?
What is the relationship between the gender and dementia groups?
As mentioned the targeted age is above 40, which was filtered as well as the irrelevant variables for the research were removed.
hmdata <- hmdata[hmdata$Age >= 40, ]
hmdata <- subset(hmdata, select = -c(Hand, Delay))
colnames(hmdata)[colnames(hmdata)== "M.F"] <- "Gender"
Next the missing values per column/variable need to be determined and imputated based on median. Median was used based on the research of other N/A values for similar data sets on brain function.
colSums(is.na(hmdata))
## ID Gender Age Educ SES MMSE CDR eTIV nWBV ASF
## 0 0 0 29 48 29 29 0 0 0
hmdata$Educ <- impute(hmdata$Educ, median)
hmdata$SES <- impute(hmdata$SES, median)
hmdata$MMSE <- impute(hmdata$MMSE, median)
hmdata$CDR <- impute(hmdata$CDR, median)
head(hmdata)
## ID Gender Age Educ SES MMSE CDR eTIV nWBV ASF
## 1 OAS1_0001_MR1 F 74 2 3 29 0.0 1344 0.743 1.306
## 2 OAS1_0002_MR1 F 55 4 1 29 0.0 1147 0.810 1.531
## 3 OAS1_0003_MR1 F 73 4 3 27 0.5 1454 0.708 1.207
## 9 OAS1_0010_MR1 M 74 5 2 30 0.0 1636 0.689 1.073
## 10 OAS1_0011_MR1 F 52 3 2 30 0.0 1321 0.827 1.329
## 12 OAS1_0013_MR1 F 81 5 2 30 0.0 1664 0.679 1.055
colSums(is.na(hmdata))
## ID Gender Age Educ SES MMSE CDR eTIV nWBV ASF
## 0 0 0 0 0 0 0 0 0 0
hmdata$Group <- ifelse(hmdata$CDR > 0.4, "Demented", "Nondemented")
H0: The arithmetic means of dementia groups and whole brain volume are equal.
H1: The arithmetic means of dementia groups and whole brain volume differ.
describeBy(hmdata$nWBV, g=hmdata$Group)
##
## Descriptive statistics by group
## group: Demented
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.72 0.04 0.72 0.72 0.04 0.64 0.8 0.15 -0.03 -0.73 0
## ------------------------------------------------------------
## group: Nondemented
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 162 0.78 0.05 0.78 0.78 0.05 0.64 0.86 0.22 -0.5 -0.5 0
Nondemented <- ggplot(hmdata[hmdata$Group == "Nondemented", ], aes (x = nWBV)) +
theme_linedraw()+
geom_histogram(fill = "darkseagreen", colour = "darkseagreen4")+
ggtitle("Nondemented")
Demented <- ggplot(hmdata[hmdata$Group == "Demented", ], aes (x = nWBV)) +
theme_linedraw()+
geom_histogram(fill = "darkseagreen", colour = "darkseagreen4")+
ggtitle("Demented")
ggarrange(Nondemented, Demented,
ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To test whether the normality assumption the following test is performed:
hmdata %>%
group_by(Group) %>%
shapiro_test(nWBV)
## # A tibble: 2 × 4
## Group variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Demented nWBV 0.988 0.496
## 2 Nondemented nWBV 0.967 0.000597
The test showed that the data from nondemented population is not normally distributed, which violates the normality assumption stating that both data should be normally distributed, meaning that a non-parametric alternative needs to be used. Resulting in Wilcoxon rank sum test is performed. The hypothesis is the following:
H0: Distribution locations of demented and non-demented are the same.
H1: distribution locations of demented and non-demented are different.
wilcox.test(hmdata$nWBV ~ hmdata$Group,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: hmdata$nWBV by hmdata$Group
## W = 2908, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The distribution location of normalized whole brain volume is not equal between demented and nondemented based on the p-value.
effectsize(wilcox.test(hmdata$nWBV ~ hmdata$Group,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.64 | [-0.72, -0.55]
interpret_rank_biserial(0.64)
## [1] "very large"
## (Rules: funder2019)
Based on the sample data, we find that demented and nondemented differ in the normalized whole brain volume (𝑝 < 0.001). Nondemented brain volume is larger, which is shown by the very large difference in distribution (𝑟 = 0.64).
In this section the equality of five population arithmetic means for independent sampling will be tested. The independent variable is the level of education, where there are 5 different (1: less than high school grad., 2: high school grad., 3: some college, 4: college grad., 5: beyond college). The dependent variable is the normalized whole brain volume.
The hypothesis is the following:
H0: µ1 = µ2 = µ3 = µ4 = µ5
H1: At least one µ is different
hmdata$EducF <- factor(hmdata$Educ,
levels = c(1,2,3,4,5),
labels = c("Less than highschool", "High school grad.", "Some college", "College grad.", "Beyond college"))
hmdata %>%
group_by(EducF)%>%
count()
## # A tibble: 5 × 2
## # Groups: EducF [5]
## EducF n
## <fct> <int>
## 1 Less than highschool 23
## 2 High school grad. 64
## 3 Some college 75
## 4 College grad. 49
## 5 Beyond college 51
describeBy(x = hmdata$eTIV, group = hmdata$EducF)
##
## Descriptive statistics by group
## group: Less than highschool
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 23 1471.04 139.97 1461 1460.32 139.36 1295 1750 455 0.62 -0.95
## se
## X1 29.19
## ------------------------------------------------------------
## group: High school grad.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 64 1428.17 130.96 1445.5 1424.15 111.94 1161 1762 601 0.27 0.01
## se
## X1 16.37
## ------------------------------------------------------------
## group: Some college
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 75 1444.79 163.28 1411 1431.72 124.54 1154 1992 838 0.9 1.05
## se
## X1 18.85
## ------------------------------------------------------------
## group: College grad.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 49 1471.78 173.18 1454 1461.29 160.12 1123 1891 768 0.58 -0.05
## se
## X1 24.74
## ------------------------------------------------------------
## group: Beyond college
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 51 1507.76 182.14 1534 1507.32 226.84 1174 1911 737 0.02 -0.93
## se
## X1 25.5
First the assumptions need to be checked, beginning with the variance of the groups with levene test for homogeneity of variance.
H0: The variance of the whole brain volume among all groups is equal.
H1: The variance of the whole brain volume among groups is different (at least two of them differ).
leveneTest(hmdata$nWBV, group = hmdata$EducF)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.3359 0.8536
## 257
Based on the p-value of 0.8536, the null hypothesis is accepted, showing no violation of the first assumption.
Next the normality assumption is tested.
H0: All sample groups are normally distributed.
H1: At least one sample group is not normally distributed.
hmdata %>%
group_by(EducF) %>%
shapiro_test(nWBV)
## # A tibble: 5 × 4
## EducF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Less than highschool nWBV 0.930 0.112
## 2 High school grad. nWBV 0.985 0.635
## 3 Some college nWBV 0.949 0.00435
## 4 College grad. nWBV 0.953 0.0480
## 5 Beyond college nWBV 0.963 0.117
Since two groups (some college and college grad.) are not normally distributed, Kruskal-Wallis Rank Sum Test is used in continuation.
kruskal.test(nWBV ~ EducF,
data = hmdata)
##
## Kruskal-Wallis rank sum test
##
## data: nWBV by EducF
## Kruskal-Wallis chi-squared = 32.384, df = 4, p-value = 1.597e-06
kruskal_effsize(nWBV ~ EducF,
data = hmdata)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 nWBV 262 0.110 eta2[H] moderate
Groups_nonpar <- wilcox_test(nWBV ~ EducF,
paired = FALSE,
p.adjust.method = "bonferroni",
data = hmdata)
Groups_nonpar
## # A tibble: 10 × 9
## .y. group1 group2 n1 n2 stati…¹ p p.adj p.adj…²
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 nWBV Less than highschool High … 23 64 471 1.1 e-2 1.09e-1 ns
## 2 nWBV Less than highschool Some … 23 75 296. 2.04e-6 2.04e-5 ****
## 3 nWBV Less than highschool Colle… 23 49 282. 7.04e-4 7 e-3 **
## 4 nWBV Less than highschool Beyon… 23 51 330. 3 e-3 2.8 e-2 *
## 5 nWBV High school grad. Some … 64 75 1442. 5.25e-5 5.25e-4 ***
## 6 nWBV High school grad. Colle… 64 49 1349 2.05e-1 1 e+0 ns
## 7 nWBV High school grad. Beyon… 64 51 1461 3.37e-1 1 e+0 ns
## 8 nWBV Some college Colle… 75 49 2294. 2 e-2 1.99e-1 ns
## 9 nWBV Some college Beyon… 75 51 2461 6 e-3 6.4 e-2 ns
## 10 nWBV College grad. Beyon… 49 51 1310. 6.82e-1 1 e+0 ns
## # … with abbreviated variable names ¹statistic, ²p.adj.signif
For easier visualisation a graph is made to represent the Kruskal-Wallis results:
pwc <- hmdata %>%
wilcox_test(nWBV ~ EducF,
paired = FALSE,
p.adjust.method = "bonferroni")
Kruskal_results <- kruskal_test(nWBV ~ EducF,
data = hmdata)
library(rstatix)
pwc <- pwc %>%
add_y_position(fun = "median", step.increase = 0.35)
library(ggpubr)
ggboxplot(hmdata, x = "EducF", y = "nWBV", add = "point", ylim=c(0.62, 1)) +
stat_pvalue_manual(pwc, hide.ns = FALSE) +
stat_summary(fun = median, geom = "point", shape = 16, size = 4,
aes(group = EducF), color = "darkred",
position = position_dodge(width = 0.8))+
theme(axis.text.x=element_text(size=10))+
stat_summary(fun = median, colour = "darkred",
position = position_dodge(width = 0.8),
geom = "text", vjust = -0.5, hjust = -1,
aes(label = round(after_stat(y), digits = 2), group = EducF)) +
labs(subtitle = get_test_label(Kruskal_results, detailed = TRUE),
caption = get_pwc_label(pwc))
It was found that distribution of whole brain volume differs for least one education level (𝜒2 = 32.4, 𝑝 < 0.001), while the effect size was moderate (𝜂2 = 0.11). Post-Hoc tests revealed differences for four groups, namely less than high school with some college, college grad and beyond college as well as high school grad and having some college. This shows that whole brain volume is related to the level of education.
hmdata1 <- subset(hmdata, select = c(Gender, Group))
head(hmdata1)
## Gender Group
## 1 F Nondemented
## 2 F Nondemented
## 3 F Demented
## 9 M Nondemented
## 10 F Nondemented
## 12 F Nondemented
H0 : There is no association between gender and dementia status.
H1 : There is association between gender and dementia status.
results <- chisq.test(hmdata1$Gender, hmdata1$Group,
correct = TRUE)
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hmdata1$Gender and hmdata1$Group
## X-squared = 3.4641, df = 1, p-value = 0.06271
addmargins(results$observed)
## hmdata1$Group
## hmdata1$Gender Demented Nondemented Sum
## F 59 115 174
## M 41 47 88
## Sum 100 162 262
round(results$expected, 2)
## hmdata1$Group
## hmdata1$Gender Demented Nondemented
## F 66.41 107.59
## M 33.59 54.41
round(results$residuals, 2)
## hmdata1$Group
## hmdata1$Gender Demented Nondemented
## F -0.91 0.71
## M 1.28 -1.00
The assumptions are not violated. The null hypothesis is rejected at p < 0.062.
We cannot say that the number of either male or female show a significant difference than expected (+-1.96).
addmargins(round(prop.table(results$observed), 3))
## hmdata1$Group
## hmdata1$Gender Demented Nondemented Sum
## F 0.225 0.439 0.664
## M 0.156 0.179 0.335
## Sum 0.381 0.618 0.999
addmargins(round(prop.table(results$observed, 1), 3), 2)
## hmdata1$Group
## hmdata1$Gender Demented Nondemented Sum
## F 0.339 0.661 1.000
## M 0.466 0.534 1.000
addmargins(round(prop.table(results$observed, 2), 3), 1)
## hmdata1$Group
## hmdata1$Gender Demented Nondemented
## F 0.59 0.71
## M 0.41 0.29
## Sum 1.00 1.00
Out of all females tested, 33.9% were diagnosed with dementia. Out of all the people that were diagnosed with dementia, 59% were female and 41% were male.
effectsize::cramers_v(hmdata$Gender, hmdata$Group)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.11 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.11)
## [1] "small"
## (Rules: funder2019)
There is a small effect between gender and dementia group.