Analysis of variance
Before begin..
Let’s load the SBP dataset.
dataset_sbp <- read.csv(file = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")
head(dataset_sbp)
## SEX BTH_G SBP DBP FBS DIS BMI
## 1 1 1 116 78 94 4 16.6
## 2 1 1 100 60 79 4 22.3
## 3 1 1 100 60 87 4 21.9
## 4 1 1 111 70 72 4 20.2
## 5 1 1 120 80 98 4 20.0
## 6 1 1 115 79 95 4 23.1
Making a new variable hypertension
dataset_sbp$hypertension <- ifelse(dataset_sbp$SBP > 130 |
dataset_sbp$DBP > 80,
T,
F)
dataset_sbp$history_of_hypertension <- ifelse(dataset_sbp$DIS == 1 |
dataset_sbp$DIS == 2,
T,
F)
dataset_sbp$DIS <- factor(dataset_sbp$DIS,
levels = c(1,2,3,4),
labels = c("Hypertension and diabetes",
"Hypertension",
"Diabetes",
"No history"))
set.seed(1)
dataset_sbp_small <- subset(dataset_sbp, row.names(dataset_sbp) %in% sample(x = 1:1000000, size = 1000))
DIS groups
ggplot(dataset_sbp, aes(y = SBP, x = DIS)) +
geom_boxplot() +
xlab("History of hypertension or diabetes") +
ylab("SBP (mmHg)") +
theme_classic(base_size = 12)
T-test?
DIS group 1 vs 4
t.test(dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP,
dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP,
var.equal = F)
##
## Welch Two Sample t-test
##
## data: dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP and dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP
## t = 169.59, df = 59802, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 11.18067 11.44213
## sample estimates:
## mean of x mean of y
## 130.5640 119.2526
DIS group 2 vs 4
t.test(dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP,
dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP,
var.equal = F)
##
## Welch Two Sample t-test
##
## data: dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP and dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP
## t = 282.46, df = 225561, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 11.22032 11.37713
## sample estimates:
## mean of x mean of y
## 130.5513 119.2526
DIS group 3 vs 4
t.test(dataset_sbp %>% filter(as.numeric(DIS) == 3) %>% .$SBP,
dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP,
var.equal = F)
##
## Welch Two Sample t-test
##
## data: dataset_sbp %>% filter(as.numeric(DIS) == 3) %>% .$SBP and dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP
## t = 60.351, df = 48165, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.937403 4.201738
## sample estimates:
## mean of x mean of y
## 123.3221 119.2526
DIS group 1 vs 2
t.test(dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP,
dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP,
var.equal = F)
##
## Welch Two Sample t-test
##
## data: dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP and dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP
## t = 0.16998, df = 90286, p-value = 0.865
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1334474 0.1587919
## sample estimates:
## mean of x mean of y
## 130.5640 130.5513
….
Do you think it is reasonable?
Linear model?
lm(data = dataset_sbp, SBP ~ DIS) %>%
summary
##
## Call:
## lm(formula = SBP ~ DIS, data = dataset_sbp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.564 -9.253 -0.253 9.747 70.747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130.56397 0.05974 2185.532 <2e-16 ***
## DISHypertension -0.01267 0.06884 -0.184 0.854
## DISDiabetes -7.24183 0.08938 -81.022 <2e-16 ***
## DISNo history -11.31140 0.06186 -182.866 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.8 on 999996 degrees of freedom
## Multiple R-squared: 0.1013, Adjusted R-squared: 0.1013
## F-statistic: 3.756e+04 on 3 and 999996 DF, p-value: < 2.2e-16
It shows how different whether the group difference estimates with beta values (as well as CIs and p-values).
But can we assess wheter the DIS
is a significant factor
affecting SBP?
Here comes the analysis of variance.
Analysis of variance
Let’s say we are comparing multipel groups
Making a new dataset
midterm <- data.frame(midterm = c(82, 83, 97, 83, 78, 68, 38, 59, 55),
major = rep(c("BE", "MBE", "CHEM"), times=1, each=3))
mean_sim <- 10
std_sim <- 5
lcb <- ((mean_sim - (3 * std_sim)) - 5)
ucb <- (((2 * mean_sim) + (3 * (2 * std_sim))) + 5)
ggplot(data = data.frame(u = c(lcb, ucb)),
mapping = aes(x = u)) +
stat_function(mapping = aes(colour = "Group 1"),
fun = dnorm,
args = list(mean = 20,
sd = std_sim)) +
stat_function(mapping = aes(colour = "Group 2"),
fun = dnorm,
args = list(mean = 40,
sd = (std_sim))) +
stat_function(mapping = aes(colour = "Group 3"),
fun = dnorm,
args = list(mean = 15,
sd = (std_sim))) +
scale_colour_manual(values = c("red", "blue", "green")) +
labs(x = "Values",
y = "Densities",
col = "Groups") +
theme_classic(base_size = 12)
Variance of group means
mean_sim <- 10
std_sim <- 5
lcb <- ((mean_sim - (3 * std_sim)) - 5)
ucb <- (((2 * mean_sim) + (3 * (2 * std_sim))) + 5)
ggplot(data = data.frame(u = c(lcb, ucb)),
mapping = aes(x = u)) +
stat_function(mapping = aes(colour = "Group 1"),
fun = dnorm,
args = list(mean = 20,
sd = std_sim)) +
stat_function(mapping = aes(colour = "Group 2"),
fun = dnorm,
args = list(mean = 40,
sd = (std_sim))) +
stat_function(mapping = aes(colour = "Group 3"),
fun = dnorm,
args = list(mean = 15,
sd = (std_sim))) +
geom_vline(xintercept = c(20, 40, 15), color= c("red", "blue", "green")) +
scale_colour_manual(values = c("grey", "grey", "grey")) +
geom_vline(xintercept = mean(c(15, 40, 20)), col = "Black", size = 2) +
labs(x = "Values",
y = "Densities",
col = "Groups") +
theme_classic(base_size = 12)
# Uncertainty - variance within groups
with smaller variance within group
mean_sim <- 10
std_sim <- 5
lcb <- ((mean_sim - (3 * std_sim)) - 5)
ucb <- (((2 * mean_sim) + (3 * (2 * std_sim))) + 5)
ggplot(data = data.frame(u = c(lcb, ucb)),
mapping = aes(x = u)) +
stat_function(mapping = aes(colour = "Group 1"),
fun = dnorm,
args = list(mean = 20,
sd = 0.4*std_sim)) +
stat_function(mapping = aes(colour = "Group 2"),
fun = dnorm,
args = list(mean = 40,
sd = 0.4*(std_sim))) +
stat_function(mapping = aes(colour = "Group 3"),
fun = dnorm,
args = list(mean = 15,
sd = 0.4*(std_sim))) +
geom_vline(xintercept = c(20, 40, 15), color= c("red", "blue", "green")) +
scale_colour_manual(values = c("grey", "grey", "grey")) +
geom_vline(xintercept = mean(c(15, 40, 20)), col = "Black", size = 2) +
labs(x = "Values",
y = "Densities",
col = "Groups") +
theme_classic(base_size = 12)
Extremely smalle variance
mean_sim <- 10
std_sim <- 5
lcb <- ((mean_sim - (3 * std_sim)) - 5)
ucb <- (((2 * mean_sim) + (3 * (2 * std_sim))) + 5)
ggplot(data = data.frame(u = c(lcb, ucb)),
mapping = aes(x = u)) +
stat_function(mapping = aes(colour = "Group 1"),
fun = dnorm,
args = list(mean = 20,
sd = 0.01*std_sim)) +
stat_function(mapping = aes(colour = "Group 2"),
fun = dnorm,
args = list(mean = 40,
sd = 0.01*(std_sim))) +
stat_function(mapping = aes(colour = "Group 3"),
fun = dnorm,
args = list(mean = 15,
sd = 0.01*(std_sim))) +
geom_vline(xintercept = c(20, 40, 15), color= c("red", "blue", "green")) +
scale_colour_manual(values = c("grey", "grey", "grey")) +
geom_vline(xintercept = mean(c(15, 40, 20)), col = "Black", size = 2) +
labs(x = "Values",
y = "Densities",
col = "Groups") +
theme_classic(base_size = 12)
With extremly high sd
mean_sim <- 10
std_sim <- 5
lcb <- ((mean_sim - (3 * std_sim)) - 5)
ucb <- (((2 * mean_sim) + (3 * (2 * std_sim))) + 5)
ggplot(data = data.frame(u = c(lcb, ucb)),
mapping = aes(x = u)) +
stat_function(mapping = aes(colour = "Group 1"),
fun = dnorm,
args = list(mean = 20,
sd = 10*std_sim)) +
stat_function(mapping = aes(colour = "Group 2"),
fun = dnorm,
args = list(mean = 40,
sd = 10*(std_sim))) +
stat_function(mapping = aes(colour = "Group 3"),
fun = dnorm,
args = list(mean = 15,
sd = 10*(std_sim))) +
geom_vline(xintercept = c(20, 40, 15), color= c("red", "blue", "green")) +
scale_colour_manual(values = c("grey", "grey", "grey")) +
geom_vline(xintercept = mean(c(15, 40, 20)), col = "Black", size = 2) +
ylim(c(0,0.01)) +
labs(x = "Values",
y = "Densities",
col = "Groups") +
theme_classic(base_size = 12)
Looking at extremecases clearifies what analysis of variance is testing.
Midterm dataset
ggplot(data = midterm) +
stat_function(mapping = aes(colour = "BE"),
fun = dnorm,
args = list(mean = midterm %>% filter(major == "BE") %>% .$midterm %>% mean,
sd = midterm %>% filter(major == "BE") %>% .$midterm %>% sd)) +
stat_function(mapping = aes(colour = "MBE"),
fun = dnorm,
args = list(mean = midterm %>% filter(major == "MBE") %>% .$midterm %>% mean,
sd = midterm %>% filter(major == "MBE") %>% .$midterm %>% sd)) +
stat_function(mapping = aes(colour = "CHEM"),
fun = dnorm,
args = list(mean = midterm %>% filter(major == "CHEM") %>% .$midterm %>% mean,
sd = midterm %>% filter(major == "CHEM") %>% .$midterm %>% sd)) +
scale_color_brewer(type = "qual", palette = 2) +
xlim(c(0, 100)) +
labs(x = "Midterm score",
y = "Densities",
col = "Groups") +
theme_classic(base_size = 12)
SBP dataset
In histogram, the data can be shown as
ggplot(data = dataset_sbp) +
stat_function(mapping = aes(colour = "History of hypertension and diabetes"),
fun = dnorm,
args = list(mean = dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP %>% mean,
sd = dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP %>% sd)) +
stat_function(mapping = aes(colour = "History of hypertension"),
fun = dnorm,
args = list(mean = dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP %>% mean,
sd = dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP %>% sd)) +
stat_function(mapping = aes(colour = "History of diabetes"),
fun = dnorm,
args = list(mean = dataset_sbp %>% filter(as.numeric(DIS) == 3) %>% .$SBP %>% mean,
sd = dataset_sbp %>% filter(as.numeric(DIS) == 3) %>% .$SBP %>% sd)) +
stat_function(mapping = aes(colour = "No history"),
fun = dnorm,
args = list(mean = dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP %>% mean,
sd = dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP %>% sd)) +
scale_color_brewer(type = "qual", palette = 2) +
xlim(c(100,150)) +
labs(x = "SBP (mmHg)",
y = "Densities",
col = "Groups") +
theme_classic(base_size = 12)
ANOVA in R
aov()
function can calculate the result of analysis of
variance
aov(midterm ~ major, data = midterm) %>%
summary
## Df Sum Sq Mean Sq F value Pr(>F)
## major 2 2124 1062.1 12.59 0.00712 **
## Residuals 6 506 84.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov(SBP ~ DIS, data = dataset_sbp_small) %>%
summary
## Df Sum Sq Mean Sq F value Pr(>F)
## DIS 3 29932 9977 53.54 <2e-16 ***
## Residuals 996 185607 186
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Else, we can implement anova into effect modifications
aov(SBP ~ SEX + BTH_G * BMI + FBS + DIS * DBP, data = dataset_sbp_small) %>%
summary
## Df Sum Sq Mean Sq F value Pr(>F)
## SEX 1 6475 6475 82.630 < 2e-16 ***
## BTH_G 1 23618 23618 301.394 < 2e-16 ***
## BMI 1 10826 10826 138.150 < 2e-16 ***
## FBS 1 2905 2905 37.073 1.63e-09 ***
## DIS 3 8318 2773 35.383 < 2e-16 ***
## DBP 1 85682 85682 1093.431 < 2e-16 ***
## BTH_G:BMI 1 0 0 0.000 0.985
## DIS:DBP 3 374 125 1.593 0.190
## Residuals 987 77342 78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(SBP ~ SEX + BTH_G * BMI + FBS + DIS * DBP, data = dataset_sbp_small) %>%
summary
##
## Call:
## lm(formula = SBP ~ SEX + BTH_G * BMI + FBS + DIS * DBP, data = dataset_sbp_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.193 -5.601 -0.520 5.510 38.301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.345205 11.941720 4.132 3.90e-05 ***
## SEX -0.976183 0.589594 -1.656 0.0981 .
## BTH_G 0.256344 0.292661 0.876 0.3813
## BMI 0.245572 0.187394 1.310 0.1903
## FBS 0.019089 0.012977 1.471 0.1416
## DISHypertension -4.553308 12.265896 -0.371 0.7106
## DISDiabetes -1.391342 17.500850 -0.080 0.9366
## DISNo history -17.968618 11.307352 -1.589 0.1124
## DBP 0.859173 0.134571 6.385 2.64e-10 ***
## BTH_G:BMI 0.001867 0.012488 0.149 0.8812
## DISHypertension:DBP 0.054381 0.150584 0.361 0.7181
## DISDiabetes:DBP -0.033645 0.224297 -0.150 0.8808
## DISNo history:DBP 0.178389 0.138652 1.287 0.1985
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.852 on 987 degrees of freedom
## Multiple R-squared: 0.6412, Adjusted R-squared: 0.6368
## F-statistic: 147 on 12 and 987 DF, p-value: < 2.2e-16
aov(SBP ~ BTH_G * SEX * DIS * DBP, data = dataset_sbp_small) %>%
summary
## Df Sum Sq Mean Sq F value Pr(>F)
## BTH_G 1 22804 22804 293.300 < 2e-16 ***
## SEX 1 7289 7289 93.744 < 2e-16 ***
## DIS 3 11307 3769 48.475 < 2e-16 ***
## DBP 1 95484 95484 1228.103 < 2e-16 ***
## BTH_G:SEX 1 854 854 10.990 0.00095 ***
## BTH_G:DIS 3 700 233 3.001 0.02973 *
## SEX:DIS 3 453 151 1.942 0.12116
## BTH_G:DBP 1 157 157 2.024 0.15514
## SEX:DBP 1 29 29 0.376 0.53987
## DIS:DBP 3 137 46 0.588 0.62266
## BTH_G:SEX:DIS 3 136 45 0.584 0.62536
## BTH_G:SEX:DBP 1 0 0 0.000 0.98243
## BTH_G:DIS:DBP 3 343 114 1.469 0.22155
## SEX:DIS:DBP 3 364 121 1.561 0.19722
## BTH_G:SEX:DIS:DBP 3 220 73 0.945 0.41823
## Residuals 968 75261 78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova()
function accepts linear model outputs
lm(SBP ~ BTH_G * SEX * DIS * DBP, data = dataset_sbp_small) %>%
summary
##
## Call:
## lm(formula = SBP ~ BTH_G * SEX * DIS * DBP, data = dataset_sbp_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.013 -5.341 -0.617 5.572 43.134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -201.34069 205.73745 -0.979 0.3280
## BTH_G 14.12147 9.50847 1.485 0.1378
## SEX 90.26545 144.25017 0.626 0.5316
## DISHypertension 288.35627 216.54324 1.332 0.1833
## DISDiabetes 178.64106 332.71491 0.537 0.5914
## DISNo history 229.88453 206.55369 1.113 0.2660
## DBP 4.02042 2.45536 1.637 0.1019
## BTH_G:SEX -5.37119 6.18981 -0.868 0.3857
## BTH_G:DISHypertension -16.87641 10.12132 -1.667 0.0958 .
## BTH_G:DISDiabetes -13.99269 16.88817 -0.829 0.4076
## BTH_G:DISNo history -12.67335 9.61234 -1.318 0.1877
## SEX:DISHypertension -133.96369 152.76266 -0.877 0.3807
## SEX:DISDiabetes -58.21450 205.47704 -0.283 0.7770
## SEX:DISNo history -83.80291 144.69323 -0.579 0.5626
## BTH_G:DBP -0.16922 0.11451 -1.478 0.1398
## SEX:DBP -1.14763 1.74774 -0.657 0.5116
## DISHypertension:DBP -3.59142 2.58840 -1.388 0.1656
## DISDiabetes:DBP -1.93448 4.10562 -0.471 0.6376
## DISNo history:DBP -2.76757 2.46747 -1.122 0.2623
## BTH_G:SEX:DISHypertension 8.28844 6.64491 1.247 0.2126
## BTH_G:SEX:DISDiabetes 6.19111 9.77716 0.633 0.5267
## BTH_G:SEX:DISNo history 4.53109 6.24898 0.725 0.4686
## BTH_G:SEX:DBP 0.06734 0.07496 0.898 0.3692
## BTH_G:DISHypertension:DBP 0.20757 0.12216 1.699 0.0896 .
## BTH_G:DISDiabetes:DBP 0.15526 0.21062 0.737 0.4612
## BTH_G:DISNo history:DBP 0.15002 0.11603 1.293 0.1963
## SEX:DISHypertension:DBP 1.66182 1.85315 0.897 0.3701
## SEX:DISDiabetes:DBP 0.60939 2.55070 0.239 0.8112
## SEX:DISNo history:DBP 0.99992 1.75449 0.570 0.5689
## BTH_G:SEX:DISHypertension:DBP -0.10115 0.08066 -1.254 0.2102
## BTH_G:SEX:DISDiabetes:DBP -0.06943 0.12222 -0.568 0.5701
## BTH_G:SEX:DISNo history:DBP -0.05381 0.07584 -0.709 0.4782
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.818 on 968 degrees of freedom
## Multiple R-squared: 0.6508, Adjusted R-squared: 0.6396
## F-statistic: 58.2 on 31 and 968 DF, p-value: < 2.2e-16
https://rpubs.com/rslbliss/r_mlm_ws
https://methods101.com.au/docs/soci832_11_3_multilevel_models_week_11/
Dataset is students from 10 handpicked schools, representing a subset of students and schools from a US survey of eight-grade students at 1000 schools (800 public 200 private).
There are quite a lot of variables in the dataset, but we are going start by using just three:
Bibliography
## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.