COD_week12_2_MGK_BTE3207

Minsik Kim

2023-11-15

Analysis of variance

Before begin..

Let’s load the SBP dataset.

dataset_sbp <- read.csv(file = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics/dataset/sbp_dataset_korea_2013-2014.csv")



head(dataset_sbp)

##   SEX BTH_G SBP DBP FBS DIS  BMI
## 1   1     1 116  78  94   4 16.6
## 2   1     1 100  60  79   4 22.3
## 3   1     1 100  60  87   4 21.9
## 4   1     1 111  70  72   4 20.2
## 5   1     1 120  80  98   4 20.0
## 6   1     1 115  79  95   4 23.1

Making a new variable hypertension

dataset_sbp$hypertension <- ifelse(dataset_sbp$SBP > 130 |
                                           dataset_sbp$DBP > 80,
                                   T,
                                   F)


dataset_sbp$history_of_hypertension <- ifelse(dataset_sbp$DIS == 1 |
                                           dataset_sbp$DIS == 2,
                                   T,
                                   F)

dataset_sbp$DIS <- factor(dataset_sbp$DIS, 
                          levels = c(1,2,3,4),
                          labels = c("Hypertension and diabetes",
                                     "Hypertension",
                                     "Diabetes",
                                     "No history"))

set.seed(1)

dataset_sbp_small <- subset(dataset_sbp, row.names(dataset_sbp) %in% sample(x = 1:1000000, size = 1000))

DIS groups

ggplot(dataset_sbp, aes(y = SBP, x = DIS)) +
        geom_boxplot() +
        xlab("History of hypertension or diabetes") +
        ylab("SBP (mmHg)") +
        theme_classic(base_size = 12)

T-test?

DIS group 1 vs 4

t.test(dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP,
       dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP,
       var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP and dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP
## t = 169.59, df = 59802, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  11.18067 11.44213
## sample estimates:
## mean of x mean of y 
##  130.5640  119.2526

DIS group 2 vs 4

t.test(dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP,
       dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP,
       var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP and dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP
## t = 282.46, df = 225561, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  11.22032 11.37713
## sample estimates:
## mean of x mean of y 
##  130.5513  119.2526

DIS group 3 vs 4

t.test(dataset_sbp %>% filter(as.numeric(DIS) == 3) %>% .$SBP,
       dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP,
       var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  dataset_sbp %>% filter(as.numeric(DIS) == 3) %>% .$SBP and dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP
## t = 60.351, df = 48165, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.937403 4.201738
## sample estimates:
## mean of x mean of y 
##  123.3221  119.2526

DIS group 1 vs 2

t.test(dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP,
       dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP,
       var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP and dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP
## t = 0.16998, df = 90286, p-value = 0.865
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1334474  0.1587919
## sample estimates:
## mean of x mean of y 
##  130.5640  130.5513

….

Do you think it is reasonable?

Linear model?

lm(data = dataset_sbp, SBP ~ DIS) %>% 
        summary

## 
## Call:
## lm(formula = SBP ~ DIS, data = dataset_sbp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.564  -9.253  -0.253   9.747  70.747 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)     130.56397    0.05974 2185.532   <2e-16 ***
## DISHypertension  -0.01267    0.06884   -0.184    0.854    
## DISDiabetes      -7.24183    0.08938  -81.022   <2e-16 ***
## DISNo history   -11.31140    0.06186 -182.866   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.8 on 999996 degrees of freedom
## Multiple R-squared:  0.1013, Adjusted R-squared:  0.1013 
## F-statistic: 3.756e+04 on 3 and 999996 DF,  p-value: < 2.2e-16

It shows how different whether the group difference estimates with beta values (as well as CIs and p-values).

But can we assess wheter the DIS is a significant factor affecting SBP?

Here comes the analysis of variance.

Analysis of variance

Let’s say we are comparing multipel groups

Making a new dataset

midterm <- data.frame(midterm = c(82, 83, 97, 83, 78, 68, 38, 59, 55),
           major = rep(c("BE", "MBE", "CHEM"), times=1, each=3))

mean_sim <- 10
std_sim <- 5

lcb <- ((mean_sim - (3 * std_sim)) - 5)
ucb <- (((2 * mean_sim) + (3 * (2 * std_sim))) + 5)

ggplot(data = data.frame(u = c(lcb, ucb)),
       mapping = aes(x = u)) +
  stat_function(mapping = aes(colour = "Group 1"),
                fun = dnorm,
                args = list(mean = 20,
                            sd = std_sim)) +
  stat_function(mapping = aes(colour = "Group 2"),
                fun = dnorm,
                args = list(mean = 40,
                            sd = (std_sim))) +
        stat_function(mapping = aes(colour = "Group 3"),
                fun = dnorm,
                args = list(mean = 15,
                            sd = (std_sim))) +
  scale_colour_manual(values = c("red", "blue", "green")) +
  labs(x = "Values",
       y = "Densities",
       col = "Groups") +
        theme_classic(base_size = 12)

Variance of group means

mean_sim <- 10
std_sim <- 5

lcb <- ((mean_sim - (3 * std_sim)) - 5)
ucb <- (((2 * mean_sim) + (3 * (2 * std_sim))) + 5)

ggplot(data = data.frame(u = c(lcb, ucb)),
       mapping = aes(x = u)) +
  stat_function(mapping = aes(colour = "Group 1"),
                fun = dnorm,
                args = list(mean = 20,
                            sd = std_sim)) +
  stat_function(mapping = aes(colour = "Group 2"),
                fun = dnorm,
                args = list(mean = 40,
                            sd = (std_sim))) +
        stat_function(mapping = aes(colour = "Group 3"),
                fun = dnorm,
                args = list(mean = 15,
                            sd = (std_sim))) +
        geom_vline(xintercept = c(20, 40, 15), color= c("red", "blue", "green")) +
        scale_colour_manual(values = c("grey", "grey", "grey")) +
        geom_vline(xintercept = mean(c(15, 40, 20)), col = "Black", size = 2) +
  labs(x = "Values",
       y = "Densities",
       col = "Groups") +
        theme_classic(base_size = 12)

# Uncertainty - variance within groups

with smaller variance within group

mean_sim <- 10
std_sim <- 5

lcb <- ((mean_sim - (3 * std_sim)) - 5)
ucb <- (((2 * mean_sim) + (3 * (2 * std_sim))) + 5)

ggplot(data = data.frame(u = c(lcb, ucb)),
       mapping = aes(x = u)) +
  stat_function(mapping = aes(colour = "Group 1"),
                fun = dnorm,
                args = list(mean = 20,
                            sd = 0.4*std_sim)) +
  stat_function(mapping = aes(colour = "Group 2"),
                fun = dnorm,
                args = list(mean = 40,
                            sd = 0.4*(std_sim))) +
        stat_function(mapping = aes(colour = "Group 3"),
                fun = dnorm,
                args = list(mean = 15,
                            sd = 0.4*(std_sim))) +
        geom_vline(xintercept = c(20, 40, 15), color= c("red", "blue", "green")) +
        scale_colour_manual(values = c("grey", "grey", "grey")) +
        geom_vline(xintercept = mean(c(15, 40, 20)), col = "Black", size = 2) +
  labs(x = "Values",
       y = "Densities",
       col = "Groups") +
        theme_classic(base_size = 12)

Extremely smalle variance

mean_sim <- 10
std_sim <- 5

lcb <- ((mean_sim - (3 * std_sim)) - 5)
ucb <- (((2 * mean_sim) + (3 * (2 * std_sim))) + 5)

ggplot(data = data.frame(u = c(lcb, ucb)),
       mapping = aes(x = u)) +
  stat_function(mapping = aes(colour = "Group 1"),
                fun = dnorm,
                args = list(mean = 20,
                            sd = 0.01*std_sim)) +
  stat_function(mapping = aes(colour = "Group 2"),
                fun = dnorm,
                args = list(mean = 40,
                            sd = 0.01*(std_sim))) +
        stat_function(mapping = aes(colour = "Group 3"),
                fun = dnorm,
                args = list(mean = 15,
                            sd = 0.01*(std_sim))) +
        geom_vline(xintercept = c(20, 40, 15), color= c("red", "blue", "green")) +
        scale_colour_manual(values = c("grey", "grey", "grey")) +
        geom_vline(xintercept = mean(c(15, 40, 20)), col = "Black", size = 2) +
  labs(x = "Values",
       y = "Densities",
       col = "Groups") +
        theme_classic(base_size = 12)

With extremly high sd

mean_sim <- 10
std_sim <- 5

lcb <- ((mean_sim - (3 * std_sim)) - 5)
ucb <- (((2 * mean_sim) + (3 * (2 * std_sim))) + 5)

ggplot(data = data.frame(u = c(lcb, ucb)),
       mapping = aes(x = u)) +
  stat_function(mapping = aes(colour = "Group 1"),
                fun = dnorm,
                args = list(mean = 20,
                            sd = 10*std_sim)) +
  stat_function(mapping = aes(colour = "Group 2"),
                fun = dnorm,
                args = list(mean = 40,
                            sd = 10*(std_sim))) +
        stat_function(mapping = aes(colour = "Group 3"),
                fun = dnorm,
                args = list(mean = 15,
                            sd = 10*(std_sim))) +
        geom_vline(xintercept = c(20, 40, 15), color= c("red", "blue", "green")) +
        scale_colour_manual(values = c("grey", "grey", "grey")) +
        geom_vline(xintercept = mean(c(15, 40, 20)), col = "Black", size = 2) +
        ylim(c(0,0.01)) +
  labs(x = "Values",
       y = "Densities",
       col = "Groups") +
        theme_classic(base_size = 12)

Looking at extremecases clearifies what analysis of variance is testing.

Midterm dataset

ggplot(data = midterm) +
  stat_function(mapping = aes(colour = "BE"),
                fun = dnorm,
                args = list(mean = midterm %>% filter(major == "BE") %>% .$midterm %>% mean,
                            sd = midterm %>% filter(major == "BE") %>% .$midterm %>% sd)) +
  stat_function(mapping = aes(colour = "MBE"),
                fun = dnorm,
                args = list(mean = midterm %>% filter(major == "MBE") %>% .$midterm %>% mean,
                            sd = midterm %>% filter(major == "MBE") %>% .$midterm %>% sd)) +
        stat_function(mapping = aes(colour = "CHEM"),
                fun = dnorm,
                args = list(mean = midterm %>% filter(major == "CHEM") %>% .$midterm %>% mean,
                            sd = midterm %>% filter(major == "CHEM") %>% .$midterm %>% sd)) +
  scale_color_brewer(type = "qual", palette = 2) +
        xlim(c(0, 100)) +
  labs(x = "Midterm score",
       y = "Densities",
       col = "Groups") +
        theme_classic(base_size = 12)

SBP dataset

In histogram, the data can be shown as

ggplot(data = dataset_sbp) +
  stat_function(mapping = aes(colour = "History of hypertension and diabetes"),
                fun = dnorm,
                args = list(mean = dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP %>% mean,
                            sd = dataset_sbp %>% filter(as.numeric(DIS) == 1) %>% .$SBP %>% sd)) +
  stat_function(mapping = aes(colour = "History of hypertension"),
                fun = dnorm,
                args = list(mean = dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP %>% mean,
                            sd = dataset_sbp %>% filter(as.numeric(DIS) == 2) %>% .$SBP %>% sd)) +
        stat_function(mapping = aes(colour = "History of diabetes"),
                fun = dnorm,
                args = list(mean = dataset_sbp %>% filter(as.numeric(DIS) == 3) %>% .$SBP %>% mean,
                            sd = dataset_sbp %>% filter(as.numeric(DIS) == 3) %>% .$SBP %>% sd)) +
        stat_function(mapping = aes(colour = "No history"),
                fun = dnorm,
                args = list(mean = dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP %>% mean,
                            sd = dataset_sbp %>% filter(as.numeric(DIS) == 4) %>% .$SBP %>% sd)) +
  scale_color_brewer(type = "qual", palette = 2) +
        xlim(c(100,150)) +
  labs(x = "SBP (mmHg)",
       y = "Densities",
       col = "Groups") +
        theme_classic(base_size = 12)

ANOVA in R

aov() function can calculate the result of analysis of variance

aov(midterm ~ major, data = midterm) %>%
        summary

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## major        2   2124  1062.1   12.59 0.00712 **
## Residuals    6    506    84.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

aov(SBP ~ DIS, data = dataset_sbp_small) %>%
        summary

##              Df Sum Sq Mean Sq F value Pr(>F)    
## DIS           3  29932    9977   53.54 <2e-16 ***
## Residuals   996 185607     186                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Else, we can implement anova into effect modifications

aov(SBP ~ SEX + BTH_G * BMI + FBS + DIS * DBP, data = dataset_sbp_small) %>%
        summary

##              Df Sum Sq Mean Sq  F value   Pr(>F)    
## SEX           1   6475    6475   82.630  < 2e-16 ***
## BTH_G         1  23618   23618  301.394  < 2e-16 ***
## BMI           1  10826   10826  138.150  < 2e-16 ***
## FBS           1   2905    2905   37.073 1.63e-09 ***
## DIS           3   8318    2773   35.383  < 2e-16 ***
## DBP           1  85682   85682 1093.431  < 2e-16 ***
## BTH_G:BMI     1      0       0    0.000    0.985    
## DIS:DBP       3    374     125    1.593    0.190    
## Residuals   987  77342      78                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lm(SBP ~ SEX + BTH_G * BMI + FBS + DIS * DBP, data = dataset_sbp_small) %>%
        summary

## 
## Call:
## lm(formula = SBP ~ SEX + BTH_G * BMI + FBS + DIS * DBP, data = dataset_sbp_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.193  -5.601  -0.520   5.510  38.301 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          49.345205  11.941720   4.132 3.90e-05 ***
## SEX                  -0.976183   0.589594  -1.656   0.0981 .  
## BTH_G                 0.256344   0.292661   0.876   0.3813    
## BMI                   0.245572   0.187394   1.310   0.1903    
## FBS                   0.019089   0.012977   1.471   0.1416    
## DISHypertension      -4.553308  12.265896  -0.371   0.7106    
## DISDiabetes          -1.391342  17.500850  -0.080   0.9366    
## DISNo history       -17.968618  11.307352  -1.589   0.1124    
## DBP                   0.859173   0.134571   6.385 2.64e-10 ***
## BTH_G:BMI             0.001867   0.012488   0.149   0.8812    
## DISHypertension:DBP   0.054381   0.150584   0.361   0.7181    
## DISDiabetes:DBP      -0.033645   0.224297  -0.150   0.8808    
## DISNo history:DBP     0.178389   0.138652   1.287   0.1985    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.852 on 987 degrees of freedom
## Multiple R-squared:  0.6412, Adjusted R-squared:  0.6368 
## F-statistic:   147 on 12 and 987 DF,  p-value: < 2.2e-16

aov(SBP ~  BTH_G * SEX * DIS * DBP, data = dataset_sbp_small) %>%
        summary

##                    Df Sum Sq Mean Sq  F value  Pr(>F)    
## BTH_G               1  22804   22804  293.300 < 2e-16 ***
## SEX                 1   7289    7289   93.744 < 2e-16 ***
## DIS                 3  11307    3769   48.475 < 2e-16 ***
## DBP                 1  95484   95484 1228.103 < 2e-16 ***
## BTH_G:SEX           1    854     854   10.990 0.00095 ***
## BTH_G:DIS           3    700     233    3.001 0.02973 *  
## SEX:DIS             3    453     151    1.942 0.12116    
## BTH_G:DBP           1    157     157    2.024 0.15514    
## SEX:DBP             1     29      29    0.376 0.53987    
## DIS:DBP             3    137      46    0.588 0.62266    
## BTH_G:SEX:DIS       3    136      45    0.584 0.62536    
## BTH_G:SEX:DBP       1      0       0    0.000 0.98243    
## BTH_G:DIS:DBP       3    343     114    1.469 0.22155    
## SEX:DIS:DBP         3    364     121    1.561 0.19722    
## BTH_G:SEX:DIS:DBP   3    220      73    0.945 0.41823    
## Residuals         968  75261      78                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova() function accepts linear model outputs

lm(SBP ~ BTH_G * SEX * DIS * DBP, data = dataset_sbp_small) %>% 
        summary

## 
## Call:
## lm(formula = SBP ~ BTH_G * SEX * DIS * DBP, data = dataset_sbp_small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.013  -5.341  -0.617   5.572  43.134 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                   -201.34069  205.73745  -0.979   0.3280  
## BTH_G                           14.12147    9.50847   1.485   0.1378  
## SEX                             90.26545  144.25017   0.626   0.5316  
## DISHypertension                288.35627  216.54324   1.332   0.1833  
## DISDiabetes                    178.64106  332.71491   0.537   0.5914  
## DISNo history                  229.88453  206.55369   1.113   0.2660  
## DBP                              4.02042    2.45536   1.637   0.1019  
## BTH_G:SEX                       -5.37119    6.18981  -0.868   0.3857  
## BTH_G:DISHypertension          -16.87641   10.12132  -1.667   0.0958 .
## BTH_G:DISDiabetes              -13.99269   16.88817  -0.829   0.4076  
## BTH_G:DISNo history            -12.67335    9.61234  -1.318   0.1877  
## SEX:DISHypertension           -133.96369  152.76266  -0.877   0.3807  
## SEX:DISDiabetes                -58.21450  205.47704  -0.283   0.7770  
## SEX:DISNo history              -83.80291  144.69323  -0.579   0.5626  
## BTH_G:DBP                       -0.16922    0.11451  -1.478   0.1398  
## SEX:DBP                         -1.14763    1.74774  -0.657   0.5116  
## DISHypertension:DBP             -3.59142    2.58840  -1.388   0.1656  
## DISDiabetes:DBP                 -1.93448    4.10562  -0.471   0.6376  
## DISNo history:DBP               -2.76757    2.46747  -1.122   0.2623  
## BTH_G:SEX:DISHypertension        8.28844    6.64491   1.247   0.2126  
## BTH_G:SEX:DISDiabetes            6.19111    9.77716   0.633   0.5267  
## BTH_G:SEX:DISNo history          4.53109    6.24898   0.725   0.4686  
## BTH_G:SEX:DBP                    0.06734    0.07496   0.898   0.3692  
## BTH_G:DISHypertension:DBP        0.20757    0.12216   1.699   0.0896 .
## BTH_G:DISDiabetes:DBP            0.15526    0.21062   0.737   0.4612  
## BTH_G:DISNo history:DBP          0.15002    0.11603   1.293   0.1963  
## SEX:DISHypertension:DBP          1.66182    1.85315   0.897   0.3701  
## SEX:DISDiabetes:DBP              0.60939    2.55070   0.239   0.8112  
## SEX:DISNo history:DBP            0.99992    1.75449   0.570   0.5689  
## BTH_G:SEX:DISHypertension:DBP   -0.10115    0.08066  -1.254   0.2102  
## BTH_G:SEX:DISDiabetes:DBP       -0.06943    0.12222  -0.568   0.5701  
## BTH_G:SEX:DISNo history:DBP     -0.05381    0.07584  -0.709   0.4782  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.818 on 968 degrees of freedom
## Multiple R-squared:  0.6508, Adjusted R-squared:  0.6396 
## F-statistic:  58.2 on 31 and 968 DF,  p-value: < 2.2e-16

https://stats.idre.ucla.edu/

https://rpubs.com/rslbliss/r_mlm_ws

https://methods101.com.au/docs/soci832_11_3_multilevel_models_week_11/

Dataset is students from 10 handpicked schools, representing a subset of students and schools from a US survey of eight-grade students at 1000 schools (800 public 200 private).

There are quite a lot of variables in the dataset, but we are going start by using just three:

Bibliography

## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.