COD_20231107_MGK_IBS7048_week11_statistical_microbiome_analysis

Minsik Kim

2023-11-08

This analysis pipe noted https://github.com/minsiksudo/BTE3207_Advanced_Biostatistics.

Roading the data

from previous lecture.. we generated a phyloseq object. it can be saved to hard disk using the below command.

saveRDS(ps, "phyloseq_example.rds")

And it can be loaded to a new R session like below!

getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/scripts/IBS7048_Advanced_metagenomics"
library(phyloseq)
library(tidyverse)
library(microbiome)
library(vegan)
#source("https://raw.githubusercontent.com/joey711/phyloseq/master/inst/scripts/installer.R",
#       local = TRUE)


phyloseq <- readRDS("/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/scripts/IBS7048_Advanced_metagenomics/phyloseq_example.rds")
phyloseq
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 225 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 225 taxa by 6 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 225 reference sequences ]

Basics of statistical tests

MPG dataset

MPG dataset is basic data installed in ggplot

head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Double-check the data type.

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

t.test fucntion will conduct basic t-test

Paired t-test

t.test(mpg$cty,
       mpg$hwy,
       paired = T)
## 
##  Paired t-test
## 
## data:  mpg$cty and mpg$hwy
## t = -44.492, df = 233, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -6.872628 -6.289765
## sample estimates:
## mean difference 
##       -6.581197

Unpaired t-test

city_mpg_2seater <- mpg %>% filter(class == "2seater") %>% .$cty
city_mpg_midsize <- mpg %>% filter(class == "midsize") %>% .$cty

city_mpg_2seater
## [1] 16 15 16 15 15
city_mpg_midsize
##  [1] 15 17 16 19 22 18 18 17 18 18 21 21 18 18 19 23 23 19 19 18 19 19 18 16 17
## [26] 18 16 21 21 21 21 18 18 19 21 18 19 21 16 18 17
t.test(city_mpg_2seater,
       city_mpg_midsize,
       paired = F,
       var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  city_mpg_2seater and city_mpg_midsize
## t = -8.5965, df = 20.862, p-value = 2.69e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.168314 -2.543882
## sample estimates:
## mean of x mean of y 
##   15.4000   18.7561

Linear model

Univariate linear model

Boxplot of categorical x, continuous y

boxplot(data = mpg, cty ~ class)

Linear model

lm(data = mpg, 
       cty ~ class)
## 
## Call:
## lm(formula = cty ~ class, data = mpg)
## 
## Coefficients:
##     (Intercept)     classcompact     classmidsize     classminivan  
##         15.4000           4.7277           3.3561           0.4182  
##     classpickup  classsubcompact         classsuv  
##         -2.4000           4.9714          -1.9000

Linear model - the output can be stored

lm_result_mpg <- 
        lm(data = mpg, 
           cty ~ class)

lm_result_mpg
## 
## Call:
## lm(formula = cty ~ class, data = mpg)
## 
## Coefficients:
##     (Intercept)     classcompact     classmidsize     classminivan  
##         15.4000           4.7277           3.3561           0.4182  
##     classpickup  classsubcompact         classsuv  
##         -2.4000           4.9714          -1.9000

The result of lm is a list having multiple caclulated factors

lm_result_mpg %>% names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"

Summary of a linear model

summary(lm_result_mpg)
## 
## Call:
## lm(formula = cty ~ class, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3714 -1.7561 -0.1277  1.0000 14.6286 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      15.4000     1.3024  11.824  < 2e-16 ***
## classcompact      4.7277     1.3699   3.451 0.000666 ***
## classmidsize      3.3561     1.3796   2.433 0.015759 *  
## classminivan      0.4182     1.5708   0.266 0.790307    
## classpickup      -2.4000     1.3976  -1.717 0.087304 .  
## classsubcompact   4.9714     1.3923   3.571 0.000435 ***
## classsuv         -1.9000     1.3539  -1.403 0.161884    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.912 on 227 degrees of freedom
## Multiple R-squared:  0.5438, Adjusted R-squared:  0.5317 
## F-statistic:  45.1 on 6 and 227 DF,  p-value: < 2.2e-16

This result will show you how different data is. (by the factor label of class)

You have 1. Estimate (beta) 2. Standard errors 3. p-values 4. R-squared values

ANOVA

Analysis of variance for linear model can be done to this linear model result.

anova(lm_result_mpg)
## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## class       6 2295.0  382.51  45.099 < 2.2e-16 ***
## Residuals 227 1925.3    8.48                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How can we test different multiple associations?

ggplot(mpg,
       aes(y = cty, x = class, fill = as.factor(year))) +
        geom_boxplot()

Anova - Multivariate

lm(data = mpg,
        formula = hwy ~ class + as.factor(year)) %>%
        anova
## Analysis of Variance Table
## 
## Response: hwy
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## class             6 5683.2  947.21 83.4157 <2e-16 ***
## as.factor(year)   1   12.1   12.15  1.0699 0.3021    
## Residuals       226 2566.3   11.36                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

linear model - Multivariate

Multiple linear regression can be don by just adding a new variable to formula

lm(data = mpg,
        formula = hwy ~ class + as.factor(year)) %>%
        summary
## 
## Call:
## lm(formula = hwy ~ class + as.factor(year), data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3908 -1.6435 -0.3427  1.1141 16.0659 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          24.5260     1.5301  16.029  < 2e-16 ***
## classcompact          3.5581     1.5862   2.243   0.0259 *  
## classmidsize          2.5328     1.5967   1.586   0.1141    
## classminivan         -2.3699     1.8186  -1.303   0.1939    
## classpickup          -7.8825     1.6176  -4.873 2.07e-06 ***
## classsubcompact       3.4081     1.6123   2.114   0.0356 *  
## classsuv             -6.6400     1.5669  -4.238 3.29e-05 ***
## as.factor(year)2008   0.4567     0.4416   1.034   0.3021    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.37 on 226 degrees of freedom
## Multiple R-squared:  0.6894, Adjusted R-squared:  0.6798 
## F-statistic: 71.65 on 7 and 226 DF,  p-value: < 2.2e-16

Statistical tests for alpha diversity

sample_data_with_alpha <-
        phyloseq %>%
        sample_data() %>%
        merge(., #Merge function will do the same thing as vloopup at excel
              microbiome::alpha(phyloseq),
              by = 0) %>%
        column_to_rownames("Row.names") %>%
        sample_data

sample_data(phyloseq) <- sample_data_with_alpha

Alpha diversity boxplots

sample_data_with_alpha %>%
        ggplot(aes(x = When, y = observed)) +
        geom_boxplot() +
        xlab("Species richness") +
        theme_classic(base_size = 15)

Statistical tests on alpha diversity (linear model)

sample_data_with_alpha %>%
        data.frame() %>% 
        lm(data = ., observed ~ When) %>%
        summary
## 
## Call:
## lm(formula = observed ~ When, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33.00 -15.05  -0.10  11.50  41.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.000      7.418  12.268 7.18e-10 ***
## WhenLate     -15.900     10.225  -1.555    0.138    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.25 on 17 degrees of freedom
## Multiple R-squared:  0.1245, Adjusted R-squared:  0.07303 
## F-statistic: 2.418 on 1 and 17 DF,  p-value: 0.1384

The difference between those two group - is not statistically significant

How about other outcomes?

sample_data_with_alpha %>%
        data.frame() %>% 
        lm(data = ., observed ~ evenness_simpson) %>%
        summary
## 
## Call:
## lm(formula = observed ~ evenness_simpson, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.386 -13.908   2.837  14.264  32.095 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        134.78      17.58   7.668 6.47e-07 ***
## evenness_simpson  -207.81      67.83  -3.063  0.00703 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.09 on 17 degrees of freedom
## Multiple R-squared:  0.3557, Adjusted R-squared:  0.3178 
## F-statistic: 9.385 on 1 and 17 DF,  p-value: 0.007033

The evenness difference between those two group - is statistically significant

sample_data_with_alpha %>%
        data.frame() %>% 
        lm(data = ., observed ~ dominance_core_abundance) %>%
        summary
## 
## Call:
## lm(formula = observed ~ dominance_core_abundance, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.181 -12.114  -4.527   2.548  37.443 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                273.61      45.13   6.062 1.27e-05 ***
## dominance_core_abundance  -223.44      52.62  -4.247 0.000544 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.57 on 17 degrees of freedom
## Multiple R-squared:  0.5147, Adjusted R-squared:  0.4862 
## F-statistic: 18.03 on 1 and 17 DF,  p-value: 0.0005441

Core abundance was also statistically different.

How can we say this difference is not because of deeper sequencing?

sample_data_with_alpha$final_reads <- phyloseq %>% sample_sums

sample_data_with_alpha %>%
        data.frame() %>% 
        ggplot(aes(x = final_reads, y = observed)) +
        geom_point()

Linear model of this association

(As deeper sequencing identified more taxa)

lm_eqn <- function(m){
    eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, 
         list(a = format(unname(coef(m)[1]), digits = 2),
              b = format(unname(coef(m)[2]), digits = 2),
             r2 = format(summary(m)$r.squared, digits = 3)))
    as.character(as.expression(eq));
}


ggplot(data = sample_data_with_alpha, aes(x = final_reads, y = observed)) +
        geom_smooth(method = "lm", 
                    se = T, 
                    color = "red", 
                    linetype = "dashed"
                    ) +  # Plot regression slope
        geom_point() +
        theme_classic( base_family = "serif", base_size = 20) +
        geom_label(x = 12000, y = 60,
                  label = lm_eqn(lm(data = data.frame(sample_data_with_alpha), observed ~ final_reads)),
                  parse = TRUE, 
                  family='serif',
                  size = 8, label.size = 0, color = "red")#, fill = alpha(c("white"),1)) 

lm(data = data.frame(sample_data_with_alpha), observed ~ final_reads) %>% summary
## 
## Call:
## lm(formula = observed ~ final_reads, data = data.frame(sample_data_with_alpha))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.01 -11.19  -1.72  11.47  23.08 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.217e+01  6.728e+00   7.754 5.57e-07 ***
## final_reads 4.896e-03  9.353e-04   5.235 6.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.72 on 17 degrees of freedom
## Multiple R-squared:  0.6171, Adjusted R-squared:  0.5946 
## F-statistic:  27.4 on 1 and 17 DF,  p-value: 6.727e-05

There is significant correlation !

ggplot(data = sample_data_with_alpha, aes(x = final_reads, y = evenness_simpson)) +
        geom_smooth(method = "lm", 
                    se = T, 
                    color = "red", 
                    linetype = "dashed"
                    ) +  # Plot regression slope
        geom_point(aes(col = When)) +
        theme_classic( base_family = "serif", base_size = 20) +
        geom_label(x = 12000, y = 60,
                  label = lm_eqn(lm(data = data.frame(sample_data_with_alpha), evenness_simpson ~ final_reads)),
                  parse = TRUE, 
                  family='serif',
                  size = 8, label.size = 0, color = "red")#, fill = alpha(c("white"),1)) 

Even for evenness, this association seem to have a significant term in general.

Multiple linear regression

lm(data = data.frame(sample_data_with_alpha),
   evenness_simpson ~ final_reads + When) %>% summary
## 
## Call:
## lm(formula = evenness_simpson ~ final_reads + When, data = data.frame(sample_data_with_alpha))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.05258 -0.02686  0.00104  0.01807  0.09423 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.247e-01  2.111e-02  15.378 5.26e-11 ***
## final_reads -1.406e-05  2.540e-06  -5.537 4.51e-05 ***
## WhenLate     2.616e-02  1.836e-02   1.424    0.174    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03989 on 16 degrees of freedom
## Multiple R-squared:  0.6786, Adjusted R-squared:  0.6385 
## F-statistic: 16.89 on 2 and 16 DF,  p-value: 0.0001138

The association we ‘obsereved’ was not because of experimental condition, but due to different final reads.

lm(data = data.frame(sample_data_with_alpha),
   dominance_core_abundance ~ final_reads + When) %>% summary
## 
## Call:
## lm(formula = dominance_core_abundance ~ final_reads + When, data = data.frame(sample_data_with_alpha))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.095190 -0.038735  0.005034  0.027147  0.126734 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.263e-01  3.278e-02  25.209 2.63e-14 ***
## final_reads -2.703e-06  3.943e-06  -0.686  0.50283    
## WhenLate     8.595e-02  2.851e-02   3.015  0.00822 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06193 on 16 degrees of freedom
## Multiple R-squared:  0.3811, Adjusted R-squared:  0.3038 
## F-statistic: 4.927 on 2 and 16 DF,  p-value: 0.02152

However, the change of dominance was due to experimental condition.

So, the TRUE change between the samples from early and late phase, was Dominance.

How would it change the association for beta diversity?

Ordination

Principal Coordinates Analysis with Bray-Curtis dist

phyloseq_rel <- phyloseq %>% transform_sample_counts(., function(x) {x/sum(x)})
sample_data(phyloseq_rel) <- sample_data_with_alpha

plot_ordination(physeq = phyloseq_rel,
                ordination = ordinate(phyloseq_rel, "PCoA", "bray"),
                col = "When"
                ) +
        geom_point(size = 8)

These two groups seems different. Is it really different, by sequencing depth or the sampling phase?

plot_ordination(physeq = phyloseq_rel,
                ordination = ordinate(phyloseq_rel, "PCoA", "bray"),
                col = "final_reads"
                ) +
        geom_point(size = 8) +
        scale_color_gradient(low = "blue", high = "red") 

Symbols and colors at the sample time

plot_ordination(physeq = phyloseq_rel,
                ordination = ordinate(phyloseq_rel, "PCoA", "bray"),
                shape = "When",
                col = "final_reads"
                ) +
        geom_point(size = 8) +
        scale_color_gradient(low = "blue", high = "red") +
        stat_ellipse(type = "norm", linetype = 2) +
        stat_ellipse(type = "t") +
        theme_classic() 

        #scale_color_brewer(type = "qual")

PERMANOVA

As betadiversity indices have multiple distances at the same time, the calculation must accompany multiple analyses, at a random order (permutation).

This is permutational anova (PERMANOVA).

Univariate permutational anova (PERMANOVA).

Based on sampling point

adonis(formula = distance(phyloseq, method = "bray") ~ When,
       data = data.frame(sample_data_with_alpha),
       permutations = 10000) %>% 
        .$aov.tab
## Permutation: free
## Number of permutations: 10000
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## When       1   0.50229 0.50229  5.5789 0.24709  3e-04 ***
## Residuals 17   1.53057 0.09003         0.75291           
## Total     18   2.03286                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There was difference. R-squared shows that the 25% of variance in beta-diversity was explained by When variable.

Multivariate anlaysis

adonis(formula = distance(phyloseq, method = "bray") ~ final_reads + When,
       data = data.frame(sample_data_with_alpha),
       permutations = 10000) %>% 
        .$aov.tab
## Permutation: free
## Number of permutations: 10000
## 
## Terms added sequentially (first to last)
## 
##             Df SumsOfSqs MeanSqs F.Model      R2    Pr(>F)    
## final_reads  1   0.65112 0.65112 11.7547 0.32030 9.999e-05 ***
## When         1   0.49546 0.49546  8.9445 0.24373 9.999e-05 ***
## Residuals   16   0.88628 0.05539         0.43598              
## Total       18   2.03286                 1.00000              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There was difference. R-squared shows that the 24% of variance in beta-diversity was explained by When variable, and 32% was explained by Fian reads.

–> 1% of variance in beta-diversity was ADJUSTED by sequencing depth.

Bibliography

## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## Generation in R. R package version 1.45, <https://yihui.org/knitr/>. Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963 Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
## Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2023). rmarkdown: Dynamic Documents for R. R package version 2.25, <https://github.com/rstudio/rmarkdown>. Xie Y, Allaire J, Grolemund G (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338, <https://bookdown.org/yihui/rmarkdown>. Xie Y, Dervieux C, Riederer E (2020). R Markdown Cookbook_. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837, <https://bookdown.org/yihui/rmarkdown-cookbook>.
## 'rmarkdown' Documents. R package version 1.0.4, <https://CRAN.R-project.org/package=rmdformats>.