This analysis pipe noted https://github.com/minsiksudo/BTE3207_Advanced_Biostatistics.
Roading the data
from previous lecture.. we generated a phyloseq object. it can be saved to hard disk using the below command.
saveRDS(ps, "phyloseq_example.rds")
And it can be loaded to a new R session like below!
getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/scripts/IBS7048_Advanced_metagenomics"
library(phyloseq)
library(tidyverse)
library(microbiome)
library(vegan)
#source("https://raw.githubusercontent.com/joey711/phyloseq/master/inst/scripts/installer.R",
# local = TRUE)
phyloseq <- readRDS("/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/scripts/IBS7048_Advanced_metagenomics/phyloseq_example.rds")
phyloseq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 225 taxa and 19 samples ]
## sample_data() Sample Data: [ 19 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 225 taxa by 6 taxonomic ranks ]
## refseq() DNAStringSet: [ 225 reference sequences ]
Basics of statistical tests
MPG dataset
MPG dataset is basic data installed in ggplot
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
Double-check the data type.
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
t.test fucntion will conduct basic t-test
Paired t-test
t.test(mpg$cty,
mpg$hwy,
paired = T)
##
## Paired t-test
##
## data: mpg$cty and mpg$hwy
## t = -44.492, df = 233, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -6.872628 -6.289765
## sample estimates:
## mean difference
## -6.581197
Unpaired t-test
city_mpg_2seater <- mpg %>% filter(class == "2seater") %>% .$cty
city_mpg_midsize <- mpg %>% filter(class == "midsize") %>% .$cty
city_mpg_2seater
## [1] 16 15 16 15 15
city_mpg_midsize
## [1] 15 17 16 19 22 18 18 17 18 18 21 21 18 18 19 23 23 19 19 18 19 19 18 16 17
## [26] 18 16 21 21 21 21 18 18 19 21 18 19 21 16 18 17
t.test(city_mpg_2seater,
city_mpg_midsize,
paired = F,
var.equal = F)
##
## Welch Two Sample t-test
##
## data: city_mpg_2seater and city_mpg_midsize
## t = -8.5965, df = 20.862, p-value = 2.69e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.168314 -2.543882
## sample estimates:
## mean of x mean of y
## 15.4000 18.7561
Linear model
Univariate linear model
Boxplot of categorical x, continuous y
boxplot(data = mpg, cty ~ class)
Linear model
lm(data = mpg,
cty ~ class)
##
## Call:
## lm(formula = cty ~ class, data = mpg)
##
## Coefficients:
## (Intercept) classcompact classmidsize classminivan
## 15.4000 4.7277 3.3561 0.4182
## classpickup classsubcompact classsuv
## -2.4000 4.9714 -1.9000
Linear model - the output can be stored
lm_result_mpg <-
lm(data = mpg,
cty ~ class)
lm_result_mpg
##
## Call:
## lm(formula = cty ~ class, data = mpg)
##
## Coefficients:
## (Intercept) classcompact classmidsize classminivan
## 15.4000 4.7277 3.3561 0.4182
## classpickup classsubcompact classsuv
## -2.4000 4.9714 -1.9000
The result of lm is a list having multiple caclulated factors
lm_result_mpg %>% names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
Summary of a linear model
summary(lm_result_mpg)
##
## Call:
## lm(formula = cty ~ class, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3714 -1.7561 -0.1277 1.0000 14.6286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.4000 1.3024 11.824 < 2e-16 ***
## classcompact 4.7277 1.3699 3.451 0.000666 ***
## classmidsize 3.3561 1.3796 2.433 0.015759 *
## classminivan 0.4182 1.5708 0.266 0.790307
## classpickup -2.4000 1.3976 -1.717 0.087304 .
## classsubcompact 4.9714 1.3923 3.571 0.000435 ***
## classsuv -1.9000 1.3539 -1.403 0.161884
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.912 on 227 degrees of freedom
## Multiple R-squared: 0.5438, Adjusted R-squared: 0.5317
## F-statistic: 45.1 on 6 and 227 DF, p-value: < 2.2e-16
This result will show you how different data is. (by the factor label
of class
)
You have 1. Estimate (beta) 2. Standard errors 3. p-values 4. R-squared values
ANOVA
Analysis of variance for linear model can be done to this linear model result.
anova(lm_result_mpg)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## class 6 2295.0 382.51 45.099 < 2.2e-16 ***
## Residuals 227 1925.3 8.48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How can we test different multiple associations?
ggplot(mpg,
aes(y = cty, x = class, fill = as.factor(year))) +
geom_boxplot()
Anova - Multivariate
lm(data = mpg,
formula = hwy ~ class + as.factor(year)) %>%
anova
## Analysis of Variance Table
##
## Response: hwy
## Df Sum Sq Mean Sq F value Pr(>F)
## class 6 5683.2 947.21 83.4157 <2e-16 ***
## as.factor(year) 1 12.1 12.15 1.0699 0.3021
## Residuals 226 2566.3 11.36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
linear model - Multivariate
Multiple linear regression can be don by just adding a new variable to formula
lm(data = mpg,
formula = hwy ~ class + as.factor(year)) %>%
summary
##
## Call:
## lm(formula = hwy ~ class + as.factor(year), data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3908 -1.6435 -0.3427 1.1141 16.0659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.5260 1.5301 16.029 < 2e-16 ***
## classcompact 3.5581 1.5862 2.243 0.0259 *
## classmidsize 2.5328 1.5967 1.586 0.1141
## classminivan -2.3699 1.8186 -1.303 0.1939
## classpickup -7.8825 1.6176 -4.873 2.07e-06 ***
## classsubcompact 3.4081 1.6123 2.114 0.0356 *
## classsuv -6.6400 1.5669 -4.238 3.29e-05 ***
## as.factor(year)2008 0.4567 0.4416 1.034 0.3021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.37 on 226 degrees of freedom
## Multiple R-squared: 0.6894, Adjusted R-squared: 0.6798
## F-statistic: 71.65 on 7 and 226 DF, p-value: < 2.2e-16
Statistical tests for alpha diversity
sample_data_with_alpha <-
phyloseq %>%
sample_data() %>%
merge(., #Merge function will do the same thing as vloopup at excel
microbiome::alpha(phyloseq),
by = 0) %>%
column_to_rownames("Row.names") %>%
sample_data
sample_data(phyloseq) <- sample_data_with_alpha
Alpha diversity boxplots
sample_data_with_alpha %>%
ggplot(aes(x = When, y = observed)) +
geom_boxplot() +
xlab("Species richness") +
theme_classic(base_size = 15)
Statistical tests on alpha diversity (linear model)
sample_data_with_alpha %>%
data.frame() %>%
lm(data = ., observed ~ When) %>%
summary
##
## Call:
## lm(formula = observed ~ When, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.00 -15.05 -0.10 11.50 41.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.000 7.418 12.268 7.18e-10 ***
## WhenLate -15.900 10.225 -1.555 0.138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.25 on 17 degrees of freedom
## Multiple R-squared: 0.1245, Adjusted R-squared: 0.07303
## F-statistic: 2.418 on 1 and 17 DF, p-value: 0.1384
The difference between those two group - is not statistically significant
How about other outcomes?
sample_data_with_alpha %>%
data.frame() %>%
lm(data = ., observed ~ evenness_simpson) %>%
summary
##
## Call:
## lm(formula = observed ~ evenness_simpson, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.386 -13.908 2.837 14.264 32.095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.78 17.58 7.668 6.47e-07 ***
## evenness_simpson -207.81 67.83 -3.063 0.00703 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.09 on 17 degrees of freedom
## Multiple R-squared: 0.3557, Adjusted R-squared: 0.3178
## F-statistic: 9.385 on 1 and 17 DF, p-value: 0.007033
The evenness difference between those two group - is statistically significant
sample_data_with_alpha %>%
data.frame() %>%
lm(data = ., observed ~ dominance_core_abundance) %>%
summary
##
## Call:
## lm(formula = observed ~ dominance_core_abundance, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.181 -12.114 -4.527 2.548 37.443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 273.61 45.13 6.062 1.27e-05 ***
## dominance_core_abundance -223.44 52.62 -4.247 0.000544 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.57 on 17 degrees of freedom
## Multiple R-squared: 0.5147, Adjusted R-squared: 0.4862
## F-statistic: 18.03 on 1 and 17 DF, p-value: 0.0005441
Core abundance was also statistically different.
How can we say this difference is not because of deeper sequencing?
sample_data_with_alpha$final_reads <- phyloseq %>% sample_sums
sample_data_with_alpha %>%
data.frame() %>%
ggplot(aes(x = final_reads, y = observed)) +
geom_point()
Linear model of this association
(As deeper sequencing identified more taxa)
lm_eqn <- function(m){
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(unname(coef(m)[1]), digits = 2),
b = format(unname(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.character(as.expression(eq));
}
ggplot(data = sample_data_with_alpha, aes(x = final_reads, y = observed)) +
geom_smooth(method = "lm",
se = T,
color = "red",
linetype = "dashed"
) + # Plot regression slope
geom_point() +
theme_classic( base_family = "serif", base_size = 20) +
geom_label(x = 12000, y = 60,
label = lm_eqn(lm(data = data.frame(sample_data_with_alpha), observed ~ final_reads)),
parse = TRUE,
family='serif',
size = 8, label.size = 0, color = "red")#, fill = alpha(c("white"),1))
lm(data = data.frame(sample_data_with_alpha), observed ~ final_reads) %>% summary
##
## Call:
## lm(formula = observed ~ final_reads, data = data.frame(sample_data_with_alpha))
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.01 -11.19 -1.72 11.47 23.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.217e+01 6.728e+00 7.754 5.57e-07 ***
## final_reads 4.896e-03 9.353e-04 5.235 6.73e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.72 on 17 degrees of freedom
## Multiple R-squared: 0.6171, Adjusted R-squared: 0.5946
## F-statistic: 27.4 on 1 and 17 DF, p-value: 6.727e-05
There is significant correlation !
ggplot(data = sample_data_with_alpha, aes(x = final_reads, y = evenness_simpson)) +
geom_smooth(method = "lm",
se = T,
color = "red",
linetype = "dashed"
) + # Plot regression slope
geom_point(aes(col = When)) +
theme_classic( base_family = "serif", base_size = 20) +
geom_label(x = 12000, y = 60,
label = lm_eqn(lm(data = data.frame(sample_data_with_alpha), evenness_simpson ~ final_reads)),
parse = TRUE,
family='serif',
size = 8, label.size = 0, color = "red")#, fill = alpha(c("white"),1))
Even for evenness, this association seem to have a significant term in general.
Multiple linear regression
lm(data = data.frame(sample_data_with_alpha),
evenness_simpson ~ final_reads + When) %>% summary
##
## Call:
## lm(formula = evenness_simpson ~ final_reads + When, data = data.frame(sample_data_with_alpha))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.05258 -0.02686 0.00104 0.01807 0.09423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.247e-01 2.111e-02 15.378 5.26e-11 ***
## final_reads -1.406e-05 2.540e-06 -5.537 4.51e-05 ***
## WhenLate 2.616e-02 1.836e-02 1.424 0.174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03989 on 16 degrees of freedom
## Multiple R-squared: 0.6786, Adjusted R-squared: 0.6385
## F-statistic: 16.89 on 2 and 16 DF, p-value: 0.0001138
The association we ‘obsereved’ was not because of experimental condition, but due to different final reads.
lm(data = data.frame(sample_data_with_alpha),
dominance_core_abundance ~ final_reads + When) %>% summary
##
## Call:
## lm(formula = dominance_core_abundance ~ final_reads + When, data = data.frame(sample_data_with_alpha))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.095190 -0.038735 0.005034 0.027147 0.126734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.263e-01 3.278e-02 25.209 2.63e-14 ***
## final_reads -2.703e-06 3.943e-06 -0.686 0.50283
## WhenLate 8.595e-02 2.851e-02 3.015 0.00822 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06193 on 16 degrees of freedom
## Multiple R-squared: 0.3811, Adjusted R-squared: 0.3038
## F-statistic: 4.927 on 2 and 16 DF, p-value: 0.02152
However, the change of dominance was due to experimental condition.
So, the TRUE change between the samples from early and late
phase, was Dominance
.
How would it change the association for beta diversity?
Ordination
Principal Coordinates Analysis with Bray-Curtis dist
phyloseq_rel <- phyloseq %>% transform_sample_counts(., function(x) {x/sum(x)})
sample_data(phyloseq_rel) <- sample_data_with_alpha
plot_ordination(physeq = phyloseq_rel,
ordination = ordinate(phyloseq_rel, "PCoA", "bray"),
col = "When"
) +
geom_point(size = 8)
These two groups seems different. Is it really different, by sequencing depth or the sampling phase?
plot_ordination(physeq = phyloseq_rel,
ordination = ordinate(phyloseq_rel, "PCoA", "bray"),
col = "final_reads"
) +
geom_point(size = 8) +
scale_color_gradient(low = "blue", high = "red")
Symbols and colors at the sample time
plot_ordination(physeq = phyloseq_rel,
ordination = ordinate(phyloseq_rel, "PCoA", "bray"),
shape = "When",
col = "final_reads"
) +
geom_point(size = 8) +
scale_color_gradient(low = "blue", high = "red") +
stat_ellipse(type = "norm", linetype = 2) +
stat_ellipse(type = "t") +
theme_classic()
#scale_color_brewer(type = "qual")
PERMANOVA
As betadiversity indices have multiple distances at the same time, the calculation must accompany multiple analyses, at a random order (permutation).
This is permutational anova (PERMANOVA).
Univariate permutational anova (PERMANOVA).
Based on sampling point
adonis(formula = distance(phyloseq, method = "bray") ~ When,
data = data.frame(sample_data_with_alpha),
permutations = 10000) %>%
.$aov.tab
## Permutation: free
## Number of permutations: 10000
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## When 1 0.50229 0.50229 5.5789 0.24709 3e-04 ***
## Residuals 17 1.53057 0.09003 0.75291
## Total 18 2.03286 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There was difference. R-squared shows that the 25% of variance in
beta-diversity was explained by When
variable.
Multivariate anlaysis
adonis(formula = distance(phyloseq, method = "bray") ~ final_reads + When,
data = data.frame(sample_data_with_alpha),
permutations = 10000) %>%
.$aov.tab
## Permutation: free
## Number of permutations: 10000
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## final_reads 1 0.65112 0.65112 11.7547 0.32030 9.999e-05 ***
## When 1 0.49546 0.49546 8.9445 0.24373 9.999e-05 ***
## Residuals 16 0.88628 0.05539 0.43598
## Total 18 2.03286 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There was difference. R-squared shows that the 24% of variance in
beta-diversity was explained by When
variable, and 32% was
explained by Fian reads
.
–> 1% of variance in beta-diversity was ADJUSTED by sequencing depth.
Bibliography
## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.
## Generation in R. R package version 1.45, <https://yihui.org/knitr/>. Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963 Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
## Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2023). rmarkdown: Dynamic Documents for R. R package version 2.25, <https://github.com/rstudio/rmarkdown>. Xie Y, Allaire J, Grolemund G (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338, <https://bookdown.org/yihui/rmarkdown>. Xie Y, Dervieux C, Riederer E (2020). R Markdown Cookbook_. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837, <https://bookdown.org/yihui/rmarkdown-cookbook>.
## 'rmarkdown' Documents. R package version 1.0.4, <https://CRAN.R-project.org/package=rmdformats>.