The goal of this is to demonstrate alpha diversity analyses that could be performed on metagenomic data. This script was used to generate some results that are present in one of my published articles.
For these analyses, I mainly use the amazing phyloseq package.
library("phyloseq")
library("ggplot2")
library("dplyr")
library("ggpubr")
I start by loading three files:
head(otumat)[,1:6]
## yy1 yy2 yy3 yy4 yy5 yy6
## Acidobacterium 67 0 9 33 0 10
## Candidatus_Koribacter 496 0 23 64 0 2
## Granulicella 262 2 48 192 0 5
## Terriglobus 558 0 19 86 0 8
## Chloracidobacterium 23 0 3 16 4 9
## Candidatus_Solibacter 526 0 16 76 0 19
head(taxmat)
## Phylum Genus
## 1 Acidobacteria Acidobacterium
## 2 Acidobacteria Candidatus_Koribacter
## 3 Acidobacteria Granulicella
## 4 Acidobacteria Terriglobus
## 5 Acidobacteria Chloracidobacterium
## 6 Acidobacteria Candidatus_Solibacter
head(metadata)
## Group
## yy1 YY
## yy2 YY
## yy3 YY
## yy4 YY
## yy5 YY
## yy6 YY
For this script to work, the tax_table and the otu_table row names must be identical and in the same order. So I use the command below just to be sure that the Genera in the tax_table are exactly the same as the row names of the otu_table.
setdiff(rownames(otumat),taxmat$Genus)
## character(0)
Since the result is 0, I know there’s no difference, then I can use the data in the Genus column of taxmat as row names. Also, make sure the taxmat is a matrix.
rownames(taxmat) <- taxmat$Genus
taxmat = as.matrix(taxmat)
Now create the phyloseq object by combining the three files imported previously.
OTU = otu_table(otumat, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)
sampledata = sample_data(metadata)
physeq1 = phyloseq(OTU, TAX, sampledata)
physeq1
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 850 taxa and 162 samples ]
## sample_data() Sample Data: [ 162 samples by 1 sample variables ]
## tax_table() Taxonomy Table: [ 850 taxa by 2 taxonomic ranks ]
Arranging the metadata factors so that the groups will always be listed in the given order:
sample_data(physeq1)$Group <- factor((sample_data(physeq1)$Group), levels=c("YY","VV","MM","TT","CA","NN","SS"))
Phyloseq has the function “Estimate Richness” which returns the results of some alpha diversity estimates for each metagenome:
richness <- estimate_richness(physeq1)
head(richness)
## Observed Chao1 se.chao1 ACE se.ACE Shannon Simpson
## yy1 831 837.1818 4.171555 838.8740 9.701057 2.881380 0.7596974
## yy2 562 648.2500 21.068072 646.7016 12.412318 2.292813 0.8051032
## yy3 824 838.1304 7.375090 831.4718 13.203409 2.772338 0.8312823
## yy4 830 846.5000 9.297206 838.8370 10.317675 1.290877 0.3830826
## yy5 473 524.3830 16.106509 519.7690 11.254276 2.627071 0.8160624
## yy6 740 752.3333 5.603758 758.0869 12.736843 2.102588 0.7548794
## InvSimpson Fisher
## yy1 4.161419 84.18912
## yy2 5.130921 74.45343
## yy3 5.927059 90.89269
## yy4 1.620963 76.97193
## yy5 5.436626 60.96582
## yy6 4.079624 77.92174
The function “Plot Richness” generates dot plots showing the results given by “Estimate Richness”:
plot_richness(physeq1)
Since I’m analyzing many samples, a dot plot is not very informative, so I decided to generate a boxplot with metagenomes being arranged as groups. You can use any measure given in “Estimate Richness”, but for this example, I’m using the Shannon index:
plot_richness(physeq1, x="Group", measures="Shannon", color = "Group")+
geom_boxplot(alpha=0.6)+
theme(legend.position="none", axis.text.x=element_text(angle=45,hjust=1,vjust=1,size=12))
It’s also possible to add comparisons between pairs of groups in the plot. Here I’m using the function “stat_compare_means” from the ggpubr package. Wilcoxon was the test used for comparison.
a_my_comparisons <- list( c("YY", "MM"), c("CA", "TT"), c("NN", "SS"))
symnum.args = list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c("****", "***", "**", "*", "ns"))
plot_richness(physeq1, x="Group", measures="Shannon", color = "Group")+
geom_boxplot(alpha=0.6)+
theme(legend.position="none", axis.text.x=element_text(angle=45,hjust=1,vjust=1,size=12))+
stat_compare_means(method = "wilcox.test", comparisons = a_my_comparisons, label = "p.signif", symnum.args = symnum.args)
A histogram can be created to check if the Shannon values estimated for the metagenomes are normally distributed:
hist(richness$Shannon, main="Shannon index", xlab="")
Since it is, we can run anova tests and check if the variable Group impacts the Shannon diversity:
anova.sh = aov(richness$Shannon ~ sample_data(physeq1)$Group)
summary(anova.sh)
## Df Sum Sq Mean Sq F value Pr(>F)
## sample_data(physeq1)$Group 6 124.82 20.804 50.52 <2e-16 ***
## Residuals 155 63.83 0.412
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the anova tests results, it’s possible to compute the Tukey Honest Significant Differences:
TukeyHSD(anova.sh)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = richness$Shannon ~ sample_data(physeq1)$Group)
##
## $`sample_data(physeq1)$Group`
## diff lwr upr p adj
## VV-YY 0.37317998 -0.46615308 1.2125131 0.8376996
## MM-YY 0.74818610 0.11716760 1.3792046 0.0092785
## TT-YY 1.39816943 0.65565198 2.1406869 0.0000017
## CA-YY 0.04885664 -0.53797865 0.6356919 0.9999799
## NN-YY -0.15205043 -0.79400594 0.4899051 0.9919818
## SS-YY -1.43588324 -2.00909498 -0.8626715 0.0000000
## MM-VV 0.37500611 -0.40767600 1.1576882 0.7840224
## TT-VV 1.02498945 0.14992423 1.9000547 0.0106835
## CA-VV -0.32432335 -1.07184098 0.4231943 0.8528290
## NN-VV -0.52523041 -1.31675668 0.2662959 0.4300572
## SS-VV -1.80906322 -2.54593407 -1.0721924 0.0000000
## TT-MM 0.64998333 -0.02783927 1.3278059 0.0694716
## CA-MM -0.69932946 -1.20181026 -0.1968487 0.0010245
## NN-MM -0.90023653 -1.46611477 -0.3343583 0.0000918
## SS-MM -2.18406934 -2.67057011 -1.6975686 0.0000000
## CA-TT -1.34931279 -1.98620743 -0.7124182 0.0000001
## NN-TT -1.55021986 -2.23823585 -0.8622039 0.0000000
## SS-TT -2.83405267 -3.45841701 -2.2096883 0.0000000
## NN-CA -0.20090707 -0.71705577 0.3152416 0.9068569
## SS-CA -1.48473987 -1.91237793 -1.0571018 0.0000000
## SS-NN -1.28383281 -1.78443797 -0.7832276 0.0000000
For Non-normally distributed data, we can use Kruskal-Wallis Rank Sum Test:
kruskal.test(richness$Shannon ~ sample_data(physeq1)$Group)
##
## Kruskal-Wallis rank sum test
##
## data: richness$Shannon by sample_data(physeq1)$Group
## Kruskal-Wallis chi-squared = 104.32, df = 6, p-value < 2.2e-16
We can also get a list with the p-values resulted of the Wilcoxon Tests considering each pair of groups:
pairwise.wilcox.test(richness$Shannon, sample_data(physeq1)$Group, p.adj = "bonf")
##
## Pairwise comparisons using Wilcoxon rank sum exact test
##
## data: richness$Shannon and sample_data(physeq1)$Group
##
## YY VV MM TT CA NN
## VV 1.0000 - - - - -
## MM 0.0283 1.0000 - - - -
## TT 2.9e-05 0.0453 0.6080 - - -
## CA 1.0000 1.0000 0.0030 7.1e-07 - -
## NN 1.0000 1.0000 0.0012 1.5e-06 1.0000 -
## SS 7.5e-09 2.5e-06 7.2e-12 7.5e-11 1.9e-14 1.9e-09
##
## P value adjustment method: bonferroni