The goal of this is to demonstrate alpha diversity analyses that could be performed on metagenomic data. This script was used to generate some results that are present in one of my published articles.

For these analyses, I mainly use the amazing phyloseq package.

library("phyloseq")
library("ggplot2")
library("dplyr")
library("ggpubr")

I start by loading three files:

  1. The “otu_table” which, in my case, is called “otumat”. To get this file, I ran Kraken Taxonomic Analysis followed by the kraken-mpa-report program for all my samples and then I merged the results using the script merge_metaphlan_tables.py provided by the MetaPhlAn2 Program. Later, I filtered the result to only have information regarding bacterial genera.
head(otumat)[,1:6]
##                       yy1 yy2 yy3 yy4 yy5 yy6
## Acidobacterium         67   0   9  33   0  10
## Candidatus_Koribacter 496   0  23  64   0   2
## Granulicella          262   2  48 192   0   5
## Terriglobus           558   0  19  86   0   8
## Chloracidobacterium    23   0   3  16   4   9
## Candidatus_Solibacter 526   0  16  76   0  19
  1. The “tax_table” which, in my case, is called “taxmat”. I created this file by filtering and transforming the merged table explained above so that I only had the phyla and genera information.
head(taxmat)
##          Phylum                 Genus
## 1 Acidobacteria        Acidobacterium
## 2 Acidobacteria Candidatus_Koribacter
## 3 Acidobacteria          Granulicella
## 4 Acidobacteria           Terriglobus
## 5 Acidobacteria   Chloracidobacterium
## 6 Acidobacteria Candidatus_Solibacter
  1. The “sample_data” which, in my case, is called “metadata”. This tab-delimited file is just metagenomes’ names and metadata informations.
head(metadata)
##     Group
## yy1    YY
## yy2    YY
## yy3    YY
## yy4    YY
## yy5    YY
## yy6    YY

For this script to work, the tax_table and the otu_table row names must be identical and in the same order. So I use the command below just to be sure that the Genera in the tax_table are exactly the same as the row names of the otu_table.

setdiff(rownames(otumat),taxmat$Genus)
## character(0)

Since the result is 0, I know there’s no difference, then I can use the data in the Genus column of taxmat as row names. Also, make sure the taxmat is a matrix.

rownames(taxmat) <- taxmat$Genus
taxmat =  as.matrix(taxmat)

Now create the phyloseq object by combining the three files imported previously.

OTU = otu_table(otumat, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)
sampledata = sample_data(metadata)

physeq1 = phyloseq(OTU, TAX, sampledata)

physeq1
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 850 taxa and 162 samples ]
## sample_data() Sample Data:       [ 162 samples by 1 sample variables ]
## tax_table()   Taxonomy Table:    [ 850 taxa by 2 taxonomic ranks ]

Arranging the metadata factors so that the groups will always be listed in the given order:

sample_data(physeq1)$Group <- factor((sample_data(physeq1)$Group), levels=c("YY","VV","MM","TT","CA","NN","SS"))

Alpha Diversity

Phyloseq has the function “Estimate Richness” which returns the results of some alpha diversity estimates for each metagenome:

richness <- estimate_richness(physeq1)
head(richness)
##     Observed    Chao1  se.chao1      ACE    se.ACE  Shannon   Simpson
## yy1      831 837.1818  4.171555 838.8740  9.701057 2.881380 0.7596974
## yy2      562 648.2500 21.068072 646.7016 12.412318 2.292813 0.8051032
## yy3      824 838.1304  7.375090 831.4718 13.203409 2.772338 0.8312823
## yy4      830 846.5000  9.297206 838.8370 10.317675 1.290877 0.3830826
## yy5      473 524.3830 16.106509 519.7690 11.254276 2.627071 0.8160624
## yy6      740 752.3333  5.603758 758.0869 12.736843 2.102588 0.7548794
##     InvSimpson   Fisher
## yy1   4.161419 84.18912
## yy2   5.130921 74.45343
## yy3   5.927059 90.89269
## yy4   1.620963 76.97193
## yy5   5.436626 60.96582
## yy6   4.079624 77.92174

The function “Plot Richness” generates dot plots showing the results given by “Estimate Richness”:

plot_richness(physeq1)

Since I’m analyzing many samples, a dot plot is not very informative, so I decided to generate a boxplot with metagenomes being arranged as groups. You can use any measure given in “Estimate Richness”, but for this example, I’m using the Shannon index:

plot_richness(physeq1, x="Group", measures="Shannon", color = "Group")+
  geom_boxplot(alpha=0.6)+ 
  theme(legend.position="none", axis.text.x=element_text(angle=45,hjust=1,vjust=1,size=12))

It’s also possible to add comparisons between pairs of groups in the plot. Here I’m using the function “stat_compare_means” from the ggpubr package. Wilcoxon was the test used for comparison.

a_my_comparisons <- list( c("YY", "MM"), c("CA", "TT"), c("NN", "SS"))
symnum.args = list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c("****", "***", "**", "*", "ns"))

plot_richness(physeq1, x="Group", measures="Shannon", color = "Group")+
  geom_boxplot(alpha=0.6)+ 
  theme(legend.position="none", axis.text.x=element_text(angle=45,hjust=1,vjust=1,size=12))+
  stat_compare_means(method = "wilcox.test", comparisons = a_my_comparisons, label = "p.signif", symnum.args = symnum.args)

A histogram can be created to check if the Shannon values estimated for the metagenomes are normally distributed:

hist(richness$Shannon, main="Shannon index", xlab="")

Since it is, we can run anova tests and check if the variable Group impacts the Shannon diversity:

anova.sh = aov(richness$Shannon ~ sample_data(physeq1)$Group)
summary(anova.sh)
##                             Df Sum Sq Mean Sq F value Pr(>F)    
## sample_data(physeq1)$Group   6 124.82  20.804   50.52 <2e-16 ***
## Residuals                  155  63.83   0.412                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the anova tests results, it’s possible to compute the Tukey Honest Significant Differences:

TukeyHSD(anova.sh)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = richness$Shannon ~ sample_data(physeq1)$Group)
## 
## $`sample_data(physeq1)$Group`
##              diff         lwr        upr     p adj
## VV-YY  0.37317998 -0.46615308  1.2125131 0.8376996
## MM-YY  0.74818610  0.11716760  1.3792046 0.0092785
## TT-YY  1.39816943  0.65565198  2.1406869 0.0000017
## CA-YY  0.04885664 -0.53797865  0.6356919 0.9999799
## NN-YY -0.15205043 -0.79400594  0.4899051 0.9919818
## SS-YY -1.43588324 -2.00909498 -0.8626715 0.0000000
## MM-VV  0.37500611 -0.40767600  1.1576882 0.7840224
## TT-VV  1.02498945  0.14992423  1.9000547 0.0106835
## CA-VV -0.32432335 -1.07184098  0.4231943 0.8528290
## NN-VV -0.52523041 -1.31675668  0.2662959 0.4300572
## SS-VV -1.80906322 -2.54593407 -1.0721924 0.0000000
## TT-MM  0.64998333 -0.02783927  1.3278059 0.0694716
## CA-MM -0.69932946 -1.20181026 -0.1968487 0.0010245
## NN-MM -0.90023653 -1.46611477 -0.3343583 0.0000918
## SS-MM -2.18406934 -2.67057011 -1.6975686 0.0000000
## CA-TT -1.34931279 -1.98620743 -0.7124182 0.0000001
## NN-TT -1.55021986 -2.23823585 -0.8622039 0.0000000
## SS-TT -2.83405267 -3.45841701 -2.2096883 0.0000000
## NN-CA -0.20090707 -0.71705577  0.3152416 0.9068569
## SS-CA -1.48473987 -1.91237793 -1.0571018 0.0000000
## SS-NN -1.28383281 -1.78443797 -0.7832276 0.0000000

For Non-normally distributed data, we can use Kruskal-Wallis Rank Sum Test:

kruskal.test(richness$Shannon ~ sample_data(physeq1)$Group)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  richness$Shannon by sample_data(physeq1)$Group
## Kruskal-Wallis chi-squared = 104.32, df = 6, p-value < 2.2e-16

We can also get a list with the p-values resulted of the Wilcoxon Tests considering each pair of groups:

pairwise.wilcox.test(richness$Shannon, sample_data(physeq1)$Group, p.adj = "bonf")
## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  richness$Shannon and sample_data(physeq1)$Group 
## 
##    YY      VV      MM      TT      CA      NN     
## VV 1.0000  -       -       -       -       -      
## MM 0.0283  1.0000  -       -       -       -      
## TT 2.9e-05 0.0453  0.6080  -       -       -      
## CA 1.0000  1.0000  0.0030  7.1e-07 -       -      
## NN 1.0000  1.0000  0.0012  1.5e-06 1.0000  -      
## SS 7.5e-09 2.5e-06 7.2e-12 7.5e-11 1.9e-14 1.9e-09
## 
## P value adjustment method: bonferroni