Taxonomy, Alpha and beta diversity

Load the libraries

As usual, the first step for each R script is to load the necessary libraries we are going to use in the next lesson. What we will need is:

vegan Vegan is a R package specifically designed for community ecology. It contains a lot of useful functions.
ggplot2 ggplot is a package used to make beautiful plots
ggpubr add functionalities to ggplot
reshape2 helps with table manipulation
RColorBrewer A collection of color palettes.

# load libraries
library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
if(!require("ggpubr")){      # an if condition, if ggpubr is not installed:
  install.packages("ggpubr") #install ggpubr
}

## Loading required package: ggpubr

if(!require("reshape2")){      # an if condition, if package is not installed:
  install.packages("reshape2") #install package
}

## Loading required package: reshape2

if(!require("RColorBrewer")){      # an if condition, if package is not installed:
  install.packages("RColorBrewer") #install package
}

## Loading required package: RColorBrewer

Set the working directory

As usual we need to set the working directory. This is the default directory from where this R script will save and load files. Be careful, I will set it to MY working directory, but you need to change it to YOURS. I set is as the working directory where the dada2 outputs were saved from the exercise before.

setwd("/Users/gabri/Desktop/lesson_microbiome/")

Load the files

Now we need to load the files that we will use in the exercise. We need the ASV count table and the Taxonomy table we generated in the dada2 exercise before.

ASVtab = readRDS("asv_table.RDS")
TAXtab = readRDS("taxa_table_silva.RDS")
TAXtab = as.data.frame(TAXtab)

Furthermore, we also need a mapping file. This is a file that can link the names of our samples to an experimental design. As is the exercise before, this script downloads it automatically from a shared folder, but you can also download it manually and read it in.

options(timeout = 3000) 
url = "https://arizona.box.com/shared/static/drprxffagltlx9j76i4venb2wh90vrzu.csv" # This is the internet address where the file is stored, do not change it!
destfile = "mapping_file.csv" # Here you have to specify the folder and FILE NAME  where you want the file to be stored!
download.file(url , destfile)
metadata = read.csv("mapping_file.csv")

This file contains three columns, the name of the sample, the diet and the weight. Thirst thing we can do, if diet had an impact on the weight of the mice.

#use basic r plots
boxplot(Weight_g ~ Diet, data = metadata) # plot shows huge differece

# Now we can use ggplot to make the graph look prettier
ggplot(data = metadata, aes(x = Diet, y = Weight_g, color = Diet)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  geom_jitter() +
  ylab("Weight (g)") + 
  theme_bw()

# The difference looks clearly significant, but in case, let's do a t-test test
t.test(Weight_g ~ Diet, data = metadata)

## 
##  Welch Two Sample t-test
## 
## data:  Weight_g by Diet
## t = 36.413, df = 6.8565, p-value = 4.197e-09
## alternative hypothesis: true difference in means between group High_fat and group Normal is not equal to 0
## 95 percent confidence interval:
##  14.48915 16.51085
## sample estimates:
## mean in group High_fat   mean in group Normal 
##                  45.84                  30.34

Sounds, good, we can start using this mapping file to expore the effect of Diet on the microbial communities. The first thing we have to do, is to check if the name of the samples is in the same order as in our ASVtab, otherwise we would have to reorganize it.

rownames(ASVtab)

##  [1] "control_1"  "control_2"  "control_3"  "control_4"  "control_5" 
##  [6] "high_fat_1" "high_fat_2" "high_fat_3" "high_fat_4" "high_fat_5"

metadata$Name == rownames(ASVtab) ## Check if the metadata and ASVbtable are in the same order

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

It looks good!

Let’s get an overview of our sequencig run

Usually, when presenting this type of data, it is important to give an overview on it. This usually consists in the number of reads (max and min), number of ASVs per sample, and a quick overview of the taxonomy of the reads (at phylum level).

sort(rowSums(ASVtab)) # this command sorts the number of reads ( rowsums)

## high_fat_5 high_fat_1 high_fat_3  control_2  control_5  control_3  control_4 
##      81657      93403     132438     150009     165807     180093     186988 
## high_fat_4  control_1 high_fat_2 
##     214240     236131     292883

So, the number of reads is included between 81657 to 292883 reads.Now let’s see how many ASVs are per each sample.

sort(rowSums(ASVtab != 0)) # this command sorts the number of ASVs ( rowsums of non zero cells)

## high_fat_5 high_fat_1 high_fat_3 high_fat_2 high_fat_4  control_3  control_5 
##        247        251        290        305        307        354        374 
##  control_2  control_4  control_1 
##        384        417        424

So, the number of reads is included between 247 to 424 ASVs per sample.

Taxonomic overview

Ok, lets see what are the most abundant phyla that compose all reads.

phylum_tab = aggregate(t(ASVtab) ~ TAXtab$Phylum, FUN = "sum") ## sum all the ASV togheter in the same sample
phylum_tab1 = phylum_tab[,2:11]
rownames(phylum_tab) = phylum_tab$`TAXtab$Phylum`
total_ab_phyla = rowSums(phylum_tab1)/sum(rowSums(phylum_tab1))
total_ab_phyla = total_ab_phyla*100
names(total_ab_phyla) = phylum_tab$`TAXtab$Phylum`
total_ab_phyla

##   Acidobacteriota  Actinobacteriota      Bacteroidota     Crenarchaeota 
##      4.614555e-04      7.364253e-01      3.671294e+01      1.153639e-04 
##     Cyanobacteria  Desulfobacterota        Firmicutes   Gemmatimonadota 
##      1.127682e-01      1.901081e+00      5.831517e+01      1.153639e-04 
##   Patescibacteria    Proteobacteria Verrucomicrobiota 
##      8.981077e-02      2.087452e+00      4.366523e-02

So Firmicutes is the most abundant phylum with 58.31 percent of the reads, followed by Bacteroidota (36.71 %), Proteobacteria (2.09 %) and Desulfobacterota (1.9 %).

Let’s make a staked barplot.

phylum_tab_melted = melt(phylum_tab, value = phylum_tab$`TAXtab$Phylum`) ## This melt the table

## Using TAXtab$Phylum as id variables

colnames(phylum_tab_melted)[1:2] = c("Phylum", "Sample")
phylum_tab_melted$group = metadata$Diet[match(phylum_tab_melted$Sample, metadata$Name)]
ggplot(phylum_tab_melted,aes(x =Sample, y = value, fill = Phylum))+
    geom_bar(position="fill", stat="identity")+
    scale_fill_brewer(palette = "Paired") + 
    facet_wrap(~group, scales = "free") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    ylab("Relative abundance")

Alpha diversity calculation

Do you remember what we said before about the potential impact of alpha diversity of richness? Probably our reads have different sequencing depth. Let’s have a look.

rowSums(ASVtab)

##  control_1  control_2  control_3  control_4  control_5 high_fat_1 high_fat_2 
##     236131     150009     180093     186988     165807      93403     292883 
## high_fat_3 high_fat_4 high_fat_5 
##     132438     214240      81657

There is a difference between the sample with the lowest amount of reads (high_fat_5) and the sample with the highest amount of reads (high_fat_2). So, to avoid problems connected to differences in sequencing depth, we go ahead and rarefy the table.

We choose a sequencing depth of 80000, since it is a very similar depth to the samples with the lowest amount of reads.

ASVtab.rar = rrarefy(ASVtab, 80000)
rowSums(ASVtab.rar) # Worked like a dream

##  control_1  control_2  control_3  control_4  control_5 high_fat_1 high_fat_2 
##      80000      80000      80000      80000      80000      80000      80000 
## high_fat_3 high_fat_4 high_fat_5 
##      80000      80000      80000

Ok now we have our rarefied table, let’s calculate:

ASV richness (the number of ASVs)
Shannon H’ (a diversity index)
Pielou’s evenness (Shannon H’/log(richness))

metadata$richness = specnumber(ASVtab.rar) # calculate ASV richness
metadata$shannon = diversity(ASVtab.rar, index = "shannon")
metadata$evenness = metadata$shannon / log(metadata$richness)

Perfect, now let’s make three boxplots We can just take the values from the previous boxplot(about weight) and change it to the value we want to plot

# Plot richness
a = ggplot(data = metadata, aes(x = Diet, y = richness, color = Diet)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  geom_jitter() +
  ylab("ASV richness") + 
  theme_bw()       

#plot Shannon H'
b = ggplot(data = metadata, aes(x = Diet, y = shannon, color = Diet)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  geom_jitter() +
  ylab("Shannon H'") + 
  theme_bw()

#plot Evenness'
c = ggplot(data = metadata, aes(x = Diet, y = evenness, color = Diet)) +
  geom_boxplot(alpha = 0.5, outlier.shape = NA) +
  geom_jitter() +
  ylab("Pielou's E") + 
  theme_bw()

ggarrange(a,b,c, nrow = 1, common.legend = TRUE)

For richness and diversity indexes Diet was much higher, so communities were richer and more balanced with standard diet rather than high fat diet. Evenness did not look significantly different between the two groups. Although the patterns are very very clear, we can still test them with Anovas.

### Anova tests
t.test(richness ~ Diet, data = metadata)

## 
##  Welch Two Sample t-test
## 
## data:  richness by Diet
## t = -6.4151, df = 7.8833, p-value = 0.0002192
## alternative hypothesis: true difference in means between group High_fat and group Normal is not equal to 0
## 95 percent confidence interval:
##  -150.18724  -70.61276
## sample estimates:
## mean in group High_fat   mean in group Normal 
##                  276.6                  387.0

t.test(shannon ~ Diet, data = metadata)

## 
##  Welch Two Sample t-test
## 
## data:  shannon by Diet
## t = -2.8353, df = 6.5277, p-value = 0.02715
## alternative hypothesis: true difference in means between group High_fat and group Normal is not equal to 0
## 95 percent confidence interval:
##  -0.50564256 -0.04207058
## sample estimates:
## mean in group High_fat   mean in group Normal 
##               4.321898               4.595754

t.test(evenness ~ Diet, data = metadata) # all signficant

## 
##  Welch Two Sample t-test
## 
## data:  evenness by Diet
## t = -0.16788, df = 7.4226, p-value = 0.8712
## alternative hypothesis: true difference in means between group High_fat and group Normal is not equal to 0
## 95 percent confidence interval:
##  -0.03324020  0.02878558
## sample estimates:
## mean in group High_fat   mean in group Normal 
##              0.7691949              0.7714222

Richness and Shannon are signifcant while evenness is not.

Beta diversity

Let’s move on to beta diversity. We have already seen that the communities look different from at the phylum level. Let’s see if this also reflects in patterns of beta diversity. We are first going to calculate bray curtis dissimilarities on a rarefied table. Beware that there are many methods to normalize the sample counts before calculating beta-diversities (cumulative sum scaling, arcsin transformation and many others). If the patterns are very clear, they will appear with the majority of methods, otherwise test some and see what is the difference. There is no real perfect correct way, as every dataset is different.

Then we can make an PCoA ordination plot.

# calculate bray curtis distances
bc_dist = vegdist(ASVtab.rar, method = "bray")
PCOA = cmdscale(bc_dist, eig = TRUE, k = 2)
eigenvectors = PCOA$eig / sum(PCOA$eig)
eigenvectors

##  [1]  5.928289e-01  1.731930e-01  9.226475e-02  6.649667e-02  3.329752e-02
##  [6]  1.839461e-02  1.121584e-02  6.649765e-03  5.658935e-03 -1.808381e-17

plot(PCOA$points)

metadata$MDS1 = PCOA$points[,1]
metadata$MDS2 = PCOA$points[,2]
ggplot(metadata, aes(MDS1, MDS2, color = Diet))+
   geom_point(size = 3)+
   theme_bw() + 
   xlab("MDS1[59.4%]")+
   ylab("MDS2{17.2%]")

We see a strong division between the two condition on the first axis of the ordination. Now we are going to test it with a PERMANOVA

perm = adonis2(bc_dist~ Diet, data = metadata)
perm # It is signficant and it explains 59 percent of the variation in our data.

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = bc_dist ~ Diet, data = metadata)
##          Df SumOfSqs      R2      F Pr(>F)   
## Diet      1  1.03905 0.58661 11.352  0.009 **
## Residual  8  0.73222 0.41339                 
## Total     9  1.77127 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is significant and explains 59 percent of variation in our data.