As usual, the first step for each R script is to load the necessary libraries we are going to use in the next lesson. What we will need is:
# load libraries
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
if(!require("ggpubr")){ # an if condition, if ggpubr is not installed:
install.packages("ggpubr") #install ggpubr
}
## Loading required package: ggpubr
if(!require("reshape2")){ # an if condition, if package is not installed:
install.packages("reshape2") #install package
}
## Loading required package: reshape2
if(!require("RColorBrewer")){ # an if condition, if package is not installed:
install.packages("RColorBrewer") #install package
}
## Loading required package: RColorBrewer
As usual we need to set the working directory. This is the default directory from where this R script will save and load files. Be careful, I will set it to MY working directory, but you need to change it to YOURS. I set is as the working directory where the dada2 outputs were saved from the exercise before.
setwd("/Users/gabri/Desktop/lesson_microbiome/")
Now we need to load the files that we will use in the exercise. We need the ASV count table and the Taxonomy table we generated in the dada2 exercise before.
ASVtab = readRDS("asv_table.RDS")
TAXtab = readRDS("taxa_table_silva.RDS")
TAXtab = as.data.frame(TAXtab)
Furthermore, we also need a mapping file. This is a file that can link the names of our samples to an experimental design. As is the exercise before, this script downloads it automatically from a shared folder, but you can also download it manually and read it in.
options(timeout = 3000)
url = "https://arizona.box.com/shared/static/drprxffagltlx9j76i4venb2wh90vrzu.csv" # This is the internet address where the file is stored, do not change it!
destfile = "mapping_file.csv" # Here you have to specify the folder and FILE NAME where you want the file to be stored!
download.file(url , destfile)
metadata = read.csv("mapping_file.csv")
This file contains three columns, the name of the sample, the diet and the weight. Thirst thing we can do, if diet had an impact on the weight of the mice.
#use basic r plots
boxplot(Weight_g ~ Diet, data = metadata) # plot shows huge differece
# Now we can use ggplot to make the graph look prettier
ggplot(data = metadata, aes(x = Diet, y = Weight_g, color = Diet)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
geom_jitter() +
ylab("Weight (g)") +
theme_bw()
# The difference looks clearly significant, but in case, let's do a t-test test
t.test(Weight_g ~ Diet, data = metadata)
##
## Welch Two Sample t-test
##
## data: Weight_g by Diet
## t = 36.413, df = 6.8565, p-value = 4.197e-09
## alternative hypothesis: true difference in means between group High_fat and group Normal is not equal to 0
## 95 percent confidence interval:
## 14.48915 16.51085
## sample estimates:
## mean in group High_fat mean in group Normal
## 45.84 30.34
Sounds, good, we can start using this mapping file to expore the effect of Diet on the microbial communities. The first thing we have to do, is to check if the name of the samples is in the same order as in our ASVtab, otherwise we would have to reorganize it.
rownames(ASVtab)
## [1] "control_1" "control_2" "control_3" "control_4" "control_5"
## [6] "high_fat_1" "high_fat_2" "high_fat_3" "high_fat_4" "high_fat_5"
metadata$Name == rownames(ASVtab) ## Check if the metadata and ASVbtable are in the same order
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
It looks good!
Usually, when presenting this type of data, it is important to give an overview on it. This usually consists in the number of reads (max and min), number of ASVs per sample, and a quick overview of the taxonomy of the reads (at phylum level).
sort(rowSums(ASVtab)) # this command sorts the number of reads ( rowsums)
## high_fat_5 high_fat_1 high_fat_3 control_2 control_5 control_3 control_4
## 81657 93403 132438 150009 165807 180093 186988
## high_fat_4 control_1 high_fat_2
## 214240 236131 292883
So, the number of reads is included between 81657 to 292883 reads.Now let’s see how many ASVs are per each sample.
sort(rowSums(ASVtab != 0)) # this command sorts the number of ASVs ( rowsums of non zero cells)
## high_fat_5 high_fat_1 high_fat_3 high_fat_2 high_fat_4 control_3 control_5
## 247 251 290 305 307 354 374
## control_2 control_4 control_1
## 384 417 424
So, the number of reads is included between 247 to 424 ASVs per sample.
Ok, lets see what are the most abundant phyla that compose all reads.
phylum_tab = aggregate(t(ASVtab) ~ TAXtab$Phylum, FUN = "sum") ## sum all the ASV togheter in the same sample
phylum_tab1 = phylum_tab[,2:11]
rownames(phylum_tab) = phylum_tab$`TAXtab$Phylum`
total_ab_phyla = rowSums(phylum_tab1)/sum(rowSums(phylum_tab1))
total_ab_phyla = total_ab_phyla*100
names(total_ab_phyla) = phylum_tab$`TAXtab$Phylum`
total_ab_phyla
## Acidobacteriota Actinobacteriota Bacteroidota Crenarchaeota
## 4.614555e-04 7.364253e-01 3.671294e+01 1.153639e-04
## Cyanobacteria Desulfobacterota Firmicutes Gemmatimonadota
## 1.127682e-01 1.901081e+00 5.831517e+01 1.153639e-04
## Patescibacteria Proteobacteria Verrucomicrobiota
## 8.981077e-02 2.087452e+00 4.366523e-02
So Firmicutes is the most abundant phylum with 58.31 percent of the reads, followed by Bacteroidota (36.71 %), Proteobacteria (2.09 %) and Desulfobacterota (1.9 %).
Let’s make a staked barplot.
phylum_tab_melted = melt(phylum_tab, value = phylum_tab$`TAXtab$Phylum`) ## This melt the table
## Using TAXtab$Phylum as id variables
colnames(phylum_tab_melted)[1:2] = c("Phylum", "Sample")
phylum_tab_melted$group = metadata$Diet[match(phylum_tab_melted$Sample, metadata$Name)]
ggplot(phylum_tab_melted,aes(x =Sample, y = value, fill = Phylum))+
geom_bar(position="fill", stat="identity")+
scale_fill_brewer(palette = "Paired") +
facet_wrap(~group, scales = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("Relative abundance")
Do you remember what we said before about the potential impact of alpha diversity of richness? Probably our reads have different sequencing depth. Let’s have a look.
rowSums(ASVtab)
## control_1 control_2 control_3 control_4 control_5 high_fat_1 high_fat_2
## 236131 150009 180093 186988 165807 93403 292883
## high_fat_3 high_fat_4 high_fat_5
## 132438 214240 81657
There is a difference between the sample with the lowest amount of reads (high_fat_5) and the sample with the highest amount of reads (high_fat_2). So, to avoid problems connected to differences in sequencing depth, we go ahead and rarefy the table.
We choose a sequencing depth of 80000, since it is a very similar depth to the samples with the lowest amount of reads.
ASVtab.rar = rrarefy(ASVtab, 80000)
rowSums(ASVtab.rar) # Worked like a dream
## control_1 control_2 control_3 control_4 control_5 high_fat_1 high_fat_2
## 80000 80000 80000 80000 80000 80000 80000
## high_fat_3 high_fat_4 high_fat_5
## 80000 80000 80000
Ok now we have our rarefied table, let’s calculate:
metadata$richness = specnumber(ASVtab.rar) # calculate ASV richness
metadata$shannon = diversity(ASVtab.rar, index = "shannon")
metadata$evenness = metadata$shannon / log(metadata$richness)
Perfect, now let’s make three boxplots We can just take the values from the previous boxplot(about weight) and change it to the value we want to plot
# Plot richness
a = ggplot(data = metadata, aes(x = Diet, y = richness, color = Diet)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
geom_jitter() +
ylab("ASV richness") +
theme_bw()
#plot Shannon H'
b = ggplot(data = metadata, aes(x = Diet, y = shannon, color = Diet)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
geom_jitter() +
ylab("Shannon H'") +
theme_bw()
#plot Evenness'
c = ggplot(data = metadata, aes(x = Diet, y = evenness, color = Diet)) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
geom_jitter() +
ylab("Pielou's E") +
theme_bw()
ggarrange(a,b,c, nrow = 1, common.legend = TRUE)
For richness and diversity indexes Diet was much higher, so communities
were richer and more balanced with standard diet rather than high fat
diet. Evenness did not look significantly different between the two
groups. Although the patterns are very very clear, we can still test
them with Anovas.
### Anova tests
t.test(richness ~ Diet, data = metadata)
##
## Welch Two Sample t-test
##
## data: richness by Diet
## t = -6.4151, df = 7.8833, p-value = 0.0002192
## alternative hypothesis: true difference in means between group High_fat and group Normal is not equal to 0
## 95 percent confidence interval:
## -150.18724 -70.61276
## sample estimates:
## mean in group High_fat mean in group Normal
## 276.6 387.0
t.test(shannon ~ Diet, data = metadata)
##
## Welch Two Sample t-test
##
## data: shannon by Diet
## t = -2.8353, df = 6.5277, p-value = 0.02715
## alternative hypothesis: true difference in means between group High_fat and group Normal is not equal to 0
## 95 percent confidence interval:
## -0.50564256 -0.04207058
## sample estimates:
## mean in group High_fat mean in group Normal
## 4.321898 4.595754
t.test(evenness ~ Diet, data = metadata) # all signficant
##
## Welch Two Sample t-test
##
## data: evenness by Diet
## t = -0.16788, df = 7.4226, p-value = 0.8712
## alternative hypothesis: true difference in means between group High_fat and group Normal is not equal to 0
## 95 percent confidence interval:
## -0.03324020 0.02878558
## sample estimates:
## mean in group High_fat mean in group Normal
## 0.7691949 0.7714222
Richness and Shannon are signifcant while evenness is not.
Let’s move on to beta diversity. We have already seen that the communities look different from at the phylum level. Let’s see if this also reflects in patterns of beta diversity. We are first going to calculate bray curtis dissimilarities on a rarefied table. Beware that there are many methods to normalize the sample counts before calculating beta-diversities (cumulative sum scaling, arcsin transformation and many others). If the patterns are very clear, they will appear with the majority of methods, otherwise test some and see what is the difference. There is no real perfect correct way, as every dataset is different.
Then we can make an PCoA ordination plot.
# calculate bray curtis distances
bc_dist = vegdist(ASVtab.rar, method = "bray")
PCOA = cmdscale(bc_dist, eig = TRUE, k = 2)
eigenvectors = PCOA$eig / sum(PCOA$eig)
eigenvectors
## [1] 5.928289e-01 1.731930e-01 9.226475e-02 6.649667e-02 3.329752e-02
## [6] 1.839461e-02 1.121584e-02 6.649765e-03 5.658935e-03 -1.808381e-17
plot(PCOA$points)
metadata$MDS1 = PCOA$points[,1]
metadata$MDS2 = PCOA$points[,2]
ggplot(metadata, aes(MDS1, MDS2, color = Diet))+
geom_point(size = 3)+
theme_bw() +
xlab("MDS1[59.4%]")+
ylab("MDS2{17.2%]")
We see a strong division between the two condition on the first axis of
the ordination. Now we are going to test it with a PERMANOVA
perm = adonis2(bc_dist~ Diet, data = metadata)
perm # It is signficant and it explains 59 percent of the variation in our data.
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = bc_dist ~ Diet, data = metadata)
## Df SumOfSqs R2 F Pr(>F)
## Diet 1 1.03905 0.58661 11.352 0.009 **
## Residual 8 0.73222 0.41339
## Total 9 1.77127 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is significant and explains 59 percent of variation in our data.