Load the libraries

As is customary, the first step when working with each R script is to load the required libraries that will be utilized in the upcoming lesson. The libraries we will need are as follows

  1. ggplot2. ggplot is a package used to make beautiful plots
  2. ggpubr. add functionalities to ggplot
  3. reshape2. helps with table manipulation
  4. RColorBrewer. A collection of color palettes.
# load libraries
library(ggplot2)
if(!require("ggpubr")){      # an if condition, if ggpubr is not installed:
  install.packages("ggpubr") #install ggpubr
}
## Loading required package: ggpubr
if(!require("reshape2")){      # an if condition, if ggpubr is not installed:
  install.packages("reshape2") #install ggpubr
}
## Loading required package: reshape2
if(!require("RColorBrewer")){      # an if condition, if ggpubr is not installed:
  install.packages("RColorBrewer") #install ggpubr
}
## Loading required package: RColorBrewer

Set the working directory

As is typical, we need to set the working directory. This is the default directory from which the R script will save and load files. Please note that while I have set it to MY working directory, you should adjust it to match YOURS. Specifically, I have set it to the working directory where the outputs from the previous exercise using dada2 were saved.

setwd("/Users/gabri/Desktop/lesson_microbiome/Exercise_dada2")

Load the files

The next step is to load the files that we will be using in this exercise. Specifically, we require both the ASV count table and the taxonomy table that we generated in the previous exercise using dada2.

ASVtab = readRDS("asv_table.RDS")
TAXtab = readRDS("taxa_table_silva.RDS")
TAXtab = as.data.frame(TAXtab)

As with the previous exercise, the mapping file will be downloaded automatically from a shared folder.”

metadata = read.csv("mapping_file.csv")
metadata$Diet = factor(metadata$Diet)

First part, normalization

The first step in our script will be to normalize or scale our dataset in order to reduce issues caused by discrepancies in sequencing depth. As previously noted, there are many different methods to achieve this goal. In this exercise, we will be using the simplest method, which involves converting the table into relative abundances.

ASVtab_rel = sweep(ASVtab, 1, rowSums(ASVtab), FUN = "/")
rowSums(ASVtab_rel) # now we have an ASVtable of realtive abundances.
##  control_1  control_2  control_3  control_4  control_5 high_fat_1 high_fat_2 
##          1          1          1          1          1          1          1 
## high_fat_3 high_fat_4 high_fat_5 
##          1          1          1

The subsequent step will involve filtering out ASVs with low abundance, and only retaining those that are present in large numbers. This is necessary because rare ASVs tend to have more noise (random fluctuations in the number of reads), which can make the analysis less reliable. Additionally, many low-abundance ASVs may only be present in a few samples. For this exercise, we will only consider ASVs that were detected in at least four samples.

non_zero_count <- colSums(ASVtab_rel != 0) # Count non-zero values per column
selected_cols <- names(non_zero_count[non_zero_count > 3]) # Get column names with more than 3 non-zero values
ASVtab_rel_filt <- ASVtab_rel[, selected_cols]# Filter the table for selected columns
dim(ASVtab_rel_filt) ### Se the number of rows and columns
## [1]  10 387

After filtering, we are left with 387 ASVs. If desired, we could further reduce their number by selecting only the 100 most abundant ASVs. However, for the purposes of this exercise, we will proceed with the full set of 387 ASVs.

#top100 = sort(colSums(ASVtab_rel_filt), decreasing = TRUE)#[1:100]
#ASVtab_rel_filt <- ASVtab_rel#[, names(top100)]
dim(ASVtab_rel_filt)
## [1]  10 387

The next step is to construct a function containing a loop that will compute the necessary statistics for each ASV. However, before we proceed, we need to address the issue of potentially having logarithms of zero. When this occurs, our calculation will fail because the computer assigns an infinite value to the log of 0. To circumvent this issue, we will add a ‘pseudo-count’ to each value in the dataframe.

There are various ways to add a pseudo-count, such as simply adding a value of 1 to each count in the table before converting it into relative abundances, or adding a decimal value of 1. In our case, we will determine the lowest value present in our relative abundance table and use it as a putative detection limit for our study. We will subtract one-tenth of this value and use it as the pseudo-count.

### we find the lowest number in our dataset adn multiply it by 0.9)
pseudocount = min(ASVtab_rel[ASVtab_rel > 0])*0.9
rowSums(ASVtab_rel_filt) #check rowsums before
##  control_1  control_2  control_3  control_4  control_5 high_fat_1 high_fat_2 
##  0.9114093  0.9719817  0.9887169  0.8802169  0.9104984  0.8991146  0.8824104 
## high_fat_3 high_fat_4 high_fat_5 
##  0.8755493  0.8818941  0.8708133
ASVtab_rel_filt = ASVtab_rel_filt + pseudocount # add the pseudocount to the table
rowSums(ASVtab_rel_filt) #row sums did not change much
##  control_1  control_2  control_3  control_4  control_5 high_fat_1 high_fat_2 
##  0.9125986  0.9731709  0.9899062  0.8814061  0.9116876  0.9003038  0.8835996 
## high_fat_3 high_fat_4 high_fat_5 
##  0.8767385  0.8830833  0.8720025
results = data.frame()#create an empty dataframe
for ( i in 1:ncol(ASVtab_rel_filt)) #loop along the table
    {
      WT = wilcox.test(ASVtab_rel_filt[,i] ~ metadata$Diet, exact=FALSE)    # Perform a wilcoxon test
      a  = ASVtab_rel_filt[6:10, i]     # Condition1
      b  = ASVtab_rel_filt[1:5, i]    # Condition2
      l2cf = log2(median(a)/median(b))                # Calculate log2fold change
      ### calculate CI
          B <- 500 # number of bootstrap replicates
          n <- length(a) # length of the vectors to be bootstrapped
          med_y <- rep(NA, B) #empty vector to be filled
          med_x <- rep(NA, B) #empy vector to be filled
          log2_fc <- rep(NA, B) # empty vector to be filled
          for (c in 1:B) {    ## CI loop 
                              ind <- sample(n, replace=TRUE) #randomly sample a vector of length n
                              med_y[c] <- median(a[ind]) #calculate median of a 
                              med_x[c] <- median(b[ind]) #calculate median of b
                              log2_fc[c] <- log2(med_y[c]/med_x[c]) #Calculate log2fc
                            } # end of the loop
          lower_ci <- quantile(log2_fc, 0.025) #extract 0.025 CIs
          upper_ci <- quantile(log2_fc, 0.975) #extract 0.975 CI
          sum_abundance = sum(ASVtab_rel_filt[,i]) # calculate the total abundance of the ASV across the dataset
         partial_results = c(l2cf, lower_ci, upper_ci, WT$statistic, WT$p.value, sum_abundance)  # set up a partial results vector
         results =rbind(results, partial_results)        # add it as the last row of the dataframe
      }
colnames(results) = c("l2fc", "lower_CI", "upper_CI", "Wilcox_W","p.value", "Sum abundance") # rename the columns 
results$p_adj = p.adjust(results$p.value, method = "fdr") # 

Now we have all we need for plotting the results, let’s first do a vulcano plot.

results$association = ifelse(results$p_adj < 0.05 & results$l2fc > 0, "High fat", "Not singificant")
results$association = ifelse(results$p_adj < 0.05 & results$l2fc < 0, "Normal",results$association)

ggplot(results, aes(x = as.numeric(l2fc), y = -log10(as.numeric(p.value)), color = association)) +
  geom_point() +
  labs(x = "Log2 Fold Change", y = "-log10(p-value)",
       title = "Volcano Plot") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray") +
  theme_bw()

Let’s see detailed results by making a lollypop plot. before, we assign the names to each sequence, by combining a uniue numerical ID (ASV1, ASV2, ASV3) and the genus it belongs. In case the genus is not assigned, we will retrieve

name_tmp = sprintf("ASV%s",seq(1:nrow(TAXtab))) 
#now let's add some information to this. For example, we want to add the genus, and if not available, the last taxonomic information available.
TAXtab$name =  TAXtab$Genus
TAXtab$name = ifelse(is.na(TAXtab$name), TAXtab$Family, TAXtab$name)
TAXtab$name = ifelse(is.na(TAXtab$name), TAXtab$Order, TAXtab$name)
TAXtab$name = ifelse(is.na(TAXtab$name), TAXtab$Class, TAXtab$name)
TAXtab$name = ifelse(is.na(TAXtab$name), TAXtab$Phylum, TAXtab$name)
unique(is.na(TAXtab$name)) # Ok, we have all names
## [1] FALSE
## now let us apply a the prefix to each one of them
TAXtab$name = paste(name_tmp, TAXtab$name, sep = "-")
TAXtab$name[1:10] # ok it looks good.
##  [1] "ASV1-Lactococcus"                  "ASV2-Alistipes"                   
##  [3] "ASV3-Clostridia UCG-014"           "ASV4-Faecalibaculum"              
##  [5] "ASV5-Ligilactobacillus"            "ASV6-Ileibacterium"               
##  [7] "ASV7-Muribaculaceae"               "ASV8-RF39"                        
##  [9] "ASV9-Alloprevotella"               "ASV10-Rikenellaceae RC9 gut group"
unique(results$ASV == colnames(ASVtab_rel_filt))
## logical(0)
results$name = TAXtab$name[match(colnames(ASVtab_rel_filt), colnames(ASVtab_rel_filt))]

Let’s see how many signficant ASVs we have. If they are still too many, we filter them and plot only the top 50 most abundant.

results_sig = results[results$p_adj < 0.05,]
table(results_sig$association)
## 
## High fat   Normal 
##       59      141
### We still have 200 (141 and 59) differential abundant ASVs
### I wanna only plot the first 50 most across the entire study, otherwise I will end up with a huge plot
results_sig = results_sig[order(results_sig$`Sum abundance`, decreasing = TRUE),]
results_sig = results_sig[1:25,]
results_sig$name = factor(results_sig$name, levels = rev(results_sig$name))
a = ggplot(results_sig, aes(l2fc, name, color = association ))+
  geom_point() +theme_bw() +
  geom_segment( aes(x=0, xend=l2fc, y=name, yend=name, color = association)) +
  geom_vline(xintercept = 0, size=0.3) +xlab("log2FoldChange") +ylab(NULL)# +
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
  #theme(legend.position="bottom")
a

Looks good. Let’s look make some boxplots to put besides it.

AASVtab_sig = data.frame(ASVtab_rel_filt[,as.numeric(rownames(results_sig))])
colnames(AASVtab_sig) = results_sig$name
AASVtab_sig$Diet = metadata$Diet
melted = reshape2::melt(AASVtab_sig, values = "Diet")
## Using Diet as id variables
melted$variable = factor(melted$variable, levels = rev(results_sig$name))
b = ggplot(melted, aes(value, variable, fill = Diet)) +
  geom_boxplot(outlier.size = 0.5) + theme_bw() + xlab("Relative abundance") +
    theme( axis.text.y=element_blank()) + ylab(NULL)

b 

ggarrange(a,b, ncol = 2, nrow = 1,align = "h", common.legend = TRUE, widths = c(1.85, 0.9))

In this lesson we viewed a simple way to indentify differentially abundant taxa in a dataset. As we said in the slides, there are multiple tools developed for differential abundance analysis that can be adapted to a moltitude of experimental designs. Some of the include:

  1. ALDEx2.
  2. ANCOM-BC.
  3. DEseq2.
  4. MaAslinn2.
  5. LefSE.
  6. metagenomeSeq.

…and others… Explore all different tools and try apply them on your dataset. Be critical with their results.