Introduction to microbial data analysis

#Dataset used generated by He et al.

The following RMD file is able to take a dataset containing ASV abundance data, metadata, taxonomy information, and metabolomics data to generate graphics and conduct multivariate analysis.

First, move the R file you’ll be doing analysis in in the same folder as where your data is stored on your computer.

Now import the necessary libraries.

library("phyloseq")
library("ggplot2")      # Needed to generate graphs.
library("readxl")       # To import the data from an Excel file.
library("dplyr")        # Allows us to filter and reformat data frames.
library("tibble")       # Needed for converting column names to row names.

Read-in your data from their respective files. You can copy the path of the files you’ll need from your file explorer by right clicking and selecting “Copy file as path.”

taxa <- read.csv("HeSlupsky/asv_abundances/feature_table.tsv",sep='\t',header = TRUE, check.names = F,row.names = 1)
taxonomy <- read.csv("HeSlupsky/taxonomy_abundances/taxonomy.tsv",sep="\t")
mtb <- read.csv("HeSlupsky/metabolomics/mtb.tsv",sep='\t',header=TRUE,row.names = 1,check.names = F)
mtb.map <- read.csv("HeSlupsky/metabolomics/mtb.map.tsv",sep='\t',header=TRUE,check.names = F)

Now you need to make the column names for your taxa match the names from your ASV abundances. In our case, we just removed the date information in the string, your data will need a different adjustment, if any. The split function allows you to remove any text in a string you don’t want present.

colnames(taxa) <- 
sapply(colnames(taxa), function(cname) unlist(strsplit(cname,split="12021."))[2] )

These lines define the taxonomy table by ranks. If there are fewer than 7 ranks, it fills in ‘NA’.

taxranks <- sapply(taxonomy$Taxon, function (taxa) unlist(strsplit(taxa, split = ';')))
TAX <- matrix(NA, nrow=length(taxonomy$Taxon), ncol = 7)
colnames(TAX) <- c("kingdom", "phylum", "class", "order", "family", "genus", "species")

Now we make the row names for the taxonomy table the taxonomy feature names, so they now fully match the ASV abundance table.

for (row.id in 1:nrow(TAX)){TAX[row.id,1:length(taxranks[[row.id]])] <- taxranks[[row.id]]}
rownames(TAX) <- taxonomy$Feature.ID

Next we create a phyloseq object called ‘mydata’ using our metadata and taxa.

metadata <- read.csv("HeSlupsky/metadata.tsv",header=TRUE,sep="\t")
rownames(metadata) <- metadata$Sample # assign rows with sample names
mydata <- phyloseq(otu_table(taxa, taxa_are_rows = TRUE), 
                   tax_table(TAX), sample_data(metadata))

Now we can keep only the data we want to analyze. For us, we are selecting only the 2 month-old patient samples. The subset you want to analyze must be a variable in your metadata.

Next, we normalize the number of reads in each sample using median sequencing depth. We chose to filter out ASVs with less than 20% of the total reads per sample, but you can adjust as is neccessary.

Now we can start plotting. Let’s start with a heatmap. We use the Bray-Curtis method to calculate dissimilarity, a commonly used statistical method for datasets like microbiome data.

plot_heatmap(mydata_abund, method = "NMDS", distance = "bray")

Now we can graph a bar plot. We separate based on genus, but you can chose any of the taxa ranks we created earlier.

#Bar Plot
plot_bar(mydata_abund, fill = "genus") + 
  geom_bar(aes(color=genus, fill=genus), stat="identity", position="stack")

And here we do multivariate ordination, again using the bray-curtis method. We generate the ordination data by creating an object called ‘mydata.ord’

mydata.ord <- ordinate(mydata, "NMDS", "bray")

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.2658998 
## Run 1 stress 0.2877911 
## Run 2 stress 0.2668965 
## Run 3 stress 0.2724969 
## Run 4 stress 0.2685011 
## Run 5 stress 0.2755811 
## Run 6 stress 0.2677693 
## Run 7 stress 0.2738416 
## Run 8 stress 0.2816317 
## Run 9 stress 0.2825174 
## Run 10 stress 0.2714409 
## Run 11 stress 0.2734688 
## Run 12 stress 0.276857 
## Run 13 stress 0.2857445 
## Run 14 stress 0.2841584 
## Run 15 stress 0.2709783 
## Run 16 stress 0.2735445 
## Run 17 stress 0.2678149 
## Run 18 stress 0.2669257 
## Run 19 stress 0.2762585 
## Run 20 stress 0.2769973 
## *** Best solution was not repeated -- monoMDS stopping criteria:
##      2: no. of iterations >= maxit
##     18: stress ratio > sratmax

plot_ordination(mydata, mydata.ord, type="samples", color="diet", shape="Gender", title="Samples") + geom_point(size=3)

Introduction to microbial data analysis

Jordan Jackson

2023-05