Overview

In this analysis we are looking at the relative abundance of microbiome time series samples from a single subject. The relative abundance table gives us the “percentage” for each microbe detected in the sample (so for each sample the sum of relative abundances of all microbes should add to 1.0). The relative abundance table is available for 5 taxonimc levels, but here we start with the genus, followed by family, followed by order. The general method that we use here is PCA. For a given abundance table, we compute the top 2 PC’s and visualize the data in the PC space. We label the samples that are in PC space with their date/sampleID. This can visually tell us if there is any signal in the data. After viewing the distribution in PC space, we want to know which microbes are driving the top 2 PC’s. For this, we plot the top 2 PC loadings: that is, the weight that PCA assigns to a microbe for the PC space. A higher weight (positve or negative), indicates that the microbe contributes more the PC’s. We select the top microbes (microbes with largest loadings) and visualize how their abundance changes over the time series.

In summary the steps are as follows:
- Compute PCA on the relative abundance table and visualize how samples change in the top 2 PC space
- Visualize the loadings of the microbes for the top 2 PC and select microbes with highest magnitude loadings
- Visualize the relative abundance of the highest miagnitude microbes as time series

library(reshape)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:graphics':
## 
##     layout

Multi-Dimensional Scaling (MDS)

Justine Debelius provided this dataset and also provided a distance matrix of the samples using the unifrac distance metric. Since this distance matrix is already provided, we can use the Multi-Dimensional Scaling technique with the powerful unifrac distance metric. This gives us a more powerful represntaiton for testing to see if any signal is in the data, but not clear way of determining the microbes that are important.

df = read.delim("~/Documents/AG-LS/data/long_time/beta_diversity/unweighted_unifrac_dm.txt", header = TRUE, stringsAsFactors = FALSE)

dist.matrix = as.matrix(df[,-1])
row.names(dist.matrix) = df$X

fit <- cmdscale(dist.matrix,eig=TRUE, k=2) # k is the number of dim

res = data.frame(x = fit$points[,1], y = fit$points[,2])
dates = sapply(row.names(dist.matrix), FUN = function(x) strsplit(x, split = "LS.")[[1]][2]) 
res$sample.dates = as.Date(dates, format = "%m.%d.%Y")

add.stage = function(df.x){
  df.x$stage = 0
  df.x$stage[which(df.x$sample.dates > as.Date("2014-01-01"))] = 1
  df.x$stage = as.factor(df.x$stage)
  return(df.x)
}

res$stage = 0
res$stage[which(res$sample.dates > as.Date("2014-01-01"))] = 1
res$stage = as.factor(res$stage)
ggplot(res, aes(x = x, y = y, label = sample.dates, colour = stage)) + geom_text(size = 8) -> p
#plotly_POST(p, filename = "MDS-unifrac", fileopt = "overwrite")
ggplotly(p)

We seee a very clear separation between the two stages when LS had several weight loss (stage 0) followed by weight recovery (stage 1)

PCA

Perform PCA on each relative abundance table for each taxonomic level Below is a helper function for comptuing PCA

pca.computer = function(df.x, take.log = TRUE, imputing.val = 1e-6){
  numeric.df = df.x[,-c(1:3)]
  if (take.log == TRUE){
    numeric.df = log(imputing.val + numeric.df)
  }
  pca = prcomp(numeric.df)
  pca.res = cbind(as.Date(df.x$sample.date), as.data.frame(pca$x))
  names(pca.res)[1] = "sample.dates"
  pca.loadings = as.data.frame(pca$rotation)
  pca.loadings$microbe = row.names(pca.loadings)
  return(list(add.stage(pca.res), pca.loadings))
}


top.microbes = function(pc.loadings, num.pcs = 6, additional.microbe = NULL, df.x){
  loadings.mag = apply(as.matrix(pc.loadings[,c(1:2)]), 1, FUN = function(x) norm(x, type = "2"))
  top.loadings = names(loadings.mag[order(loadings.mag, decreasing = TRUE)[c(1:num.pcs)]])
  top.microbe.df = df.x[,c("sample.date", top.loadings, additional.microbe)]
  top.microbe.m = melt(top.microbe.df, idvar = "sample.date")
  return(top.microbe.m)
}

Perform PCA on genus relative abundance

df.genus = read.csv("~/Documents/AG-LS/data/long_time_cleaned/AG-LS-L6.csv", header = TRUE, stringsAsFactors = FALSE)
pca.genus = pca.computer(df.genus)
genus.proj = pca.genus[[1]]
genus.loadings = pca.genus[[2]]
ggplot(genus.proj, aes(x = PC1, y = PC2, label = sample.dates, colour = stage)) + 
  geom_text(size = 8) + ggtitle("AG-LS PCA Genus") -> p
ggplotly(p)

Using PCA on the relative abundance of genus we see a similar separation of stage 0 and stage 1. Plot the loadings for the genus PCA to figure out what are the most important genus.

ggplot(genus.loadings, aes(x = PC1, y = PC2, label = microbe)) + 
  geom_text(size = 8) + ggtitle("Genus PC loadings")-> p
#plotly_POST(p, filename = "PCA-genus-no-imputation", fileopt = "overwrite")
ggplotly(p)

Plot time series for top genus based on PC loadings

top.genus.m = top.microbes(genus.loadings, df.x = df.genus)
## Using sample.date as id variables
ggplot(top.genus.m, aes(x = sample.date, y = value)) + facet_wrap(~variable, ncol = 1, scales = "free_y") + 
  geom_point() + xlab("") + ylab("") + ggtitle("Top Genus time series")-> p
ggplotly(p)

Perform PCA on family relative abundance

df.family = read.csv("~/Documents/AG-LS/data/long_time_cleaned/AG-LS-L5.csv", header = TRUE, stringsAsFactors = FALSE)
pca.family = pca.computer(df.family, take.log = FALSE)
family.proj = pca.family[[1]]
family.loadings = pca.family[[2]]
ggplot(family.proj, aes(x = PC1, y = PC2, label = sample.dates, colour = stage)) + 
  geom_text(size = 8) + ggtitle("AG-LS PCA Family with log transform<br>(adding 1e-6 to all entries)") -> p
ggplotly(p)

Plot the loadings for the family PCA

ggplot(family.loadings, aes(x = PC1, y = PC2, label = microbe)) + 
  geom_text(size = 8) + ggtitle("Family PC loadings") -> p
#plotly_POST(p, filename = "PCA-family-loadings", fileopt = "overwrite")
ggplotly(p)

Plot time series for top genus based on PC loadings

top.family.m = top.microbes(family.loadings, additional.microbe = "f__Peptococcaceae", df.x = df.family)
## Using sample.date as id variables
ggplot(top.family.m, aes(x = sample.date, y = value)) + facet_wrap(~variable, ncol = 1, scales = "free_y") + 
  geom_point() + xlab("") + ylab("")-> p
#plotly_POST(p, filename = "Top-family-time-series", fileopt = "overwrite")
ggplotly(p)

Perform PCA on order relative abundance

df.order = read.csv("~/Documents/AG-LS/data/long_time_cleaned/AG-LS-L4.csv", header = TRUE, stringsAsFactors = FALSE)
pca.order = pca.computer(df.order, take.log = FALSE)
order.proj = pca.order[[1]]
order.loadings = pca.order[[2]]
ggplot(order.proj, aes(x = PC1, y = PC2, label = sample.dates, colour = stage)) + 
  geom_text(size = 8) + ggtitle("AG-LS PCA Order with log transform<br>(adding 1e-6 to all entries)") -> p
#plotly_POST(p, filename = "PCA-order-imputation", fileopt = "overwrite")
ggplotly(p)

Plot the loadings for the order PCA

ggplot(order.loadings, aes(x = PC1, y = PC2, label = microbe)) + 
  geom_text(size = 8) + ggtitle("Order PC loadings") -> p
ggplotly(p)

Plot time series for top genus based on PC loadings

top.order.m = top.microbes(order.loadings, df.x = df.order)
## Using sample.date as id variables
ggplot(top.order.m, aes(x = sample.date, y = value)) + facet_wrap(~variable, ncol = 1, scales = "free_y") + 
  geom_point() + xlab("") + ylab("")-> p
ggplotly(p)