Introduction to Microbial Data

source('https://gist.githubusercontent.com/sw1/d7db73c62c1abdae6671d039ccee9298/raw/a2ed1fbc2d2f24c1d838d9ddb9871cb5c819512f/stem_data.R')

## Warning in install.packages(c("tidyverse", "StanHeaders",
## "themetagenomics", : installation of package 'themetagenomics' had non-zero
## exit status

## Warning in install.packages(c("tidyverse", "StanHeaders",
## "themetagenomics", : installation of package 'huge' had non-zero exit
## status

## Bioconductor version 3.5 (BiocInstaller 1.26.0), ?biocLite for help

## BioC_mirror: https://bioconductor.org

## Using Bioconductor 3.5 (BiocInstaller 1.26.0), R 3.4.1 (2017-06-30).

## Installing package(s) 'Biostrings'

## Warning in install.packages(pkgs = doing, lib = lib, ...): installation of
## package 'Biostrings' had non-zero exit status

source('https://gist.githubusercontent.com/sw1/d7db73c62c1abdae6671d039ccee9298/raw/528b7bccfa985c2458d377e760638c71f16e05f8/stem_functions.R')

Where do we start?

Say we swabbed someone’s skin. How would we go about finding out what microbes make up that person’s “skin microbiome”? Well, that swab now contains an incredibly large number of cells from hudreds to thousands of different bacterial species. Our next step would be to break open, or “lyse,” these cells and collect all of the DNA. This is where things get complicated. Once we collect this DNA, we end up with fragments of DNA from various DNA strands, and we lack any knowledge about which microbe these fragments came from. We simply have fragment after fragment of unlabeled DNA. How can we find out whether a given fragment initially belonged to, say, E. coli or S. aureus?

Alignment

Let’s see what one of these sequences looks like:

seq

##                                                    sample 
## "ACGTCGTGCGCGGCGGTTCCCACACATACAACCGCGTGATACGCGATATGAGCGA"

It’s just a long sequence of nucleotides, As, Cs, Gs, and Ts. This particular sequence is 55 nucleotides long. If we wanted to know what species of bacteria this sequence likely came from, we have to compare this sequence to sequences with known origin. So the game is pretty simple, we take our unknown sample sequence and compare it to, say, a DNA sequence from E. coli, and then another sequence from S. aureus, and so on. We then see which sequence was the best match. That best match will tell us the likely origin of our sequence.

Now the question is, how do we go about making that comparison? How do we compare two different sequences of DNA? Well, we do something called DNA “alignment.” The simplest approach is to compare the nucleotide at position 1 for one of the sequences to the nucleotide at position 1 for another sequence. If it’s a match, we score it a 1; if they’re different, we score it a 0. We then sum up the score for each position to the end.

Let’s try some examples:

hamming()

When we do real DNA alignment, the methods are a little more sophisticated, so we don’t go into detail, but we can still use some commands to use them. Remember our sequence from our sample?

seq

##                                                    sample 
## "ACGTCGTGCGCGGCGGTTCCCACACATACAACCGCGTGATACGCGATATGAGCGA"

Let’s compare it to 3 other sequences:

seq1

##   A DNAStringSet instance of length 1
##     width seq                                          names               
## [1]    90 GCAAGCGTTATCCGGATTTAC...GGGGCTCAACCCCGGGACTG ecoli

seq2

##   A DNAStringSet instance of length 1
##     width seq                                          names               
## [1]    90 CCGGGCGTTATCCGGATTTAT...GAAGCTCAACGGCTGACTTG staph

seq3

##   A DNAStringSet instance of length 1
##     width seq                                          names               
## [1]    90 ACAAGCGTTGTCCGGAATTAC...TGGGCTCAACCCATGAACTG strep

Notice the names on the far right? Sequence 1 is from E. coli, sequence 2 is from Staph, and sequence 3 is from Strep. Let’s compare our sequence to E. coli using a command that does alignment for us.

pairwiseAlignment(seq,seq1)

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: [1] ACGT-CGTGCGCGGCGGTTCCCACACATACA-...-CGATATGA-------------------GCGA 
## subject: [1] GCAAGCGTTATCCGGATTTACTGGGTGTAAAG...CTGATGTGAAAGCCCGGGGCTCAACCCCGGGA 
## score: -288.0297

This function aligns our sequnce (“pattern”) to the E. coli sequence (“subject”) and gives us a score telling us how good the alignment is. Remember the goal here. We want to find which of the 3 bacteria is the best match for our sequence. Your job is then to replace seq1 in the command above with the other two sequences, get the score, and then choose which bacteria our sequence most likely originated from.

Scaling Up

Now, back to that swab. What we learned so far is how to make comparisons between unknown DNA fragments and known DNA fragments. This gives us a way of figuring out what’s in our sample. For each fragment, we can compare it to a set of known fragments, find the best match, and write down which species it most likely originated from. After doing this to all of our fragments from our sample, we can count the number of species that were present in our sample. Then, we can repeat the entire process for more samples. The end result is a bunch of counts. These counts are often presented like so:

DAVID$ABUND[1:5,1:5]

##           00001 00002 00003 00004 00005
## ERR531441  7232 44702     0  4562 17174
## ERR531442  1807     0   756  5386  3899
## ERR531443 20396 19382     0  9155 10203
## ERR531444 14325   686     0   394 11702
## ERR531445  3534  6751 13093  4323  3533

What we see here are counts for a specific sample (the row names, like ERR531441) and specific bacterial species (the column names, like 00001). So, when we counted the bacteria in sample ERR531441, we found 7232 DNA fragments that matched bacterial species 00001, 44702 fragments for species 00002, and so on. If you want to see what the actual bacteria species are, we can find that here:

DAVID$TAX[1:5,1:7]

##       Kingdom    Phylum          Class         Order          
## 00001 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
## 00002 "Bacteria" "Firmicutes"    "Clostridia"  "Clostridiales"
## 00003 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
## 00004 "Bacteria" "Firmicutes"    "Clostridia"  "Clostridiales"
## 00005 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
##       Family            Genus              Species      
## 00001 "Bacteroidaceae"  "Bacteroides"      NA           
## 00002 "Ruminococcaceae" "Faecalibacterium" NA           
## 00003 "Bacteroidaceae"  "Bacteroides"      "plebeius"   
## 00004 "Ruminococcaceae" "Faecalibacterium" "prausnitzii"
## 00005 "Bacteroidaceae"  "Bacteroides"      NA

Microbiome Analysis

That data you just saw is from a real study. The researchers followed two individuals daily for nearly a year. Each day, they profiled the microbes in their gut and mouth.

Subject B had food poisoning at one point in the study. You should be able to locate the day this occured by plotting the data. If you run the following function, you’ll see a bar plot showing how different gut Phyla change over time in subject B. You can change “Phylum” to any of the following: Class, Order, Family, or Genus. Plotting different taxonomic levels may help you identify the day food poisoning occured:

plot_subjectB("Phylum")

Now, let’s look at subject A’s mouth microbiome. Compare and contrast the mouth microbiome of subject A to the gut microbiome of subject B:

plot_subjectA("Phylum")

Clustering Taxa

vis(EFFECTS,type='taxa')
vis(EFFECTS,type='continuous')
vis(EFFECTS,type='binary')