A tutorial for basic R operations and alpha diversity analysis

load in the data object and explore it

Here I am using the data from the bacterial communities in response to fungicide treatments. This is a phyloseq object and saved in RDS format. To load it in to R, use the readRDS() function. I moved this data into the current directory for our convenience.

epi16S <- readRDS("C:/Users/Xiaoping/OneDrive - Virginia Tech/Xiaoping Li/Location_impacts/Gloria/bsthit_epi_clean_1000_updateDec14.rds")

Using class() to check that it is a phyloseq object.

A phyloseq object usually contains a count table (otu_table), a taxonomy table(tax_table), a metadata table (sample_data), and can also include tree, sequences, etc. Check the official page for more: here

class(epi16S)
## [1] "phyloseq"
## attr(,"package")
## [1] "phyloseq"

This data has an otu table, a sample data, a taxonomy table and a tree, you can use functions, like otu_table(), tax_table, sample_data() to extract the corresponding data

epi16S
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4645 taxa and 110 samples ]
## sample_data() Sample Data:       [ 110 samples by 18 sample variables ]
## tax_table()   Taxonomy Table:    [ 4645 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 4645 tips and 4644 internal nodes ]
otu16S = otu_table(epi16S)
class(otu16S) = "matrix"
## Warning in class(otu16S) = "matrix": Setting class(x) to "matrix" sets attribute
## to NULL; result will no longer be an S4 object
tax16S = as.matrix(tax_table(epi16S))
meta16S = as.data.frame(sample_data(epi16S))

otu table looks like this (Showing first 10 rows and columns)—row: taxa, column: sample in the metadata, each cell has the sequencing abundance (how many reads assigned to that taxon)

print(otu16S[1:2, 1:10])
##                          A0NC1_barcode01 A0NC2_barcode33 A0NC3_barcode01
## OTU905:g__Mycoplasma                   7              13              11
## OTU981:c__Planctomycetia               1               3               1
##                          A1NC1_barcode02 A1NC2_barcode34 A1NC3_barcode02
## OTU905:g__Mycoplasma                   6               2               9
## OTU981:c__Planctomycetia               0               0               0
##                          A1NC4_barcode34 A2NC1_barcode03 A2NC2_barcode35
## OTU905:g__Mycoplasma                   4               3               1
## OTU981:c__Planctomycetia               0               2               0
##                          A2NC3_barcode03
## OTU905:g__Mycoplasma                   3
## OTU981:c__Planctomycetia               5

tax table looks like this, showing the lineage (7 levels):

print(tax16S[1:5, 1:7])
## Taxonomy Table:     [5 taxa by 7 taxonomic ranks]:
##                                      Domain     Phylum          
## OTU905:g__Mycoplasma                 "Bacteria" "Tenericutes"   
## OTU981:c__Planctomycetia             "Bacteria" "Planctomycetes"
## OTU770:Corynebacterium matruchotii   "Bacteria" "Actinobacteria"
## OTU457:Desulfobacula toluolica       "Bacteria" "Proteobacteria"
## OTU2343:Luteimicrobium xylanilyticum "Bacteria" "Actinobacteria"
##                                      Class                 Order              
## OTU905:g__Mycoplasma                 "Mollicutes"          "Mycoplasmatales"  
## OTU981:c__Planctomycetia             "Planctomycetia"      "c__Planctomycetia"
## OTU770:Corynebacterium matruchotii   "Actinomycetia"       "Corynebacteriales"
## OTU457:Desulfobacula toluolica       "Deltaproteobacteria" "Desulfobacterales"
## OTU2343:Luteimicrobium xylanilyticum "Actinomycetia"       "Micrococcales"    
##                                      Family               Genus              
## OTU905:g__Mycoplasma                 "Mycoplasmataceae"   "Mycoplasma"       
## OTU981:c__Planctomycetia             "c__Planctomycetia"  "c__Planctomycetia"
## OTU770:Corynebacterium matruchotii   "Corynebacteriaceae" "Corynebacterium"  
## OTU457:Desulfobacula toluolica       "Desulfobacteraceae" "Desulfobacula"    
## OTU2343:Luteimicrobium xylanilyticum "o__Micrococcales"   "Luteimicrobium"   
##                                      Species                       
## OTU905:g__Mycoplasma                 "g__Mycoplasma"               
## OTU981:c__Planctomycetia             "c__Planctomycetia"           
## OTU770:Corynebacterium matruchotii   "Corynebacterium matruchotii" 
## OTU457:Desulfobacula toluolica       "Desulfobacula toluolica"     
## OTU2343:Luteimicrobium xylanilyticum "Luteimicrobium xylanilyticum"

metadata looks like this, showing what samples have been taken, when, and other related information:

print(meta16S[1:5, 1:10])
##                  Barcodes Original_label Amplicon Seq_run Flow_cell Fungicide
## A0NC1_barcode01 barcode01          A0NC1      16S       1         1        NT
## A0NC2_barcode33 barcode33          A0NC2      16S       2         1        NT
## A0NC3_barcode01 barcode01          A0NC3      16S       3         2        NT
## A1NC1_barcode02 barcode02          A1NC1      16S       1         1        NT
## A1NC2_barcode34 barcode34          A1NC2      16S       2         1        NT
##                 Fungicide_code time Month Season
## A0NC1_barcode01              A    0   May     ES
## A0NC2_barcode33              A    0   May     ES
## A0NC3_barcode01              A    0   May     ES
## A1NC1_barcode02              A    1   May     ES
## A1NC2_barcode34              A    1   May     ES

Now, analyzing alpha diversity

first, check sequencing depth with rarefaction curve using the vegan package

library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
# need to transpose first
rarecurve(t(otu16S), label=F)

or you can use ampvis2 package, so the plot would be ggplot2 object for easier manipulation

# ampvis2 is restrict on the tax_table, need to keep to 7 rankings
tax16S = as.matrix(tax_table(epi16S))
tax16S = tax16S[, 1:7]
tax_table(epi16S) = tax16S

amp.epi = phyloseq_to_ampvis2(epi16S)


amp_rarecurve(amp.epi, color_by = "funlab", stepsize = 100) +
    geom_vline(aes(xintercept = median(sample_sums(epi16S))), linetype="dashed", color="darkgreen") +
    scale_color_brewer(palette = "Set2",labels=c("Nontreated", "Daconil", "Banner Maxx", "Concert II")) +
    labs(color = "Fungicides", y="Number of observed OTUs") +
    theme(text = element_text(size=12, color="black"),
          axis.title.x = element_text(size=12, color="black", margin = margin(t=10)),
          axis.title.y = element_text(size=12, color="black", margin = margin(r=10)),
          axis.text.x = element_text(size=12, color="black"),
          axis.text.y = element_text(size=12, color="black"))

calculate good’s coverage using the phyloseq_coverage function from the metagMisc package, this is library sample based, you need to combine this with metadata for summarizing the coverage according to other factors

meta16S.epi = data.frame(sample_data(epi16S))


goods.epi = phyloseq_coverage(epi16S, correct_singletons = T)
rownames(goods.epi) = goods.epi$SampleID

epi.coverage = merge(goods.epi, meta16S.epi, by = "row.names")

head(epi.coverage)
##         Row.names        SampleID SampleCoverage  Barcodes Original_label
## 1 A0NC1_barcode01 A0NC1_barcode01      0.9889398 barcode01          A0NC1
## 2 A0NC2_barcode33 A0NC2_barcode33      0.9856085 barcode33          A0NC2
## 3 A0NC3_barcode01 A0NC3_barcode01      0.9968621 barcode01          A0NC3
## 4 A1NC1_barcode02 A1NC1_barcode02      0.9803316 barcode02          A1NC1
## 5 A1NC2_barcode34 A1NC2_barcode34      0.9728516 barcode34          A1NC2
## 6 A1NC3_barcode02 A1NC3_barcode02      0.9937704 barcode02          A1NC3
##   Amplicon Seq_run Flow_cell Fungicide Fungicide_code time Month Season
## 1      16S       1         1        NT              A    0   May     ES
## 2      16S       2         1        NT              A    0   May     ES
## 3      16S       3         2        NT              A    0   May     ES
## 4      16S       1         1        NT              A    1   May     ES
## 5      16S       2         1        NT              A    1   May     ES
## 6      16S       3         2        NT              A    1   May     ES
##        Date time_label Rep Disease funlab    funlab2 Spl_time   Season_lab
## 1 5/26/2021        pre   1      10     NT Nontreated        0 Early summer
## 2 5/26/2021        pre   2      10     NT Nontreated        0 Early summer
## 3 5/26/2021        pre   3       0     NT Nontreated        0 Early summer
## 4 5/27/2021        1dp   1      10     NT Nontreated        1 Early summer
## 5 5/27/2021        1dp   2      10     NT Nontreated        1 Early summer
## 6 5/27/2021        1dp   3       0     NT Nontreated        1 Early summer

for example, summarizing the coverage by season and fungicide

# here
epi.coverage %>%
  group_by(Season, funlab) %>%
  summarise(mean_coverage = mean(SampleCoverage), stderr = sd(SampleCoverage)/sqrt(n()))
## `summarise()` has grouped output by 'Season'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 4
## # Groups:   Season [2]
##   Season funlab mean_coverage   stderr
##   <fct>  <fct>          <dbl>    <dbl>
## 1 ES     NT             0.985 0.00267 
## 2 ES     DL             0.990 0.00184 
## 3 ES     BM             0.990 0.00182 
## 4 ES     Crt2           0.991 0.00166 
## 5 EF     NT             0.995 0.00146 
## 6 EF     DL             0.975 0.0105  
## 7 EF     BM             0.978 0.00934 
## 8 EF     Crt2           0.994 0.000869

calculate alpha diversity indices, phyloseq has the function to just do that.

  • please review what is observed richness, shannon index, simpsons, Chao1, ACE

  • which of those only measure richness, and which of those measure evenness, and which measure diversity (richness and evenness)

  • if the sequencing depth allows, we can rarefy the data first (IMPORTANT: need to find an appropriate depth where it retains a reasonable amount of sequencing reads while keeping most of the samples)

In this case, the minimal depth of data is 1,669, I will just use this one as an example, the parameter ‘rngseed’ can be set as any number, next time you run it, you will get the exact result (because it involves random sampling so if you don’t set it, everytime it will produce a slightly different output)

min(sample_sums(epi16S)) #1669
## [1] 1669
epi_rare = phyloseq::rarefy_even_depth(epi16S, rngseed = 123, replace = F)
## `set.seed(123)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(123); .Random.seed` for the full vector
## ...
## 1360OTUs were removed because they are no longer 
## present in any sample after random subsampling
## ...

After rarefaction, we can estimate richness and diversity

div_epi16S = phyloseq::estimate_richness(epi_rare, measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson"))


head(div_epi16S)
##                 Observed    Chao1 se.chao1      ACE   se.ACE  Shannon   Simpson
## A0NC1_barcode01      287 609.6364 71.50498 648.9985 16.54194 4.126091 0.9561377
## A0NC2_barcode33      329 665.0185 69.02996 724.7856 17.25638 4.113913 0.9232927
## A0NC3_barcode01      300 686.2174 82.04864 785.7976 19.12988 3.685522 0.9021501
## A1NC1_barcode02      235 483.1081 60.75085 525.0204 14.11728 3.788541 0.9424328
## A1NC2_barcode34      301 640.3750 72.44460 687.7792 17.08671 3.793132 0.9158557
## A1NC3_barcode02      226 578.8333 88.30718 620.3770 17.27779 3.140454 0.8685044
##                 InvSimpson
## A0NC1_barcode01  22.798643
## A0NC2_barcode33  13.036561
## A0NC3_barcode01  10.219729
## A1NC1_barcode02  17.370997
## A1NC2_barcode34  11.884350
## A1NC3_barcode02   7.604818

ANOVA

  • we need to combine the diversity calculation with the metadata that contains the factors we want to compare, we will test on the observed richness here
mrg.epi = merge(meta16S.epi, div_epi16S, by.x="row.names", by.y="row.names")
head(mrg.epi)
##         Row.names  Barcodes Original_label Amplicon Seq_run Flow_cell Fungicide
## 1 A0NC1_barcode01 barcode01          A0NC1      16S       1         1        NT
## 2 A0NC2_barcode33 barcode33          A0NC2      16S       2         1        NT
## 3 A0NC3_barcode01 barcode01          A0NC3      16S       3         2        NT
## 4 A1NC1_barcode02 barcode02          A1NC1      16S       1         1        NT
## 5 A1NC2_barcode34 barcode34          A1NC2      16S       2         1        NT
## 6 A1NC3_barcode02 barcode02          A1NC3      16S       3         2        NT
##   Fungicide_code time Month Season      Date time_label Rep Disease funlab
## 1              A    0   May     ES 5/26/2021        pre   1      10     NT
## 2              A    0   May     ES 5/26/2021        pre   2      10     NT
## 3              A    0   May     ES 5/26/2021        pre   3       0     NT
## 4              A    1   May     ES 5/27/2021        1dp   1      10     NT
## 5              A    1   May     ES 5/27/2021        1dp   2      10     NT
## 6              A    1   May     ES 5/27/2021        1dp   3       0     NT
##      funlab2 Spl_time   Season_lab Observed    Chao1 se.chao1      ACE   se.ACE
## 1 Nontreated        0 Early summer      287 609.6364 71.50498 648.9985 16.54194
## 2 Nontreated        0 Early summer      329 665.0185 69.02996 724.7856 17.25638
## 3 Nontreated        0 Early summer      300 686.2174 82.04864 785.7976 19.12988
## 4 Nontreated        1 Early summer      235 483.1081 60.75085 525.0204 14.11728
## 5 Nontreated        1 Early summer      301 640.3750 72.44460 687.7792 17.08671
## 6 Nontreated        1 Early summer      226 578.8333 88.30718 620.3770 17.27779
##    Shannon   Simpson InvSimpson
## 1 4.126091 0.9561377  22.798643
## 2 4.113913 0.9232927  13.036561
## 3 3.685522 0.9021501  10.219729
## 4 3.788541 0.9424328  17.370997
## 5 3.793132 0.9158557  11.884350
## 6 3.140454 0.8685044   7.604818
epi_adiv2 = mrg.epi %>%
  tidyr::gather(key = "measure", value = "value", Observed:InvSimpson)


# select "Observed" for observed richness, or "Shannon" for shannon index, etc. 
epi.df = epi_adiv2 %>%
    filter(measure == "Observed")

aov.epi = epi.df %>%
  aov(value ~ funlab*time_label*Season, data=.) 

# see anova result
summary(aov.epi)
##                          Df Sum Sq Mean Sq F value   Pr(>F)    
## funlab                    3 313670  104557  33.272 5.94e-14 ***
## time_label                3 450034  150011  47.736  < 2e-16 ***
## Season                    1 103742  103742  33.013 1.69e-07 ***
## funlab:time_label         9  27732    3081   0.981  0.46276    
## funlab:Season             3  51295   17098   5.441  0.00189 ** 
## time_label:Season         3 114678   38226  12.164 1.30e-06 ***
## funlab:time_label:Season  9  24578    2731   0.869  0.55631    
## Residuals                78 245114    3142                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1