Updated January 19, 2018

Online version available at http://rpubs.com/maddieSC/R_SOP_UCR_Jan_2018

Tips for this workshop

If you have any issues in R, type ??command into the console where command is the function you are having issues with and a help page will come up. To search for a command, type ?command.
Lines starting with # are comments that are for the reader’s benefit. These lines are not code and do not need to be entered into the console. They are also not evaluated by R and can be copied into your script with the code.
GREEN boxes contain code that you can copy and paste to run on your machine.

#GREEN box

WHITE boxes contain sample output of this code, and nothing will happen if you try to copy it into your console.
```
#WHITE box
```
ORANGE boxes contain R code which you should NOT execute today, either because it takes forever or because we do not have the necessary input to run it.

#ORANGE box

Basic R code you may find useful:
1. Matrices/data frames are designated by [ , ] where it is [rows, columns]
2. | is or
3. & is and

Consejos para este taller

Si tiene problemas en R, escriba ?? command en la consola dondecommand es la función con la que está teniendo problemas y aparecerá una página de ayuda. Para buscar un command, escriba?command.
Las líneas que comienzan con # son comentarios que son para el beneficio del lector. Estas líneas no son código y no necesitan ingresarse en la consola. Tampoco son evaluados por R y pueden copiarse en su secuencia de comandos con el código.
Los cuadros VERDE contienen código que puede copiar y pegar para ejecutar en su máquina.
Los cuadros BLANCOS contienen ejemplos de salida de este código, y nada sucederá si intentas copiarlo en tu consola.
Los cuadros NARANJA contienen un código R que NO deberías ejecutar hoy, ya sea porque lleva mucho tiempo o porque no tenemos la entrada necesaria para ejecutarlo.
Código R básico que puede serle útil:
1. Las matrices / marcos de datos se designan por [,] donde es [filas, columnas] segundo. | es o do. & es y

Introduction

Written for R v3.4.3 in RStudio v1.1.383

Goal

The goal of this tutorial is to demonstrate basic analyses of microbiota data to determine if and how communities differ by variables of interest. In general, this pipeline can be used for any microbiota data set that has been clustered into operational taxonomic units (OTUs).

This tutorial assumes some basic statistical knowledge. Please consider if your data fit the assumptions of each test (normality? equal sampling? Etc.). If you are not familiar with statistics at this level, we strongly recommend collaborating with someone who is. The incorrect use of statistics is a pervasive and serious problem in the sciences so don’t become part of the problem! That said, this is an introductory tutorial and there are many, many further analyses that can be done with microbiota data. Hopefully, this is just the start for your data!

Data

The data used here were created using 2x250 bp amplicon sequencing of the bacterial V4 region of the 16S rRNA gene on the Illumina MiSeq platform. The full data set is in Dill-McFarland et al. Sci Rep 7: 40864. Here, we will use a subset of samples. Specifically, we will be correlating the fecal bacterial microbiota of 8 dairy calves at different ages (2 weeks, 8 weeks, 1 year) to variables like weight gain (average daily gain in kg, ADGKG) and gastrointestinal short chain fatty acids (SCFA).

Files

We will use the following files created using the Microbiota Processing in mothur: Standard Operating Procedure (SOP).

example.final.an.unique_list.0.03.norm.shared (OTU table)
example.final.an.unique_list.0.03.cons.taxonomy (Taxonomy of OTUs)

We will also be using tab-delimited metadata and SCFA files created in Excel. The metadata includes our metadata (like age and ADGKG) as well as alpha-diversity metrics from example.final.an.unique_list.0.03.norm.groups.summary calculated in mothur. The SCFA table is the mM concentrations of different SCFAs in rumen (stomach) liquids from 1-year-old animals.

example.metadata.txt
example.SCFA.txt

Finally, we will be loading a number of custom scripts from Steinberger_scripts and a pre-calculated OTU tree NJ.tree.RData. The information for creating this tree is provided in this tutorial.

Introducción

Escrito para R v3.4.3 en RStudio v1.1.383

Gol

El objetivo de este tutorial es demostrar análisis básicos de datos de microbiota para determinar si las comunidades difieren según las variables de interés y de qué manera. En general, esta canalización se puede usar para cualquier conjunto de datos de microbiota que se haya agrupado en unidades taxonómicas operativas (OTU).

Este tutorial asume algunos conocimientos estadísticos básicos. Por favor, considere si sus datos se ajustan a las suposiciones de cada prueba (normalidad, igual muestreo, etc.). Si no está familiarizado con las estadísticas en este nivel, le recomendamos colaborar con alguien que sí lo esté. ¡El uso incorrecto de las estadísticas es un problema generalizado y serio en las ciencias, así que no se convierta en parte del problema! Dicho esto, este es un tutorial introductorio y hay muchos, muchos análisis adicionales que se pueden hacer con datos de microbiota. ¡Con suerte, esto es solo el comienzo de tus datos!

Datos

Los datos utilizados aquí se crearon usando la secuencia de amplicón de 2x250 pb de la región V4 bacteriana del gen 16S rRNA en la plataforma Illumina MiSeq. El conjunto completo de datos se encuentra en [Dill-McFarland * et al *. Sci Rep 7: 40864] (https://www.ncbi.nlm.nih.gov/pubmed/28098248). Aquí, usaremos un subconjunto de muestras. Específicamente, correlacionaremos la microbiota bacteriana fecal de 8 terneros lecheros a diferentes edades (2 semanas, 8 semanas, 1 año) a variables como ganancia de peso (ganancia diaria promedio en kg, ADGKG) y ácidos grasos de cadena corta gastrointestinales (AGCC) .

Archivos

Usaremos los siguientes archivos creados usando el [Microbiota Processing in mothur, UCR Workshop 2018] (http://rpubs.com/maddieSC/mothur_SOP_UCR_Jan_2018).

example.final.an.unique_list.0.03.norm.shared (tabla OTU)
example.final.an.unique_list.0.03.cons.taxonomy (Taxonomía de OTU)

También utilizaremos metadatos delimitados por tabuladores y archivos SCFA creados en Excel. Los metadatos incluyen nuestros metadatos (como la edad y ADGKG), así como las métricas de diversidad alfa de example.final.an.unique_list.0.03.norm.groups.summary calculadas en mothur. La tabla SCFA es la concentración mM de diferentes AGCC en líquidos del rumen (estómago) de animales de 1 año de edad.

example.metadata.txt
example.SCFA.txt

Finalmente, cargaremos una cantidad de scripts personalizados de Steinberger_scripts y un árbol OTU precalculadoNJ.tree.RData. La información para crear este árbol se proporciona en este tutorial.

Get set up

Download and install

Base R: http://cran.mtu.edu/

NOTE: If you need to update to the most recent version of R on Windows you can do so using the installr package. Instructions here. For OSX and Ubuntu, download from CRAN using the link above.

RStudio: https://www.rstudio.com/products/rstudio/download3/
Packages: Open RStudio on your computer. If you have not already downloaded these packages, go to the lower right quadrant of your screen and open the Package tab. Click “install” and search for the package you want to download.
- ape
- dplyr
- ggplot2
- gplots
- lme4
- miLineage
- phangorn
- plotly
- tidyr
- vegan
- VennDiagram
- DESeq2 (DESeq2 is not on CRAN, so we have to call it manually. See below.)
- phyloseq (phyloseq is not on CRAN, so we have to call it manually. See below.)

Copy and paste the following into your console.

source("https://bioconductor.org/biocLite.R")

## Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help

biocLite("phyloseq")

## BioC_mirror: https://bioconductor.org

## Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).

## Installing package(s) 'phyloseq'

## package 'phyloseq' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Madison\AppData\Local\Temp\Rtmp0UMEsk\downloaded_packages

## installation path not writeable, unable to update packages: MASS, rpart

## Old packages: 'digest', 'GenomicRanges', 'hexbin', 'Hmisc', 'mgcv',
##   'pillar', 'RCurl', 'viridis'

source("https://bioconductor.org/biocLite.R")

## Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help

biocLite("DESeq2")

## BioC_mirror: https://bioconductor.org

## Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).

## Installing package(s) 'DESeq2'

## package 'DESeq2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Madison\AppData\Local\Temp\Rtmp0UMEsk\downloaded_packages

## installation path not writeable, unable to update packages: MASS, rpart

## Old packages: 'digest', 'GenomicRanges', 'hexbin', 'Hmisc', 'mgcv',
##   'pillar', 'RCurl', 'viridis'

Note: If you are having trouble installing packages, turn off your computer’s firewall temporarily.

Configurar

Descargar e instalar

Base R: http://cran.mtu.edu/

** NOTA: ** Si necesita actualizar a la versión más reciente de R en Windows, puede hacerlo usando el paquete installr. Instrucciones [aquí.] (Https://www.r-statistics.com/2013/03/updating-r-f-on-windows-using-the-installr-package/) Para OSX y Ubuntu, descargue de CRAN utilizando el enlace de arriba.

RStudio: https://www.rstudio.com/products/rstudio/download3/
Paquetes: abra RStudio en su computadora. Si aún no ha descargado estos paquetes, vaya al cuadrante inferior derecho de la pantalla y abra la pestaña Paquete. Haga clic en “descargar” y busque el paquete que desea descargar.
- ape
- dplyr
- ggplot2
- gplots
- lme4
- miLineage
- phangorn
- plotly
- tidyr
- vegan
- VennDiagram
- DESeq2 (DESeq2 no está en CRAN, por lo que debemos llamarlo manualmente. Consulte a continuación).
- phyloseq (phyloseq no está en CRAN, por lo que debemos llamarlo manualmente. Consulte a continuación.)

Copie y pegue lo siguiente en su consola.

** Nota **: si tiene problemas para instalar paquetes, apague el firewall de su computadora temporalmente.

Organization

All of our analyses will be organized into a “Project”.

Make a new project by selecting File->New project. Select “New Directory” and “Empty Project”. Name the project “Microbiota_Analysis_UCR” and save the project to your Desktop. Place all of your files for this analysis in the folder created on the Desktop

Create a new R script (File->New file->R script) to save your code. This file will automatically be saved in the project folder.

Now your screen should look like this

Upper left: Where you type and save the code you want to run.
Upper right: Files you load into and create in R. To view one, click on it and it will open in the upper left pane.
Lower left: The console. Where commands and outputs run (similar to the one mothur window).
Lower right: Variable. Explore the different tabs.

Load Packages

The library command tells R to open the package you want to use. You need to do this every time you open R.

#Analyses of Phylogenetics and Evolution package. Required for tree calculations to be used with phyloseq
library(ape)

#This package will help analyze "differential expression" in the microbiota alongside phyloseq
library(DESeq2)

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colMeans,
##     colnames, colSums, do.call, duplicated, eval, evalq, Filter,
##     Find, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, Map, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
##     rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: DelayedArray

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## 
## Attaching package: 'DelayedArray'

## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

## The following object is masked from 'package:base':
## 
##     apply

#This package will also help us more easily manipulate our data
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:matrixStats':
## 
##     count

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union

## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect

## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union

## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union

## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Graphing package used in phyloseq. To edit the default setting of a plot, you need to use functions in this package.
library(ggplot2)

#This package is used to calculate and plot Venn diagrams as well as heatmaps
library(gplots)

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:IRanges':
## 
##     space

## The following object is masked from 'package:S4Vectors':
## 
##     space

## The following object is masked from 'package:stats':
## 
##     lowess

#Linear mixed-effects models like repeated measures analysis
library(lme4)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:S4Vectors':
## 
##     expand

#Associations of specific lineages to continuous covariates of interest
library(miLineage)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: data.table

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:SummarizedExperiment':
## 
##     shift

## The following object is masked from 'package:GenomicRanges':
## 
##     shift

## The following object is masked from 'package:IRanges':
## 
##     shift

## The following objects are masked from 'package:S4Vectors':
## 
##     first, second

## Loading required package: geepack

#used to read in mothur-formatted files
library(phangorn)

#The phyloseq package seeks to address issues with multiple microbiome analysis packages by providing a set of functions that internally manage the organizing, linking, storing, and analyzing of phylogenetic sequencing data. In general, this package is used for UniFrac analyses.
library(phyloseq)

## 
## Attaching package: 'phyloseq'

## The following object is masked from 'package:SummarizedExperiment':
## 
##     distance

## The following object is masked from 'package:Biobase':
## 
##     sampleNames

## The following object is masked from 'package:GenomicRanges':
## 
##     distance

## The following object is masked from 'package:IRanges':
## 
##     distance

#A package to create interactive web graphics of use in 3D plots
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:IRanges':
## 
##     slice

## The following object is masked from 'package:S4Vectors':
## 
##     rename

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

#This package will help us more easily manipulate our data, which are matrices
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:Matrix':
## 
##     expand

## The following object is masked from 'package:S4Vectors':
## 
##     expand

#The vegan package provides tools for descriptive community ecology. It has most basic functions of diversity analysis, community ordination and dissimilarity analysis. In general, this package is used for Bray-Curtis and Jaccard analyses.
library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.4-5

## 
## Attaching package: 'vegan'

## The following objects are masked from 'package:phangorn':
## 
##     diversity, treedist

#Pretty Venn disgrams
library(VennDiagram)

## Loading required package: grid

## Loading required package: futile.logger

## 
## Attaching package: 'VennDiagram'

## The following object is masked from 'package:ape':
## 
##     rotate

Organización

Todos nuestros análisis se organizarán en un “Proyecto”.

Haga un nuevo proyecto seleccionando Archivo-> Nuevo proyecto. Seleccione “Nuevo directorio” y “Proyecto vacío”. Nombre el proyecto “Microbiota_Analysis_UCR” y guarde el proyecto en su escritorio. Coloque todos sus archivos para este análisis en la carpeta creada en el escritorio

Cree un nuevo guión R (Archivo-> Nuevo archivo-> Guión R) para guardar su código. Este archivo se guardará automáticamente en la carpeta del proyecto.

Ahora su pantalla debería verse así

Arriba a la izquierda: donde escribe y guarda el código que desea ejecutar.
Arriba a la derecha: archivos en los que carga y crea en R. Para ver uno, haga clic en él y se abrirá en el panel superior izquierdo.
Más abajo a la izquierda: la consola. Donde se ejecutan comandos y salidas (similar a la ventana de mothur).
Abajo a la derecha: Variable. Explore las diferentes pestañas.

Paquetes

El comando library le dice a R que abra el paquete que desea usar. Tienes que hacer esto cada vez que abras R.

Load Data

In the code, the text before = is what the file will be called in R. Make this short but unique as this is how you will tell R to use this file in later commands.

header: tells R that the first row is column names, not data
row.names: tells R that the first column is row names, not data
sep: tells R that the data are tab-delimited. If you had a comma-delimited file, you would us sep=","

#OTU table (shared file)
OTU = read.table("example.final.an.unique_list.0.03.norm.shared", header=TRUE, sep="\t")

#Taxonomy of each OTU
tax = read.table("example.final.an.unique_list.0.03.cons.taxonomy", header=TRUE, sep="\t")

#Metadata. Since we made this in Excel, not mothur, we can use the "row.names" modifier to automatically name the rows by the values in the first column (sample names)
meta = read.table("example.metadata.txt", header=TRUE, row.names=1, sep="\t")

#SCFA data
SCFA = read.table("example.SCFA.txt", header=TRUE, row.names=1, sep="\t")

Cargar datos

En el código, el texto antes = es a lo que se llamará el archivo en R. Hazlo breve pero único ya que así le dirás a R que use este archivo en los comandos posteriores.

encabezado: le dice a R que la primera fila son nombres de columna, no datos
row.names: le dice a R que la primera columna es nombres de filas, no datos
sep: le dice a R que los datos están delimitados por tabuladores. Si tuviera un archivo delimitado por comas, nos lo haría sep =", "

Data manipulation

Clean up the data

You can look at your data by clicking on it in the upper-right quadrant “Environment”

There are several unneeded columns and incorrect formatting in the tables as they were output by mothur. We will now fix them.

Manipulación de datos

Limpiar los datos

Puede ver sus datos haciendo clic en el cuadrante superior derecho “Entorno”

Hay varias columnas innecesarias y un formato incorrecto en las tablas, ya que fueron generadas por mothur. Ahora los arreglaremos.

OTU table

We need to use the “Group” column as the row names so that it will match our metadata

row.names(OTU) = OTU$Group

We then need to remove the “label”, “numOTUs”, and “Group” columns as they are not OTU counts like the rest of the table

OTU.clean = OTU[,-which(names(OTU) %in% c("label", "numOtus", "Group"))]

Taxonomy table

For the taxonomy table, we name the rows by the OTU #

row.names(tax) = tax$OTU

Remove all the OTUs that don’t occur in our OTU.clean data set

tax.clean = tax[row.names(tax) %in% colnames(OTU.clean),]

We then need to separate the “taxonomy” column so that each level (i.e. Domain, Phylum, etc) is in it’s own column. We do this with a special command “separate” from the tidyr package

tax.clean = separate(tax.clean, Taxonomy, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Strain"), sep=";")

Finally, we remove the “Size” and “Strain” columns as well as “OTU” since these are now the row names

tax.clean = tax.clean[,-which(names(tax.clean) %in% c("Size", "Strain", "OTU"))]

Metadata and SCFA tables

These tables do not require any modification since I created them in Excel exactly as I need them for this R analysis.

tabla OTU

Necesitamos usar la columna “Group” como los nombres de las filas para que coincida con nuestros metadatos

Entonces necesitamos eliminar las columnas “label”, “numOTUs” y “Group” ya que no son conteos OTU como el resto de la tabla

Tabla de taxonomía

Para la tabla de taxonomía, nombramos las filas por OTU #

Elimine todas las OTU que no aparecen en nuestro conjunto de datos OTU.clean

Luego, debemos separar la columna “taxonomía” para que cada nivel (* i.e. * Dominio, Filo, etc.) esté en su propia columna. Hacemos esto con un comando especial “separate” del paquete tidyr

Finalmente, eliminamos las columnas “Size” y “Strain”, así como “OTU” ya que estos son los nombres de las filas

Metadatos y tablas SCFA

Estas tablas no requieren ninguna modificación ya que las creé en Excel exactamente como las necesito para este análisis R.

Order the data

To make viewing and using the data easier, we will make sure our tables have samples (rows) in the same order. Since OTU.clean, meta, and SCFA have sample names as row names, we order by these.

OTU.clean = OTU.clean[order(row.names(OTU.clean)),]
meta = meta[order(row.names(meta)),]
SCFA = SCFA[order(row.names(SCFA)),]

Our taxonomy table is already in order from OTU1 to OTUN so we do not need to order it.

Ordene los datos

Para facilitar la visualización y el uso de los datos, nos aseguraremos de que nuestras tablas tengan muestras (filas) en el mismo orden. Como OTU.clean, meta y SCFA tienen nombres de muestra como nombres de fila, ordenamos por estos.

Nuestra tabla de taxonomía ya está en orden desde OTU1 a OTUN, por lo que no es necesario solicitarla

Further data manipulation

Several other manipulations may need to be made to data for various visualizations and analyses. It is not possible to go over everything that may be needed, but here are some examples that come to mind.

Más manipulación de datos

Varias otras manipulaciones pueden necesitar hacerse a los datos para varias visualizaciones y análisis. No es posible repasar todo lo que pueda ser necesario, pero aquí hay algunos ejemplos que vienen a la mente.

Subsetting data

It may be desirable to subset data by sample type. If you subset your matrix, you must subset your metadata table. It is CRUCIAL that your samples are in the same order. R is not smart enough to try to match sample names with these commands. We will use the which function in base R to subset data frames. Notice that I have to provide BOTH what rows I want to keep (those which show that AgeGroup is 2 weeks) AND what columns I want to keep (blank to the RIGHT of the comma within the brackets means keep ALL columns).

meta.2wk <- meta[which(meta$AgeGroup=="2w"),]
OTU.clean.2wk <- OTU.clean[which(meta$AgeGroup=="2w"),]

View(meta.2wk)

Or, to subset to a group of samples to the same samples in another file, we can use the value-matching function %in%. The names are a little different between the tables so we also add “.F” to the SCFA names to make them match.

OTU.SCFA = OTU.clean[row.names(OTU.clean) %in% paste(row.names(SCFA), ".F", sep=""),]
meta.SCFA = meta[row.names(meta) %in% paste(row.names(SCFA), ".F", sep=""),]

View(meta.SCFA)

Datos de subconjuntos

Puede ser deseable subconjunto de datos por tipo de muestra. Si subconjunta su matriz, debe subconjuntar su tabla de metadatos. Es ** CRUCIAL ** que sus muestras estén en el mismo orden. R no es lo suficientemente inteligente como para tratar de unir nombres de muestra con estos comandos. Utilizaremos la función which en la base R para subconjuntos de marcos de datos. Tenga en cuenta que tengo que proporcionar AMBAS filas que quiero mantener (aquellas que muestran que AgeGroup es de 2 semanas) Y qué columnas quiero mantener (en blanco a la DERECHA de la coma entre paréntesis significa mantener TODAS las columnas).

O bien, para subconjuntar un grupo de muestras a las mismas muestras en otro archivo, podemos usar la función de ajuste de valor %in%. Los nombres son un poco diferentes entre las tablas, por lo que también agregamos “.F” a los nombres SCFA para que coincidan.

Remove rare taxa

You may also want to remove rare taxa from your OTU table. These may add noise to a large dataset which you may not want. This can be accomplished with which and the apply functions. Nothing to the left of the comman means we want to keep all rows. Then to the right of the comma, we want to only keep rows which meet our conditional expression. For this conditional, we want to apply the max function to columns (2) of the data frame OTU.clean, and return TRUE for those greater than 10. This will remove all OTUs which do not have a count greater than 10 in at least one sample. Other functions, such as average across groups or total abundance in data set, can also be applied.

OTU.clean.abund=OTU.clean[,which(apply(OTU.clean,2,max)>10)]

#number of columns before
ncol(OTU.clean)

## [1] 5002

#number of columns after
ncol(OTU.clean.abund)

## [1] 909

Eliminar taxones raros

También es posible que desee eliminar taxones raros de su tabla OTU. Estos pueden agregar ruido a un gran conjunto de datos que puede que no desee. Esto se puede lograr con las funciones which yapply. Nada a la izquierda del comandante significa que queremos mantener todas las filas. Luego, a la derecha de la coma, solo queremos mantener las filas which cumplen con nuestra expresión condicional. Para este condicional, queremos aplicar la función max a las columnas (2) del marco de datos OTU.clean, y devolver TRUE para aquellos mayores que 10. Esto eliminará todas las OTU which que no tienen un recuento mayor que 10 en al menos una muestra. También se pueden aplicar otras funciones, como el promedio entre los grupos o la abundancia total en el conjunto de datos.

Convert OTU counts to proportions

Lastly, we are going to learn how to convert our OTU table of count data to an OTU table of relative abundance. In short, we want to divide all cells in a row by the total of that row. We can accomplish this using the sweep function. We want to sweep all rows (1) of the OTU.clean data frame, dividing (/) each value by the rowSum.

OTU.clean.relabund <- sweep(OTU.clean,1,rowSums(OTU.clean),"/")

#row sums before
rowSums(OTU.clean)

## 5017.1yr.F  5017.2w.F  5017.8w.F 5020.1yr.F  5020.2w.F  5020.8w.F 
##      17342      17477      17607      17560      17460      17443 
## 5026.1yr.F  5026.2w.F  5026.8w.F 5031.1yr.F  5031.2w.F  5031.8w.F 
##      17583      17515      17483      17565      17518      17405 
## 5037.1yr.F  5037.2w.F  5037.8w.F 5041.1yr.F  5041.2w.F  5041.8w.F 
##      17614      17477      17449      17402      17504      17482 
## 5045.1yr.F  5045.2w.F  5045.8w.F 5053.1yr.F  5053.2w.F  5053.8w.F 
##      17338      17504      17471      17463      17488      17493

#row sums after
rowSums(OTU.clean.relabund)

## 5017.1yr.F  5017.2w.F  5017.8w.F 5020.1yr.F  5020.2w.F  5020.8w.F 
##          1          1          1          1          1          1 
## 5026.1yr.F  5026.2w.F  5026.8w.F 5031.1yr.F  5031.2w.F  5031.8w.F 
##          1          1          1          1          1          1 
## 5037.1yr.F  5037.2w.F  5037.8w.F 5041.1yr.F  5041.2w.F  5041.8w.F 
##          1          1          1          1          1          1 
## 5045.1yr.F  5045.2w.F  5045.8w.F 5053.1yr.F  5053.2w.F  5053.8w.F 
##          1          1          1          1          1          1

Convertir OTU cuenta en proporciones

Por último, vamos a aprender cómo convertir nuestra tabla OTU de datos de conteo a una tabla de abundancia relativa de OTU. En resumen, queremos dividir todas las celdas en una fila por el total de esa fila. Podemos lograr esto usando la función sweep. Queremos barrer todas las filas (1) del marco de datos OTU.clean, dividiendo (/) cada valor por el rowSum.

Set seed

We will be running some processes that rely on the random number generater. To make your analysis reproducible, we set the random seed.

set.seed(8765)

Establecer semilla

Vamos a ejecutar algunos procesos que dependen del generador de números aleatorios. Para que su análisis sea reproducible, establecemos la semilla aleatoria.

Alpha-diversity

What are these metrics?

Alpha diversity describes the diversity in a single sample. These values are generally unitless, so are only useful for comparing between samples or groups of samples. There are two major categories of alpha diversity metrics that you will find.

Richness describes the number of species in an environment. The most basic richness value is “observed species” (literal count of the number of taxa present). This is denoted as “sobs” in mothur output. Other values can be estimated based on the number of rare taxa, and attempt to mathematically estimate how many we may be missing. Chao1 Richness Estimate (“chao” in mothur) is one of these. the ACE Index is another. Which one you use doesn’t particularly matter. Preferences are field- and lab- and researcher-dependent. Chao1 estimates tend to be lower than ACE estimates, but both will be greater than observed species.

Diversity indices consider both sample richness and sample evenness. If you have two samples with 10 sequences and 2 species each, they will have the same richness. However, if sample 1 has five of species A and five of species B, while sample 2 has nine of species A and one of species B, then sample 1 is more even, and this will be reflected in a higher value for diversity. Shannon’s Diversity Index/Shannon-Wiener Index/Shannon entropy (shannon in mothur) and the Simpson Index/Simpson Dominance Index (simpson in mothur) are two of the major estimators of sample diversity. Shannon is always a positive value running from 0 to log(# of species). Higher values for Shannon indicate greater diversity. Simpson runs from 0 to 1, and is weird in that LOWER numbers indicate greater diversity. For this reason, the inverse of the Simpson index is often used. Another option is phylogenetic diversity, which takes into account the sum of all branch lengths for the phylogenetic tree of OTUs contained in that sample. We are not going to use any phylogenetic diversity metrics today, but if you want to calculate these, you can do so using the picante package, here and here.

This image illustrates richness vs. diversity. Both forests have the same richness (4 tree species) but Community 1 has much more even distribution of the 4 species while Community 2 is dominated by tree species A. This makes Community 1 more diverse than Community 2.

Alpha diversity

¿Qué son estas métricas?

La diversidad alfa describe la diversidad en una sola muestra. Estos valores generalmente no tienen unidades, por lo que solo son útiles para comparar muestras o grupos de muestras. Hay dos categorías principales de métricas de diversidad alfa que encontrará.

** Riqueza ** describe la cantidad de especies en un entorno. El valor de riqueza más básico es “especies observadas” (conteo literal de la cantidad de taxones presentes). Esto se denota como “sollozos” en la salida de mothur. Se pueden estimar otros valores basados en la cantidad de taxones raros e intentar estimar matemáticamente cuántos nos podemos perder. El estimado de riqueza de Chao1 (“chao” en mothur) es uno de estos. el índice ACE es otro. Cuál de ellas uses no importa en particular. Las preferencias dependen del campo, del laboratorio y del investigador. Las estimaciones de Chao1 tienden a ser más bajas que las estimaciones de ACE, pero ambas serán mayores que las especies observadas.

** Los índices de diversidad ** consideran tanto la riqueza de la muestra como la uniformidad de la muestra. Si tienes dos muestras con 10 secuencias y 2 especies cada una, tendrán la misma riqueza. Sin embargo, si la muestra 1 tiene cinco de la especie A y cinco de la especie B, mientras que la muestra 2 tiene nueve de la especie A y una de la especie B, la muestra 1 es más pareja, y esto se reflejará en un mayor valor para la diversidad. El índice de diversidad de Shannon / índice de Shannon-Wiener / entropía de Shannon (shannon in mothur) y el índice Simpson / índice de dominancia de Simpson (simpson in mothur) son dos de los principales estimadores de la diversidad de muestras. Shannon siempre tiene un valor positivo que va de 0 a log (# de especies). Los valores más altos para Shannon indican una mayor diversidad. Simpson corre de 0 a 1, y es raro porque los números INFERIORES indican una mayor diversidad. Por esta razón, a menudo se usa el inverso del índice de Simpson. Otra opción es la diversidad filogenética, que tiene en cuenta la suma de todas las longitudes de ramificación para el árbol filogenético de las OTU contenidas en esa muestra. No vamos a utilizar ninguna métrica de diversidad filogenética hoy, pero si desea calcular estos, puede hacerlo utilizando el paquete picante, [aquí] (https://cran.r-project.org/web/packages /picante/picante.pdf) y [aquí.] (https://daijiang.name/en/2014/05/04/notes-func-phylo-book-1/)

Esta imagen ilustra riqueza vs. diversidad. Ambos bosques tienen la misma riqueza (4 especies de árboles) pero la Comunidad 1 tiene una distribución mucho más uniforme de las 4 especies, mientras que la Comunidad 2 está dominada por la especie arbórea A. Esto hace que la Comunidad 1 sea más diversa que la Comunidad 2.

Explore alpha metrics

NOTE: We will be using alpha diversity metrics which we calculated in mothur, as described in the previous workshop. However, it is possible to calculate these metrics in R using the phyloseq or vegan packages as well. Here is how that can be done. We will be talking more about use of phyloseq later, but here is a block of code demonstrating how to make a phyloseq object and generate alpha diversity statistics in R with phyloseq, followed by vegan.

Explore las métricas alfa

** NOTA: ** Usaremos métricas de diversidad alfa que calculamos en mothur, como se describió en el taller anterior. Sin embargo, es posible calcular estas métricas en R usando los paquetes phyloseq ovegan también. Aquí es cómo se puede hacer eso. Hablaremos más sobre el uso de phyloseq más tarde, pero aquí hay un bloque de código que demuestra cómo hacer un objetophyloseq y generar estadísticas de diversidad alfa en R con phyloseq, seguido devegan.

`phyloseq` calculations

#create a phyloseq object
OTU.physeq = otu_table(as.matrix(OTU.clean), taxa_are_rows=FALSE)
tax.physeq = tax_table(as.matrix(tax.clean))
meta.physeq = sample_data(meta)

physeq.alpha = phyloseq(OTU.physeq, tax.physeq, meta.physeq)

#generate alpha diversity column
sample_data(physeq.alpha)$shannon.physeq <- estimate_richness(physeq.alpha, measures="Shannon")

phyloseq also allows you to easily plot alpha diversity, both by sample and by group.

plot_richness(physeq.alpha, measures="Shannon")

plot_richness(physeq.alpha, "AgeGroup", measures="Shannon")

Making pretty plots is also possible, but we will get into that later when we get into phyloseq a bit more towards the end of today.

`vegan` calculations

meta$shannon.vegan <- diversity(OTU.clean, index="shannon")

Compare methods

Now, lets plot these three versions of shannon vs. sample index

#view and compare to mothur calculated Shannon Index
par(mfrow=c(3,1))
plot(x=c(1:nsamples(physeq.alpha)), y=meta$shannon, main="Shannon diversity (mothur)", xlab="")
plot(x=c(1:nsamples(physeq.alpha)), y=sample_data(physeq.alpha)$shannon.physeq$Shannon, main="Shannon diversity (phyloseq)", xlab="")
plot(x=c(1:nsamples(physeq.alpha)), y=meta$shannon.vegan, main="Shannon diversity (vegan)", xlab="")

Suffice to say, they are very similar.

phyloseq también te permite trazar fácilmente la diversidad alfa, tanto por muestra como por grupo.

Hacer tramas bonitas también es posible, pero entraremos en eso más adelante cuando entremos phyloseq un poco más hacia el final de hoy.

Ahora, vamos a trazar estas tres versiones del índice shannon vs. sample

Basta decir que son muy similares.

Alpha statistics

Now we will start to look at our data generated in mothur. We will first start with alpha-diversity and richness. Let’s plot some common ones here.

#Create 2x2 plot environment so that we can see all 4 metrics at once. 
par(mfrow = c(2, 2))

#Then plot each metric.
hist(meta$shannon, main="Shannon diversity", xlab="", breaks=10)
hist(meta$simpson, main="Simpson diversity", xlab="", breaks=10)
hist(meta$chao, main="Chao richness", xlab="", breaks=15)
hist(meta$ace, main="ACE richness", xlab="", breaks=15)

You want the data to be roughly normal so that you can run ANOVA or t-tests. If it is not normally distributed, you will need to consider non-parametric tests such as Kruskal-Wallis.

Here, we see that none of the data are normally distributed. This occurs with the subset but not the full data set because I’ve specifically selected samples with divergent alpha metrics. In general, you will see roughly normal data for Shannon’s diversity as well as most richness metrics. Simpson’s diversity, on the other hand, is usually skewed as seen here.

So most will use inverse Simpson (1/Simpson) instead. This not only increases normalcy but also makes the output more logical as a higher inverse Simpson value corresponds to higher diversity.

Let’s look at inverse Simpson instead.

#Create 2x2 plot environment 
par(mfrow = c(2, 2))

#Plots
hist(meta$shannon, main="Shannon diversity", xlab="", breaks=10)
hist(1/meta$simpson, main="Inverse Simpson diversity", xlab="", breaks=10)
hist(meta$chao, main="Chao richness", xlab="", breaks=15)
hist(meta$ace, main="ACE richness", xlab="", breaks=15)

Now we see a bimodal distribution for Simpson similar to the richness metrics.

To test for normalcy statistically, we can run the Shapiro-Wilk test of normality.

shapiro.test(meta$shannon)

## 
##  Shapiro-Wilk normality test
## 
## data:  meta$shannon
## W = 0.91511, p-value = 0.0456

shapiro.test(1/meta$simpson)

## 
##  Shapiro-Wilk normality test
## 
## data:  1/meta$simpson
## W = 0.74821, p-value = 4.69e-05

shapiro.test(meta$chao)

## 
##  Shapiro-Wilk normality test
## 
## data:  meta$chao
## W = 0.80636, p-value = 0.0003749

shapiro.test(meta$ace)

## 
##  Shapiro-Wilk normality test
## 
## data:  meta$ace
## W = 0.83017, p-value = 0.0009573

We see that, as expected from the graphs, none are normal.

However, our sample size is small and normalcy tests are very sensitive for small data-sets. In fact, you can run Shapiro-Wilk on a list of 50 values randomly sampled from the R-generated normal distribution and find that they are not normal (even though we know that they are!)

So, what does this mean for our purposes? Well, we should run statistical tests that don’t assume our data is normal, because we don’t have any evidence (graphs, Shapiro-Wilk) that it is normal. Theoretically. In reality, if your data is roughly hill-shaped, not heavily skewed, and not bimodal, most tests are robust enough to handle it. But, when in doubt, use a test that does not assume normality (nonparametric). None of these metrics are normal-looking enough that I would use a parametric test. However, we are going to show both parametric and nonparametric examples for demonstration purposes.

Overall, for alpha-diversity:

ANOVA, t-test, or general linear models with the normal distribution are used when the data is roughly normal
Kruskal-Wallis, Wilcoxon rank sum test, or general linear models with another distribution are used when the data is not normal

Our main variables of interest are

Categorical: AgeGroup: 2w, 8w, 1yr
Continuous: ADGKG: 0.05-1.56 kg gained per day (average daily gain kg)

Estadísticas Alfa

Ahora comenzaremos a ver nuestros datos generados en mothur. Primero comenzaremos con alfadiversidad y riqueza. Vamos a trazar algunos comunes aquí.

Desea que los datos sean más o menos normales para que pueda ejecutar ANOVA o pruebas t. Si no se distribuye normalmente, tendrá que considerar pruebas no paramétricas como Kruskal-Wallis.

Aquí, vemos que ninguno de los datos se distribuye normalmente. Esto ocurre con el subconjunto pero no con el conjunto de datos completo porque he seleccionado muestras específicamente con métricas alfa divergentes. En general, verá datos prácticamente normales para la diversidad de Shannon, así como para la mayoría de las métricas de riqueza. La diversidad de Simpson, por otro lado, suele estar sesgada, como se ve aquí.

Entonces la mayoría usará Simpson inversa (1 / Simpson) en su lugar. Esto no solo aumenta la normalidad sino que también hace que el resultado sea más lógico ya que un valor de Simpson inverso más alto corresponde a una mayor diversidad.

Miremos a Simpson inverso en su lugar.

Ahora vemos una distribución bimodal para Simpson similar a las métricas de riqueza.

Para evaluar estadísticamente la normalidad, podemos ejecutar la prueba de normalidad de Shapiro-Wilk.

Vemos que, como se espera de los gráficos, ninguno es normal.

Sin embargo, nuestro tamaño de muestra es pequeño y las pruebas de normalidad son muy sensibles para pequeños conjuntos de datos. De hecho, puede ejecutar Shapiro-Wilk en una lista de 50 valores aleatoriamente muestreados de la distribución normal generada por R y descubrir que no son normales (¡aunque sabemos que sí lo están!)

Entonces, ¿qué significa esto para nuestros propósitos? Bueno, deberíamos ejecutar pruebas estadísticas que no supongan que nuestros datos son normales, porque no tenemos ninguna evidencia (gráficos, Shapiro-Wilk) de que sea normal. Teóricamente En realidad, si sus datos son aproximadamente en forma de colina, no demasiado sesgados y no bimodales, la mayoría de las pruebas son lo suficientemente robustas como para manejarlo. Pero, en caso de duda, use una prueba que no asuma la normalidad (no paramétrica). Ninguna de estas métricas es lo suficientemente normal como para usar una prueba paramétrica. Sin embargo, vamos a mostrar ejemplos paramétricos y no paramétricos para fines de demostración.

En general, para la diversidad alfa:

ANOVA, t-test o modelos lineales generales con la distribución normal se utilizan cuando los datos son aproximadamente normales
Kruskal-Wallis, prueba de suma de rangos de Wilcoxon, o modelos lineales generales con otra distribución se usan cuando los datos no son normales

Nuestras principales variables de interés son

Categórico: AgeGroup: 2w, 8w, 1yr
Continuo: ADGKG: 0.05-1.56 kg ganado por día (ganancia diaria media kg)

Categorical variables

Now that we know which tests can be used, let’s run them.

Normally distributed metrics

Since it’s the closest to normalcy, we will use Shannon’s diversity as an example. First, we will test age, which is a categorical variable with more than 2 levels. Thus, we run ANOVA. If age were only two levels, we could run a t-test

Does age impact the Shannon diversity of the fecal microbiota?

#Run the ANOVA and save it as an object
aov.shannon.age = aov(shannon ~ AgeGroup, data=meta)
#Call for the summary of that ANOVA, which will include P-values
summary(aov.shannon.age)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## AgeGroup     2  42.98  21.489   103.4 1.35e-11 ***
## Residuals   21   4.36   0.208                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To do all the pairwise comparisons between groups and correct for multiple comparisons, we run Tukey’s honest significance test of our ANOVA.

TukeyHSD(aov.shannon.age)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = shannon ~ AgeGroup, data = meta)
## 
## $AgeGroup
##             diff        lwr       upr   p adj
## 2w-1yr -3.270063 -3.8446230 -2.695503 0.0e+00
## 8w-1yr -1.830903 -2.4054628 -1.256342 2.0e-07
## 8w-2w   1.439160  0.8646001  2.013720 8.5e-06

We clearly see that all age groups have significantly different diversity. When we plot the data, we see that diversity increases as the animals age.

#Re-order the groups because the default is 1yr-2w-8w
meta$AgeGroup.ord = factor(meta$AgeGroup, c("2w","8w","1yr"))
#Return the plot area to 1x1
par(mfrow = c(1, 1))
#Plot
boxplot(shannon ~ AgeGroup.ord, data=meta, ylab="Shannon's diversity")

Non-normally distributed metrics

We will use Chao’s richness estimate here. Since age is categorical, we use Kruskal-Wallis (non-parametric equivalent of ANOVA). If we have only two levels, we would run Wilcoxon rank sum test (non-parametric equivalent of t-test)

kruskal.test(chao ~ AgeGroup, data=meta)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  chao by AgeGroup
## Kruskal-Wallis chi-squared = 19.28, df = 2, p-value = 6.507e-05

We can test pairwise within the age groups with Wilcoxon Rank Sum Tests. This test has a slightly different syntax than our other tests

pairwise.wilcox.test(meta$chao, meta$AgeGroup, p.adjust.method="fdr")

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  meta$chao and meta$AgeGroup 
## 
##    1yr     2w     
## 2w 0.00023 -      
## 8w 0.00023 0.00186
## 
## P value adjustment method: fdr

Like diversity, we see that richness also increases with age.

#Create 1x1 plot environment
par(mfrow = c(1, 1))
#Plot
boxplot(chao ~ AgeGroup.ord, data=meta, ylab="Chao richness")

Variables categóricas

Ahora que sabemos qué pruebas se pueden usar, vamos a ejecutarlas.

** Métrica distribuida normalmente **

Como es lo más cercano a la normalidad, utilizaremos ** la diversidad de Shannon ** como ejemplo. Primero, probaremos la edad, que es una variable categórica con más de 2 niveles. Por lo tanto, ejecutamos ANOVA. Si la edad fuera solo dos niveles, podríamos ejecutar una prueba t

¿La edad afecta la diversidad de Shannon de la microbiota fecal?

Para hacer todas las comparaciones por parejas entre grupos y corregir comparaciones múltiples, ejecutamos la prueba de significación honesta de Tukey de nuestro ANOVA.

Vemos claramente que todos los grupos de edad tienen una diversidad significativamente diferente. Cuando trazamos los datos, vemos que la diversidad aumenta a medida que los animales envejecen.

** Métrica no distribuida normalmente **

Utilizaremos ** la estimación de riqueza de Chao ** aquí. Como la edad es categórica, utilizamos Kruskal-Wallis (equivalente no paramétrico de ANOVA). Si solo tenemos dos niveles, correríamos la prueba de suma de rangos de Wilcoxon (equivalente no paramétrico de la prueba t)

Podemos probar en parejas dentro de los grupos de edad con las pruebas de suma de rangos de Wilcoxon. Esta prueba tiene una sintaxis ligeramente diferente a nuestras otras pruebas

Al igual que la diversidad, vemos que la riqueza también aumenta con la edad.

Continuous variables

For continuous variables, we use general linear models, specifying the distribution that best fits our data.

Normally distributed metrics

Since ADG is a continuous variable, we run a general linear model. We will again use Shannon’s diversity as our roughly normal metric. The default of glm and lm is the normal distribution so we don’t have to specify anything.

Does ADG impact the Shannon diversity of the fecal microbiota?

glm.shannon.ADG = glm(shannon ~ ADGKG, data=meta)
summary(glm.shannon.ADG)

## 
## Call:
## glm(formula = shannon ~ ADGKG, data = meta)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.49110  -1.11216  -0.01749   1.53658   1.84728  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.62565    1.01390   3.576  0.00169 **
## ADGKG       -0.03407    0.97805  -0.035  0.97253   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.151815)
## 
##     Null deviance: 47.343  on 23  degrees of freedom
## Residual deviance: 47.340  on 22  degrees of freedom
## AIC: 90.412
## 
## Number of Fisher Scoring iterations: 2

The output let’s us know that the intercept of our model is significantly different from 0 but our slope (e.g. our variable of interest) is not. This makes sense when we look at the data.

plot(shannon ~ ADGKG, data=meta)
#Add the glm best fit line
abline(glm.shannon.ADG)

Non-normally distributed metrics

We will again use a general linear model for our non-normally distributed metric Chao. However, this time, we change the distribution from normal to something that fits the data better.

But which distribution should we choose? In statistics, there is no one “best” model. There are only good and better models. We will use the plot function to compare two models and pick the better one.

First, the Gaussian (normal) distribution, which we already know is a bad fit.

gaussian.chao.ADG = glm(chao ~ ADGKG, data=meta, family="gaussian")
par(mfrow = c(1,2))
plot(gaussian.chao.ADG, which=c(1,2))

Quasipoisson (log) distribution

qp.chao.ADG = glm(chao ~ ADGKG, data=meta, family="quasipoisson")
par(mfrow = c(1,2))
plot(qp.chao.ADG, which=c(1,2))

What we’re looking for is no pattern in the Residuals vs. Fitted graph (“stars in the sky”), which shows that we picked a good distribution family to fit our data. We also want our residuals to be normally distributed, which is shown by most/all of the points falling on the line in the Normal Q-Q plot.

While it’s still not perfect, the quasipoisson fits much better with residuals on the order of 30 whereas gaussian was on the order of 600. So, we will use quasipoisson and see that ADG does not to correlate to Chao richness.

summary(qp.chao.ADG)

## 
## Call:
## glm(formula = chao ~ ADGKG, family = "quasipoisson", data = meta)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -24.36  -17.05  -10.66   18.81   26.91  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.4528     0.5561  11.605 7.54e-11 ***
## ADGKG        -0.1859     0.5438  -0.342    0.736    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 374.2485)
## 
##     Null deviance: 8117.2  on 23  degrees of freedom
## Residual deviance: 8074.4  on 22  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Plotting this we see that, indeed, there is not signficant correlation between Chao and ADG.

#Return the plot area to 1x1
par(mfrow = c(1, 1))
#Plot
plot(log(chao) ~ ADGKG, data=meta, ylab="ln(Chao's richness)")
abline(qp.chao.ADG)

Variables continuas

Para las variables continuas, usamos modelos lineales generales, especificando la distribución que mejor se ajusta a nuestros datos.

** Métrica distribuida normalmente **

Como ADG es una variable continua, ejecutamos un modelo lineal general. Nuevamente usaremos la diversidad de Shannon como nuestra métrica más o menos normal. El valor predeterminado de glm ylm es la distribución normal, por lo que no es necesario que especifique nada.

¿El ADG impacta la diversidad de Shannon de la microbiota fecal?

La salida nos permite saber que la intersección de nuestro modelo es significativamente diferente de 0, pero nuestra pendiente (* por ejemplo * nuestra variable de interés) no lo es. Esto tiene sentido cuando miramos los datos.

Nuevamente usaremos un modelo lineal general para nuestra métrica Chao no distribuida normalmente. Sin embargo, esta vez, cambiamos la distribución de normal a algo que se ajusta mejor a los datos.

Pero, ¿qué distribución deberíamos elegir? En estadística, no hay un “mejor” modelo. Solo hay modelos buenos y mejores. Utilizaremos la función plot para comparar dos modelos y elegir el mejor.

En primer lugar, la distribución Gaussiana (normal), que ya sabemos que es un mal ajuste.

Distribución de Quasipoisson (log)

Lo que estamos buscando no es un patrón en el gráfico Residuals vs. Fitted (“estrellas en el cielo”), que muestra que elegimos una buena familia de distribución para que se ajuste a nuestros datos. También queremos que nuestros residuos se distribuyan normalmente, lo que se muestra por la mayoría / todos los puntos que caen en la línea en el gráfico Q-Q Normal.

Si bien aún no es perfecto, el quasipoisson encaja mucho mejor con residuos del orden de 30, mientras que gaussiano era del orden de 600. Por lo tanto, utilizaremos quasipoisson y veremos que ADG no se correlaciona con la riqueza de Chao.

Trazando esto vemos que, de hecho, no hay una correlación significativa entre Chao y ADG.

Mixed models

Our two variables may not be fully independent and therefore, running them in two separate tests may not be correct. That is to say, age may impact ADG. In fact, I know this is the case because calves (2w, 8w) gain weight more quickly than heifers (1yr).

Think about your variables and what they mean “in the real world.” Logically combine them into as few ANOVA tests as possible. In the end, it’s better to test a meaningless interaction than not test a meaningful one.

We can test if the interaction of age and ADG impacts diversity with a model that includes both of our variables. The * symbol is a shortcut for models. A*B is equivalent to A + B + A:B

aov.shannon.all = aov(shannon ~ AgeGroup*ADGKG, data=meta)
summary(aov.shannon.all)

##                Df Sum Sq Mean Sq F value   Pr(>F)    
## AgeGroup        2  42.98  21.489  95.472 2.61e-10 ***
## ADGKG           1   0.05   0.054   0.239    0.631    
## AgeGroup:ADGKG  2   0.26   0.130   0.576    0.572    
## Residuals      18   4.05   0.225                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the interaction of age and ADG doesn’t significantly impact Shannon diversity, So we should remove that variable to simplify our model. If you had many interaction terms, you would step-wise remove the one with the highest P-value until you had the simplest model with only individual variables and significant interaction terms.

aov.shannon.all2 = aov(shannon ~ AgeGroup+ADGKG, data=meta)
summary(aov.shannon.all2)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## AgeGroup     2  42.98  21.489   99.70 3.96e-11 ***
## ADGKG        1   0.05   0.054    0.25    0.623    
## Residuals   20   4.31   0.216                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Overall, the ANOVA test tells us that only age impacts Shannon diversity but it does not tell us which age groups differ from one another. If all of our variables were categorical, we could run TukeyHSD like we did with age only.

TukeyHSD(aov.shannon.all)

## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## ADGKG

## Warning in replications(paste("~", xx), data = mf): non-factors ignored:
## AgeGroup, ADGKG

## Warning in TukeyHSD.aov(aov.shannon.all): 'which' specified some non-
## factors which will be dropped

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = shannon ~ AgeGroup * ADGKG, data = meta)
## 
## $AgeGroup
##             diff       lwr       upr    p adj
## 2w-1yr -3.270063 -3.875469 -2.664657 0.00e+00
## 8w-1yr -1.830903 -2.436309 -1.225496 1.20e-06
## 8w-2w   1.439160  0.833754  2.044567 2.81e-05

However, you will see that we don’t get any data from ADG since it is continuous. There is an error denoting this as “non-factors ignored: ADGKG”

So, we should have run our test as a glm since we have at least one continuous variable. First, we will still include the interaction variable to see that type of output.

glm.shannon.all = glm(shannon ~ AgeGroup*ADGKG, data=meta)
summary(glm.shannon.all)

## 
## Call:
## glm(formula = shannon ~ AgeGroup * ADGKG, data = meta)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0301  -0.2468   0.0894   0.1572   0.7624  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        5.7123     2.5928   2.203   0.0409 *
## AgeGroup2w        -3.3969     2.6197  -1.297   0.2111  
## AgeGroup8w        -2.9610     2.7554  -1.075   0.2967  
## ADGKG             -0.4481     2.7599  -0.162   0.8728  
## AgeGroup2w:ADGKG   0.1228     2.7848   0.044   0.9653  
## AgeGroup8w:ADGKG   1.0750     2.8763   0.374   0.7130  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.22508)
## 
##     Null deviance: 47.3425  on 23  degrees of freedom
## Residual deviance:  4.0514  on 18  degrees of freedom
## AIC: 39.413
## 
## Number of Fisher Scoring iterations: 2

Now this output is saying the same thing as ANOVA but in a more complicated way. The function automatically picks a reference group for categorical variables (in this case, 1yr) to compare all other groups to. Let’s go through each line

(Intercept) - This is whether or not the y-intercept is 0. A significant P-value indicates that the intercept is not 0, and we wouldn’t expect it to be for any alpha-diversity metric since 0 means nothing is there
AgeGroup2w - the difference between Shannon when Age = 2w vs. 1yr (the same as testing “shannon ~ AgeGroup” and only looking at the 2w-1yr pairwise comparison)
AgeGroup8w - the same as 2w but now looking at only the 8w-1yr comparison
ADGKG - the slope of Shannon to ADGKG (the same as testing “shannon ~ ADGKG”)
AgeGroup2w:ADGKG - the difference in slope of shannon ~ ADG between ages 2w and 1yr
AgeGroup8w:ADGKG - the difference in slope of shannon ~ ADG between ages 8w and 1yr

As we saw in ANOVA, none of the interaction terms are significant so we remove them.

glm.shannon.all2 = glm(shannon ~ AgeGroup+ADGKG, data=meta)
summary(glm.shannon.all2)

## 
## Call:
## glm(formula = shannon ~ AgeGroup + ADGKG, data = meta)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.95299  -0.25858   0.07643   0.30409   0.74487  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4459     0.3487  15.619 1.14e-12 ***
## AgeGroup2w   -3.2760     0.2324 -14.094 7.55e-12 ***
## AgeGroup8w   -1.7989     0.2408  -7.471 3.30e-07 ***
## ADGKG        -0.1639     0.3281  -0.500    0.623    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2155447)
## 
##     Null deviance: 47.3425  on 23  degrees of freedom
## Residual deviance:  4.3109  on 20  degrees of freedom
## AIC: 36.903
## 
## Number of Fisher Scoring iterations: 2

Note: The full glm model with the interaction term included did not show age as significant. When we remove the interaction term, age is significant. This is why you should remove non-significant interactions terms as they can the mask main effects of individual variables.

We can run a similar test with non-normal data like Chao.

qp.chao.all = glm(chao ~ AgeGroup*ADGKG, data=meta, family="quasipoisson")
summary(qp.chao.all)

## 
## Call:
## glm(formula = chao ~ AgeGroup * ADGKG, family = "quasipoisson", 
##     data = meta)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -7.774  -3.430  -0.140   3.692   5.277  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.99825    0.71122   9.840 1.14e-08 ***
## AgeGroup2w       -1.61539    0.75272  -2.146   0.0458 *  
## AgeGroup8w       -2.24498    0.86846  -2.585   0.0187 *  
## ADGKG             0.01751    0.75699   0.023   0.9818    
## AgeGroup2w:ADGKG -0.42295    0.80094  -0.528   0.6039    
## AgeGroup8w:ADGKG  0.86269    0.86550   0.997   0.3321    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 18.86331)
## 
##     Null deviance: 8117.2  on 23  degrees of freedom
## Residual deviance:  348.5  on 18  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Remove the non-significant interaction.

qp.chao.all2 = glm(chao ~ AgeGroup+ADGKG, data=meta, family="quasipoisson")
summary(qp.chao.all2)

## 
## Call:
## glm(formula = chao ~ AgeGroup + ADGKG, family = "quasipoisson", 
##     data = meta)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -7.783  -3.452  -1.378   3.744   8.184  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.03944    0.23567  29.870  < 2e-16 ***
## AgeGroup2w  -1.98090    0.14862 -13.329 2.08e-11 ***
## AgeGroup8w  -1.24286    0.11926 -10.422 1.57e-09 ***
## ADGKG       -0.02643    0.24530  -0.108    0.915    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 23.74583)
## 
##     Null deviance: 8117.20  on 23  degrees of freedom
## Residual deviance:  476.31  on 20  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Modelos mixtos

Nuestras dos variables pueden no ser completamente independientes y, por lo tanto, ejecutarlas en dos pruebas separadas puede no ser correcto. Es decir, la edad puede afectar a ADG. De hecho, sé que este es el caso porque los terneros (2w, 8w) ganan peso más rápido que las vaquillas (1 año).

Piensa en tus variables y en lo que significan “en el mundo real”. Lógicamente, combínelos en el menor número de pruebas de ANOVA que sea posible. Al final, es mejor probar una interacción sin sentido que no probar una significativa.

Podemos probar si la interacción de edad y ADG impacta la diversidad con un modelo que incluye nuestras dos variables. El símbolo * es un atajo para los modelos. A * B es equivalente a A + B + A: B

Podemos ver que la interacción de edad y ADG no afecta significativamente la diversidad de Shannon, por lo que deberíamos eliminar esa variable para simplificar nuestro modelo. Si tuviera muchos términos de interacción, eliminaría paso a paso el que tuviera el mayor valor P hasta que tuviera el modelo más simple con solo variables individuales y términos de interacción significativos.

En general, la prueba ANOVA nos dice que solo la edad afecta la diversidad de Shannon, pero no nos dice qué grupos de edad difieren entre sí. Si todas nuestras variables fueran categóricas, podríamos ejecutar TukeyHSD como lo hicimos solo con la edad.

Sin embargo, verá que no obtenemos datos de ADG ya que es continuo. Hay un error que denota esto como “factores no ignorados: ADGKG”

Entonces, deberíamos haber ejecutado nuestra prueba como un glm ya que tenemos al menos una variable continua. Primero, seguiremos incluyendo la variable de interacción para ver ese tipo de salida.

Ahora esta salida dice lo mismo que ANOVA pero de una manera más complicada. La función selecciona automáticamente un grupo de referencia para las variables categóricas (en este caso, 1 año) para comparar todos los otros grupos. Repasemos cada línea

(Interceptar) - Esto es si la intersección y es 0. Un valor P significativo indica que la intersección no es 0, y no esperaríamos que sea para cualquier métrica de diversidad alfa ya que 0 significa que nada es ahí
AgeGroup2w: la diferencia entre Shannon cuando Age = 2w contra 1yr (lo mismo que probar “shannon ~ AgeGroup” y solo mirando la comparación pairw 2w-1yr)
AgeGroup8w: lo mismo que 2w, pero ahora solo se observa la comparación de 8w-1yr
ADGKG - la pendiente de Shannon a ADGKG (lo mismo que probar “shannon ~ ADGKG”)
AgeGroup2w: ADGKG - la diferencia en la pendiente de shannon ~ ADG entre edades de 2w y 1yr
AgeGroup8w: ADGKG - la diferencia en la pendiente de shannon ~ ADG entre edades 8w y 1yr

Como vimos en ANOVA, ninguno de los términos de interacción es significativo, así que los eliminamos.

** Nota **: el modelo completo de glm con el término de interacción incluido no mostró la edad como significativa. Cuando eliminamos el término de interacción, la edad es significativa. Esta es la razón por la que debe eliminar los términos de interacciones no significativas, ya que pueden ocultar los efectos principales de las variables individuales.

Podemos ejecutar una prueba similar con datos no normales como Chao.

Eliminar la interacción no significativa.

Repeated measure

Another thing to consider with this data is the fact that we sampled the same animals over time. So, we have a repeated measures design. There are a number of ways to do repeated measures in R. I personally like the lme4 package used here.

We add the repeated measure component by adding a random effect for the individual animals with (1|Animal) in the lmer function.

rm.shannon.all = lmer(shannon ~ AgeGroup+ADGKG + (1|Animal), data=meta)
summary(rm.shannon.all)

## Linear mixed model fit by REML ['lmerMod']
## Formula: shannon ~ AgeGroup + ADGKG + (1 | Animal)
##    Data: meta
## 
## REML criterion at convergence: 32.4
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.83117 -0.45932  0.09539  0.49972  1.53368 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Animal   (Intercept) 0.03793  0.1948  
##  Residual             0.17819  0.4221  
## Number of obs: 24, groups:  Animal, 8
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   5.3906     0.3520  15.313
## AgeGroup2w   -3.2739     0.2114 -15.486
## AgeGroup8w   -1.8104     0.2208  -8.201
## ADGKG        -0.1049     0.3321  -0.316
## 
## Correlation of Fixed Effects:
##            (Intr) AgGrp2 AgGrp8
## AgeGroup2w -0.350              
## AgeGroup8w -0.027  0.461       
## ADGKG      -0.884  0.057 -0.293

We see that very little of the variance in the data is explained by the animal random effects (0.03793). So we actually don’t need to include repeated measures in our final model, but it was necessary to check!

From all of this, we can conclude that the fecal microbiota increases in diversity and richness as dairy cows age. Animal growth as measured by ADG does not correlate with fecal community diversity or richness.

Medida repetida

Otra cosa a considerar con estos datos es el hecho de que probamos los mismos animales con el tiempo. Entonces, tenemos un diseño de medidas repetidas. Hay varias maneras de hacer medidas repetidas en R. Personalmente me gusta el paquete lme4 utilizado aquí.

Agregamos el componente de medida repetida agregando un efecto aleatorio para los animales individuales con (1 | Animal) en la función lmer.

Vemos que muy poca de la varianza en los datos se explica por los efectos aleatorios animales (0.03793). Entonces, realmente no necesitamos incluir medidas repetidas en nuestro modelo final, ¡pero era necesario verificarlo!

** De todo esto, podemos concluir que la microbiota fecal aumenta en diversidad y riqueza a medida que las vacas lecheras envejecen. El crecimiento de los animales medido por ADG no se correlaciona con la diversidad o riqueza de la comunidad fecal. **

Beta diversity

Beta-diversity is between sample diversity. It is how different every sample is from every other sample. Thus, each sample has more than one value. Some metrics take abundance into account (i.e. diversity: Bray-Curtis, weighted UniFrac) and some only calculate based on presence-absence (i.e. richness: Jaccard, unweighted UniFrac). The UniFrac metrics are unique in that they take into account phylogenetic information. So if two samples have a completely different set of OTUs, but they are closely phylogenetically related, the distance between these samples in UniFrac space would be lower than if the two sets of OTUs were distantly related.

Beta-diversity appears like the following (completely made-up numbers)

.	sample1	sample2	sample3	…
sample1	0	0.345	0.194	…
sample2	0.345	0	0.987	…
sample3	0.194	0.987	0	…
…	…	…	…	…

Beta diversity

La beta-diversidad se encuentra entre la diversidad de muestras. Es cuán diferente es cada muestra de cada otra muestra. Por lo tanto, cada muestra tiene más de un valor. Algunas métricas tienen en cuenta la abundancia (* i.e. * diversity: Bray-Curtis, UniFrac ponderado) y algunas solo calculan en función de la presencia-ausencia (* i.e. * richness: Jaccard, UniFrac no ponderado). Las métricas de UniFrac son únicas ya que tienen en cuenta la información filogenética. Entonces, si dos muestras tienen un conjunto completamente diferente de OTU, pero están estrechamente relacionadas filogenéticamente, la distancia entre estas muestras en el espacio UniFrac sería menor que si los dos conjuntos de OTU estuvieran relacionados de forma lejana.

Beta-diversity aparece como los siguientes (números completamente inventados)

Visualization

The best way to visualize beta-diversity, or how different samples are from each other, is by non-metric multidimensional scaling (nMDS). This is similar to principle coordinate analysis or PCA/PCoA if you’ve heard of that, only nMDS is more statistically robust with multiple iterations in the form of the trymax part of the command.

Each symbol on an nMDS plot represents the total microbial community of that sample. Symbols closer together have more similar microbiotas while those farther apart have less similar.

Visualización

La mejor forma de visualizar la diversidad beta, o la diferencia de muestras entre sí, es mediante escalas multidimensionales no métricas (nMDS). Esto es similar al análisis de coordenadas principales o PCA / PCoA, si ha oído hablar de eso, solo nMDS es más robusto estadísticamente con múltiples iteraciones en la forma de la parte trymax del comando.

Cada símbolo en un diagrama nMDS representa la comunidad microbiana total de esa muestra. Los símbolos más cercanos tienen microbiotas más similares, mientras que los más alejados tienen menos similares.

OTU-based metrics

There are two main type of beta-diversity measures. These OTU-based metrics treat every OTU as a separate entity without taking taxonomy into account. The distance between Prevotella OTU1 and Prevotella OTU2 is equivalent to the distance between Prevotella OTU1 and Bacteroides OTU1.

Métricas basadas en OTU

Hay dos tipos principales de medidas de diversidad beta. Estas métricas basadas en OTU tratan a cada OTU como una entidad separada sin tener en cuenta la taxonomía. La distancia entre * Prevotella * OTU1 y * Prevotella * OTU2 es equivalente a la distancia entre * Prevotella * OTU1 y * Bacteroides * OTU1.

Ordination scatterplots

First, we calculate the nMDS values for a 2-axis k=2 graph using the OTU-based Bray-Curtis metric that takes into account both the presence/absence and abundance of OTUs in your samples (i.e. diversity). This uses the metaMDS function from the package vegan.

BC.nmds = metaMDS(OTU.clean, distance="bray", k=2, trymax=1000)

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.06208161 
## Run 1 stress 0.06210668 
## ... Procrustes: rmse 0.001636313  max resid 0.005662513 
## ... Similar to previous best
## Run 2 stress 0.06208261 
## ... Procrustes: rmse 0.0008174643  max resid 0.00186259 
## ... Similar to previous best
## Run 3 stress 0.06208133 
## ... New best solution
## ... Procrustes: rmse 0.000495613  max resid 0.001143981 
## ... Similar to previous best
## Run 4 stress 0.06208228 
## ... Procrustes: rmse 0.0002768028  max resid 0.0006083455 
## ... Similar to previous best
## Run 5 stress 0.06208254 
## ... Procrustes: rmse 0.0003377152  max resid 0.0007457908 
## ... Similar to previous best
## Run 6 stress 0.06208233 
## ... Procrustes: rmse 0.000285801  max resid 0.000626649 
## ... Similar to previous best
## Run 7 stress 0.06210685 
## ... Procrustes: rmse 0.001453303  max resid 0.005539077 
## ... Similar to previous best
## Run 8 stress 0.062104 
## ... Procrustes: rmse 0.001430176  max resid 0.005147467 
## ... Similar to previous best
## Run 9 stress 0.06208351 
## ... Procrustes: rmse 0.0005018534  max resid 0.00111944 
## ... Similar to previous best
## Run 10 stress 0.06208269 
## ... Procrustes: rmse 0.0003614257  max resid 0.0008024269 
## ... Similar to previous best
## Run 11 stress 0.06208154 
## ... Procrustes: rmse 0.0004861021  max resid 0.001120926 
## ... Similar to previous best
## Run 12 stress 0.06212707 
## ... Procrustes: rmse 0.001859292  max resid 0.005339963 
## ... Similar to previous best
## Run 13 stress 0.3702005 
## Run 14 stress 0.06210406 
## ... Procrustes: rmse 0.001425256  max resid 0.00512563 
## ... Similar to previous best
## Run 15 stress 0.06208142 
## ... Procrustes: rmse 3.189023e-05  max resid 6.612762e-05 
## ... Similar to previous best
## Run 16 stress 0.06210429 
## ... Procrustes: rmse 0.001578454  max resid 0.005195898 
## ... Similar to previous best
## Run 17 stress 0.06210796 
## ... Procrustes: rmse 0.00155285  max resid 0.005626229 
## ... Similar to previous best
## Run 18 stress 0.06208191 
## ... Procrustes: rmse 0.0001981339  max resid 0.0004391198 
## ... Similar to previous best
## Run 19 stress 0.06208168 
## ... Procrustes: rmse 0.0001331311  max resid 0.000291077 
## ... Similar to previous best
## Run 20 stress 0.06210592 
## ... Procrustes: rmse 0.001396183  max resid 0.005412384 
## ... Similar to previous best
## *** Solution reached

We see that we reached a convergent solution around 20 iterations and our stress is very low (0.06), meaning that 2-axis are sufficient to view the data. Generally, we are looking for stress values below 0.2, and are very happy with stress values at or around 0.1. If your stress values are high, you may need to add more axes to the ordination. Plotting stress vs. number of axes will reveal a “break point” after which stress does not decrease much, as seen in this figure.

The goal is to use the minimum number of axes for which the stress is reasonably low.

So, since our stress is fine, we will plot the nMDS with different colors for your different groups of interest. We will use colors for our three ages

NOTE: It is important to be mindful when selecting colors for publication/presentation that you are using a palate that is accessible to everyone. Be mindful of colorblindness! Here is a neat resource for choosing colors:http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3. R can accept lots of strings as color names (http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf), but the R Color Brewer 2 allows you to find hex values and get a lot more precise, as well as limit yourself to colorblind-safe options!

par(mfrow = c(1, 1))
#Create a blank plot for the nmds
plot(BC.nmds, type="n", main="Bray-Curtis")
#Add the points colored by age
points(BC.nmds, display="sites", pch=20, col=c("blue", "green", "red")[meta$AgeGroup])
#Add a legend
legend(-5.5, 2.5, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)

This will create a plot in the lower right quadrant. If you want to get fancy, type ?plot in the console to see other ways to modify the plot function.

A similar thing can be done for the Jaccard metric, which only takes into account presence/absence (i.e. richness).

J.nmds = metaMDS(OTU.clean, distance="jaccard", k=2, trymax=1000)

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.0620818 
## Run 1 stress 0.06208178 
## ... New best solution
## ... Procrustes: rmse 0.0007016851  max resid 0.001623036 
## ... Similar to previous best
## Run 2 stress 0.06210633 
## ... Procrustes: rmse 0.001409348  max resid 0.005467011 
## ... Similar to previous best
## Run 3 stress 0.06210745 
## ... Procrustes: rmse 0.001470069  max resid 0.00557513 
## ... Similar to previous best
## Run 4 stress 0.06208144 
## ... New best solution
## ... Procrustes: rmse 0.0001309513  max resid 0.0002717662 
## ... Similar to previous best
## Run 5 stress 0.06208156 
## ... Procrustes: rmse 5.349512e-05  max resid 0.0001195792 
## ... Similar to previous best
## Run 6 stress 0.06208137 
## ... New best solution
## ... Procrustes: rmse 2.027381e-05  max resid 4.710602e-05 
## ... Similar to previous best
## Run 7 stress 0.06208345 
## ... Procrustes: rmse 0.0004560942  max resid 0.001010311 
## ... Similar to previous best
## Run 8 stress 0.06210681 
## ... Procrustes: rmse 0.001448074  max resid 0.005531499 
## ... Similar to previous best
## Run 9 stress 0.06208334 
## ... Procrustes: rmse 0.0004470347  max resid 0.000984174 
## ... Similar to previous best
## Run 10 stress 0.06208155 
## ... Procrustes: rmse 7.705878e-05  max resid 0.0001651192 
## ... Similar to previous best
## Run 11 stress 0.06208217 
## ... Procrustes: rmse 0.0002412108  max resid 0.0005340427 
## ... Similar to previous best
## Run 12 stress 0.06210429 
## ... Procrustes: rmse 0.001420012  max resid 0.005133791 
## ... Similar to previous best
## Run 13 stress 0.06208263 
## ... Procrustes: rmse 0.0002884997  max resid 0.0006395557 
## ... Similar to previous best
## Run 14 stress 0.06208166 
## ... Procrustes: rmse 0.0001135875  max resid 0.0002424163 
## ... Similar to previous best
## Run 15 stress 0.06210651 
## ... Procrustes: rmse 0.001438738  max resid 0.005503184 
## ... Similar to previous best
## Run 16 stress 0.06208137 
## ... New best solution
## ... Procrustes: rmse 6.516686e-05  max resid 0.0001605969 
## ... Similar to previous best
## Run 17 stress 0.06208244 
## ... Procrustes: rmse 0.0002976643  max resid 0.0007159927 
## ... Similar to previous best
## Run 18 stress 0.06208222 
## ... Procrustes: rmse 0.0002618419  max resid 0.0006358936 
## ... Similar to previous best
## Run 19 stress 0.06208197 
## ... Procrustes: rmse 0.000208525  max resid 0.0005678922 
## ... Similar to previous best
## Run 20 stress 0.0620832 
## ... Procrustes: rmse 0.0004189108  max resid 0.0009707012 
## ... Similar to previous best
## *** Solution reached

plot(J.nmds, type="n", main="Jaccard")
points(J.nmds, display="sites", pch=20, col=c("blue", "green", "red")[meta$AgeGroup])
legend(-3, 1.5, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)

You see that the values are very different for Jaccard but the pattern of points is very similar to Bray-Curtis. This is because Jaccard is a transformation of Bray-Curtis with J = 2BC/(1+BC)

Ordination scatterplots

Primero, calculamos los valores de nMDS para un gráfico de 2 ejes “k = 2” usando la métrica de Bray-Curtis basada en OTU que toma en cuenta tanto la presencia / ausencia como la abundancia de OTU en sus muestras (* i.e. * diversity). Esto usa la función metaMDS del paquetevegan.

Vemos que alcanzamos una solución convergente de alrededor de 20 iteraciones y nuestro estrés es muy bajo (0.06), lo que significa que los 2 ejes son suficientes para ver los datos. En general, estamos buscando valores de estrés por debajo de 0.2, y estamos muy contentos con los valores de estrés en o alrededor de 0.1. Si sus valores de estrés son altos, es posible que deba agregar más ejes a la ordenación. Trazar la tensión en función del número de ejes revelará un “punto de ruptura” después del cual el estrés no disminuirá mucho, como se ve en esta figura.

El objetivo es usar la cantidad mínima de ejes para los cuales el estrés es razonablemente bajo.

Entonces, dado que nuestro estrés está bien, tramaremos el nMDS con diferentes colores para sus diferentes grupos de interés. Usaremos colores para nuestras tres edades

** NOTA: ** Es importante tener en cuenta al seleccionar los colores para la publicación / presentación que está usando un paladar accesible para todos. ¡Sea consciente del daltonismo! Aquí hay un recurso perfecto para elegir colores: http: //colorbrewer2.org/#type=sequential&scheme=BuGn&n=3. R puede aceptar muchas cadenas como nombres de colores (http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf), pero la R Color Brewer 2 le permite encontrar valores hexadecimales y obtener mucha más precisión , ¡y también se limita a las opciones de daltonismo seguro!

Esto creará una trama en el cuadrante inferior derecho. Si quieres ser elegante, escribe ? Plot en la consola para ver otras formas de modificar la función de trazado.

Se puede hacer algo similar con la métrica de Jaccard, que solo tiene en cuenta la presencia / ausencia (* i.e. * riqueza).

Verá que los valores son muy diferentes para Jaccard, pero el patrón de puntos es muy similar al de Bray-Curtis. Esto es porque Jaccard es una transformación de Bray-Curtis con J = 2BC / (1 + BC)

Ellipses

You can also plot standard error (se) ellipses for your nMDS data instead of showing all of the individual points. Here, we will plot standard error ellipses for the Bray-Curtis metric using ordiellipse from vegan.

Code courtesy of Madison Cox.

plot(BC.nmds, type="n", main="Bray-Curtis")
legend(-5.5, 2.5, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)

#Add an ellipse for 2w
ordiellipse(BC.nmds, groups=meta$AgeGroup, display="sites", kind="se", label=FALSE, col="green", draw="polygon", alpha=200, show.groups = c("2w"), border=FALSE)

#Add an ellipse for 8w
ordiellipse(BC.nmds, groups=meta$AgeGroup, display="sites", kind="se", label=FALSE, col="red", draw="polygon", alpha=200, show.groups = c("8w"), border=FALSE)

#Add an ellipse for 1yr
ordiellipse(BC.nmds, groups=meta$AgeGroup, display="sites", kind="se", label=FALSE, col="blue", draw="polygon", alpha=200, show.groups = c("1yr"), border=FALSE)

We clearly see in both the dot and ellipse plots that age significantly impacts the overall structure (Bray-Curtis) and composition (Jaccard) of the fecal bacterial microbiota.

Elipses

También puede trazar puntos suspensivos estándar (se) para sus datos nMDS en lugar de mostrar todos los puntos individuales. Aquí, trazaremos elipses de error estándar para la métrica de Bray-Curtis usando ordiellipse devegan.

Código cortesía de Madison Cox.

Vemos claramente tanto en las gráficas de punto como de elipse que la edad impacta significativamente la estructura general (Bray-Curtis) y la composición (Jaccard) de la microbiota bacteriana fecal.

3D plots

If your stress is high (like over 0.3) for your metaMDS calculation, you probably need to increase to 3 axes k=3. Graphing a 3D plot is much more complicated, and there are a number of packages that could be used. Here, we will use one option from the plotly package to visualize a 3D Bray-Curtis plot.

#Calculate the Bray-Curtis nMDS for 3-axis
BC.nmds.3D = metaMDS(OTU.clean, distance="bray", k=3, trymax=1000)

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.04686346 
## Run 1 stress 0.04741659 
## Run 2 stress 0.04673425 
## ... New best solution
## ... Procrustes: rmse 0.01073904  max resid 0.0344814 
## Run 3 stress 0.05061835 
## Run 4 stress 0.04740131 
## Run 5 stress 0.04984642 
## Run 6 stress 0.04747801 
## Run 7 stress 0.05226505 
## Run 8 stress 0.05295437 
## Run 9 stress 0.04741387 
## Run 10 stress 0.0457586 
## ... New best solution
## ... Procrustes: rmse 0.03868237  max resid 0.1296728 
## Run 11 stress 0.05094992 
## Run 12 stress 0.04719303 
## Run 13 stress 0.05012352 
## Run 14 stress 0.04750204 
## Run 15 stress 0.0479423 
## Run 16 stress 0.04579561 
## ... Procrustes: rmse 0.004692476  max resid 0.01495666 
## Run 17 stress 0.05069634 
## Run 18 stress 0.0485804 
## Run 19 stress 0.05058189 
## Run 20 stress 0.04859459 
## Run 21 stress 0.04996713 
## Run 22 stress 0.04740079 
## Run 23 stress 0.04747632 
## Run 24 stress 0.04675455 
## Run 25 stress 0.04747574 
## Run 26 stress 0.0486171 
## Run 27 stress 0.04575823 
## ... New best solution
## ... Procrustes: rmse 0.0005374711  max resid 0.0008831403 
## ... Similar to previous best
## *** Solution reached

Extract x-y-z values for this nMDS

BCxyz = scores(BC.nmds.3D, display="sites")
#This is a table that looks like 
BCxyz

##                 NMDS1       NMDS2        NMDS3
## 5017.1yr.F -4.7973931  0.33029806 -0.211481225
## 5017.2w.F   3.1867260  0.06208276  1.484970505
## 5017.8w.F   1.0614871 -2.13025264 -1.218243774
## 5020.1yr.F -4.7579235  0.24440345 -0.002888360
## 5020.2w.F   3.4979230 -1.00981047  1.015200903
## 5020.8w.F   1.5897780 -1.93435391  0.464128291
## 5026.1yr.F -4.7720517  0.20611823  0.214815994
## 5026.2w.F   3.3976411  1.10010056 -0.616957559
## 5026.8w.F   3.1483050  2.07715934  1.478767471
## 5031.1yr.F -4.8021402  0.44250394  0.202447638
## 5031.2w.F   3.3537430  0.48376070 -1.490408346
## 5031.8w.F   0.8577869 -1.64300786  0.250766536
## 5037.1yr.F -4.8522745  0.48898068 -0.004218580
## 5037.2w.F   3.6593056  0.26886383 -0.507062657
## 5037.8w.F   3.1326413 -0.82210579 -0.024946820
## 5041.1yr.F -4.7724198  0.28335210  0.060469429
## 5041.2w.F   3.1661815  2.43615798 -1.218459457
## 5041.8w.F   1.0947996 -2.58325770 -0.236659085
## 5045.1yr.F -4.7522029  0.16444286  0.004405471
## 5045.2w.F   1.5110480  3.11956405 -0.469494555
## 5045.8w.F   1.4900615 -2.17087166 -0.450930039
## 5053.1yr.F -4.8259682  0.39929033 -0.016428020
## 5053.2w.F   3.2932453  2.30299477  0.813801957
## 5053.8w.F   0.8917011 -2.11641360  0.478404284

Plot the xyz coordinates and color by age

plot_ly(x=BCxyz[,1], y=BCxyz[,2], z=BCxyz[,3], type="scatter3d", mode="markers", color=meta$AgeGroup, colors=c("blue", "green", "red"))

Note: Since 3D plots are difficult to interpret in printed journal articles, many authors choose to create two separate 2D plots to show the 3D data like so.

par(mfrow=c(1,2))
#Axis 1 and 2 (x and y)
plot(BCxyz[,1], BCxyz[,2], main="Bray-Curtis 1:2", pch=20, col=c("blue", "green", "red")[meta$AgeGroup])
legend(-5.4, 3, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)
#Axis 1 and 3 (x and z)
plot(BCxyz[,1], BCxyz[,3], main="Bray-Curtis 1:3", pch=20, col=c("blue", "green", "red")[meta$AgeGroup])

Parcelas 3D

Si tu estrés es alto (como más de 0.3) para tu cálculo de ‘metaMDS’, probablemente necesites aumentar a 3 ejes ’k = 3. Graficar una trama en 3D es mucho más complicado, y hay una serie de paquetes que podrían ser utilizados. Aquí, usaremos una opción del paqueteplotly` para visualizar un trazado 3D Bray-Curtis.

Extraiga los valores de x-y-z para este nMDS

Dibuja las coordenadas xyz y colorea por edad

** Nota **: Debido a que las tramas 3D son difíciles de interpretar en artículos impresos de revistas, muchos autores eligen crear dos trazados 2D separados para mostrar los datos 3D de esa manera.

Phylogentic-based metrics

The most common of this type of beta-diversity metrics is UniFrac. The strength of UniFrac over Bray-Curtis or Jaccard is that it takes into account phylogenetic relationships of the species present in the microbiota. Thus, samples with different OTUs from the same genus will be more similar by UniFrac that those with OTUs from different genera. The weakness is that UniFrac is more sensitive to low abundance OTUs and those that a very phylogenetically distant.

Your choice will depend on how much you personally feel phylogenetic relationships vs. sensitively matter in your data.

Just as above, UniFrac can be plotted as an nMDS. You just need to use a different R package, and thus, slightly different commands.

Métricas basadas en filogenia El más común de este tipo de métrica de diversidad beta es UniFrac. La fuerza de UniFrac sobre Bray-Curtis o Jaccard es que toma en cuenta las relaciones filogenéticas de las especies presentes en la microbiota. Por lo tanto, las muestras con diferentes OTU del mismo género serán más similares por UniFrac que aquellas con OTU de diferentes géneros. La debilidad es que UniFrac es más sensible a las OTU de baja abundancia y aquellas que son muy filogenéticamente distantes.

Su elección dependerá de cuánto siente personalmente las relaciones filogenéticas frente a la materia sensible en sus datos.

Al igual que arriba, UniFrac se puede trazar como un nMDS. Solo necesita usar un paquete R diferente, y por lo tanto, comandos ligeramente diferentes.

Create physeq object

To start, you must make a phyloseq object which includes the OTU.clean, meta, and tax.clean data. We tell R which tables are each type.

OTU.UF = otu_table(as.matrix(OTU.clean), taxa_are_rows=FALSE)
tax.UF = tax_table(as.matrix(tax.clean))
meta.UF = sample_data(meta)

We then merge these into an object of class phyloseq. This object now holds all of our taxonomic information, all of our metadata, and all of our OTU counts by sample.

physeq = phyloseq(OTU.UF, tax.UF, meta.UF)

To add the phylogenetic component to UniFrac, we calculate a rooted phylogenetic tree of our OTUs. This takes a long time so we have provided the tree for you.

However, if we were to calculate a tree, first, we import a distance matrix created from representative sequences of our OTUs. We would use phangorn to read the file as it was created in mothur as seen under “Trees of OTUs” here.

DO NOT RUN THIS

dist.mat = import_mothur_dist("clean_repFasta.phylip.dist")

We would then calculate a rooted neighbor-joining tree from the distance matrix using the ape package.

DO NOT RUN THIS

NJ.tree = bionj(dist.mat)

Instead, we have pre-calculated this tree and you can load is with

load("NJ.tree.Rdata")

Then, add this tree to your physeq object. This object will be what is used in UniFrac calculations.

physeq.tree = merge_phyloseq(physeq, NJ.tree)

We can look at this object and see its components.

physeq.tree

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5002 taxa and 24 samples ]
## sample_data() Sample Data:       [ 24 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 5002 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 5002 tips and 5000 internal nodes ]

Crear objeto physeq

Para comenzar, debe hacer un objeto phyloseq que incluya los datos OTU.clean, meta e tax.clean. Le decimos a R qué tablas son de cada tipo.

Luego fusionamos estos en un objeto de clase phyloseq. Este objeto ahora contiene toda nuestra información taxonómica, todos nuestros metadatos y todos los conteos de OTU por muestra.

Para agregar el componente filogenético a UniFrac, calculamos un árbol filogenético enraizado de nuestras OTU. Esto lleva mucho tiempo, por lo que hemos proporcionado el árbol para usted.

Sin embargo, si tuviéramos que calcular un árbol, primero, importamos una matriz de distancia creada a partir de secuencias representativas de nuestras OTU. Usaríamos phangorn para leer el archivo tal como fue creado en mothur como se ve en" Árboles de OTUs “[aquí] (http://rpubs.com/maddieSC/mothur_SOP_UCR_Jan_2018).

** NO EJECUTE ESTO **

Entonces, calcularíamos un árbol arraigado de unión de vecinos a partir de la matriz de distancia utilizando el paquete ape.

** NO EJECUTE ESTO **

En cambio, hemos precalculado este árbol y puede cargarlo con

Luego, agregue este árbol a su objeto physeq. Este objeto será lo que se usa en los cálculos de UniFrac.

Podemos mirar este objeto y ver sus componentes.

Ordination scatterplots

Calculate weighted UniFrac (i.e. diversity) distances and ordinate into an nMDS. We specify weighted with weighted=TRUE.

wUF.ordu = ordinate(physeq.tree, method="NMDS", distance="unifrac", weighted=TRUE)

## Warning in UniFrac(physeq, ...): Randomly assigning root as -- Otu00062 --
## in the phylogenetic tree in the data you provided.

## Run 0 stress 0.0864543 
## Run 1 stress 0.08645377 
## ... New best solution
## ... Procrustes: rmse 0.0001213931  max resid 0.0003141587 
## ... Similar to previous best
## Run 2 stress 0.1335727 
## Run 3 stress 0.1463023 
## Run 4 stress 0.08645329 
## ... New best solution
## ... Procrustes: rmse 0.0007206919  max resid 0.001920389 
## ... Similar to previous best
## Run 5 stress 0.1270238 
## Run 6 stress 0.1157455 
## Run 7 stress 0.1143571 
## Run 8 stress 0.1317677 
## Run 9 stress 0.08645345 
## ... Procrustes: rmse 5.804039e-05  max resid 0.0001620988 
## ... Similar to previous best
## Run 10 stress 0.08808605 
## Run 11 stress 0.08645348 
## ... Procrustes: rmse 0.000642139  max resid 0.001706552 
## ... Similar to previous best
## Run 12 stress 0.1157451 
## Run 13 stress 0.0864534 
## ... Procrustes: rmse 4.051435e-05  max resid 0.0001125382 
## ... Similar to previous best
## Run 14 stress 0.1143564 
## Run 15 stress 0.08659435 
## ... Procrustes: rmse 0.004251655  max resid 0.01804703 
## Run 16 stress 0.1295296 
## Run 17 stress 0.0864538 
## ... Procrustes: rmse 0.000161137  max resid 0.0004585026 
## ... Similar to previous best
## Run 18 stress 0.1347981 
## Run 19 stress 0.08645297 
## ... New best solution
## ... Procrustes: rmse 0.0003657154  max resid 0.0008934259 
## ... Similar to previous best
## Run 20 stress 0.08808625 
## *** Solution reached

You can plot UniFrac nMDS using the basic plot function as we’ve done before.

par(mfrow=c(1,1))
plot(wUF.ordu, type="n", main="Weighted UniFrac")

## Warning in ordiplot(x, choices = choices, type = type, display = display, :
## Species scores not available

points(wUF.ordu, pch=20, display="sites", col=c("blue", "green", "red")[meta$AgeGroup])
legend(0.3,0.15, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)

But let’s also look at the ggplot2 package. This package is incredibly powerful and can be customized in many ways. This document and this website have many helpful tips.

plot_ordination(physeq.tree, wUF.ordu, type="sites", color="AgeGroup") + 
  scale_colour_manual(values=c("2w"="green", "8w"="red", "1yr"="blue")) + 
  theme_bw() + 
  ggtitle("Weighted UniFrac")

Unweighted UniFrac (i.e. richness) can be visualized in the same way. We specify unweighted with weighted=FALSE.

uwUF.ordu = ordinate(physeq.tree, method="NMDS", distance="unifrac", weighted=FALSE)

## Warning in UniFrac(physeq, ...): Randomly assigning root as -- Otu00541 --
## in the phylogenetic tree in the data you provided.

## Run 0 stress 9.695153e-05 
## Run 1 stress 9.657832e-05 
## ... New best solution
## ... Procrustes: rmse 7.750783e-05  max resid 0.0002776914 
## ... Similar to previous best
## Run 2 stress 9.871795e-05 
## ... Procrustes: rmse 8.086551e-05  max resid 0.0002819207 
## ... Similar to previous best
## Run 3 stress 9.488623e-05 
## ... New best solution
## ... Procrustes: rmse 7.261501e-05  max resid 0.0002642816 
## ... Similar to previous best
## Run 4 stress 9.862006e-05 
## ... Procrustes: rmse 1.701217e-05  max resid 5.025527e-05 
## ... Similar to previous best
## Run 5 stress 9.806631e-05 
## ... Procrustes: rmse 0.0001070473  max resid 0.0002353732 
## ... Similar to previous best
## Run 6 stress 9.757454e-05 
## ... Procrustes: rmse 3.985665e-05  max resid 0.0001388531 
## ... Similar to previous best
## Run 7 stress 9.826177e-05 
## ... Procrustes: rmse 9.722135e-05  max resid 0.0002191936 
## ... Similar to previous best
## Run 8 stress 9.695708e-05 
## ... Procrustes: rmse 7.448687e-05  max resid 0.0002751687 
## ... Similar to previous best
## Run 9 stress 9.907648e-05 
## ... Procrustes: rmse 9.310993e-05  max resid 0.0002388289 
## ... Similar to previous best
## Run 10 stress 9.984534e-05 
## ... Procrustes: rmse 3.384419e-05  max resid 0.0001260377 
## ... Similar to previous best
## Run 11 stress 9.684607e-05 
## ... Procrustes: rmse 0.0001319037  max resid 0.0003356478 
## ... Similar to previous best
## Run 12 stress 9.69891e-05 
## ... Procrustes: rmse 8.404145e-06  max resid 2.447679e-05 
## ... Similar to previous best
## Run 13 stress 0.0002969569 
## ... Procrustes: rmse 0.0003866364  max resid 0.0006715474 
## ... Similar to previous best
## Run 14 stress 9.723199e-05 
## ... Procrustes: rmse 3.731826e-05  max resid 0.0001336343 
## ... Similar to previous best
## Run 15 stress 9.99257e-05 
## ... Procrustes: rmse 0.0001270356  max resid 0.0003614341 
## ... Similar to previous best
## Run 16 stress 9.955355e-05 
## ... Procrustes: rmse 6.056256e-05  max resid 0.0001673759 
## ... Similar to previous best
## Run 17 stress 9.589429e-05 
## ... Procrustes: rmse 1.686683e-05  max resid 4.596185e-05 
## ... Similar to previous best
## Run 18 stress 9.633493e-05 
## ... Procrustes: rmse 3.660483e-05  max resid 0.0001324208 
## ... Similar to previous best
## Run 19 stress 9.921893e-05 
## ... Procrustes: rmse 1.085938e-05  max resid 1.669484e-05 
## ... Similar to previous best
## Run 20 stress 9.637055e-05 
## ... Procrustes: rmse 6.450683e-05  max resid 0.0001970587 
## ... Similar to previous best
## *** Solution reached

## Warning in metaMDS(ps.dist): Stress is (nearly) zero - you may have
## insufficient data

plot_ordination(physeq.tree, uwUF.ordu, type="sites", color="AgeGroup") + 
  scale_colour_manual(values=c("2w"="green", "8w"="red", "1yr"="blue")) + 
  theme_bw() + 
  ggtitle("Unweighted UniFrac")

Ordination scatterplots

Calcule distancias ponderadas de UniFrac (* i.e. * diversity) y ordinate en un nMDS. Especificamos ponderado con weighted = TRUE.

Puede trazar UniFrac nMDS usando la función plot básica como lo hemos hecho antes.

Pero también miremos el paquete ggplot2. Este paquete es increíblemente poderoso y se puede personalizar de muchas maneras. [Este documento] (https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf) y [este sitio web] (http://kbroman.org/Workshop_DataCarpNSBE/ggplot2 .html) tiene muchos consejos útiles.

UniFrac no ponderado (* i.e. * riqueza) se puede visualizar de la misma manera. Especificamos no ponderado con weighted = FALSE.

Ellipses

Ellipses can be plotted instead of points as well. With the basic plot function:

plot(wUF.ordu, type="n", main="Weighted UniFrac")

## Warning in ordiplot(x, choices = choices, type = type, display = display, :
## Species scores not available

legend(0.3, 0.15, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)

#Add an ellipse for 2w
ordiellipse(wUF.ordu, groups=meta$AgeGroup, display="sites", kind="se",  label=FALSE, col="green", draw="polygon", alpha=200, show.groups = c("2w"), border=FALSE)

#Add an ellipse for 8w
ordiellipse(wUF.ordu, groups=meta$AgeGroup, display="sites", kind="se",  label=FALSE, col="red", draw="polygon", alpha=200, show.groups = c("8w"), border=FALSE)

#Add an ellipse for 1yr
ordiellipse(wUF.ordu, groups=meta$AgeGroup, display="sites", kind="se",  label=FALSE, col="blue", draw="polygon", alpha=200, show.groups = c("1yr"), border=FALSE)

We can also plot ellipses in ggplot2. However, these ellipses are not the exact same at the standard error ellipses used with OTU-based metrics as they use different underlying calculations. However, they get at the same question of standard error for groups of points on an nMDS.

We plot ellipses with ggplot2 by adding the stat_ellipse function to our plot.

plot_ordination(physeq.tree, wUF.ordu, type="sites", color="AgeGroup") + 
  scale_colour_manual(values=c("2w"="green", "8w"="red", "1yr"="blue")) + 
  theme_bw() + 
  stat_ellipse() + 
  ggtitle("Weighted UniFrac")

Elipses

Los elipses se pueden trazar en lugar de puntos también. Con la función básica de la trama:

También podemos trazar elipsis en ggplot2. Sin embargo, estas elipses no son exactamente iguales en las elipses de error estándar que se usan con las métricas basadas en OTU, ya que utilizan diferentes cálculos subyacentes. Sin embargo, abordan la misma cuestión del error estándar para grupos de puntos en un nMDS.

Trazamos las elipsis con ggplot2 agregando la funciónstat_ellipse a nuestra gráfica.

3D plots

3D UniFrac ordinations are not currently supported by phyloseq. We see that our ordinations only include 2 dimensions.

wUF.ordu

## 
## Call:
## metaMDS(comm = ps.dist) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     ps.dist 
## Distance: user supplied 
## 
## Dimensions: 2 
## Stress:     0.08645297 
## Stress type 1, weak ties
## Two convergent solutions found after 20 tries
## Scaling: centring, PC rotation 
## Species: scores missing

But we can instead calculate UniFrac distances using UniFrac and ordinating for 3-axes with metaMDS.

wUF.dist = UniFrac(physeq.tree, weighted=TRUE, normalized=TRUE)

## Warning in UniFrac(physeq.tree, weighted = TRUE, normalized = TRUE):
## Randomly assigning root as -- Otu03194 -- in the phylogenetic tree in the
## data you provided.

wUF.nmds.3D = metaMDS(wUF.dist, method="NMDS", k=3)

## Run 0 stress 0.04217486 
## Run 1 stress 0.05952471 
## Run 2 stress 0.05952709 
## Run 3 stress 0.042174 
## ... New best solution
## ... Procrustes: rmse 0.0003317483  max resid 0.0007893038 
## ... Similar to previous best
## Run 4 stress 0.04217542 
## ... Procrustes: rmse 0.0005403913  max resid 0.0014387 
## ... Similar to previous best
## Run 5 stress 0.0421741 
## ... Procrustes: rmse 0.0001810271  max resid 0.000555628 
## ... Similar to previous best
## Run 6 stress 0.05952602 
## Run 7 stress 0.04217451 
## ... Procrustes: rmse 0.0003976044  max resid 0.001227917 
## ... Similar to previous best
## Run 8 stress 0.06815104 
## Run 9 stress 0.05952564 
## Run 10 stress 0.04217457 
## ... Procrustes: rmse 0.0004479109  max resid 0.001435945 
## ... Similar to previous best
## Run 11 stress 0.04217428 
## ... Procrustes: rmse 0.0003207273  max resid 0.0009212836 
## ... Similar to previous best
## Run 12 stress 0.04217476 
## ... Procrustes: rmse 0.0004904995  max resid 0.001357519 
## ... Similar to previous best
## Run 13 stress 0.04217443 
## ... Procrustes: rmse 0.0003308483  max resid 0.0008748533 
## ... Similar to previous best
## Run 14 stress 0.04217414 
## ... Procrustes: rmse 0.0002102509  max resid 0.000611423 
## ... Similar to previous best
## Run 15 stress 0.04217491 
## ... Procrustes: rmse 0.0005257634  max resid 0.001791904 
## ... Similar to previous best
## Run 16 stress 0.04217454 
## ... Procrustes: rmse 0.0003986916  max resid 0.001121447 
## ... Similar to previous best
## Run 17 stress 0.04217553 
## ... Procrustes: rmse 0.0004447142  max resid 0.001546131 
## ... Similar to previous best
## Run 18 stress 0.04217399 
## ... New best solution
## ... Procrustes: rmse 0.0001824097  max resid 0.0005684325 
## ... Similar to previous best
## Run 19 stress 0.04217406 
## ... Procrustes: rmse 7.68744e-05  max resid 0.0001772352 
## ... Similar to previous best
## Run 20 stress 0.04217417 
## ... Procrustes: rmse 0.0001240512  max resid 0.0002862878 
## ... Similar to previous best
## *** Solution reached

Then, similar to what we did with Bray-Curtis/Jaccard, we pull out the xyz values and plot with plotly.

wUFxyz = scores(wUF.nmds.3D, display="sites")
#This is a table that looks like 
wUFxyz

##                  NMDS1        NMDS2       NMDS3
## 5017.1yr.F -0.19591424  0.107765310  0.07968290
## 5017.2w.F   0.40329083  0.187040546 -0.11891085
## 5017.8w.F  -0.06738145  0.046058811 -0.21927277
## 5020.1yr.F -0.21311918  0.100813200  0.06833139
## 5020.2w.F  -0.02918765 -0.163606283 -0.02929884
## 5020.8w.F   0.03375300  0.054503745 -0.09099989
## 5026.1yr.F -0.22482781  0.066613100  0.05594134
## 5026.2w.F   0.13241677 -0.217029557  0.08745439
## 5026.8w.F   0.38996273  0.135464299  0.24011205
## 5031.1yr.F -0.19996967  0.080398029  0.09445703
## 5031.2w.F   0.19084848 -0.256852240  0.01563640
## 5031.8w.F  -0.13587208 -0.042300350 -0.02591350
## 5037.1yr.F -0.21800838  0.076413856  0.07189119
## 5037.2w.F   0.05187202 -0.120151694 -0.04223782
## 5037.8w.F   0.14227112 -0.115591151 -0.01897721
## 5041.1yr.F -0.20911338  0.081709200  0.07441520
## 5041.2w.F   0.27813371 -0.237693762  0.03647625
## 5041.8w.F  -0.13928666 -0.001531998 -0.18656755
## 5045.1yr.F -0.23328251  0.051043269  0.06274834
## 5045.2w.F   0.49259170  0.294540193 -0.14634317
## 5045.8w.F  -0.16902451 -0.126094687 -0.13841874
## 5053.1yr.F -0.21539833  0.077884489  0.08008741
## 5053.2w.F   0.27502987 -0.030380383  0.17559141
## 5053.8w.F  -0.13978439 -0.049015941 -0.12588496

plot_ly(x=wUFxyz[,1], y=wUFxyz[,2], z=wUFxyz[,3], type="scatter3d", mode="markers", color=meta$AgeGroup, colors=c("blue", "green", "red"))

Parcelas 3D

Las ordenaciones 3D UniFrac actualmente no son compatibles con phyloseq. Vemos que nuestras ordenaciones solo incluyen 2 dimensiones.

Pero podemos calcular las distancias de UniFrac usando UniFrac y ordenando 3-axes conmetaMDS.

Entonces, de manera similar a lo que hicimos con Bray-Curtis / Jaccard, sacamos los valores xyz y trazamos con plotly.

Vectors for continuous variables

While it is easy to visualize categorical groups with coloring in nMDS, it is difficult to achieve the same effect with continuous variables. Instead, we can fit these variables as a vector on our nMDS plots.

To do this, we first fit the variables to our distances using the envfit function in vegan. You can do Bray-Curtis, Jaccard, weighted or unweighted UniFrac. Here, we will demonstrate with Bray-Curtis and weighted UniFrac.

fit.BC = envfit(BC.nmds, meta) 
fit.BC

## 
## ***VECTORS
## 
##                  NMDS1    NMDS2     r2 Pr(>r)    
## AgeExact      -0.99887 -0.04744 0.9765  0.001 ***
## ADGKG          0.12503  0.99215 0.0770  0.444    
## chao          -0.98567  0.16868 0.9599  0.001 ***
## shannon       -0.69400  0.71997 0.9469  0.001 ***
## simpson        0.42087 -0.90712 0.7353  0.001 ***
## ace           -0.99746  0.07129 0.9078  0.001 ***
## shannon.vegan -0.71229  0.70188 0.9591  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
## 
## ***FACTORS:
## 
## Centroids:
##                   NMDS1   NMDS2
## Animalcow5017   -0.1841  0.5449
## Animalcow5020    0.0059  0.6577
## Animalcow5026    0.4243 -0.8826
## Animalcow5031   -0.2442  0.1175
## Animalcow5037    0.4946 -0.0566
## Animalcow5041    0.0500 -0.0290
## Animalcow5045   -0.1374 -0.3384
## Animalcow5053   -0.4090 -0.0134
## AgeGroup1yr     -4.4470 -0.1800
## AgeGroup2w       2.5047 -1.0509
## AgeGroup8w       1.9422  1.2309
## AgeGroup.ord2w   2.5047 -1.0509
## AgeGroup.ord8w   1.9422  1.2309
## AgeGroup.ord1yr -4.4470 -0.1800
## 
## Goodness of fit:
##                  r2 Pr(>r)    
## Animal       0.0248  0.997    
## AgeGroup     0.9134  0.001 ***
## AgeGroup.ord 0.9134  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

We see that it has automatically fit every variable in our meta table.

The simplest way around this is to just ask envfit to run on only the variables you want.

fit.BC = envfit(BC.nmds, meta[,c("AgeGroup", "ADGKG")])
fit.BC

## 
## ***VECTORS
## 
##         NMDS1   NMDS2    r2 Pr(>r)
## ADGKG 0.12503 0.99215 0.077  0.452
## Permutation: free
## Number of permutations: 999
## 
## ***FACTORS:
## 
## Centroids:
##               NMDS1   NMDS2
## AgeGroup1yr -4.4470 -0.1800
## AgeGroup2w   2.5047 -1.0509
## AgeGroup8w   1.9422  1.2309
## 
## Goodness of fit:
##              r2 Pr(>r)    
## AgeGroup 0.9134  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

We repeat for weighted UniFrac

fit.wUF = envfit(wUF.ordu, meta[,c("AgeGroup", "ADGKG")])
fit.wUF

## 
## ***VECTORS
## 
##          NMDS1    NMDS2     r2 Pr(>r)
## ADGKG -0.17846  0.98395 0.0398   0.66
## Permutation: free
## Number of permutations: 999
## 
## ***FACTORS:
## 
## Centroids:
##               NMDS1   NMDS2
## AgeGroup1yr -0.1076 -0.0834
## AgeGroup2w   0.1432  0.0322
## AgeGroup8w  -0.0356  0.0511
## 
## Goodness of fit:
##              r2 Pr(>r)    
## AgeGroup 0.5588  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

For categorical variables, envfit will label the centroid of the data for each group in the nMDS with that group’s name. For continuous variables, it adds an arrow in the direction from smallest to largest value.

Note: The P-values for variables in envfit are not equivalent to the P-values for our ANOVA/Kruskal/GLM tests. Instead, envfit P-values tell you how well the arrow or centroids fit the x-y data of the nMDS, not the underlying distance matrix. In general, if your nMDS is a good representation of the data (low stress value) and the variable was significant in its appropriate ANOVA/Kruskal/GLM test, the fitted arrow/centroids will also be significant. And if your nMDS is a good representation of the data and the variable was not significant, the fitted arrow/centroids will also not be significant. We see this type of result here, but this will not always be the case.

However, if your nMDS stress was borderline or not great and/or your variable was borderline significant or not, you may see divergent results for the arrow/centroid. This does not mean that the result you got in ANOVA/Kruskal/GLM was invalid. It just means that it’s difficult to visualize this result as a simple arrow or centroids on a 2D plot. Regardless, non-significant variables in envfit that you know are signficant in other tests may still be represented on an nMDS as a visual aid.

Thus, we plot our 2D nMDS colored by age with an arrow for the ADG variable even though that arrow was not significant. Since the ADG variable was also not significant in GLM, we probably won’t use these plot in a publication, but it is good practice.

For Bray-Curtis:

plot(BC.nmds, type="n", main="Bray-Curtis")
points(BC.nmds, pch=20, display="sites", col=c("blue", "green", "red")[meta$AgeGroup])
legend(-6, 2, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)
#Add fitted variables
plot(fit.BC, col="black")

You could also ask it to only plot variables with a fit P-value < 0.05. So we would only see the centroids

plot(BC.nmds, type="n", main="Bray-Curtis")
points(BC.nmds, pch=20, display="sites", col=c("blue", "green", "red")[meta$AgeGroup])
legend(-6, 2, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)
#Add fitted variables
plot(fit.BC, col="black", p.max=0.05)

Weighted UniFrac

plot(wUF.ordu, type="n", main="Weighted UniFrac")

## Warning in ordiplot(x, choices = choices, type = type, display = display, :
## Species scores not available

points(wUF.ordu, pch=20, display="sites", col=c("blue", "green", "red")[meta$AgeGroup])
legend(.3,.15, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)
#Add fitted variables
plot(fit.wUF, col="black")

You could also fit your OTU.clean table to the nMDS to add arrow(s) for specific OTUs within the plot. OTU arrows that, say, go in the same direction as an age group centroid tend to increase in abundance in that age group. The opposite direction would indicate that an OTU decreases in abundance in that age group.

Fitting all OTUs would take awhile so we will only fit the first 10 in our table.

fit.BC.OTU = envfit(BC.nmds, OTU.clean[,1:10])
fit.BC.OTU

## 
## ***VECTORS
## 
##             NMDS1    NMDS2     r2 Pr(>r)    
## Otu00001  0.71738 -0.69668 0.2478  0.033 *  
## Otu00002  0.46984 -0.88275 0.2109  0.057 .  
## Otu00003  0.25719 -0.96636 0.2503  0.021 *  
## Otu00004  0.25006  0.96823 0.2738  0.030 *  
## Otu00005  0.15473  0.98796 0.2910  0.003 ** 
## Otu00006 -0.96867  0.24837 0.6743  0.001 ***
## Otu00007  0.17991 -0.98368 0.2488  0.009 ** 
## Otu00008  0.40157  0.91583 0.3108  0.016 *  
## Otu00009  0.26275 -0.96487 0.1894  0.062 .  
## Otu00010  0.33868 -0.94090 0.1552  0.078 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

#We will only plot significant arrows in this case
plot(BC.nmds, type="n", main="Bray-Curtis")
points(BC.nmds, pch=20, display="sites", col=c("blue", "green", "red")[meta$AgeGroup])
legend(-6, -1.1, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)
#Add fitted variables
plot(fit.BC.OTU, col="black", p.max=0.05)

You could also think about plotting higher taxonomic levels like summed genera or family groups of OTUs.

#Extract all OTUs within the genus Ruminococcus
OTU.Rumino = OTU.clean[,tax.clean$Genus == "g__Ruminococcus"]
#Sum the abundances of the Ruminococcaceae OTUs into one variable (column)
OTU.Rumino$Rumino.sum = rowSums(OTU.Rumino)

#Fit the new Ruminococcaceae group
fit.BC.Rumino = envfit(BC.nmds, OTU.Rumino$Rumino.sum)
fit.BC.Rumino

## 
## ***VECTORS
## 
##         NMDS1    NMDS2     r2 Pr(>r)    
## [1,] -0.14506  0.98942 0.6621  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

#Plot
plot(BC.nmds, type="n", main="Bray-Curtis")
points(BC.nmds, pch=20, display="sites", col=c("blue", "green", "red")[meta$AgeGroup])
legend(-6, -1.1, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)
#Add fitted variables
plot(fit.BC.Rumino, col="black", labels=c("Ruminococcus"))

Finally, you could input a matrix of environmental variables and use the vegan::bioenv function to identify those which are appropriate to use in plotting. This requires your OTU table or distance matrix, a matrix of continuous environmental variables (like our short-chain fatty acid abundances). Since we only have SCFA data for 4 samples today, we will use the subsetted OTU table “OTU.SCFA”.

bioenv.demo <- bioenv(comm=OTU.SCFA, env=SCFA, method="kendall", index="bray")
summary(bioenv.demo)

##                                                                   size
## Isobutyrate                                                          1
## Isobutyrate iVal.2MB                                                 2
## Formate Isobutyrate iVal.2MB                                         3
## Formate Isobutyrate Butyrate iVal.2MB                                4
## Formate Acetate Isobutyrate iVal.2MB Valerate                        5
## Formate Acetate Isobutyrate Butyrate iVal.2MB Valerate               6
## Formate Acetate Propionate Isobutyrate Butyrate iVal.2MB Valerate    7
##                                                                   correlation
## Isobutyrate                                                            0.6000
## Isobutyrate iVal.2MB                                                   0.6000
## Formate Isobutyrate iVal.2MB                                           0.3333
## Formate Isobutyrate Butyrate iVal.2MB                                  0.3333
## Formate Acetate Isobutyrate iVal.2MB Valerate                          0.3333
## Formate Acetate Isobutyrate Butyrate iVal.2MB Valerate                 0.3333
## Formate Acetate Propionate Isobutyrate Butyrate iVal.2MB Valerate      0.2000

Isobutyrate has a fairly high correlation coefficient! Let’s make a new nMDS plot with these 4 samples and see what the vector looks like. I am using an artificially low “trymax” here because I know that the ordination will not converge with only 4 data points, and I don’t want to print 1000 lines of output for you to scroll through like I did in an earlier draft of this document. It is also going to yell at us about not having enough data, but we are going to plot it anyway!

#calculate nMDS
BC.nmds.SCFA <- metaMDS(OTU.SCFA, distance = "bray", k=2, trymax=20)

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0 
## Run 1 stress 0 
## ... Procrustes: rmse 0.06519934  max resid 0.101584 
## Run 2 stress 0 
## ... Procrustes: rmse 0.09068648  max resid 0.1500857 
## Run 3 stress 0 
## ... Procrustes: rmse 0.1482016  max resid 0.17969 
## Run 4 stress 0 
## ... Procrustes: rmse 0.02937013  max resid 0.03419048 
## Run 5 stress 0 
## ... Procrustes: rmse 0.3170615  max resid 0.4415728 
## Run 6 stress 0 
## ... Procrustes: rmse 0.1735384  max resid 0.2186006 
## Run 7 stress 0 
## ... Procrustes: rmse 0.1204908  max resid 0.1396258 
## Run 8 stress 0 
## ... Procrustes: rmse 0.2192493  max resid 0.2889815 
## Run 9 stress 0 
## ... Procrustes: rmse 0.3408676  max resid 0.4577456 
## Run 10 stress 0 
## ... Procrustes: rmse 0.1430815  max resid 0.2345267 
## Run 11 stress 0 
## ... Procrustes: rmse 0.1104757  max resid 0.1265018 
## Run 12 stress 0 
## ... Procrustes: rmse 0.3143317  max resid 0.3447919 
## Run 13 stress 0 
## ... Procrustes: rmse 0.1243163  max resid 0.2083983 
## Run 14 stress 0 
## ... Procrustes: rmse 0.145857  max resid 0.2204875 
## Run 15 stress 0 
## ... Procrustes: rmse 0.2610969  max resid 0.2742735 
## Run 16 stress 0 
## ... Procrustes: rmse 0.1147179  max resid 0.1906836 
## Run 17 stress 0 
## ... Procrustes: rmse 0.3527324  max resid 0.4773902 
## Run 18 stress 0 
## ... Procrustes: rmse 0.1062363  max resid 0.1223868 
## Run 19 stress 0 
## ... Procrustes: rmse 0.1635943  max resid 0.2164571 
## Run 20 stress 0 
## ... Procrustes: rmse 0.02790187  max resid 0.03344752 
## *** No convergence -- monoMDS stopping criteria:
##     20: stress < smin

## Warning in metaMDS(OTU.SCFA, distance = "bray", k = 2, trymax = 20): Stress
## is (nearly) zero - you may have insufficient data

## Warning in postMDS(out$points, dis, plot = max(0, plot - 1), ...): skipping
## half-change scaling: too few points below threshold

#calculate fitted variable
fit.BC.IsoB = envfit(BC.nmds.SCFA, SCFA$Isobutyrate)

## 'nperm' >= set of all permutations: complete enumeration.

## Set of permutations < 'minperm'. Generating entire set.

## 'nperm' >= set of all permutations: complete enumeration.

## Set of permutations < 'minperm'. Generating entire set.

#plot nMDS
plot(BC.nmds.SCFA, type="n", main="Bray-Curtis")
points(BC.nmds.SCFA, pch=20, display="sites", col=c("blue"))

#add fitted variable
plot(fit.BC.IsoB, col="black", labels=c("Isobutyrate"))

Vectores para variables continuas

Si bien es fácil visualizar grupos categóricos con coloración en nMDS, es difícil lograr el mismo efecto con variables continuas. En cambio, podemos ajustar estas variables como un vector en nuestros gráficos nMDS.

Para hacer esto, primero ajustamos las variables a nuestras distancias usando la función envfit envegan. Puede hacer Bray-Curtis, Jaccard, UniFrac ponderado o no ponderado. Aquí, demostraremos con Bray-Curtis y UniFrac ponderado.

Vemos que se ajusta automáticamente a todas las variables en nuestra meta tabla.

La forma más sencilla de evitar esto es pedirle a envfit que ejecute solo las variables que desee.

Repetimos para UniFrac ponderado

Para las variables categóricas, envfit etiquetará el centroide de los datos para cada grupo en el nMDS con el nombre de ese grupo. Para variables continuas, agrega una flecha en la dirección del valor más pequeño al más grande.

** Nota **: Los valores P para variables en envfit no son equivalentes a los valores P para nuestras pruebas ANOVA / Kruskal / GLM. En cambio, los valores P de envfit le indican qué tan bien la flecha o los centroides se ajustan a los datos * x-y del nMDS , no a la matriz de distancia subyacente. En general, si su nMDS es una buena representación de los datos (bajo valor de estrés) y la variable fue significativa en su prueba ANOVA / Kruskal / GLM apropiada, la flecha / centroides ajustados también serán significativos. Y si su nMDS es una buena representación de los datos y la variable no * significativa, la flecha / centroides ajustados tampoco * serán * significativos. Vemos este tipo de resultados aquí, pero este no siempre será el caso.

Sin embargo, si su estrés nMDS era dudoso o no grande y / o su variable era significativa o no significativa, puede ver resultados divergentes para la flecha / centroide. Esto no significa que el resultado obtenido en ANOVA / Kruskal / GLM no fue válido. Simplemente significa que es difícil visualizar este resultado como una simple flecha o centroides en una gráfica 2D. Independientemente, las variables no significativas en el “entorno” que usted sabe que son significativas en otras pruebas aún pueden estar representadas en un nMDS como una ayuda visual.

Por lo tanto, graficamos nuestro nMDS 2D coloreado por edad con una flecha para la variable ADG a pesar de que esa flecha no fue significativa. Dado que la variable ADG tampoco fue significativa en GLM, probablemente no utilicemos estos gráficos en una publicación, pero es una buena práctica.

Para Bray-Curtis:

También podría pedirle que solo trace las variables con un valor de P ajustado <0.05. Entonces solo veríamos los centroides

UniFrac ponderado

También podría ajustar su tabla OTU.clean al nMDS para agregar flechas para unidades OTU específicas dentro de la gráfica. Las flechas OTU que, por ejemplo, van en la misma dirección que un centroide de grupo de edad tienden a aumentar en abundancia en ese grupo de edad. La dirección opuesta indicaría que una OTU disminuye en abundancia en ese grupo de edad.

Ajustar todas las OTU llevaría un tiempo, por lo que solo se ajustarán a los primeros 10 en nuestra tabla.

También podría pensar en trazar niveles taxonómicos superiores como géneros sumados o grupos familiares de OTU.

Finalmente, puede ingresar una matriz de variables ambientales y usar la función vegan :: bioenv para identificar aquellas que son apropiadas para el trazado. Esto requiere su tabla OTU o matriz de distancia, una matriz de variables ambientales continuas (como nuestras abundancias de ácidos grasos de cadena corta). Como solo tenemos datos de SCFA para 4 muestras hoy, utilizaremos la tabla OTU subconjuntada “OTU.SCFA”.

¡El isobutirato tiene un coeficiente de correlación bastante alto! Hagamos un nuevo diagrama de nMDS con estas 4 muestras y veamos cómo se ve el vector. Estoy usando un “trymax” artificialmente bajo aquí porque sé que la ordenación no convergerá con solo 4 puntos de datos, y no quiero imprimir 1000 líneas de salida para que pueda desplazarse como lo hice en un borrador anterior de este documento. También nos gritará que no tenemos suficientes datos, ¡pero lo vamos a trazar de todos modos!

Statistically test beta-diversity

While nMDS gives us a visual of beta-diversity, it does not test for statistical differences. We do this with permutational analysis of variance (PERMANOVA) or analysis of similarity (ANOSIM). These test whether the overall microbial community differs by your variable of interest.

You can run them with Bray-Curtis, Jaccard, weighted or unweighted UniFrac to answer different questions. For example, if your variable is significant for Bray-Curtis/weighted UniFrac but not Jaccard/unweighted UniFrac, this means your groups tend to have the same OTUs (richness) but different abundances of those OTUs (diversity). When variables are signficant for Bray-Curtis/Jaccard but not UniFrac, this indicates that your samples have different specific OTUs but similar taxa. Like group 1 has a lot of Prevotella OTU1 and group 2 has a lot of Prevotella OTU2, but they are both Prevotella so UniFrac treats them as being very similar.

Estadísticamente prueba beta-diversity Si bien nMDS nos da una idea de la diversidad beta, no prueba las diferencias estadísticas. Hacemos esto con análisis de varianza permutacionales (PERMANOVA) o análisis de similitud (ANOSIM). Estos evalúan si la comunidad microbiana en general difiere según su variable de interés.

Puede ejecutarlos con Bray-Curtis, Jaccard, UniFrac ponderado o no ponderado para responder diferentes preguntas. Por ejemplo, si su variable es significativa para Bray-Curtis / UniFrac ponderado, pero no para Jaccard / UniFrac no ponderado, esto significa que sus grupos tienden a tener las mismas OTU (riqueza) pero diferentes abundancias de esas OTU (diversidad). Cuando las variables son significativas para Bray-Curtis / Jaccard pero no para UniFrac, esto indica que sus muestras tienen diferentes OTU específicas pero taxa similares. Al igual que el grupo 1 tiene una gran cantidad de * Prevotella * OTU1 y el grupo 2 tiene una gran cantidad de * Prevotella * OTU2, pero ambas son * Prevotella * por lo que UniFrac las trata como muy similares.

PERMANOVA

For Bray-Curtis or Jaccard, we use the vegan package to calculate distances and run PERMANOVA. As with ANOVA/glm of alpha-diversity, we want to include all variables that could interact in one model.

Note: adonis cannot handle or account for NA or blanks in your data. Subset to only samples with complete metadata before running vegdist if these exist.

#Calculate distance and save as a matrix
BC.dist=vegdist(OTU.clean, distance="bray")
#Run PERMANOVA on distances.
adonis(BC.dist ~ AgeGroup*ADGKG, data = meta, permutations = 1000)

## 
## Call:
## adonis(formula = BC.dist ~ AgeGroup * ADGKG, data = meta, permutations = 1000) 
## 
## Permutation: free
## Number of permutations: 1000
## 
## Terms added sequentially (first to last)
## 
##                Df SumsOfSqs MeanSqs F.Model      R2   Pr(>F)    
## AgeGroup        2    3.9720 1.98600  8.0116 0.44481 0.000999 ***
## ADGKG           1    0.1979 0.19791  0.7984 0.02216 0.615385    
## AgeGroup:ADGKG  2    0.2976 0.14881  0.6003 0.03333 0.938062    
## Residuals      18    4.4620 0.24789         0.49969             
## Total          23    8.9296                 1.00000             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Similarly for Jaccard

J.dist=vegdist(OTU.clean, distance="jaccard")
adonis(J.dist ~ AgeGroup*ADGKG, data = meta, permutations = 1000)

## 
## Call:
## adonis(formula = J.dist ~ AgeGroup * ADGKG, data = meta, permutations = 1000) 
## 
## Permutation: free
## Number of permutations: 1000
## 
## Terms added sequentially (first to last)
## 
##                Df SumsOfSqs MeanSqs F.Model      R2   Pr(>F)    
## AgeGroup        2    3.9720 1.98600  8.0116 0.44481 0.000999 ***
## ADGKG           1    0.1979 0.19791  0.7984 0.02216 0.626374    
## AgeGroup:ADGKG  2    0.2976 0.14881  0.6003 0.03333 0.911089    
## Residuals      18    4.4620 0.24789         0.49969             
## Total          23    8.9296                 1.00000             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the interaction is not significant so we remove it.

adonis(BC.dist ~ AgeGroup+ADGKG, data = meta, permutations = 1000)

## 
## Call:
## adonis(formula = BC.dist ~ AgeGroup + ADGKG, data = meta, permutations = 1000) 
## 
## Permutation: free
## Number of permutations: 1000
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2   Pr(>F)    
## AgeGroup   2    3.9720 1.98600  8.3451 0.44481 0.000999 ***
## ADGKG      1    0.1979 0.19791  0.8316 0.02216 0.557443    
## Residuals 20    4.7597 0.23798         0.53302             
## Total     23    8.9296                 1.00000             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

adonis(J.dist ~ AgeGroup+ADGKG, data = meta, permutations = 1000)

## 
## Call:
## adonis(formula = J.dist ~ AgeGroup + ADGKG, data = meta, permutations = 1000) 
## 
## Permutation: free
## Number of permutations: 1000
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2   Pr(>F)    
## AgeGroup   2    3.9720 1.98600  8.3451 0.44481 0.000999 ***
## ADGKG      1    0.1979 0.19791  0.8316 0.02216 0.582418    
## Residuals 20    4.7597 0.23798         0.53302             
## Total     23    8.9296                 1.00000             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For UniFrac, we use the phyloseq package to calculate distances and then vegan to run PERMANOVA.

wUF.dist = UniFrac(physeq.tree, weighted=TRUE, normalized=TRUE)

## Warning in UniFrac(physeq.tree, weighted = TRUE, normalized = TRUE):
## Randomly assigning root as -- Otu01526 -- in the phylogenetic tree in the
## data you provided.

adonis(wUF.dist ~ AgeGroup*ADGKG, data=meta, permutations = 1000)

## 
## Call:
## adonis(formula = wUF.dist ~ AgeGroup * ADGKG, data = meta, permutations = 1000) 
## 
## Permutation: free
## Number of permutations: 1000
## 
## Terms added sequentially (first to last)
## 
##                Df SumsOfSqs MeanSqs F.Model      R2   Pr(>F)    
## AgeGroup        2   1.20097 0.60048  7.1369 0.41665 0.000999 ***
## ADGKG           1   0.06947 0.06947  0.8256 0.02410 0.523477    
## AgeGroup:ADGKG  2   0.09753 0.04876  0.5796 0.03384 0.874126    
## Residuals      18   1.51449 0.08414         0.52542             
## Total          23   2.88245                 1.00000             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

uwUF.dist = UniFrac(physeq.tree, weighted=FALSE, normalized=TRUE)

## Warning in UniFrac(physeq.tree, weighted = FALSE, normalized = TRUE):
## Randomly assigning root as -- Otu00627 -- in the phylogenetic tree in the
## data you provided.

adonis(uwUF.dist ~ AgeGroup*ADGKG, data=meta, permutations = 1000)

## 
## Call:
## adonis(formula = uwUF.dist ~ AgeGroup * ADGKG, data = meta, permutations = 1000) 
## 
## Permutation: free
## Number of permutations: 1000
## 
## Terms added sequentially (first to last)
## 
##                Df SumsOfSqs MeanSqs F.Model      R2   Pr(>F)    
## AgeGroup        2    3.4896 1.74480  9.1317 0.46919 0.000999 ***
## ADGKG           1    0.2419 0.24185  1.2658 0.03252 0.215784    
## AgeGroup:ADGKG  2    0.2667 0.13337  0.6980 0.03586 0.822178    
## Residuals      18    3.4393 0.19107         0.46242             
## Total          23    7.4374                 1.00000             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Remove non-significant interaction term

adonis(wUF.dist ~ AgeGroup+ADGKG, data=meta, permutations = 1000)

## 
## Call:
## adonis(formula = wUF.dist ~ AgeGroup + ADGKG, data = meta, permutations = 1000) 
## 
## Permutation: free
## Number of permutations: 1000
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2   Pr(>F)    
## AgeGroup   2   1.20097 0.60048  7.4501 0.41665 0.000999 ***
## ADGKG      1   0.06947 0.06947  0.8619 0.02410 0.501499    
## Residuals 20   1.61202 0.08060         0.55925             
## Total     23   2.88245                 1.00000             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

adonis(uwUF.dist ~ AgeGroup+ADGKG, data=meta, permutations = 1000)

## 
## Call:
## adonis(formula = uwUF.dist ~ AgeGroup + ADGKG, data = meta, permutations = 1000) 
## 
## Permutation: free
## Number of permutations: 1000
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2   Pr(>F)    
## AgeGroup   2    3.4896 1.74480  9.4161 0.46919 0.000999 ***
## ADGKG      1    0.2419 0.24185  1.3052 0.03252 0.211788    
## Residuals 20    3.7060 0.18530         0.49829             
## Total     23    7.4374                 1.00000             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

NOTE: PERMANOVA, like all statistical tests, has certain assumptions about sampling variability. While it does not assume normality, it DOES assume equal beta dispersion between groups. SO, if your betadisper test (see below) shows that your groups have significantly different dispersions, you should use ANOSIM rather than PERMANOVA to look at mean differences between groups in beta diversity space.

PERMANOVA

Para Bray-Curtis o Jaccard, usamos el paquete vegan para calcular distancias y ejecutar PERMANOVA. Al igual que con ANOVA / glm de alfa-diversity, queremos incluir todas las variables que podrían interactuar en un modelo.

** Nota **: adonis no puede manejar ni dar cuenta de NA o espacios en blanco en sus datos. Subconjunto a solo muestras con metadatos completos antes de ejecutar vegdist si estos existen.

Del mismo modo para Jaccard

Vemos que la interacción no es significativa, así que la eliminamos.

Para UniFrac, utilizamos el paquete phyloseq para calcular distancias y luegovegan para ejecutar PERMANOVA.

Eliminar término de interacción no significativo

** NOTA: ** PERMANOVA, como todas las pruebas estadísticas, tiene ciertas suposiciones sobre la variabilidad del muestreo. Si bien no asume la normalidad, SÍ supone una dispersión beta igual entre los grupos. ASÍ, si su prueba betadisper (ver a continuación) muestra que sus grupos tienen dispersiones significativamente diferentes, debe usar ANOSIM en lugar de PERMANOVA para ver las diferencias de medias entre los grupos en el espacio de diversidad beta.

ANOSIM

If you have very different group sizes or different group beta dispersions, you may consider analysis of similarities (ANOSIM) instead of PERMANOVA. This test does not assume equal group variances. However, it only allows simple 1 variable models with no interactions and can only be used for categorical (AgeGroup), not continuous (ADG) variables.

For example, Bray-Curtis:

anosim(BC.dist, meta$AgeGroup, permutations = 1000)

## 
## Call:
## anosim(dat = BC.dist, grouping = meta$AgeGroup, permutations = 1000) 
## Dissimilarity: bray 
## 
## ANOSIM statistic R: 0.8467 
##       Significance: 0.000999 
## 
## Permutation: free
## Number of permutations: 1000

Overall, from the nMDS of various beta-diversity metrics (OTU- and phylogenetic-based) and statistical analyses, it is clear that age significantly impacts the fecal microbiota of dairy cows.

ANOSIM

Si tiene tamaños de grupo muy diferentes o diferentes dispersiones beta de grupo, puede considerar el análisis de similitudes (ANOSIM) en lugar de PERMANOVA. Esta prueba no supone variaciones de grupos iguales. Sin embargo, solo permite modelos simples de 1 variable sin interacciones y solo se puede usar para variables categóricas (AgeGroup), no continuas (ADG).

Por ejemplo, Bray-Curtis:

** En general, del nMDS de varias métricas de diversidad beta (OTU y basadas en filogenia) y análisis estadísticos, está claro que la edad tiene un impacto significativo en la microbiota fecal de las vacas lecheras. **

2D variables

These analyses are for comparing the microbiota to metadata that cannot fit in a single column and therefore, must be represented as a matrix of its own. For example, PERMANOVA can only tell you that the microbiota differs according to a single short chain fatty acid (SCFA), but other tests can tell you that the microbiota differs according to the overall SCFA profile. This section is also useful for comparing data if you have multiple OTU tables, like for bacteria, archaea, and fungi.

Mantel from vegan tests if two distance matrices co-vary e.g. does the data in matrix 1 change in the same way as the data in matrix 2. Like PERMANOVA, this test only tells you that the overall data co-vary, not which specific OTUs or SCFAs matter. You can only compare samples were you have both types of data so we must use the subsetted SCFA OTU table we made earlier.

We then calculate distance matrices separately for each matrix. It is not necessary to do Bray-Curtis, Jaccard and UniFrac here since our SCFA data does not have any taxonomy to it.

dist1 = vegdist(OTU.SCFA)
dist2 = vegdist(SCFA)

Run a Mantel test comparing the 2 matrices.

mantel(dist1, dist2, permutations=100)

## 'nperm' >= set of all permutations: complete enumeration.

## Set of permutations < 'minperm'. Generating entire set.

## 
## Mantel statistic based on Pearson's product-moment correlation 
## 
## Call:
## mantel(xdis = dist1, ydis = dist2, permutations = 100) 
## 
## Mantel statistic r: -0.02423 
##       Significance: 0.54167 
## 
## Upper quantiles of permutations (null model):
##   90%   95% 97.5%   99% 
## 0.540 0.552 0.596 0.629 
## Permutation: free
## Number of permutations: 23

We see that the overall OTU table and SCFA tables do not co-vary.

You can also run Mantel on 3 matrices at once like so

Do not run as we do not have 3 matrices here

mantel.partial(dist1, dist2, dist3, permutations=100)

Variables 2D

Estos análisis son para comparar la microbiota con los metadatos que no caben en una sola columna y, por lo tanto, deben representarse como una matriz propia. Por ejemplo, PERMANOVA solo puede decirle que la microbiota difiere de acuerdo con un único ácido graso de cadena corta (SCFA), pero otras pruebas pueden indicarle que la microbiota difiere de acuerdo con el perfil general de SCFA. Esta sección también es útil para comparar datos si tiene múltiples tablas OTU, como bacterias, arqueas y hongos.

Mantel de las pruebasvegan si dos matrices de distancia co-varían * ej. * ¿Los datos en la matriz 1 cambian de la misma manera que los datos en la matriz 2. Al igual que PERMANOVA, esta prueba solo indica que la información general co- varían, no lo que importan las OTU o SCFAs. Solo puede comparar muestras si tiene ambos tipos de datos, por lo que debemos usar la tabla subconjuntada SCFA OTU que hicimos anteriormente.

Luego calculamos las matrices de distancia por separado para cada matriz. No es necesario hacer aquí a Bray-Curtis, Jaccard y UniFrac ya que nuestros datos de SCFA no tienen ninguna taxonomía.

Ejecute una prueba de Mantel comparando las 2 matrices.

Vemos que la tabla OTU global y las tablas SCFA no co-varían.

También puede ejecutar Mantel en 3 matrices a la vez como tal

** No ejecute ya que no tenemos 3 matrices aquí **

Beta dispersion

Sometimes it will be clear from nMDS that one group tends to vary more (be more spread out) than another group. You can test this statistically with multivariate homogeneity of group dispersion (variances).

Here is an example for Bray-Curtis. We use the same distance matrix we calculated for PERMANOVA/ANOSIM

Calculate dispersion (variances) within each group.

disp.age = betadisper(BC.dist, meta$AgeGroup)

Perform an ANOVA-like test to determine if the variances differ by groups.

permutest(disp.age, pairwise=TRUE, permutations=1000)

## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 1000
## 
## Response: Distances
##           Df  Sum Sq  Mean Sq     F N.Perm   Pr(>F)    
## Groups     2 0.47459 0.237293 30.93   1000 0.000999 ***
## Residuals 21 0.16111 0.007672                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Pairwise comparisons:
## (Observed p-value below diagonal, permuted p-value above diagonal)
##            1yr         2w     8w
## 1yr            9.9900e-04 0.0010
## 2w  4.8556e-06            0.7632
## 8w  1.2886e-06 7.7206e-01

Combining this with our plot,

plot(BC.nmds, type="n", main="Bray-Curtis")
legend(.6,-2, legend=c("2w","8w","1yr"), col=c("green","red","blue"), pch=20)
ordiellipse(BC.nmds, groups=meta$AgeGroup, display="sites", kind="se",  label=FALSE, col="green", draw="polygon", alpha=200, show.groups = c("2w"), border=FALSE)
ordiellipse(BC.nmds, groups=meta$AgeGroup, display="sites", kind="se",  label=FALSE, col="red", draw="polygon", alpha=200, show.groups = c("8w"), border=FALSE)
ordiellipse(BC.nmds, groups=meta$AgeGroup, display="sites", kind="se",  label=FALSE, col="blue", draw="polygon", alpha=200, show.groups = c("1yr"), border=FALSE)

we see that 2 week and 8 week calves have similar variability in their fecal microbiotas but that both 2- and 8-week calves have more variable fecal microbiotas than 1-year heifers.

NOTE: PERMANOVA, like all statistical tests, has certain assumptions about sampling variability. While it does not assume normality, it DOES assume equal beta dispersion between groups. SO, if your betadisper test shows that your groups have significantly different dispersions, you should use ANOSIM rather than PERMANOVA to look at mean differences between groups in beta diversity space.

Dispersión Beta

A veces, de nMDS quedará claro que un grupo tiende a variar más (estar más disperso) que otro grupo. Puede probar esto estadísticamente con la homogeneidad multivariante de dispersión de grupo (varianzas).

Aquí hay un ejemplo para Bray-Curtis. Usamos la misma matriz de distancia que calculamos para PERMANOVA / ANOSIM

Calcule la dispersión (varianzas) dentro de cada grupo.

Realice una prueba tipo ANOVA para determinar si las varianzas difieren según los grupos.

Combinando esto con nuestra trama,

vemos que las terneras de 2 semanas y 8 semanas tienen una variabilidad similar en sus microbiotas fecales, pero que las terneras de 2 y 8 semanas tienen microbiotas fecales más variables que las novillas de 1 año.

** NOTA: ** PERMANOVA, como todas las pruebas estadísticas, tiene ciertas suposiciones sobre la variabilidad del muestreo. Si bien no asume la normalidad, SÍ supone una dispersión beta igual entre los grupos. ASÍ, si su prueba de Betadisper muestra que sus grupos tienen dispersiones significativamente diferentes, debe usar ANOSIM en lugar de PERMANOVA para ver las diferencias de medias entre los grupos en el espacio de diversidad beta.

OTUs that differ by

Categorical variables

Just because the overall microbiota does or does not differ between age groups, does not mean specific OTUs do or don’t differ by age. However, it is inadvisable to just test all OTUs in your data set against all variables of interest. Since you are running multiple similar tests, you need to apply a false discovery rate (fdr) correctios and correcting across all OTUs (5002 in this data set) will most likely result in no significant results after fdr correction. Also, you don’t want to look at over 5000 P-values, do you?

There are a number of way to decrease the number of OTUs you’re looking at

Don’t use OTUs. Add together genus or family groups and test if all or some of these taxa differ across variables of interest
Apply an abundance cutoff such as only looking at OTUs/taxa that are at least 1% abundance in at least one sample
Apply a frequency cutoff such as only looking at OTUs/taxa that occur in at least 50% of samples
Combine 2 and 3

However, some of these methods are somewhat arbitrary. How do you pick an abundance or frequency cutoff? What if a low abundant OTU is of interest? And what if you are interested in possible species-level differences (OTUs) so high taxonomic levels aren’t useful?

OTUs que difieren por

Variables categóricas

El hecho de que la microbiota general difiera o no entre los grupos de edad no significa que las OTU específicas difieran o no según la edad. Sin embargo, no es aconsejable probar todas las OTU en su conjunto de datos contra todas las variables de interés. Dado que está ejecutando varias pruebas similares, debe aplicar una tasa de descubrimiento falso (fdr) correctios y la corrección en todas las OTU (5002 en este conjunto de datos) probablemente no dará resultados significativos después de la corrección fdr. Además, no quiere mirar más de 5000 valores P, ¿verdad?

Hay varias maneras de disminuir el número de OTU que estás viendo

No use OTUs. Agregue el género o los grupos familiares y pruebe si todos o algunos de estos taxones difieren entre las variables de interés
Aplique un límite de abundancia, como solo mirar OTUs / taxones que son al menos 1% de abundancia en al menos una muestra
Aplique un límite de frecuencia, como solo observar las OTU / taxa que ocurren en al menos el 50% de las muestras
Combina 2 y 3

Sin embargo, algunos de estos métodos son algo arbitrarios. ¿Cómo se elige una abundancia o corte de frecuencia? ¿Qué pasa si una OTU baja abundante es de interés? ¿Y qué sucede si le interesan las posibles diferencias a nivel de especie (OTU) por lo que los niveles taxonómicos altos no son útiles?

SIMPER

So, one way to non-arbitrarily select OTUs/taxa of interest is similarity percentages (SIMPER). SIMPER identifies the OTUs that most contribute to beta-diversity measures. These OTUs are the most abundant and/or most variable OTUs in the data set. Note: SIMPER outputs all pairwise comparisons (A-B, B-C, A-C, etc.) and thus, only works for categorical variables.

SIMPER’s output is a list of OTUs which cumulatively explain 70%+ of the variation between each comparison. The numbers below the OTUs are cumulative, so to get each OTU’s contribution, you must subtract the previous OTU’s value.

For example

simper(OTU.clean, meta$AgeGroup, permutations=100)

## cumulative contributions of most influential species:
## 
## $`1yr_2w`
##  Otu00002  Otu00001  Otu00003  Otu00007  Otu00011  Otu00006  Otu00009 
## 0.0983761 0.1627191 0.2225335 0.2657879 0.2982889 0.3271508 0.3514210 
##  Otu00014  Otu00022  Otu00018  Otu00012  Otu00016  Otu00004  Otu00021 
## 0.3660756 0.3793171 0.3924608 0.4048922 0.4171422 0.4283988 0.4385280 
##  Otu00008  Otu00025  Otu00028  Otu00023  Otu00037  Otu00013  Otu00035 
## 0.4479076 0.4565849 0.4646081 0.4723795 0.4790690 0.4857141 0.4920793 
##  Otu00055  Otu00030  Otu00036  Otu00040  Otu00042  Otu00010  Otu00049 
## 0.4983615 0.5045449 0.5106265 0.5166717 0.5226378 0.5274331 0.5321886 
##  Otu00046  Otu00033  Otu00031  Otu00081  Otu00051  Otu00064  Otu00056 
## 0.5368030 0.5413764 0.5458188 0.5500936 0.5543565 0.5582465 0.5620674 
##  Otu00032  Otu00052  Otu00062  Otu00026  Otu00020  Otu00074  Otu00069 
## 0.5657989 0.5695078 0.5730822 0.5765920 0.5799406 0.5831741 0.5864067 
##  Otu00066  Otu00077  Otu00148  Otu00073  Otu00067  Otu00065  Otu00076 
## 0.5895953 0.5927428 0.5958511 0.5989588 0.6020549 0.6051241 0.6081334 
##  Otu00075  Otu00091  Otu00048  Otu00097  Otu00068  Otu00050  Otu00084 
## 0.6111073 0.6140400 0.6169121 0.6196512 0.6223697 0.6250661 0.6277023 
##  Otu00100  Otu00019  Otu00063  Otu00039  Otu00086  Otu00071  Otu00101 
## 0.6303356 0.6329664 0.6355752 0.6381709 0.6406744 0.6431362 0.6455850 
##  Otu00089  Otu00096  Otu00095  Otu00108  Otu00088  Otu00103  Otu00094 
## 0.6480310 0.6504700 0.6528884 0.6553007 0.6576757 0.6600472 0.6624184 
##  Otu00098  Otu00116  Otu00090  Otu00105  Otu00104  Otu00099  Otu00059 
## 0.6647575 0.6670589 0.6693444 0.6716046 0.6738590 0.6760506 0.6781917 
##  Otu00106  Otu00115  Otu00102  Otu00110  Otu00119  Otu00118  Otu00034 
## 0.6803196 0.6824245 0.6844633 0.6865021 0.6884972 0.6904775 0.6924261 
##  Otu00114  Otu00093  Otu00124  Otu00045 
## 0.6943714 0.6962690 0.6981558 0.7000319 
## 
## $`1yr_8w`
##   Otu00001   Otu00005   Otu00006   Otu00004   Otu00010   Otu00017 
## 0.03765603 0.07335078 0.10010930 0.12226268 0.14087762 0.15688502 
##   Otu00008   Otu00009   Otu00015   Otu00018   Otu00016   Otu00014 
## 0.17205091 0.18718833 0.20107546 0.21456235 0.22713556 0.23964967 
##   Otu00029   Otu00019   Otu00021   Otu00025   Otu00024   Otu00037 
## 0.25102468 0.26162658 0.27202671 0.28093293 0.28829315 0.29516652 
##   Otu00035   Otu00044   Otu00055   Otu00027   Otu00036   Otu00040 
## 0.30170335 0.30821052 0.31465848 0.32109529 0.32733731 0.33354206 
##   Otu00042   Otu00020   Otu00013   Otu00041   Otu00003   Otu00043 
## 0.33966556 0.34564370 0.35158279 0.35717451 0.36261926 0.36799345 
##   Otu00038   Otu00026   Otu00034   Otu00049   Otu00070   Otu00046 
## 0.37334038 0.37836130 0.38334135 0.38822230 0.39310161 0.39783775 
##   Otu00012   Otu00058   Otu00011   Otu00051   Otu00054   Otu00045 
## 0.40234701 0.40670755 0.41102172 0.41521298 0.41939306 0.42353985 
##   Otu00047   Otu00064   Otu00056   Otu00052   Otu00048   Otu00002 
## 0.42764688 0.43163954 0.43556497 0.43937178 0.44313291 0.44683135 
##   Otu00062   Otu00031   Otu00057   Otu00061   Otu00053   Otu00074 
## 0.45050368 0.45405112 0.45759807 0.46109474 0.46455875 0.46787762 
##   Otu00069   Otu00066   Otu00077   Otu00073   Otu00067   Otu00079 
## 0.47119548 0.47447192 0.47770248 0.48089214 0.48406988 0.48721802 
##   Otu00083   Otu00078   Otu00076   Otu00075   Otu00091   Otu00121 
## 0.49033806 0.49342871 0.49651735 0.49956976 0.50257978 0.50549547 
##   Otu00097   Otu00092   Otu00032   Otu00084   Otu00129   Otu00050 
## 0.50830678 0.51111612 0.51389884 0.51660098 0.51922111 0.52181856 
##   Otu00100   Otu00101   Otu00096   Otu00108   Otu00095   Otu00086 
## 0.52434751 0.52686095 0.52936793 0.53184756 0.53429667 0.53674109 
##   Otu00089   Otu00088   Otu00103   Otu00094   Otu00098   Otu00116 
## 0.53918547 0.54162316 0.54405719 0.54649097 0.54889172 0.55125394 
##   Otu00105   Otu00104   Otu00143   Otu00123   Otu00082   Otu00039 
## 0.55357747 0.55589135 0.55819397 0.56049152 0.56278380 0.56503978 
##   Otu00099   Otu00130   Otu00090   Otu00106   Otu00107   Otu00115 
## 0.56728918 0.56953083 0.57176616 0.57395024 0.57611979 0.57828018 
##   Otu00087   Otu00153   Otu00102   Otu00110   Otu00119   Otu00118 
## 0.58042631 0.58252590 0.58461849 0.58671108 0.58875879 0.59079874 
##   Otu00022   Otu00072   Otu00080   Otu00093   Otu00124   Otu00112 
## 0.59281824 0.59481609 0.59678509 0.59873275 0.60067308 0.60260107 
##   Otu00122   Otu00131   Otu00132   Otu00134   Otu00128   Otu00125 
## 0.60450552 0.60639869 0.60828362 0.61014314 0.61199594 0.61383412 
##   Otu00133   Otu00159   Otu00139   Otu00127   Otu00114   Otu00137 
## 0.61566158 0.61747930 0.61928689 0.62106367 0.62282385 0.62455846 
##   Otu00136   Otu00194   Otu00138   Otu00144   Otu00142   Otu00135 
## 0.62629042 0.62801571 0.62974033 0.63143945 0.63312281 0.63480281 
##   Otu00147   Otu00120   Otu00188   Otu00126   Otu00028   Otu00211 
## 0.63647550 0.63814069 0.63980299 0.64140642 0.64300322 0.64457174 
##   Otu00154   Otu00146   Otu00173   Otu00156   Otu00158   Otu00157 
## 0.64612078 0.64764950 0.64917769 0.65068721 0.65217234 0.65364696 
##   Otu00060   Otu00168   Otu00140   Otu00163   Otu00171   Otu00113 
## 0.65508066 0.65651008 0.65793253 0.65931862 0.66069801 0.66207484 
##   Otu00178   Otu00200   Otu00165   Otu00170   Otu00164   Otu00187 
## 0.66344999 0.66480785 0.66616041 0.66748648 0.66881018 0.67012189 
##   Otu00151   Otu00213   Otu00149   Otu00183   Otu00192   Otu00167 
## 0.67141176 0.67269928 0.67397558 0.67525135 0.67652371 0.67778788 
##   Otu00177   Otu00181   Otu00180   Otu00236   Otu00186   Otu00199 
## 0.67904574 0.68029263 0.68151160 0.68272731 0.68393783 0.68512983 
##   Otu00253   Otu00150   Otu00204   Otu00169   Otu00218   Otu00189 
## 0.68632029 0.68750539 0.68867418 0.68982822 0.69097221 0.69210846 
##   Otu00182   Otu00184   Otu00226   Otu00270   Otu00172   Otu00225 
## 0.69323878 0.69436709 0.69548866 0.69660494 0.69770318 0.69878699 
##   Otu00185   Otu00203 
## 0.69986670 0.70093653 
## 
## $`2w_8w`
##  Otu00002  Otu00001  Otu00003  Otu00007  Otu00009  Otu00005  Otu00011 
## 0.1101390 0.1804133 0.2466786 0.2952479 0.3351854 0.3745198 0.4100899 
##  Otu00004  Otu00010  Otu00017  Otu00008  Otu00012  Otu00015  Otu00022 
## 0.4397781 0.4641945 0.4818672 0.4987872 0.5154942 0.5307997 0.5454777 
##  Otu00029  Otu00013  Otu00019  Otu00020  Otu00028  Otu00006  Otu00023 
## 0.5580145 0.5704325 0.5824230 0.5910912 0.5996473 0.6081657 0.6166261 
##  Otu00024  Otu00027  Otu00031  Otu00044  Otu00030  Otu00041  Otu00043 
## 0.6247348 0.6322130 0.6396626 0.6468237 0.6539027 0.6600291 0.6659522 
##  Otu00038  Otu00032  Otu00026  Otu00070  Otu00033  Otu00034  Otu00047 
## 0.6718453 0.6776585 0.6834157 0.6887933 0.6940870 0.6992933 0.7044391

We see a number of OTUs that may differ between 1 or more age comparisons. However, these are just the OTUs that most contribute to Bray-Curtis measures between our age groups. They are not necessarily significantly different.

To test significance, we compare the relative abundance of an OTU across our age groups with Kruskal-Wallis (OTU abundance is never normally distributed, trust me). For example, OTU1 occurs in all SIMPER age comparisons and does, in fact, significantly differ by age.

kruskal.test(OTU.clean$Otu00001 ~ meta$AgeGroup)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  OTU.clean$Otu00001 by meta$AgeGroup
## Kruskal-Wallis chi-squared = 15.994, df = 2, p-value = 0.0003364

In contrast, OTU17 occurs in SIMPER but does not actually significantly differ by age group

kruskal.test(OTU.clean$Otu00017 ~ meta$AgeGroup)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  OTU.clean$Otu00017 by meta$AgeGroup
## Kruskal-Wallis chi-squared = 4.9767, df = 2, p-value = 0.08305

Note: These P-values have not been corrected from false discovery rate (fdr) yet.

Now, it would be very tedious to individually test every variable of interest in SIMPER and then test every SIMPER OTU in Kruskal-Wallis. So, Andrew Steinberger (Suen lab) has written two scripts to simplify both SIMPER and Kruskal-Wallis of SIMPER OTUs. The latest versions can be found on his GitHub page and we have provided them for this workshop in /Steinberger_scripts

Disclaimer Andrew has provided these scripts out of the goodness of his heart and provides no guarentee that they will work for your exact data set or with new versions of R/RStudio/vegan. You may contact him through GitHub with issues or errors, but it is not his job to troubleshoot for you. He may or may not address your concerns in an updated version of the scripts at a later time.

The use of these scripts are as follows (from Steinberger GitHub with some modifications)

simper_pretty.R

This script is meant to rapidly perform the SIMPER function from the R package vegan for all comparisons of interest in a data set. Inputs are OTU and metadata tables, and the output is a .csv. User can tailor contents of .csv by setting perc_cutoff, low_cutoff, and low_val. This function can also handle taxonomic levels instead of OTU, but currently only select formats are compatible. Requires installation of the R package vegan.

Usage:

simper.pretty(x, metrics, c(‘interesting’), perc_cutoff=0.5, low_cutoff = ‘y’, low_val=0.01, ‘output_name’)

Inputs:

x: OTU table
metrics: metadata table
interesting: a list of the column headers for the columns of interest in the metrics file. e.g. c(‘int1’,‘int2’,‘int3’)
perc_cutoff: % cutoff for output OTUs, as decimal (i.e. write 50% as 0.5), larger % increases number OTUs in output.
low_cutoff: ‘y’ if want to REMOVE OTUs that contribute less than 1%
low_val: set value of low cutoff (0.01), ignored if low_cutoff=‘n’.
output_name: the name that is appended to the output filename “_clean_simper.csv“.

R_krusk.R

This script takes the output .csv of simper_pretty.R, and the OTU/metadata/taxonomy tables, and performs the non-parametric Kruskal-Wallis rank-sum test on each OTU in the .csv file. Output is a .csv file containing the same contents of simper.pretty output with the following info: p-value, fdr corrected p-value, OTU taxonomic classification (if applicable), mean rel. abund and std dev of otu/tax_lvl in group 1 of comparison, and mean rel. abund and std dev of otu/tax_lvl in group 2 of comparison. Requires installation of R packages vegan and dplyr.

Usage:

kruskal.pretty(x, metrics, csv, c(‘interesting’), ‘output_name’, taxonomy)

Inputs:

x: OTU table
metrics: metadata table
csv: output from simper.pretty, must be imported as data.frame. e.g. csv= data.frame(read.csv(“PATH to name_clean_simper.csv”))
interesting: a list of the column headers for the columns of interest in the metrics file, should be same as simper.pretty inputs. e.g. c(‘int1’,‘int2’,‘int3’)
output_name= the name that is appended to the output filename “_krusk_simper.csv“.
taxonomy: The .taxonomy file output from classify.otu command in mothur. This is the UNALTERED tax file, not tax.clean (optional)

First, we load these functions into R.

source("Steinberger_scripts/simper_pretty.r")
source("Steinberger_scripts/R_krusk.r")

Then, we apply them to our data. We will ask for all SIMPER OTUs (perc_cutoff = 1, meaning up to cumulative 100%) but cutoff any OTUs that individually contribute less than 1% to SIMPER (low_val=0.01). You may want to consider different cutoffs for your data.

simper.pretty(OTU.clean, meta, c('AgeGroup'), perc_cutoff=1, low_cutoff = 'y', low_val=0.01, 'Age')

simper.results = data.frame(read.csv("Age_clean_simper.csv"))
kruskal.pretty(OTU.clean, meta, simper.results, c('AgeGroup'), 'Age', tax)

If we import the Kruskal-Wallis back into R and select only OTUs there were significantly different after fdr correction (fdr_krusk_p.val)…

#Import
KW.results = data.frame(read.csv("Age_krusk_simper.csv"))
#Remove non-significant
KW.results.signif = KW.results[KW.results$fdr_krusk_p.val < 0.05,]
#Order by OTU#
KW.results.signif = KW.results.signif[with(KW.results.signif, order(OTU)),]
head(KW.results.signif)

##     X Comparison     SIMPER      OTU  krusk_p.val fdr_krusk_p.val
## 2   2     1yr_2w 0.06434298 Otu00001 0.0004510953     0.001383359
## 15 15     1yr_8w 0.03765603 Otu00001 0.0004510953     0.001383359
## 1   1     1yr_2w 0.09837610 Otu00002 0.0004510953     0.001383359
## 30 30      2w_8w 0.11013903 Otu00002 0.0208625823     0.029989962
## 3   3     1yr_2w 0.05981442 Otu00003 0.0003310658     0.001383359
## 32 32      2w_8w 0.06626526 Otu00003 0.0356919001     0.044373714
##                                                                                                                   Taxonomy
## 2          k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Faecalibacterium;s__prausnitzii;
## 15         k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Faecalibacterium;s__prausnitzii;
## 1          k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Faecalibacterium;s__prausnitzii;
## 30         k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Faecalibacterium;s__prausnitzii;
## 3  k__Bacteria;p__Actinobacteria;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella;s__aerofaciens;
## 32 k__Bacteria;p__Actinobacteria;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella;s__aerofaciens;
##    Left.mean.abund   Left.stdev Right.mean.abund Right.stdev
## 2     7.109140e-06 2.010768e-05      0.128370197  0.16351829
## 15    7.109140e-06 2.010768e-05      0.073292635  0.09803742
## 1     7.118451e-06 2.013402e-05      0.196185324  0.23796423
## 30    1.961853e-01 2.379642e-01      0.007205221  0.01601067
## 3     0.000000e+00 0.000000e+00      0.119333403  0.18000346
## 32    1.193334e-01 1.800035e-01      0.010598818  0.02126522

we see a number of OTU that significantly differ by age group.

Looking at OTU1 as relative abundance

#Calculate abundance
abund = OTU.clean/rowSums(OTU.clean)*100
#plot
boxplot(abund$Otu00001 ~ meta$AgeGroup.ord, ylab="% Relative abundance", main="OTU1")

and using the P-values in KW.results.signif, we can say that OTU1 is significantly less abundant in 1yr animals compared to either 2w or 8w calves.

SIMPER

Entonces, una forma de seleccionar OTU / taxa de interés no arbitrariamente es el porcentaje de similitud (SIMPER). SIMPER identifica las OTU que más contribuyen a las medidas de diversidad beta. Estas OTU son las OTU más abundantes y / o más variables en el conjunto de datos. ** Nota **: SIMPER genera todas las comparaciones por parejas (A-B, B-C, A-C, etc.) y, por lo tanto, solo funciona para variables categóricas.

La salida de SIMPER es una lista de OTU que explica acumulativamente el 70% de la variación entre cada comparación. Los números debajo de las OTU son ** acumulativos **, por lo que para obtener la contribución de cada OTU, debe restar el valor de la OTU anterior.

Por ejemplo

Vemos una cantidad de OTU que pueden diferir entre 1 o más comparaciones por edad. Sin embargo, estas son solo las OTU que más contribuyen a las medidas de Bray-Curtis entre nuestros grupos de edad. * No son necesariamente significativamente diferentes. *

Para probar la significación, comparamos la abundancia relativa de una OTU a través de nuestros grupos de edad con Kruskal-Wallis (la abundancia de OTU nunca se distribuye normalmente, créanme). Por ejemplo, OTU1 ocurre en todas las comparaciones de edades SIMPER y, de hecho, difiere significativamente según la edad.

Por el contrario, OTU17 se produce en SIMPER pero en realidad no difiere significativamente por grupo de edad

** Nota **: estos valores P aún no se han corregido de la tasa de descubrimiento falso (fdr).

Ahora, sería muy tedioso probar individualmente cada variable de interés en SIMPER y luego probar cada OTU SIMPER en Kruskal-Wallis. Entonces, Andrew Steinberger (laboratorio de Suen) ha escrito dos guiones para simplificar tanto SIMPER como Kruskal-Wallis de SIMPER OTU. Las últimas versiones se pueden encontrar en su [página de GitHub] (https://github.com/asteinberger9/seq_scripts) y las hemos proporcionado para este taller en /Steinberger_scripts

** Descargo de responsabilidad ** * Andrew ha proporcionado estos guiones por la bondad de su corazón y no garantiza que funcionen para su conjunto de datos exacto o con las nuevas versiones de R / RStudio / vegan. Puede contactarlo a través de GitHub con problemas o errores, pero no es su trabajo resolverlo por usted. Puede o no abordar sus inquietudes en una versión actualizada de los guiones en un momento posterior. *

El uso de estos scripts es el siguiente (de Steinberger GitHub con algunas modificaciones)

** simper_pretty.R **

Este script está destinado a realizar rápidamente la función SIMPER del paquete Rvegan para todas las comparaciones de interés en un conjunto de datos. Las entradas son OTU y tablas de metadatos, y el resultado es un .csv. El usuario puede adaptar los contenidos de .csv configurando perc_cutoff, low_cutoff y low_val. Esta función también puede manejar niveles taxonómicos en lugar de OTU, pero actualmente solo los formatos seleccionados son compatibles. Requiere la instalación del paquete R vegan.

Uso:

simper.pretty (x, metrics, c (‘interesting’), perc_cutoff = 0.5, low_cutoff = ‘y’, low_val = 0.01, ‘output_name’)

Insumos:

x: tabla OTU
metrics: tabla de metadatos
interesting: una lista de los encabezados de columna para las columnas de interés en el archivo de medidas. p.ej. c (‘int1’, ‘int2’, ‘int3’)
perc_cutoff:% de corte para las OTU de salida, como decimal (es decir, escribir 50% como 0.5), el mayor% aumenta el número de OTU en la salida.
low_cutoff: ‘y’ si quiere eliminar las OTU que contribuyen con menos del 1%
low_val: establece el valor de corte bajo (0.01), ignorado si low_cutoff = ‘n’.
output_name: el nombre que se agrega al nombre de archivo de salida “_clean_simper.csv“.

** R_krusk.R **

Este script toma el resultado .csv de simper_pretty.R, y las tablas OTU / metadata / taxonomy, y realiza la prueba de suma de rango Kruskal-Wallis no paramétrica en cada OTU en el archivo .csv. La salida es un archivo .csv que contiene los mismos contenidos de la salida simper.pretty con la siguiente información: valor p, valor p corregido fdr, clasificación taxonómica OTU (si corresponde), significa rel. abund y std dev de otu / tax_lvl en el grupo 1 de comparación, y media rel. abund y std dev de otu / tax_lvl en el grupo 2 de comparación. Requiere la instalación de paquetes R vegan ydplyr.

Uso:

kruskal.pretty (x, metrics, csv, c (‘interesting’), ‘output_name’, taxonomy)

Insumos:

x: tabla OTU
metrics: tabla de metadatos
csv: salida de simper.pretty, se debe importar como data.frame. p.ej. csv = data.frame (read.csv (“RUTA para nombrar_clean_simper.csv”))
interesting: una lista de los encabezados de columna para las columnas de interés en el archivo de métricas debe ser igual a las entradas simper.pretty. p.ej. c (‘int1’, ‘int2’, ‘int3’)
output_name = el nombre que se agrega al nombre de archivo de salida “_krusk_simper.csv“.
taxonomy: la salida del archivo .taxonomy del comando classify.otu en mothur. Este es el archivo de impuestos UNALTERED, no tax.clean (opcional)

Primero, cargamos estas funciones en R.

Luego, los aplicamos a nuestros datos. Pediremos todas las OTU SIMPER (perc_cutoff = 1, que significa hasta el 100% acumulativo) pero cortará cualquier OTU que contribuya individualmente con menos de 1% a SIMPER (low_val = 0.01). Es posible que desee considerar diferentes puntos de corte para sus datos.

Si importamos Kruskal-Wallis nuevamente a R y seleccionamos solo OTUs, hubo una diferencia significativamente diferente después de la corrección de fdr (fdr_krusk_p.val) …

vemos una cantidad de OTU que difieren significativamente por grupo de edad.

Mirando a OTU1 como abundancia relativa

y usando los valores P en KW.results.signif, podemos decir que OTU1 es significativamente menos abundante en animales de 1yr en comparación con terneros de 2w u 8w.

`DESeq2`

Another way to find specific OTUs of interest which differ between categorical explanatory variables is by using the DESeq function in DESeq2. DESeq2 was built for RNASeq data. However, it turns out that 16S rRNA community sequencing data is VERY similar in structure to RNASeq data. There are a lot of zeroes, severely skewed distributions, and overall the OTUs end up looking a lot like genes. So, we can use this tool to investigate “differential expression” of specific OTUs between samples!

DESeq2 partners up with phyloseq to make this happen. The first thing we need to do is to convert our phyloseq object to a DESeq2 object. For these functions, we do not need a phylogenetic tree. So we will use our phyloseq object “physeq.” We will also let it know what we want to use as our categorical explanatory variable, in this case “AgeGroup.”

pds <- phyloseq_to_deseq2(physeq, ~AgeGroup)

## converting counts to integer mode

Now try to run the DESeq function.

pds = DESeq(pds)

Uh-oh! There is some eccentricities with the count data in some cases. We need to make a small adjustment in order to move forward. I found this solution online.

gm_mean = function(x, na.rm=TRUE){
  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(pds), 1, gm_mean)

pds = estimateSizeFactors(pds, geoMeans = geoMeans)

Now try the function again.

pds = DESeq(pds)

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## -- replacing outliers and refitting for 177 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)

## estimating dispersions

## fitting model and testing

Lets pull out the results in a readable form.

res <- results(pds, cooksCutoff = FALSE)
alpha=0.05
sigtab <- res[which(res$padj < alpha), ]
sigtab <- cbind(as(sigtab,"data.frame"), as(tax_table(physeq)[rownames(sigtab), ], "matrix"))
head(sigtab,20)

##            baseMean log2FoldChange     lfcSE       stat       pvalue
## Otu00001 1351.52291      12.110218 1.1673301  10.374288 3.246241e-25
## Otu00002 1781.31269       8.693493 1.4977166   5.804498 6.455906e-09
## Otu00003 1030.32823       9.754449 1.5630938   6.240476 4.362421e-10
## Otu00004  301.35487      24.840982 1.6830476  14.759525 2.671572e-49
## Otu00005  322.06325      12.548884 1.9003212   6.603559 4.014013e-11
## Otu00006  321.59330      -3.406751 1.4890166  -2.287920 2.214219e-02
## Otu00007  733.60949       7.332177 1.7166759   4.271148 1.944692e-05
## Otu00008  260.32785      11.292505 1.1940792   9.457082 3.166565e-21
## Otu00010   81.97638       4.585691 1.3196858   3.474835 5.111675e-04
## Otu00011  626.66561       9.593566 1.5172907   6.322827 2.568213e-10
## Otu00012   57.06341       9.733015 1.8270167   5.327272 9.969888e-08
## Otu00013  131.26953      10.675331 2.3500653   4.542568 5.557316e-06
## Otu00014  164.65307      -3.302411 1.0495452  -3.146516 1.652283e-03
## Otu00015   28.72467      23.658752 3.6070268   6.559073 5.414335e-11
## Otu00016  125.97247     -11.301999 0.9824394 -11.504016 1.259168e-30
## Otu00017   14.81136      22.145142 3.1820906   6.959306 3.419530e-12
## Otu00018  134.94822     -11.039986 0.9015639 -12.245373 1.778583e-34
## Otu00019  168.03339      11.652341 1.9069478   6.110466 9.934038e-10
## Otu00020  122.82284      10.758421 1.7329540   6.208140 5.361553e-10
## Otu00021  101.76592     -10.993366 0.9367950 -11.735082 8.424472e-32
##                  padj      Domain            Phylum             Class
## Otu00001 2.216297e-23 k__Bacteria     p__Firmicutes     c__Clostridia
## Otu00002 4.747030e-08 k__Bacteria     p__Firmicutes     c__Clostridia
## Otu00003 3.923566e-09 k__Bacteria p__Actinobacteria c__Coriobacteriia
## Otu00004 4.012701e-46 k__Bacteria  p__Bacteroidetes    c__Bacteroidia
## Otu00005 4.157964e-10 k__Bacteria  p__Bacteroidetes    c__Bacteroidia
## Otu00006 4.031220e-02 k__Bacteria     p__Firmicutes     c__Clostridia
## Otu00007 7.567168e-05 k__Bacteria p__Actinobacteria c__Actinobacteria
## Otu00008 1.160044e-19 k__Bacteria  p__Bacteroidetes    c__Bacteroidia
## Otu00010 1.459646e-03 k__Bacteria     p__Firmicutes        c__Bacilli
## Otu00011 2.410910e-09 k__Bacteria     p__Firmicutes     c__Clostridia
## Otu00012 6.265595e-07 k__Bacteria  p__Bacteroidetes    c__Bacteroidia
## Otu00013 2.447827e-05 k__Bacteria  p__Bacteroidetes    c__Bacteroidia
## Otu00014 4.143121e-03 k__Bacteria     p__Firmicutes     c__Clostridia
## Otu00015 5.532198e-10 k__Bacteria   p__Spirochaetes   c__Spirochaetes
## Otu00016 1.747688e-28 k__Bacteria  p__Bacteroidetes    c__Bacteroidia
## Otu00017 4.280112e-11 k__Bacteria p__Actinobacteria c__Coriobacteriia
## Otu00018 4.452385e-32 k__Bacteria  p__Bacteroidetes    c__Bacteroidia
## Otu00019 8.243605e-09 k__Bacteria     p__Firmicutes     c__Clostridia
## Otu00020 4.737090e-09 k__Bacteria  p__Bacteroidetes    c__Bacteroidia
## Otu00021 1.807651e-29 k__Bacteria  p__Bacteroidetes    c__Bacteroidia
##                         Order                        Family
## Otu00001     o__Clostridiales            f__Ruminococcaceae
## Otu00002     o__Clostridiales            f__Ruminococcaceae
## Otu00003  o__Coriobacteriales          f__Coriobacteriaceae
## Otu00004     o__Bacteroidales             f__Bacteroidaceae
## Otu00005     o__Bacteroidales                      f__S24-7
## Otu00006     o__Clostridiales            f__Ruminococcaceae
## Otu00007 o__Bifidobacteriales         f__Bifidobacteriaceae
## Otu00008     o__Bacteroidales             f__Bacteroidaceae
## Otu00010   o__Lactobacillales           f__Lactobacillaceae
## Otu00011     o__Clostridiales            f__Ruminococcaceae
## Otu00012     o__Bacteroidales             f__Bacteroidaceae
## Otu00013     o__Bacteroidales             f__Bacteroidaceae
## Otu00014     o__Clostridiales            f__Ruminococcaceae
## Otu00015    o__Spirochaetales            f__Spirochaetaceae
## Otu00016     o__Bacteroidales o__Bacteroidales_unclassified
## Otu00017  o__Coriobacteriales          f__Coriobacteriaceae
## Otu00018     o__Bacteroidales             f__Bacteroidaceae
## Otu00019     o__Clostridiales            f__Veillonellaceae
## Otu00020     o__Bacteroidales             f__Prevotellaceae
## Otu00021     o__Bacteroidales       f__[Paraprevotellaceae]
##                                    Genus
## Otu00001             g__Faecalibacterium
## Otu00002             g__Faecalibacterium
## Otu00003                  g__Collinsella
## Otu00004                  g__Bacteroides
## Otu00005           f__S24-7_unclassified
## Otu00006 f__Ruminococcaceae_unclassified
## Otu00007              g__Bifidobacterium
## Otu00008                  g__Bacteroides
## Otu00010                g__Lactobacillus
## Otu00011               g__Butyricicoccus
## Otu00012                  g__Bacteroides
## Otu00013                  g__Bacteroides
## Otu00014 f__Ruminococcaceae_unclassified
## Otu00015                    g__Treponema
## Otu00016   o__Bacteroidales_unclassified
## Otu00017                    g__Olsenella
## Otu00018                       g__5-7N15
## Otu00019        g__Phascolarctobacterium
## Otu00020                   g__Prevotella
## Otu00021                        g__CF231
##                                        Species
## Otu00001                        s__prausnitzii
## Otu00002                        s__prausnitzii
## Otu00003                        s__aerofaciens
## Otu00004                          s__uniformis
## Otu00005                 f__S24-7_unclassified
## Otu00006       f__Ruminococcaceae_unclassified
## Otu00007       g__Bifidobacterium_unclassified
## Otu00008           g__Bacteroides_unclassified
## Otu00010                            s__reuteri
## Otu00011                      s__pullicaecorum
## Otu00012           g__Bacteroides_unclassified
## Otu00013           g__Bacteroides_unclassified
## Otu00014       f__Ruminococcaceae_unclassified
## Otu00015             g__Treponema_unclassified
## Otu00016         o__Bacteroidales_unclassified
## Otu00017             g__Olsenella_unclassified
## Otu00018                g__5-7N15_unclassified
## Otu00019 g__Phascolarctobacterium_unclassified
## Otu00020                              s__copri
## Otu00021                 g__CF231_unclassified

If you take a look at this data frame, you will see that it contains all OTUs significant under this model at p < 0.05. You should look at the “padj” column, which has the Benjamini-Hochberg corrected p-values. You can then use these as a starting point for anova testing, etc. to determine differences between groups.

DESeq2

Otra forma de encontrar OTU específicas de interés que difieren entre variables explicativas categóricas es mediante el uso de la función DESeq enDESeq2. DESeq2 fue creado para datos RNASeq. Sin embargo, resulta que los datos de secuenciación de comunidad 16S rRNA son MUY similares en estructura a los datos RNASeq. Hay muchos ceros, distribuciones muy sesgadas y, en general, las OTU terminan pareciéndose mucho a los genes. Entonces, podemos usar esta herramienta para investigar la “expresión diferencial” de OTU específicas entre muestras.

DESeq2 se asocia conphyloseq para que esto suceda. Lo primero que debemos hacer es convertir nuestro objeto phyloseq en un objetoDESeq2. Para estas funciones, no necesitamos un árbol filogenético. Entonces usaremos nuestro objeto phyloseq" physeq “. También le haremos saber lo que queremos usar como nuestra variable explicativa categórica, en este caso”AgeGroup“.

Ahora intenta ejecutar la función DESeq.

¡Oh oh! Hay algunas excentricidades con los datos de recuento en algunos casos. Necesitamos hacer un pequeño ajuste para seguir adelante. Encontré esta solución en línea.

Ahora intenta la función nuevamente.

Vamos a sacar los resultados en una forma legible. Si echas un vistazo a este marco de datos, verás que contiene todas las OTU significativas bajo este modelo en p <0.05. Debería mirar la columna “padj”, que tiene los valores de p corregidos de Benjamini-Hochberg. A continuación, puede utilizarlos como punto de partida para la prueba anova, etc., para determinar las diferencias entre grupos.

Continuous variables

For continuous variables, historically there has been no simple test like SIMPER to pull out OTUs likely to differ across your variable. You could run linear models glm of the OTU abundances with different distributions family= similar to what we did with Chao richness. However, OTU abundance data is not normal nor does it fit well with other standard distributions due to its many zeros. So, you will need to test a number of distributions and transformations of the data to find a suitable model.

Variables continuas

Para variables continuas, históricamente no ha habido una prueba simple como SIMPER para extraer las OTU que probablemente difieran en su variable. Podría ejecutar modelos lineales glm de las abundancias OTU con diferentes distribucionesfamily =similar a lo que hicimos con la riqueza de Chao. Sin embargo, los datos de abundancia OTU no son normales ni se ajustan bien a otras distribuciones estándar debido a sus muchos ceros. Por lo tanto, deberá probar una serie de distribuciones y transformaciones de los datos para encontrar un modelo adecuado.

`miLineage`

But to generate higher-level candidate taxa, we can use a brand new R package which was developed recently here at UW-Madison! It is called miLineage, and was written by Dr. Zheng-Zheng Tang in the Biostatistics and Medical Informatics department. I went to her talk recently, and decided to include some of the things I learned in this tutorial. Keep in mind that this information is brand new to me as well, so there may be caveats and pieces of information that I am not understanding clearly or not expressing well. As with all of these methods, make sure that you read the documentation and understand what is needed for your own data before you begin. Here is a link to the documentation for miLineage: https://cran.r-project.org/web/packages/miLineage/miLineage.pdf. Here is a link to the publication describing the method: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408811/.

So, what is this method trying to accomplish? Often with micribiota data, we have vectors of continuous variables (like “ADGKG” in this example data set) which we expect to covary with specific OTUs. The problem is that most statistical methods which look for covariance between matrices make assumptions which do not apply to microbiota data. Most problematic are the non-normal distributions of abundance of taxa at high resolution (genus, species, and OTU level) and the large number of zeroes which exist in these data sets. In order to identify taxa which covary with a continuous variable of interest, we have to use a method which does not make assumptions about distribution and which can deal with a large number of zeroes. Enter the QCAT test from miLineage!

The package contains two versions of the test. The first is the standard QCAT. This is appropriate for a data matrix which is less than 30% zeroes. This will almost never be the case for OTU-resolution microbiota data. So I am going to focus on the second, the QCAT_GEE.

Inputs:

OTU: OTU table FORMATTED AS MATRIX
X: continuous variables of interest AND confounding variables. These are the variables we expect to covary with our OTU abundances FORMATTED AS MATRIX
Z: continuous variables of interest AND confounding variables which we expect to be related to OTU presence/absence FORMATTED AS MATRIX
X.index: The column index for the continuous variable(s) of interest to abundance
Z.index: The column index for the continuous variable(s) of interest to presence/absence
Tax: Taxonomy table, like the one we used for our phyloseq object, except with the column headers as “Rank1”, “Rank2”…etc. FORMATTED AS MATRIX
fdr.alpha: the significance level we are interested in (usually 0.05)
min.depth: cutoff for evening sequencing depth. Use 0 for a normalized/subsampled OTU table.
n.resample: number of times to resample. Allows for calculation of accurate p-values

In order to use this package, we are going to have to change the format of several of our input files to meet the requirements of the package.

First, we need to change the column headers of our taxonomy table. We have domain through species level assignments, so we have 7 “Ranks”. Let’s rename the column headers now.

tax.clean.miL <- tax.clean
colnames(tax.clean.miL)<-c("Rank1","Rank2","Rank3","Rank4","Rank5","Rank6","Rank7")

Next, we have to create the data frame of covariates of interest. We are going to try to test the covariation of age at time of sampling with our OTU table.

covar.miL <- cbind.data.frame(meta$AgeExact)
rownames(covar.miL) <- rownames(meta)
colnames(covar.miL) <- c("Age")

Finally, we need to reformat all our inputs as matrices.

OTU.miL <- as.matrix(OTU.clean)
tax.clean.miL <- as.matrix(tax.clean.miL)
covar.miL <- as.matrix(covar.miL)

Now we can run the test.

QCAT_GEE(OTU=OTU.miL, Tax=tax.clean.miL, X=covar.miL, X.index = 1, Z=covar.miL, Z.index=1,  fdr.alpha=0.05, n.resample = 100)

## $lineage.pval
## $lineage.pval$`Two-Part`
##            g__[Clostridium] g__[Eubacterium] g__[Ruminococcus]
## Asymptotic      0.004569263       0.01954218       0.001720479
## Resampling      0.009900990       0.00990099       0.049504950
##            g__Acinetobacter g__Akkermansia g__Alistipes g__Bacillus
## Asymptotic       0.35714159      0.0000000   0.10785530    1.000000
## Resampling       0.02970297      0.1782178   0.02970297    0.950495
##            g__Bacteroides  g__Blautia g__Brevibacterium g__Bulleidia
## Asymptotic     0.14864905 0.002170536         0.5521478    0.1271411
## Resampling     0.00990099 0.009900990         0.1584158    0.1584158
##            g__Clostridium g__Corynebacterium g__Desulfovibrio  g__Dorea
## Asymptotic     0.67199500          0.1109366       0.00000000 0.0000000
## Resampling     0.02970297          0.1584158       0.00990099 0.2772277
##            g__Enterococcus g__Lachnospira g__Lactobacillus g__Ochrobactrum
## Asymptotic      0.06842293      0.5805491       0.04895534       0.5988383
## Resampling      0.04950495      0.5148515       0.00990099       0.4950495
##            g__Olsenella g__Parabacteroides g__Prevotella
## Asymptotic   0.05349178         0.01455372    0.04299606
## Resampling   0.01980198         0.00990099    0.02970297
##            g__Pseudobutyrivibrio g__Psychrobacter g__Ruminococcus
## Asymptotic             0.4719227        1.0000000      0.04938857
## Resampling             0.4653465        0.5940594      0.00990099
##            g__Sharpea g__Streptococcus g__Streptomyces
## Asymptotic  0.0000000       0.04354151      0.51533959
## Resampling  0.4059406       0.04950495      0.03960396
##            f__[Mogibacteriaceae] f__[Odoribacteraceae]
## Asymptotic            0.00449175           0.004612202
## Resampling            0.00990099           0.009900990
##            f__[Paraprevotellaceae] f__[Tissierellaceae] f__[Weeksellaceae]
## Asymptotic             0.004625551            0.5944970         0.35714159
## Resampling             0.019801980            0.7722772         0.02970297
##            f__Acetobacteraceae f__Actinomycetaceae f__Aerococcaceae
## Asymptotic         0.001035808          0.08486326       0.86329539
## Resampling         0.009900990          0.01980198       0.01980198
##            f__Alcaligenaceae f__Anaeroplasmataceae f__Bacillaceae
## Asymptotic        0.14148585             1.0000000     0.15144531
## Resampling        0.00990099             0.8910891     0.01980198
##            f__Bacteroidaceae f__Bifidobacteriaceae f__Christensenellaceae
## Asymptotic       0.009873207            0.00151534            0.005743916
## Resampling       0.009900990            0.36633663            0.059405941
##            f__Clostridiaceae f__Comamonadaceae f__Coriobacteriaceae
## Asymptotic       0.001194624         0.3972573           0.02460370
## Resampling       0.009900990         0.9108911           0.00990099
##            f__Desulfovibrionaceae f__Elusimicrobiaceae
## Asymptotic             0.03304490           0.00000000
## Resampling             0.03960396           0.03960396
##            f__Enterobacteriaceae f__Erysipelotrichaceae f__Eubacteriaceae
## Asymptotic             0.1966620             0.23121681         0.0000000
## Resampling             0.2178218             0.00990099         0.1485149
##            f__Hyphomicrobiaceae f__Lachnospiraceae f__Lactobacillaceae
## Asymptotic            1.0000000         0.11485285          0.00000000
## Resampling            0.7227723         0.00990099          0.03960396
##            f__Microbacteriaceae f__Micrococcaceae f__Moraxellaceae
## Asymptotic            0.3829978         0.5521478       0.06889424
## Resampling            0.1188119         0.2772277       0.00990099
##            f__Nocardioidaceae f__Paenibacillaceae f__Peptostreptococcaceae
## Asymptotic          0.8539274           0.8642819              0.008423761
## Resampling          0.5643564           0.8514851              0.009900990
##            f__Phyllobacteriaceae f__Planococcaceae f__Porphyromonadaceae
## Asymptotic            0.94221664        0.21210283           0.004208917
## Resampling            0.01980198        0.02970297           0.009900990
##            f__Rikenellaceae f__Ruminococcaceae f__Spirochaetaceae
## Asymptotic      0.001528526         0.01919360         0.02206889
## Resampling      0.009900990         0.00990099         0.02970297
##            f__Staphylococcaceae f__Streptococcaceae
## Asymptotic            0.0000000          0.08696424
## Resampling            0.2178218          0.10891089
##            f__Streptosporangiaceae f__Succinivibrionaceae
## Asymptotic              0.51590020             0.03328806
## Resampling              0.08910891             0.01980198
##            f__Veillonellaceae f__Verrucomicrobiaceae f__Victivallaceae
## Asymptotic         0.13823016            0.004675057        0.00000000
## Resampling         0.00990099            0.009900990        0.02970297
##            o__Actinomycetales o__Bacillales o__Bacteroidales
## Asymptotic          0.9504619    0.19306680       0.04755229
## Resampling          0.1683168    0.01980198       0.00990099
##            o__Burkholderiales o__Campylobacterales o__Clostridiales
## Asymptotic        0.005942546            0.0000000       0.05937160
## Resampling        0.009900990            0.1386139       0.00990099
##            o__Desulfovibrionales o__Flavobacteriales o__Lactobacillales
## Asymptotic            0.05489269           0.3571125         0.26995391
## Resampling            0.05940594           0.1683168         0.03960396
##            o__Rhizobiales c__Actinobacteria c__Alphaproteobacteria
## Asymptotic      0.8346765       0.005451505             0.02125006
## Resampling      0.2277228       0.009900990             0.00990099
##              c__Bacilli c__Betaproteobacteria c__Clostridia
## Asymptotic 0.0005424236           0.009961925  7.122467e-05
## Resampling 0.0099009901           0.009900990  9.900990e-03
##            c__Deltaproteobacteria c__Gammaproteobacteria c__Mollicutes
## Asymptotic            0.003890433             0.20811412    0.02199880
## Resampling            0.049504950             0.02970297    0.00990099
##            c__Planctomycetia c__Spirochaetes p__Actinobacteria
## Asymptotic       0.006132873      0.02797527         0.3307077
## Resampling       0.039603960      0.00990099         0.2772277
##            p__Bacteroidetes p__Chloroflexi p__Cyanobacteria p__Firmicutes
## Asymptotic      0.002083447     0.02231054        1.0000000   0.001715826
## Resampling      0.009900990     0.00990099        0.5049505   0.009900990
##            p__Proteobacteria p__Tenericutes p__Verrucomicrobia k__Bacteria
## Asymptotic       0.002667928    0.002925308         0.03498893  0.67840177
## Resampling       0.009900990    0.009900990         0.00990099  0.00990099
## 
## $lineage.pval$`Zero-Part`
##            g__[Clostridium] g__[Eubacterium] g__[Ruminococcus]
## Asymptotic      0.001165278      0.007167919       0.001415603
## Resampling      0.009900990      0.009900990       0.009900990
##            g__Acinetobacter g__Akkermansia g__Alistipes g__Bacillus
## Asymptotic       0.19863745   0.0003462648   0.07248904  0.05022569
## Resampling       0.02970297   0.0099009901   0.02970297  0.00990099
##            g__Bacteroides   g__Blautia g__Brevibacterium g__Bulleidia
## Asymptotic     0.05274186 0.0003389958         0.3501410   0.08358944
## Resampling     0.00990099 0.0099009901         0.1584158   0.12871287
##            g__Clostridium g__Corynebacterium g__Desulfovibrio  g__Dorea
## Asymptotic     0.55270505          0.1954330       0.06925496 0.3278224
## Resampling     0.01980198          0.2178218       0.06930693 0.7821782
##            g__Enterococcus g__Lachnospira g__Lactobacillus g__Ochrobactrum
## Asymptotic      0.03628707      0.3751310       0.02009388       0.5988383
## Resampling      0.02970297      0.5148515       0.00990099       0.4950495
##            g__Olsenella g__Parabacteroides g__Prevotella
## Asymptotic   0.03957429         0.01108327    0.01848719
## Resampling   0.01980198         0.00990099    0.01980198
##            g__Pseudobutyrivibrio g__Psychrobacter g__Ruminococcus
## Asymptotic             0.2838397       0.06909904      0.10361675
## Resampling             0.4554455       0.02970297      0.01980198
##            g__Sharpea g__Streptococcus g__Streptomyces
## Asymptotic 0.01613610       0.01723143      0.31896989
## Resampling 0.00990099       0.00990099      0.03960396
##            f__[Mogibacteriaceae] f__[Odoribacteraceae]
## Asymptotic            0.00116733            0.00201078
## Resampling            0.00990099            0.00990099
##            f__[Paraprevotellaceae] f__[Tissierellaceae] f__[Weeksellaceae]
## Asymptotic              0.01862173            0.3877217         0.19863745
## Resampling              0.01980198            0.7722772         0.02970297
##            f__Acetobacteraceae f__Actinomycetaceae f__Aerococcaceae
## Asymptotic         0.000346429          0.06832263       0.52067256
## Resampling         0.009900990          0.01980198       0.01980198
##            f__Alcaligenaceae f__Anaeroplasmataceae f__Bacillaceae
## Asymptotic        0.03143024          0.0003464849     0.09519867
## Resampling        0.00990099          0.0099009901     0.01980198
##            f__Bacteroidaceae f__Bifidobacteriaceae f__Christensenellaceae
## Asymptotic       0.003233547          0.0004561791            0.003251744
## Resampling       0.009900990          0.0099009901            0.009900990
##            f__Clostridiaceae f__Comamonadaceae f__Coriobacteriaceae
## Asymptotic       0.003062604         0.3972573           0.02842446
## Resampling       0.009900990         0.9108911           0.00990099
##            f__Desulfovibrionaceae f__Elusimicrobiaceae
## Asymptotic             0.03452509          0.006692442
## Resampling             0.06930693          0.009900990
##            f__Enterobacteriaceae f__Erysipelotrichaceae f__Eubacteriaceae
## Asymptotic             0.1368552             0.24456005        0.06310121
## Resampling             0.1386139             0.01980198        0.02970297
##            f__Hyphomicrobiaceae f__Lachnospiraceae f__Lactobacillaceae
## Asymptotic           0.31858682         0.14162364          0.02246477
## Resampling           0.00990099         0.00990099          0.00990099
##            f__Microbacteriaceae f__Micrococcaceae f__Moraxellaceae
## Asymptotic            0.2169164         0.3501410       0.05308433
## Resampling            0.1188119         0.1485149       0.00990099
##            f__Nocardioidaceae f__Paenibacillaceae f__Peptostreptococcaceae
## Asymptotic          0.5796432           0.5957630               0.00398969
## Resampling          0.6534653           0.8514851               0.00990099
##            f__Phyllobacteriaceae f__Planococcaceae f__Porphyromonadaceae
## Asymptotic            0.68300562        0.09296751           0.001164506
## Resampling            0.01980198        0.04950495           0.009900990
##            f__Rikenellaceae f__Ruminococcaceae f__Spirochaetaceae
## Asymptotic      0.009526076         0.06577425         0.01336323
## Resampling      0.009900990         0.00990099         0.02970297
##            f__Staphylococcaceae f__Streptococcaceae
## Asymptotic           0.35801408          0.04868761
## Resampling           0.01980198          0.04950495
##            f__Streptosporangiaceae f__Succinivibrionaceae
## Asymptotic              0.31943482             0.02454269
## Resampling              0.08910891             0.02970297
##            f__Veillonellaceae f__Verrucomicrobiaceae f__Victivallaceae
## Asymptotic         0.24699410            0.001167139       0.002977769
## Resampling         0.00990099            0.009900990       0.009900990
##            o__Actinomycetales o__Bacillales o__Bacteroidales
## Asymptotic          0.7735962    0.08275569       0.11427760
## Resampling          0.1089109    0.00990099       0.00990099
##            o__Burkholderiales o__Campylobacterales o__Clostridiales
## Asymptotic        0.002655129          0.008340176       0.10757615
## Resampling        0.009900990          0.009900990       0.00990099
##            o__Desulfovibrionales o__Flavobacteriales o__Lactobacillales
## Asymptotic            0.04001428          0.19861716         0.11650896
## Resampling            0.04950495          0.01980198         0.01980198
##            o__Rhizobiales c__Actinobacteria c__Alphaproteobacteria
## Asymptotic      0.5972896       0.003997343            0.006852977
## Resampling      0.2079208       0.009900990            0.009900990
##             c__Bacilli c__Betaproteobacteria c__Clostridia
## Asymptotic 0.001168276           0.007071254   0.002221954
## Resampling 0.009900990           0.009900990   0.009900990
##            c__Deltaproteobacteria c__Gammaproteobacteria c__Mollicutes
## Asymptotic            0.001165913             0.12762062   0.006976663
## Resampling            0.009900990             0.04950495   0.009900990
##            c__Planctomycetia c__Spirochaetes p__Actinobacteria
## Asymptotic       0.002344937      0.02019466         0.3382921
## Resampling       0.009900990      0.00990099         0.5643564
##            p__Bacteroidetes p__Chloroflexi p__Cyanobacteria p__Firmicutes
## Asymptotic       0.13070902    0.008237659       0.01635592    0.02068341
## Resampling       0.04950495    0.009900990       0.00990099    0.01980198
##            p__Proteobacteria p__Tenericutes p__Verrucomicrobia k__Bacteria
## Asymptotic        0.01397988    0.001160613        0.007775314  0.54614266
## Resampling        0.00990099    0.009900990        0.009900990  0.00990099
## 
## $lineage.pval$`Positive-Part`
##            g__[Clostridium] g__[Eubacterium] g__[Ruminococcus]
## Asymptotic        0.6001005        0.4496360         0.1577982
## Resampling        0.5643564        0.8118812         0.8613861
##            g__Acinetobacter g__Akkermansia g__Alistipes g__Bacillus
## Asymptotic                1      0.0000000    0.3611369           1
## Resampling                1      0.2079208    0.2673267           1
##            g__Bacteroides g__Blautia g__Brevibacterium g__Bulleidia
## Asymptotic      0.5928876  0.6833100                 1    0.3907661
## Resampling      0.4950495  0.8910891                 1    0.7722772
##            g__Clostridium g__Corynebacterium g__Desulfovibrio  g__Dorea
## Asymptotic     0.65000366         0.09732678        0.0000000 0.0000000
## Resampling     0.03960396         0.00990099        0.1188119 0.2772277
##            g__Enterococcus g__Lachnospira g__Lactobacillus g__Ochrobactrum
## Asymptotic       0.4222870              1        0.4504195              NA
## Resampling       0.1287129              1        0.2475248              NA
##            g__Olsenella g__Parabacteroides g__Prevotella
## Asymptotic    0.2813894          0.2176530     0.4494146
## Resampling    0.6039604          0.4257426     0.7128713
##            g__Pseudobutyrivibrio g__Psychrobacter g__Ruminococcus
## Asymptotic             1.0000000        1.0000000      0.10230395
## Resampling             0.8514851        0.7623762      0.00990099
##            g__Sharpea g__Streptococcus g__Streptomyces
## Asymptotic  0.0000000        0.9785369               1
## Resampling  0.5148515        0.9504950               1
##            f__[Mogibacteriaceae] f__[Odoribacteraceae]
## Asymptotic             0.5869457             0.4413195
## Resampling             0.6930693             0.5742574
##            f__[Paraprevotellaceae] f__[Tissierellaceae] f__[Weeksellaceae]
## Asymptotic              0.03444085                    1                  1
## Resampling              0.00990099                    1                  1
##            f__Acetobacteraceae f__Actinomycetaceae f__Aerococcaceae
## Asymptotic           0.6128130           0.2620467                1
## Resampling           0.8910891           0.1980198                1
##            f__Alcaligenaceae f__Anaeroplasmataceae f__Bacillaceae
## Asymptotic         0.9535927                     1      0.4222116
## Resampling         0.9504950                     1      0.1881188
##            f__Bacteroidaceae f__Bifidobacteriaceae f__Christensenellaceae
## Asymptotic         0.3989748             1.0000000              0.2979867
## Resampling         0.6336634             0.6237624              0.7821782
##            f__Clostridiaceae f__Comamonadaceae f__Coriobacteriaceae
## Asymptotic         0.0478007                NA           0.16639576
## Resampling         0.1287129                NA           0.02970297
##            f__Desulfovibrionaceae f__Elusimicrobiaceae
## Asymptotic              0.1571071            0.0000000
## Resampling              0.3366337            0.1980198
##            f__Enterobacteriaceae f__Erysipelotrichaceae f__Eubacteriaceae
## Asymptotic             0.4048576              0.3162231         0.0000000
## Resampling             0.8415842              0.1584158         0.1584158
##            f__Hyphomicrobiaceae f__Lachnospiraceae f__Lactobacillaceae
## Asymptotic            1.0000000         0.22095345           0.0000000
## Resampling            0.7227723         0.00990099           0.2079208
##            f__Microbacteriaceae f__Micrococcaceae f__Moraxellaceae
## Asymptotic                    1         1.0000000        0.2684904
## Resampling                    1         0.8316832        0.8217822
##            f__Nocardioidaceae f__Paenibacillaceae f__Peptostreptococcaceae
## Asymptotic          1.0000000                   1               0.31381120
## Resampling          0.7920792                   1               0.02970297
##            f__Phyllobacteriaceae f__Planococcaceae f__Porphyromonadaceae
## Asymptotic                     1         0.6498228             0.5447921
## Resampling                     1         0.4158416             0.6831683
##            f__Rikenellaceae f__Ruminococcaceae f__Spirochaetaceae
## Asymptotic       0.01383238         0.05775747          0.3193898
## Resampling       0.00990099         0.00990099          0.6831683
##            f__Staphylococcaceae f__Streptococcaceae
## Asymptotic            0.0000000           0.4688150
## Resampling            0.3267327           0.8910891
##            f__Streptosporangiaceae f__Succinivibrionaceae
## Asymptotic                       1              0.2562988
## Resampling                       1              0.1584158
##            f__Veillonellaceae f__Verrucomicrobiaceae f__Victivallaceae
## Asymptotic         0.16065355              0.6155877        0.00000000
## Resampling         0.00990099              0.3861386        0.04950495
##            o__Actinomycetales o__Bacillales o__Bacteroidales
## Asymptotic          0.9356310     0.6102820       0.09759225
## Resampling          0.5544554     0.4356436       0.00990099
##            o__Burkholderiales o__Campylobacterales o__Clostridiales
## Asymptotic          0.3419841            0.0000000       0.13320572
## Resampling          0.5445545            0.2277228       0.00990099
##            o__Desulfovibrionales o__Flavobacteriales o__Lactobacillales
## Asymptotic             0.2795826           1.0000000          0.6659518
## Resampling             0.4554455           0.7326733          0.4158416
##            o__Rhizobiales c__Actinobacteria c__Alphaproteobacteria
## Asymptotic      0.8554734         0.1992961              0.4771617
## Resampling      0.9603960         0.3762376              0.4455446
##            c__Bacilli c__Betaproteobacteria c__Clostridia
## Asymptotic 0.05022718             0.1611726   0.001800228
## Resampling 0.00990099             0.1683168   0.009900990
##            c__Deltaproteobacteria c__Gammaproteobacteria c__Mollicutes
## Asymptotic              0.4957157              0.4747018     0.4861390
## Resampling              0.1089109              0.2673267     0.3168317
##            c__Planctomycetia c__Spirochaetes p__Actinobacteria
## Asymptotic         0.4053812      0.25531120         0.3173305
## Resampling         0.9504950      0.04950495         0.4059406
##            p__Bacteroidetes p__Chloroflexi p__Cyanobacteria p__Firmicutes
## Asymptotic      0.001769568              1        1.0000000   0.007687737
## Resampling      0.009900990              1        0.5841584   0.009900990
##            p__Proteobacteria p__Tenericutes p__Verrucomicrobia k__Bacteria
## Asymptotic        0.02674659      0.3559259          0.7468447  0.66204052
## Resampling        0.00990099      0.4059406          0.2871287  0.00990099
## 
## 
## $sig.lineage
## $sig.lineage$`Two-Part`
##  [1] "g__[Clostridium]"         "g__[Eubacterium]"        
##  [3] "g__Bacteroides"           "g__Blautia"              
##  [5] "g__Desulfovibrio"         "g__Lactobacillus"        
##  [7] "g__Olsenella"             "g__Parabacteroides"      
##  [9] "g__Ruminococcus"          "f__[Mogibacteriaceae]"   
## [11] "f__[Odoribacteraceae]"    "f__[Paraprevotellaceae]" 
## [13] "f__Acetobacteraceae"      "f__Actinomycetaceae"     
## [15] "f__Aerococcaceae"         "f__Alcaligenaceae"       
## [17] "f__Bacillaceae"           "f__Bacteroidaceae"       
## [19] "f__Clostridiaceae"        "f__Coriobacteriaceae"    
## [21] "f__Erysipelotrichaceae"   "f__Lachnospiraceae"      
## [23] "f__Moraxellaceae"         "f__Peptostreptococcaceae"
## [25] "f__Phyllobacteriaceae"    "f__Porphyromonadaceae"   
## [27] "f__Rikenellaceae"         "f__Ruminococcaceae"      
## [29] "f__Succinivibrionaceae"   "f__Veillonellaceae"      
## [31] "f__Verrucomicrobiaceae"   "o__Bacillales"           
## [33] "o__Bacteroidales"         "o__Burkholderiales"      
## [35] "o__Clostridiales"         "c__Actinobacteria"       
## [37] "c__Alphaproteobacteria"   "c__Bacilli"              
## [39] "c__Betaproteobacteria"    "c__Clostridia"           
## [41] "c__Mollicutes"            "c__Spirochaetes"         
## [43] "p__Bacteroidetes"         "p__Chloroflexi"          
## [45] "p__Firmicutes"            "p__Proteobacteria"       
## [47] "p__Tenericutes"           "p__Verrucomicrobia"      
## [49] "k__Bacteria"             
## 
## $sig.lineage$`Zero-Part`
##  [1] "g__[Clostridium]"         "g__[Eubacterium]"        
##  [3] "g__[Ruminococcus]"        "g__Acinetobacter"        
##  [5] "g__Akkermansia"           "g__Alistipes"            
##  [7] "g__Bacillus"              "g__Bacteroides"          
##  [9] "g__Blautia"               "g__Clostridium"          
## [11] "g__Enterococcus"          "g__Lactobacillus"        
## [13] "g__Olsenella"             "g__Parabacteroides"      
## [15] "g__Prevotella"            "g__Psychrobacter"        
## [17] "g__Ruminococcus"          "g__Sharpea"              
## [19] "g__Streptococcus"         "f__[Mogibacteriaceae]"   
## [21] "f__[Odoribacteraceae]"    "f__[Paraprevotellaceae]" 
## [23] "f__[Weeksellaceae]"       "f__Acetobacteraceae"     
## [25] "f__Actinomycetaceae"      "f__Aerococcaceae"        
## [27] "f__Alcaligenaceae"        "f__Anaeroplasmataceae"   
## [29] "f__Bacillaceae"           "f__Bacteroidaceae"       
## [31] "f__Bifidobacteriaceae"    "f__Christensenellaceae"  
## [33] "f__Clostridiaceae"        "f__Coriobacteriaceae"    
## [35] "f__Elusimicrobiaceae"     "f__Erysipelotrichaceae"  
## [37] "f__Eubacteriaceae"        "f__Hyphomicrobiaceae"    
## [39] "f__Lachnospiraceae"       "f__Lactobacillaceae"     
## [41] "f__Moraxellaceae"         "f__Peptostreptococcaceae"
## [43] "f__Phyllobacteriaceae"    "f__Porphyromonadaceae"   
## [45] "f__Rikenellaceae"         "f__Ruminococcaceae"      
## [47] "f__Spirochaetaceae"       "f__Staphylococcaceae"    
## [49] "f__Succinivibrionaceae"   "f__Veillonellaceae"      
## [51] "f__Verrucomicrobiaceae"   "f__Victivallaceae"       
## [53] "o__Bacillales"            "o__Bacteroidales"        
## [55] "o__Burkholderiales"       "o__Campylobacterales"    
## [57] "o__Clostridiales"         "o__Flavobacteriales"     
## [59] "o__Lactobacillales"       "c__Actinobacteria"       
## [61] "c__Alphaproteobacteria"   "c__Bacilli"              
## [63] "c__Betaproteobacteria"    "c__Clostridia"           
## [65] "c__Deltaproteobacteria"   "c__Mollicutes"           
## [67] "c__Planctomycetia"        "c__Spirochaetes"         
## [69] "p__Chloroflexi"           "p__Cyanobacteria"        
## [71] "p__Firmicutes"            "p__Proteobacteria"       
## [73] "p__Tenericutes"           "p__Verrucomicrobia"      
## [75] "k__Bacteria"             
## 
## $sig.lineage$`Positive-Part`
## character(0)
## 
## 
## $global.pval
##      Simes_Two-Part     Simes_Zero-Part Simes_Positive-Part 
##          0.04395604          0.03305785          0.12252475

This spits out a lot of information. First, a list of taxa at ascending taxonomic resolution with two p-values associated with each one. It is important that you only consider the more accurate “Resampling” p-value when using the QCAT_GEE test. Next, it prints the list of significant lineages, again in descending taxonomic resolution, for three sets of tests. The first is the two-part test, which takes into account both abundance and presence/absence. The second is the zero-part test, which looks only at presence/absence. The third is the positive part test, which removes any lineages which have zeroes associated with any sample and looks at abundance. Because the rumen environment changes so drastically from birth to one year, we do not expect to see any lineages which persist all the way through. So it is not suprising that we do not see any significant lineages in the positive-part test. But this gives us a good list of candidate taxa which we can statistically test.

Let’s plot the abundance vs. age for phylum Bacteroidetes, implicated in the two-part test. and generate a linear model.

OTU.bacteriodetes = OTU.clean[,tax.clean$Phylum == "p__Bacteroidetes"]
OTU.bacteriodetes$bacteriodetes.sum = rowSums(OTU.bacteriodetes)

plot(x=meta$AgeExact, y=OTU.bacteriodetes$bacteriodetes.sum, pch=15)

glm.bacteriodetes <- glm(OTU.bacteriodetes$bacteriodetes.sum ~ meta$AgeExact)
summary(glm.bacteriodetes)

## 
## Call:
## glm(formula = OTU.bacteriodetes$bacteriodetes.sum ~ meta$AgeExact)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4336.9  -2396.0   -450.1   2705.3   6465.8  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4062.812    908.731   4.471 0.000191 ***
## meta$AgeExact    8.680      4.262   2.037 0.053893 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 10795274)
## 
##     Null deviance: 282275853  on 23  degrees of freedom
## Residual deviance: 237496021  on 22  degrees of freedom
## AIC: 460.69
## 
## Number of Fisher Scoring iterations: 2

abline(glm.bacteriodetes)

We can see that this is marginally significant with this test. Realistically, the structure of this data set isn’t TRULY continuous, so this method isn’t very appropriate as we applied it. But hopefully it was an illustrative example.

miLineage

¡Pero para generar taxones candidatos de nivel superior, podemos usar un nuevo paquete R que se desarrolló recientemente aquí en UW-Madison! Se llama miLineage y fue escrito por el Dr. Zheng-Zheng Tang en el departamento de Bioestadística e Informática Médica. Fui a su charla recientemente y decidí incluir algunas de las cosas que aprendí en este tutorial. Tenga en cuenta que esta información también es nueva para mí, por lo que puede haber advertencias y datos que no entiendo claramente o que no se expresan bien. Al igual que con todos estos métodos, asegúrese de leer la documentación y comprender qué se necesita para sus propios datos antes de comenzar. Aquí hay un enlace a la documentación para miLineage: https://cran.r-project.org/web/packages/miLineage/miLineage.pdf. Aquí hay un enlace a la publicación que describe el método: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408811/.

Entonces, ¿qué intenta este método lograr? A menudo, con los datos de micribiota, tenemos vectores de variables continuas (como “ADGKG” en este conjunto de datos de ejemplo) que esperamos covariar con OTU específicas. El problema es que la mayoría de los métodos estadísticos que buscan la covarianza entre las matrices hacen suposiciones que no se aplican a los datos de la microbiota. Los más problemáticos son las distribuciones no normales de la abundancia de taxones a alta resolución (género, especie y nivel OTU) y la gran cantidad de ceros que existen en estos conjuntos de datos. Para identificar taxones que covarian con una variable continua de interés, debemos usar un método que no haga suposiciones sobre la distribución y que pueda tratar con una gran cantidad de ceros. Ingrese la prueba QCAT demiLineage!

El paquete contiene dos versiones de la prueba. El primero es el estándar QCAT. Esto es apropiado para una matriz de datos que es menos del 30% de ceros. Esto ** casi nunca ** será el caso para los datos de microbiota de resolución OTU. Así que me voy a centrar en el segundo, el QCAT_GEE.

Insumos:

OTU: tabla OTU ** FORMATTED COMO MATRIZ **
X: variables continuas de interés Y variables de confusión. Estas son las variables que esperamos que covary con nuestras abundancias OTU ** FORMATEADO COMO MATRIZ **
Z: variables continuas de interés Y variables de confusión que esperamos que estén relacionadas con la presencia / ausencia de OTU ** FORMATTED COMO MATRIZ **
X.índex: el índice de columna para la (s) variable (s) continua (s) de interés para la abundancia
Z.index: el índice de columna para la (s) variable (s) continua (s) de interés para presencia / ausencia
Impuesto: tabla de taxonomía, como la que usamos para nuestro objeto phyloseq, excepto con los encabezados de columna como" Rank1 “,” Rank2 “… etc. ** FORMATEADO COMO MATRIZ **
fdr.alpha: el nivel de significancia que nos interesa (generalmente 0.05)
min.depth: punto de corte para la profundidad de la secuencia de la noche. Use 0 para una tabla OTU normalizada / submuestreada.
n.resample: número de veces para volver a muestrear. Permite el cálculo de valores p precisos

Para usar este paquete, vamos a tener que cambiar el formato de varios de nuestros archivos de entrada para cumplir con los requisitos del paquete.

Primero, necesitamos cambiar los encabezados de columna de nuestra tabla de taxonomía. Tenemos dominio a través de asignaciones de nivel de especie, por lo que tenemos 7 “Rangos”. Cambiemos el nombre de los encabezados de columna ahora.

A continuación, tenemos que crear el marco de datos de las covariables de interés. Vamos a tratar de probar la covariación de la edad en el momento del muestreo con nuestra tabla OTU.

Finalmente, necesitamos volver a formatear todas nuestras entradas como matrices.

Ahora podemos ejecutar la prueba.

Esto escupe mucha información. Primero, una lista de taxones en resolución taxonómica ascendente con dos valores de p asociados con cada uno. ** Es importante que solo considere el valor p de “remuestreo” más preciso al usar la prueba ** QCAT_GEE **. A continuación, imprime la lista de linajes significativos, nuevamente en resolución taxonómica descendente, para tres conjuntos de pruebas. El primero es la prueba de dos partes, que tiene en cuenta tanto la abundancia como la presencia / ausencia. El segundo es la prueba de cero partes, que solo se ve en presencia / ausencia. El tercero es la prueba de la parte positiva, que elimina cualquier linaje que tenga ceros asociados con cualquier muestra y analiza la abundancia. Debido a que el entorno del rumen cambia tan drásticamente desde el nacimiento hasta un año, no esperamos ver ningún linaje que persista hasta el final. Entonces no es sorprendente que no veamos linajes significativos en la prueba de parte positiva. Pero esto nos da una buena lista de taxones candidatos que podemos evaluar estadísticamente.

Vamos a trazar la abundancia contra la edad de los bacterióideos phylum, implicados en la prueba de dos partes. y generar un modelo lineal.

Podemos ver que esto es marginalmente significativo con esta prueba. Siendo realistas, la estructura de este conjunto de datos no es VERDADERAMENTE continua, por lo que este método no es muy apropiado ya que lo aplicamos. Pero afortunadamente fue un ejemplo ilustrativo.

`DESeq2`

Here it is again! Turns out, unlike SIMPER, DESeq2 can take BOTH categorical AND continuous explanatory variables. Because we have gone through it before, I am going to put this example in one big block of code. It is essentially the same, but we are going to use “ADGKG” instead of “AgeGroup” when making our DESeq2 object.

pds2 <- phyloseq_to_deseq2(physeq, ~ADGKG)

## converting counts to integer mode

gm_mean = function(x, na.rm=TRUE){
  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(pds2), 1, gm_mean)
pds2 = estimateSizeFactors(pds2, geoMeans = geoMeans)
pds2 = DESeq(pds2)

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

res2 <- results(pds2, cooksCutoff = FALSE)
alpha=0.05
sigtab2 <- res2[which(res2$padj < alpha), ]
sigtab2 <- cbind(as(sigtab2,"data.frame"), as(tax_table(physeq)[rownames(sigtab2), ], "matrix"))
head(sigtab2,20)

##           baseMean log2FoldChange    lfcSE      stat       pvalue
## Otu00016 125.97247      -16.15317 3.792250 -4.259520 2.048662e-05
## Otu00018 134.94822      -14.71987 3.616761 -4.069906 4.703219e-05
## Otu00021 101.76592      -12.89218 3.582755 -3.598399 3.201821e-04
## Otu00025  89.63720      -17.89727 3.739762 -4.785669 1.704185e-06
## Otu00027  99.18663       17.29546 3.777278  4.578817 4.676134e-06
## Otu00029 181.46933       13.30460 4.518822  2.944262 3.237258e-03
## Otu00035  65.52297      -15.04962 3.493084 -4.308405 1.644358e-05
## Otu00036  61.32749      -11.15129 3.678517 -3.031463 2.433717e-03
## Otu00037  68.14580      -13.45810 3.473143 -3.874903 1.066670e-04
## Otu00038  54.76796      -27.24449 4.785732 -5.692857 1.249310e-08
## Otu00040  61.95809      -13.96074 3.663646 -3.810614 1.386220e-04
## Otu00042  60.82039      -13.17175 3.656834 -3.601954 3.158343e-04
## Otu00043  76.60115       29.87132 4.828328  6.186680 6.144462e-10
## Otu00046  47.31414      -15.54166 3.610132 -4.305013 1.669763e-05
## Otu00048  45.53499      -11.94130 2.731062 -4.372403 1.228862e-05
## Otu00049  48.34865      -12.02429 3.619255 -3.322312 8.927493e-04
## Otu00052  38.16086      -14.60505 3.563384 -4.098647 4.155731e-05
## Otu00055  68.13640      -16.32464 4.761462 -3.428493 6.069430e-04
## Otu00056  39.55129      -14.19564 3.405694 -4.168207 3.070044e-05
## Otu00058  52.44649       26.90874 4.519817  5.953502 2.624644e-09
##                  padj      Domain           Phylum
## Otu00016 4.562928e-04 k__Bacteria p__Bacteroidetes
## Otu00018 8.642164e-04 k__Bacteria p__Bacteroidetes
## Otu00021 3.857932e-03 k__Bacteria p__Bacteroidetes
## Otu00025 6.592504e-05 k__Bacteria p__Bacteroidetes
## Otu00027 1.494330e-04 k__Bacteria    p__Firmicutes
## Otu00029 2.832601e-02 k__Bacteria  p__Spirochaetes
## Otu00035 3.835236e-04 k__Bacteria p__Bacteroidetes
## Otu00036 2.264281e-02 k__Bacteria p__Bacteroidetes
## Otu00037 1.668091e-03 k__Bacteria    p__Firmicutes
## Otu00038 1.530404e-06 k__Bacteria p__Bacteroidetes
## Otu00040 1.997788e-03 k__Bacteria    p__Firmicutes
## Otu00042 3.857932e-03 k__Bacteria    p__Firmicutes
## Otu00043 1.505393e-07 k__Bacteria    p__Firmicutes
## Otu00046 3.835236e-04 k__Bacteria p__Bacteroidetes
## Otu00048 3.225764e-04 k__Bacteria    p__Firmicutes
## Otu00049 9.373867e-03 k__Bacteria p__Bacteroidetes
## Otu00052 7.831955e-04 k__Bacteria    p__Firmicutes
## Otu00055 6.970361e-03 k__Bacteria p__Bacteroidetes
## Otu00056 5.938112e-04 k__Bacteria p__Bacteroidetes
## Otu00058 4.822783e-07 k__Bacteria p__Bacteroidetes
##                                  Class                         Order
## Otu00016                c__Bacteroidia              o__Bacteroidales
## Otu00018                c__Bacteroidia              o__Bacteroidales
## Otu00021                c__Bacteroidia              o__Bacteroidales
## Otu00025                c__Bacteroidia              o__Bacteroidales
## Otu00027            c__Erysipelotrichi         o__Erysipelotrichales
## Otu00029               c__Spirochaetes             o__Spirochaetales
## Otu00035                c__Bacteroidia              o__Bacteroidales
## Otu00036                c__Bacteroidia              o__Bacteroidales
## Otu00037                 c__Clostridia              o__Clostridiales
## Otu00038                c__Bacteroidia              o__Bacteroidales
## Otu00040                 c__Clostridia              o__Clostridiales
## Otu00042                 c__Clostridia              o__Clostridiales
## Otu00043                 c__Clostridia              o__Clostridiales
## Otu00046                c__Bacteroidia              o__Bacteroidales
## Otu00048                 c__Clostridia              o__Clostridiales
## Otu00049 p__Bacteroidetes_unclassified p__Bacteroidetes_unclassified
## Otu00052                 c__Clostridia              o__Clostridiales
## Otu00055                c__Bacteroidia              o__Bacteroidales
## Otu00056                c__Bacteroidia              o__Bacteroidales
## Otu00058                c__Bacteroidia              o__Bacteroidales
##                                 Family                           Genus
## Otu00016 o__Bacteroidales_unclassified   o__Bacteroidales_unclassified
## Otu00018             f__Bacteroidaceae                       g__5-7N15
## Otu00021       f__[Paraprevotellaceae]                        g__CF231
## Otu00025       f__[Paraprevotellaceae]                        g__CF231
## Otu00027        f__Erysipelotrichaceae                      g__Sharpea
## Otu00029            f__Spirochaetaceae                    g__Treponema
## Otu00035             f__Bacteroidaceae                       g__5-7N15
## Otu00036       f__[Paraprevotellaceae]                        g__CF231
## Otu00037            f__Ruminococcaceae f__Ruminococcaceae_unclassified
## Otu00038                      f__S24-7           f__S24-7_unclassified
## Otu00040            f__Veillonellaceae        g__Phascolarctobacterium
## Otu00042            f__Ruminococcaceae f__Ruminococcaceae_unclassified
## Otu00043      f__Peptostreptococcaceae                g__[Clostridium]
## Otu00046             f__Bacteroidaceae                       g__5-7N15
## Otu00048            f__Lachnospiraceae f__Lachnospiraceae_unclassified
## Otu00049 p__Bacteroidetes_unclassified   p__Bacteroidetes_unclassified
## Otu00052            f__Ruminococcaceae f__Ruminococcaceae_unclassified
## Otu00055       f__[Paraprevotellaceae]                        g__CF231
## Otu00056             f__Bacteroidaceae                       g__5-7N15
## Otu00058             f__Prevotellaceae                   g__Prevotella
##                                        Species
## Otu00016         o__Bacteroidales_unclassified
## Otu00018                g__5-7N15_unclassified
## Otu00021                 g__CF231_unclassified
## Otu00025                 g__CF231_unclassified
## Otu00027                         s__azabuensis
## Otu00029             g__Treponema_unclassified
## Otu00035                g__5-7N15_unclassified
## Otu00036                 g__CF231_unclassified
## Otu00037       f__Ruminococcaceae_unclassified
## Otu00038                 f__S24-7_unclassified
## Otu00040 g__Phascolarctobacterium_unclassified
## Otu00042       f__Ruminococcaceae_unclassified
## Otu00043                       s__bifermentans
## Otu00046                g__5-7N15_unclassified
## Otu00048       f__Lachnospiraceae_unclassified
## Otu00049         p__Bacteroidetes_unclassified
## Otu00052       f__Ruminococcaceae_unclassified
## Otu00055                 g__CF231_unclassified
## Otu00056                g__5-7N15_unclassified
## Otu00058            g__Prevotella_unclassified

Again, this provides some candidates for further exploration.

DESeq2

¡Aquí está de nuevo! Resulta que, a diferencia de SIMPER, DESeq2 puede tomar tanto variables categóricas como variables explicativas continuas. Debido a que ya lo hemos analizado, voy a poner este ejemplo en un gran bloque de código. Es esencialmente lo mismo, pero vamos a usar “ADGKG” en lugar de “AgeGroup” al hacer nuestro objeto DESeq2.

Nuevamente, esto proporciona algunos candidatos para una mayor exploración.

Correlations

So, you can also approach continuous variables as correlations. Generally, only strong correlations (r > 0.5 or r < -0.5) should be reported and if you have a lot that fall into the “strong” category, you can up the cut off, say, to r > 0.75 or r < -0.75. There are many correlation options. I like Kendall-Tau because it does not assume linearity or normality. Type ??cor in the R console to learn others that are available.

Also, consider options to decrease the number of OTUs tested or you will be dealing with a huge table. Like only ones at >X% abundance? Only ones found in SIMPER and/or KW analyses of other important variables?

Here, we will correlate ADG to OTUs with at least 5% relative abundance in at least one sample in our data set.

#Remember we calculated abundance before with
#abund = OTU.clean/rowSums(OTU.clean)*100

#Subset OTUs to abundance cutoff
OTU.abund = OTU.clean[, apply(abund, MARGIN=2, function(x) any(x > 5))]

cor.kendall = cor(OTU.abund, meta$ADGKG, method = "kendall")
cor.kendall

##                  [,1]
## Otu00001  0.189852125
## Otu00002  0.211764129
## Otu00003  0.027397313
## Otu00004  0.275867615
## Otu00005  0.165056323
## Otu00006 -0.114462240
## Otu00007  0.143930930
## Otu00008  0.211764129
## Otu00009 -0.177517901
## Otu00010  0.176299258
## Otu00011  0.208334326
## Otu00012  0.017236256
## Otu00013  0.269669049
## Otu00015  0.018077538
## Otu00016 -0.257293680
## Otu00017  0.284293111
## Otu00019  0.172479145
## Otu00020  0.102188122
## Otu00022 -0.034040152
## Otu00023  0.004106646
## Otu00024  0.073416202
## Otu00027  0.412640807
## Otu00029  0.076924424
## Otu00030 -0.077670805
## Otu00031  0.286002668
## Otu00038 -0.271163072
## Otu00041  0.125193349
## Otu00043  0.189645652
## Otu00044  0.239065695
## Otu00053 -0.217652255
## Otu00055 -0.112428004
## Otu00070 -0.037317590

In this case, we don’t see any strong correlations. However, if we did, we could use those OTUs as our list of ones that are of interest to check for significance with glm.

Next, we will correlate SCFAs with OTUs with at least 1% relative abundance in at least one sample in our data set. We will use only samples for which we also have SCFA data.

#Calculate abundances
abund.SCFA = OTU.SCFA/rowSums(OTU.SCFA)*100

#Subset OTUs to abundance cutoff
OTU.SCFA.abund = OTU.SCFA[, apply(abund.SCFA, MARGIN=2, function(x) any(x > 1))]

cor.kendall = cor(OTU.SCFA.abund, SCFA, method = "kendall")
cor.kendall

##             Formate    Acetate Propionate Isobutyrate   Butyrate
## Otu00006  0.0000000  0.1825742  0.1825742   0.1825742  0.1825742
## Otu00014  0.1825742  0.3333333  0.3333333   0.0000000  0.3333333
## Otu00016 -0.1825742 -0.3333333 -0.3333333  -0.6666667 -0.3333333
## Otu00018 -0.1825742 -0.3333333 -0.3333333  -0.6666667 -0.3333333
## Otu00021 -0.9128709 -0.6666667 -0.6666667  -0.3333333 -0.6666667
## Otu00025  0.9128709  0.6666667  0.6666667   0.3333333  0.6666667
## Otu00035 -0.5477226 -0.6666667 -0.6666667  -1.0000000 -0.6666667
## Otu00036 -0.5477226 -0.6666667 -0.6666667  -0.3333333 -0.6666667
## Otu00037 -0.1825742  0.0000000  0.0000000   0.3333333  0.0000000
## Otu00040 -0.5477226 -0.6666667 -0.6666667  -1.0000000 -0.6666667
## Otu00042  0.1825742  0.3333333  0.3333333   0.0000000  0.3333333
## Otu00046 -0.1825742 -0.3333333 -0.3333333  -0.6666667 -0.3333333
## Otu00049 -0.1825742 -0.3333333 -0.3333333   0.0000000 -0.3333333
## Otu00051  0.5477226  0.3333333  0.3333333   0.6666667  0.3333333
## Otu00052 -0.5477226 -0.6666667 -0.6666667  -1.0000000 -0.6666667
## Otu00056 -0.1825742 -0.3333333 -0.3333333  -0.6666667 -0.3333333
## Otu00064 -0.5477226 -0.3333333 -0.3333333  -0.6666667 -0.3333333
## Otu00066 -0.5477226 -0.6666667 -0.6666667  -1.0000000 -0.6666667
## Otu00067  0.1825742  0.0000000  0.0000000   0.3333333  0.0000000
## Otu00069  0.5477226  0.3333333  0.3333333   0.6666667  0.3333333
## Otu00074  0.5477226  0.6666667  0.6666667   0.3333333  0.6666667
## Otu00077  0.1825742  0.3333333  0.3333333   0.6666667  0.3333333
## Otu00088  0.1825742  0.0000000  0.0000000  -0.3333333  0.0000000
## Otu00089  0.1825742  0.0000000  0.0000000  -0.3333333  0.0000000
## Otu00097 -0.1825742  0.0000000  0.0000000   0.3333333  0.0000000
## Otu00100 -0.1825742  0.0000000  0.0000000   0.3333333  0.0000000
## Otu00113 -0.5477226 -0.6666667 -0.6666667  -0.3333333 -0.6666667
## Otu00192  0.5477226  0.6666667  0.6666667   1.0000000  0.6666667
## Otu00295  0.2581989  0.2357023  0.2357023   0.7071068  0.2357023
##            iVal.2MB   Valerate
## Otu00006 -0.1825742  0.1825742
## Otu00014 -0.3333333  0.0000000
## Otu00016 -0.3333333 -0.6666667
## Otu00018 -0.3333333 -0.6666667
## Otu00021 -0.6666667 -0.3333333
## Otu00025  0.6666667  0.3333333
## Otu00035 -0.6666667 -1.0000000
## Otu00036  0.0000000 -0.3333333
## Otu00037  0.0000000  0.3333333
## Otu00040 -0.6666667 -1.0000000
## Otu00042 -0.3333333  0.0000000
## Otu00046 -0.3333333 -0.6666667
## Otu00049  0.3333333  0.0000000
## Otu00051  1.0000000  0.6666667
## Otu00052 -0.6666667 -1.0000000
## Otu00056 -0.3333333 -0.6666667
## Otu00064 -1.0000000 -0.6666667
## Otu00066 -0.6666667 -1.0000000
## Otu00067  0.6666667  0.3333333
## Otu00069  1.0000000  0.6666667
## Otu00074  0.0000000  0.3333333
## Otu00077  0.3333333  0.6666667
## Otu00088  0.0000000 -0.3333333
## Otu00089  0.0000000 -0.3333333
## Otu00097  0.0000000  0.3333333
## Otu00100  0.0000000  0.3333333
## Otu00113  0.0000000 -0.3333333
## Otu00192  0.6666667  1.0000000
## Otu00295  0.7071068  0.7071068

If the data table is too large to view in R, you can write it to a table in your project folder.

write.table(cor.kendall, file = "cor_kendall.csv", sep = ",")

We see that some OTUs strongly correlation with a SCFAs. For example, Otu00021 and Otu00025 with Formate

par(mfrow = c(1, 2))
plot(abund.SCFA$Otu00021 ~ SCFA$Formate, xlab="Formate (mM)", ylab="Relative abundance, %", main="OTU21")
plot(abund.SCFA$Otu00025 ~ SCFA$Formate, xlab="Formate (mM)", ylab="Relative abundance, %", main="OTU25")

Clearly we don’t have enough data points to make strong conclusions here and the correlations are being driven by one animal with very high formate. However, we could further test the list of OTUs that correlate strongly with SCFAs. We will assume a normal distribution here, but you should assess your models with plot() to make sure they are a good fit.

OTU21.Formate = glm(OTU.SCFA$Otu00021 ~ SCFA$Formate)
summary(OTU21.Formate)

## 
## Call:
## glm(formula = OTU.SCFA$Otu00021 ~ SCFA$Formate)
## 
## Deviance Residuals: 
##       1        2        3        4  
## -56.173   96.253  -46.747    6.668  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    357.75      51.46   6.952   0.0201 *
## SCFA$Formate  -540.02     201.13  -2.685   0.1152  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 7324.907)
## 
##     Null deviance: 67454  on 3  degrees of freedom
## Residual deviance: 14650  on 2  degrees of freedom
## AIC: 50.175
## 
## Number of Fisher Scoring iterations: 2

OTU25.Formate = glm(OTU.SCFA$Otu00025 ~ SCFA$Formate)
summary(OTU25.Formate)

## 
## Call:
## glm(formula = OTU.SCFA$Otu00025 ~ SCFA$Formate)
## 
## Deviance Residuals: 
##        1         2         3         4  
##  127.727  -118.783     6.217   -15.162  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    219.78      74.49   2.951   0.0982 .
## SCFA$Formate   721.00     291.12   2.477   0.1316  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 15346.04)
## 
##     Null deviance: 124819  on 3  degrees of freedom
## Residual deviance:  30692  on 2  degrees of freedom
## AIC: 53.133
## 
## Number of Fisher Scoring iterations: 2

So, we see that these two OTUs do not significantly differ with Formate concentration even though they had very strong Kendall correlations. This is similar to OTUs occuring in SIMPER that do not hold up to subsequent Kruskal-Wallis testing.

Correlaciones

Por lo tanto, también puede abordar variables continuas como correlaciones. Generalmente, solo se deben reportar las correlaciones fuertes (r> 0.5 o r <-0.5) y si tiene mucho que cae en la categoría “fuerte”, puede subir el límite, por ejemplo, a r> 0.75 o r <- 0.75. Hay muchas opciones de correlación. Me gusta Kendall-Tau porque no supone linealidad o normalidad. Escriba? Cor en la consola R para conocer otros que están disponibles.

Además, considere opciones para disminuir el número de OTU probadas o tratará con una mesa enorme. ¿Al igual que solo en> X% de abundancia? ¿Solo los que se encuentran en SIMPER y / o KW análisis de otras variables importantes?

Aquí, correlacionaremos ADG con OTU con al menos 5% de abundancia relativa en al menos una muestra en nuestro conjunto de datos.

En este caso, no vemos ninguna correlación fuerte. Sin embargo, si lo hiciéramos, podríamos usar esas OTU como nuestra lista de las que son de interés para verificar la importancia de glm.

A continuación, correlacionaremos los SCFA con OTU con al menos 1% de abundancia relativa en al menos una muestra en nuestro conjunto de datos. Utilizaremos solo muestras para las cuales también tengamos datos de SCFA.

Si la tabla de datos es demasiado grande para visualizarla en R, puede escribirla en una tabla de la carpeta de proyectos.

Vemos que algunas OTU tienen una fuerte correlación con un SCFA. Por ejemplo, Otu00021 y Otu00025 con formiato

Claramente, no tenemos suficientes puntos de datos para sacar conclusiones sólidas aquí y las correlaciones están siendo impulsadas por un animal con formiato muy alto. Sin embargo, podríamos probar más la lista de OTU que se correlacionan fuertemente con los SCFA. Aquí asumiremos una distribución normal, pero debe evaluar sus modelos con plot () para asegurarse de que encajen bien.

Entonces, vemos que estas dos OTU no difieren significativamente con la concentración de formiato a pesar de que tenían correlaciones de Kendall muy fuertes. Esto es similar a las OTU que ocurren en SIMPER que no se sostienen en las pruebas subsiguientes de Kruskal-Wallis.

Working in `phyloseq`

phyloseq is an incredibly powerful package, and we will not have time to go over everything that it is able to do. There is abundant documentation and tutorials online. Here is a link to the phyloseq bible. But I would like to go over some of the ways you can manipulate data in phyloseq which might come in handy. We will make a new phyloseq object first, using our previously generated phyloseq-formatted input files. The object is much smaller without the tree, so I am not going to include it.

physeq.demo <- phyloseq(OTU.UF, meta.UF, tax.UF)

Here are some accessors to take a look at your phyloseq object.

#View one of the data frames within the object
View(sample_data(physeq.demo))

#number of taxa
ntaxa(physeq.demo)

## [1] 5002

#number of samples
nsamples(physeq.demo)

## [1] 24

#sample names
sample_names(physeq.demo)

##  [1] "5017.1yr.F" "5017.2w.F"  "5017.8w.F"  "5020.1yr.F" "5020.2w.F" 
##  [6] "5020.8w.F"  "5026.1yr.F" "5026.2w.F"  "5026.8w.F"  "5031.1yr.F"
## [11] "5031.2w.F"  "5031.8w.F"  "5037.1yr.F" "5037.2w.F"  "5037.8w.F" 
## [16] "5041.1yr.F" "5041.2w.F"  "5041.8w.F"  "5045.1yr.F" "5045.2w.F" 
## [21] "5045.8w.F"  "5053.1yr.F" "5053.2w.F"  "5053.8w.F"

#taxon levels
rank_names(physeq.demo)

## [1] "Domain"  "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

#metadata variables
sample_variables(physeq.demo)

##  [1] "Animal"        "AgeGroup"      "AgeExact"      "ADGKG"        
##  [5] "chao"          "shannon"       "simpson"       "ace"          
##  [9] "shannon.vegan" "AgeGroup.ord"

#specific metadata column: AgeGroup
sample_data(physeq.demo)$AgeGroup

##  [1] 1yr 2w  8w  1yr 2w  8w  1yr 2w  8w  1yr 2w  8w  1yr 2w  8w  1yr 2w 
## [18] 8w  1yr 2w  8w  1yr 2w  8w 
## Levels: 1yr 2w 8w

Something we might be interested in doing is subsetting our OTU matrix within phyloseq. This is accomplished by rarefy_even_depth. This is dependent on the random seed we set earlier for reproducible results, and will also remove all OTUs/samples which are now blank because of the rarefaction.

physeq.demo.rarefy <- rarefy_even_depth(physeq.demo, sample.size = 5000)

## You set `rngseed` to FALSE. Make sure you've set & recorded
##  the random seed of your session for reproducibility.
## See `?set.seed`

## ...

## 1946OTUs were removed because they are no longer 
## present in any sample after random subsampling

## ...

#number of taxa and samples before
ntaxa(physeq.demo)

## [1] 5002

nsamples(physeq.demo)

## [1] 24

#number of taxa and samples after
ntaxa(physeq.demo.rarefy)

## [1] 3056

nsamples(physeq.demo.rarefy)

## [1] 24

Alternately, you can perform normalization (like we did in mothur) using this function, courtesy of Michelle Berry and the Denef lab at the University of Michigan.

#the function
scale_reads <- function(physeq,n){
  physeq.scale <- transform_sample_counts(physeq, function(x) {round((n*x/sum(x)))})
  #otu_table(physeq.scale) = round(otu_table(physeq.scale))
  physeq.scale = prune_taxa(taxa_sums(physeq.scale) > 0, physeq.scale)
  return(physeq.scale)
}

#normalize our phyloseq object to 5000 using this function
physeq.demo.norm <- scale_reads(physeq.demo, 5000)

#normalize our phyloseq object to the minimum observed in the samples using this function
physeq.demo.norm2 <- scale_reads(physeq.demo,min(sample_sums(physeq.demo)))

#number of taxa and reads/sample before
ntaxa(physeq.demo)

## [1] 5002

rowSums(otu_table(physeq.demo))

## 5017.1yr.F  5017.2w.F  5017.8w.F 5020.1yr.F  5020.2w.F  5020.8w.F 
##      17342      17477      17607      17560      17460      17443 
## 5026.1yr.F  5026.2w.F  5026.8w.F 5031.1yr.F  5031.2w.F  5031.8w.F 
##      17583      17515      17483      17565      17518      17405 
## 5037.1yr.F  5037.2w.F  5037.8w.F 5041.1yr.F  5041.2w.F  5041.8w.F 
##      17614      17477      17449      17402      17504      17482 
## 5045.1yr.F  5045.2w.F  5045.8w.F 5053.1yr.F  5053.2w.F  5053.8w.F 
##      17338      17504      17471      17463      17488      17493

#number of taxa and reads/sample after
ntaxa(physeq.demo.norm)

## [1] 2596

ntaxa(physeq.demo.norm2)

## [1] 5002

rowSums(otu_table(physeq.demo.norm))

## 5017.1yr.F  5017.2w.F  5017.8w.F 5020.1yr.F  5020.2w.F  5020.8w.F 
##       4867       4985       4957       4899       4984       4987 
## 5026.1yr.F  5026.2w.F  5026.8w.F 5031.1yr.F  5031.2w.F  5031.8w.F 
##       4926       4975       4989       4854       4982       4975 
## 5037.1yr.F  5037.2w.F  5037.8w.F 5041.1yr.F  5041.2w.F  5041.8w.F 
##       4958       4978       4984       4890       4994       4975 
## 5045.1yr.F  5045.2w.F  5045.8w.F 5053.1yr.F  5053.2w.F  5053.8w.F 
##       4907       4975       4973       4894       4986       4952

rowSums(otu_table(physeq.demo.norm2))

## 5017.1yr.F  5017.2w.F  5017.8w.F 5020.1yr.F  5020.2w.F  5020.8w.F 
##      17342      17347      17369      17418      17341      17358 
## 5026.1yr.F  5026.2w.F  5026.8w.F 5031.1yr.F  5031.2w.F  5031.8w.F 
##      17409      17344      17344      17405      17343      17356 
## 5037.1yr.F  5037.2w.F  5037.8w.F 5041.1yr.F  5041.2w.F  5041.8w.F 
##      17423      17346      17349      17377      17341      17357 
## 5045.1yr.F  5045.2w.F  5045.8w.F 5053.1yr.F  5053.2w.F  5053.8w.F 
##      17338      17342      17347      17391      17343      17357

We may also want to merge samples by group for plotting purposes. Say, if you want to make a stacked bar graph comparing phylum relative abundance between age groups. This is accomplished by merge_samples, followed by a transformation to relative abundance.

#merge samples
physeq.demo.merged <- merge_samples(physeq.demo, "AgeGroup")

#samples before
nsamples(physeq.demo)

## [1] 24

sample_names(physeq.demo)

##  [1] "5017.1yr.F" "5017.2w.F"  "5017.8w.F"  "5020.1yr.F" "5020.2w.F" 
##  [6] "5020.8w.F"  "5026.1yr.F" "5026.2w.F"  "5026.8w.F"  "5031.1yr.F"
## [11] "5031.2w.F"  "5031.8w.F"  "5037.1yr.F" "5037.2w.F"  "5037.8w.F" 
## [16] "5041.1yr.F" "5041.2w.F"  "5041.8w.F"  "5045.1yr.F" "5045.2w.F" 
## [21] "5045.8w.F"  "5053.1yr.F" "5053.2w.F"  "5053.8w.F"

#samples after
nsamples(physeq.demo.merged)

## [1] 3

sample_names(physeq.demo.merged)

## [1] "1yr" "2w"  "8w"

#convert counts to proportions
physeq.demo.merged.relabund <- transform_sample_counts(physeq.demo.merged, function(x) x/sum(x))

#row sums before
rowSums(otu_table(physeq.demo.merged))

##    1yr     2w     8w 
## 139867 139943 139833

#row sums after
rowSums(otu_table(physeq.demo.merged.relabund))

## 1yr  2w  8w 
##   1   1   1

We can also get rid of low-abundance OTUs in phyloseq. Sometimes, rare species add unnecessary noise to our data, and removal may give a better idea of broad-scale patterns going on. This removal can be accomplished with

#remove OTUs whose mean value per sample is less than 1
physeq.demo.abundmean <- filter_taxa(physeq.demo, function(x) mean(x) > 1, TRUE)

#remove OTUs whose max value per sample is less than 10
physeq.demo.abundcount <- prune_taxa(taxa_sums(physeq.demo)>10,physeq.demo)

#taxa before
ntaxa(physeq.demo)

## [1] 5002

#taxa after
ntaxa(physeq.demo.abundmean)

## [1] 995

ntaxa(physeq.demo.abundcount)

## [1] 1499

It may also be desirable to remove samples or subset a dataset to a specific group of samples.

#remove samples by name
physeq.demo.removesamp <- prune_samples(sample_names(physeq.demo)!="5017.1yr.F", physeq.demo)

#remove samples by sequence count lower than 17400
physeq.demo.removelow <- prune_samples(sample_sums(physeq.demo)>=17400, physeq.demo)

#subset to only 2wk samples
physeq.demo.2wk <- subset_samples(physeq.demo, AgeGroup=="2w")

#samples before
nsamples(physeq.demo)

## [1] 24

#samples after
nsamples(physeq.demo.removesamp)

## [1] 23

nsamples(physeq.demo.removelow)

## [1] 22

nsamples(physeq.demo.2wk)

## [1] 8

Finally, you may want to subset a dataset to a specific taxon.

#subset to only Bacteriodetes
physeq.demo.Bacteriodetes <- subset_taxa(physeq.demo, Phylum=="p__Bacteroidetes")

#taxa before
ntaxa(physeq.demo)

## [1] 5002

#taxa after
ntaxa(physeq.demo.Bacteriodetes)

## [1] 901

Trabajando en phyloseq

phyloseq es un paquete increíblemente poderoso, y no tendremos tiempo de repasar todo lo que es capaz de hacer. Hay abundante documentación y tutoriales en línea. Aquí hay un enlace al phyloseq [biblia.] (Http://joey711.github.io/phyloseq-demo/) Pero me gustaría repasar algunas de las maneras en que puedes manipular datos enphyloseq que podría ser útil. Primero haremos un nuevo objeto phyloseq, usando nuestros archivos de entrada formateadosphyloseq previamente generados. El objeto es mucho más pequeño sin el árbol, por lo que no voy a incluirlo.

Aquí hay algunos accesorios para echar un vistazo a su objeto phyloseq.

Algo que podríamos interesarnos es subdividir nuestra matriz OTU dentro de phyloseq. Esto se logra mediante rarefy_even_depth. Esto depende ** de la semilla aleatoria que establezcamos antes para obtener resultados reproducibles **, y también eliminará todas las OTU / muestras que ahora están en blanco debido a la rarefacción.

Alternativamente, puede realizar la normalización (como lo hicimos en mothur) usando esta función, cortesía de [Michelle Berry] (https://github.com/DenefLab/MicrobeMiseq/blob/master/R/miseqR.R) y el laboratorio Denef en la Universidad de Michigan.

También es posible que deseemos fusionar muestras por grupo para fines de trazado. Digamos, si quiere hacer un gráfico de barras apiladas que compare la abundancia relativa del phylum entre los grupos de edad. Esto se logra mediante merge_samples, seguido de una transformación a abundancia relativa.

También podemos deshacernos de las OTU de baja abundancia en phyloseq. A veces, las especies raras agregan ruido innecesario a nuestros datos, y la eliminación puede dar una mejor idea de los patrones a gran escala. Esta eliminación se puede lograr con

También puede ser conveniente eliminar muestras o subconjuntos de un conjunto de datos a un grupo específico de muestras.

Finalmente, es posible que desee subconjuntar un conjunto de datos a un taxón específico.

Other visualizations

There are a TON of visualizations you can do in phyloseq. Google it. But we will go over a select few, plus some bonus stuff outside of phyloseq, after covering the basics of the ggplot2 package which is used in phyloseq to generate graphics.

`ggplot2` syntax

ggplot2 lets you get a lot pickier about figures than does the base R plotting utility. To make a pretty graph, you can give ggplot2 the following: + data + aesthetics (positions, colors, shapes, sample groupings, etc) + geometric object (what kind of plot you want) …among other things.

Data comes first, then, within the parentheses for “aesthetics” is anything “variable,” or anything that differs from point to point, like position or color (if coloring by sample type). Following this is anything constant, like a point size of 2 for all points.

Here is the basic format for a scatter plot:

plot <- ggplot(data, aes(x=var1, y=var2, color=colVar), size=2)

If we wanted to add another element, say a trendline, we use “+”

trendline <- predict(lm(var2~var1, data=data))
plot + geom_line(aes(y=trendline))

Let’s try it with some real data. Lets plot Chao richness vs. ADGKG for our samples. coloring by AgeGroup and with a trendline for the linear model.

#make basic plot
ggplot.demo <- ggplot(meta, aes(x=meta$chao, y=meta$ADGKG), size=2)

#calculate trendline for linear model
trendline.demo <- predict(lm(ADGKG~chao, data=meta))

#plot with points colored by AgeGroup and a black trendline
ggplot.demo + geom_point(aes(color=AgeGroup), size=3) + geom_line(aes(y=trendline.demo), color="black")

You can also color according to continuous variables and get a continuous scale. Let’s try coloring with AgeExact instead of AgeGroup. Also, for fun, let’s make the points huge ampersands

#plot with points colored by AgeExact and a black trendline
ggplot.demo + geom_point(aes(color=AgeExact), size = 20, shape = "&") + geom_line(aes(y=trendline.demo), color="black")

See? So much you can do.

We will get practice with a few other types of plots as well, but that should get us started. There is a TON more ggplot2 can do, spend some time playing with it and some time on Google. But trust us when we say that once you learn the syntax, it is both MUCH more beautiful and MUCH easier to customize than base R graphics.

Otras visualizaciones

Hay una TONELADA de visualizaciones que puedes hacer en phyloseq. Buscalo en Google. Pero revisaremos unos pocos seleccionados, además de algunas cosas de bonificación fuera de phyloseq, después de cubrir los conceptos básicos del paqueteggplot2 que se utiliza en phyloseq para generar gráficos.

ggplot2 sintaxis ggplot2 te permite ser mucho más exigente con las figuras que la utilidad de trazado base R. Para hacer un gráfico bonito, puede darle a ggplot2 lo siguiente: + datos + estética (posiciones, colores, formas, agrupaciones de muestras, etc.) + objeto geométrico (qué tipo de trama quieres) …entre otras cosas.

Los datos son lo primero, luego, dentro de los paréntesis, “estética” es cualquier cosa “variable”, o cualquier cosa que difiera de un punto a otro, como posición o color (si colorea por tipo de muestra). Siguiendo esto hay algo constante, como un tamaño de punto de 2 para todos los puntos.

Aquí está el formato básico para un diagrama de dispersión:

Si quisiéramos agregar otro elemento, digamos una línea de tendencia, usamos “+”

Probémoslo con algunos datos reales. Vamos a trazar la riqueza de Chao vs. ADGKG para nuestras muestras. colorear por AgeGroup y con una línea de tendencia para el modelo lineal.

También puede colorear de acuerdo con variables continuas y obtener una escala continua. Probemos colorear con AgeExact en lugar de AgeGroup. Además, para divertirnos, hagamos que los puntos sean grandes símbolos

¿Ver? Tanto que puedes hacer

También obtendremos práctica con algunos otros tipos de tramas, pero eso debería ayudarnos a comenzar. Hay un TON más de lo que ggplot2 puede hacer, pasar un tiempo jugando con él y pasar un tiempo en Google. Pero créenos cuando decimos que una vez que aprendes la sintaxis, es tanto MUCHO más bello y MUCHO más fácil de personalizar que los gráficos de base R.

Bar charts

The phyloseq object we created with our OTU, meta, tax, and tree data (physeq.tree) can also be used in a number of other plot functions in the phyloseq / ggplot2 packages.

Let’s explore some of the bar chart options. First, we’ll make the classic additive bar chart for phyla in our samples

plot_bar(physeq.tree, fill="Phylum")

We can simplify by grouping our samples by age group

plot_bar(physeq.tree, x="AgeGroup", fill="Phylum")

And removing the lines between OTUs in the bars

plot_bar(physeq.tree, x="AgeGroup", fill="Phylum") + geom_bar(aes(color=Phylum, fill=Phylum), stat="identity", position="stack")

And only showing the top 5 most abundant phyla

#Sort the Phyla by abundance and pick the top 5
top5P.names = sort(tapply(taxa_sums(physeq.tree), tax_table(physeq.tree)[, "Phylum"], sum), TRUE)[1:5]
#Cut down the physeq.tree data to only the top 10 Phyla
top5P = subset_taxa(physeq.tree, Phylum %in% names(top5P.names))
#Plot
plot_bar(top5P, x="AgeGroup", fill="Phylum") + geom_bar(aes(color=Phylum, fill=Phylum), stat="identity", position="stack")

There are many more options within ggplot2 to alter this figure. This document has many helpful tips.

Another way to simplify these bar plots is to not show all OTUs for one sample in one bar. We can do this with facet_grid

plot_bar(top5P, x="AgeGroup", fill="Phylum", facet_grid = ~Phylum) + geom_bar(aes(color=Phylum, fill=Phylum), stat="identity", position="stack")

And you can break it down at any taxonomic level and color by any other level.

Gráfica de barras

El objeto phyloseq que creamos con nuestros datos de OTU, meta, impuestos y árbol (physeq.tree) también se puede usar en varias otras funciones de trazado en los paquetesphyloseq / ggplot2.

Exploremos algunas de las opciones de gráfico de barras. Primero, crearemos el gráfico de barras aditivo clásico para phyla en nuestras muestras

Podemos simplificar agrupando nuestras muestras por grupo de edad

Y eliminando las líneas entre OTUs en las barras

Y solo mostrando los 5 mejores phyla más abundantes

Hay muchas más opciones dentro de ggplot2 para alterar esta figura. [Este documento] (https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf) tiene muchos consejos útiles.

Otra forma de simplificar estos gráficos de barras es no mostrar todas las OTU para una muestra en una barra. Podemos hacer esto con facet_grid

Y puede descomponerlo en cualquier nivel y color taxonómico en cualquier otro nivel.

Trees

We can also plot phylogenetic trees and label/modify them by our variables of interest.

Let’s look at the genus Prevotella in our data. We want to subset down to just this genus or else our plot would be too cluttered to read.

Subset by genus

prevotella = subset_taxa(physeq.tree, Genus == "g__Prevotella")

We can see that this worked by comparing the number of taxa in our subset and our original data

physeq.tree

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5002 taxa and 24 samples ]
## sample_data() Sample Data:       [ 24 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 5002 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 5002 tips and 5000 internal nodes ]

prevotella

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 106 taxa and 24 samples ]
## sample_data() Sample Data:       [ 24 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 106 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 106 tips and 105 internal nodes ]

We can plot these OTUs on a tree.

plot_tree(prevotella, plot.margin = 0.5, ladderize = TRUE)

In the figure, each OTU is represented by the end branch of the tree. How many samples that OTU occurs in is represented by the black dots.

Let’s make this figure a little more useful and add 1) Colors to the dots for our age groups, 2) Size to the dots to show OTU abundance, and 3) Species level labels for the OTUs

plot_tree(prevotella, color = "AgeGroup", label.tips = "Species", size = "abundance", plot.margin = 0.5, ladderize = TRUE)

Already it’s a little difficult to read. You can view a larger page by clicking “Zoom” above the figure. Or export the figure as a PDF and save as a full page size, 9.5x11.

There are even more customizable options in this figure. Type ?plot_tree into the console to see the help page explaining all the options.

Árboles

También podemos trazar árboles filogenéticos y etiquetarlos / modificarlos según nuestras variables de interés.

Veamos el género * Prevotella * en nuestros datos. Queremos subconjuntar a solo este género o de lo contrario nuestra trama sería demasiado abarrotada para leer.

Subconjunto por género

Podemos ver que esto funcionó al comparar el número de taxones en nuestro subconjunto y nuestros datos originales

Podemos trazar estas OTU en un árbol.

En la figura, cada OTU está representada por la rama final del árbol. Cuántas muestras en que ocurre la OTU están representadas por los puntos negros.

Hagamos que esta figura sea un poco más útil y agregue 1) Colores a los puntos para nuestros grupos de edad, 2) Tamaño a los puntos para mostrar la abundancia de OTU, y 3) Etiquetas de nivel de especies para las OTU

Ya es un poco difícil de leer. Puede ver una página más grande haciendo clic en “Zoom” encima de la figura. O bien, exporte la figura como PDF y guárdela como tamaño de página completa, 9.5x11.

Hay incluso más opciones personalizables en esta figura. Escriba? Plot_tree en la consola para ver la página de ayuda que explica todas las opciones.

Heat maps

There are some good options in both phyloseq and gplots to make heatmaps. We will go through phyloseq but know that the same things could be done in gplots with code specific to that package.

Mapas de calor

Hay algunas buenas opciones en phyloseq ygplots para hacer heatmaps. Pasaremos por phyloseq pero sabemos que las mismas cosas se podrían hacer engplots con un código específico para ese paquete.

OTUs

We’re going to just look at the 20 most abundant OTUs to make it more readable.

#Sort the OTUs by abundance and pick the top 20
top20OTU.names = names(sort(taxa_sums(physeq.tree), TRUE)[1:20])
#Cut down the physeq.tree data to only the top 10 Phyla
top20OTU = prune_taxa(top20OTU.names, physeq.tree)

We now see that we only have 20 taxa

top20OTU

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 20 taxa and 24 samples ]
## sample_data() Sample Data:       [ 24 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 20 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 20 tips and 19 internal nodes ]

First, you can make a heatmap of OTU abundance across all samples

plot_heatmap(top20OTU)

## Warning: Transformation introduced infinite values in discrete y-axis

And grouped by our age groups

plot_heatmap(top20OTU, sample.label="AgeGroup", sample.order="AgeGroup")

## Warning: Transformation introduced infinite values in discrete y-axis

We can label the OTU taxa

plot_heatmap(top20OTU, sample.label="AgeGroup", sample.order="AgeGroup", taxa.label="Genus")

## Warning: Transformation introduced infinite values in discrete y-axis

And group OTUs within the same Phyla

plot_heatmap(top20OTU, sample.label="AgeGroup", sample.order="AgeGroup", taxa.label="Genus", taxa.order="Phylum")

## Warning: Transformation introduced infinite values in discrete y-axis

We can also change the colors (white -> purple), including the 0s/NAs (grey).

plot_heatmap(top20OTU, sample.label="AgeGroup", sample.order="AgeGroup", taxa.label="Genus", taxa.order="Phylum", low="white", high="purple", na.value="grey")

## Warning: Transformation introduced infinite values in discrete y-axis

You can also have R automatically group your OTUs and samples by beta-diversity. This may yield the most easily interpreted heatmap but if you have a specific research question that is better addressed by your own ordering (like our age groups above), you should stick with that. We’ll show Bray-Curtis as an example. Other options are

bray
jaccard
wunifrac
uwunifrac

plot_heatmap(top20OTU, "NMDS", "bray", title="Bray-Curtis")

## Warning: Transformation introduced infinite values in discrete y-axis

OTUs

Solo vamos a ver las 20 OTU más abundantes para hacerlo más legible.

Ahora vemos que solo tenemos 20 taxones

En primer lugar, puede hacer un mapa de calor de la abundancia de OTU en todas las muestras

Y agrupados por nuestros grupos de edad

Podemos etiquetar los taxones OTU

Y OTU grupales dentro del mismo Phyla

También podemos cambiar los colores (blanco -> violeta), incluidos los 0s / NA (gris).

También puede hacer que R agrupe automáticamente sus OTU y muestras por beta-diversity. Esto puede producir el mapa de calor interpretado más fácilmente, pero si tiene una pregunta de investigación específica que se solucione mejor con su propio pedido (como nuestros grupos de edad anteriores), debe apegarse a eso. Le mostraremos a Bray-Curtis como un ejemplo. Otras opciones son

bray
jaccard
wunifrac
uwunifrac

Beta-diversity

The other common use for heatmaps is to show distances between samples (i.e. beta-diversity) similar to what is shown in nMDS. We have all of the same metric options as we did for nMDS.

We do not want to use the plot_heatmap function from phyloseq because it requires the input of a physeq object. Instead, we can use our distance matrices as inputs for a gplots command. This command will automatically group samples by similarity (trees)

#Bray-Curtis
heatmap.2(as.matrix(BC.dist))

#UniFrac
heatmap.2(as.matrix(wUF.dist))

You could also change the colors

#Rainbow colors
rc <- rainbow(nrow(as.matrix(BC.dist)), start=0, end=0.9)
heatmap.2(as.matrix(BC.dist), col=rc)

As always, for further customization, explore with ?heatmap.2

El otro uso común para heatmaps es mostrar las distancias entre las muestras (* i.e. * beta-diversity) similar a lo que se muestra en nMDS. Tenemos todas las mismas opciones métricas que usamos para nMDS.

No queremos utilizar la función plot_heatmap dephyloseq porque requiere la entrada de un objeto physeq. En cambio, podemos usar nuestras matrices de distancia como entradas para un comando gplots. Este comando agrupará muestras automáticamente por similitud (árboles)

También puedes cambiar los colores

Como siempre, para una mayor personalización, explore con ?heatmap.2

Venn diagrams

Venn diagram of three samples: 5017.2w.F, 5017.8w.F, and 5017.1yr.F

Create a list of OTUs that occur (count > 0) in each sample.

We select for the row by name with OTU.clean[“name”,]
We select the columns with a value >0 with OTU.clean[,apply()]

OTU.5017.2w = colnames(OTU.clean["5017.2w.F", apply(OTU.clean["5017.2w.F",], MARGIN=2, function(x) any(x >0))])

OTU.5017.8w = colnames(OTU.clean["5017.8w.F", apply(OTU.clean["5017.8w.F",], MARGIN=2, function(x) any(x >0))])

OTU.5017.1yr = colnames(OTU.clean["5017.1yr.F",apply(OTU.clean["5017.1yr.F",], MARGIN=2, function(x) any(x >0))])

We can then use these lists of OTUs to plot a Venn diagram with venn() from the gplots package

par(mfrow=c(1,1))
venn(list(OTU.5017.2w, OTU.5017.8w, OTU.5017.1yr))

We can also do this for our age groups by selecting all samples where meta$AgeGroup = 2w, 8w, or 1yr

OTU.2w = colnames(OTU.clean[meta$AgeGroup == "2w", apply(OTU.clean[meta$AgeGroup == "2w",], MARGIN=2, function(x) any(x >0))])

OTU.8w = colnames(OTU.clean[meta$AgeGroup == "8w", apply(OTU.clean[meta$AgeGroup == "8w",], MARGIN=2, function(x) any(x >0))])

OTU.1yr = colnames(OTU.clean[meta$AgeGroup == "1yr", apply(OTU.clean[meta$AgeGroup == "1yr",], MARGIN=2, function(x) any(x >0))])

And plot

par(mfrow=c(1,1))
venn(list(OTU.2w, OTU.8w, OTU.1yr))

These are not the prettiest Venns, but they are the quickest way to calculate the values within a Venn.

Once you have these, you can use the VennDiagram package for more pretty graphing options. For example, the age groups venns would be

plot.new()
par(mfrow=c(1,1), mar=c(0,0,0,0))

draw.triple.venn(area1 = 385+58+71+320, area2 = 801+190+320+71, area3 = 3177+190+58+71, n12 = 320+71, n23 = 190+71, n13 = 58+71, n123 = 71, category = c("2w", "8w", "1yr"), lty = "blank", fill = c("green", "red", "blue"))

## (polygon[GRID.polygon.1572], polygon[GRID.polygon.1573], polygon[GRID.polygon.1574], polygon[GRID.polygon.1575], polygon[GRID.polygon.1576], polygon[GRID.polygon.1577], text[GRID.text.1578], text[GRID.text.1579], text[GRID.text.1580], text[GRID.text.1581], text[GRID.text.1582], text[GRID.text.1583], text[GRID.text.1584], text[GRID.text.1585], text[GRID.text.1586], text[GRID.text.1587])

Or we can export the OTU lists and make Venns with this online tool http://bioinformatics.psb.ugent.be/webtools/Venn/. This tool is handy in that is gives you the list of OTUs within the Venn sections so that you can see which specific bacteria are shared.

write.table(OTU.2w, "OTU.2w.csv", sep=",", row.names=FALSE, col.names=FALSE)
write.table(OTU.8w, "OTU.8w.csv", sep=",", row.names=FALSE, col.names=FALSE)
write.table(OTU.1yr, "OTU.1yr.csv", sep=",", row.names=FALSE, col.names=FALSE)

Diagramas de Venn

Diagrama de Venn de tres muestras: 5017.2w.F, 5017.8w.F y 5017.1yr.F

Crea una lista de OTU que ocurren (cuenta> 0) en cada muestra.

Seleccionamos para la fila por nombre con * OTU.clean [“name”,] *
Seleccionamos las columnas con un valor> 0 con * OTU.clean [, apply ()] *

Entonces podemos usar estas listas de OTUs para trazar un diagrama de Venn con venn () del paquete gplots

También podemos hacer esto para nuestros grupos de edad seleccionando todas las muestras donde meta $ AgeGroup = 2w, 8w, o 1yr

Y trama

Estos no son los Venns más bonitos, pero son la forma más rápida de calcular los valores dentro de un Venn.

Una vez que tenga esto, puede usar el paquete VennDiagram para obtener más opciones de gráficos. Por ejemplo, los venns de los grupos de edad serían

O podemos exportar las listas de OTU y crear Venns con esta herramienta en línea http://bioinformatics.psb.ugent.be/webtools/Venn/. Esta herramienta es útil porque le da la lista de OTU dentro de las secciones de Venn para que pueda ver qué bacterias específicas se comparten.

Networks

OTUs

You can plot the distances between OTUs as a network. It would be an unreadable mess to plot all the OTUs in our data set, so we will just use the smaller prevotella data set.

plot_net(prevotella, color="Species", type="taxa")

For co-occurrence networks of OTUs, I recommend Gephi or Cytoscape. Thus far, I have not found an R package comparable to these other programs.

Beta-diversity

You can also plot beta-diversity as a network where the edges (lines) are the distances between samples. All metrics we’ve used here are supported (bray, jaccard, wunifrac, uwunifrac)

plot_net(physeq.tree, color="AgeGroup", distance="bray")

Redes

OTUs

Puede trazar las distancias entre OTU como una red. Sería un desastre ilegible trazar todas las OTU en nuestro conjunto de datos, por lo que solo utilizaremos el conjunto de datos prevotella más pequeño.

Para redes de OTU simultáneas, recomiendo [Gephi] (https://gephi.org/) o [Cytoscape] (http://www.cytoscape.org/). Hasta ahora, no he encontrado un paquete R comparable a estos otros programas.

Beta-diversity

También puede trazar la diversidad beta como una red donde los bordes (líneas) son las distancias entre las muestras. Todas las métricas que hemos usado aquí son compatibles (rebuzno, jaccard, wunifrac, uwunifrac)

Publication figures

Annotating figures

If you want to add stars/letters to indicate significance or denote regions of a plot, you can do so using ggplot2 and cowplot. Here is a link to the cowplot documentation: https://cran.r-project.org/web/packages/cowplot/cowplot.pdf. Alternatively, you can do so post-export using GIMP, Photoshop, Illustrator, MS Paint, KidPix, crayons, etc. like I do. It is much easier and gives more control and fewer laptops thrown across the room because your asterisks are slightly out of alignment and you don’t know why.

Publicaciones

Anotación de cifras

Si desea agregar estrellas / letras para indicar el significado o denotar regiones de un trazado, puede hacerlo usando ggplot2 ycowplot. Aquí hay un enlace a la documentación cowplot: https://cran.r-project.org/web/packages/cowplot/cowplot.pdf. Alternativamente, puede hacerlo después de la exportación usando GIMP, Photoshop, Illustrator, MS Paint, KidPix, crayones, etc. como yo. Es mucho más fácil y le da más control y menos laptops lanzadas a través de la sala porque sus asteriscos están ligeramente desalineados y usted no sabe por qué.

Exporting figures

Once you have a figure you want to include in a publication, there are a number of ways to export it out of R. You can use the Export functionality on the menu within the Plots window, but this often does not result in high enough resolution.

Ideally, you want to save in PostScript (.ps) or PDF (.pdf) formats because they are vector-based, meaning they are not any specific dpi and do not get blurry when zoomed in. Other formats (PNG, JPG, BMP, TIFF) are pixel-based formats (little square dots) and can become jagged when zoomed in.

If you have issues getting a specific font to work, try installing and loading the package extrafont.

PostScript

Here, we will use postscript to export as a .ps. This function uses

width, height: in inches unless otherwise specified with units=
horizontal: TRUE = landscape, FALSE = portrait
colormodel: RGB, CMYK, and others
family: Font to be used within figures

Then we add layout if we have more than one plot within the overall figure.

matrix:
- A list of how many figures there are. For 2, it is c(1,2). For 4, it is c(1,2,3,4)
- Then the number of rows, columns the figures should be oriented in
widths: A list of scalars of how large each figure should be in width.
heights: A list of scalars of how large each figure should be in height.

postscript("Fig1.ps", width = 7, height = 3, horizontal = FALSE, colormodel = "rgb", family = "ArialMT")
 
layout(matrix(c(1,2), 1, 2), widths=c(3,2), heights=c(1,1))
 
plot(BC.nmds, type="n", main="Bray-Curtis")
points(BC.nmds, display="sites", pch=20, col=c("blue", "green", "red")[meta$AgeGroup])
 
boxplot(shannon ~ AgeGroup.ord, data=meta, main="Diversity", ylab="Shannon's diversity", col=c("green", "red", "blue"))
 
dev.off()

## png 
##   2

To open the resulting .ps file:

Open it directly in Adobe Illustrator (vectors are preserved)
On a Mac, double-clicking on it will convert it automatically into a PDF and will open automatically into Preview.
On Windows, it depends on how “file associations” are set-up. Typically the file would need some transformation on a “standard” Windows computer before it can be used. If Adobe software is installed, it could run via Distiller to convert the .ps to a PDF.

PDF

To export directly to a PDF, we will use pdf

pdf("Fig1.pdf", width = 7, height = 3,  colormodel = "rgb", family = "ArialMT")
 
layout(matrix(c(1,2), 1, 2), widths=c(3,2), heights=c(1,1))
 
plot(BC.nmds, type="n", main="Bray-Curtis")
points(BC.nmds, display="sites", pch=20, col=c("blue", "green", "red")[meta$AgeGroup])
 
boxplot(shannon ~ AgeGroup.ord, data=meta, main="Diversity", ylab="Shannon's diversity", col=c("green", "red", "blue"))
 
dev.off()

## png 
##   2

PNG

PNG is pixel-based so it may get blurry if not at high enough resolution. The exact resolution can be specified by giving the dpi in res=

png("Fig1.png", width = 7, height = 3, units='in', res=300)
 
layout(matrix(c(1,2), 1, 2), widths=c(3,2), heights=c(1,1))
 
plot(BC.nmds, type="n", main="Bray-Curtis")
points(BC.nmds, display="sites", pch=20, col=c("blue", "green", "red")[meta$AgeGroup])
 
boxplot(shannon ~ AgeGroup.ord, data=meta, main="Diversity", ylab="Shannon's diversity", col=c("green", "red", "blue"))
 
dev.off()

## png 
##   2

Exportar figuras

Una vez que tiene una figura que desea incluir en una publicación, hay varias formas de exportarla desde R. Puede usar la funcionalidad Export en el menú dentro de la ventana de Trazado, pero esto a menudo no da como resultado una alta suficiente resolución

Idealmente, desea guardar en formatos PostScript (.ps) o PDF (.pdf) porque están basados en vectores, lo que significa que no tienen ningún dpi específico y no se borran al acercarse. Otros formatos (PNG, JPG, BMP) , TIFF) son formatos basados en píxeles (pequeños puntos cuadrados) y pueden hacerse irregulares cuando se amplía.

Si tiene problemas para que funcione una fuente específica, intente instalar y cargar el paquete extrafont.

Postscript

Aquí, usaremos postscript para exportar como.ps. Esta función usa

ancho, alto: en pulgadas a menos que se especifique lo contrario con unidades =
horizontal: TRUE = paisaje, FALSE = retrato
colormodel: RGB, CMYK y otros
familia: fuente para ser utilizada dentro de las figuras

Luego agregamos layout si tenemos más de un gráfico dentro de la figura global.

matriz:
- Una lista de cuántas figuras hay. Para 2, es c (1,2). Para 4, es c (1,2,3,4)
- Luego, el número de filas, columnas, las figuras deben estar orientadas en
anchos: una lista de escalares de cuán grande debe ser cada figura de ancho.
alturas: una lista de escalares de cuán grande debe ser cada figura en altura.

Para abrir el archivo .ps resultante:

Ábralo directamente en Adobe Illustrator (se conservan vectores)
En una Mac, al hacer doble clic en ella se convertirá automáticamente en PDF y se abrirá automáticamente en Vista previa.
En Windows, depende de cómo se configuren las “asociaciones de archivos”. Normalmente, el archivo necesitaría alguna transformación en una computadora con Windows “estándar” antes de que pueda ser utilizada. Si el software de Adobe está instalado, podría ejecutarse a través de Distiller para convertir .ps a PDF.

PDF

Para exportar directamente a un PDF, usaremos pdf

PNG

PNG está basado en píxeles, por lo que puede verse borroso si no tiene una resolución lo suficientemente alta. La resolución exacta se puede especificar dando los ppp en res =

Microbiota analysis in R, UCR Workshop 2018

Kim Dill-McFarland and Madison Cox

January 16 & 19, 2018

Tips for this workshop

Introduction

Goal

Data

Files

Get set up

Download and install

Organization

Load Packages

Load Data

Data manipulation

Clean up the data

OTU table

Taxonomy table

Metadata and SCFA tables

Order the data

Further data manipulation

Subsetting data

Remove rare taxa

Convert OTU counts to proportions

Set seed

Alpha-diversity

What are these metrics?

Explore alpha metrics

phyloseq calculations

vegan calculations

Compare methods

Alpha statistics

Categorical variables

Continuous variables

Mixed models

Repeated measure

Beta diversity

Visualization

OTU-based metrics

Ordination scatterplots

Ellipses

3D plots

Phylogentic-based metrics

Create physeq object

Ordination scatterplots

Ellipses

3D plots

Vectors for continuous variables

Statistically test beta-diversity

PERMANOVA

ANOSIM

2D variables

Beta dispersion

OTUs that differ by

Categorical variables

SIMPER

DESeq2

Continuous variables

miLineage

DESeq2

Correlations

Working in phyloseq

Other visualizations

ggplot2 syntax

Bar charts

Trees

Heat maps

OTUs

Beta-diversity

Venn diagrams

Networks

OTUs

Beta-diversity

Publication figures

Annotating figures

Exporting figures

PostScript

PDF

PNG

`phyloseq` calculations

`vegan` calculations

`DESeq2`

`miLineage`

`DESeq2`

Working in `phyloseq`

`ggplot2` syntax