Background

This is to understand and reproduce the results presented in “Potential of fecal microbiota for early-stage detection of colorectal cancer” by Zeller et al. (2014) (Link to the paper). The current test before colonoscopy to test for colorectal cancer is the Fecal occult blood test (FOBT). Could analysis of the fecal microbiota be the better test? It might be that particular species have a striking effect as is the case with ulcers and Helicobacter pylori. Are there so differences in microbial abundances in patients with cancer and without? Zeller et al. analysed fecal microbiome data from different patient populations and countries.

French population Study

Background of the data

Fecal metagenomes from 156 participants in France. Subsequently they underwent colonoscopy, so the diagnosis is known. This metadata is given in Supplementary Information Table S1 and Dataset S1). The abundance table is given in Supplementary Dataset S3. The first problem, the data did only include 141 participants, because all participants with large adenomas were kicked out, adenomas are precursors of cancer. The patients with small adenomas were put in the control group for later analyses. Unfortunately, this made it impossible to fully reproduce the results of their initial diversity analysis, which included all adenoma patients.

Diversity (and richness) differences between the groups

Calculating Shannon Diversity

Diversity is a measure to see how many different species (OTUs) are in the smaples and how even they are distributed. Diversity gets bigger when there are more species and when the evennes increases. The shannon diversity of a sample in which pi stands for the abundance of OTUi is simply \(-sum(p_{i}*log(p_{i})\).

AbundTable <- read.csv2("SuppData3Abundance.csv")
# The Data was downloaded and then saved as a csv

# to correct for the headers
con <- file("SuppData3Abundance.csv")
open(con)
Headers <- readLines(con, n = 1)
close(con)
Headers <- strsplit(Headers, split = ';')
Headers[[1]][1] = "OTU"
Headers <- Headers[[1]]

names(AbundTable) <- Headers

They used the vegan package to calculate the shannon diversity. And despite the simple equation this is probably the easiest solution

require(vegan)
## Loading required package: vegan
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.2-1
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
require(xtable)
## Loading required package: xtable

To use the diversity function the table should only contain the samples and the OTU names should be rownames or colnames

RNames <- as.character(AbundTable$OTU)
AbundTable <- AbundTable[,2:ncol(AbundTable)]
row.names(AbundTable) <- RNames
# diversity() works by default on rows as samples so I transpose the AbundTable
TAbundTable <- t(AbundTable)
class(TAbundTable) # so note that the Transpose changed it into a matrix
## [1] "matrix"
# just because I like dplyr and dataframes better
TAbundTable <- tbl_df(data.frame(TAbundTable))

# calculate the diversity
ShannonDiversity <- diversity(TAbundTable, index = "shannon")

head(ShannonDiversity)
## CCIS27304052ST-3-0 CCIS15794887ST-4-0 CCIS74726977ST-3-0 
##           1.853916           1.537729           1.179761 
## CCIS16561622ST-4-0 CCIS79210440ST-3-0 CCIS82507866ST-3-0 
##           1.920942           1.781650           2.055223

So for each sample (given with the sample ID) there is a shannon diversity, the higher the richer and evener the sample.

Establishing the groups, cancer, adenoma, control

Loading of the patient data

PatientData <- read.csv2("Pop. F Subject Metadata-Table 1.csv")

There is the mentioned problem, that the abundance table did not contain the smaples of the patients with large adenomas, so I also only kept the patient data that was used

PatientData <- PatientData[PatientData$Sample.ID %in% row.names(TAbundTable),]

# see the three left over diagnoses, large adenoma is gone
unique(PatientData$Diagnosis)
## [1] Normal        Small adenoma Cancer       
## Levels: Cancer Large adenoma Normal Small adenoma

As mentioned, this is unfortunate, since they used in the first figures all samples, my adenoma group is thus not the one they used. Defining the groups.

IDsControlEarly <- PatientData$Sample.ID[PatientData$Diagnosis == 'Normal']
IDsCRCEarly <- PatientData$Sample.ID[PatientData$Diagnosis == 'Cancer']
IDAdenomaEarly <- PatientData$Sample.ID[PatientData$Diagnosis %in% c('Small adenoma',
                                                                     'Large adenoma')]

ShannonDivControls <- ShannonDiversity[names(ShannonDiversity) %in% IDsControlEarly]
ShannonDivCRC <- ShannonDiversity[names(ShannonDiversity) %in% IDsCRCEarly]
ShannonDivAdenoma <- ShannonDiversity[names(ShannonDiversity) %in% IDAdenomaEarly]

Checking for differences in diversity of the groups with Kruskal-Wallis test

To be honest, not yet sure, why here Kruskal-Wallis.

ShannonDivList = list(ShannonDivControls, ShannonDivAdenoma, ShannonDivCRC)

kruskal.test(ShannonDivList) # so I get an even higher p-value but that is because of the wrong adenoma group.
## 
##  Kruskal-Wallis rank sum test
## 
## data:  ShannonDivList
## Kruskal-Wallis chi-squared = 0.7124, df = 2, p-value = 0.7003

Plot of Shannon Diversity similar to Fig S1D

Since ggplot needs a data.frame

ShDiv <- data.frame(Diversity = ShannonDiversity)
ShDiv$Group = ""
ShDiv$Group[names(ShannonDiversity) %in% IDsControlEarly] = "Control"
ShDiv$Group[names(ShannonDiversity) %in% IDsCRCEarly] = "CRC"
ShDiv$Group[names(ShannonDiversity) %in% IDAdenomaEarly] = "Adenoma"
# to get the same order of the levels
ShDiv$Group <- relevel(factor(ShDiv$Group), ref = "Control")

The plot comparable to their Figure S1D.

Tr <- ggplot(ShDiv, aes(x = factor(Group), y = Diversity))
Tr + geom_boxplot(notch = TRUE) +
  xlab("") + 
  ylab("ShannonDiversity") +
  theme_bw(12)

Calculating and comparing species richness separately (Fig S1E)

Note that they just counted all the species that were present (i.e. abundance not 0). They included the UNMAPPED

Richness <- rowSums(TAbundTable != 0)
# Richness <- specnumber(TAbundTable) # results in the same result!
# THEY USED THIS specnumber, ok
RichnessControls <- Richness[names(Richness) %in% IDsControlEarly]
RichnessCRC <- Richness[names(Richness) %in% IDsCRCEarly]
RichnessAdenoma <- Richness[names(Richness) %in% IDAdenomaEarly]

kruskal.test(list(RichnessControls,RichnessAdenoma,RichnessCRC))
## 
##  Kruskal-Wallis rank sum test
## 
## data:  list(RichnessControls, RichnessAdenoma, RichnessCRC)
## Kruskal-Wallis chi-squared = 0.3642, df = 2, p-value = 0.8335
# clearly not significant.
ShDiv$Richness <- Richness

Tr <- ggplot(ShDiv, aes(x = factor(Group), y = Richness))
Tr + geom_boxplot(notch = TRUE) +
  xlab("") + 
  ylab("Species richness") +
  theme_bw(12)

Differences in the abundance of specific taxa between the groups (Fig S2)

An individual example

Since I only have the abundances of certain taxa, I assume species, I can only compare these. They first to phylum level comparisons, i.e. to check whether there are more or less Firmicutes and so on in the cancer than in the controls. For this I probably had to sum up the abundances of all species that belong to the different phyla. This I do not do, I just want to show how the Wilcoxon test is done, and why they used FDR, see Fig S2:

“Significance was determined using FDR-corrected pair-wise Wilcoxon tests with a cutoff of 0.1 on the adjusted p-values”

So why False Discovery Rate correction. You always compare the same samples, you have here 61 control samples, and 53 CRC samples. Let’s think in phylum level you first check if Fusobacteria are different. Then in the same samples if Proteobacteria are different and so on. That is exactly we tested jelly beans vs acne (abundances), now let’s find a correlation, none, check green jelly beans, then red and so on. (Statistical inference class coursera)

The species they found to be most different between cancer and neoplasia-free ( control) was: Fusobacterium nucleatum subsp. vincentii [1482]. I try to repeat what they did.

#I did not directly find that entry in my colnames of TAbundTable, so I searched
# for all with "Fusobacterium"
colnames(TAbundTable)[grep("Fusobacterium",colnames(TAbundTable))]
##  [1] "Fusobacterium.gonidiaformans..1476." 
##  [2] "unnamed.Fusobacterium.sp..D12..1477."
##  [3] "Fusobacterium.periodonticum..1478."  
##  [4] "Fusobacterium.nucleatum..1479."      
##  [5] "Fusobacterium.nucleatum..1480."      
##  [6] "unclassified.Fusobacterium..1481."   
##  [7] "unclassified.Fusobacterium..1482."   
##  [8] "Fusobacterium.ulcerans..1483."       
##  [9] "Fusobacterium.varium..1484."         
## [10] "Fusobacterium.mortiferum..1485."

you see that there is only a unclassified Fusobacterium [1482]. I decided to do a simple Wilcoxon on that one.

Fuso1482 <- select(TAbundTable, contains("1482"))
names(Fuso1482)[1] = 'Value'
Fuso1482 <- select(TAbundTable, contains("1482"))

Fuso1482$Groups = ""
Fuso1482$Groups[row.names(Fuso1482) %in% IDsControlEarly] = "Control"
Fuso1482$Groups[row.names(Fuso1482) %in% IDsCRCEarly] = "CRC"
# leave the adenoma class out
Fuso1482 <- filter(Fuso1482, Groups != "")
Fuso1482$Groups <-factor(Fuso1482$Groups)
names(Fuso1482)[1] = 'Values'

Lets first see the data

par(mfrow = c(1,2))
boxplot(Values ~ Groups, data = Fuso1482)
boxplot(Values ~ Groups, data = Fuso1482, ylim = c(0, 0.00005))

So you see the species is almost not present in the control, correspondingly the wilcoxon test shows high significance (two sided here)

wilcox.test(Values ~ Groups, data = Fuso1482)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Values by Groups
## W = 2404, p-value = 4.41e-08
## alternative hypothesis: true location shift is not equal to 0

The p-value is lower than what they published but I have not yet done the FDR correction.

Doing Wilcoxon on all taxa with FDR correction

Note they do an abundance filtering first which I did not do here

Wilcoxon on all OTUs

DFWilcox <-TAbundTable
DFWilcox$Groups = ""
DFWilcox$Groups[row.names(DFWilcox) %in% IDsControlEarly] = "Control"
DFWilcox$Groups[row.names(DFWilcox) %in% IDsCRCEarly] = "CRC"
# leave the adenoma class out
DFWilcox <- filter(DFWilcox, Groups != "")
DFWilcox$Groups <-factor(DFWilcox$Groups)

MeansControls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MeansCRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
Meansuden0Controls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
Meansuden0CRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MediansControls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MediansCRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
pValues <- vector(mode='numeric', length = ncol(DFWilcox)-1)

for (i in 1:(ncol(DFWilcox)-1)){
        CurrentData <- select(DFWilcox, i, Groups)
        names(CurrentData)[1] <- "Value"
        MeansControls[i] <- mean(CurrentData$Value[CurrentData$Groups == 'Control'])
        MeansCRCs[i] <- mean(CurrentData$Value[CurrentData$Groups == 'CRC'])
        MediansControls[i] <- median(CurrentData$Value[CurrentData$Groups == 'Control'])
        MediansCRCs[i] <- median(CurrentData$Value[CurrentData$Groups == 'CRC'])
        CurrentDatauden0 <- CurrentData[CurrentData$Value !=0,]
        Meansuden0Controls[i] <- mean(CurrentDatauden0$Value[CurrentDatauden0$Groups == 'Control'])
        Meansuden0CRCs[i] <- mean(CurrentDatauden0$Value[CurrentDatauden0$Groups == 'CRC'])
        if (dim(CurrentDatauden0)[1] == 0){
                pValues[i] <- NaN
        } else {
                pValues[i] <- wilcox.test(Value~Groups, data = CurrentData)$p.value
        }
}

MeansControls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MeansCRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
Meansuden0Controls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
Meansuden0CRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MediansControls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MediansCRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
pValues <- vector(mode='numeric', length = ncol(DFWilcox)-1)
ZeroProportionControl <- vector(mode='numeric', length = ncol(DFWilcox)-1)
ZeroProportionCRC <- vector(mode='numeric', length = ncol(DFWilcox)-1)
ZerosControl <- vector(mode='numeric', length = ncol(DFWilcox)-1)
ZerosCRC <- vector(mode='numeric', length = ncol(DFWilcox)-1)

for (i in 1:(ncol(DFWilcox)-1)){
        CurrentData <- select(DFWilcox, i, Groups)
        names(CurrentData)[1] <- "Value"
        MeansControls[i] <- mean(CurrentData$Value[CurrentData$Groups == 'Control'])
        MeansCRCs[i] <- mean(CurrentData$Value[CurrentData$Groups == 'CRC'])
        MediansControls[i] <- median(CurrentData$Value[CurrentData$Groups == 'Control'])
        MediansCRCs[i] <- median(CurrentData$Value[CurrentData$Groups == 'CRC'])
        CurrentDatauden0 <- CurrentData[CurrentData$Value !=0,]
        Meansuden0Controls[i] <- mean(CurrentDatauden0$Value[CurrentDatauden0$Groups == 'Control'])
        Meansuden0CRCs[i] <- mean(CurrentDatauden0$Value[CurrentDatauden0$Groups == 'CRC'])
        ZeroProportionControl[i] <- 
                sum(CurrentData$Value[CurrentData$Groups == 'Control'] == 0)/
                length(CurrentData$Value[CurrentData$Groups == 'Control'])
        ZeroProportionCRC[i] <- 
                sum(CurrentData$Value[CurrentData$Groups == 'CRC'] == 0)/
                length(CurrentData$Value[CurrentData$Groups == 'CRC'])
        ZerosControl[i] <- sum(CurrentData$Value[CurrentData$Groups == 'Control'] == 0)
        ZerosCRC[i] <- sum(CurrentData$Value[CurrentData$Groups == 'CRC'] == 0)
        if (dim(CurrentDatauden0)[1] == 0){
                pValues[i] <- NaN
        } else {
                pValues[i] <- wilcox.test(Value~Groups, data = CurrentData)$p.value
        }
}
# Note, there were 61 control samples and 53 CRC samples! 

Correcting the pValues with FDR and presenting the OTUs with significant differences

## adjusting the p-values
PValuesBH <- p.adjust(pValues,method="BH")
PValuesFDR <- p.adjust(pValues,method="fdr")

# they are identical, see
identical(PValuesBH,PValuesFDR)
## [1] TRUE

Arranging the data in a data.frame and finding the significant ones and ordering them

Sign0k1Cutoff <- vector(mode = 'character', length = length(PValuesFDR))

Sign0k1Cutoff[PValuesFDR<.1] <- 'yes'
Sign0k1Cutoff[!PValuesFDR<.1] <- 'no'
Sign0k1Cutoff[is.na(PValuesFDR)] <- 'no'


SummaryWilcoxon <- data.frame(Sign0k1Cutoff, PValuesFDR,pValues, ZerosControl, 
                              PerControl = 100*(1-ZeroProportionControl), ZerosCRC,
                              PerCRC = 100*(1-ZeroProportionCRC), MeansControls,MeansCRCs,
                         Meansuden0Controls, Meansuden0CRCs,
                         MediansControls, MediansCRCs) 

SummaryWilcoxon <- tbl_df(SummaryWilcoxon)


# hadley wickham says row.names sucks, and all dplyr ignores them, so
SummaryWilcoxon$OTUs <- row.names(AbundTable)
SummaryWilcoxon <- select(SummaryWilcoxon, OTUs, 1:13)



SignificantSpecies <- filter(SummaryWilcoxon, Sign0k1Cutoff == 'yes')
# I get only 16 significant species
SignificantSpecies$CRCtoControl = ""

SignificantSpecies <- mutate(SignificantSpecies, RatioMeans = MeansCRCs/MeansControls)
SignificantSpecies$CRCtoControl[SignificantSpecies$RatioMeans < 1] = "reduced"
SignificantSpecies$CRCtoControl[SignificantSpecies$RatioMeans > 1] = "increased"

SignificantSpecies <- select(SignificantSpecies, 1, 15, 3:14, 15)

SignificantSpecies <- arrange(SignificantSpecies, PValuesFDR)

Here the significant species I found in a table to be compared with Figure S2.

print(xtable(as.data.frame(SignificantSpecies[,1:8]),
             digits = c(0,0,0,5,5,0,1,0,1)), comment = F, type = 'html')
OTUs CRCtoControl PValuesFDR pValues ZerosControl PerControl ZerosCRC PerCRC
1 unclassified Fusobacterium [1482] increased 0.00003 0.00000 55 9.8 24 54.7
2 unclassified Fusobacterium [1481] increased 0.00019 0.00000 51 16.4 23 56.6
3 Fusobacterium nucleatum [1479] increased 0.00165 0.00001 59 3.3 34 35.8
4 Pseudoflavonifractor capillosus [1579] increased 0.00273 0.00001 2 96.7 0 100.0
5 Fusobacterium nucleatum [1480] increased 0.00835 0.00006 56 8.2 33 37.7
6 Porphyromonas asaccharolytica [1056] increased 0.02480 0.00020 57 6.6 35 34.0
7 unnamed Ruminococcaceae bacterium D16 [1580] increased 0.04463 0.00043 2 96.7 2 96.2
8 butyrate-producing bacterium [1595] reduced 0.04463 0.00048 1 98.4 2 96.2
9 unnamed Ruminococcus sp. SR1/5 [1621] reduced 0.04463 0.00053 0 100.0 1 98.1
10 Eubacterium hallii [1597] reduced 0.04604 0.00061 0 100.0 4 92.5
11 Prevotella nigrescens [1069] increased 0.05113 0.00088 60 1.6 42 20.8
12 Eubacterium eligens [1627] reduced 0.05113 0.00077 2 96.7 6 88.7
13 Eikenella corrodens [448] increased 0.05113 0.00087 61 0.0 44 17.0
14 Peptostreptococcus stomatis [1530] increased 0.05289 0.00103 28 54.1 18 66.0
15 unnamed Ruminococcus sp. 5_1_39BFAA [1620] reduced 0.05289 0.00105 0 100.0 1 98.1
16 Leptotrichia hofstadii [1488] increased 0.08468 0.00179 61 0.0 45 15.1
# as.data.frame was needed because dplyr breaks with xtable

So I find many that they also find and in the same direction (increased, reduced), however, my pvalues are a bit bigger. Sometimes they are different in order, e.g. 1580 vs 1621. I find 448 they don’t, and they find quite many I do not. The reason for this must the the filtering they did.