This is to understand and reproduce the results presented in “Potential of fecal microbiota for early-stage detection of colorectal cancer” by Zeller et al. (2014) (Link to the paper). The current test before colonoscopy to test for colorectal cancer is the Fecal occult blood test (FOBT). Could analysis of the fecal microbiota be the better test? It might be that particular species have a striking effect as is the case with ulcers and Helicobacter pylori. Are there so differences in microbial abundances in patients with cancer and without? Zeller et al. analysed fecal microbiome data from different patient populations and countries.
Fecal metagenomes from 156 participants in France. Subsequently they underwent colonoscopy, so the diagnosis is known. This metadata is given in Supplementary Information Table S1 and Dataset S1). The abundance table is given in Supplementary Dataset S3. The first problem, the data did only include 141 participants, because all participants with large adenomas were kicked out, adenomas are precursors of cancer. The patients with small adenomas were put in the control group for later analyses. Unfortunately, this made it impossible to fully reproduce the results of their initial diversity analysis, which included all adenoma patients.
Diversity is a measure to see how many different species (OTUs) are in the smaples and how even they are distributed. Diversity gets bigger when there are more species and when the evennes increases. The shannon diversity of a sample in which pi stands for the abundance of OTUi is simply \(-sum(p_{i}*log(p_{i})\).
AbundTable <- read.csv2("SuppData3Abundance.csv")
# The Data was downloaded and then saved as a csv
# to correct for the headers
con <- file("SuppData3Abundance.csv")
open(con)
Headers <- readLines(con, n = 1)
close(con)
Headers <- strsplit(Headers, split = ';')
Headers[[1]][1] = "OTU"
Headers <- Headers[[1]]
names(AbundTable) <- Headers
They used the vegan package to calculate the shannon diversity. And despite the simple equation this is probably the easiest solution
require(vegan)
## Loading required package: vegan
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.2-1
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
require(xtable)
## Loading required package: xtable
To use the diversity function the table should only contain the samples and the OTU names should be rownames or colnames
RNames <- as.character(AbundTable$OTU)
AbundTable <- AbundTable[,2:ncol(AbundTable)]
row.names(AbundTable) <- RNames
# diversity() works by default on rows as samples so I transpose the AbundTable
TAbundTable <- t(AbundTable)
class(TAbundTable) # so note that the Transpose changed it into a matrix
## [1] "matrix"
# just because I like dplyr and dataframes better
TAbundTable <- tbl_df(data.frame(TAbundTable))
# calculate the diversity
ShannonDiversity <- diversity(TAbundTable, index = "shannon")
head(ShannonDiversity)
## CCIS27304052ST-3-0 CCIS15794887ST-4-0 CCIS74726977ST-3-0
## 1.853916 1.537729 1.179761
## CCIS16561622ST-4-0 CCIS79210440ST-3-0 CCIS82507866ST-3-0
## 1.920942 1.781650 2.055223
So for each sample (given with the sample ID) there is a shannon diversity, the higher the richer and evener the sample.
Loading of the patient data
PatientData <- read.csv2("Pop. F Subject Metadata-Table 1.csv")
There is the mentioned problem, that the abundance table did not contain the smaples of the patients with large adenomas, so I also only kept the patient data that was used
PatientData <- PatientData[PatientData$Sample.ID %in% row.names(TAbundTable),]
# see the three left over diagnoses, large adenoma is gone
unique(PatientData$Diagnosis)
## [1] Normal Small adenoma Cancer
## Levels: Cancer Large adenoma Normal Small adenoma
As mentioned, this is unfortunate, since they used in the first figures all samples, my adenoma group is thus not the one they used. Defining the groups.
IDsControlEarly <- PatientData$Sample.ID[PatientData$Diagnosis == 'Normal']
IDsCRCEarly <- PatientData$Sample.ID[PatientData$Diagnosis == 'Cancer']
IDAdenomaEarly <- PatientData$Sample.ID[PatientData$Diagnosis %in% c('Small adenoma',
'Large adenoma')]
ShannonDivControls <- ShannonDiversity[names(ShannonDiversity) %in% IDsControlEarly]
ShannonDivCRC <- ShannonDiversity[names(ShannonDiversity) %in% IDsCRCEarly]
ShannonDivAdenoma <- ShannonDiversity[names(ShannonDiversity) %in% IDAdenomaEarly]
To be honest, not yet sure, why here Kruskal-Wallis.
ShannonDivList = list(ShannonDivControls, ShannonDivAdenoma, ShannonDivCRC)
kruskal.test(ShannonDivList) # so I get an even higher p-value but that is because of the wrong adenoma group.
##
## Kruskal-Wallis rank sum test
##
## data: ShannonDivList
## Kruskal-Wallis chi-squared = 0.7124, df = 2, p-value = 0.7003
Since ggplot needs a data.frame
ShDiv <- data.frame(Diversity = ShannonDiversity)
ShDiv$Group = ""
ShDiv$Group[names(ShannonDiversity) %in% IDsControlEarly] = "Control"
ShDiv$Group[names(ShannonDiversity) %in% IDsCRCEarly] = "CRC"
ShDiv$Group[names(ShannonDiversity) %in% IDAdenomaEarly] = "Adenoma"
# to get the same order of the levels
ShDiv$Group <- relevel(factor(ShDiv$Group), ref = "Control")
The plot comparable to their Figure S1D.
Tr <- ggplot(ShDiv, aes(x = factor(Group), y = Diversity))
Tr + geom_boxplot(notch = TRUE) +
xlab("") +
ylab("ShannonDiversity") +
theme_bw(12)
Note that they just counted all the species that were present (i.e. abundance not 0). They included the UNMAPPED
Richness <- rowSums(TAbundTable != 0)
# Richness <- specnumber(TAbundTable) # results in the same result!
# THEY USED THIS specnumber, ok
RichnessControls <- Richness[names(Richness) %in% IDsControlEarly]
RichnessCRC <- Richness[names(Richness) %in% IDsCRCEarly]
RichnessAdenoma <- Richness[names(Richness) %in% IDAdenomaEarly]
kruskal.test(list(RichnessControls,RichnessAdenoma,RichnessCRC))
##
## Kruskal-Wallis rank sum test
##
## data: list(RichnessControls, RichnessAdenoma, RichnessCRC)
## Kruskal-Wallis chi-squared = 0.3642, df = 2, p-value = 0.8335
# clearly not significant.
ShDiv$Richness <- Richness
Tr <- ggplot(ShDiv, aes(x = factor(Group), y = Richness))
Tr + geom_boxplot(notch = TRUE) +
xlab("") +
ylab("Species richness") +
theme_bw(12)
Since I only have the abundances of certain taxa, I assume species, I can only compare these. They first to phylum level comparisons, i.e. to check whether there are more or less Firmicutes and so on in the cancer than in the controls. For this I probably had to sum up the abundances of all species that belong to the different phyla. This I do not do, I just want to show how the Wilcoxon test is done, and why they used FDR, see Fig S2:
“Significance was determined using FDR-corrected pair-wise Wilcoxon tests with a cutoff of 0.1 on the adjusted p-values”
So why False Discovery Rate correction. You always compare the same samples, you have here 61 control samples, and 53 CRC samples. Let’s think in phylum level you first check if Fusobacteria are different. Then in the same samples if Proteobacteria are different and so on. That is exactly we tested jelly beans vs acne (abundances), now let’s find a correlation, none, check green jelly beans, then red and so on. (Statistical inference class coursera)
The species they found to be most different between cancer and neoplasia-free ( control) was: Fusobacterium nucleatum subsp. vincentii [1482]. I try to repeat what they did.
#I did not directly find that entry in my colnames of TAbundTable, so I searched
# for all with "Fusobacterium"
colnames(TAbundTable)[grep("Fusobacterium",colnames(TAbundTable))]
## [1] "Fusobacterium.gonidiaformans..1476."
## [2] "unnamed.Fusobacterium.sp..D12..1477."
## [3] "Fusobacterium.periodonticum..1478."
## [4] "Fusobacterium.nucleatum..1479."
## [5] "Fusobacterium.nucleatum..1480."
## [6] "unclassified.Fusobacterium..1481."
## [7] "unclassified.Fusobacterium..1482."
## [8] "Fusobacterium.ulcerans..1483."
## [9] "Fusobacterium.varium..1484."
## [10] "Fusobacterium.mortiferum..1485."
you see that there is only a unclassified Fusobacterium [1482]. I decided to do a simple Wilcoxon on that one.
Fuso1482 <- select(TAbundTable, contains("1482"))
names(Fuso1482)[1] = 'Value'
Fuso1482 <- select(TAbundTable, contains("1482"))
Fuso1482$Groups = ""
Fuso1482$Groups[row.names(Fuso1482) %in% IDsControlEarly] = "Control"
Fuso1482$Groups[row.names(Fuso1482) %in% IDsCRCEarly] = "CRC"
# leave the adenoma class out
Fuso1482 <- filter(Fuso1482, Groups != "")
Fuso1482$Groups <-factor(Fuso1482$Groups)
names(Fuso1482)[1] = 'Values'
Lets first see the data
par(mfrow = c(1,2))
boxplot(Values ~ Groups, data = Fuso1482)
boxplot(Values ~ Groups, data = Fuso1482, ylim = c(0, 0.00005))
So you see the species is almost not present in the control, correspondingly the wilcoxon test shows high significance (two sided here)
wilcox.test(Values ~ Groups, data = Fuso1482)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Values by Groups
## W = 2404, p-value = 4.41e-08
## alternative hypothesis: true location shift is not equal to 0
The p-value is lower than what they published but I have not yet done the FDR correction.
Note they do an abundance filtering first which I did not do here
DFWilcox <-TAbundTable
DFWilcox$Groups = ""
DFWilcox$Groups[row.names(DFWilcox) %in% IDsControlEarly] = "Control"
DFWilcox$Groups[row.names(DFWilcox) %in% IDsCRCEarly] = "CRC"
# leave the adenoma class out
DFWilcox <- filter(DFWilcox, Groups != "")
DFWilcox$Groups <-factor(DFWilcox$Groups)
MeansControls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MeansCRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
Meansuden0Controls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
Meansuden0CRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MediansControls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MediansCRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
pValues <- vector(mode='numeric', length = ncol(DFWilcox)-1)
for (i in 1:(ncol(DFWilcox)-1)){
CurrentData <- select(DFWilcox, i, Groups)
names(CurrentData)[1] <- "Value"
MeansControls[i] <- mean(CurrentData$Value[CurrentData$Groups == 'Control'])
MeansCRCs[i] <- mean(CurrentData$Value[CurrentData$Groups == 'CRC'])
MediansControls[i] <- median(CurrentData$Value[CurrentData$Groups == 'Control'])
MediansCRCs[i] <- median(CurrentData$Value[CurrentData$Groups == 'CRC'])
CurrentDatauden0 <- CurrentData[CurrentData$Value !=0,]
Meansuden0Controls[i] <- mean(CurrentDatauden0$Value[CurrentDatauden0$Groups == 'Control'])
Meansuden0CRCs[i] <- mean(CurrentDatauden0$Value[CurrentDatauden0$Groups == 'CRC'])
if (dim(CurrentDatauden0)[1] == 0){
pValues[i] <- NaN
} else {
pValues[i] <- wilcox.test(Value~Groups, data = CurrentData)$p.value
}
}
MeansControls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MeansCRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
Meansuden0Controls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
Meansuden0CRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MediansControls <- vector(mode='numeric', length = ncol(DFWilcox)-1)
MediansCRCs <- vector(mode='numeric', length = ncol(DFWilcox)-1)
pValues <- vector(mode='numeric', length = ncol(DFWilcox)-1)
ZeroProportionControl <- vector(mode='numeric', length = ncol(DFWilcox)-1)
ZeroProportionCRC <- vector(mode='numeric', length = ncol(DFWilcox)-1)
ZerosControl <- vector(mode='numeric', length = ncol(DFWilcox)-1)
ZerosCRC <- vector(mode='numeric', length = ncol(DFWilcox)-1)
for (i in 1:(ncol(DFWilcox)-1)){
CurrentData <- select(DFWilcox, i, Groups)
names(CurrentData)[1] <- "Value"
MeansControls[i] <- mean(CurrentData$Value[CurrentData$Groups == 'Control'])
MeansCRCs[i] <- mean(CurrentData$Value[CurrentData$Groups == 'CRC'])
MediansControls[i] <- median(CurrentData$Value[CurrentData$Groups == 'Control'])
MediansCRCs[i] <- median(CurrentData$Value[CurrentData$Groups == 'CRC'])
CurrentDatauden0 <- CurrentData[CurrentData$Value !=0,]
Meansuden0Controls[i] <- mean(CurrentDatauden0$Value[CurrentDatauden0$Groups == 'Control'])
Meansuden0CRCs[i] <- mean(CurrentDatauden0$Value[CurrentDatauden0$Groups == 'CRC'])
ZeroProportionControl[i] <-
sum(CurrentData$Value[CurrentData$Groups == 'Control'] == 0)/
length(CurrentData$Value[CurrentData$Groups == 'Control'])
ZeroProportionCRC[i] <-
sum(CurrentData$Value[CurrentData$Groups == 'CRC'] == 0)/
length(CurrentData$Value[CurrentData$Groups == 'CRC'])
ZerosControl[i] <- sum(CurrentData$Value[CurrentData$Groups == 'Control'] == 0)
ZerosCRC[i] <- sum(CurrentData$Value[CurrentData$Groups == 'CRC'] == 0)
if (dim(CurrentDatauden0)[1] == 0){
pValues[i] <- NaN
} else {
pValues[i] <- wilcox.test(Value~Groups, data = CurrentData)$p.value
}
}
# Note, there were 61 control samples and 53 CRC samples!
## adjusting the p-values
PValuesBH <- p.adjust(pValues,method="BH")
PValuesFDR <- p.adjust(pValues,method="fdr")
# they are identical, see
identical(PValuesBH,PValuesFDR)
## [1] TRUE
Arranging the data in a data.frame and finding the significant ones and ordering them
Sign0k1Cutoff <- vector(mode = 'character', length = length(PValuesFDR))
Sign0k1Cutoff[PValuesFDR<.1] <- 'yes'
Sign0k1Cutoff[!PValuesFDR<.1] <- 'no'
Sign0k1Cutoff[is.na(PValuesFDR)] <- 'no'
SummaryWilcoxon <- data.frame(Sign0k1Cutoff, PValuesFDR,pValues, ZerosControl,
PerControl = 100*(1-ZeroProportionControl), ZerosCRC,
PerCRC = 100*(1-ZeroProportionCRC), MeansControls,MeansCRCs,
Meansuden0Controls, Meansuden0CRCs,
MediansControls, MediansCRCs)
SummaryWilcoxon <- tbl_df(SummaryWilcoxon)
# hadley wickham says row.names sucks, and all dplyr ignores them, so
SummaryWilcoxon$OTUs <- row.names(AbundTable)
SummaryWilcoxon <- select(SummaryWilcoxon, OTUs, 1:13)
SignificantSpecies <- filter(SummaryWilcoxon, Sign0k1Cutoff == 'yes')
# I get only 16 significant species
SignificantSpecies$CRCtoControl = ""
SignificantSpecies <- mutate(SignificantSpecies, RatioMeans = MeansCRCs/MeansControls)
SignificantSpecies$CRCtoControl[SignificantSpecies$RatioMeans < 1] = "reduced"
SignificantSpecies$CRCtoControl[SignificantSpecies$RatioMeans > 1] = "increased"
SignificantSpecies <- select(SignificantSpecies, 1, 15, 3:14, 15)
SignificantSpecies <- arrange(SignificantSpecies, PValuesFDR)
Here the significant species I found in a table to be compared with Figure S2.
print(xtable(as.data.frame(SignificantSpecies[,1:8]),
digits = c(0,0,0,5,5,0,1,0,1)), comment = F, type = 'html')
| OTUs | CRCtoControl | PValuesFDR | pValues | ZerosControl | PerControl | ZerosCRC | PerCRC | |
|---|---|---|---|---|---|---|---|---|
| 1 | unclassified Fusobacterium [1482] | increased | 0.00003 | 0.00000 | 55 | 9.8 | 24 | 54.7 |
| 2 | unclassified Fusobacterium [1481] | increased | 0.00019 | 0.00000 | 51 | 16.4 | 23 | 56.6 |
| 3 | Fusobacterium nucleatum [1479] | increased | 0.00165 | 0.00001 | 59 | 3.3 | 34 | 35.8 |
| 4 | Pseudoflavonifractor capillosus [1579] | increased | 0.00273 | 0.00001 | 2 | 96.7 | 0 | 100.0 |
| 5 | Fusobacterium nucleatum [1480] | increased | 0.00835 | 0.00006 | 56 | 8.2 | 33 | 37.7 |
| 6 | Porphyromonas asaccharolytica [1056] | increased | 0.02480 | 0.00020 | 57 | 6.6 | 35 | 34.0 |
| 7 | unnamed Ruminococcaceae bacterium D16 [1580] | increased | 0.04463 | 0.00043 | 2 | 96.7 | 2 | 96.2 |
| 8 | butyrate-producing bacterium [1595] | reduced | 0.04463 | 0.00048 | 1 | 98.4 | 2 | 96.2 |
| 9 | unnamed Ruminococcus sp. SR1/5 [1621] | reduced | 0.04463 | 0.00053 | 0 | 100.0 | 1 | 98.1 |
| 10 | Eubacterium hallii [1597] | reduced | 0.04604 | 0.00061 | 0 | 100.0 | 4 | 92.5 |
| 11 | Prevotella nigrescens [1069] | increased | 0.05113 | 0.00088 | 60 | 1.6 | 42 | 20.8 |
| 12 | Eubacterium eligens [1627] | reduced | 0.05113 | 0.00077 | 2 | 96.7 | 6 | 88.7 |
| 13 | Eikenella corrodens [448] | increased | 0.05113 | 0.00087 | 61 | 0.0 | 44 | 17.0 |
| 14 | Peptostreptococcus stomatis [1530] | increased | 0.05289 | 0.00103 | 28 | 54.1 | 18 | 66.0 |
| 15 | unnamed Ruminococcus sp. 5_1_39BFAA [1620] | reduced | 0.05289 | 0.00105 | 0 | 100.0 | 1 | 98.1 |
| 16 | Leptotrichia hofstadii [1488] | increased | 0.08468 | 0.00179 | 61 | 0.0 | 45 | 15.1 |
# as.data.frame was needed because dplyr breaks with xtable
So I find many that they also find and in the same direction (increased, reduced), however, my pvalues are a bit bigger. Sometimes they are different in order, e.g. 1580 vs 1621. I find 448 they don’t, and they find quite many I do not. The reason for this must the the filtering they did.