It is about associations of data. Specifically here, the associations between gene expressions. Measures for association are correlation or as suggested here proportionality. They suggest that for the relative data we are working with proportionality is the better measure of association than correlation.
A key sentence from the supplementary information of the paper is:
in the absence of any other information or assumptions, correlations between relative values tell us nothing about relationships between the absolute values from which they were derived. We stress in the absence of any other information or assumptions
So of course it is about understanding their reasoning why correlation is shitty, but key is to always think, do we not have valid information or assumptions that make the critique points weaker??
I change their example which they reproduced from Section 1.7 of Aitchison’s to microbiome. Two scientists get the same fecal samples (3 individuals). There are in fact 4 OTUs in those samples, but scientist B only detects 3 while scientist A detects all four
ScientistA.RelAbundances <- 0.1 * data.frame(
OTU1 =c(1, 2, 3),
OTU2=c(2, 1, 3),
OTU3 =c(1, 1, 2),
OTU4 =c(6, 6, 2))
# So each row is a sample.
# these are the relative abundances, Lets say in total 1000 microbes,
# then the Total Abundances would be
TotalAbundances <- 1000*ScientistA.RelAbundances
# Obviously scientist B has a serious detection problem, anyway, his total
# abundance Table would look like
ScientistB.TotalAbundances <- TotalAbundances[,-4]
ScientistB.RelAbundances <- ScientistB.TotalAbundances/rowSums(ScientistB.TotalAbundances)
# So there is of course already here a real problem for scientist B, he gets
# completely different relative abundances, you should not miss key OTUs,
# but the critical point is
cor(ScientistA.RelAbundances)
## OTU1 OTU2 OTU3 OTU4
## OTU1 1.0000000 0.5000000 0.8660254 -0.8660254
## OTU2 0.5000000 1.0000000 0.8660254 -0.8660254
## OTU3 0.8660254 0.8660254 1.0000000 -1.0000000
## OTU4 -0.8660254 -0.8660254 -1.0000000 1.0000000
cor(ScientistB.RelAbundances)
## OTU1 OTU2 OTU3
## OTU1 1 -1 0
## OTU2 -1 1 0
## OTU3 0 0 1
So same samples, same detection precision, but because scientist B does not include one OTU the correlation between OTU1 and OTU2 (real 0.5) is -1 for him!! Correlation is thus not subcompositionally coherent, if you would use the OTU1/OTU2 ratio it would be, see
ScientistA.RelAbundances$OTU1/ScientistA.RelAbundances$OTU2
## [1] 0.5 2.0 1.0
ScientistB.RelAbundances$OTU1/ScientistB.RelAbundances$OTU2
## [1] 0.5 2.0 1.0
So as long as we work with scale invariant functions, or equivalently express all our statements about compositions in terms of ratios, we shall be subcompositionally coherent.
How much is that a problem for us?
The idea here is: You have three independent RV X, Y, Z. They should as such not correlate, but X/Z to Y/Z do.
set.seed(123)
X <- rnorm(1000, mean = 10, sd = 1)
Y <- rnorm(1000, mean = 10, sd = 1)
Z <- rnorm(1000, mean = 30, sd = 3)
cor(X,Y)
## [1] 0.08647944
cor(X/Z, Y/Z)
## [1] 0.540498
The explanation is easy, best see the plot in Figure S1. The ones that were divided by high numbers will now correlate with each other. NOTE that Z was clearly bigger than X and Y and that is also the reason why you do get this spurious correlation.
So is that a microbiome problem?
…at least in the abscence of other information or assumptions. This is the impressive Figure 1. Saying: if you have a clear correlation between the relative abundances of two OTUs, you know nothing about their absolute abundances. I reproduce this with their data.
df <- data.frame(x0=seq(0.3,0.6,by=0.05), y0=seq(0.4, 0.25, by=-0.025))
df <- transform(df, rep=1:nrow(df), slope=y0/x0, intercept=0)
M <- -1.5
C <- 1.75
df <- transform(df, x1=(C - intercept)/(slope - M))
df <- transform(df, y1=M*x1 + C)
M <- 2.5
C <- -2.5
df <- transform(df, x2=(C - intercept)/(slope - M))
df <- transform(df, y2=M*x2 + C)
M <- 0
C <- c(1.254, 1.163, 1.091, 1.069, 1.110, 1.300, 1.051)
df <- transform(df, x3=(C - intercept)/(slope - M))
df <- transform(df, y3=M*x3 + C)
dfm <- melt(df[,grep("x|y|rep", names(df))], id="rep")
dfm <- cbind(dfm, colsplit(dfm$variable, pattern = "", names=c("coordinate", "set")))
# I had to add the pattern here
dfc <- dcast(dfm, rep + set ~ coordinate)
# I had to change to dcast here
dfc$set <- factor(dfc$set, labels=paste("set", 0:3))
dfc$set <- factor(
dfc$set,
labels=c("relative", "absolute 1", "absolute 2", "absolute 3")
)
So first, here is their plot:
ggplot(data=dfc, aes(x=x, y=y, group=set)) +
geom_abline(aes(slope=-1, intercept=1), linetype=2) +
geom_abline(aes(slope=slope, intercept=intercept), data=df, colour="white") +
geom_line(aes(colour=set), linetype=3) +
geom_point(aes(fill=set), size=4, shape=21, colour="white") +
scale_colour_brewer(name="Set of points", palette="Set1") +
scale_fill_brewer( name="Set of points", palette="Set1") +
coord_equal(xlim=c(-0.02,3), ylim=c(-0.02,3)) +
labs(y=expression(mRNA[2]), x=expression(mRNA[1])) +
theme(legend.position=c(0.02, 0.98),
legend.justification = c(0, 1))
Indeed, all of the absolute abundances could result in the relative abundances shown in red. Note this is an extreme example, where each sample consists of exactly 2 OTUs (x, and y). So when the relative abundance of x changes it has to be fully compensated by y. Maybe the most striking example here is absolute 2, x and y are in a positive relation ship (correlation = 1), i.e. in each sample where there is more x there is also more y, while the correlation of the relative abundances is -1. This example I illustrate a bit more clearly.
# the thing is I get actually different relative abundances, but ok, they are the same for all three conditions.
dfSelf <- select(df, x0, y0, x1, y1, x2, y2, x3, y3)
dfSelf$Sample <- 1:7
colnames(dfSelf)[3:8] <- c('OTU1_Exp1', 'OTU2_Exp1', 'OTU1_Exp2', 'OTU2_Exp2', 'OTU1_Exp3', 'OTU2_Exp3')
dfSelf_long <- gather(dfSelf, OTU_Exp, absoluteAbundance, -x0, -y0, - Sample )
dfSelf_long <- separate(dfSelf_long, OTU_Exp, c("OTU", "Exp"), sep = "_")
dfSelf_long <- group_by(dfSelf_long, Exp, Sample)
dfSelf_long <- mutate(dfSelf_long, totalAbundance = sum(absoluteAbundance),
relativeAbundance = absoluteAbundance/totalAbundance)
# So I get in fact different relative abundances, but key is, they are the same.
dfcSelf <- filter(dfc, set %in% c("relative", "absolute 2"))
dfcSelf$rep <- as.character(dfcSelf$rep)
colnames(dfcSelf)[1:2] <- c('Sample', 'Abundance_Type')
dfcSelf$Group <- 'Control'
dfcSelf$Group[7:14] <- 'Disease'
ggplot(data=dfcSelf, aes(x=x, y=y, group=Abundance_Type)) +
geom_abline(aes(slope=slope, intercept=intercept), data=df, colour= cbPalette[1]) +
geom_point(aes(colour = Sample), size=7, shape=19) +
scale_colour_manual(values = c(cbPalette[2:8])) +
geom_point(aes(shape = Abundance_Type), size=4.5) +
scale_shape_manual(values = c(17, 18)) +
#scale_fill_brewer( name="Set of points", palette="Set1") +
theme_bw(18) +
xlab('OTU1') +
ylab('OTU2')
So here it gets clear. In sample 1 both OTU1 and OTU2 have high absolute abundances, so the total abundance is high. It goes down from sample 1 to sample 7. The decrease is faster for OTU2 than for OTU1, so although (look at x axis) the absolute abundance of OTU1 decreases from sample 1 to sample 7, its relative abundance increases!
This problem of missing proportionality between relative and absolute abundance whenn the total absolute abundance is not constant, is not only a problem for correlations between relative abundances of OTUs it is already a problem for interpreting the relative abundance levels of one OTU.
Imagine the seven samples where from a control and a disease group.
ggplot(data=dfcSelf, aes(x=x, y=y, group=Abundance_Type)) +
geom_abline(aes(slope=slope, intercept=intercept), data=df, colour= cbPalette[1]) +
geom_point(aes(colour = Group), size=7, shape=19) +
scale_colour_manual(values = c(cbPalette[2:3])) +
geom_point(aes(shape = Abundance_Type), size=4.5) +
scale_shape_manual(values = c(17, 18)) +
#scale_fill_brewer( name="Set of points", palette="Set1") +
theme_bw(18) +
xlab('OTU1') +
ylab('OTU2')
Let’s see only OTU1:
dfcSelf2 <- select(dfcSelf, 2:3, 5)
dfcSelf3 <- select(dfcSelf, 1:3, 5) %>% spread(Abundance_Type, x)
set.seed(121) # to get teh same jitter!
Tr <- ggplot(dfcSelf2, aes(x = Abundance_Type, y = x, fill = Group)) +
geom_boxplot(alpha = 1/2,
outlier.colour = NA) +
geom_jitter(aes(color = factor(Group)),
position=position_jitterdodge(dodge.width=0.8), size = 3) +
xlab("Abundance Type") +
ylab("Abundance OTU1") +
scale_colour_manual(values = c("#E69F00", "#56B4E9"),
name = "") +
scale_fill_manual(values = c("#E69F00", "#56B4E9"),
name = "") +
theme_bw(18)
Tr
a simple t.test would show that the relative abundance of OTU1 is lower in the control than in the Disease
t.test(dfcSelf3$relative[1:3], dfcSelf3$relative[4:7])
##
## Welch Two Sample t-test
##
## data: dfcSelf3$relative[1:3] and dfcSelf3$relative[4:7]
## t = -4.0415, df = 4.9592, p-value = 0.01008
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.28658557 -0.06341443
## sample estimates:
## mean of x mean of y
## 0.350 0.525
So that the relative abundance is not necessarily proportional to the absolute abundance poses a clear problem already to the interpreation of the abundances of a single OTU in different groups. In their example for yeast mRNAs they had 1399 mRNAs whose absolute abundance decreased from 0 to 3 hours after N starvation, and whose relative abundances increased.
Of course this poses a problem for correlations between relative abundances of OTUs.
He just showed, that if the relative abundances of two OTUs are correlated, this must not hold true for their absolute abundances. Now he wants to demonstrate that proporitonality is different: Proof that if the relative abundances are proportional, the same holds true for the absolute abundances.
df0 <- data.frame(x=seq(0.3,0.5,by=0.05))
slope <- 0.3/0.5
df0 <- transform(df0, rep=factor(1:nrow(df0)), y=x*slope, set="relative")
df1 <- transform(df0, x=x+0.5, rep=sample(rep), set="absolute1")
df2 <- transform(df0, x=3*x+0.4, rep=sample(rep), set="absolute2")
df3 <- transform(df0, x=8*(x-0.3)^1.5+2.2, rep=sample(rep), set="absolute3")
df <- rbind(df0, df1, df2, df3)
df <- transform(df, y=x*slope)
# so only after this also all the absolute are proportional
ggplot(data=df, aes(x=x, y=y)) +
geom_abline(aes(slope=-1, intercept=1), linetype=2) +
geom_abline(aes(slope=slope, intercept=0), colour="white") +
geom_point(aes(color=set, shape=rep), size=4) +
scale_color_brewer(name="Set of points", palette="Set1") +
scale_shape_manual(name="Sample number", values=c(15, 16, 17, 1, 2)) +
coord_equal(xlim=c(-0.02,3), ylim=c(-0.02,3)) +
labs(y=expression(mRNA[2]), x=expression(mRNA[1])) +
theme(legend.position=c(0.02, 0.98),
legend.justification = c(0, 1),
legend.box="horizontal", legend.box.just="top")
rm(df, df0, df1, df2, df3, slope)
I have to say I do not understand this visual proof. He wants to proof that when relative abundances are proportional also the absolute abundances were. This is indeed pretty obvious, if \(x_{i}, y_{i}\) are the relative abundances of X and Y at sample i and it generally holds that \(\frac{y_{i}}{x_{i}} = constant\), then with \(t_{i}\) being the total abundance in sample i \(\frac{t_{i}*y_{i}}{t_{i}*x_{i}} = constant\) since the total abundances cross always out.
An own example for proportional. Let’s say x and y are now absolute abundances, and they stay totally unchanged. But the total abundance changes over samples.
set.seed(123)
df <- data.frame(x = c(1, 1, 1, 1, 1), y = c(1.5,1.5,1.5,1.5,1.5), total = sample(2.5:7, 5), set = 'absolute', sample = 1:5)
df1 <- transform(df, x = x/total, y = y/total, set = 'relative')
df0 <- rbind(df, df1)
ggplot(df0, aes(x = x, y = y)) +
geom_point(aes(color=set, shape=factor(sample)), size=4) +
theme_bw(18)
So while the absolute abundances are always the same, the relative abundances get stretched, but they remain proportional. I made all the absolute abundances the same, and then calculated relative abundances for different total abundances, but of course it would work the other way around.
But how about inversely proportional?? In my opinion, if relative abundances are inversely proportional, it is not sure that the same holds true for the absolute abundances
\[x_{i} * y_{i} = constant\] \[t_{i}^{2} * x_{i} * y_{i} = NOT constant\] because it depends on the total abundances. Let’s try an example
set.seed(123)
df <- data.frame(x = seq(0.1, by = 0.1, to =0.5))
df <- mutate(df, y = 0.05/x, set = 'relative', total = sample(2.5:7, 5),
sample = 1:5)
df1 <- transform(df, x = x*total, y = y*total, set = 'absolute')
df0 <- rbind(df, df1)
ggplot(df0, aes(x = x, y = y)) +
geom_point(aes(color=set, shape=factor(sample)), size=4) +
theme_bw(18)
So you have the same problem here for inversely proportional as for negative correlation
How can I so detect in relative abundance data OTUs that are really negatively associated?
df0$y/df0$x
## [1] 5.0000000 1.2500000 0.5555556 0.3125000 0.2000000 5.0000000 1.2500000
## [8] 0.5555556 0.3125000 0.2000000