S2 Why does compositional data need special treatment?

S2.1 Correlation is not subcompositionally coherent

I change their example which they reproduced from Section 1.7 of Aitchison’s to microbiome. Two scientists get the same fecal samples (3 individuals). There are in fact 4 OTUs in those samples, but scientist B only detects 3 while scientist A detects all four

ScientistA.RelAbundances <- 0.1 * data.frame(
        OTU1  =c(1, 2, 3),
  OTU2=c(2, 1, 3),
 OTU3  =c(1, 1, 2),
  OTU4    =c(6, 6, 2))
# So each row is a sample. 
# these are the relative abundances, Lets say in total 1000 microbes,
# then the Total Abundances would be
TotalAbundances <- 1000*ScientistA.RelAbundances

# Obviously scientist B has a serious detection problem, anyway, his total
# abundance Table would look like

ScientistB.TotalAbundances <- TotalAbundances[,-4]

ScientistB.RelAbundances <- ScientistB.TotalAbundances/rowSums(ScientistB.TotalAbundances)

# So there is of course already here a real problem for scientist B, he gets
# completely different relative abundances, you should not miss key OTUs,
# but the critical point is

cor(ScientistA.RelAbundances)

##            OTU1       OTU2       OTU3       OTU4
## OTU1  1.0000000  0.5000000  0.8660254 -0.8660254
## OTU2  0.5000000  1.0000000  0.8660254 -0.8660254
## OTU3  0.8660254  0.8660254  1.0000000 -1.0000000
## OTU4 -0.8660254 -0.8660254 -1.0000000  1.0000000

cor(ScientistB.RelAbundances)

##      OTU1 OTU2 OTU3
## OTU1    1   -1    0
## OTU2   -1    1    0
## OTU3    0    0    1

So same samples, same detection precision, but because scientist B does not include one OTU the correlation between OTU1 and OTU2 (real 0.5) is -1 for him!! Correlation is thus not subcompositionally coherent, if you would use the OTU1/OTU2 ratio it would be, see

ScientistA.RelAbundances$OTU1/ScientistA.RelAbundances$OTU2

## [1] 0.5 2.0 1.0

ScientistB.RelAbundances$OTU1/ScientistB.RelAbundances$OTU2

## [1] 0.5 2.0 1.0

So as long as we work with scale invariant functions, or equivalently express all our statements about compositions in terms of ratios, we shall be subcompositionally coherent.

How much is that a problem for us?

we consider always the same OTUs for all samples. Of course we could miss some unknowns which could vary between samples, but at least it is not that they affect the relative abundances in one sample and not in the other. But if we would not include the unmapped for calculating the relative abundances, I think we had the subcompositional consistence problem. If we include the unmapped we might however end with the spurious correlation problem shown next.

S2.2 Spurious correlation

The idea here is: You have three independent RV X, Y, Z. They should as such not correlate, but X/Z to Y/Z do.

set.seed(123)
X <- rnorm(1000, mean = 10, sd = 1)
Y <- rnorm(1000, mean = 10, sd = 1)
Z <- rnorm(1000, mean = 30, sd = 3)

cor(X,Y)

## [1] 0.08647944

cor(X/Z, Y/Z)

## [1] 0.540498

The explanation is easy, best see the plot in Figure S1. The ones that were divided by high numbers will now correlate with each other. NOTE that Z was clearly bigger than X and Y and that is also the reason why you do get this spurious correlation.

So is that a microbiome problem?

I think we have to think about the unmapped fraction here, if the total counts are highest for unmapped, for the relative data you divide by it, so you would all OTUs make more correlative with each other
so I think that speaks again clearly for Paul’s method

S2.3 Correlations between rel. abundances tell us absolutely nothing.

…at least in the abscence of other information or assumptions. This is the impressive Figure 1. Saying: if you have a clear correlation between the relative abundances of two OTUs, you know nothing about their absolute abundances. I reproduce this with their data.

df <- data.frame(x0=seq(0.3,0.6,by=0.05), y0=seq(0.4, 0.25, by=-0.025))
df <- transform(df, rep=1:nrow(df), slope=y0/x0, intercept=0)
M  <- -1.5
C  <-  1.75
df <- transform(df, x1=(C - intercept)/(slope - M))
df <- transform(df, y1=M*x1 + C)
M  <-  2.5
C  <- -2.5
df <- transform(df, x2=(C - intercept)/(slope - M))
df <- transform(df, y2=M*x2 + C)
M  <- 0
C  <- c(1.254, 1.163, 1.091, 1.069, 1.110, 1.300, 1.051)
df <- transform(df, x3=(C  - intercept)/(slope - M))
df <- transform(df, y3=M*x3 + C)

dfm <- melt(df[,grep("x|y|rep", names(df))], id="rep")
dfm <- cbind(dfm, colsplit(dfm$variable, pattern = "", names=c("coordinate", "set")))
# I had to add the pattern here
dfc <- dcast(dfm, rep  + set ~ coordinate)
# I had to change to dcast here
dfc$set <- factor(dfc$set, labels=paste("set", 0:3))
dfc$set <- factor(
  dfc$set, 
  labels=c("relative", "absolute 1", "absolute 2", "absolute 3")
  )

So first, here is their plot:

ggplot(data=dfc, aes(x=x, y=y, group=set)) + 
  geom_abline(aes(slope=-1, intercept=1), linetype=2) + 
  geom_abline(aes(slope=slope, intercept=intercept), data=df, colour="white") + 
  geom_line(aes(colour=set), linetype=3) + 
  geom_point(aes(fill=set), size=4, shape=21, colour="white") + 
  scale_colour_brewer(name="Set of points", palette="Set1") + 
  scale_fill_brewer(  name="Set of points", palette="Set1") + 
  coord_equal(xlim=c(-0.02,3), ylim=c(-0.02,3)) + 
  labs(y=expression(mRNA[2]), x=expression(mRNA[1])) + 
  theme(legend.position=c(0.02, 0.98),
        legend.justification = c(0, 1))

Indeed, all of the absolute abundances could result in the relative abundances shown in red. Note this is an extreme example, where each sample consists of exactly 2 OTUs (x, and y). So when the relative abundance of x changes it has to be fully compensated by y. Maybe the most striking example here is absolute 2, x and y are in a positive relation ship (correlation = 1), i.e. in each sample where there is more x there is also more y, while the correlation of the relative abundances is -1. This example I illustrate a bit more clearly.

# the thing is I get actually different relative abundances, but ok, they are the same for all three conditions.
dfSelf <- select(df, x0, y0, x1, y1, x2, y2, x3, y3)
dfSelf$Sample <- 1:7
colnames(dfSelf)[3:8] <- c('OTU1_Exp1', 'OTU2_Exp1', 'OTU1_Exp2', 'OTU2_Exp2', 'OTU1_Exp3', 'OTU2_Exp3')
dfSelf_long <- gather(dfSelf, OTU_Exp, absoluteAbundance, -x0, -y0, - Sample )
dfSelf_long <- separate(dfSelf_long, OTU_Exp, c("OTU", "Exp"), sep = "_")
dfSelf_long <- group_by(dfSelf_long, Exp, Sample)
dfSelf_long <- mutate(dfSelf_long, totalAbundance = sum(absoluteAbundance),
                      relativeAbundance = absoluteAbundance/totalAbundance)
# So I get in fact different relative abundances, but key is, they are the same.

dfcSelf <- filter(dfc, set %in% c("relative", "absolute 2"))
dfcSelf$rep <- as.character(dfcSelf$rep)
colnames(dfcSelf)[1:2] <- c('Sample', 'Abundance_Type') 
dfcSelf$Group <- 'Control'
dfcSelf$Group[7:14] <- 'Disease'

ggplot(data=dfcSelf, aes(x=x, y=y, group=Abundance_Type)) +  
  geom_abline(aes(slope=slope, intercept=intercept), data=df, colour= cbPalette[1]) + 
  geom_point(aes(colour = Sample), size=7, shape=19) + 
  scale_colour_manual(values = c(cbPalette[2:8])) +
  geom_point(aes(shape = Abundance_Type), size=4.5) + 
  scale_shape_manual(values = c(17, 18)) + 
  #scale_fill_brewer(  name="Set of points", palette="Set1") +
  theme_bw(18) +
        xlab('OTU1') +
        ylab('OTU2')

So here it gets clear. In sample 1 both OTU1 and OTU2 have high absolute abundances, so the total abundance is high. It goes down from sample 1 to sample 7. The decrease is faster for OTU2 than for OTU1, so although (look at x axis) the absolute abundance of OTU1 decreases from sample 1 to sample 7, its relative abundance increases!

This behaviour that OTUs relative abundance can behave different from the absolute abundance is directly due to differences in total absolute abundance between the samples
If the total absolute abundance is constant, the relative abundance of a OTU is always \(\propto\) its absolute abundance. From: \[ \frac{a1}{t} = r1, \frac{a2}{t} = r2\] directly follows that \[ \frac{a1}{r1} = \frac{a2}{r2} = t >>> a \propto r\]

This problem of missing proportionality between relative and absolute abundance whenn the total absolute abundance is not constant, is not only a problem for correlations between relative abundances of OTUs it is already a problem for interpreting the relative abundance levels of one OTU.

Imagine the seven samples where from a control and a disease group.

ggplot(data=dfcSelf, aes(x=x, y=y, group=Abundance_Type)) +  
  geom_abline(aes(slope=slope, intercept=intercept), data=df, colour= cbPalette[1]) + 
  geom_point(aes(colour = Group), size=7, shape=19) + 
  scale_colour_manual(values = c(cbPalette[2:3])) +
  geom_point(aes(shape = Abundance_Type), size=4.5) + 
  scale_shape_manual(values = c(17, 18)) + 
  #scale_fill_brewer(  name="Set of points", palette="Set1") +
  theme_bw(18) +
        xlab('OTU1') +
        ylab('OTU2')

Let’s see only OTU1:

dfcSelf2 <- select(dfcSelf, 2:3, 5)
dfcSelf3 <- select(dfcSelf, 1:3, 5) %>% spread(Abundance_Type, x)

set.seed(121) # to get teh same jitter!
Tr <- ggplot(dfcSelf2, aes(x = Abundance_Type, y = x, fill = Group)) + 
        geom_boxplot(alpha = 1/2,
                  outlier.colour = NA) +
        geom_jitter(aes(color = factor(Group)),
                    position=position_jitterdodge(dodge.width=0.8), size = 3) +
        xlab("Abundance Type") +
        ylab("Abundance OTU1") + 
        scale_colour_manual(values = c("#E69F00", "#56B4E9"),
                            name = "") +
        scale_fill_manual(values = c("#E69F00", "#56B4E9"),
                          name = "") +
        theme_bw(18)

Tr

a simple t.test would show that the relative abundance of OTU1 is lower in the control than in the Disease

t.test(dfcSelf3$relative[1:3], dfcSelf3$relative[4:7])

## 
##  Welch Two Sample t-test
## 
## data:  dfcSelf3$relative[1:3] and dfcSelf3$relative[4:7]
## t = -4.0415, df = 4.9592, p-value = 0.01008
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.28658557 -0.06341443
## sample estimates:
## mean of x mean of y 
##     0.350     0.525

So that the relative abundance is not necessarily proportional to the absolute abundance poses a clear problem already to the interpreation of the abundances of a single OTU in different groups. In their example for yeast mRNAs they had 1399 mRNAs whose absolute abundance decreased from 0 to 3 hours after N starvation, and whose relative abundances increased.

Of course this poses a problem for correlations between relative abundances of OTUs.

He just showed, that if the relative abundances of two OTUs are correlated, this must not hold true for their absolute abundances. Now he wants to demonstrate that proporitonality is different: Proof that if the relative abundances are proportional, the same holds true for the absolute abundances.

df0 <- data.frame(x=seq(0.3,0.5,by=0.05))
slope <- 0.3/0.5
df0 <- transform(df0, rep=factor(1:nrow(df0)), y=x*slope, set="relative")
df1 <- transform(df0, x=x+0.5, rep=sample(rep), set="absolute1")
df2 <- transform(df0, x=3*x+0.4, rep=sample(rep), set="absolute2")
df3 <- transform(df0, x=8*(x-0.3)^1.5+2.2, rep=sample(rep), set="absolute3")
df  <- rbind(df0, df1, df2, df3)
df  <- transform(df, y=x*slope)
# so only after this also all the absolute are proportional
ggplot(data=df, aes(x=x, y=y)) +  
  geom_abline(aes(slope=-1, intercept=1), linetype=2) + 
  geom_abline(aes(slope=slope, intercept=0), colour="white") + 
  geom_point(aes(color=set, shape=rep),  size=4) + 
  scale_color_brewer(name="Set of points", palette="Set1") + 
  scale_shape_manual(name="Sample number", values=c(15, 16, 17, 1, 2)) + 
  coord_equal(xlim=c(-0.02,3), ylim=c(-0.02,3)) + 
  labs(y=expression(mRNA[2]), x=expression(mRNA[1])) + 
  theme(legend.position=c(0.02, 0.98),
        legend.justification = c(0, 1),
        legend.box="horizontal", legend.box.just="top")

rm(df, df0, df1, df2, df3, slope)

I have to say I do not understand this visual proof. He wants to proof that when relative abundances are proportional also the absolute abundances were. This is indeed pretty obvious, if \(x_{i}, y_{i}\) are the relative abundances of X and Y at sample i and it generally holds that \(\frac{y_{i}}{x_{i}} = constant\), then with \(t_{i}\) being the total abundance in sample i \(\frac{t_{i}*y_{i}}{t_{i}*x_{i}} = constant\) since the total abundances cross always out.

An own example for proportional. Let’s say x and y are now absolute abundances, and they stay totally unchanged. But the total abundance changes over samples.

set.seed(123)
df <- data.frame(x = c(1, 1, 1, 1, 1), y = c(1.5,1.5,1.5,1.5,1.5), total = sample(2.5:7, 5), set = 'absolute', sample = 1:5)
df1 <- transform(df, x = x/total, y = y/total, set = 'relative')
df0 <- rbind(df, df1)

ggplot(df0, aes(x = x, y = y)) +
          geom_point(aes(color=set, shape=factor(sample)),  size=4) +
        theme_bw(18)

So while the absolute abundances are always the same, the relative abundances get stretched, but they remain proportional. I made all the absolute abundances the same, and then calculated relative abundances for different total abundances, but of course it would work the other way around.

But how about inversely proportional?? In my opinion, if relative abundances are inversely proportional, it is not sure that the same holds true for the absolute abundances

\[x_{i} * y_{i} = constant\] \[t_{i}^{2} * x_{i} * y_{i} = NOT constant\] because it depends on the total abundances. Let’s try an example

set.seed(123)
df <- data.frame(x = seq(0.1, by = 0.1, to =0.5))
df <- mutate(df, y = 0.05/x, set = 'relative', total = sample(2.5:7, 5), 
             sample = 1:5)
df1 <- transform(df, x = x*total, y = y*total, set = 'absolute')
df0 <- rbind(df, df1)

ggplot(df0, aes(x = x, y = y)) +
          geom_point(aes(color=set, shape=factor(sample)),  size=4) +
        theme_bw(18)

So you have the same problem here for inversely proportional as for negative correlation

How can I so detect in relative abundance data OTUs that are really negatively associated?

I assume ratios again NOTE, proportionality hin oder her, it is the ratios that remain unchanged

df0$y/df0$x

##  [1] 5.0000000 1.2500000 0.5555556 0.3125000 0.2000000 5.0000000 1.2500000
##  [8] 0.5555556 0.3125000 0.2000000

LovelletAl_Own

Thorsten Brach

5 May 2015

Background and Nomenclature from the paper

S2 Why does compositional data need special treatment?

S2.1 Correlation is not subcompositionally coherent

S2.2 Spurious correlation

S2.3 Correlations between rel. abundances tell us absolutely nothing.