This is a follow up of

so if you have questions, start going through that one.

Lovell_2015 (Proportionality: A Valid Alternative to Correlation for Relative Data) ended with \(\theta\) as a measure of the strength of proportionality between two RV.

Background and Questions

\(\theta\) which he calls “goodness-of-fit to proportionality statistic” can be used to assess the extent to which a pair of random variables are proportional. It is directly related to logratio variance which was suggested as a proportionality measure by Aitchison_1986

\[ var(log(Xr/Yr)) = var(log(Xr)) * \Big(1 + \frac{var(log(Yr))}{var(log(Xr))} - 2*\sqrt{\frac{var(log(Yr))}{var(log(Xr))}}* cor(log(Xr),log(Yr))\Big) \]

\[ \theta(log(Xr),log(Yr)) = 1 + \frac{var(log(Yr))}{var(log(Xr))} - 2*\sqrt{\frac{var(log(Yr))}{var(log(Xr))}}* |cor(log(Xr),log(Yr))| \]

Indeed \(\theta\) is 0 for both directly and inversely proportional RV.

Note from \[ var(log(Xr/Yr)) = var(log(Xr)) * \theta(log(Xr),log(Yr)) \]

follows that if he did not use the \(|r|\) there was no need to calculate \(\beta\) and \(|r|\) for calculating \(\theta\) since:

\[ \theta(log(Xr),log(Yr))_{\text{NOTabsolute}} = var(log(Xr/Yr))/var(log(Xr)) \]

But see below \(\theta(log(Xr),log(Yr))_{\text{NOTabsolute}}\) is not interpretable, i.e. it results in high values for inversely correlated RV!!

What are questions I want to answer here?

Tests on different data

The example data from 150516_StatisticalInferenceARegressionBasics

Xr is directly proportional to Yr, and inversely proportional to Yr2. It is perfectly negatively correlated with Yr1 but not proportional.

NOTE To ensure that all pairs of RV are on the same scale, one has to use the centered logratios (clr) instead of just log.

\[ clr(X) = \Big( log(\frac{x_{1}}{GM(X)}), ..., log(\frac{x_{D}}{GM(X)}) \Big)\]

So you take the logarithms of the components of a RV (NOT the Composition of a sample, the values of one OTU over all samples) after dividing them by the geometric mean (\(GM\)) (which is the nth root of the product of all components (http://en.wikipedia.org/wiki/Geometric_mean)).

set.seed(123)
Xr <- rnorm(10, mean = 0.2, sd = .1)
Yr <- 1.2 * Xr
Yr1 <- .4 - Xr
Yr2 <- 1/Xr*.01
dfr <- data.frame (Xr = Xr, Yr = Yr, Yr1 = Yr1, Yr2 = Yr2)
# could be an t_Abundance table with 4 OTUs and 10 Samples
dfr2 <- dfr # dfr2 will be the data arranged
dfr2 <- arrange(dfr2, Xr)
dfr2$Sample <- LETTERS[1:10]
# dfr2 now contains the ordered relative abundances of the RVs
dfr3 <- transmute(dfr2, Sample = Sample, Xr = log(Xr), Yr = log(Yr), Yr1 = log(Yr1), Yr2 = log(Yr2))
dfr4 <- transmute(dfr2, Sample = Sample, Xr = log(Xr/(prod(Xr)^(1/length(Xr)))), Yr = log(Yr/(prod(Yr)^(1/length(Yr)))), Yr1 = log(Yr1/(prod(Yr1)^(1/length(Yr1)))), Yr2 = log(Yr2/(prod(Yr2)^(1/length(Yr2)))))
# so dfr2 to 4 contain now the relative abundances, the log() and the clr, respectively
dfr2_long <- gather(dfr2, OTU_or_mRNA, value, - Sample)
dfr3_long <- gather(dfr3, OTU_or_mRNA, value, - Sample)
dfr4_long <- gather(dfr4, OTU_or_mRNA, value, - Sample)
# just to show that the sum of the clr is practically zero for all of the RVs
# Why? Because if you divide Xr by its geometric mean = XrbyGM, then prod(XrbyGM)
# = 1. You take the log of x1 , x2 and so on, you sum this is log(x1) + log(x2)
# and so on, which is log(x1*x2...), which is log(1), so 0
colSums(dfr4[-1])
##           Xr           Yr          Yr1          Yr2 
## 5.342948e-16 1.020017e-15 8.916479e-16 1.741662e-15

To show the linear relationships between Xr and the other RVs.

So you see: in relative Abundance Xr and Yr1 (linearly correlated) form a straight line but not Yr2 (inversely proportional). After taking the log or clr it is the opposite.

## Source: local data frame [8 x 4]
## 
##                  Method           Yr        Yr1        Yr2
## 1                   Cor 1.000000e+00 -1.0000000 -0.8536159
## 2                CorLog 1.000000e+00 -0.8633439 -1.0000000
## 3           LogRatioVar 3.423875e-33  1.5750614  0.9291085
## 4                 theta 0.000000e+00  0.9401858  0.0000000
## 5     theta_NOTabsolute 1.474048e-32  6.7809575  4.0000000
## 6       LogRatioVar_clr 2.191280e-33  1.5750614  0.9291085
## 7             theta_clr 2.220446e-16  0.3286706  0.0000000
## 8 theta_clr_NOTabsolute 9.433904e-33  6.7809575  4.0000000

So these 4 RVs (OTUs) illustrate the first problem with \(\theta\), particularly when using 0.05 as a cutoff: (Note also taking clr has only an influence on theta not logratio variance, and only on theta when taken with the absolute)

  • Yr1 is not proportional with Xr, though \(Xr + Yr1 = constant\), so they are prefectly negatively correlated, and it would be nice to detect them.
  • using the centered logratio helped to make the \(\theta\) value of Xr Yr1 smaller, but it is far away from his cut off.

Problems Lovell mentions self

  • the treatment of zeros, see reference 22
  • count data is not continuous and thus not purely relative, here he links to reference 24