This is a follow up of
so if you have questions, start going through that one.
Lovell_2015 (Proportionality: A Valid Alternative to Correlation for Relative Data) ended with \(\theta\) as a measure of the strength of proportionality between two RV.
\(\theta\) which he calls “goodness-of-fit to proportionality statistic” can be used to assess the extent to which a pair of random variables are proportional. It is directly related to logratio variance which was suggested as a proportionality measure by Aitchison_1986
\[ var(log(Xr/Yr)) = var(log(Xr)) * \Big(1 + \frac{var(log(Yr))}{var(log(Xr))} - 2*\sqrt{\frac{var(log(Yr))}{var(log(Xr))}}* cor(log(Xr),log(Yr))\Big) \]
\[ \theta(log(Xr),log(Yr)) = 1 + \frac{var(log(Yr))}{var(log(Xr))} - 2*\sqrt{\frac{var(log(Yr))}{var(log(Xr))}}* |cor(log(Xr),log(Yr))| \]
Indeed \(\theta\) is 0 for both directly and inversely proportional RV.
Note from \[ var(log(Xr/Yr)) = var(log(Xr)) * \theta(log(Xr),log(Yr)) \]
follows that if he did not use the \(|r|\) there was no need to calculate \(\beta\) and \(|r|\) for calculating \(\theta\) since:
\[ \theta(log(Xr),log(Yr))_{\text{NOTabsolute}} = var(log(Xr/Yr))/var(log(Xr)) \]
But see below \(\theta(log(Xr),log(Yr))_{\text{NOTabsolute}}\) is not interpretable, i.e. it results in high values for inversely correlated RV!!
What are questions I want to answer here?
The example data from 150516_StatisticalInferenceARegressionBasics
Xr is directly proportional to Yr, and inversely proportional to Yr2. It is perfectly negatively correlated with Yr1 but not proportional.
NOTE To ensure that all pairs of RV are on the same scale, one has to use the centered logratios (clr) instead of just log.
\[ clr(X) = \Big( log(\frac{x_{1}}{GM(X)}), ..., log(\frac{x_{D}}{GM(X)}) \Big)\]
So you take the logarithms of the components of a RV (NOT the Composition of a sample, the values of one OTU over all samples) after dividing them by the geometric mean (\(GM\)) (which is the nth root of the product of all components (http://en.wikipedia.org/wiki/Geometric_mean)).
set.seed(123)
Xr <- rnorm(10, mean = 0.2, sd = .1)
Yr <- 1.2 * Xr
Yr1 <- .4 - Xr
Yr2 <- 1/Xr*.01
dfr <- data.frame (Xr = Xr, Yr = Yr, Yr1 = Yr1, Yr2 = Yr2)
# could be an t_Abundance table with 4 OTUs and 10 Samples
dfr2 <- dfr # dfr2 will be the data arranged
dfr2 <- arrange(dfr2, Xr)
dfr2$Sample <- LETTERS[1:10]
# dfr2 now contains the ordered relative abundances of the RVs
dfr3 <- transmute(dfr2, Sample = Sample, Xr = log(Xr), Yr = log(Yr), Yr1 = log(Yr1), Yr2 = log(Yr2))
dfr4 <- transmute(dfr2, Sample = Sample, Xr = log(Xr/(prod(Xr)^(1/length(Xr)))), Yr = log(Yr/(prod(Yr)^(1/length(Yr)))), Yr1 = log(Yr1/(prod(Yr1)^(1/length(Yr1)))), Yr2 = log(Yr2/(prod(Yr2)^(1/length(Yr2)))))
# so dfr2 to 4 contain now the relative abundances, the log() and the clr, respectively
dfr2_long <- gather(dfr2, OTU_or_mRNA, value, - Sample)
dfr3_long <- gather(dfr3, OTU_or_mRNA, value, - Sample)
dfr4_long <- gather(dfr4, OTU_or_mRNA, value, - Sample)
# just to show that the sum of the clr is practically zero for all of the RVs
# Why? Because if you divide Xr by its geometric mean = XrbyGM, then prod(XrbyGM)
# = 1. You take the log of x1 , x2 and so on, you sum this is log(x1) + log(x2)
# and so on, which is log(x1*x2...), which is log(1), so 0
colSums(dfr4[-1])
## Xr Yr Yr1 Yr2
## 5.342948e-16 1.020017e-15 8.916479e-16 1.741662e-15
To show the linear relationships between Xr and the other RVs.
So you see: in relative Abundance Xr and Yr1 (linearly correlated) form a straight line but not Yr2 (inversely proportional). After taking the log or clr it is the opposite.
## Source: local data frame [8 x 4]
##
## Method Yr Yr1 Yr2
## 1 Cor 1.000000e+00 -1.0000000 -0.8536159
## 2 CorLog 1.000000e+00 -0.8633439 -1.0000000
## 3 LogRatioVar 3.423875e-33 1.5750614 0.9291085
## 4 theta 0.000000e+00 0.9401858 0.0000000
## 5 theta_NOTabsolute 1.474048e-32 6.7809575 4.0000000
## 6 LogRatioVar_clr 2.191280e-33 1.5750614 0.9291085
## 7 theta_clr 2.220446e-16 0.3286706 0.0000000
## 8 theta_clr_NOTabsolute 9.433904e-33 6.7809575 4.0000000
So these 4 RVs (OTUs) illustrate the first problem with \(\theta\), particularly when using 0.05 as a cutoff: (Note also taking clr has only an influence on theta not logratio variance, and only on theta when taken with the absolute)