Proportionality: A Valid Alternative to Correlation for Relative Data (Lovell et al (2015))

Many of the explanations here are also from their older paper? Have you got things in proportion? (You have it on your mendeley or you can get it from Lovell’s research gate page, not on pubmed.

Some Background and the example data

In general this paper and is about appropriate associations between two quantities (RV), such as expression of mRNAs or levels of OTUs, when you only have relative data. Lovell does a huge effort stressing that correlation is a bad measure of association if you only have relative data of the two quantities without any knowledge of the absolute abundances. The simple problem is that the relative abundance of an OTU is only proportional (ratios equal see below) to its absolute abundance if the total abundance of all OTUs remains constant. This is obviously not clear for OTUs, it was often assumed for mRNA expression, but also there it seems to be nonsense, e.g. starvation results in clearly lower overall mRNA levels.

for proportionality see (http://en.wikipedia.org/wiki/Proportionality_%28mathematics%29)
\(Y \propto X\), Y directly proportional to X when \(Y = aX\) with a = proportionality constant
Y is inversely proportional to X when \(XY = a\) equal to \(Y \propto 1/X\) (inversely proportional becomes very important when working with Lovells equations.)

sticking to the X and Y from above

set.seed(123)
X <- sample(1:10, 10)
Y <- 2 + 1.5*X # so a RV Y totally correlated with X
n <- length(X)
cor(X,Y) # by default always Pearson correlation

## [1] 1

X/Y

##  [1] 0.4615385 0.5714286 0.5000000 0.5600000 0.5454545 0.2857143 0.5882353
##  [8] 0.5806452 0.4000000 0.5263158

So note, X and Y are perfectly correlated, but they are not proportional.

Note also already here, that the taking the logarithm ruins the perfect linear relationship of X and Y (who are not proportional because there was an interecept \(\neq 0\) used)

cor(X,Y)

## [1] 1

cor(log(X),log(Y))

## [1] 0.9951201

Since we and the paper is about relative abundances, I define new Xr and Yr, Yr1, and Yr2 here

set.seed(123)
Xr <- rnorm(10, mean = 0.2, sd = .1)
Yr <- 1.2 * Xr
Yr1 <- .4 - Xr
Yr2 <- 1/Xr*.01
dfr <- data.frame (Xr = Xr, Yr = Yr, Yr1 = Yr1, Yr2 = Yr2)
dfr2 <- dfr # needed later
dfr_long <- gather(dfr, Vector, Value, -Xr)

See below the plot and the correlations,

So \(Xr \propto Yr\) with proprotionality constant 1.2 (of course there are also perfectly correlated with intercept 0).
Xr and Yr1 are perfectly negatively correlated. If we assume constant total amounts of microbes, Yr1 and Xr would share 40%. If Xr goes up, Yr1 goes down. So that is relevant for us, but Xr is neither proportional nor inversely proportional to Yr1, \(Xr \not\propto Yr1\) and \(\frac{1}{Xr} \not\propto Yr1\)!
Xr and Yr2 are inversely proportional: \(\frac{1}{Xr} \propto Yr2\)

The plot showing how Xr relates to the Yrs:

Tr <- ggplot(dfr_long, aes(x = Xr, y = Value, colour = Vector)) +
        geom_point(size = 5) +
        scale_colour_manual(values = cbPalette[2:4]) +
        theme_bw(14)

Tr

and the correlations

cor(Xr, Yr)

## [1] 1

cor(Xr, Yr1)

## [1] -1

cor(Xr, Yr2)

## [1] -0.8536159

Again: Xr and Yr1 are perfectly negatively correlated but not proportional. Xr and Yr are perfectly positively correlated and proportional. Xr and Yr2 are inversely proportional but not in a linear correlation.

But look at the logs

The plot showing how Xr relates to the Yrs:

dfr_long <- mutate(dfr_long, Log_Xr = log(Xr), Log_value = log(Value))
Tr <- ggplot(dfr_long, aes(x = Log_Xr, y = Log_value, colour = Vector)) +
        geom_point(size = 5) +
        scale_colour_manual(values = cbPalette[2:4]) +
        theme_bw(14)

Tr

so now Xr and Yr2 are in a linear relationship, but Xr and Yr1 are not any more, which is of course also reflected in the correlations.

cor(log(Xr), log(Yr))

## [1] 1

cor(log(Xr), log(Yr1))

## [1] -0.8633439

cor(log(Xr), log(Yr2))

## [1] -1

To once see the numbers of Xr, Yr, Yr1, Yr2, and some ratios, here:

dfr <- mutate(dfr, Ratio_YrXr = Yr/Xr, Ratio_Yr1Xr = Yr1/Xr, Ratio_Yr2Xr = Yr2/Xr, Yr2timesXr = Yr2*Xr, LogRatio_YrXr = log(Yr/Xr), LogRatio_Yr1Xr = log(Yr1/Xr), LogRatio_Yr2Xr = log(Yr2/Xr))
dfr

##            Xr         Yr        Yr1        Yr2 Ratio_YrXr Ratio_Yr1Xr
## 1  0.14395244 0.17274292 0.25604756 0.06946739        1.2  1.77869561
## 2  0.17698225 0.21237870 0.22301775 0.05650284        1.2  1.26011364
## 3  0.35587083 0.42704500 0.04412917 0.02810008        1.2  0.12400333
## 4  0.20705084 0.24846101 0.19294916 0.04829732        1.2  0.93189268
## 5  0.21292877 0.25551453 0.18707123 0.04696406        1.2  0.87856246
## 6  0.37150650 0.44580780 0.02849350 0.02691743        1.2  0.07669718
## 7  0.24609162 0.29530994 0.15390838 0.04063527        1.2  0.62541089
## 8  0.07349388 0.08819265 0.32650612 0.13606576        1.2  4.44263031
## 9  0.13131471 0.15757766 0.26868529 0.07615293        1.2  2.04611711
## 10 0.15543380 0.18652056 0.24456620 0.06433607        1.2  1.57344279
##    Ratio_Yr2Xr Yr2timesXr LogRatio_YrXr LogRatio_Yr1Xr LogRatio_Yr2Xr
## 1   0.48257183       0.01     0.1823216     0.57588029     -0.7286255
## 2   0.31925710       0.01     0.1823216     0.23120191     -1.1417585
## 3   0.07896147       0.01     0.1823216    -2.08744685     -2.5387953
## 4   0.23326308       0.01     0.1823216    -0.07053762     -1.4555884
## 5   0.22056231       0.01     0.1823216    -0.12946828     -1.5115751
## 6   0.07245480       0.01     0.1823216    -2.56789032     -2.6247923
## 7   0.16512254       0.01     0.1823216    -0.46934642     -1.8010674
## 8   1.85138904       0.01     0.1823216     1.49124661      0.6159362
## 9   0.57992684       0.01     0.1823216     0.71594391     -0.5448533
## 10  0.41391299       0.01     0.1823216     0.45326608     -0.8820995

Getting from Aitchison’s logratio variance to \(\theta\) as an interpretable measure of the strength of proportionality between two RV of relative abundances

Again, you have two random variables (RV) X and Y that represent the relative abundances of two OTUs or the expression levels of two mRNAs in different samples. As examples I have Xr which I relate to Yr, Yr1, and Yr2. With a length of 10, they cover 10 sammples, lets say A to J.

So a better way of plotting the data

dfr2 <- arrange(dfr2, Xr)
dfr2$Sample <- LETTERS[1:10]
dfr2_long <- gather(dfr2, OTU_or_mRNA, value, - Sample)
dfr2_long <- mutate(dfr2_long, Log_value = log(value))

Tr

Trr

Note how the differences between Xr and Yr get equal in the log.

Aitchison (I have to read his own paper) suggested the logratio variance to estimate if X and Y are proportional: \[ var(\log(X/Y))\]

Lovell writes, clearly when x and y are proportional to each other then the logratio variance should be 0.

var(log(Xr/Yr))

## [1] 3.423875e-33

var(log(Xr/Yr1))

## [1] 1.575061

var(log(Xr/Yr2))

## [1] 0.9291085

var(log(X/Y))

## [1] 0.05069733

So remember, X and Y, Xr and Yr, and Xr and Yr1 are perfectly correlated. Xr and Yr are proportional, Xr and Yr2 are negatively proportional. For the proportional it is 0. Lovell says for otheres it does not have an interpretable scale, in the sense that smaller logratio variance indicates more proportional RV.

I see three general problems with this logratio variance

why do we take the log at all when we just talk proportionality?
Xr to Yr1 are prefectly negatively correlated, but they are not proportional, how to deal with this?
is it that smaller values correspond to higher proportionality?

Let’s see what Lovell can do on these problem. So he says the logratio variance is hard to interpret when X and Y are not exactly proportional. That is, what is a large and a small value? They claim, they have found a way to factorize logratio variance so it is more interpretable.

The equations used for this factorisation are actually rather easy.

It relies on an apporach to bivariate line fitting, the standardised major axis (SMA). According to Warton et al (2006) (which I should read), it should be better than least squares regression on finding the right slope, because least squares would underestimate the slope. Remember, in least squares estimates the slope of random variables Yr on Xr is \[\hat{\beta}(Xr,Yr) = cor(Xr,Yr)*sd(Yr)/sd(Xr)\]

see if that works for Xr, Yr

cor(Xr,Yr) * sd(Yr)/sd(Xr)

## [1] 1.2

indeed

Now SMA uses the following \[\hat{\beta}(Xr,Yr) = sign(cov(Xr,Yr))*sd(Yr)/sd(Xr)\] sign(x) simply gives 1 when x is positive, -1 when negative, and 0 when 0 So even if the correlation is not perfect, as in case of X and Y2, where the cor(X,Y2) is about 0.98, this equation would use 1 because the covariance is positive. Well, if you have a correlation of 0.2, to use 1 is quite some boost in the slope. Anyway`

Lovell then just use log(Yr) and log(Xr), and they use the slope to the square, note sign(cov(Xr,Yr)^2) is always 1, so they get \[\hat{\beta}^{2}(Xr,Yr) = \frac{var(log(Yr))}{var(log(Xr))}\]

For understanding what happens here, I made another excurse into the tricks of taking the log. If you are not sure if you should skip it, the key message important here:

If X and Y are proportional, so \(\frac{X}{Y} = constant1\), then \(log(X) - log(Y) = constant2\), meaning that the slope of log(Y) vs log(X) or \(\frac{\Delta{log(Y)}}{\Delta{log(X)}} = 1\) So both the classical slope from least squares as also the one presented here will be 1 for perfectly proportional Xr and Yr.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Begin Excurs into logs and slopes

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Here it might be good to also look at some plots in 150516_LogarithmAExponentialBasics

First again just Xr vs Yr. Remember the real slope is 1.2. It is a perfect correlation and proportionality because Yr = 1.2Xr, so both the least squares cor(Xr,Yr) * sd(Yr)/sd(Xr) and SMA’s sign(cov(Xr,Yr)*sd(Yr)/sd(Xr)) give this 1.2. But what happens with this slope when you take the logarithm?

# I stick to Xr and Yr which are really proportional
dfr_Select <- select(dfr, Xr, Yr)
dfr_Select <- mutate(dfr_Select, Log_Xr = log(Xr), Log_Yr = log(Yr))

Tr <- ggplot(dfr_Select, aes(x = Xr, y = Yr)) +
        geom_point(size = 5, colour = cbPalette[2]) +
        geom_abline(intercept = 0, slope = 1.2) +
        theme_bw(14)

Tr

Now they go to the logs, and they suggest for the slope

Slope <- sqrt(var(dfr_Select$Log_Xr)/var(dfr_Select$Log_Yr))
Slope

## [1] 1

Tr <- ggplot(dfr_Select, aes(x = Log_Xr, y = Log_Yr)) +
        geom_point(size = 5, colour = cbPalette[2]) +
        geom_abline(intercept = 0, slope = 1) +
        theme_bw(14)

Tr

So taking the logarithm has two critical effects.

while variance of Yr is bigger than that of Xr, after logarithm they are the same
while the slope of Xr to Yr was 1.2, the slope of log(Xr) to log(Yr) is 1

Both is easy to explain when thinking about logarithm and its rules.

First the variance

var(Yr)/var(Xr)

## [1] 1.44

# this was clear because var(a*X) = a^2*var(X)
# since Yr = 1.2*Xr the variance of Yr is 1.44*var(Xr)
var(log(Yr))/var(log(Xr))

## [1] 1

So how come the variance difference disappears? Because:

because log(1.2*Xr) = log(1.2) + log(Xr) (see 150516_LogarithmAExponentialBasics)
and var(a + Xr) = var(Xr) while var(aXr) = a^2*var(Xr)

And now the slope:

dfr_Select2 <- select(dfr_Select, Xr, Yr)
dfr_Select2 <- arrange(dfr_Select2, Xr)
dfr_Select2 <- mutate(dfr_Select2, No = 1:10)
dfr_Select2 <- gather(dfr_Select2, RV, value, -No)
dfr_Select3 <- select(dfr_Select, Log_Xr, Log_Yr)
dfr_Select3 <- arrange(dfr_Select3, Log_Xr)
dfr_Select3 <- mutate(dfr_Select3, No = 1:10)
dfr_Select3 <- gather(dfr_Select3, RV, value, -No)
Tr <- ggplot(dfr_Select2, aes(x = No, y = value, colour = RV)) +
        geom_point(size = 5) +
        scale_colour_manual(values = cbPalette[2:3]) +
        theme_bw(14)
Trr <- ggplot(dfr_Select3, aes(x = No, y = value, colour = RV)) +
        geom_point(size = 5) +
        scale_colour_manual(values = cbPalette[2:3]) +
        theme_bw(14)

Tr

Note the differences between Xr and Yr (orange and blue dots) get bigger and bigger

Trr

The differences between log(Xr) and log(Yr) are always the same:

sort(Yr) - sort(Xr)

##  [1] 0.01469878 0.02626294 0.02879049 0.03108676 0.03539645 0.04141017
##  [7] 0.04258575 0.04921832 0.07117417 0.07430130

sort(log(Yr)) - sort(log(Xr))

##  [1] 0.1823216 0.1823216 0.1823216 0.1823216 0.1823216 0.1823216 0.1823216
##  [8] 0.1823216 0.1823216 0.1823216

Also this is easy to explain: maybe look again at an exponential plot in 150516_LogarithmAExponentialBasics:

for example exp(x) or 2^x. When x increases by one, exp(x) is multiplied by 2.718 (2^x is simply doubled)
so for log(x) or \(log_{2}(x)\), when x multiplies by 2.718 or when x is doubled, log(x) or \(log_{2}(x)\) increase by 1, respectively.

we know \[ Yr = aXr\] \[ Yr - Xr = aXr - Xr = Xr(a-1)\] The difference between Yr and Xr gets consequently bigger for bigger Xr \[ log(Yr) - log(Xr) = log(aXr) - log(Xr) = log(a)+log(Xr) - log(Xr) = log(a)\]

The difference between log(Yr) and log(Xr) is consequently a constant, i.e. log(a) in our case log(1.2)

If Y is directly protportional to X (Y = aX with a the proportionality constant), then log(Y) - log(X) is constant, namely log(a)

What does that mean for the slope of log(X) to log(Y). I illustrate this with a simpler example and the diff() function, but in principle it should be clear, the slope will be always 1. What defines a slope between x and y?

X and Y are in a linear relationship if \(\Delta{y}/\Delta{x}\) is constant, which is the slope

# the simple example for exponential, log, and slopes
x <- 1:5
y <- 2*x # so proportionality constant
exdf <- data.frame(x=x, y =y)
exdf <- mutate(exdf, exp_x = 2^x, exp_y = 2^y, log_x = log(x)/log(2), log_y = log(y)/log(2)) # for simplicity I use log to base 2 here!!!
Tr <- ggplot(exdf, aes(x = x, y = y)) +
        geom_point(size = 5, colour = cbPalette[2]) +
        theme_bw(14)
Trr <- ggplot(exdf, aes(x = exp_x, y = exp_y)) +
        geom_point(size = 5, colour = cbPalette[2]) +
        theme_bw(14)
Trr <- ggplot(exdf, aes(x = exp_x, y = exp_y)) +
        geom_point(size = 5, colour = cbPalette[2]) +
        theme_bw(14)
Trrr <- ggplot(exdf, aes(x = log_x, y = log_y)) +
        geom_point(size = 5, colour = cbPalette[2]) +
        theme_bw(14)
exdf

##   x  y exp_x exp_y    log_x    log_y
## 1 1  2     2     4 0.000000 1.000000
## 2 2  4     4    16 1.000000 2.000000
## 3 3  6     8    64 1.584963 2.584963
## 4 4  8    16   256 2.000000 3.000000
## 5 5 10    32  1024 2.321928 3.321928

Tr

note \(\Delta{y}/\Delta{x}\) is 2

Trrr

note \(\Delta{log(y)}/\Delta{log(x)}\) is 1

Trr

\(\Delta{2^{y}}/\Delta{2^{x}}\) is not constant, so you see the exponential ruins the linear relationship.

To calculate \(\Delta{y}/\Delta{x}\) the diff() function is handy.

exdf2 <- data.frame(diff_x = diff(exdf$x), diff_y = diff(exdf$y), diff_exp_x = diff(exdf$exp_x), diff_exp_y = diff(exdf$exp_y), diff_log_x = diff(exdf$log_x), diff_log_y = diff(exdf$log_y))
exdf2 <- mutate(exdf2, Deltax_Deltay = diff_y/diff_x, Deltaexpx_Deltaexpy = diff_exp_y/diff_exp_x, Deltalogy_Deltalogx= diff_log_y/diff_log_x) 
exdf2

##   diff_x diff_y diff_exp_x diff_exp_y diff_log_x diff_log_y Deltax_Deltay
## 1      1      2          2         12  1.0000000  1.0000000             2
## 2      1      2          4         48  0.5849625  0.5849625             2
## 3      1      2          8        192  0.4150375  0.4150375             2
## 4      1      2         16        768  0.3219281  0.3219281             2
##   Deltaexpx_Deltaexpy Deltalogy_Deltalogx
## 1                   6                   1
## 2                  12                   1
## 3                  24                   1
## 4                  48                   1

Here just the numbers, so by logarithm directly proportional X and Y, they end up in a linear relationship with slope 1, independent on the proportionality constant.

So if X and Y are proportional, log(Y) and log(X) are in a linear relationship with slope \(\Delta{log(Y)}/\Delta{log(X)} = 1\)

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

End Excurs into logs and slopes

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

So the excurs showed, that Lowells estimate for the slope

\[\hat{\beta}^{2}(log(Xr),log(Yr)) = \frac{var(log(Yr))}{var(log(Xr))}\]

will be always 1 for directly proportional Yr and Xr.

Then in point 2.2 of their paper they consider the squared sample correlation coefficient or coefficient of determination \[ r^{2}(Yr,Xr) = cor(Yr,Xr)^{2} = \frac{cov(Yr,Xr)^{2}}{sd(Yr)^{2}sd(Xr)^{2}} \]

Note, see the regression models coursera course here, if you have a linear regression, you have the total variation \((Yi - mean(Y))^{2}\), you have the regression variation \((\hat{Yi} - mean(Y))^{2}\) (so the variance of the Y on the line), and the residual variation \((Yi - \hat{Yi})^{2}\). \(r^{2}\) is also the ratio of the regression variation by the total variation. I.e. How much variation of Y can be explained by the linear relationship with X.

Just using log(Yr) again:

\[ r^{2}(log(Yr),log(Xr)) = \frac{cov(log(Yr),log(Xr))^{2}}{var(log(Yr))var(log(Xr))} \]

Now in 2.3 they use this \(r^{2}\) to get a variation on the logratio variance suggested by Aitchison

\[ var(log(Xr/Yr)) = var(log(Xr) - log(Yr))\]

So that was just logarithm rule

\[ var(log(Xr) - log(Yr)) = var(log(Xr)) + var(log(Yr)) - 2*cov(log(Xr),log(Yr)) \]

This has nothing to do with logarithm, this is general: - here the adjusted version of this proof: - (https://www.youtube.com/watch?v=k9do8J-w9hY)

Remember the variance of a RV is \[ Var(W) = E[(W - E[W])^{2}] = E[W^{2}] - (E[W])^{2} \]

Critical is that the Expectiation of a sum is the sum of expectations \[ E[X+Y] = E[X] + E[Y] \]

so let’s W be X-Y \[ Var(X-Y) = E[((X-Y) - (E[X-Y]))^{2}] \]

\[ Var(X-Y) = E[((X-Y) - (E[X]-E[Y]))^{2}] \]

\[ Var(X-Y) = E[(X-Y -E[X] + E[Y])^{2}] \]

\[ Var(X-Y) = E[((X-E[X]) - (Y - E[Y]))^{2}] \]

\[ Var(X-Y) = E[(X-E[X])^{2} - 2*(X-E[X])(Y - E[Y]) + (Y - E[Y])^{2}] \]

\[ Var(X-Y) = E[(X-E[X])^{2}] - 2*E[(X-E[X])(Y - E[Y])] + E[(Y - E[Y])^{2}] \]

\[ Var(X-Y) = Var(X) + Var(Y) - 2*cov(X,Y) \]

coming back, as just shown, indeed you can write Aitchison’s logratio variance easily as

\[ var(log(Xr/Yr)) = var(log(Xr)) + var(log(Yr)) - 2*cov(log(Xr),log(Yr)) \]

In the following they want to combine their equations For easier notation they use

\[ A = var(log(Xr)) ~~~ B = var(log(Yr)) ~~~ C = cov(log(Xr),log(Yr)) \]

Then their estimate for the solpe between log(Yr) and log(Xr) becomes

\[\hat{\beta}^{2}(log(Xr),log(Yr)) = \frac{B}{A}\]

The logratio variance is

\[ var(log(Xr/Yr)) = A + B - 2C \]

And well the cor(log(Yr),log(Xr))^2 becomes

\[ r^{2}(log(Yr),log(Xr)) = \frac{C^{2}}{A*B} \]

In the following there is also \[ r^{2}(log(Yr),log(Xr)) = r^{2} ~~~ \hat{\beta}^{2}(log(Xr),log(Yr)) = \hat{\beta}^{2} \]

\[ \hat{\beta}^{2} = \frac{B}{A} >> B = \hat{\beta}^{2}*A \] Note that this beta is just the square of an estimated slope, that is now used to relate the variance of log(Yr) to the var(log(Xr))

\[ C^{2} = r^{2} *A*B \]

replacing the B

\[ C^{2} = r^{2} *A^{2}*\hat{\beta}^{2} \]

Up to here everything was easy. It was actually from the “Have you got things in proportion?” paper.

But now he takes the square root, in this paper he gives

\[ C = sign(\hat{\beta})*A*|r\hat{\beta}| \]

in the actually published paper “Proportionality:…” the sign thing disappears. So I go to this Paper now. The obvious answer to this square root is:

\[ C = r *A*\hat{\beta} \]

AND WE should stick with this easy solution:

Not it is just replacing the B and C in the equation of the logratio variance:

\[ var(log(Xr/Yr)) = A + B - 2C \] \[ var(log(Xr/Yr)) = A + A\hat{\beta}^{2} - 2*r *A*\hat{\beta} \] \[ var(log(Xr/Yr)) = A * (1 + \hat{\beta}^{2} - 2*r*\hat{\beta}) \]

remember A is var(log(Xr))

\[ var(log(Xr/Yr)) = var(log(Xr)) * (1 + \hat{\beta}^{2} - 2*r*\hat{\beta}) \]

So indeed the logratio variance is now factorised into two terms:

var(log(Xr))
and \(\theta(log(Xr), log(Yr)) = 1 + \hat{\beta}^{2} - 2*r*\hat{\beta}\)
He however uses: \(\theta(log(Xr), log(Yr)) = 1 + \hat{\beta}^{2} - 2*|r|*\hat{\beta}\)

And I think I know why he takes the absolute value of r, albeit that is arithmetically here not correct. See below.

First: generals on \(\theta\)

so \(\theta\) contains an estimate of the slope and funnily the correlation of log(Xr) to log(Yr)
he claims: unlike the logratio variance, \(\theta\) has an interpretable scale, e.g. values of \(\theta(log(X), log(Y)) < 0.05\) indicate that x and y are highly proportional.
so \(\theta\) provides a measure of goodness of fit to proportionality between x and y

To show the issue with the absolute \(|r|\), let’s write the equation in full again and test it on the example RV (Xr, Yr, Yr1, Yr2)

\[ var(log(Xr/Yr)) = var(log(Xr)) * (1 + \hat{\beta}^{2} - 2*\hat{\beta}*r) \]

\[ var(log(Xr/Yr)) = var(log(Xr)) * \Big(1 + \frac{var(log(Yr))}{var(log(Xr))} - 2*\sqrt{\frac{var(log(Yr))}{var(log(Xr))}}* \frac{cov(log(Xr),log(Yr))}{\sqrt{var(log(Xr))*var(log(Yr))}}\Big) \]

I wrote it like this because he did so, INTERESTINGLY here he did not use the absolute of r, which is the correlation, so easier is

\[ var(log(Xr/Yr)) = var(log(Xr)) * \Big(1 + \frac{var(log(Yr))}{var(log(Xr))} - 2*\sqrt{\frac{var(log(Yr))}{var(log(Xr))}}* cor(log(Xr),log(Yr))\Big) \]

Only then, and probably just for his definition of \(\theta\) he changes to:

\[ var(log(Xr/Yr)) = var(log(Xr)) * \Big(1 + \frac{var(log(Yr))}{var(log(Xr))} - 2*\sqrt{\frac{var(log(Yr))}{var(log(Xr))}}* |cor(log(Xr),log(Yr))|\Big) \]

So testing the arithmetics directly. First showing the real logratios again

data.frame(ToYr = var(log(Xr/Yr)), ToYr1 =var(log(Xr/Yr1)), ToYr2 = var(log(Xr/Yr2)))

##           ToYr    ToYr1     ToYr2
## 1 3.423875e-33 1.575061 0.9291085

The equation with the normal correlation (not absolute)

ToYr <- var(log(Xr)) * ( 1 + var(log(Yr))/var(log(Xr)) - 2 * sqrt( var(log(Yr))/var(log(Xr)) ) * ( cor( log(Xr),log(Yr) ) ) )  

ToYr1 <- var(log(Xr)) * ( 1 + var(log(Yr1))/var(log(Xr)) - 2 * sqrt( var(log(Yr1))/var(log(Xr)) ) * ( cor( log(Xr),log(Yr1) ) ) )

ToYr2 <- var(log(Xr)) * ( 1 + var(log(Yr2))/var(log(Xr)) - 2 * sqrt( var(log(Yr2))/var(log(Xr)) ) * ( cor( log(Xr),log(Yr2) ) ) )

data.frame(ToYr = ToYr, ToYr1 =ToYr1, ToYr2 = ToYr2)

##   ToYr    ToYr1     ToYr2
## 1    0 1.575061 0.9291085

So these transformations were correct, this equation works. But what if we use abs(cor) as he suggests?

ToYr <- var(log(Xr)) * ( 1 + var(log(Yr))/var(log(Xr)) - 2 * sqrt( var(log(Yr))/var(log(Xr)) ) * abs( cor( log(Xr),log(Yr) ) ) )  

ToYr1 <- var(log(Xr)) * ( 1 + var(log(Yr1))/var(log(Xr)) - 2 * sqrt( var(log(Yr1))/var(log(Xr)) ) * abs( cor( log(Xr),log(Yr1) ) ) )

ToYr2 <- var(log(Xr)) * ( 1 + var(log(Yr2))/var(log(Xr)) - 2 * sqrt( var(log(Yr2))/var(log(Xr)) ) * abs( cor( log(Xr),log(Yr2) ) ) )

data.frame(ToYr = ToYr, ToYr1 =ToYr1, ToYr2 = ToYr2)

##   ToYr     ToYr1 ToYr2
## 1    0 0.2183837     0

so when you use the absolute of the correlation, the logratio variance that is calculated is not correct anymore, but there might be a good reason that he is doing it, let’s just look at \(\theta\)

ThetaYr <-  1 + var(log(Yr))/var(log(Xr)) - 2 * sqrt( var(log(Yr))/var(log(Xr)) ) * ( cor( log(Xr),log(Yr) ) )   

ThetaYr1 <-  1 + var(log(Yr1))/var(log(Xr)) - 2 * sqrt( var(log(Yr1))/var(log(Xr)) ) * ( cor( log(Xr),log(Yr1) ) ) 

ThetaYr2 <-  1 + var(log(Yr2))/var(log(Xr)) - 2 * sqrt( var(log(Yr2))/var(log(Xr)) ) * ( cor( log(Xr),log(Yr2) ) ) 

ThetaAYr <-  1 + var(log(Yr))/var(log(Xr)) - 2 * sqrt( var(log(Yr))/var(log(Xr)) ) * abs( cor( log(Xr),log(Yr) ) )   

ThetaAYr1 <-  1 + var(log(Yr1))/var(log(Xr)) - 2 * sqrt( var(log(Yr1))/var(log(Xr)) ) * abs( cor( log(Xr),log(Yr1) ) ) 

ThetaAYr2 <-  1 + var(log(Yr2))/var(log(Xr)) - 2 * sqrt( var(log(Yr2))/var(log(Xr)) ) * abs( cor( log(Xr),log(Yr2) ) ) 

data.frame(Method = c('normal_r', 'absolute_r'), ThetaYr = c(ThetaYr,ThetaAYr), ThetaYr1 = c(ThetaYr1,ThetaAYr1), ThetaYr2 = c(ThetaYr2,ThetaAYr2))

##       Method ThetaYr  ThetaYr1 ThetaYr2
## 1   normal_r       0 6.7809575        4
## 2 absolute_r       0 0.9401858        0

So note, by using absolute of the correlation, he achieves that \(\theta\) is also 0 for inversely proportional RV such as Xr and Yr2. On top also the theta for Yr1 (directly negatively correlated) is pretty small (see in his paper, thetas can go up to 7 and more).

Here the proof why his \(\theta\) is 0 for inversely proportional RV such as Xr and Yr2.

First \(\beta\) is 1 because var(log(Yr2)) is var(log(Xr)):

\[ Xr*Yr2 = a ~~~ a = constant\] \[ Yr2 = a*\frac{1}{Xr}\] \[ log(Yr2) = log(a*\frac{1}{Xr}) = log(a) + log(1) - log(Xr) = log(a) - log(Xr)\] since \(a is constant\) \[ var(log(Yr2)) = var(log(a)-1*log(Xr)) = var(-1*log(Xr)) = -1^2*var(log(Xr)) = var(log(Xr))\]

So both for X and Y being proportional or inversely proportional: \[\hat{\beta}^{2} = \frac{var(log(Y))}{var(log(X))} = 1\]

Discussing, what are the advantages and disadvantages of \(\theta\) and proportionality over correlation. How well does \(\theta\)?

I decided at this point to start a new document: 150520_DisucssionLovell_2015

So \(\theta\) is 0 for both directly and inversely proportional RV. He claims it is well interpretable to find proportional RVs.

What are questions to answer here?

I believe that correlation is dangerous. But his \(\theta\) involves correlation. Is it so less dangerous?
Assuming constant total abundance, does \(\theta\) find all the RV combinations we would like to find (i.e. all correlated ones)? How well does proportionality reflect correlation in this case?
the permutation issue, is the number of \(\theta\) equal if you cha

150521_ProportionalityMeasureTheta_Lovell2015Intro

Thorsten Brach

21 May 2015

Proportionality: A Valid Alternative to Correlation for Relative Data (Lovell et al (2015))

Some Background and the example data

Getting from Aitchison’s logratio variance to \(\theta\) as an interpretable measure of the strength of proportionality between two RV of relative abundances

Discussing, what are the advantages and disadvantages of \(\theta\) and proportionality over correlation. How well does \(\theta\)?