This document is motivated by the desire to analyse data that carry only relative information.
The first section highlights one of the pitfalls in ignoring the need to treat this kind of data with care, and shows that the trusty correlation coefficient is not at all trusty or appropriate as a measure of association for relative data.
Proportionality is an appropriate measure of association for relative data, and the second section of this document explores testing the hypothesis that two variables are proportional. It also highlights why this too must be approached with care.
To illustrate the phenomenon of spurious correlation (a term coined by Karl Pearson in 1896) let's look at a year's sales (in thousands) of three different kinds of books (Cookbooks
, Thrillers
and Mysteries
) across a number of book stores (nstores
):
n.stores <- 250
sales.df <- data.frame(Cookbooks=rnorm(n.stores, mean=10, sd=1),
Thrillers=rnorm(n.stores, mean=20, sd=1),
Mysteries=rnorm(n.stores, mean=30, sd=3))
Our sales data are just random numbers sampled from Normal distrubitions with different means and standard deviations.
Spurious correlation can arise when we form ratios of random variables, so lets do that:
sales.df <- transform(sales.df,
Cookbooks.over.Mysteries= Cookbooks/Mysteries,
Thrillers.over.Mysteries= Thrillers/Mysteries)
Let's plot the data pairwise to reassure ourselves visually of the lack of relationship between Cookbooks
, Thrillers
and Mysteries
Now let's look at what happens when we start working with ratios of these random, and statistically independent variables. Specifically, Cookbooks/Mysteries
and Thrillers/Mysteries
:
The distribution of points has a definite “tilt” to it and we can see the pairwise correlation indicates that something is going on even though all three variables are independent.
If we use a bit of colour, we can see that this “tilt” is created by Mysteries
being in the denominator of both variables:
…and this phenomenon is what Karl Pearson named spurious correlation.
Clearly, correlation is not a good measure of association for these relative data because a value of rho == 0.62
suggests that there is a relationship between Cookbooks/Mysteries
and Thrillers/Mysteries
even though all three variables are independent.
So, if correlation is not an appropriate measure of association for data carrying only relative information, what is?
Aitchison (1986) proposed the variance of the logratios of two variables \( x \) and \( y \): \[ {\mathrm Var}(\log x/y). \] If \( x \) and \( y \) are always proportional, the logarithm of their ratio will be constant, and the variance of the logarithm will be zero.
When there is some variability in the relationship between \( x \) and \( y \) we need a means to test whether the probability that a proportional relationship exists between them. We suggest the following approach for testing proportionality, after Warton et al.
To test for proportionality of \( x \) and \( y \) we test to see whether the slope of \( \log x \) versus \( \log y \) is significantly different from 1. Warton et al. describe a resampling approach to this page 290, which amounts to
Let's try this with our Cookbooks
and Thrillers
log.x <- log(sales.df$Cookbooks)
log.y <- log(sales.df$Thrillers)
log.xy <- log.x + log.y
log.x.on.y <- log.x - log.y
test.statistic <- cor(log.xy, log.x.on.y)^2
permutations <- 10000
p.value <- sum(replicate(permutations, cor(log.xy, sample(log.x.on.y))^2 > test.statistic))/permutations
This gives a p-value of 0.0000
and we reject the hypothesis that the slope is 1, i.e., Cookbooks
and Thrillers
don't appear to be proportional.
Just to be clear, the null hypothesis \( H_0 \) is that the slope of \( \log x \) versus \( \log y \) is 1. So, the probability of obtaining a value of the test statistic more extreme than we observed assuming the slope really is 1 is 0
.
When we try the same with Cookbooks/Mysteries
and Thrillers/Mysteries
, we get a p-value of 0.0000
and, again, we reject the hypothesis that Cookbooks
and Thrillers
are proportional.
Now let's see what happens when we apply the test to a pair of variables that are proportional, though with some variability in their relationship. We create these “roughly proportional” variables by multiplying Cookbooks
with two different vectors of noise:
log.x <- log(sales.df$Cookbooks * rnorm(n.stores, 1, 0.01))
log.y <- log(sales.df$Cookbooks * rnorm(n.stores, 5, 0.01))
This time, the test gives a p-value of 0.2056
, and we fail to reject the hypothesis that the variables are proportional.
This all seems pretty good, but there are a few traps for the unwary. First, it is important to remember that we are testing the statistical significance of a relationship, not effect size. We can see this if we apply the test to Cookbooks
and Cookbooks
raised to the power of 1.001:
x <- jitter(sales.df$Cookbooks)
y <- jitter(sales.df$Cookbooks^1.001)
Let's plot these two and add the line of equality in green:
Despite what our eyes tell us, the resampling test gives a p-value of 0.0000
telling us to reject the hypothesis that “Cookbooks
” and “Cookbooks
raised to the power of 1.001” are proportional. And so it should — they're not! However, I think it is pretty natural (at least for non-statisticians) to be a little bit surprised by this.
To increase our confidence that we haven't made a programming error, we check that using the smatr
package's slope.test
which says that, even though the slope of the relationship between log x
and log y
(0.9967
) is near as dammit to 1, we still reject the hypothesis that they are proportional with a p-value of 0.0000
.
The second trap for the unwary comes when we apply the slope test to a pair of statistically independent variables with the same standard deviation:
x <- rnorm(100)
y <- rnorm(100)
result <- slope.test(x, y)
unlist(result)
## r p test.value b ci1 ci2
## 0.01447 0.88635 1.00000 -1.01453 -0.83191 -1.23723
In this case, the p-value of 0.8864
means we fail to reject the hypothesis that the slope of the x-y pairs is 1. This is, in part, to do with the standardised major axis estimate of slope: \( {\mathrm sign}(s_{xy}) s_y/s_x \) where \( s_{xy} \) is the sample covariance, and \( s_x \) and \( s_y \) the sample standard deviations of \( x \) and \( y \). This issue goes away when the standard deviations of x and y differ:
x <- rnorm(100)
y <- rnorm(100, sd = 1.5)
result <- slope.test(x, y)
unlist(result)
## r p test.value b ci1 ci2
## -5.404e-01 6.507e-09 1.000e+00 5.463e-01 4.476e-01 6.666e-01
Now, the p-value of 0.0000
suggests we reject the hypothesis that x and y have a slope of 1.