On spurious correlation and testing for proportionality

by David Lovell (CSIRO)

This document is motivated by the desire to analyse data that carry only relative information.

The first section highlights one of the pitfalls in ignoring the need to treat this kind of data with care, and shows that the trusty correlation coefficient is not at all trusty or appropriate as a measure of association for relative data.

Proportionality is an appropriate measure of association for relative data, and the second section of this document explores testing the hypothesis that two variables are proportional. It also highlights why this too must be approached with care.

Spurious correlation

To illustrate the phenomenon of spurious correlation (a term coined by Karl Pearson in 1896) let's look at a year's sales (in thousands) of three different kinds of books (Cookbooks, Thrillers and Mysteries) across a number of book stores (nstores):

n.stores <- 250
sales.df <- data.frame(Cookbooks=rnorm(n.stores, mean=10, sd=1),
                       Thrillers=rnorm(n.stores, mean=20, sd=1),
                       Mysteries=rnorm(n.stores, mean=30, sd=3))

Our sales data are just random numbers sampled from Normal distrubitions with different means and standard deviations.

Spurious correlation can arise when we form ratios of random variables, so lets do that:

sales.df <- transform(sales.df,
                      Cookbooks.over.Mysteries= Cookbooks/Mysteries,
                      Thrillers.over.Mysteries= Thrillers/Mysteries)

Let's plot the data pairwise to reassure ourselves visually of the lack of relationship between Cookbooks, Thrillers and Mysteries

plot of chunk plot-pairs plot of chunk plot-pairs plot of chunk plot-pairs

Now let's look at what happens when we start working with ratios of these random, and statistically independent variables. Specifically, Cookbooks/Mysteries and Thrillers/Mysteries:

plot of chunk plot-spurious1

The distribution of points has a definite “tilt” to it and we can see the pairwise correlation indicates that something is going on even though all three variables are independent.

If we use a bit of colour, we can see that this “tilt” is created by Mysteries being in the denominator of both variables:

plot of chunk plot-spurious2

…and this phenomenon is what Karl Pearson named spurious correlation.

Clearly, correlation is not a good measure of association for these relative data because a value of rho == 0.62 suggests that there is a relationship between Cookbooks/Mysteries and Thrillers/Mysteries even though all three variables are independent.

So, if correlation is not an appropriate measure of association for data carrying only relative information, what is?

Aitchison (1986) proposed the variance of the logratios of two variables \( x \) and \( y \): \[ {\mathrm Var}(\log x/y). \] If \( x \) and \( y \) are always proportional, the logarithm of their ratio will be constant, and the variance of the logarithm will be zero.

When there is some variability in the relationship between \( x \) and \( y \) we need a means to test whether the probability that a proportional relationship exists between them. We suggest the following approach for testing proportionality, after Warton et al.

Testing for proportionality

To test for proportionality of \( x \) and \( y \) we test to see whether the slope of \( \log x \) versus \( \log y \) is significantly different from 1. Warton et al. describe a resampling approach to this page 290, which amounts to

  1. Constructing the residual plot \( (\log y - 1\cdot\log x) \) versus \( (\log y + 1\cdot\log x) \).
    In other words, plot \( \log x/y \) against \( \log xy \). See the link to the variance of logratios?
  2. Calculate the test statistic \( r_{rf}(b)^2 = \rho(\log x/y, \log xy)^2 \)
  3. Accumulate that statistic for permuted versions of that data
  4. Form the p-value as the proportion of times in which the “permuted test statistic” exceeds the test statistic that was actually obeserved

Let's try this with our Cookbooks and Thrillers

log.x          <- log(sales.df$Cookbooks)
log.y          <- log(sales.df$Thrillers)
log.xy         <- log.x + log.y
log.x.on.y     <- log.x - log.y

test.statistic <- cor(log.xy, log.x.on.y)^2
permutations   <- 10000
p.value <- sum(replicate(permutations, cor(log.xy, sample(log.x.on.y))^2 > test.statistic))/permutations

This gives a p-value of 0.0000 and we reject the hypothesis that the slope is 1, i.e., Cookbooks and Thrillers don't appear to be proportional.

Just to be clear, the null hypothesis \( H_0 \) is that the slope of \( \log x \) versus \( \log y \) is 1. So, the probability of obtaining a value of the test statistic more extreme than we observed assuming the slope really is 1 is 0.

When we try the same with Cookbooks/Mysteries and Thrillers/Mysteries, we get a p-value of 0.0000 and, again, we reject the hypothesis that Cookbooks and Thrillers are proportional.

Now let's see what happens when we apply the test to a pair of variables that are proportional, though with some variability in their relationship. We create these “roughly proportional” variables by multiplying Cookbooks with two different vectors of noise:

log.x <- log(sales.df$Cookbooks * rnorm(n.stores, 1, 0.01))
log.y <- log(sales.df$Cookbooks * rnorm(n.stores, 5, 0.01))

This time, the test gives a p-value of 0.2056, and we fail to reject the hypothesis that the variables are proportional.

Traps for the unwary

This all seems pretty good, but there are a few traps for the unwary. First, it is important to remember that we are testing the statistical significance of a relationship, not effect size. We can see this if we apply the test to Cookbooks and Cookbooks raised to the power of 1.001:

x <- jitter(sales.df$Cookbooks)
y <- jitter(sales.df$Cookbooks^1.001)

Let's plot these two and add the line of equality in green:

plot of chunk slope-test4-1

Despite what our eyes tell us, the resampling test gives a p-value of 0.0000 telling us to reject the hypothesis that “Cookbooks” and “Cookbooks raised to the power of 1.001” are proportional. And so it should — they're not! However, I think it is pretty natural (at least for non-statisticians) to be a little bit surprised by this.

To increase our confidence that we haven't made a programming error, we check that using the smatr package's slope.test which says that, even though the slope of the relationship between log x and log y (0.9967) is near as dammit to 1, we still reject the hypothesis that they are proportional with a p-value of 0.0000.

The second trap for the unwary comes when we apply the slope test to a pair of statistically independent variables with the same standard deviation:

x <- rnorm(100)
y <- rnorm(100)
result <- slope.test(x, y)
##          r          p test.value          b        ci1        ci2 
##    0.01447    0.88635    1.00000   -1.01453   -0.83191   -1.23723

In this case, the p-value of 0.8864 means we fail to reject the hypothesis that the slope of the x-y pairs is 1. This is, in part, to do with the standardised major axis estimate of slope: \( {\mathrm sign}(s_{xy}) s_y/s_x \) where \( s_{xy} \) is the sample covariance, and \( s_x \) and \( s_y \) the sample standard deviations of \( x \) and \( y \). This issue goes away when the standard deviations of x and y differ:

x <- rnorm(100)
y <- rnorm(100, sd = 1.5)
result <- slope.test(x, y)
##          r          p test.value          b        ci1        ci2 
## -5.404e-01  6.507e-09  1.000e+00  5.463e-01  4.476e-01  6.666e-01

Now, the p-value of 0.0000 suggests we reject the hypothesis that x and y have a slope of 1.