log of means or mean of logs?

Consider a variable that is not Normally distributed (left-skewed, right-skewed, lognormal etc.). This kind of variable is quite common in ecological data. Species traits, soil nutrient contents, abundance, you name it. We could simulate such data using rlnorm():

x <- rlnorm(10000, 5, 1)
par(mfrow=c(1,2), mar=c(4.5,4,2,0.5))
hist(x, main = "", breaks = 100, col = "grey", border = "grey")
hist(log(x), main = "", breaks = 50, col = "grey", border = "grey")
abline(v = log(mean(x)), col = "red", lwd = 2)   
abline(v = mean(log(x)), col = "blue", lwd = 2)

Of course, a histogram of log(x) centers at about 5, which was the specified mean on the log-scale. Doing log(mean(x)) incorrectly returns 5.5152763, whereas mean(log(x)) correctly returns 5.0074422. This seems trivial, but to be honest I have been incorrectly doing log(mean(x)) many times! (see example below)

Intuitively, we would expect such a discrepancy to increase with increasing standard deviation of the lognormal distribution, because greater SD simply increases the chance of having a very large x value, hence further biasing mean(x) towards larger value. This can be easily demonstrated with simulation:

n <- 100
SDs <- seq(0.01, 5, length.out = n)
diff <- vector("numeric", n)
for (i in seq_len(n)) {
  x <- rlnorm(10000, 5, SDs[i])
  mean.log <- mean(log(x))
  log.mean <- log(mean(x))
  diff[i] <- log.mean - mean.log
}
plot(SDs, diff, 
     ylab = "log(mean(x)) -- mean(log(x))", 
     xlab = "Standard deviation of lognormal distribution")

Clearly, the error increases exponentially with increasing SD of the lognormal distribution. This means that log-transformation before taking the mean is always better than the opposite way. This makes sense because mean is a statistic that assumes that x is Normally-distributed.

A common mistake in managing ecological data can happen during the summary of group-means. Consider a scenario when you are summarising individual-level trait values to the species level, or quadrat-level environmental values to the plot level. Don’t know about you — but I often stupidly do mean(quadrat values) then log(plot average) (i.e. examining histogram after taking the mean)!

\facepalm

log of means or mean of logs?

Hao Ran Lai