A quick look at XKCD and power laws

A recent poster on /r/math posted about XKCD comic repostings following power laws, and I perhaps unfairly pointed to a paper by Clauset, Shalizi, and Newman which, essentially, tells people to stop fitting power laws to everything. (Shalizi has an argumentative web-log post about it). This is why we don't let mathematicians do statistics.

Was I being unfair? Maybe, maybe not. The poster chose a coefficient by looking at a log-log plot and reported an $ R^2 $, so something was done wrong. But perhaps the data really are a power law instead of, say, a log-normal. Fortunately, there's an R package for this, poweRlaw.

Fitting this stuff is fairly easy, and the specific comparison that we're interested in (log normal vs power law) comes straight from the documentation. I will therefore be somewhat terse, as I would otherwise simply be repeating what the documentation states. It's really rather well-written and thorough and the package is well-designed. If you're very curious, I highly recommend trying it yourself with this code and playing around with the various options the documentation illustrates.

The author of the original image conveniently gave a link to the original data here. After loading it in:

xkcdcounts = xkcd$Count

summary(xkcdcounts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     2.0     3.0    11.2     9.0   676.0

library(MASS)
truehist(xkcdcounts[xkcdcounts > 10])

plot of chunk unnamed-chunk-2

First, fit the power law and log-normal distributions. Note that, to do this the principled way, you need to fit the power law to only the tail of the distribution, so a proper cutoff point needs to be chosen (the R package does this based on a Kolmogorov-Smirnov test - read the paper for more details). To compare the log-normal and power law distributions, they should use the same cutoff. The original author used an implicit cutoff of $ 10 $.


library("poweRlaw")

model_power = displ$new(xkcdcounts)
est = estimate_xmin(model_power)
model_power$setXmin(est)

model_lognormal = dislnorm$new(xkcdcounts)
model_lognormal$setXmin(est)
est = estimate_pars(model_lognormal)
model_lognormal$setPars(est)
model_power$pars

## [1] 2.292

model_power$xmin

## [1] 18

model_lognormal$pars

## [1] -2.463  2.287

model_power is the power law model, the model_power$pars gives the $ \alpha $ parameter (ie, the power of the power law). model_lognormal is the log-normal model and model_lognormal$pars gives the $ \mu $ and $ \sigma^2 $ parameters, respectively.

plot(model_power)
lines(model_power, col=2)
lines(model_lognormal, col = 4)
legend(50,.5, # places a legend at the appropriate place 
       c("power law","log-normal"), # puts text in the legend

lty=c(1,1), # gives the legend appropriate symbols (lines)

lwd=c(2.5,2.5),col=c(2,4))

plot of chunk unnamed-chunk-4

These both look essentially the same in terms of how well they fit the data.

We can test a couple things. First, whether a power law is a reasonable fit for the data. This can be done using the bootstrap. Second, whether the power law model fits better than the log normal model.

bs_p = bootstrap_p(model_power)

This is testing against the null hypothesis that the data are generated from a power law distribution. This yields a $ p $-value of 0.62, so we fail to reject the null hypothesis. Great. Now to compare the power law distribution to the log-normal:

compare = compare_distributions(model_power, model_lognormal)

See the documentation for what this does, but the general idea is that, if one distribution is a better fit for the data than the other, it will pick it out. From the documentation:

Vuong’s test, which a likelihood ratio test for model selection using the Kullback-Leibler criteria. The test statistic, R , is the ratio of the log-likelihoods of the data between the two competing models.

This yields a test-statistic of -0.6572 and a $ p $-value of 0.511. This suggests the power law is a slightly better fit to the data than a log-normal, but the difference is not significant. This suggests there is no reason to favor one model over the other.

Conclusion

A power law is an adequate description of the data when fitted correctly. However, it does not perform better than, say, a log-normal distribution fitted in the same way, so it's not particularly meaningful to say that they are distributed according to a power law. Further, the data need to be over several orders of magnitude for a power law to be meaningful, and, since the optimal model starts at 18, we're only going over 1.6 decades of data (ie, $ \log_{10}676/18 $), which is not enough at all to conclude there's a power law even if the model were somehow distinguished from other models.