As part of my final project of the MA in Data Journalism at Birmingham City University, I have been exploring the use of probability and probability distributions in media stories (More about it at the end of this article). That is how I came up with the idea of using Poisson distribution for a story about the most dangerous cities for cyclists or the probability of having an accident in each area.

I contacted to Carlos Gil Bellosta (here is his blog) to make sure I was on the right direction. Then, I looked for the data I needed here.

As the purpose was learning how to use this model for a journalistic story, I took only the 15 areas from the West Midlands. I also looked for the population size for each local authority selected.

Importing data

casualties_wm <- read.csv("casualties_cyclist.csv", sep = ";")
casualties_wm$All_casualties_16 <- as.numeric(gsub(",", "",casualties_wm$All_casualties_16))
class(casualties_wm$All_casualties_16)

## [1] "numeric"

unique(casualties_wm)

There are 15 areas. I have information regarding the number of casualties in 2016, the population in 2016, the rate of casualties, and an average of the casualties in each city between 2010 and 2014.

If I had told this story three months ago, I would have done:

library(ggplot2)

ggplot(data=casualties_wm, aes(x=reorder(Local_Authority, casualties_16_million), y=casualties_16_million)) + geom_bar(stat = "identity") + labs(title="The worst towns for cyclists in the West Midlands", y="Casualties per million", x=NULL, caption="Source: Department of Transport") + theme(axis.text.x = element_text(angle = 20))

I mean, I would have considered only the last rate of 2016, a single and ‘precise’ value.

But, what if that year was an outlier? What about the uncertainty/variability in the rate? What about talking about risks as the probability of having a casualty given the information we have from the past? And what about considering the population size?

No, it’s not the best choice.

Finding a best model

Two of the primary examples when statisticians explain Poisson is the cancer deaths and the car accidents. So, casualties by bike was quite reasonable to suit this model (they are independent events, unlimited events, occur randomly, in a time interval, and the average remains the same from interval to interval).

What I haven’t taken into account was the “multiple comparison ‘problem’” mentioned in this paper that Carlos sent it to me.

One of the examples was quite similar to my idea, but the paper does not give much detail about the code used (no, it wasn’t written for journalists). So, after a couple of attempts, I wrote again to Carlos asking for help, and he got back to me with two models to tell my story.

Generalised linear mixed models

library(lme4)
library(lattice)
library(effects)

glmer(All_casualties_16 ~ (1|Local_Authority) + offset(log(Pop_16)), family = poisson, data = casualties_wm)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: All_casualties_16 ~ (1 | Local_Authority) + offset(log(Pop_16))
##    Data: casualties_wm
##      AIC      BIC   logLik deviance df.resid 
## 142.4280 143.8441 -69.2140 138.4280       13 
## Random effects:
##  Groups          Name        Std.Dev.
##  Local_Authority (Intercept) 0.2421  
## Number of obs: 15, groups:  Local_Authority, 15
## Fixed Effects:
## (Intercept)  
##      -8.484

“Generalized linear mixed model fit by maximum likelihood,” what is this?

A Generalised Linear Model (GLM) is part of the regression models where the response variable (cyclists’ casualties) does not follow a normal distribution, but an exponential family distribution (Poisson, in my case).

A “Generalized linear mixed model” takes into account the random effects of the predictors (the local authorities).

The function is similar to the regression one that I had previously explored.

“All_casualties_16” is my dependent variable. So, the next argument is the independent. But, why the number 1 in (1|Local_Authority)?

“The number 1 is the independent variable,” explained Carlos by mail, “so, (1|x) means that there are variations in the independent variable which depends on ‘x’. In other words, for each x there is a deviation regarding the mean,” he added.

The glmer function also does ‘pooling’ of those deviations to avoid extreme values and to get more stable estimate figures.

The next argument is ‘offset’, which “is used for a covariate with known slope”, explained Mervyn Thomas in this online question. “This might arise in situations where you are correcting the number of events for an estimate of population size.”

The family argument refers to the ‘type’ of distribution. And, lastly, our dataset.

modelo <- glmer(All_casualties_16 ~ (1|Local_Authority) + offset(log(Pop_16)), family = poisson, data = casualties_wm)

The model estimates the average casualties across all the LA. And we can access the estimated deviation of each LA regarding the overall mean with `ranef(). That gives the y-intercept by each local authority (the y value -number of casualties-, when x=0 -local authority-). This value will be needed later.

ranef(modelo)

## $Local_Authority
##                          (Intercept)
## Birmingham                0.14848343
## Coventry                  0.29432925
## Dudley                   -0.45295910
## Herefordshire, County of  0.13292146
## Sandwell                 -0.41409304
## Shropshire               -0.06216858
## Solihull                  0.06640681
## Staffordshire            -0.01055372
## Stoke-on-Trent            0.03643748
## Telford and Wrekin       -0.09548825
## Walsall                  -0.04966910
## Warwickshire              0.37233359
## West Midlands            -0.03084169
## Wolverhampton             0.19341904
## Worcestershire           -0.06418722

randoms <- ranef(modelo, condVar = TRUE)$Local_Authority
variances <- as.numeric(attr(randoms, "postVar")) #We'll use the variances for estimating the confidence intervals.

Then, he creates the new dataset with the estimated mean casualties in each local authority (the variation of each local authority + the average of casualties across all local authorities).

res <- data.frame(local_authority = rownames(randoms), mean_effect = randoms$`(Intercept)`+coef(summary(modelo))[,"Estimate"])

coef(summary(modelo))

##              Estimate Std. Error   z value Pr(>|z|)
## (Intercept) -8.484382 0.06981866 -121.5203        0

After that, he adds the upper and lower limit of the variation of casualties per local authority.

res$lower <- res$mean_effect - 2* sqrt(variances) # 2 standard deviation
res$upper <- res$mean_effect + 2* sqrt(variances)

The casualties have an exponential distribution (Poisson is part of the exponential regression family).

exp(res$mean_effect)

##  [1] 0.0002397538 0.0002773994 0.0001313900 0.0002360516 0.0001365971
##  [6] 0.0001942140 0.0002208615 0.0002045015 0.0002143406 0.0001878494
## [11] 0.0001966568 0.0002999040 0.0002003944 0.0002507730 0.0001938223

res$mean_effect <- exp(res$mean_effect)*1e6 # Per million inhabitants. 
res$lower <- exp(res$lower)*1e6
res$upper <- exp(res$upper)*1e6

res$local_authority <- reorder(res$local_authority, res$mean_effect, mean)

ggplot(data = res, aes(x=local_authority, y=mean_effect)) + geom_point() + geom_errorbar(width=.1, aes(ymin=lower, ymax=upper), col="blue") + labs(title="Dangerous cities for cyclists in the West Midlands", y="Annual casualties per million", x=NULL, caption="Source: Department of Transport") + theme(axis.text.x = element_text(angle = 20))

Model with stan

For the second model, he uses Stan, which is a probabilistic programming language for Bayesian statistical inference.

Worth digging a bit into Bayesian inference to understand the basic concepts of prior and posterior probability. I will translate the example Carlos gave me.

“You are thinking of buying a mobile phone with some specific characteristics, but you don’t know how much it would cost. You estimate that it could cost between 100 and 800 euros, what is the same that saying that you don’t have a clue. So, that is your prior belief.

Then, you do some research. Somebody bought a new phone for X euros, a friend saw another one similar in China for X euros and so on. Your estimation narrows down up to 250 and 350 euros. That is what you “learn” after the data. You don’t have the exact value, but your new estimation is better. The more the data, the more accurate your (posterior) estimation is."

And that is what stan does.

library(rstan)

data {
  int<lower=1> n; 
  int casualties[n]; 
  real pop[n];
  real my_mean;
  real my_sd;
}

parameters {
  real<lower = 0> rate[n];
}

model {
  // prior (regularization): using historical data to assess the regularization amount
  rate ~ normal(my_mean, my_sd);
  
  // model: each casualty is a poisson distributed value with knwon rate
  for (i in 1:n)
    casualties[i] ~ poisson(rate[i] * pop[i] / 1e6);
}

Firstly, setting the method. Data and parameter are the elements which will be used in the model, which is established at the bottom.

The loop says that the number of casualties in each local authority follows a Poisson distribution, and what we need to calculate is the rate.

for (i in 1:n) casualties[i] ~ poisson(rate[i] * pop[i] / 1e6)

However, this rate has to be relatively similar to the historical one. The function normal() takes the mean and standard deviation of the historical information in the dataset: the average in the casualties over 2010 and 2014.

rate ~ normal(my_mean, my_sd)

#preparing the data
#number of casualties, population, average of the last years - explanatory variables
stan_data <- list(n=nrow(casualties_wm), casualties= casualties_wm$All_casualties_16, pop=casualties_wm$Pop_16, my_mean=mean(casualties_wm$average_million), my_sd= sd(casualties_wm$average_million))

Time to use the function stan(). This function estimates the value that we don’t know (the rate) given the data that we have.

fit <- stan(file = "src.stan", data = stan_data, chains = 1, iter = 12000, warmup = 2000, thin = 10)

## 
## SAMPLING FOR MODEL 'src' NOW (CHAIN 1).
## 
## Gradient evaluation took 1.6e-05 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:     1 / 12000 [  0%]  (Warmup)
## Iteration:  1200 / 12000 [ 10%]  (Warmup)
## Iteration:  2001 / 12000 [ 16%]  (Sampling)
## Iteration:  3200 / 12000 [ 26%]  (Sampling)
## Iteration:  4400 / 12000 [ 36%]  (Sampling)
## Iteration:  5600 / 12000 [ 46%]  (Sampling)
## Iteration:  6800 / 12000 [ 56%]  (Sampling)
## Iteration:  8000 / 12000 [ 66%]  (Sampling)
## Iteration:  9200 / 12000 [ 76%]  (Sampling)
## Iteration: 10400 / 12000 [ 86%]  (Sampling)
## Iteration: 11600 / 12000 [ 96%]  (Sampling)
## Iteration: 12000 / 12000 [100%]  (Sampling)
## 
##  Elapsed Time: 0.124903 seconds (Warm-up)
##                0.385149 seconds (Sampling)
##                0.510052 seconds (Total)

The arguments mean:

File: that is the document where we have established the method before.

Data: the list with the explanatory variables.

Chains: I looked for this up… Markov chain is a number of sequences of possible events which follows the Markov theory. And the Markov theory says that the probability of something happening depends only on the previous event.

Iter: number of iterations for each chain.

Warmup: I couldn’t find what is the purpose of this argument, and Carlos explained to me that this argument is used to “ignore the first values, which tend to be noisy and inconsistent.” That makes sense even with a basic knowledge in statistics, given other theories such as the law of big numbers, or the random variation due to small populations.

Thin: save one in every n.

summary(fit)

## $summary
##                mean    se_mean        sd        2.5%        25%        50%
## rate[1]    200.5782 0.17207441  5.441471   190.55606   196.9610   200.1393
## rate[2]    239.0139 0.49172490 13.523380   213.32265   230.1473   238.6921
## rate[3]    269.2933 0.72345735 22.877730   227.63033   252.7730   269.3564
## rate[4]    127.0279 0.65963056 19.121269    92.28167   113.7127   126.3972
## rate[5]    235.2729 0.94737389 27.234106   182.87845   216.6853   234.3042
## rate[6]    134.1343 0.61366491 19.386406    96.74174   121.1608   133.3523
## rate[7]    196.3266 0.68690779 21.721931   155.90232   181.4397   195.2358
## rate[8]    222.6154 0.84138275 26.606859   170.70388   205.1610   222.5776
## rate[9]    205.6729 0.54575156 14.892129   177.45844   195.4199   205.2271
## rate[10]   216.9816 0.78085822 24.692905   168.44584   199.6083   216.7503
## rate[11]   191.3364 0.90631511 26.837176   139.37468   173.1210   189.5405
## rate[12]   200.0942 0.72872718 23.044377   155.08998   184.0476   199.7375
## rate[13]   290.7651 0.60616731 19.168693   253.75901   278.9077   291.0934
## rate[14]   247.6109 0.79026942 24.990513   201.36138   230.3554   247.5467
## rate[15]   195.5846 0.55237545 16.829843   164.12420   184.3754   194.9208
## lp__     12664.4730 0.08763882  2.771383 12658.10767 12662.9508 12664.8191
##                 75%      97.5%     n_eff      Rhat
## rate[1]    204.2374   211.8207 1000.0000 1.0010153
## rate[2]    247.6457   267.8199  756.3557 0.9991980
## rate[3]    283.2044   317.1828 1000.0000 0.9990019
## rate[4]    139.0973   168.1891  840.2952 0.9990357
## rate[5]    253.0380   288.2436  826.3869 0.9990166
## rate[6]    146.9854   172.1853  998.0035 1.0002369
## rate[7]    210.6141   239.2975 1000.0000 0.9993131
## rate[8]    240.2450   276.1301 1000.0000 0.9990454
## rate[9]    215.1348   236.2019  744.6010 0.9994411
## rate[10]   232.4402   265.3505 1000.0000 1.0020450
## rate[11]   210.1585   245.9571  876.8296 1.0005146
## rate[12]   215.7745   247.0494 1000.0000 0.9992488
## rate[13]   303.4819   327.6961 1000.0000 1.0053586
## rate[14]   263.8301   297.1549 1000.0000 0.9991519
## rate[15]   206.2135   230.5540  928.3065 0.9991230
## lp__     12666.4750 12668.8967 1000.0000 1.0010794
## 
## $c_summary
## , , chains = chain:1
## 
##           stats
## parameter        mean        sd        2.5%        25%        50%
##   rate[1]    200.5782  5.441471   190.55606   196.9610   200.1393
##   rate[2]    239.0139 13.523380   213.32265   230.1473   238.6921
##   rate[3]    269.2933 22.877730   227.63033   252.7730   269.3564
##   rate[4]    127.0279 19.121269    92.28167   113.7127   126.3972
##   rate[5]    235.2729 27.234106   182.87845   216.6853   234.3042
##   rate[6]    134.1343 19.386406    96.74174   121.1608   133.3523
##   rate[7]    196.3266 21.721931   155.90232   181.4397   195.2358
##   rate[8]    222.6154 26.606859   170.70388   205.1610   222.5776
##   rate[9]    205.6729 14.892129   177.45844   195.4199   205.2271
##   rate[10]   216.9816 24.692905   168.44584   199.6083   216.7503
##   rate[11]   191.3364 26.837176   139.37468   173.1210   189.5405
##   rate[12]   200.0942 23.044377   155.08998   184.0476   199.7375
##   rate[13]   290.7651 19.168693   253.75901   278.9077   291.0934
##   rate[14]   247.6109 24.990513   201.36138   230.3554   247.5467
##   rate[15]   195.5846 16.829843   164.12420   184.3754   194.9208
##   lp__     12664.4730  2.771383 12658.10767 12662.9508 12664.8191
##           stats
## parameter         75%      97.5%
##   rate[1]    204.2374   211.8207
##   rate[2]    247.6457   267.8199
##   rate[3]    283.2044   317.1828
##   rate[4]    139.0973   168.1891
##   rate[5]    253.0380   288.2436
##   rate[6]    146.9854   172.1853
##   rate[7]    210.6141   239.2975
##   rate[8]    240.2450   276.1301
##   rate[9]    215.1348   236.2019
##   rate[10]   232.4402   265.3505
##   rate[11]   210.1585   245.9571
##   rate[12]   215.7745   247.0494
##   rate[13]   303.4819   327.6961
##   rate[14]   263.8301   297.1549
##   rate[15]   206.2135   230.5540
##   lp__     12666.4750 12668.8967

Storing the rates and the CI for the chart:

res2 <- as.data.frame(summary(fit)$summary[1:nrow(casualties_wm), c("mean", "2.5%", "97.5%")])
res2$local_authority <- casualties_wm$Local_Authority
res2$local_authority <- reorder(res2$local_authority, res2$mean, mean)
colnames(res2) <- c("mean", "lower", "upper", "local_authority")

ggplot(res2, aes(x=local_authority, y = mean)) +
  geom_point(alpha=0.5) +
  geom_errorbar(width=.1, aes(ymin=lower, ymax=upper), colour="blue") + 
  labs(x=NULL, y= "Annual casualties per million", title="The most dangerous areas for cyclists in the West Midlands",
       caption="Source: Department of Transport") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Interpreting the results

Blue lines show the uncertainty. In 19 out of 20 times the number of casualties/million will be between the upper and lower limit. Some lines are bigger than others due to the population size. The smaller the population, the higher the variation in the estimated mean.

The dots are the values estimated by the model (the mean), which is similar but not equal to the real rates, because of the regularization (new word and technique for my list).

“The regularization is one of the main processes in the modern statistics, and it is related to the ‘incredulity’ of the extreme values,” Carlos told me. So, this process ‘punish’ those extreme values toward the mean.

Conclusion

In several papers, David Spiegelhalter has suggested using funnel plots to control the indicator against a measure of its precision, and to avoid the “spurious ranking” in caterpillar plots.

asfunnelplot  <- data.frame(LA=casualties_wm$Local_Authority,
                       casualties=(casualties_wm$casualties_16_million/1000000),
                       population=casualties_wm$Pop_16
                       )
asfunnelplot$SE <- sqrt((asfunnelplot$casualties*(1-asfunnelplot$casualties))/asfunnelplot$population)
  
casualties.fem <- weighted.mean(asfunnelplot$casualties, 1/asfunnelplot$SE^2)

## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
pop.seq <- seq(10000, max(asfunnelplot$population), 10000)
ll95 <-casualties.fem - 1.96 * sqrt((casualties.fem*(1-casualties.fem)) / (pop.seq)) 
ul95 <- casualties.fem + 1.96 * sqrt((casualties.fem*(1-casualties.fem)) / (pop.seq)) 
ll999 <- casualties.fem - 3.29 * sqrt((casualties.fem*(1-casualties.fem)) / (pop.seq))
ul999 <- casualties.fem + 3.29 * sqrt((casualties.fem*(1-casualties.fem)) / (pop.seq))
CI <- data.frame(ll95, ul95, ll999, ul999, pop.seq, casualties.fem)

## draw plot
funnel <- ggplot(aes(x = population, y = casualties), data = asfunnelplot) +
    geom_point(shape = 1, size = 2) +
    geom_line(aes(x = pop.seq, y = ll95, color = "steelblue3"), data = CI, color = "steelblue3") +
    geom_line(aes(x = pop.seq, y = ul95, color = "steelblue3"), data = CI, color = "steelblue3")+
    geom_line(aes(x = pop.seq, y = ll999, color = "steelblue1"), linetype = "dashed", data = CI, color = "steelblue1") +
    geom_line(aes(x = pop.seq, y = ul999, color = "steelblue1"), linetype = "dashed", data = CI, color = "steelblue1") + geom_hline(aes(yintercept = casualties.fem, color = "olivedrab"), data = CI, color = "olivedrab") + xlim(0,1150000) + scale_y_continuous(labels = function(y) y*100000) + labs(title = "", x = "Local authority population size", y = "") + theme_bw() 
funnel

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 520 rows containing missing values (geom_path).

## Warning: Removed 520 rows containing missing values (geom_path).

## Warning: Removed 520 rows containing missing values (geom_path).

## Warning: Removed 520 rows containing missing values (geom_path).

But funnel plots make difficult the multiple comparisons, and it doesn’t avoid the problem of extreme values.

“There is no formal allowance for ‘regression to the mean’, in which extreme outcomes are expected to tend towards the population mean simply because a contributing factor to their ‘extremeness’ is likely to be a run of good or bad luck. This could be formally taken into account by fitting a random-effects model,” Spiegelhalter says in “Funnel plots for comparing institutional performance.”

Hence, for my story, I would use the random effects models of above, which estimate the casualties rate considering the probability distribution of this type of events, which use mechanisms to avoid extreme values, and which calculate and show the variation of each estimated mean (uncertainty).

“Scientists deal with uncertainty by invoking probability,” wrote Victor Cohn in the magazine Significance. And he adds “to be statistically significant, and not just the result of pure chance, the same result must appear again and again. When it does, that’s reliability.”

More specifically, between the two models developed, I would lean towards the one with stan, because it takes into account the historical average and relates it to the new estimated rate of each local authority.

Bibliography

Cohn. V, (1999) How to help reporters tell the truth, Of significance, Oct 22nd, pp.9-13. Available at: http://www.statlit.org/pdf/1999-Cohn-Significance.pdf [Accessed August 14]

Department of Statistics Online Programs of the Pennsylvania State University (n.d.) Introduction to Generalized Linear Models, STAT 504. Available at: https://onlinecourses.science.psu.edu/stat504/node/216/ [Accessed August 14]

Gelman A., Hill, J. and Yajima, M. (2012) Why we (usually) don’t have to worry about multiple comparisons, Journal of Research on Educational Effectiveness, Vol.5, pp: 189–211, DOI: 10.1080/19345747.2011.618213. [Accessed at Agust 15]

Hertzog, L. (2017) Interpreting random effects in linear mixed-effect models, Biologyforfun [blog], April 3rd. Available at: https://biologyforfun.wordpress.com/2017/04/03/interpreting-random-effects-in-linear-mixed-effect-models/ [Accessed August 14]

Iosif Moraru, R. (2014) How to assess risk: as a measure of probability or possibility? Researchgate (n.d.). Available at: https://www.researchgate.net/post/How_to_assess_risk_as_a_measure_of_probability_or_possibility [Accessed August 14]

Mervyn, T. (2013) What is the role of an offset term in modelling a GLM?, Researchgate (n.d.). Available at: https://www.researchgate.net/post/What_is_the_role_of_an_offset_term_in_modelling_a_GLM [Accessed August 14]

NSS (2016) Bayesian Statistics explained to Beginners in Simple English, Analytica Vidhya, Jun 20. Available at: https://www.analyticsvidhya.com/blog/2016/06/bayesian-statistics-beginners-simple-english/ [Accessed August 14]

Spiegelhalter, D. (2005) Handling over-dispersion of performance indicators, BMJ Quality & Safety Vol 14, pp.347-351. Available at: http://dx.doi.org/10.1136/qshc.2005.013755 [Accessed August 14]

Spiegelhalter, D. (2005) Funnel plots for comparing institutional performance, Statistics in medicine, Vol 24, pp. 1185-1202.

StatsToDo (n.d.) Poisson Distribution : Explained, StatsToDo. Available at: https://www.statstodo.com/Poisson_Exp.php [Accessed August 14]

Dangerous towns for cyclists or probability in a journalistic story