As part of my final project of the MA in Data Journalism at Birmingham City University, I have been exploring the use of probability and probability distributions in media stories (More about it at the end of this article). That is how I came up with the idea of using Poisson distribution for a story about the most dangerous cities for cyclists or the probability of having an accident in each area.
I contacted to Carlos Gil Bellosta (here is his blog) to make sure I was on the right direction. Then, I looked for the data I needed here.
As the purpose was learning how to use this model for a journalistic story, I took only the 15 areas from the West Midlands. I also looked for the population size for each local authority selected.
casualties_wm <- read.csv("casualties_cyclist.csv", sep = ";")
casualties_wm$All_casualties_16 <- as.numeric(gsub(",", "",casualties_wm$All_casualties_16))
class(casualties_wm$All_casualties_16)
## [1] "numeric"
unique(casualties_wm)
There are 15 areas. I have information regarding the number of casualties in 2016, the population in 2016, the rate of casualties, and an average of the casualties in each city between 2010 and 2014.
If I had told this story three months ago, I would have done:
library(ggplot2)
ggplot(data=casualties_wm, aes(x=reorder(Local_Authority, casualties_16_million), y=casualties_16_million)) + geom_bar(stat = "identity") + labs(title="The worst towns for cyclists in the West Midlands", y="Casualties per million", x=NULL, caption="Source: Department of Transport") + theme(axis.text.x = element_text(angle = 20))
I mean, I would have considered only the last rate of 2016, a single and ‘precise’ value.
But, what if that year was an outlier? What about the uncertainty/variability in the rate? What about talking about risks as the probability of having a casualty given the information we have from the past? And what about considering the population size?
No, it’s not the best choice.
Two of the primary examples when statisticians explain Poisson is the cancer deaths and the car accidents. So, casualties by bike was quite reasonable to suit this model (they are independent events, unlimited events, occur randomly, in a time interval, and the average remains the same from interval to interval).
What I haven’t taken into account was the “multiple comparison ‘problem’” mentioned in this paper that Carlos sent it to me.
One of the examples was quite similar to my idea, but the paper does not give much detail about the code used (no, it wasn’t written for journalists). So, after a couple of attempts, I wrote again to Carlos asking for help, and he got back to me with two models to tell my story.
library(lme4)
library(lattice)
library(effects)
glmer(All_casualties_16 ~ (1|Local_Authority) + offset(log(Pop_16)), family = poisson, data = casualties_wm)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: All_casualties_16 ~ (1 | Local_Authority) + offset(log(Pop_16))
## Data: casualties_wm
## AIC BIC logLik deviance df.resid
## 142.4280 143.8441 -69.2140 138.4280 13
## Random effects:
## Groups Name Std.Dev.
## Local_Authority (Intercept) 0.2421
## Number of obs: 15, groups: Local_Authority, 15
## Fixed Effects:
## (Intercept)
## -8.484
“Generalized linear mixed model fit by maximum likelihood,” what is this?
A Generalised Linear Model (GLM) is part of the regression models where the response variable (cyclists’ casualties) does not follow a normal distribution, but an exponential family distribution (Poisson, in my case).
A “Generalized linear mixed model” takes into account the random effects of the predictors (the local authorities).
The function is similar to the regression one that I had previously explored.
“All_casualties_16” is my dependent variable. So, the next argument is the independent. But, why the number 1 in (1|Local_Authority)?
“The number 1 is the independent variable,” explained Carlos by mail, “so, (1|x) means that there are variations in the independent variable which depends on ‘x’. In other words, for each x there is a deviation regarding the mean,” he added.
The glmer function also does ‘pooling’ of those deviations to avoid extreme values and to get more stable estimate figures.
The next argument is ‘offset’, which “is used for a covariate with known slope”, explained Mervyn Thomas in this online question. “This might arise in situations where you are correcting the number of events for an estimate of population size.”
The family argument refers to the ‘type’ of distribution. And, lastly, our dataset.
modelo <- glmer(All_casualties_16 ~ (1|Local_Authority) + offset(log(Pop_16)), family = poisson, data = casualties_wm)
The model estimates the average casualties across all the LA. And we can access the estimated deviation of each LA regarding the overall mean with `ranef(). That gives the y-intercept by each local authority (the y value -number of casualties-, when x=0 -local authority-). This value will be needed later.
ranef(modelo)
## $Local_Authority
## (Intercept)
## Birmingham 0.14848343
## Coventry 0.29432925
## Dudley -0.45295910
## Herefordshire, County of 0.13292146
## Sandwell -0.41409304
## Shropshire -0.06216858
## Solihull 0.06640681
## Staffordshire -0.01055372
## Stoke-on-Trent 0.03643748
## Telford and Wrekin -0.09548825
## Walsall -0.04966910
## Warwickshire 0.37233359
## West Midlands -0.03084169
## Wolverhampton 0.19341904
## Worcestershire -0.06418722
randoms <- ranef(modelo, condVar = TRUE)$Local_Authority
variances <- as.numeric(attr(randoms, "postVar")) #We'll use the variances for estimating the confidence intervals.
Then, he creates the new dataset with the estimated mean casualties in each local authority (the variation of each local authority + the average of casualties across all local authorities).
res <- data.frame(local_authority = rownames(randoms), mean_effect = randoms$`(Intercept)`+coef(summary(modelo))[,"Estimate"])
coef(summary(modelo))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.484382 0.06981866 -121.5203 0
After that, he adds the upper and lower limit of the variation of casualties per local authority.
res$lower <- res$mean_effect - 2* sqrt(variances) # 2 standard deviation
res$upper <- res$mean_effect + 2* sqrt(variances)
The casualties have an exponential distribution (Poisson is part of the exponential regression family).
exp(res$mean_effect)
## [1] 0.0002397538 0.0002773994 0.0001313900 0.0002360516 0.0001365971
## [6] 0.0001942140 0.0002208615 0.0002045015 0.0002143406 0.0001878494
## [11] 0.0001966568 0.0002999040 0.0002003944 0.0002507730 0.0001938223
res$mean_effect <- exp(res$mean_effect)*1e6 # Per million inhabitants.
res$lower <- exp(res$lower)*1e6
res$upper <- exp(res$upper)*1e6
res$local_authority <- reorder(res$local_authority, res$mean_effect, mean)
ggplot(data = res, aes(x=local_authority, y=mean_effect)) + geom_point() + geom_errorbar(width=.1, aes(ymin=lower, ymax=upper), col="blue") + labs(title="Dangerous cities for cyclists in the West Midlands", y="Annual casualties per million", x=NULL, caption="Source: Department of Transport") + theme(axis.text.x = element_text(angle = 20))
For the second model, he uses Stan, which is a probabilistic programming language for Bayesian statistical inference.
Worth digging a bit into Bayesian inference to understand the basic concepts of prior and posterior probability. I will translate the example Carlos gave me.
“You are thinking of buying a mobile phone with some specific characteristics, but you don’t know how much it would cost. You estimate that it could cost between 100 and 800 euros, what is the same that saying that you don’t have a clue. So, that is your prior belief.
Then, you do some research. Somebody bought a new phone for X euros, a friend saw another one similar in China for X euros and so on. Your estimation narrows down up to 250 and 350 euros. That is what you “learn” after the data. You don’t have the exact value, but your new estimation is better. The more the data, the more accurate your (posterior) estimation is."
And that is what stan does.
library(rstan)
data {
int<lower=1> n;
int casualties[n];
real pop[n];
real my_mean;
real my_sd;
}
parameters {
real<lower = 0> rate[n];
}
model {
// prior (regularization): using historical data to assess the regularization amount
rate ~ normal(my_mean, my_sd);
// model: each casualty is a poisson distributed value with knwon rate
for (i in 1:n)
casualties[i] ~ poisson(rate[i] * pop[i] / 1e6);
}
Firstly, setting the method. Data and parameter are the elements which will be used in the model, which is established at the bottom.
The loop says that the number of casualties in each local authority follows a Poisson distribution, and what we need to calculate is the rate.
for (i in 1:n) casualties[i] ~ poisson(rate[i] * pop[i] / 1e6)
However, this rate has to be relatively similar to the historical one. The function normal() takes the mean and standard deviation of the historical information in the dataset: the average in the casualties over 2010 and 2014.
rate ~ normal(my_mean, my_sd)
#preparing the data
#number of casualties, population, average of the last years - explanatory variables
stan_data <- list(n=nrow(casualties_wm), casualties= casualties_wm$All_casualties_16, pop=casualties_wm$Pop_16, my_mean=mean(casualties_wm$average_million), my_sd= sd(casualties_wm$average_million))
Time to use the function stan(). This function estimates the value that we don’t know (the rate) given the data that we have.
fit <- stan(file = "src.stan", data = stan_data, chains = 1, iter = 12000, warmup = 2000, thin = 10)
##
## SAMPLING FOR MODEL 'src' NOW (CHAIN 1).
##
## Gradient evaluation took 1.6e-05 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
## Adjust your expectations accordingly!
##
##
## Iteration: 1 / 12000 [ 0%] (Warmup)
## Iteration: 1200 / 12000 [ 10%] (Warmup)
## Iteration: 2001 / 12000 [ 16%] (Sampling)
## Iteration: 3200 / 12000 [ 26%] (Sampling)
## Iteration: 4400 / 12000 [ 36%] (Sampling)
## Iteration: 5600 / 12000 [ 46%] (Sampling)
## Iteration: 6800 / 12000 [ 56%] (Sampling)
## Iteration: 8000 / 12000 [ 66%] (Sampling)
## Iteration: 9200 / 12000 [ 76%] (Sampling)
## Iteration: 10400 / 12000 [ 86%] (Sampling)
## Iteration: 11600 / 12000 [ 96%] (Sampling)
## Iteration: 12000 / 12000 [100%] (Sampling)
##
## Elapsed Time: 0.124903 seconds (Warm-up)
## 0.385149 seconds (Sampling)
## 0.510052 seconds (Total)
The arguments mean:
File: that is the document where we have established the method before.
Data: the list with the explanatory variables.
Chains: I looked for this up… Markov chain is a number of sequences of possible events which follows the Markov theory. And the Markov theory says that the probability of something happening depends only on the previous event.
Iter: number of iterations for each chain.
Warmup: I couldn’t find what is the purpose of this argument, and Carlos explained to me that this argument is used to “ignore the first values, which tend to be noisy and inconsistent.” That makes sense even with a basic knowledge in statistics, given other theories such as the law of big numbers, or the random variation due to small populations.
Thin: save one in every n.
summary(fit)
## $summary
## mean se_mean sd 2.5% 25% 50%
## rate[1] 200.5782 0.17207441 5.441471 190.55606 196.9610 200.1393
## rate[2] 239.0139 0.49172490 13.523380 213.32265 230.1473 238.6921
## rate[3] 269.2933 0.72345735 22.877730 227.63033 252.7730 269.3564
## rate[4] 127.0279 0.65963056 19.121269 92.28167 113.7127 126.3972
## rate[5] 235.2729 0.94737389 27.234106 182.87845 216.6853 234.3042
## rate[6] 134.1343 0.61366491 19.386406 96.74174 121.1608 133.3523
## rate[7] 196.3266 0.68690779 21.721931 155.90232 181.4397 195.2358
## rate[8] 222.6154 0.84138275 26.606859 170.70388 205.1610 222.5776
## rate[9] 205.6729 0.54575156 14.892129 177.45844 195.4199 205.2271
## rate[10] 216.9816 0.78085822 24.692905 168.44584 199.6083 216.7503
## rate[11] 191.3364 0.90631511 26.837176 139.37468 173.1210 189.5405
## rate[12] 200.0942 0.72872718 23.044377 155.08998 184.0476 199.7375
## rate[13] 290.7651 0.60616731 19.168693 253.75901 278.9077 291.0934
## rate[14] 247.6109 0.79026942 24.990513 201.36138 230.3554 247.5467
## rate[15] 195.5846 0.55237545 16.829843 164.12420 184.3754 194.9208
## lp__ 12664.4730 0.08763882 2.771383 12658.10767 12662.9508 12664.8191
## 75% 97.5% n_eff Rhat
## rate[1] 204.2374 211.8207 1000.0000 1.0010153
## rate[2] 247.6457 267.8199 756.3557 0.9991980
## rate[3] 283.2044 317.1828 1000.0000 0.9990019
## rate[4] 139.0973 168.1891 840.2952 0.9990357
## rate[5] 253.0380 288.2436 826.3869 0.9990166
## rate[6] 146.9854 172.1853 998.0035 1.0002369
## rate[7] 210.6141 239.2975 1000.0000 0.9993131
## rate[8] 240.2450 276.1301 1000.0000 0.9990454
## rate[9] 215.1348 236.2019 744.6010 0.9994411
## rate[10] 232.4402 265.3505 1000.0000 1.0020450
## rate[11] 210.1585 245.9571 876.8296 1.0005146
## rate[12] 215.7745 247.0494 1000.0000 0.9992488
## rate[13] 303.4819 327.6961 1000.0000 1.0053586
## rate[14] 263.8301 297.1549 1000.0000 0.9991519
## rate[15] 206.2135 230.5540 928.3065 0.9991230
## lp__ 12666.4750 12668.8967 1000.0000 1.0010794
##
## $c_summary
## , , chains = chain:1
##
## stats
## parameter mean sd 2.5% 25% 50%
## rate[1] 200.5782 5.441471 190.55606 196.9610 200.1393
## rate[2] 239.0139 13.523380 213.32265 230.1473 238.6921
## rate[3] 269.2933 22.877730 227.63033 252.7730 269.3564
## rate[4] 127.0279 19.121269 92.28167 113.7127 126.3972
## rate[5] 235.2729 27.234106 182.87845 216.6853 234.3042
## rate[6] 134.1343 19.386406 96.74174 121.1608 133.3523
## rate[7] 196.3266 21.721931 155.90232 181.4397 195.2358
## rate[8] 222.6154 26.606859 170.70388 205.1610 222.5776
## rate[9] 205.6729 14.892129 177.45844 195.4199 205.2271
## rate[10] 216.9816 24.692905 168.44584 199.6083 216.7503
## rate[11] 191.3364 26.837176 139.37468 173.1210 189.5405
## rate[12] 200.0942 23.044377 155.08998 184.0476 199.7375
## rate[13] 290.7651 19.168693 253.75901 278.9077 291.0934
## rate[14] 247.6109 24.990513 201.36138 230.3554 247.5467
## rate[15] 195.5846 16.829843 164.12420 184.3754 194.9208
## lp__ 12664.4730 2.771383 12658.10767 12662.9508 12664.8191
## stats
## parameter 75% 97.5%
## rate[1] 204.2374 211.8207
## rate[2] 247.6457 267.8199
## rate[3] 283.2044 317.1828
## rate[4] 139.0973 168.1891
## rate[5] 253.0380 288.2436
## rate[6] 146.9854 172.1853
## rate[7] 210.6141 239.2975
## rate[8] 240.2450 276.1301
## rate[9] 215.1348 236.2019
## rate[10] 232.4402 265.3505
## rate[11] 210.1585 245.9571
## rate[12] 215.7745 247.0494
## rate[13] 303.4819 327.6961
## rate[14] 263.8301 297.1549
## rate[15] 206.2135 230.5540
## lp__ 12666.4750 12668.8967
Storing the rates and the CI for the chart:
res2 <- as.data.frame(summary(fit)$summary[1:nrow(casualties_wm), c("mean", "2.5%", "97.5%")])
res2$local_authority <- casualties_wm$Local_Authority
res2$local_authority <- reorder(res2$local_authority, res2$mean, mean)
colnames(res2) <- c("mean", "lower", "upper", "local_authority")
ggplot(res2, aes(x=local_authority, y = mean)) +
geom_point(alpha=0.5) +
geom_errorbar(width=.1, aes(ymin=lower, ymax=upper), colour="blue") +
labs(x=NULL, y= "Annual casualties per million", title="The most dangerous areas for cyclists in the West Midlands",
caption="Source: Department of Transport") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Blue lines show the uncertainty. In 19 out of 20 times the number of casualties/million will be between the upper and lower limit. Some lines are bigger than others due to the population size. The smaller the population, the higher the variation in the estimated mean.
The dots are the values estimated by the model (the mean), which is similar but not equal to the real rates, because of the regularization (new word and technique for my list).
“The regularization is one of the main processes in the modern statistics, and it is related to the ‘incredulity’ of the extreme values,” Carlos told me. So, this process ‘punish’ those extreme values toward the mean.
In several papers, David Spiegelhalter has suggested using funnel plots to control the indicator against a measure of its precision, and to avoid the “spurious ranking” in caterpillar plots.
asfunnelplot <- data.frame(LA=casualties_wm$Local_Authority,
casualties=(casualties_wm$casualties_16_million/1000000),
population=casualties_wm$Pop_16
)
asfunnelplot$SE <- sqrt((asfunnelplot$casualties*(1-asfunnelplot$casualties))/asfunnelplot$population)
casualties.fem <- weighted.mean(asfunnelplot$casualties, 1/asfunnelplot$SE^2)
## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
pop.seq <- seq(10000, max(asfunnelplot$population), 10000)
ll95 <-casualties.fem - 1.96 * sqrt((casualties.fem*(1-casualties.fem)) / (pop.seq))
ul95 <- casualties.fem + 1.96 * sqrt((casualties.fem*(1-casualties.fem)) / (pop.seq))
ll999 <- casualties.fem - 3.29 * sqrt((casualties.fem*(1-casualties.fem)) / (pop.seq))
ul999 <- casualties.fem + 3.29 * sqrt((casualties.fem*(1-casualties.fem)) / (pop.seq))
CI <- data.frame(ll95, ul95, ll999, ul999, pop.seq, casualties.fem)
## draw plot
funnel <- ggplot(aes(x = population, y = casualties), data = asfunnelplot) +
geom_point(shape = 1, size = 2) +
geom_line(aes(x = pop.seq, y = ll95, color = "steelblue3"), data = CI, color = "steelblue3") +
geom_line(aes(x = pop.seq, y = ul95, color = "steelblue3"), data = CI, color = "steelblue3")+
geom_line(aes(x = pop.seq, y = ll999, color = "steelblue1"), linetype = "dashed", data = CI, color = "steelblue1") +
geom_line(aes(x = pop.seq, y = ul999, color = "steelblue1"), linetype = "dashed", data = CI, color = "steelblue1") + geom_hline(aes(yintercept = casualties.fem, color = "olivedrab"), data = CI, color = "olivedrab") + xlim(0,1150000) + scale_y_continuous(labels = function(y) y*100000) + labs(title = "", x = "Local authority population size", y = "") + theme_bw()
funnel
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 520 rows containing missing values (geom_path).
## Warning: Removed 520 rows containing missing values (geom_path).
## Warning: Removed 520 rows containing missing values (geom_path).
## Warning: Removed 520 rows containing missing values (geom_path).
But funnel plots make difficult the multiple comparisons, and it doesn’t avoid the problem of extreme values.
“There is no formal allowance for ‘regression to the mean’, in which extreme outcomes are expected to tend towards the population mean simply because a contributing factor to their ‘extremeness’ is likely to be a run of good or bad luck. This could be formally taken into account by fitting a random-effects model,” Spiegelhalter says in “Funnel plots for comparing institutional performance.”
Hence, for my story, I would use the random effects models of above, which estimate the casualties rate considering the probability distribution of this type of events, which use mechanisms to avoid extreme values, and which calculate and show the variation of each estimated mean (uncertainty).
“Scientists deal with uncertainty by invoking probability,” wrote Victor Cohn in the magazine Significance. And he adds “to be statistically significant, and not just the result of pure chance, the same result must appear again and again. When it does, that’s reliability.”
More specifically, between the two models developed, I would lean towards the one with stan, because it takes into account the historical average and relates it to the new estimated rate of each local authority.
Cohn. V, (1999) How to help reporters tell the truth, Of significance, Oct 22nd, pp.9-13. Available at: http://www.statlit.org/pdf/1999-Cohn-Significance.pdf [Accessed August 14]
Department of Statistics Online Programs of the Pennsylvania State University (n.d.) Introduction to Generalized Linear Models, STAT 504. Available at: https://onlinecourses.science.psu.edu/stat504/node/216/ [Accessed August 14]
Gelman A., Hill, J. and Yajima, M. (2012) Why we (usually) don’t have to worry about multiple comparisons, Journal of Research on Educational Effectiveness, Vol.5, pp: 189–211, DOI: 10.1080/19345747.2011.618213. [Accessed at Agust 15]
Hertzog, L. (2017) Interpreting random effects in linear mixed-effect models, Biologyforfun [blog], April 3rd. Available at: https://biologyforfun.wordpress.com/2017/04/03/interpreting-random-effects-in-linear-mixed-effect-models/ [Accessed August 14]
Iosif Moraru, R. (2014) How to assess risk: as a measure of probability or possibility? Researchgate (n.d.). Available at: https://www.researchgate.net/post/How_to_assess_risk_as_a_measure_of_probability_or_possibility [Accessed August 14]
Mervyn, T. (2013) What is the role of an offset term in modelling a GLM?, Researchgate (n.d.). Available at: https://www.researchgate.net/post/What_is_the_role_of_an_offset_term_in_modelling_a_GLM [Accessed August 14]
NSS (2016) Bayesian Statistics explained to Beginners in Simple English, Analytica Vidhya, Jun 20. Available at: https://www.analyticsvidhya.com/blog/2016/06/bayesian-statistics-beginners-simple-english/ [Accessed August 14]
Spiegelhalter, D. (2005) Handling over-dispersion of performance indicators, BMJ Quality & Safety Vol 14, pp.347-351. Available at: http://dx.doi.org/10.1136/qshc.2005.013755 [Accessed August 14]
Spiegelhalter, D. (2005) Funnel plots for comparing institutional performance, Statistics in medicine, Vol 24, pp. 1185-1202.
StatsToDo (n.d.) Poisson Distribution : Explained, StatsToDo. Available at: https://www.statstodo.com/Poisson_Exp.php [Accessed August 14]