Bayesian Analysis of Richard’s Data

I was interested in to what extent the data could be used to support the null hypothesis that qualification has not impact on league improvement. Obviously with p-values we cannot accept a null hypothesis, but there are ways to assess the evidence in favour of the null using Bayesian analysis. I approach this in 3 ways: Bayes factors, Bayesian parameter estimation, and Bayesian modelling.

Bayes Factors

Bayes factors assess the degree of support for the null model compared to an alternative model, given the data. This can be computed directly from the t-value. The Bayes factor for your first t-test (t = 0.499) was \(BF_{01}\) = 2.578 and for the second t-test (t = 0.762) was \(BF_{01}\) = 2.908.

The Bayes factors suggest that the data are about 3 times more likely under the null model of no difference than an alternative model. Bayesians would consider this “anecdotal” support for the null hypothesis.

Bayesian Parameter Estimation

An alternative analysis is Bayesian parameter estimation where we computationally sample from the posterior distribution of the estimated parameter of interest. Here our parameter of interest is the mean difference betwen league improvement for qualifiers versus those who missed. Thus, this analysis presents a histogram of our belief of how plausible this parameter is across a range of values.

The analysis here uses default (i.e., non-subjective) priors as outlined by Kruschke (2013)¹.

First Data Set

This histogram represents the Bayesian posterior distribution of belief in plausible values for the true difference of improvement in league positions between qualifiers and ``just-misses’’, using your data as a sample. As can be seen, the mean difference is small, with a 95% Credible Interval (illustrated by the horizontal black line) ranging from negative to positive values. Unlike a standard NHST confidence interval, this Credible Interval can be interpreted as the range of plausible values of the true parameter. As this includes zero (i.e., no difference in qualifier performance), we have some evidence that no-difference is a plausible outcome.

Second Data Set

In this data, the mean difference again is small, with a 95% Credible Interval (illustrated by the horizontal black line) ranging from negative to positive values. Unlike a standard NHST confidence interval, this Credible Interval can be interpreted as the range of plausible values of the true parameter. As this includes zero (i.e., no difference in qualifier performance), we have some evidence that no-difference is a plausible outcome.

Bayesian Modelling

Some concern I had when I was thinking how to model this data is that different observers have different prior beliefs about the effect of qualification on subsequent performance. This is problematic, because if the data are equivocal, both parties could (rightly) come to quite different conclusions.

One advantage of Bayesian analysis is that we can model what our prior belief is, and we can then show explicitly how our prior belief should change based on incoming data. Two observers with different priors can see the same data, and update their beliefs independently. If the data are convincing, these posterior beliefs should be similar despite the prior.

So, in this section, what I wanted to do was to formally model the prior beliefs of the pundits (“Qualification leads to poorer performance”“) with that of a skeptic (”Qualification makes no difference“). Additionally, in this section, I was able to take into account regression to the mean, which is a likely candidate for explaining some of the data. Specifically, the data is mostly interested in well-performing teams toward the top of the table; thus, they have more room to move down next year than up.

To model this data, we need several things:

A formal generating statistical model
- For this I used a binomial model
Specification of each side’s prior beliefs
- Pundits expect worse performance for qualifiers
- Skeptics expect no difference
Quantification of the likelihood of the observed data
- Again, using the binomial model

From this, we can generate posterior beliefs for both parties and see whether they differ!

Binomial Model

The binomial model is used to model binary outcome data, and to estimate the “rate” at which something occurs. For example, imagine flippling a coin 100 times. A coin will have a “rate” (between 0 to 1) at which it generates the outcome “heads”. For a fair coin, this rate is 0.5.

The binomial model allows us to estimate what the likely value for the rate is given a certain number of “successes” out a total number of observations. As the rate for an outcome is constrained to be between 0 and 1, we can use the binomial model to generate a “likelihood function”, which basically shows us how “likely” each possible value of the rate is given a fixed observation.

Here is a plot of the plausible values for rate (denoted \(\theta\)) given 100 flips of a coin, and 50 observed heads:

This plot is straightforward to interpret: values for \(\theta\) around 0.5 are most plausible given the observed data. Note that there is uncertainty here: 0.48 is probably just as likely for the true value of \(\theta\) as 0.52 is, but a value of 0.2 is practically impossible given the observed data.

Richard’s Data

I modelled your data in a similar way to the coin flipping example. I used the data of how many teams improved their position the next year instead of “number of heads”, and the total number of teams instead of the “total number of flips”. In your data, 6/14 teams who qualified the year before improved their position, and 6/14 teams who just missed the year before improved their position. The likelihood for \(\theta\) for each set of data are below:

For both teams, the best estimate for \(\theta\) is 0.28, suggesting that the rate of improvement is poor for both sets of teams. At this point, you could conclude that the rate of improvement is the same for those who qualify compared to those who do not qualify, in contrast to the pundits’ claims. However, this doesn’t take into consideration a couple of important points:

There is some uncertainty in the estimates for \(\theta\) (i.e., the widths of the likelihood functions are quite wide) due to low sample size in each analysis (14 possible teams)
This doesn’t take into account the prior beliefs of the pundits. Put another way: would these data be convincing to a pundit? We can answer this with Bayesian analysis.

Specifying the prior beliefs

Let’s specify what the prior beliefs are for the pundits and the skeptics. Within these prior beliefs we should take into account that both sets of observers should expect teams (whether they qualified or missed) should do worse the next year due to regression to the mean. So, both sets of observers should expect \(\theta\) to be less than 0.5. But, pundits expect this to be lower for the qualified teams, whereas the skeptics expect \(\theta\) to be the same for qualified teams and missed teams.

Here are some proposed prior beliefs for both sets of observers:

Note that the pundits expect qualified teams to do worse, but the skeptics expect identical rates. Both pundits and skeptics expect overall low rates due to regression to the mean.

Now let’s update these prior beliefs based on the data (via the likelihood). Bayesian analysis formalises how prior beliefs change into posterior beliefs based on observed data. The plots below show how the prior beliefs change for both the pundits and skeptics when they both observe the data from the qualified and the missed teams:

From this we can note a few interesting things:

Pundits had to change the location of their belief in \(\theta\) (i.e., the x-position of the mode of the posterior distribution is different to the x-position of the prior distribution)
Skeptics didn’t change their belief per se; they just became more convinced in what they already believed (i.e., the x-position of their posterior distribution did not change, it just became more peaked)
The x-position of the mode of the posterior distributions are identical for qualified and missed teams for the skeptics, but—despite the data—pundits still seem to think that qualified teams do worse.

This last point is best demonstrated by just plotting the posterior distributions for pundits and skeptics. To emphasise: these plots show what each side believe NOW given their prior beliefs and the observed data.

Thus, more data are needed to convince the pundits that European qualification doesn’t influence league positioning.

Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142, 573.↩