Abstract

Bradley and Terry (1952) modeled paired comparison data, such as comparisons of pairs of products by consumers or outcomes of sporting contests between two players. The data is binary, with each paired comparison having a single “winner.” The case study will provide simulated data and demonstrate the inference of contestant abilities and ranks as well as predicting the outcomes of future contests.

We will show how the Bradley-Terry likelihood may be paired with a prior to produce a Bayesian model for which it is possible to perform inference on parameters from observed data using full Bayesian inference (the so-called “Bayesian inversion”). This then allows inference for future matches.

After introducing a Bayesian Bradley-Terry model, we replace the simple fixed population prior with a hierarchical model in which we jointly estimate the population variation in ability along with the ability parameters. This allows tighter estimation of ability parameters and reduces uncertainty in future, unobserved outcome predictions.

Finally, we will introduce a Bradley-Terry model for team contests, where each team consists of multiple players. The assumption will be that each player has an ability and the team’s ability is additive on the log odds scale.

The Bradley-Terry Model

Like all good probability stories, this one is a generative story. Nevertheless, we will begin historically with the likelihood introduced by Bradley and Terry (1952) and earlier discussed by Zermelo (1929). The observed data is the outcome of matches between players and the (latent) parameters are the player abilities; we postpone from now the model of the abilities and concentrate on the model of match outcomes.

We will suppose that there are \(K\) players. Each contestant will have an ability \(\alpha_k \in \mathbb{R}\). The probability that contestant \(i\) will beat contestant \(j\) is given by11 The log odds function \[\mathrm{logit}:(0, 1) \rightarrow (-\infty, \infty)\] is defined by \[\mathrm{logit}(u) = \log \left( \frac{u}{1 - u} \right).\] Its inverse, \[\mathrm{logit}^{-1}:(-\infty, \infty) \rightarrow (0, 1),\] is given by \[\mathrm{logit}^{-1}(v) = \frac{1}{1 + \exp(-v)} = \frac{\exp(v)}{1 + \exp(v)}.\]

\[ \mathrm{Pr}\left[ i \mbox{ beats } j \right] \ = \ \mathrm{logit}^{-1}(\alpha_i - \alpha_j). \]

Furthermore, the flexibility of the long-form data means not every pair of players need play each other and there may be multiple matches between the same two players.
play each other. If there are multiple matches between the same two players, the results are assumed to be independent.

Long form data

Because the data is “incomplete” (i.e., not all players need play each other), the easiest way to encode the data is in long form, e.g.,

\[ \begin{array}{c|cc|c} n & \mathrm{player}^0 & \mathrm{player}^1 & y \\ \hline 1 & 1 & 2 & 1 \\ 2 & 1 & 13 & 0 \\ 3 & 1 & 5 & 0 \\ 4 & 2 & 9 & 0 \\ 5 & 3 & 4 & 1 \\ 6 & 2 & 9 & 1 \end{array} \]

The first column, labeled \(n\) is the match index. With \(N\) matches, \(n \in 1, 2, \ldots, N\). The second two columns indicate which players participated in the match. The last column is the result \(y_n \in \{ 0, 1 \}\), indicating which player won the match. For example, the fourth row (\(n = 4\)) records a match between player 2 (\(\mathrm{player}^0 = 2\)) and player 9 (\(\mathrm{player}^1 = 9\)) in which player 2 won (\(y = 0\)).

Simulating data

In order to simulate data, we need to assume a distribution over the ability parameters. For simplicity and because we are not assuming anything about the structure of the data, we will assume they are distributed standard normal (mean zero, standard deviation one).22 More elaborate models would be appropriate, if, for example, we knew that the population was made up out of two groups, amateurs and professionals, a mixture prior would be more appropriate.

# map (-infinity, infinty) -> (0, 1)
inv_logit <- function(u) 1 / (1 + exp(-u))

# map vector to vector with sum of zero
center <- function(u) u - sum(u) / length(u)

# parameters
K <- 50
alpha <- center(rnorm(K))

# observations
N <- K^2    # may not have this many matches in practice
player1 <- rep(NA, N)
player0 <- rep(NA, N)
for (n in 1:N) {
  players <- sample(1:K, 2)
  player0[n] <- players[1]
  player1[n] <- players[2]
}
log_odds_player1 <- alpha[player1] - alpha[player0]
prob_win_player1 <- inv_logit(log_odds_player1)
y <- rbinom(N, 1, prob_win_player1)

Rendering the first few rows of the data frame shows that it is in the same form as the long-form table shown in the previous section.

df <- data.frame(player0 = player0, player1 = player1, y = y)
head(df)

  player0 player1 y
1      42      13 0
2       1       8 0
3      16      47 0
4      24       1 1
5      23       7 0
6       7       9 0

Coding the Bradley-Terry model for maximum likelihood

Suppose have some data data in the appropriate long form. With RStan, we can write the model out directly in terms of its log likelihood. The first part of the program encodes the data, beginning with the constant number of players K and matches N. Then there are three parallel arrays of integers indicating the players in each match and the outcome.

data {
  int<lower = 0> K;                     // players
  int<lower = 0> N;                     // matches
  int<lower=1, upper = K> player1[N];   // player 1 for game n
  int<lower=1, upper = K> player0[N];   // player 0 for game n
  int<lower = 0, upper = 1> y[N];       // winner for match n
}
parameters {
  vector[K] alpha;                      // ability for player n
}
model {
  y ~ bernoulli_logit(alpha[player1] - alpha[player0]);
}

The parameters are coded as a \(K\)-vector, and the likelihood coded as defined above. The vectorized sampling statement is equivalent to the following less efficient loop form.33 The loop form illustrates how the likelihood function is defined as \[p(y \mid \alpha) = \prod_{n=1}^N \mathsf{Bernoulli}\!\left(y_n \ \bigg| \ \mathrm{logit}^{-1}\!\left(\alpha_{\mathrm{player1}[n]} - \alpha_{\mathrm{player0}[n]}\right)\right)\!\]

for (n in 1:N)
  y[n] ~ bernoulli_logit(alpha[player1[n]] - alpha[player2[n]]);

The distribution bernoulli_logit is the Bernoulli distribution with a parameter on the logit (log odds) scale, where for \(y \in \{ 0, 1 \}\) and \(\theta \in (0, 1)\),

\[ \mathsf{Bernoulli}(y \mid \theta) \ = \ \begin{cases} \theta & \mbox{if} \ \ y = 1 \\ 1 - \theta & \mbox{if} \ \ y = 0. \end{cases} \]

and for \(\alpha \in (-\infty, \infty)\),

\[ \mathsf{BernolliLogit}(y \mid \alpha) \ = \ \mathsf{Bernoulli}(y \mid \mathrm{logit}^{-1}(\alpha)). \]

Building in the link function makes the arithmetic more stable and the code less error prone. Thus rather than writing

y ~ bernoulli(inv_logit(u));

we encourage users to use the logit (log odds) parameterized version.

y ~ bernoulli_logit(u);

Maximum likelihood estimation

All that we need to do to fit the data with Stan is pack the data into a list, compile the model using stan_model, then find the maximum likelihood estimate \(\theta^*\), that is, the estimate for the parameter values that maximizes the probability of the match outcomes that were observed.

\[ \begin{array}{rcl} \theta^* & = & \mathrm{argmax}_{\theta} p(y | \theta) \\ & = & \mathrm{argmax}_{\theta} \ \prod_{n=1}^N \mathsf{Bernoulli}(y_n \mid \mathrm{logit}^{-1}(\alpha_{\mathrm{player1[n]}} - \alpha_{\mathrm{player0[n]}}) \end{array} \]

mle_model_data <-
  list(K = K, N = N, player0 = player0, player1 = player1, y = y)
mle_model <-stan_model("individual-uniform.stan")

Warning in readLines(file, warn = TRUE): incomplete final line found on 'C:
\Users\daniel\Desktop\individual-uniform.stan'

mle_model_estimates <- optimizing(mle_model, data = mle_model_data)

We can now see how well we recovered the parameters by plotting the maximum likelihood estimates against the true values, which takes some wrangling to wrench out of the fit object.

Fit of true value (horizontal axis) versus maximum likelihood estimate (vertical axis). The nearly linear relationship shows that maximum likelihood estimation does a good job estimating parameters with this number and scale of parameters and number of matches.

alpha_star <- mle_model_estimates$par[paste("alpha[", 1:K, "]", sep="")]
mle_fit_plot <-
  ggplot(data.frame(alpha = alpha, alpha_star = alpha_star),
         aes(x = alpha, y = alpha_star)) +
      geom_abline(slope = 1, intercept = 0, color="green", size = 2) +
      geom_point(size = 2) +
      ggtheme_tufte()
mle_fit_plot

Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font
family not found in Windows font database

Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font
family not found in Windows font database

Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
x$y, : font family not found in Windows font database

Ranking with the Bradley-Terry Model

Players (or other items) may be ranked using the Bradley-Terry model. In Stan, functions of parameters, such as rankings, may be coded in the generated quantities. So we only need to add the following lines to the end of our original program to define the ranking.

generated quantities {
  int<lower=1, upper=K> ranked[K] = sort_indices_desc(alpha);
}

This returns the sorting of the indexes in descending order.

ranked_players <-
  mle_model_estimates$par[paste("ranked[", 1:K, "]",
                                sep="")]
print(ranked_players, digits=0)

 ranked[1]  ranked[2]  ranked[3]  ranked[4]  ranked[5]  ranked[6] 
        50          1         48         11         37         14 
 ranked[7]  ranked[8]  ranked[9] ranked[10] ranked[11] ranked[12] 
        49         24         43          6          8         33 
ranked[13] ranked[14] ranked[15] ranked[16] ranked[17] ranked[18] 
        39         38         32         35         16         25 
ranked[19] ranked[20] ranked[21] ranked[22] ranked[23] ranked[24] 
        26         47         28          3         23         15 
ranked[25] ranked[26] ranked[27] ranked[28] ranked[29] ranked[30] 
         2         36         40         30          5         29 
ranked[31] ranked[32] ranked[33] ranked[34] ranked[35] ranked[36] 
        44         27         34         20          9         21 
ranked[37] ranked[38] ranked[39] ranked[40] ranked[41] ranked[42] 
        13         18         12         42         19          4 
ranked[43] ranked[44] ranked[45] ranked[46] ranked[47] ranked[48] 
        10         31         17         46         41         22 
ranked[49] ranked[50] 
         7         45

The value of ranked[1] is the index of the top-ranked player, ranked[2] of the second-ranked player, and so on. We can print the rankings we derive as follows.

Bayesian Bradley-Terry Model

To convert our simple likelihood into a proper Bayesian model, we need a prior for the ability parameters. Such a prior will characterize the population of players in terms of the distribution of their abilities.

Population model for abilities

We follow Leonard (1977) in laying out simple non-conjugate priors for the ability parameters.44 This is in some sense cheating because it guarantees the model is well specified in the sense of exactly matching the data-generating process. In reality, models are almost always approximations and thus not perfectly well specified for the data sets they model.

\[ \alpha_k \sim \mathsf{Normal}(0, 1) \]

Coding the Bayesian Bradley-Terry model

The form of the data does not change. The declarations of parameters is simplified and no longer centered by construction.

parameters {
  vector[K] alpha;                      // ability for player n
}
model {
  alpha ~ normal(0, 1);
  y ~ bernoulli_logit(alpha[player1] - alpha[player0]);
}

Instead of hard centering, the normal prior with location parameter zero will implicitly center the paremters around zero by assigning them higher density. The unit scale of the normal prior provides an indication of how much variation there is in player ability.

Fitting the model

individual_model <- stan_model("individual.stan")
individual_posterior <- sampling(individual_model, data = mle_model_data)
print(individual_posterior, "alpha", probs=c(0.05, 0.5, 0.95))

Inference for Stan model: individual.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd    5%   50%   95% n_eff Rhat
alpha[1]   2.45    0.01 0.35  1.88  2.44  3.04  2299    1
alpha[2]  -0.09    0.01 0.26 -0.52 -0.09  0.33  1142    1
alpha[3]  -0.06    0.01 0.25 -0.48 -0.05  0.34  1039    1
alpha[4]  -1.00    0.01 0.26 -1.43 -1.00 -0.58  1171    1
alpha[5]  -0.27    0.01 0.26 -0.70 -0.27  0.17  1326    1
alpha[6]   0.88    0.01 0.28  0.42  0.88  1.32  1233    1
alpha[7]  -1.95    0.01 0.31 -2.45 -1.94 -1.45  1339    1
alpha[8]   0.85    0.01 0.28  0.40  0.85  1.32  1580    1
alpha[9]  -0.57    0.01 0.25 -0.99 -0.56 -0.16  1055    1
alpha[10] -1.02    0.01 0.27 -1.48 -1.02 -0.59  1217    1
alpha[11]  1.42    0.01 0.31  0.93  1.41  1.92  1731    1
alpha[12] -0.82    0.01 0.25 -1.22 -0.82 -0.40  1076    1
alpha[13] -0.70    0.01 0.26 -1.13 -0.70 -0.28  1134    1
alpha[14]  1.26    0.01 0.29  0.78  1.27  1.73  1626    1
alpha[15] -0.07    0.01 0.27 -0.52 -0.08  0.37  1355    1
alpha[16]  0.52    0.01 0.27  0.06  0.52  0.95  1117    1
alpha[17] -1.11    0.01 0.27 -1.55 -1.11 -0.67  1256    1
alpha[18] -0.80    0.01 0.27 -1.24 -0.79 -0.36  1330    1
alpha[19] -0.99    0.01 0.25 -1.40 -0.98 -0.57  1074    1
alpha[20] -0.50    0.01 0.26 -0.92 -0.50 -0.07  1273    1
alpha[21] -0.70    0.01 0.25 -1.11 -0.70 -0.29  1032    1
alpha[22] -1.52    0.01 0.27 -1.97 -1.52 -1.07  1442    1
alpha[23] -0.06    0.01 0.25 -0.47 -0.06  0.34  1131    1
alpha[24]  1.03    0.01 0.27  0.59  1.02  1.47  1261    1
alpha[25]  0.37    0.01 0.26 -0.04  0.37  0.80  1144    1
alpha[26]  0.34    0.01 0.25 -0.06  0.34  0.77  1162    1
alpha[27] -0.34    0.01 0.26 -0.76 -0.34  0.06  1197    1
alpha[28]  0.09    0.01 0.25 -0.32  0.09  0.50  1060    1
alpha[29] -0.28    0.01 0.24 -0.67 -0.28  0.12  1091    1
alpha[30] -0.22    0.01 0.25 -0.63 -0.22  0.18  1119    1
alpha[31] -1.08    0.01 0.27 -1.51 -1.08 -0.64  1222    1
alpha[32]  0.59    0.01 0.26  0.16  0.60  1.03  1129    1
alpha[33]  0.79    0.01 0.25  0.38  0.78  1.19  1139    1
alpha[34] -0.41    0.01 0.25 -0.81 -0.42 -0.01  1186    1
alpha[35]  0.52    0.01 0.25  0.10  0.52  0.91  1018    1
alpha[36] -0.10    0.01 0.25 -0.50 -0.10  0.31   947    1
alpha[37]  1.29    0.01 0.29  0.81  1.28  1.78  1467    1
alpha[38]  0.67    0.01 0.28  0.21  0.66  1.13  1296    1
alpha[39]  0.68    0.01 0.26  0.26  0.68  1.10  1206    1
alpha[40] -0.15    0.01 0.25 -0.57 -0.15  0.27  1003    1
alpha[41] -1.39    0.01 0.26 -1.84 -1.38 -0.96  1311    1
alpha[42] -0.91    0.01 0.26 -1.34 -0.91 -0.48  1324    1
alpha[43]  0.95    0.01 0.28  0.49  0.95  1.41  1429    1
alpha[44] -0.31    0.01 0.25 -0.70 -0.31  0.10  1027    1
alpha[45] -2.07    0.01 0.31 -2.58 -2.06 -1.56  1394    1
alpha[46] -1.21    0.01 0.27 -1.65 -1.21 -0.78  1332    1
alpha[47]  0.11    0.01 0.23 -0.27  0.11  0.49  1057    1
alpha[48]  1.64    0.01 0.30  1.15  1.64  2.14  1565    1
alpha[49]  1.14    0.01 0.28  0.69  1.14  1.60  1357    1
alpha[50]  2.50    0.01 0.36  1.94  2.48  3.10  1974    1

Samples were drawn using NUTS(diag_e) at Sun Dec 15 16:35:32 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

For the posterior fit object, we are taking draws \(\alpha^{(m)}\) from the posterior55 Not indendently, but by using Markov chain Monte Carlo.

\[ p(\alpha | y) \propto p(y | \alpha) \, p(\alpha). \]

To calculate Bayesian estimates, we take posterior means, which are guaranteed to minimize expected square error when the model is well specified.

\[ \begin{array}{rcl} \hat{\alpha}_k & = & \displaystyle \mathbb{E}\left[\alpha_k \mid y \, \right] \\[4pt] & = & \displaystyle \int_{-\infty}^{\infty} \alpha_k \ p(\alpha_k | y) \ \mathrm{d}\alpha \\[4pt] & \approx & \displaystyle \frac{1}{M} \sum_{m=1}^M \alpha^{(m)}_k \end{array} \]

This is an example of full Bayesian inference, which is nearly always based on calculating conditional expectations of quantities of interest over the posterior. The second line defining the expectation shows the general form—a weighted average of the quantity of interest, \(\alpha_k\), over the posterior distribution \(p(\alpha_k | y)\). This weighted average is calculated by Markov chain Monte Carlo (MCMC) using an average of the posterior draws.

The extract() function in RStan returns a structure from which the draws \(\alpha^{(m)}\) may be extracted. These are easily converted into posterior means.66 They can be extracted even more easily using utility functions in RStan; this long form is just for pedagogical purposes.

alpha_hat <- rep(NA, K)
for (k in 1:K)
  alpha_hat[k] <- mean(extract(individual_posterior)$alpha[ , k])

These can then be scatterplotted against the true values just as the maximum likelihood was previously.

Fit of true value (horizontal axis) versus Bayesian (posterior mean) estimate (vertical axis). The nearly linear relationship shows that the Bayesian estimates also do a good job of fitting this data.

bayes_fit_plot <-
  ggplot(data.frame(alpha = alpha, alpha_hat = alpha_hat),
         aes(x = alpha, y = alpha_hat)) +
      geom_abline(slope = 1, intercept = 0, color="green", size = 2) +
      geom_point(size = 2) +
      ggtheme_tufte()
bayes_fit_plot

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
x$y, : font family not found in Windows font database

With this much data, the Bayesian estimates are very similar to the maximum likelihood estimates.

Ranking in the Bayesian model

Because of the way posterior quantities are calculated with uncertainty, it does not make sense to return a single answer for the ranking. Instead, we have a probabilistic notion of ranking which can be coded in Stan as follows.

generated quantities {
  int<lower=1, upper=K> ranking[K];       // rank of player ability
  {
    int ranked_index[K] = sort_indices_desc(alpha);
    for (k in 1:K)
      ranking[ranked_index[k]] = k;
  }
}

With this definition, ranking[k] holds the rank of player k (rather than the index of the player at rank k as before). This allows for full Bayesian inference for posterior uncertainty.

Intercepts and home-field advantage

With player abilities \(\alpha_k\) centered around zero and the total predictor being additive in player abilities, this model has no intercept term. This is because there is symmetry between the player identified as player 0 and the player identified as player 1. If these identifiers are assigned randomly, the expected difference \(\alpha_i - \alpha_j\) is zero.

In situations where there is a distinction between the players assigned as player 0 and as player 1, it will be necessary to introduce an intercept term to model the advantage of being player 1 (which may be negative and thus actually be a disadvantage). This intercept could model the advantage for playing the white pieces in chess and moving first, or the home-field advantage in basketball games. For this case study, we stick to random assignment and assume there is no advantage to player 0 or player 1 in any match.77 Technically, the prior and likelihood with random assignment ensure the prior predictive expectation of \(y_n\) is zero. Such an assumption is easily tested.

Hierarchical Bayesian Bradley-Terry Model

Rather than hard-coding the prior for abilities, we can use a hierarchical prior to estimate the amount of variation in the population at the same time as we estimate the parameters themselves.

Priors and Hyperprior

We introduce a new parameter \(\sigma > 0\) for the scale of variation in the population. We then use the scale \(\sigma\) in the prior for the population parameters,

\[ \alpha_k \sim \mathsf{Normal}(0, \sigma). \] We then need a hyperprior on \(\sigma\) to complete the model; we will somewhat arbitrarily choose a lognormal prior with scale 0.5.
\[ \sigma \sim \mathsf{LogNormal}(0, \, 0.5). \] When there are large numbers of players, player ability estimates and the estimate of the scale of variation should not be very sensitive to the choice of prior scale (see the exercises).

If there is not much variation among the abilities of the players in the population, \(\sigma\) will be estimated with a value near zero; otherwise, if the abilities are estimated to be widely varying, \(\sigma\) will be large. In the limit as \(\sigma \rightarrow 0\), the model reverts to a complete pooling model where every player will have the same ability value (zero). In the other limit, as \(\sigma \rightarrow \infty\), the model reverts to a model with no pooling, where the amount of variation in the population means no information can be used from the population of player abilities to help estimate an individual player’s ability.

Coding the model

We add a parameter sigma for \(\sigma\). It is important to include the lower bound of zero, because scales must be positive.88 If parameters are not appropriately constrained to their support, sampling may fail to initialize or revert to inefficient rejection sampling behaviors. Constraints on data, and generated quantities are just for error checking.

parameters {
  real<lower = 0> sigma;                // scale of ability variation
  vector[K] alpha;                      // ability for player n
}
model {
  sigma ~ lognormal(0, 0.5);            // boundary avoiding
  alpha ~ normal(0, sigma);             // hierarchical
  y ~ bernoulli_logit(alpha[player1] - alpha[player0]);
}

The prior on sigma is coded as before, and now the prior on alpha uses sigma as its scale parameter.

Fitting the model

The model is fit just as before.

model_hier <- stan_model("individual-hierarchical.stan")
posterior_hier <- sampling(model_hier, data = mle_model_data)

The fit can be examined as before for convergence violations.

print(posterior_hier, c("sigma", "alpha"), probs=c(0.05, 0.5, 0.95))

Inference for Stan model: individual-hierarchical.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd    5%   50%   95% n_eff Rhat
sigma      1.07    0.00 0.12  0.89  1.06  1.27  3406    1
alpha[1]   2.48    0.01 0.36  1.92  2.47  3.12  1207    1
alpha[2]  -0.09    0.01 0.26 -0.53 -0.08  0.33   570    1
alpha[3]  -0.06    0.01 0.25 -0.47 -0.06  0.35   495    1
alpha[4]  -1.00    0.01 0.27 -1.45 -1.00 -0.57   644    1
alpha[5]  -0.27    0.01 0.27 -0.72 -0.27  0.17   636    1
alpha[6]   0.88    0.01 0.28  0.42  0.89  1.35   694    1
alpha[7]  -1.96    0.01 0.31 -2.50 -1.95 -1.46   965    1
alpha[8]   0.87    0.01 0.29  0.38  0.87  1.36   703    1
alpha[9]  -0.57    0.01 0.26 -1.00 -0.58 -0.15   584    1
alpha[10] -1.02    0.01 0.29 -1.51 -1.01 -0.57   870    1
alpha[11]  1.43    0.01 0.32  0.90  1.43  1.96  1065    1
alpha[12] -0.83    0.01 0.26 -1.25 -0.83 -0.39   597    1
alpha[13] -0.71    0.01 0.26 -1.12 -0.71 -0.27   609    1
alpha[14]  1.27    0.01 0.29  0.79  1.26  1.75   746    1
alpha[15] -0.07    0.01 0.27 -0.51 -0.07  0.37   683    1
alpha[16]  0.52    0.01 0.27  0.08  0.52  0.97   561    1
alpha[17] -1.12    0.01 0.29 -1.60 -1.12 -0.66   675    1
alpha[18] -0.80    0.01 0.28 -1.26 -0.80 -0.36   591    1
alpha[19] -1.00    0.01 0.26 -1.44 -0.99 -0.59   572    1
alpha[20] -0.50    0.01 0.27 -0.95 -0.50 -0.05   553    1
alpha[21] -0.70    0.01 0.26 -1.12 -0.69 -0.28   446    1
alpha[22] -1.53    0.01 0.28 -1.98 -1.52 -1.08   778    1
alpha[23] -0.06    0.01 0.26 -0.49 -0.06  0.35   568    1
alpha[24]  1.03    0.01 0.28  0.57  1.03  1.48   707    1
alpha[25]  0.38    0.01 0.26 -0.05  0.37  0.81   587    1
alpha[26]  0.34    0.01 0.25 -0.08  0.34  0.76   506    1
alpha[27] -0.35    0.01 0.27 -0.79 -0.35  0.09   550    1
alpha[28]  0.09    0.01 0.26 -0.34  0.08  0.51   514    1
alpha[29] -0.29    0.01 0.25 -0.70 -0.29  0.12   498    1
alpha[30] -0.23    0.01 0.26 -0.65 -0.22  0.20   543    1
alpha[31] -1.09    0.01 0.27 -1.54 -1.09 -0.64   605    1
alpha[32]  0.59    0.01 0.27  0.15  0.60  1.04   629    1
alpha[33]  0.80    0.01 0.26  0.38  0.79  1.23   544    1
alpha[34] -0.41    0.01 0.26 -0.84 -0.41 -0.01   509    1
alpha[35]  0.53    0.01 0.26  0.11  0.52  0.96   529    1
alpha[36] -0.10    0.01 0.26 -0.52 -0.10  0.33   646    1
alpha[37]  1.30    0.01 0.30  0.80  1.30  1.80   754    1
alpha[38]  0.67    0.01 0.28  0.22  0.67  1.13   743    1
alpha[39]  0.68    0.01 0.27  0.24  0.67  1.12   727    1
alpha[40] -0.15    0.01 0.26 -0.57 -0.15  0.28   561    1
alpha[41] -1.40    0.01 0.28 -1.88 -1.40 -0.95   664    1
alpha[42] -0.92    0.01 0.27 -1.37 -0.91 -0.47   708    1
alpha[43]  0.95    0.01 0.29  0.48  0.95  1.45   772    1
alpha[44] -0.30    0.01 0.25 -0.71 -0.31  0.11   617    1
alpha[45] -2.08    0.01 0.33 -2.65 -2.07 -1.55   964    1
alpha[46] -1.23    0.01 0.28 -1.68 -1.23 -0.77   721    1
alpha[47]  0.11    0.01 0.24 -0.30  0.12  0.49   441    1
alpha[48]  1.65    0.01 0.32  1.14  1.64  2.19   935    1
alpha[49]  1.15    0.01 0.29  0.68  1.15  1.62   646    1
alpha[50]  2.53    0.01 0.38  1.91  2.52  3.16  1444    1

Samples were drawn using NUTS(diag_e) at Sun Dec 15 16:35:42 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The \(\hat{R}\) values are all near 1 and the effective sample sizes are reasonable, so we have no reason to distrust the fit.

Hierarchical, Bayesian, Team Bradley-Terry Model

Many extensions to the Bradley-Terry model are possible, as it is really nothing more than a multi-intercept logistic regression. We will consider a team-based model in which two teams made up of several players compete in a match. Or equivalently, when two baskets of consumer goods are compared to one another. The point of such a model will be to infer the ability of the players (goods in the baskets) from the performance of the teams (baskets) in which they participate.

Additivity

As before, each player is still modeled as having an undlerlying ability on the log odds scale. The ability of a team will be the sum of the abilities of its players. This makes sense when the contributions of players are independent, but is only an approximation when there are interaction effects (such as synergies or interference among participants on the team).

Identifiability

This model will be most appropriate in situations where team membership is varied. Many games among teams with the same sets of players will not be identified. That is, there will be know way to infer the relative contributions of the players on a team if the players only ever play with each other. This model could be applied to sports in which the players participating varies during a game, as long as the segments were comparable in terms of variance.99 For example, continuous games like football (soccer, not American), hockey, or basketball might be divided into one minute segments where a winner was assigned based on the scoring during that interval; for most of these sports, a mechanism for accounting for ties would be necessary.

Data for team Bradley-Terry model

The data for the team Bradley-Terry model needs to indicate which players are on the teams playing each other. This will perhaps be easiest to see through a concrete simulation of a complete data set.

K <- 50                                # players
J <- 3                                 # players / team
N <- K^2 * J                           # matches
sigma <- 1                             # scale of player ability variation
alpha <- center(rnorm(K, 0, sigma))    # player abilities, centered log odds
team0 <- matrix(NA, N, J)              # players on team 0
team1 <- matrix(NA, N, J)              # players on team 1
for (n in 1:N) {
   players <- sample(1:K, 2 * J)       # 1:K w/o replacement for distinct teams
   team0[n, 1:J] <- players[1:J]
   team1[n, 1:J] <- players[(J + 1) : (2 * J)]
}
y <- rep(NA, N)
for (n in 1:N) {
  alpha_team0 = sum(alpha[team0[n, 1:J]])
  alpha_team1 = sum(alpha[team1[n, 1:J]])
  y[n] <- rbinom(1, 1, inv_logit(alpha_team1 - alpha_team0))
}
team_data <- list(K = K, J = J, N = N, team0 = team0, team1 = team1, y = y)

There are \(K\) players in total, with \(J\) players on each team. There are a total of \(N\) matches. Here we use quite a few matches to make it easy to visualize the model fit (as it will be close to the true values). We set the population variation of abilities at \(\sigma = 1\), and then generate the vector of player abilities \(\alpha \sim \mathsf{Normal}(0, \sigma)\). We then generate the teams by sampling \(J\) players for each team without replacement from among the \(K\) teams. Then \(y\) is sampled using a binomial random number generator given the sum of the team one’s abilities minus the sum of team zero’s.

Stan model

Coding up that generative process in Stan leads to the following model.

data {
  int<lower = 0> K;                       // players
  int<lower = 0> J;                       // players per team
  int<lower = 0> N;                       // matches
  int<lower = 1, upper = K> team0[N, J];  // team 0 players
  int<lower = 1, upper = K> team1[N, J];  // team 1 players
  int<lower = 0, upper = 1> y[N];         // winner
}
parameters {
  vector[K] alpha;
  real<lower = 0> sigma;
}
model {
  sigma ~ lognormal(0, 0.5);           // zero avoiding, weakly informative
  alpha ~ normal(0, sigma);                   // hierarchical, zero centered
  for (n in 1:N)                          // additive Bradley-Terry model
    y[n] ~ bernoulli_logit(sum(alpha[team1[n]]) - sum(alpha[team0[n]]));
}
generated quantities {
  int<lower=1, upper=K> ranking[K];       // rank of player ability
  {
    int ranked_index[K] = sort_indices_desc(alpha);
    for (k in 1:K)
      ranking[ranked_index[k]] = k;
  }
}

team_model <- stan_model("team.stan")
team_posterior <- sampling(team_model, data = team_data, init=0.5)

Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess

Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

print(team_posterior, c("sigma", "alpha"), probs=c(0.05, 0.5, 0.95))

Inference for Stan model: team.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd    5%   50%   95% n_eff Rhat
sigma      0.99    0.01 0.10  0.85  0.99  1.17   253    1
alpha[1]   0.53    0.01 0.17  0.25  0.53  0.81   185    1
alpha[2]   0.42    0.01 0.17  0.13  0.42  0.68   227    1
alpha[3]   1.70    0.01 0.17  1.42  1.71  1.98   214    1
alpha[4]  -0.02    0.01 0.17 -0.30 -0.02  0.25   203    1
alpha[5]   0.01    0.01 0.17 -0.27  0.02  0.29   197    1
alpha[6]   0.17    0.01 0.17 -0.11  0.17  0.44   203    1
alpha[7]  -0.02    0.01 0.17 -0.31 -0.02  0.26   206    1
alpha[8]  -1.46    0.01 0.17 -1.75 -1.46 -1.19   205    1
alpha[9]  -0.29    0.01 0.17 -0.57 -0.29 -0.01   192    1
alpha[10]  1.00    0.01 0.17  0.71  1.00  1.28   207    1
alpha[11] -0.86    0.01 0.17 -1.14 -0.86 -0.59   213    1
alpha[12]  0.29    0.01 0.16  0.02  0.30  0.56   198    1
alpha[13] -0.39    0.01 0.17 -0.67 -0.38 -0.12   196    1
alpha[14]  0.31    0.01 0.17  0.02  0.31  0.59   202    1
alpha[15] -0.91    0.01 0.17 -1.19 -0.91 -0.64   204    1
alpha[16] -0.32    0.01 0.16 -0.59 -0.31 -0.05   195    1
alpha[17]  1.23    0.01 0.17  0.95  1.24  1.50   216    1
alpha[18] -0.83    0.01 0.17 -1.10 -0.82 -0.55   205    1
alpha[19]  1.75    0.01 0.17  1.47  1.75  2.03   213    1
alpha[20] -0.19    0.01 0.17 -0.47 -0.19  0.08   212    1
alpha[21] -2.28    0.01 0.18 -2.58 -2.27 -1.99   237    1
alpha[22] -0.60    0.01 0.17 -0.87 -0.59 -0.33   205    1
alpha[23] -0.03    0.01 0.16 -0.31 -0.03  0.24   188    1
alpha[24] -0.41    0.01 0.17 -0.70 -0.41 -0.14   212    1
alpha[25] -1.31    0.01 0.17 -1.60 -1.31 -1.04   218    1
alpha[26]  1.47    0.01 0.17  1.18  1.47  1.74   196    1
alpha[27]  1.02    0.01 0.17  0.74  1.02  1.29   197    1
alpha[28]  0.33    0.01 0.17  0.04  0.33  0.60   198    1
alpha[29] -0.03    0.01 0.16 -0.32 -0.03  0.23   199    1
alpha[30] -1.62    0.01 0.17 -1.90 -1.61 -1.35   207    1
alpha[31]  0.81    0.01 0.17  0.53  0.81  1.09   214    1
alpha[32]  0.05    0.01 0.17 -0.23  0.06  0.33   202    1
alpha[33] -0.30    0.01 0.17 -0.58 -0.29 -0.03   194    1
alpha[34] -0.07    0.01 0.17 -0.35 -0.07  0.20   201    1
alpha[35]  0.18    0.01 0.17 -0.10  0.18  0.45   202    1
alpha[36] -2.12    0.01 0.18 -2.41 -2.11 -1.82   234    1
alpha[37] -0.22    0.01 0.17 -0.50 -0.22  0.05   198    1
alpha[38]  1.13    0.01 0.17  0.84  1.14  1.40   208    1
alpha[39]  1.29    0.01 0.17  1.01  1.29  1.56   220    1
alpha[40] -0.17    0.01 0.17 -0.46 -0.17  0.10   208    1
alpha[41] -1.29    0.01 0.17 -1.58 -1.29 -1.03   215    1
alpha[42] -0.33    0.01 0.17 -0.62 -0.33 -0.06   196    1
alpha[43] -1.28    0.01 0.17 -1.58 -1.28 -1.01   222    1
alpha[44] -0.42    0.01 0.17 -0.69 -0.42 -0.15   213    1
alpha[45] -0.65    0.01 0.17 -0.93 -0.64 -0.38   213    1
alpha[46]  0.06    0.01 0.17 -0.22  0.06  0.33   198    1
alpha[47]  1.44    0.01 0.17  1.16  1.44  1.72   209    1
alpha[48]  1.80    0.01 0.17  1.50  1.80  2.09   242    1
alpha[49]  0.66    0.01 0.17  0.39  0.66  0.94   215    1
alpha[50]  1.53    0.01 0.17  1.25  1.53  1.80   204    1

Samples were drawn using NUTS(diag_e) at Sun Dec 15 16:38:50 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

As before, we will pull out the Bayesian parameter estimates.

alpha_hat <- rep(NA, K)
for (k in 1:K)
  alpha_hat[k] <- mean(extract(team_posterior)$alpha[ , k])

and plot against the simulated values.

Fit of true value (horizontal axis) versus Bayesian (posterior mean) estimate (vertical axis) for team-based Bradley-Terry model. As with the previous model, inference for true abilities is very good.

team_bayes_fit_plot <-
  ggplot(data.frame(alpha = alpha, alpha_hat = alpha_hat),
         aes(x = alpha, y = alpha_hat)) +
      geom_abline(slope = 1, intercept = 0, color="green", size = 2) +
      geom_point(size = 2) +
      ggtheme_tufte()
team_bayes_fit_plot

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database

Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
x$y, : font family not found in Windows font database

Exercises

Explore the sensitivity of the prior by contrasting fits with priors on the scale parameter of \(\sigma \sim \mathsf{LogNormal}(0, 1)\) and \(\sigma \sim \mathsf{LogNormal}(0, 0.25)\). Then consider a simple half-Cauchy prior \(\sigma \sim \mathsf{Cauchy}(0, 1)\) and a half-normal prior \(\sigma \sim \mathsf{Normal}(0, 1)\).
Consider an extended Bradley-Terry model where there is a “home-field advantage.” This may literally be a home-field advantage to the sports team playing in their home field (stadium, arena, etc.), or it may be something like the advantage to the player with the white pieces in chess. Assume that player 1 is the “home” player in the data coding and that there is a global intercept term representing this advantage on the log scale. Simulate data and fit it as in the above examples. Is the home-field advantage recovery? Given that we know home-field advantages are positive, does it make sense to constrain the home-field advantage parameter to be positive? If you do and the effect is fairly small and consistent with zero in the posterior, what happens to its estimate with the constraint versus without the constraint (assuming the same prior)?

References

Bradley, Ralph Allan and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39(3/4): 324.
Leonard, T. 1977. An alternative Bayesian approach to the Bradley-Terry model for paired comparisons. Biometrics 33(1):121–132.
Zermelo, E. 1929. Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrscheinlichkeitsrechnung. Mathematische Zeitschrift 29(1), 436–460.

Source code

All of the source code, data, text, and images for this case study are available on GitHub at

stan-dev/example-models/knitr/bradley-terry

Session information

sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tufte_0.5          rstan_2.19.2       StanHeaders_2.19.0
[4] reshape_0.8.8      knitr_1.23         gridExtra_2.3     
[7] ggplot2_3.2.1     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1         highr_0.8          pillar_1.4.0      
 [4] compiler_3.6.0     plyr_1.8.4         prettyunits_1.0.2 
 [7] tools_3.6.0        pkgbuild_1.0.6     digest_0.6.19     
[10] evaluate_0.13      tibble_2.1.1       gtable_0.3.0      
[13] pkgconfig_2.0.2    rlang_0.4.2        cli_1.1.0         
[16] parallel_3.6.0     yaml_2.2.0         xfun_0.7          
[19] loo_2.1.0          withr_2.1.2        dplyr_0.8.1       
[22] stringr_1.4.0      stats4_3.6.0       grid_3.6.0        
[25] tidyselect_0.2.5   inline_0.3.15      glue_1.3.1        
[28] R6_2.4.0           processx_3.4.1     rmarkdown_1.13    
[31] callr_3.3.2        purrr_0.3.2        magrittr_1.5      
[34] codetools_0.2-16   matrixStats_0.55.0 ps_1.3.0          
[37] scales_1.0.0       htmltools_0.4.0    assertthat_0.2.1  
[40] colorspace_1.4-1   labeling_0.3       stringi_1.4.3     
[43] lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4

The Bradley-Terry Model of Ranking via Paired Comparisons

Bob Carpenter

19 March 2018

Abstract

The Bradley-Terry Model

Long form data

Simulating data

Coding the Bradley-Terry model for maximum likelihood

Maximum likelihood estimation

Ranking with the Bradley-Terry Model

Bayesian Bradley-Terry Model

Population model for abilities

Coding the Bayesian Bradley-Terry model

Fitting the model

Ranking in the Bayesian model

Intercepts and home-field advantage

Hierarchical Bayesian Bradley-Terry Model

Priors and Hyperprior

Coding the model

Fitting the model

Hierarchical, Bayesian, Team Bradley-Terry Model

Additivity

Identifiability

Data for team Bradley-Terry model

Stan model

Exercises

References

Source code

Session information

Licenses