Special thanks to Alfie Kirkpatrick, for all his hard work setting up and maintaining the Backgammon Hub website, without which the data for this project would not exist. This article is intended for multiple audiences with varying levels of technical knowledge of statistics and experience with Backgammon (these are possibly negatively related…), I have done my best to explain everything as I go so that in theory if you are fully engaged with it you should follow the main ideas, but some complex specifics will be left for only particular audiences to understand.

Abstract

This article concerns using existing computer analysis software for the game of Backgammon to calculate 2 standard metrics for skill and luck, Performance Rating (PR) and EMG luck (EMG equity change on dice roll) respectively, and using statistical methods to test for dependencies between them. In particular the article discusses the analysis of a sample of 317 matches played online on Backgammon Hub, and considers multiple statistical tests of the luck distributions conditioning on PR, searching for evidence of bias in the luck. The framing of the data was such that for each match the luck values for the 2 players were considered in terms of which player had better PR in that match (ie the luck of the player with better/worse PR in the match, rather than the players’ respective average PRs). It was found to high statistical confidence (each hypothesis test used a significance level of 1%, the lowest p-value was 1 in 31500) that the player who played less accurately receives favourable luck on average in both an absolute and relative sense. It was also found that luck was the dominant factor in determining match outcome even for 13 point matches, and that the low PR luck and high PR luck were non-linearly dependent on each other in a complex polarised way. The “Summary and interpretations” section gives a full explanation of the results of the data analysis.

Introduction

The inspiration for this project came from a chat I had with a Backgammon player after an online match on Backgammon Hub. Based on anecdotal experience of many years of playing Backgammon, mostly unaware of the ways Backgammon software now enables an objective post-match review of performance and luck, he was convinced that players who achieve a lower (better) PR tended to get better luck with the dice. I proceeded to explain to him how the dice could not possibly respond to the skill levels of the players since Backgammon Hub is completely transparent with how the dice rolls are predetermined by a pseudo-random number generator using seeds the players pick randomly at the start¹. That said, after hearing his opinion I could not entirely reject it, since based on my own anecdotal observations of post match summary PR and luck I felt there was at least some truth to it, that they were not entirely independent. The conversation sparked me off theorising plausible reasons how even with completely fair dice the computer evaluated luck metrics could still be biased or skewed by more accurate playing strategy. However, I acknowledged that to have any real confidence in these ideas would need statistical testing, and that Backgammon Hub could provide all the data I needed for this. I will now explain some technical prerequisites and the theories I had starting this project.

Theory and initial hypotheses

For readers who are unaware of the rules of Backgammon, I recommend reading the rules on Wikipedia² or watching Backgammon Galaxy’s tutorial video³ before reading further. Although Backgammon is a dice game, there is still a high degree of skill in the game based on how the players choose to allocate the dice roll movements they get among their 15 checkers, the risk-rewards they take, and also whether or not to double the stakes or drop in response to a double. The modern Backgammon AI engines are extremely well optimised for solving these decision problems by finding the moves that optimise the statistically expected value of the game position thinking ahead several moves across all of the possible dice rolls, based on the assumptions that the numbers rolled for both dice follow independent discrete uniform distributions and that the opponent plays just as well in response.

In match play, the winner is whoever reaches a pre-agreed match score first after playing as many games of Backgammon as it takes before a player wins the match, and due to the statistical nature of this and some peculiarities of Backgammon, this structure of playing Backgammon introduces considerable non-linearities between equity/points and Match Winning Chances (MWC), which motivates the EMG normalisation. In order to compare positions for analysis, the AI uses a metric called Equivalent to Money Game equity (EMG) for evaluating the relative expected value of positions in a normalised way that is comparable to expected equity if instead the games were played as disconnected money games, with the points themselves representing the quantity to be optimised rather than as progress towards winning a match. EMG converts a MWC value into an equity value through linear interpolation and extrapolation using a Match Equity Table (MET) which is numerically computed through Monte Carlo simulation. The MET contains the MWC values for the players at the start of a game for all combinations of match scores. EMG considers the 2 MWC values for the 2 respective outcomes for either player winning the current game as a single game (1 times the current cube level), and uses these 2 points to form a line for converting MWC values to base point values⁴. The MWC itself is computed by the AI using the MET along with a sophisticated neural network to determine the winning, gammon and backgammon chances for the 2 players for the current position, taking into account the effect of the doubling cube on future turns.

Although these AI engines do not rely on EMG to decide on the best move (the MWC that it calculates is sufficient), EMG is the preferred way for players to receive quantitative feedback from AI on the size of mistakes and swings of luck in their games. The change in EMG before and after each roll can define the “Luck” of the dice roll, and summing these across a match can give the total luck for a match. Similarly, the difference between the EMG value of the best decision and the one that was played can measure the size of the mistake in terms of “equity lost”. This can then be summed up across a whole match to give the total equity lost, divided by the total number of decisions made by the player (legally forced moves do not count as decisions) to give the average equity lost per decision and then multiplied by 500 to give a Performance Rating (PR). The lower the PR, the better they played and the higher the skill level of the player. PR is the standard way of expressing the objective skill level of a Backgammon player since it does not significantly vary with match length, with the exception of 1 point matches. In general, a player will have a 1 point match PR half that of their PR in matches longer than 1 point. This is due to how the presence of the cube changes the game in 2 main ways. Firstly, the capacity for large mistakes is much greater with cube plays than for checker players. However, even if the player is equally skilled with the cube as with the checkers so that the average equity lost per decision is the same for checker players vs cube plays, the PR would still be half that of their longer match PR. The reason for this is because of how within a single game with the cube in effect, the same mistake counts for twice as much EMG because the equity curve is compressed between the take points. This results in a steeper gradient in equity as winning chances change. The take points are roughly at 25% and 75% winning chances, which gives a range of winning chances before double-pass half that of the equivalent cube-less range from 0% to 100% winning chances.

Two important aspects of Backgammon checker play strategy are diversification and duplication. Diversification is when a player moves their checkers so that they maximise the number of dice roll combinations that they can make good use of on subsequent dice rolls, and duplication is the reverse applied to the dice rolls that the opponent can make good use of by making it so that the opponent needs the same dice numbers to achieve 2 different things. Because of these concepts, players often get the sense while playing that they are “making their own luck” by creating opportunities for themselves to get lucky. While this is in a sense true, as far as the EMG computer analysis is concerned the expectation of luck (the theoretical mean average as opposed to the observed sample average) is always zero (fair) for any position regardless of how well the players are playing. This is a result of how the MWC for the position on the roll can be considered as a weighted average of MWC values for each dice roll outcome based on the dice probabilities, so essentially by definition the mean average change in the MWC is zero. EMG is a linear function of MWC so the same reasoning of expectations applies to this too. This does not mean that diversification and duplication are strategically ineffective, they just shift the mean average value of the later positions and therefore also the current position up but this also shifts up the baseline from which Luck is measured. I understood this before starting the project so I knew that the skill levels of the players could not influence the mean Luck, however I thought that the ways in which they played could potentially influence the median Luck by manipulating the skew in value of dice outcomes through diversification and duplication.

Data collection

The data for this project was all collected from matches played on the Backgammon Hub website, however due to access limitations as a player rather than as an admin, the data had to be aggregated from multiple sources, with the data groups for each source being a systematic but effectively random and representative sample for their respective populations. For each data source, data was collected in reverse chronological order with “outlier” or invalid matches filtered out manually. My criteria for a match to be a valid data point was that the match had to reach a natural conclusion, therefore timeouts, disconnections and premature resignations were regarded as invalid as they could greatly skew the results of how PR, luck, and outcome were related. Considering the rarity (<1%) of these cases out of the population of all matches played on Backgammon Hub, the filtering out of these matches cannot introduce major artefacts and only strengthens the validity of conclusions made through ensuring the data is valid. The data groups were as follows: 106 of my own matches, 106 of Alex Zamanian’s matches (who I believe was my strongest opponent, much better than me for sure), 42 matches sampled from the population of all games played on Backgammon Hub (this data was kindly provided to me by Alfie Kirkpatrick), and 63 long matches (9-13 match lengths) systematically collected in chunks from the profiles of several of my opponents.

Due to the way the data was sampled in groups, it cannot be regarded as a perfectly representative sample of matches played on Backgammon Hub as a whole. This is especially true with respect to the match length and low PR data fields since many of the data sources focus on matches played by advanced/expert level players and their preferences for match lengths. This will perhaps also affect the data fields describing the match outcome in terms of whether the winner had better PR or luck since these will be determined by the skill difference of the players. That said, the recorded luck values can still be considered as a random and representative sample of luck since under the initial assumptions for this statistical investigation, or null hypothesis H0, the luck is a random variable independent of PR. Even under the alternative hypothesis, H1, that luck and PR are somehow related, luck can still be considered as a random variable jointly distributed with the PRs, and the data represents a random and representative sample of luck values for particular PR combinations. There is no plausible explanation for how the nature of the different sources of data can influence the luck in a way that is not reflected by the PRs, the dice cannot innately behave differently depending on who is playing⁵ and any psychological aspect would be reflected in how it influences PR.

In total the data represented 317 Backgammon matches with match lengths ranging from 1 to 13 (odd numbers only which is standard practice for Backgammon matches). For each match the following data fields were recorded: match length, low PR, high PR, total luck for the player with lower PR, total luck for the player with higher PR, whether or not the winner had better PR, and whether or not the winner had better total luck. The computer analysis values for PR and total luck were calculated on Backgammon Hub using “GNU Backgammon” software which uses EMG metrics⁶ as described in “Theory and initial hypothesis”.

Data analysis

Statistical modelling

Before discussing the results of the analysis, it is first important to justify the statistical modelling by describing the probability maths a little more formally at a high level of abstraction. \[ \begin{aligned} p_i'&=p_i*a_i\\ p_{i+1}&=p_i'*C_i(p_i',D_i)\text{ . . . until } MWC(p_{i+1})=1\\ EMG(p_i')&=\mathbb{E}_{D_i}(EMG(p_i'*c_{best}(p_i',D_i)))\\ L_i(p_i')&=EMG(p_i'*c_{best}(p_i',D_i))-EMG(p_i')\\ W \cap B&=\emptyset\\ L_W&=\sum_{i \in W}L_i\\ L_B&=\sum_{i \in B}L_i \end{aligned} \]

Here \(p_i\) abstractly represents the \(i\)th position in the match (board state, match scores, and cube state) and \(p_0\) is the starting position of the match. \(a_i\) represents the cube actions for both players on the \(i\)th turn of the match, modifying (represented by \(*\) operator) position \(p_i\) to make a post cube action position \(p_i'\). This potentially raises the cube level or ends the current game with a double-pass (in which case the checker plays for that turn can be ignored and the \(L_i=0\)). The cube and checker actions must be separated in this way because the luck random variables are defined for dice rolls given the cube actions (the luck values for particular dice rolls in the same board position can be different on different cube levels). \(C_i(p_i,D_i)\) is the checker decision made on the \(i\)th turn which modifies post-cube action position \(p_i'\) to make \(p_{i+1}\), until a player wins the match. \(L_i(p_i')\) is the luck value for the \(i\)th roll, the change in EMG the best move \(c_{best}\) is played out of the set of all possible moves specified by \(p_i'\) and \(D_i\). These \(L_i\)’s are discrete random variables, real valued functions of the random dice outcomes \(D_i\) parameterised by \(p_i'\). Although \(\mathbb{E}_{D_i}(L_i(p_i'))=0\), \(p_i'\) is a product of all previous dice rolls and player decisions, and this is how dependencies between the different \(L_i\) values may be introduced and how player skill levels may be able to change the shape of the distributions by directing the progression of \(p_i\).

The turns for the players with the white/black checkers is expressed respectively as disjoint sets \(W\) and \(B\) of values for \(i\) such that the \(i\)th turn was made by that player colour. The turns alternate in each game in the match but the player who starts each game is fair and random which makes this notation necessary although it can be thought of as even/odd turns. The total luck for all the turns of the white/black players are then \(L_W\) and \(L_B\) respectively. A game in a match with the cube can be over quickly from a double-pass in under 10 turns, but typically each game lasts for 30 turns per player or more. Given the number of \(L_i\) values summed together, the Central Limit Theorem (CLT) should be valid and it is reasonable to assume that \(L_W\) and \(L_B\) are approximately normally distributed. Regardless of whether the \(L_i\) random variables are independent or not, \(\mathbb{E}(L_{W})=\sum_{i \in W}( \mathbb{E}(L_{i})) =0\) and similarly for \(L_B\) so we would expect the means of these normal distributions to be 0. The labeling of \(L_W\) and \(L_B\) is arbitrary, but after observing the match they can be relabeled as “low PR luck” and “high PR luck” depending on which player played more accurately⁷, essentially conditioning on PR. These are the main random variables of interest in this investigation. Although the expectations of \(L_W\) and \(L_B\) are both 0, conditioning on PR can make the expectation non-zero if the performance of the players is dependent on the luck (or perceived luck) on previous rolls in the match, which seems likely in human games. If the skill levels of the players were fixed, for example 2 Backgammon AI’s of potentially different skill levels playing against each other, then I would expect conditioning on PR will not be able to select for non-zero expected luck.

An important point about the EMG normalisation is that it ensures a form of approximate homoscedasticity (equal variance) of \(L_i\) random variables across different games in a match. The same dice outcome for the same board state can have a higher luck associated with it when measured in terms of MWC depending on the scores and how far away the players are from winning the match. The EMG normalisation of luck expresses the change in MWC in a way that is comparable to how it changes the expected base⁸ points to be won in the current game, while still reflecting how match context modifies the value of different game outcomes and therefore the luck of the dice roll. This ensures that the variances of the \(L_i(p_i')\) random variables have roughly similar magnitude for the same position across different match point scenarios. However, it is important to acknowledge that the variance of \(L_i(p_i')\) varies enormously based on the board state of \(p_i'\), it is common for the distribution of \(L_i(p_i')\) to be highly volatile and polarised (do you hit or miss?).

In any case, variances of \(L_W\) or \(L_B\) are equal to the sum of variances for their respective \(L_i\)’s. Therefore the average variance of these luck totals should be approximately proportional to the average variance of \(L_i\) across all kinds of positions⁹, multiplied by the average number of moves for a player in a game, multiplied by the average number of games in a match. This results in the variance being approximately proportional to the match length, so the standard deviation and spread of the luck totals should be approximately proportional to the square root of the match length.

Visualising the match data and luck distributions

## 
## Table of match length counts:

## 
##   1   3   5   7   9  11  13 
##  27  61 106  48  25  25  25

The data collected mainly consists of 3, 5 and 7 point matches, with 25 matches for each long match format (9-13). This limits the power of analysis when focusing on the long match lengths, but a meaningful analysis of possible relationships between PR and luck for long matches is still possible. Together the long match data gives a sense of how trends continue to longer match lengths.

Figure 2 uses sample deciles (the values at increments of 10% through the data in sorted order) to define the bin intervals. On first inspection, the above histograms appear to show that both both distributions have similar bell-shaped distributions with central average positions close to 0 which is expected for fair luck. Some notable features of the graphs which are most likely just the result of random fluctuation and probably not worth reading too much into would be how the extreme luck values, both maximum and minimum, are higher for high PR luck than low PR luck. These are extreme values but not outliers as they are still valid data points. There is also a small dip at the middle of the low PR Luck distribution, this apparent bi-modal polarisation could be explained by more skilled players taking bigger risk-reward plays that go very well or terribly, but just because you can come up with a compelling explanation for anything in hindsight does not mean it is correct, especially for a minor feature of a graph.

On closer inspection however, we can see that each decile for high PR luck is higher than the corresponding deciles for low PR luck suggesting that the player in the match who played less well somewhat consistently gets slightly better luck than the player who played best (NB: This is in general, not necessarily in a pair-wise sense for each match in turn). This requires more formal statistical testing to be sure to avoid overreacting to spotting patterns in randomness, but it is still good to note how this visually depicts the effect size. If there is a real effect between Luck and PR it appears very small relative to the realistic range of Luck values.

Statistical tests comparing luck for high and low PR

Given that 9 statistical tests are used in this investigation, a significance level of 1% will be used to account for the problem of multiple inferences. Assuming the null hypothesis is true for each test, the probability that at least one test incorrectly rejects the null hypothesis is 8.6%. In other words we can be more than 90% confident that all statistically significant results rejecting the null hypothesis are valid. This is necessary to ensure reliability of results, but the 1% significance level also limits the sensitivity of the test to small deviations from the initial assumptions. So it is not valid to take a non-significant result as strongly confirming no deviation from the null hypothesis, only that there is no large deviation.

#Computing the number of times that the player with the lower PR in a match also
#had better luck than their opponent and testing this statistic with a
#binomial distribution to see if it is more or less often than expected for zero bias.
better_luck_count <- sum(BGHub$low_PR_luck > BGHub$high_PR_luck)
cat("empirical probability:", better_luck_count/nrow(BGHub),
    "\n99% confidence interval under H0:\n",
    qbinom(p = c(0.005, 0.995), size = nrow(BGHub), prob = 0.5)/nrow(BGHub))

## empirical probability: 0.4574132 
## 99% confidence interval under H0:
##  0.4290221 0.5709779

Considering the pairs of luck values for each match, it was observed that only ~45.7% of the time the player with lower PR had better luck. However this is within the 99% confidence interval of what would be expected for fair luck, so the 2 tailed test fails to reject the null hypothesis that each player is just as likely to get better luck regardless of PR. The width of the confidence interval shows that the test is likely to detect deviations from 50% larger than 7%, and would likely miss smaller deviations.

The next 2 statistical tests consider the sample means for luck which we assume to be normally distributed. These sample means are divided by the standard error to give approximately t-distributed test statistics. Although each match luck observation is normally distributed, they are not identically distributed since the variance will be approximately proportional to the match length. This means that using a chi square model to account for the uncertainty in the standard error is not exact, but the approximation is very good and it is still better to model this variation than not. If anything it overestimates the p-value.

low_mean <- mean(BGHub$low_PR_luck)
low_PR_z <- mean(BGHub$low_PR_luck) * sqrt( nrow(BGHub) ) / sd(BGHub$low_PR_luck)
cat("low PR Luck sample mean:", low_mean,
    "\np-value:", pt(q = low_PR_z, df = nrow(BGHub)-1))

## low PR Luck sample mean: -0.1074921 
## p-value: 0.1697268

The mean low PR luck is very close to 0 as expected. The p-value is not low enough to imply that the slight negative deviation from 0 is statistically significant.

high_mean <- mean(BGHub$high_PR_luck)
high_PR_z <- mean(BGHub$high_PR_luck) * sqrt( nrow(BGHub) ) / sd(BGHub$high_PR_luck)
cat("high PR Luck sample mean:", high_mean,
    "\np-value:", pt(q = high_PR_z, df = nrow(BGHub)-1, lower.tail = FALSE))

## high PR Luck sample mean: 0.4630063 
## p-value: 3.174541e-05

The mean high PR luck is greater than 0 with an extremely low p-value which is statistically significant at the 1% level. The null hypothesis is therefore rejected, the player with worse PR in the match gets significantly lucky on average! The significance of this result is pretty much indisputable, a 1 in 31500 event in terms of needing to repeat the experiment multiple times with new data until you would expect to observe this by sheer coincidence under the assumption of fair luck.

Q <- quantile(BGHub$low_PR_luck, probs = c(0.25, 0.5, 0.75), type = 6)
cat("low PR Luck sample median:", Q[2],
    "\nlow PR Luck quartile skew: ", (Q[3] + Q[1] - 2 * Q[2])/(Q[3] - Q[1]))

## low PR Luck sample median: -0.01 
## low PR Luck quartile skew:  -0.09544247

Similarly to the mean low PR luck, the median low PR luck is also very close to 0. Based on the quartiles the luck appears very slightly negatively skewed.

Q <- quantile(BGHub$high_PR_luck, probs = c(0.25, 0.5, 0.75), type = 6)
cat("high PR Luck sample median:", Q[2],
    "\nhigh PR Luck quartile skew: ", (Q[3] + Q[1] - 2 * Q[2])/(Q[3] - Q[1]))

## high PR Luck sample median: 0.34 
## high PR Luck quartile skew:  0.04483431

The median high PR luck is also greater than 0 by a similar magnitude to the mean luck, although slightly lower. Based on the quartiles the luck appears very slightly positively skewed.

Break down by match length: standardisation and comparison

Figures 3 and 4 visualise 95% confidence intervals for probabilities of winning given either better luck or better PR, displaying trends as match length changes. The confidence intervals are based on a beta distribution Bayesian posterior with uniform naive prior. This is a suitable model for an estimator of a probability parameter given binomial data. Quantiles were then computed from this model to give the median estimate and confidence interval. The figures show that in general the player with better luck wins about 90% of the time, almost all of the time for 1 and 3 point matches and with a slight negative trend as match length increases although even at long match lengths it is still at about 80%. On the other hand the equivalent probability estimates for how often the player with better PR wins is consistently much lower at about 50% or 60% and remains fairly constant as match length increases. Due to the ways the data was collected, it is not possible to read much into these probabilities since a systematic dependency between match length and PR difference may have been introduced by the sampling methodology. The most meaningful conclusion to draw from this is that in practice luck, and not skill, is by far the dominant factor in deciding match outcome even up to 13 point matches.

To enable comparison of luck across different match lengths, it is useful to transform the data to a standardised form so that the variance of standardised luck is constant across match lengths. I defined the standardised forms of the low and high PR luck quantities by subtracting the respective sample means (averaging across all match lengths for high and low PR respectively) and dividing by the square root of the match length (see “Statistical modelling” section). Under the null hypothesis that there is no difference in mean luck across the different match lengths, this aims to construct independent and identically distributed normal random variables describing the deviation of the observed luck from the average for the low/high PR player, considering the size of this deviation in the context of the match length.

Figures 5 and 6 demonstrate that after standardising the luck values for match length, the quantiles largely match what would be expected for a Normal distribution for both high and low PR Luck.

Figures 7-10 displays scaled Chi Square 95% confidence intervals for the variance of luck to visually verify that the variances were inconsistent across match lengths but consistent after standardising.

Figures 11 and 12 show that there are no apparent differences in location or spread of standardised luck (for either high or low PR luck) across different match lengths. Just to be sure, an ANOVA test is a good way to test this formally since each category has equal variance.

#Testing if the luck biases are significantly different across different match lengths
#with ANOVA of standardised luck

low_PR_luck_aov <- aov(low_PR_luck_std ~ as.factor(match_length), data = BGHub)
summary(low_PR_luck_aov)

##                          Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(match_length)   6   4.87  0.8112   1.194  0.309
## Residuals               310 210.70  0.6797

high_PR_luck_aov <- aov(high_PR_luck_std ~ as.factor(match_length), data = BGHub)
summary(high_PR_luck_aov)

##                          Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(match_length)   6   6.99  1.1652   1.771  0.105
## Residuals               310 203.93  0.6578

There is no significant variation in average standardised low PR luck between match length formats relative to the variation in luck values within groups. The same is true for standardised high PR luck. This implies a lack of major inconsistencies in the relationship between PR and luck across different match lengths. Due to the unbalanced number of matches per match length, the power of this test to detect deviations in the group averages is weaker for longer match lengths due to lack of data.

Relative Standard Luck: bimodal dependence and thresholds for match outcome

Thank you for making it through reading the previous sections, I have saved the most surprising and counter-intuitive results for this last section so it should be worth it! I stumbled across this while looking for ways to transform and reduce the dimensions of the data to meaningful factors for luck and skill to derive a general model for luck thresholds for winning or losing based on skill differences and match length, more on that shortly.

As far as luck is concerned, what is important in determining the outcome of a Backgammon match is not whether a player had good luck with their dice, but whether their luck was better than their opponent’s. This motivates looking at the difference in the standardised luck variables, so I define the Relative Standard Luck (RSL) as the standardised low PR luck minus the standardised high PR luck. The distribution of RSL reveals an unusual dependence between high/low PR luck.

Figures 13 and 14 show the distributions of high/low PR luck which are clearly unimodal around 0 and as shown previously in figures 5 and 6 are normally distributed. Figure 15 plots the distribution of RSL, the pairwise difference, for all the data which is visibly bimodal and figure 17 shows it is definitely not normally distributed. This is very surprising because when 2 independent normal random variables are subtracted from each other, the result is another normal random variable with a mean equal to the difference in the 2 means of the component random variables and variance equal to the sum of the 2 variances. More generally, for bivariate normal data with possible dependence from linear correlations between the 2 components, any linear combination of the 2 components will also be normally distributed. The observed distribution of RSL therefore shows that low and high PR luck are non-linearly dependent¹⁰ on each other! The distribution appears to resemble the superposition of 2 gaussian peaks of equal height and spread positioned at +-1.

Figure 16 breaks down the RSL data based on match length into single (1), short (3-7), and long (9-13) matches and plots the distribution for each case. There are signs of the same bimodal behaviour in each case with it being especially extreme for single games, however it is hard to objectively evaluate this from the graph because each category has different numbers of matches and there are only 27 data points for the single games. However despite this small sample size, it is still possible to carry out a statistical test for whether the underlying distribution RSL is sampled from is the same for each case!

The Kolmogorov-Smirnov test considers the maximum difference in the empirical cumulative distribution functions for 2 samples, and uses this as a test statistic to calculate a p-value under the null hypothesis that the 2 samples were drawn from the same distribution. To compare all 3 categories, it is only necessary to compare 2 against the third as a reference, I used the short matches as the reference group since it had the most data points (215) which ensured the greatest statistical power. There are various ways to group the match lengths together for these tests including pairwise testing for each match length combination, but this methodology keeps the number of inferences low and the sample sizes high for good statistical power, and from playing experience I would expect there to be minimal differences in RSL distributions between match lengths within the same category and thought it could be plausible for the RSL distribution to be quite different for 1 point matches given that it is played without the cube.

ks.test(single, short, alternative = "two.sided")

## 
##  Exact two-sample Kolmogorov-Smirnov test
## 
## data:  single and short
## D = 0.40637, p-value = 0.000415
## alternative hypothesis: two-sided

ks.test(short, long, alternative = "two.sided")

## Warning in ks.test.default(short, long, alternative = "two.sided"): p-value
## will be approximate in the presence of ties

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  short and long
## D = 0.094574, p-value = 0.7025
## alternative hypothesis: two-sided

The Kolmogorov-Smirnov test between the single and short data has a low p-value of 0.04% which is significant at the 1% level. This confirms that the distribution of RSL for single games has a different distribution than for short matches, and the left panel of figure 16 indicates the qualitative differences in the distribution (more strongly concentrated around the 2 modes). The p-value for the test comparing short and long matches was not below the significance level, so the test fails to identify significant differences in their distributions. It follows that long and single matches also have significantly different RSL distributions.

Note the warning message about ties in the data for the test comparing short and long RSL. This indicates that some of the RSL values in the data are identical which potentially makes some of the modelling assumptions for the test less valid. More concerning, I thought this could indicate that some of the matches were duplicated in the data but after inspecting the matches that had identical RSL values I found that they were distinct matches, the PR and luck values were different, but the difference in the luck was the exact same. Furthermore there were only a handful of matches with identical RSL values and a RSL value was never repeated more than once. The luck values were recorded to 2 decimal places (the precision displayed on Backgammon Hub at the time of data collection) and taking the difference does compress values into a narrower range, so these factors together do make it plausible for some values to be repeated rarely but it is still highly unusual. Despite the matches being different, the luck was effectively the same in terms of RSL.

To continue the dimension reduction, we wish to find a single “skill factor” statistic that most accurately reflects how the difference in skill of the players influences the match outcome. The difference in PR is a great place to start, but if the skill factor will be used alongside RSL to predict match outcome then skill factor must also account for how much the PR difference counts for in different match lengths. In general we expect that as match length increases the probability that the better player wins increases, and so the definition of skill factor should reflect this. Since PR is the average equity lost per decision and the number of decisions will be proportional to the match length, multiplying the PR difference by the match length¹¹ should relate to the total equity lost. This is a useful definition of skill factor since PR can be measured from short matches and then extrapolated to longer match lengths for making predictions about winning chances.

## 
## Call:
## lm(formula = luck_std_dif ~ skill_factor, data = BGHub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.66199 -1.00909  0.06251  0.98580  2.17968 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.326584   0.094351   3.461 0.000612 ***
## skill_factor -0.006655   0.001781  -3.737 0.000221 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.124 on 315 degrees of freedom
## Multiple R-squared:  0.04245,    Adjusted R-squared:  0.03941 
## F-statistic: 13.97 on 1 and 315 DF,  p-value: 0.000221

Figure 18 plots the data for RSL and skill factor, showing the matches won by the player with better luck in blue and the matches lost in red. The green line was plotted by inspection in an attempt to represent the threshold RSL, the critical value above which the player with better PR wins and below which they lose. The triangular area above the line but below the intercept represents when the more skilled player has worse luck but still manages to win the match by playing better. Due to overlap of the clusters, it is impossible to find a line that perfectly cuts between the wins and losses however it is still highly accurate with only 11 out of 317 data points on the wrong side of the line, 3 of which were outliers (represented with red crosses) in the sense that the player with better PR also had better luck yet somehow lost!

I inspected these outlier matches more closely to see what happened, downloading the match file and looking through the moves with GNU Backgammon to find it had indeed happened and wasn’t a mistake in data entry. The cause in each case was that when luck was expressed in terms of EMG it said that player A (who had played with a better PR) was more lucky than player B but when the luck was instead expressed in terms of change in MWC player B was more lucky than player A. This is a drawback of the EMG normalisation as it rescales MWC change of mistakes and good rolls to be roughly the same magnitude, but this will inevitably depart from describing what mattered the most for determining the match outcome. Thankfully we can tell from the data that this distortion only rarely creates outliers, and some qualitative observations of the 3 matches gives a vague sense of causes and risk factors. The common factors were as follows:

They were all long matches, at least 11 points long. This is significant given that these 3 outliers out of 50 11+ point matches were never observed for the 267 other matches with shorter match lengths.
Two were played to effective double match point and the other was played to 2 away Crawford (the trailer won a gammon) and in each case the match winner had great luck in the final game when MWC is most volatile. The combination of a long match played to a MWC volatile game enables the discrepancy between EMG and MWC luck.
To my surprise, there were no big cubes which I thought would be the biggest cause of these outliers, bad luck for most of the game and then great luck on a big cube (EMG normalises to base points so does not reflect cube level). In one game the trailer won a doubled backgammon when the scores were 8 away 3 away, but this was the only big win of it’s kind out of the 3 outlier matches, in fact besides this doubled backgammon no player ever won more than 2 points on a single game, there were no doubled gammons.

Coming back to figure 18, the black dashed trend line is a least squares line of best fit for RSL describing how the average RSL changes with skill factor. The summary of this fitted linear model is also shown below figure 18 and confirms that the fit is statistically significant at the highest level as the p-value is less than 1 in 1000. This is surprising as it shows that the RSL, which removes the previously identified constant offset in mean luck values when standardising the luck, is negatively correlated with skill factor. The gradient of the trend line is only slightly greater than the gradient of the threshold line, within one standard error of the gradient parameter estimate in the linear model. This raises doubts about whether the size of the MWC advantage actually increases significantly with skill factor. At the very least this shows the advantage does not increase as much as we might expect it would if we assumed luck and PR were independent.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  low_skill_res and high_skill_res
## D = 0.10545, p-value = 0.3416
## alternative hypothesis: two-sided

Figures 19 and 20 show the distributions of RSL residuals, the differences between the oberved values and the fitted values for the linear model, splitting the data for all match lengths into 2 groups based on whether the skill factor was less than or greater than the median skill factor. Both have a bimodal shape similar to figure 16, with the effect being more pronounced for the low skill factor group most likely because this includes all of the matches for single games. Despite this the Kolmogorov-Smirnov test comparing these distributions fails to reject the null hypothesis that they come from the same distribution, implying even with the highly bimodal single game data removed there would still be no identifiable difference in the distribution. This test was an attempt to test whether the residuals are IID for different values of skill factor but probably has quite low power since if there is any difference in the distributions it will be such that the means and variances are reasonably consistent. Ideally I would also test whether the distribution of RSL values/residuals for matches with length longer than 1 and very low skill factor in the same range as for single games had a similar extremely bimodal distribution as for single games, which could imply the cause of the bimodal behaviour is more generally caused by very low skill factor rather than the single game format alone, however there is not enough data to test this with any decent power.

Summary and interpretations

The low and high PR luck distributions were shown to be normally distributed with variances proportional to the match length, but although low PR luck had a mean close to 0 it was shown that high PR luck had a sample mean of 0.463, significantly greater than 0 with a p-value of 1 in 31500. ANOVA tests of standardised luck implied that this mean offset was consistent across different match lengths. This showed that the luck was in some way statistically dependent on PR, with the player who played less accurately receiving favourable luck on average. Then furthermore the Relative Standard Luck (RSL) distribution showed that the luck values for the 2 players were non-linearly dependent on each other, polarising the luck difference. Single games were the most strongly polarised and it was shown from a Kolmogorov-Smirnov test that the RSL values for single games came from a different distribution compared to matches with the cube. The linear regression in figure 18 showed that the RSL was negatively correlated with skill factor, a combined measure of PR difference times match length.

These statistical results disprove the hypothesis that the luck follows well behaved normal distributions, so then what does it say about the big picture of what is happening instead? My interpretation is that alongside some details of the psychological and inherent nature of mistakes in Backgammon¹², a statistical subtlety of the stochastic process of the Backgammon match connects many of these results, the fact that the match must end with a winner. The termination of the match once a player accumulates enough MWC, rather than playing a fixed number of games, statistically selects for matches terminated by volatile luck which can have major implications. Similar to discussions in articles by Douglas Zare [6, 7], the “net skill” and “net luck” completely describe the outcome of the match and since outcome can only take discrete values the possible values for “net luck” are discrete and strictly related to “net skill” via a negative linear relationship. In the context of matches these net values relate to MWC, so considering how this translates to EMG the distortions effectively add some random noise ¹³ making the relationship not exact anymore. This easily explains the bimodal distribution of RSL and also why RSL is especially bimodal for 1 point matches since \(EMG=MWC\) in that case. The termination condition combined with the nature of volatility in Backgammon may also result in inherently favourable luck for less skilled players and why RSL negatively trends with skill factor. A crude model for this stochastic process would be modelling MWC starting at 0.5 and evolving with brownian motion in discrete time (moderate luck) with drift to represent the skill difference but with large discrete shocks to the MWC that occur at a constant average rate (great rolls), and the process terminates when MWC reaches 0 or 1. Relating this to EMG would be complicated but possible using thresholds from a MET.

Considering the MWC or EMG equity fluctuating over the course of a match between 2 players of fixed skill levels, the better player is more likely to gain a points lead however the doubling cube will begin to favour the point trailer making equity more volatile and skewed in favour of the trailer, potentially explaining why RSL is negatively correlated with skill factor. The better player does not require great rolls to win the match, so even with moderate luck they will end the game quickly, giving themselves less time to get great rolls. The match lasts longer if it starts out with the weaker player having moderately favourable luck, so it is more likely for the great rolls for either player to occur in matches when the weaker player starts with good luck. When adding up the total luck for the match, this would suggest that the weaker player’s best luck will be more likely to be higher than the better player’s best luck and similarly for their worst luck, which is consistent with figure 2.

Next steps

This project only begins to delve into the statistics and probability of luck in Backgammon. It would be interesting to use Backgammon AI as a control for an absence of psychological effects to determine to what extent the measured effects are innate to the game or external. I also have plans to build on this work in optimising the FIBS Backgammon rating system [8] as a final year project for my undergraduate degree at UCL.

References

In writing this article I have tried to keep copies of references in footnotes for quick access but for completeness here is a list of references:

[1] blog post by Alfie Kirkpatrick, Backgammon Hub admin: https://ukbgf.com/online-dice-random-or-not/

[2] Backgammon Wikipedia rules: https://en.wikipedia.org/wiki/Backgammon#Rules

[3] Backgammon Galaxy tutorial video: https://youtu.be/_hCUrQSGqTI?si=ySIsFwL4Yf_jO5qW

[4] Jeremy Bagai, The Fortuitous Press, “E.M.G Woes”: https://www.fortuitouspress.com/emg

[5] Christian Anthon, GNU Backgammon Manual, see sections 7.4.3 “Statistics” and 10.4 “Equities Explained” to explain how PR, Luck and equities are evaluated by GNUbg: https://www.gnu.org/software/gnubg/manual/gnubg.pdf

[6] Douglas Zare, “Hedging Toward Skill”: https://www.bkgm.com/articles/Zare/HedgingTowardSkill.html

[7] Douglas Zare, “A Measure of Luck”: https://www.bkgm.com/articles/Zare/AMeasureOfLuck.html

[8] Kevin Bastian, “The FIBS rating system”: https://bkgm.com/articles/McCool/ratings.html

blog post by Alfie Kirkpatrick, Backgammon Hub admin: https://ukbgf.com/online-dice-random-or-not/ [1]↩︎
Backgammon Wikipedia rules: https://en.wikipedia.org/wiki/Backgammon#Rules [2]↩︎
Backgammon Galaxy tutorial video: https://youtu.be/_hCUrQSGqTI?si=ySIsFwL4Yf_jO5qW [3]↩︎
Jeremy Bagai, The Fortuitous Press, “E.M.G Woes”: https://www.fortuitouspress.com/emg [4]↩︎
although random variation will still be observed in samples, so some players will happen to be more lucky than others despite there being no difference in the mechanism behind how their luck was generated.↩︎
Christian Anthon, GNU Backgammon Manual, see sections 7.4.3 “Statistics” and 10.4 “Equities Explained” to explain how PR, Luck and equities are evaluated by GNUbg: https://www.gnu.org/software/gnubg/manual/gnubg.pdf [2]↩︎
How much worse their cube decisions \(a_i\) and checker decisions \(C_i(p_i',D_i)\) were than the best actions. Note that each \(a_i\) in which a player doubles involves cube decisions from BOTH players since the player to move doubles and the other player takes or passes.↩︎
Before applying doubling cube points multiplier.↩︎
As meaningful as this is given the wildly varying volatility and polarisation.↩︎
As in they are dependent in a non-linear way, not that they are not linearly dependent although this is also true. If you are interested in the correlation as a descriptive statistic the value was 0.03 although this is not relevant to the discussion since it has already been shown they are dependent!↩︎
I experimented with instead multiplying by the square root of the match length which is motivated by the form of the Backgammon rating system, although the results were very similar and the threshold line in figure 18 separated wins and losses marginally less effectively.↩︎
The biggest mistakes occur when failing to make the most of great dice and it is generally hardest to make the most of doubles (which tend to be good rolls if you can use them!) given the large number of options. There will also be psychological aspects to mistakes, for instance if a player has great dice they may start playing more lazily or alternatively feel more tense and less sure of to what extent to adjust strategy based on a large score lead. These factors partially explains why the mean low PR luck is greater than 0 due to good luck causing lower performance.↩︎
Relating to how much EMG distorts things for the particular positions which can be thought of as pseudorandom or separate to purely considering the match in terms of MWC↩︎

“Making our own luck”? A statistical analysis of PR and luck in the game of Backgammon

Anton Hodgetts (anton.hodgetts.24@ucl.ac.uk)

December 2025