Cricket, with its long-standing tradition of statistical analysis, has traditionally relied on the batting average, defined as the total runs scored divided by the number of dismissals, to evaluate a batter’s performance. This metric, while simple and intuitive, often fails to capture the true range of a player’s contribution, especially in Test cricket, where innings can span many sessions and involve varied scoring patterns. In this study, we focus on the top 25 run-getters in Test cricket during the 2010s decade (1st January 2010 to 31st December 2019) and explore an alternative measure, the Expected Runs per Dismissal (ERPD), which provides a more nuanced assessment of batting performance.
The batting average, despite its widespread use, has several limitations.
Sensitivity to notout innings: Batters who remain not out receive an artificial boost to their average since these innings are not counted as dismissals. This can result in overestimating a batter’s effectiveness, particularly for those who tend to bat for long durations.
Loss of distributional information: By aggregating performance into a single number, the batting average ignores the variability and distribution of scores. Two batters with similar averages might have vastly different scoring patterns—one might be highly consistent, while another might have sporadic high scores offset by low scores.
Negligency of survival dynamics: Batting average does not consider how long a batter survives in an innings before being dismissed. In the context of Test cricket, the ability to occupy the crease and build an innings is as crucial as the runs scored, and survival analysis offers a framework to incorporate this dimension.
Given these drawbacks, alternative metrics have been proposed by researchers and practitioners. For example, metrics such as APM (Average Player Multiplier), Expected Average, BEREX (Bernoulli Run Expectation), RAAR (Runs Above Average Replacement) etc. have been suggested as potential improvements over the traditional average. These measures, however, often still rely on summarizing performance without fully accounting for the survival aspect of batting.
The Expected Runs per Dismissal (ERPD) is introduced as a more robust measure that overcomes many of the limitations of the batting average. ERPD is defined as the expected number of runs a batter is likely to score before being dismissed, taking into account the complete distribution of scores.
Mathematically, ERPD is expressed as:
\[ \text{ERPD} = E[X \mid \text{Dismissal}] = \int_0^\infty x f(x) \, dx \]
where \(X\) denotes the runs scored in an innings and \(f(x)\) is the probability density function of \(X\).
After thorough statistical analysis employing tools such as the Kolmogorov-Smirnov test and comparing criteria like the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), it was found that lognormal distribution offers the best fit for the distribution of runs scored by batters while being dismissed in a Test innings.
The following data (batter under study: Alastair Cook) shows the selection of lognormal distribution as the best fit among few well-known distributions.
Goodness-of-fit statistics
1-mle-norm 2-mle-exp 3-mle-gamma 4-mle-lnorm
Kolmogorov-Smirnov statistic 0.204965 0.1268075 0.09541033 0.05958898
Cramer-von Mises statistic 2.407679 0.5559498 0.25110920 0.10153055
Anderson-Darling statistic 13.453451 2.7504675 1.27300494 0.80366991
Goodness-of-fit criteria
1-mle-norm 2-mle-exp 3-mle-gamma 4-mle-lnorm
Akaike's Information Criterion 1968.408 1749.770 1748.639 1746.379
Bayesian Information Criterion 1974.838 1752.985 1755.069 1752.809
The lognormal model is particularly attractive because it naturally accommodates the skewness observed in cricket scores, with many low to moderate scores and a long right tail capturing the occasional big innings.
Fitting of the distribution ' lnorm ' by maximum likelihood
Parameters :
estimate Std. Error
meanlog 3.061100 0.09510433
sdlog 1.290058 0.06724874
Loglikelihood: -871.1894 AIC: 1746.379 BIC: 1752.809
Correlation matrix:
meanlog sdlog
meanlog 1.000000e+00 1.454202e-09
sdlog 1.454202e-09 1.000000e+00
\[ \ln(X) \sim N(\mu, \sigma^2) \]
Accordingly, the expected value is given by: \[ E[X] = \exp\left(\mu + \frac{\sigma^2}{2} \right) \]
This forms the basis for the ERPD calculation for innings in which a batter has been dismissed.
Furthermore, survival analysis concepts are applied to extend ERPD to innings in which a batter has remained notout. If we denote the survival function by:
\[ S(x) = P(X > x) = 1 - F(x), \]
where \(F(x)\) is the cumulative distribution function of \(X\), the conditional expectation of runs given that a batter has already scored \(x\) many runs can be expressed as:
\[ E[X \mid X > x] = \frac{\int_x^\infty t f(t) \, dt}{S(x)} = \frac{\int_x^\infty t f(t) \, dt}{\int_x^\infty f(t) \, dt} \]
For innings in which a batter hasn’t been dismissed, \(x\) denotes a right-censored data and \(E[X \mid X > x]\) serves as an estimate of the batter’s score had he/she been dismissed at some point hypothetically. By incorporating these techniques into the ERPD framework, we achieve a metric that not only reflects the central tendency of a batter’s scoring but also incorporates their survival ability and the inherent risk of dismissal.
Consider Murali Vijay’s scores in Test cricket from 1st January 2010 to 31st December 2019 as the data of interest. His ERPD for innings in which he was dismissed has been derived as follows.
data <- read.csv("//Users//rhitankarsmacbook//Library//Mobile Documents//com~apple~CloudDocs/Cricket Stats//ERPD//Murali_Vijay.csv")
out_values <- data$Out
duck_values <- out_values[out_values == 0]
duck_innings <- length(duck_values)
out_values <- out_values[out_values > 0]
out_values <- out_values[!is.na(out_values)]
hist(out_values, probability = TRUE, breaks = 20, col = "lightblue", main = "Histogram of out values")
lines(density(out_values), col = "red", lwd = 2)
fit_norm <- fitdist(out_values, "norm")
fit_exp <- fitdist(out_values, "exp")
fit_gamma <- fitdist(out_values, "gamma")
fit_lognorm <- fitdist(out_values, "lnorm")
# Compare AIC values
gofstat(list(fit_norm, fit_exp, fit_gamma, fit_lognorm))
Goodness-of-fit statistics
1-mle-norm 2-mle-exp 3-mle-gamma 4-mle-lnorm
Kolmogorov-Smirnov statistic 0.2122892 0.1464157 0.1241796 0.08092825
Cramer-von Mises statistic 1.3611348 0.4167276 0.2853081 0.09442460
Anderson-Darling statistic 7.6805112 2.2910722 1.6777359 0.66486409
Goodness-of-fit criteria
1-mle-norm 2-mle-exp 3-mle-gamma 4-mle-lnorm
Akaike's Information Criterion 975.0590 877.7462 879.1151 873.2391
Bayesian Information Criterion 980.1242 880.2788 884.1803 878.3043
Lowest K-S test statistic, Cramer-von Mises test statistic, AIC and BIC values suggest lognormal to be the best fit for Vijay’s scores in innings where he was dismissed.
Fitting of the distribution ' lnorm ' by maximum likelihood
Parameters :
estimate Std. Error
meanlog 3.062178 0.12567095
sdlog 1.211927 0.08886251
Loglikelihood: -434.6195 AIC: 873.2391 BIC: 878.3043
Correlation matrix:
meanlog sdlog
meanlog 1.000000e+00 -4.760964e-10
sdlog -4.760964e-10 1.000000e+00
lognorm_mu <- fit_lognorm$estimate["meanlog"]
lognorm_sigma <- fit_lognorm$estimate["sdlog"]
ERPD_out <- exp(lognorm_mu + (lognorm_sigma^2) / 2)
cat("ERPD for out values (excluding ducks) :", ERPD_out, "\n")
ERPD for out values (excluding ducks) : 44.54775
For the remaining right censored data, i.e. innings in which Vijay was dismissed, we calculate the Mean Residual Life (MRL) for each knock to obtain the corresponding ERPD.
notout_values <- data$Notout[!is.na(data$Notout)]
mu <- fit_lognorm$estimate["meanlog"]
sigma <- fit_lognorm$estimate["sdlog"]
ERPD_notout <- function(x, mu, sigma) {
# Compute survival function S(x) = 1 - CDF(x)
Sx <- 1 - plnorm(x, meanlog = mu, sdlog = sigma)
# Compute expected runs beyond x
Ex_given_x <- integrate(function(t) t * dlnorm(t, meanlog = mu, sdlog = sigma),
lower = x, upper = Inf)$value
return(Ex_given_x / Sx)
}
ERPD_notout_values <- sapply(notout_values, ERPD_notout, mu = mu, sigma = sigma)
cat("ERPD for notout values:", ERPD_notout_values, "\n")
ERPD for notout values: 90.71932
Finally the overall ERPD of Murali Vijay is calculated as follows.
num_out <- length(out_values)
num_notout <- length(ERPD_notout_values)
total_innings <- num_out + num_notout + duck_innings
overall_ERPD <- (ERPD_out * num_out + sum(ERPD_notout_values)) / total_innings
cat("Overall ERPD:", overall_ERPD, "\n")
Overall ERPD: 40.70827
The study considers the top 25 run-getters in Test cricket during the 2010s decade (1st January 2010 to 31st December 2019). One can rank them by virtue of their batting averages in this period as follows.
Player | Inns | Notouts | Runs | Average | Strike.Rate | X100 | X50 | Ducks | X4s | X6s |
---|---|---|---|---|---|---|---|---|---|---|
SPD Smith (AUS) | 130 | 16 | 7164 | 62.84 | 55.59 | 26 | 28 | 4 | 795 | 42 |
KC Sangakkara (SL) | 86 | 7 | 4851 | 61.40 | 52.22 | 17 | 20 | 7 | 514 | 29 |
AB de Villiers (SA) | 98 | 10 | 5059 | 57.48 | 55.65 | 13 | 27 | 7 | 573 | 45 |
V Kohli (IND) | 141 | 10 | 7202 | 54.97 | 57.81 | 27 | 22 | 10 | 805 | 22 |
Younis Khan (PAK) | 101 | 12 | 4839 | 54.37 | 50.47 | 18 | 12 | 7 | 444 | 46 |
KS Williamson (NZ) | 137 | 13 | 6379 | 51.44 | 51.55 | 21 | 31 | 9 | 694 | 14 |
Misbah-ul-Haq (PAK) | 101 | 17 | 4225 | 50.29 | 46.15 | 8 | 35 | 6 | 405 | 73 |
HM Amla (SA) | 146 | 12 | 6695 | 49.96 | 50.48 | 21 | 27 | 7 | 834 | 12 |
CA Pujara (IND) | 124 | 8 | 5740 | 49.48 | 46.69 | 18 | 24 | 7 | 682 | 14 |
MJ Clarke (AUS) | 107 | 10 | 4717 | 48.62 | 58.37 | 16 | 10 | 4 | 556 | 22 |
JE Root (ENG) | 164 | 12 | 7359 | 48.41 | 54.37 | 17 | 45 | 8 | 829 | 20 |
DA Warner (AUS) | 153 | 6 | 7088 | 48.21 | 73.04 | 23 | 30 | 9 | 840 | 56 |
LRPL Taylor (NZ) | 133 | 19 | 5486 | 48.12 | 60.12 | 15 | 25 | 14 | 637 | 35 |
AN Cook (ENG) | 201 | 11 | 8818 | 46.41 | 46.93 | 23 | 37 | 6 | 1010 | 9 |
IR Bell (ENG) | 114 | 15 | 4436 | 44.80 | 48.88 | 13 | 25 | 7 | 539 | 25 |
AD Mathews (SL) | 140 | 20 | 5325 | 44.37 | 48.54 | 9 | 32 | 2 | 566 | 52 |
BB McCullum (NZ) | 95 | 5 | 3979 | 44.21 | 66.39 | 9 | 16 | 6 | 471 | 79 |
AM Rahane (IND) | 105 | 11 | 4112 | 43.74 | 50.65 | 11 | 22 | 6 | 463 | 29 |
Azhar Ali (PAK) | 146 | 8 | 5885 | 42.64 | 41.82 | 16 | 31 | 14 | 549 | 16 |
LD Chandimal (SL) | 100 | 7 | 3846 | 41.35 | 49.01 | 11 | 18 | 4 | 411 | 22 |
F du Plessis (SA) | 106 | 14 | 3799 | 41.29 | 45.98 | 9 | 21 | 9 | 466 | 20 |
Asad Shafiq (PAK) | 122 | 6 | 4528 | 39.03 | 48.61 | 12 | 26 | 13 | 498 | 29 |
M Vijay (IND) | 102 | 1 | 3821 | 37.83 | 45.78 | 12 | 14 | 8 | 450 | 32 |
FDM Karunaratne (SL) | 124 | 4 | 4421 | 36.84 | 49.04 | 9 | 24 | 12 | 442 | 8 |
JM Bairstow (ENG) | 123 | 7 | 4030 | 34.74 | 55.07 | 6 | 21 | 10 | 473 | 26 |
However, if the batters are by ranked according to their ERPD values, the table shows significant changes.
Batter | Runs | Average | ERPD |
---|---|---|---|
Kumar Sangakkara | 4851 | 61.40 | 81.51080 |
Steve Smith | 7164 | 62.84 | 75.98680 |
Kane Williamson | 6379 | 51.44 | 68.89218 |
Younis Khan | 4839 | 54.37 | 66.92585 |
Virat Kohli | 7202 | 54.97 | 66.37127 |
Joe Root | 7359 | 48.41 | 63.03840 |
Hashim Amla | 6695 | 49.96 | 62.53119 |
Cheteshwar Pujara | 5740 | 49.48 | 61.60445 |
AB de Villiers | 5059 | 57.48 | 61.13865 |
Ian Bell | 4436 | 44.80 | 59.87259 |
Misbah-ul-Haq | 4225 | 50.29 | 59.24620 |
David Warner | 7088 | 48.21 | 58.51765 |
Michael Clarke | 4717 | 48.62 | 57.08995 |
Ajinkya Rahane | 4112 | 43.74 | 56.06439 |
Alastair Cook | 8818 | 46.41 | 56.04557 |
Ross Taylor | 5486 | 48.12 | 55.89375 |
Angelo Mathews | 5325 | 44.37 | 53.81018 |
Azhar Ali | 5885 | 42.64 | 53.28048 |
Faf du Plessis | 3799 | 41.29 | 51.65885 |
Brendon McCullum | 3979 | 44.21 | 49.02250 |
Dinesh Chandimal | 3846 | 41.35 | 49.02013 |
Dimuth Karunaratne | 4421 | 36.84 | 42.34182 |
Asad Shafiq | 4528 | 39.03 | 41.01311 |
Murali Vijay | 3821 | 37.83 | 40.70827 |
Jonny Bairstow | 4030 | 34.74 | 36.51510 |
Note that, although somehow comparable, Expected Runs per Dismissal is not a direct equivalent of batting averages. One should not confuse Kumar Sangakkara’s batting average with him expected to score 81.5108 runs everytime he came out to bat during 2010s. Batters remaining notout on extremely high scores (a common case during declarations after completion of specific milestones) is a major reason for ERPD being generally higher than batting averages in Test cricket. The effect of abrupt rise in ERPD due to such scenarios can be controlled by slight modification in the working formula of MRL.
Brian Lara’s 400* against the Poms back in 2004 remains the highest individual Test score till date. A deeper look into history tells us that there have been only 6 instances of batters crossing the 350-run mark in Test cricket. A further investigation leads us to a count of only 8 batters scoring more than 335 runs in 148 years of the history of the game, none of them occuring post 2006.
Based on this fact, if the ERPD for unbeaten knocks are calculated with a modification as follows, \[ E[X \mid X > x] = \frac{\int_x^{335} t f(t) \, dt}{S(x)} \] then the resulting list of Adjusted ERPD of batters based on their performance in Test cricket during 2010s decade is given by:
Batter | Runs | Average | ERPD | AERPD | Shift | MOA |
---|---|---|---|---|---|---|
Kumar Sangakkara | 4851 | 61.40 | 81.51080 | 73.42050 | -8.09030 | 1.1957736 |
Steve Smith | 7164 | 62.84 | 75.98680 | 65.24929 | -10.73751 | 1.0383401 |
Virat Kohli | 7202 | 54.97 | 66.37127 | 59.97174 | -6.39953 | 1.0909904 |
Kane Williamson | 6379 | 51.44 | 68.89218 | 57.99365 | -10.89853 | 1.1274038 |
Joe Root | 7359 | 48.41 | 63.03840 | 56.25753 | -6.78087 | 1.1621056 |
Cheteshwar Pujara | 5740 | 49.48 | 61.60445 | 55.43618 | -6.16827 | 1.1203755 |
AB de Villiers | 5059 | 57.48 | 61.13865 | 55.13152 | -6.00713 | 0.9591427 |
Younis Khan | 4839 | 54.37 | 66.92585 | 54.72924 | -12.19661 | 1.0066073 |
David Warner | 7088 | 48.21 | 58.51765 | 52.72440 | -5.79325 | 1.0936403 |
Misbah-ul-Haq | 4225 | 50.29 | 59.24620 | 52.51657 | -6.72963 | 1.0442746 |
Hashim Amla | 6695 | 49.96 | 62.53119 | 52.07122 | -10.45997 | 1.0422582 |
Ian Bell | 4436 | 44.80 | 59.87259 | 51.56033 | -8.31226 | 1.1509002 |
Ajinkya Rahane | 4112 | 43.74 | 56.06439 | 51.02417 | -5.04022 | 1.1665334 |
Alastair Cook | 8818 | 46.41 | 56.04557 | 50.79877 | -5.24680 | 1.0945652 |
Ross Taylor | 5486 | 48.12 | 55.89375 | 49.43804 | -6.45571 | 1.0273907 |
Angelo Mathews | 5325 | 44.37 | 53.81018 | 49.19461 | -4.61557 | 1.1087359 |
Brendon McCullum | 3979 | 44.21 | 49.02250 | 47.59190 | -1.43060 | 1.0764963 |
Azhar Ali | 5885 | 42.64 | 53.28048 | 46.69366 | -6.58682 | 1.0950671 |
Faf du Plessis | 3799 | 41.29 | 51.65885 | 46.42369 | -5.23516 | 1.1243325 |
Michael Clarke | 4717 | 48.62 | 57.08995 | 45.31727 | -11.77268 | 0.9320705 |
Dinesh Chandimal | 3846 | 41.35 | 49.02013 | 43.84775 | -5.17238 | 1.0604051 |
Dimuth Karunaratne | 4421 | 36.84 | 42.34182 | 40.96594 | -1.37588 | 1.1119962 |
Murali Vijay | 3821 | 37.83 | 40.70827 | 40.55762 | -0.15065 | 1.0721020 |
Asad Shafiq | 4528 | 39.03 | 41.01311 | 39.87997 | -1.13314 | 1.0217774 |
Jonny Bairstow | 4030 | 34.74 | 36.51510 | 35.24685 | -1.26825 | 1.0145898 |
In the above table, MOA denotes the Multiplier on Average, quantifying the change in a batter’s AERPD as compared to their original batting averages \((= \frac{AERPD}{Average})\). Only AB de Villiers and Michael Clarke display a fall as compared to their averages denoting an underestimation of their batting averages - a rare case primarily occuring due to an overly right skewed distribution of scores as compared to other batters.
Note that the adjusted ERPD values have reduced the magnitude of overestimating a batter’s average score by a significant margin. Major shifts such as Hashim Amla and Younis Khan dropping down while Virat Kohli, Cheteshwar Pujara, AB de Villiers rising up the rankings etc are evident from the AERPD table. Kumar Sangakkara’s AERPD still remains fairly ahead of anyone else in the list though. The Sri Lankan maverick was quite extraordinary afterall!
Although none of ERPD or AERPD could be claimed to be a perfect metric to judge a batter’s ability yet, it provides a different perspective than the oversimplified batting average which has been in use for ages. To conclude with, this project is a simple attempt to try our hands on parallel to one of George E.P. Box’s famous quote: “All models are wrong, but some are useful.”