1 Introduction

Cricket, with its long-standing tradition of statistical analysis, has traditionally relied on the batting average, defined as the total runs scored divided by the number of dismissals, to evaluate a batter’s performance. This metric, while simple and intuitive, often fails to capture the true range of a player’s contribution, especially in Test cricket, where innings can span many sessions and involve varied scoring patterns. In this study, we focus on the top 25 run-getters in Test cricket during the 2010s decade (1st January 2010 to 31st December 2019) and explore an alternative measure, the Expected Runs per Dismissal (ERPD), which provides a more nuanced assessment of batting performance.

2 Drawbacks of batting average

The batting average, despite its widespread use, has several limitations.

  1. Sensitivity to notout innings: Batters who remain not out receive an artificial boost to their average since these innings are not counted as dismissals. This can result in overestimating a batter’s effectiveness, particularly for those who tend to bat for long durations.

  2. Loss of distributional information: By aggregating performance into a single number, the batting average ignores the variability and distribution of scores. Two batters with similar averages might have vastly different scoring patterns—one might be highly consistent, while another might have sporadic high scores offset by low scores.

  3. Negligency of survival dynamics: Batting average does not consider how long a batter survives in an innings before being dismissed. In the context of Test cricket, the ability to occupy the crease and build an innings is as crucial as the runs scored, and survival analysis offers a framework to incorporate this dimension.

Given these drawbacks, alternative metrics have been proposed by researchers and practitioners. For example, metrics such as APM (Average Player Multiplier), Expected Average, BEREX (Bernoulli Run Expectation), RAAR (Runs Above Average Replacement) etc. have been suggested as potential improvements over the traditional average. These measures, however, often still rely on summarizing performance without fully accounting for the survival aspect of batting.

3 Expected Runs per Dismissal (ERPD)

3.1 Introduction to ERPD

The Expected Runs per Dismissal (ERPD) is introduced as a more robust measure that overcomes many of the limitations of the batting average. ERPD is defined as the expected number of runs a batter is likely to score before being dismissed, taking into account the complete distribution of scores.

Mathematically, ERPD is expressed as:

\[ \text{ERPD} = E[X \mid \text{Dismissal}] = \int_0^\infty x f(x) \, dx \]

where \(X\) denotes the runs scored in an innings and \(f(x)\) is the probability density function of \(X\).

3.2 Distributional assumption: Lognormal fit

After thorough statistical analysis employing tools such as the Kolmogorov-Smirnov test and comparing criteria like the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), it was found that lognormal distribution offers the best fit for the distribution of runs scored by batters while being dismissed in a Test innings.

The following data (batter under study: Alastair Cook) shows the selection of lognormal distribution as the best fit among few well-known distributions.

Goodness-of-fit statistics
                             1-mle-norm 2-mle-exp 3-mle-gamma 4-mle-lnorm
Kolmogorov-Smirnov statistic   0.204965 0.1268075  0.09541033  0.05958898
Cramer-von Mises statistic     2.407679 0.5559498  0.25110920  0.10153055
Anderson-Darling statistic    13.453451 2.7504675  1.27300494  0.80366991

Goodness-of-fit criteria
                               1-mle-norm 2-mle-exp 3-mle-gamma 4-mle-lnorm
Akaike's Information Criterion   1968.408  1749.770    1748.639    1746.379
Bayesian Information Criterion   1974.838  1752.985    1755.069    1752.809

The lognormal model is particularly attractive because it naturally accommodates the skewness observed in cricket scores, with many low to moderate scores and a long right tail capturing the occasional big innings.

Fitting of the distribution ' lnorm ' by maximum likelihood 
Parameters : 
        estimate Std. Error
meanlog 3.061100 0.09510433
sdlog   1.290058 0.06724874
Loglikelihood:  -871.1894   AIC:  1746.379   BIC:  1752.809 
Correlation matrix:
             meanlog        sdlog
meanlog 1.000000e+00 1.454202e-09
sdlog   1.454202e-09 1.000000e+00

\[ \ln(X) \sim N(\mu, \sigma^2) \]

Accordingly, the expected value is given by: \[ E[X] = \exp\left(\mu + \frac{\sigma^2}{2} \right) \]

This forms the basis for the ERPD calculation for innings in which a batter has been dismissed.

3.3 Implementing Survival Analysis for Notout Innings

Furthermore, survival analysis concepts are applied to extend ERPD to innings in which a batter has remained notout. If we denote the survival function by:

\[ S(x) = P(X > x) = 1 - F(x), \]

where \(F(x)\) is the cumulative distribution function of \(X\), the conditional expectation of runs given that a batter has already scored \(x\) many runs can be expressed as:

\[ E[X \mid X > x] = \frac{\int_x^\infty t f(t) \, dt}{S(x)} = \frac{\int_x^\infty t f(t) \, dt}{\int_x^\infty f(t) \, dt} \]

For innings in which a batter hasn’t been dismissed, \(x\) denotes a right-censored data and \(E[X \mid X > x]\) serves as an estimate of the batter’s score had he/she been dismissed at some point hypothetically. By incorporating these techniques into the ERPD framework, we achieve a metric that not only reflects the central tendency of a batter’s scoring but also incorporates their survival ability and the inherent risk of dismissal.

3.4 Calculating ERPD for a batter

Consider Murali Vijay’s scores in Test cricket from 1st January 2010 to 31st December 2019 as the data of interest. His ERPD for innings in which he was dismissed has been derived as follows.

data <- read.csv("//Users//rhitankarsmacbook//Library//Mobile Documents//com~apple~CloudDocs/Cricket Stats//ERPD//Murali_Vijay.csv")

out_values <- data$Out 
duck_values <- out_values[out_values == 0]
duck_innings <- length(duck_values)
  
out_values <- out_values[out_values > 0]
out_values <- out_values[!is.na(out_values)]

hist(out_values, probability = TRUE, breaks = 20, col = "lightblue", main = "Histogram of out values")
lines(density(out_values), col = "red", lwd = 2)

fit_norm <- fitdist(out_values, "norm")
fit_exp <- fitdist(out_values, "exp")
fit_gamma <- fitdist(out_values, "gamma")
fit_lognorm <- fitdist(out_values, "lnorm")

# Compare AIC values
gofstat(list(fit_norm, fit_exp, fit_gamma, fit_lognorm))
Goodness-of-fit statistics
                             1-mle-norm 2-mle-exp 3-mle-gamma 4-mle-lnorm
Kolmogorov-Smirnov statistic  0.2122892 0.1464157   0.1241796  0.08092825
Cramer-von Mises statistic    1.3611348 0.4167276   0.2853081  0.09442460
Anderson-Darling statistic    7.6805112 2.2910722   1.6777359  0.66486409

Goodness-of-fit criteria
                               1-mle-norm 2-mle-exp 3-mle-gamma 4-mle-lnorm
Akaike's Information Criterion   975.0590  877.7462    879.1151    873.2391
Bayesian Information Criterion   980.1242  880.2788    884.1803    878.3043

Lowest K-S test statistic, Cramer-von Mises test statistic, AIC and BIC values suggest lognormal to be the best fit for Vijay’s scores in innings where he was dismissed.

summary(fit_lognorm)
Fitting of the distribution ' lnorm ' by maximum likelihood 
Parameters : 
        estimate Std. Error
meanlog 3.062178 0.12567095
sdlog   1.211927 0.08886251
Loglikelihood:  -434.6195   AIC:  873.2391   BIC:  878.3043 
Correlation matrix:
              meanlog         sdlog
meanlog  1.000000e+00 -4.760964e-10
sdlog   -4.760964e-10  1.000000e+00
lognorm_mu <- fit_lognorm$estimate["meanlog"]
lognorm_sigma <- fit_lognorm$estimate["sdlog"]
ERPD_out <- exp(lognorm_mu + (lognorm_sigma^2) / 2)

cat("ERPD for out values (excluding ducks) :", ERPD_out, "\n")
ERPD for out values (excluding ducks) : 44.54775 

For the remaining right censored data, i.e. innings in which Vijay was dismissed, we calculate the Mean Residual Life (MRL) for each knock to obtain the corresponding ERPD.

notout_values <- data$Notout[!is.na(data$Notout)]
mu <- fit_lognorm$estimate["meanlog"] 
sigma <- fit_lognorm$estimate["sdlog"] 

ERPD_notout <- function(x, mu, sigma) {
  # Compute survival function S(x) = 1 - CDF(x)
  Sx <- 1 - plnorm(x, meanlog = mu, sdlog = sigma)
  
  # Compute expected runs beyond x
  Ex_given_x <- integrate(function(t) t * dlnorm(t, meanlog = mu, sdlog = sigma), 
                          lower = x, upper = Inf)$value
  
  return(Ex_given_x / Sx) 
}

ERPD_notout_values <- sapply(notout_values, ERPD_notout, mu = mu, sigma = sigma)
cat("ERPD for notout values:", ERPD_notout_values, "\n")
ERPD for notout values: 90.71932 

Finally the overall ERPD of Murali Vijay is calculated as follows.

num_out <- length(out_values)
num_notout <- length(ERPD_notout_values)

total_innings <- num_out + num_notout + duck_innings
overall_ERPD <- (ERPD_out * num_out + sum(ERPD_notout_values)) / total_innings
cat("Overall ERPD:", overall_ERPD, "\n")
Overall ERPD: 40.70827 

4 Ranking the best Test batters of 2010s decade using ERPD

The study considers the top 25 run-getters in Test cricket during the 2010s decade (1st January 2010 to 31st December 2019). One can rank them by virtue of their batting averages in this period as follows.

Batters ranked by Averages in Test cricket (2010-2019)
Player Inns Notouts Runs Average Strike.Rate X100 X50 Ducks X4s X6s
SPD Smith (AUS) 130 16 7164 62.84 55.59 26 28 4 795 42
KC Sangakkara (SL) 86 7 4851 61.40 52.22 17 20 7 514 29
AB de Villiers (SA) 98 10 5059 57.48 55.65 13 27 7 573 45
V Kohli (IND) 141 10 7202 54.97 57.81 27 22 10 805 22
Younis Khan (PAK) 101 12 4839 54.37 50.47 18 12 7 444 46
KS Williamson (NZ) 137 13 6379 51.44 51.55 21 31 9 694 14
Misbah-ul-Haq (PAK) 101 17 4225 50.29 46.15 8 35 6 405 73
HM Amla (SA) 146 12 6695 49.96 50.48 21 27 7 834 12
CA Pujara (IND) 124 8 5740 49.48 46.69 18 24 7 682 14
MJ Clarke (AUS) 107 10 4717 48.62 58.37 16 10 4 556 22
JE Root (ENG) 164 12 7359 48.41 54.37 17 45 8 829 20
DA Warner (AUS) 153 6 7088 48.21 73.04 23 30 9 840 56
LRPL Taylor (NZ) 133 19 5486 48.12 60.12 15 25 14 637 35
AN Cook (ENG) 201 11 8818 46.41 46.93 23 37 6 1010 9
IR Bell (ENG) 114 15 4436 44.80 48.88 13 25 7 539 25
AD Mathews (SL) 140 20 5325 44.37 48.54 9 32 2 566 52
BB McCullum (NZ) 95 5 3979 44.21 66.39 9 16 6 471 79
AM Rahane (IND) 105 11 4112 43.74 50.65 11 22 6 463 29
Azhar Ali (PAK) 146 8 5885 42.64 41.82 16 31 14 549 16
LD Chandimal (SL) 100 7 3846 41.35 49.01 11 18 4 411 22
F du Plessis (SA) 106 14 3799 41.29 45.98 9 21 9 466 20
Asad Shafiq (PAK) 122 6 4528 39.03 48.61 12 26 13 498 29
M Vijay (IND) 102 1 3821 37.83 45.78 12 14 8 450 32
FDM Karunaratne (SL) 124 4 4421 36.84 49.04 9 24 12 442 8
JM Bairstow (ENG) 123 7 4030 34.74 55.07 6 21 10 473 26

However, if the batters are by ranked according to their ERPD values, the table shows significant changes.

Batters ranked by ERPD in Test cricket (2010-2019)
Batter Runs Average ERPD
Kumar Sangakkara 4851 61.40 81.51080
Steve Smith 7164 62.84 75.98680
Kane Williamson 6379 51.44 68.89218
Younis Khan 4839 54.37 66.92585
Virat Kohli 7202 54.97 66.37127
Joe Root 7359 48.41 63.03840
Hashim Amla 6695 49.96 62.53119
Cheteshwar Pujara 5740 49.48 61.60445
AB de Villiers 5059 57.48 61.13865
Ian Bell 4436 44.80 59.87259
Misbah-ul-Haq 4225 50.29 59.24620
David Warner 7088 48.21 58.51765
Michael Clarke 4717 48.62 57.08995
Ajinkya Rahane 4112 43.74 56.06439
Alastair Cook 8818 46.41 56.04557
Ross Taylor 5486 48.12 55.89375
Angelo Mathews 5325 44.37 53.81018
Azhar Ali 5885 42.64 53.28048
Faf du Plessis 3799 41.29 51.65885
Brendon McCullum 3979 44.21 49.02250
Dinesh Chandimal 3846 41.35 49.02013
Dimuth Karunaratne 4421 36.84 42.34182
Asad Shafiq 4528 39.03 41.01311
Murali Vijay 3821 37.83 40.70827
Jonny Bairstow 4030 34.74 36.51510

Note that, although somehow comparable, Expected Runs per Dismissal is not a direct equivalent of batting averages. One should not confuse Kumar Sangakkara’s batting average with him expected to score 81.5108 runs everytime he came out to bat during 2010s. Batters remaining notout on extremely high scores (a common case during declarations after completion of specific milestones) is a major reason for ERPD being generally higher than batting averages in Test cricket. The effect of abrupt rise in ERPD due to such scenarios can be controlled by slight modification in the working formula of MRL.

Brian Lara’s 400* against the Poms back in 2004 remains the highest individual Test score till date. A deeper look into history tells us that there have been only 6 instances of batters crossing the 350-run mark in Test cricket. A further investigation leads us to a count of only 8 batters scoring more than 335 runs in 148 years of the history of the game, none of them occuring post 2006.

Based on this fact, if the ERPD for unbeaten knocks are calculated with a modification as follows, \[ E[X \mid X > x] = \frac{\int_x^{335} t f(t) \, dt}{S(x)} \] then the resulting list of Adjusted ERPD of batters based on their performance in Test cricket during 2010s decade is given by:

Batters ranked by Adjusted ERPD in Test cricket (2010-2019)
Batter Runs Average ERPD AERPD Shift MOA
Kumar Sangakkara 4851 61.40 81.51080 73.42050 -8.09030 1.1957736
Steve Smith 7164 62.84 75.98680 65.24929 -10.73751 1.0383401
Virat Kohli 7202 54.97 66.37127 59.97174 -6.39953 1.0909904
Kane Williamson 6379 51.44 68.89218 57.99365 -10.89853 1.1274038
Joe Root 7359 48.41 63.03840 56.25753 -6.78087 1.1621056
Cheteshwar Pujara 5740 49.48 61.60445 55.43618 -6.16827 1.1203755
AB de Villiers 5059 57.48 61.13865 55.13152 -6.00713 0.9591427
Younis Khan 4839 54.37 66.92585 54.72924 -12.19661 1.0066073
David Warner 7088 48.21 58.51765 52.72440 -5.79325 1.0936403
Misbah-ul-Haq 4225 50.29 59.24620 52.51657 -6.72963 1.0442746
Hashim Amla 6695 49.96 62.53119 52.07122 -10.45997 1.0422582
Ian Bell 4436 44.80 59.87259 51.56033 -8.31226 1.1509002
Ajinkya Rahane 4112 43.74 56.06439 51.02417 -5.04022 1.1665334
Alastair Cook 8818 46.41 56.04557 50.79877 -5.24680 1.0945652
Ross Taylor 5486 48.12 55.89375 49.43804 -6.45571 1.0273907
Angelo Mathews 5325 44.37 53.81018 49.19461 -4.61557 1.1087359
Brendon McCullum 3979 44.21 49.02250 47.59190 -1.43060 1.0764963
Azhar Ali 5885 42.64 53.28048 46.69366 -6.58682 1.0950671
Faf du Plessis 3799 41.29 51.65885 46.42369 -5.23516 1.1243325
Michael Clarke 4717 48.62 57.08995 45.31727 -11.77268 0.9320705
Dinesh Chandimal 3846 41.35 49.02013 43.84775 -5.17238 1.0604051
Dimuth Karunaratne 4421 36.84 42.34182 40.96594 -1.37588 1.1119962
Murali Vijay 3821 37.83 40.70827 40.55762 -0.15065 1.0721020
Asad Shafiq 4528 39.03 41.01311 39.87997 -1.13314 1.0217774
Jonny Bairstow 4030 34.74 36.51510 35.24685 -1.26825 1.0145898

In the above table, MOA denotes the Multiplier on Average, quantifying the change in a batter’s AERPD as compared to their original batting averages \((= \frac{AERPD}{Average})\). Only AB de Villiers and Michael Clarke display a fall as compared to their averages denoting an underestimation of their batting averages - a rare case primarily occuring due to an overly right skewed distribution of scores as compared to other batters.

Note that the adjusted ERPD values have reduced the magnitude of overestimating a batter’s average score by a significant margin. Major shifts such as Hashim Amla and Younis Khan dropping down while Virat Kohli, Cheteshwar Pujara, AB de Villiers rising up the rankings etc are evident from the AERPD table. Kumar Sangakkara’s AERPD still remains fairly ahead of anyone else in the list though. The Sri Lankan maverick was quite extraordinary afterall!

Although none of ERPD or AERPD could be claimed to be a perfect metric to judge a batter’s ability yet, it provides a different perspective than the oversimplified batting average which has been in use for ages. To conclude with, this project is a simple attempt to try our hands on parallel to one of George E.P. Box’s famous quote: “All models are wrong, but some are useful.”