T20 International Cricket: Runs By Country

Author

Joe McCune

Libraries Used In This Project

I have written this code specifying the source of each function. So there is no need load any package using library(). But if you would like to know which libraries have been used, they are listed below.

Libraries Included in tidyverse Suite of Packages:

  • dplyr

  • forcats

  • ggplot2

  • lubridate

Libraries Not Included in tidyverse:

  • broom

  • countrycode

  • cricketdata

  • gt

  • scales

Getting the Data

This is how I scraped the data, and then saved it.:

  1. I requested the data on all runs scored (“batting”) by all T20 ciricket national teams.
  2. I then saved this data as the document “all_runs”.

It can take repeated attempts and considerable time to finish this process. Other code snippets in this document have been set to execute, but not this one.

t20i_runs <- cricketdata::fetch_cricinfo("t20", "men", "batting", "innings", country = NULL)
all_runs <- saveRDS(t20i_runs, "all_runs")

The saved data frame can be used to extract the data on which countries have scored how many runs.

It records how much each runner from each country scored. in each game. Thais makes for over 70, 000 rows. This can pretty easily be totaled per country. It also includes lots of information we don’t really need for this project that records how well each individual player did.

One immediate cause for concern is the “Country” column, which shows the full names of some countries and abbreviations for others. This is inconsistent.

t20i_runs_20250628 <- readRDS("all_runs")

dplyr::glimpse(t20i_runs_20250628)
Rows: 70,169
Columns: 14
$ Date          <date> 2018-07-03, 2019-02-23, 2013-08-29, 2024-12-14, 2016-09…
$ Player        <chr> "AJ Finch ", "Hazratullah Zazai ", "AJ Finch ", "YSD Sen…
$ Country       <chr> "Australia", "Afghanistan", "Australia", "CAY", "Austral…
$ Runs          <int> 172, 162, 156, 150, 145, 144, 137, 137, 137, 135, 135, 1…
$ NotOut        <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE…
$ Minutes       <dbl> NA, NA, 70, NA, NA, NA, NA, NA, 94, NA, 93, NA, NA, NA, …
$ BallsFaced    <int> 76, 62, 63, 67, 65, 41, 50, 49, 62, 62, 54, 68, 73, 43, …
$ Fours         <int> 16, 11, 11, 7, 14, 6, 8, 7, 5, 11, 7, 8, 15, 7, 15, 5, 5…
$ Sixes         <int> 10, 16, 14, 13, 9, 18, 12, 15, 16, 10, 13, 12, 6, 15, 6,…
$ StrikeRate    <dbl> 226.3158, 261.2903, 247.6190, 223.8806, 223.0769, 351.21…
$ Innings       <int> 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Participation <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "…
$ Opposition    <chr> "Zimbabwe", "Ireland", "England", "Brazil", "Sri Lanka",…
$ Ground        <chr> "Harare", "Dehradun", "Southampton", "Buenos Aires", "Pa…

Cleaning the Data

The data frame is organized by date, but I want especially to look at runs per year. So the next thing to do is to convert the dates to three separate columns, so that I can group runs by year.

t20i_prepared <- t20i_runs_20250628 |> 
  dplyr::mutate(Year=lubridate::year(Date),
         Month=lubridate::month(Date),
         Day=lubridate::day(Date)
  ) 

dplyr::glimpse(t20i_prepared)
Rows: 70,169
Columns: 17
$ Date          <date> 2018-07-03, 2019-02-23, 2013-08-29, 2024-12-14, 2016-09…
$ Player        <chr> "AJ Finch ", "Hazratullah Zazai ", "AJ Finch ", "YSD Sen…
$ Country       <chr> "Australia", "Afghanistan", "Australia", "CAY", "Austral…
$ Runs          <int> 172, 162, 156, 150, 145, 144, 137, 137, 137, 135, 135, 1…
$ NotOut        <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE…
$ Minutes       <dbl> NA, NA, 70, NA, NA, NA, NA, NA, 94, NA, 93, NA, NA, NA, …
$ BallsFaced    <int> 76, 62, 63, 67, 65, 41, 50, 49, 62, 62, 54, 68, 73, 43, …
$ Fours         <int> 16, 11, 11, 7, 14, 6, 8, 7, 5, 11, 7, 8, 15, 7, 15, 5, 5…
$ Sixes         <int> 10, 16, 14, 13, 9, 18, 12, 15, 16, 10, 13, 12, 6, 15, 6,…
$ StrikeRate    <dbl> 226.3158, 261.2903, 247.6190, 223.8806, 223.0769, 351.21…
$ Innings       <int> 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Participation <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "…
$ Opposition    <chr> "Zimbabwe", "Ireland", "England", "Brazil", "Sri Lanka",…
$ Ground        <chr> "Harare", "Dehradun", "Southampton", "Buenos Aires", "Pa…
$ Year          <dbl> 2018, 2019, 2013, 2024, 2016, 2024, 2023, 2022, 2024, 20…
$ Month         <dbl> 7, 2, 8, 12, 9, 6, 9, 6, 1, 2, 2, 2, 4, 10, 7, 8, 9, 7, …
$ Day           <int> 3, 23, 29, 14, 6, 17, 27, 5, 17, 29, 2, 15, 18, 23, 25, …

Listing the unique values used in the “Country” column shows even more cause for concern. There are 108 entries, including obvious misspellings like “MWest Indies”, and puzzling abbreviations like “GUE” and “STHEL”

raw_countries <- as.data.frame(table(t20i_prepared$Country)) 
countries_heading <- raw_countries |> 
  dplyr::rename(Countries=Var1)

countries_only <- countries_heading[, "Countries", drop=FALSE] 

gt::gt(countries_only)
Countries
Afghanistan
Arg
Australia
Aut
BAN
Belg
BER
Bhm
BHR
BHU
Blz
BOT
BRA
BUL
CAM
Canada
CAY
Chile
CHN
CIV
COK
CRC
CRT
CYP
CZK-R
DEN
England
ESP
EST
Falk
Fiji
FIN
Fran
GER
GHA
GIBR
GMB
GRC
GUE
Hong Kong
HUN
ICC World XI
INA
India
IOM
Iran
Ireland
ISR
ITA
Japan
JER
Kenya
KSA
KUW
LES
LUX
Mali
MAS
MDV
Mex
MLT
MNG
MOZ
MWest Indies
MYAN
Namibia
NED
NEP
New Zealand
NGA
NOR
OMA
Pakistan
PAN
Papau New Guinea
Peru
PHI
PORT
QAT
ROM
RWN
Samoa
Scotland
SEY
SGP
SKOR
South Africa
SRB
Sri Lanka
Sri LankaE
STHEL
SUI
Sur
SVN
SWA
SWE
SWZ
TAN
TCL
THA
TKY
UGA
United Arab Emirates
United States of America
VAN
West Indies
World XI
Zimbabwe

Closer examination reveals that not all of these “national” teams represent countries. The ICC World X1 is actually a team of international all-stars. “GUE” is an abbreviation for the British territory Guernsey, and “STHEL” represents another British territory, St Helena. Many would find these abbreviations unintelligible.

It would be possible, but tedious, to write a line of code for each one of these many abbreviations, converting them to the full name for each country.

Some of these three letter country abbreviations, such as GHA for “Ghana”, are part of the ISO Three Character Country codes standard. The countrycode package can also convert these abbreviations to full names.

I arrived at a method to achieve this. I write a conditional statement, that requires a string of only three characters, and then checks for the ISO Three Character Standard. If it finds a match, it supplies the full name of the country. If it finds no match, I specify it returns the word “not”. I also add the condition that it will only replace the three letter abbreviation if the result is more than three letters long. So if it finds no match and returns the word”not”, no change will be made.

This code converts quite a few of the abbreviations, but many are unchanged.

#|label: add__iso_countries

t20i_prepared <- t20i_runs_20250628 |> 
  dplyr::mutate(Year=lubridate::year(Date),
         Month=lubridate::month(Date),
         Day=lubridate::day(Date)
  ) |> 
  dplyr::mutate(Country = dplyr::case_when(
    nchar(Country)==3 & nchar(countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"))>3 ~ countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"),
    TRUE ~ Country  # Keep other values unchanged
  ))

iso_countries <- as.data.frame(table(t20i_prepared$Country)) 
iso_heading <- iso_countries |> 
  dplyr::rename(Countries=Var1)

iso_countries_only <- iso_heading[, "Countries", drop=FALSE] 

gt::gt(iso_countries_only)
Countries
Afghanistan
Argentina
Australia
Austria
Bahrain
BAN
Belg
Belize
BER
Bhm
BHU
BOT
Brazil
BUL
CAM
Canada
CAY
Chile
China
Cook Islands
Côte d’Ivoire
CRC
CRT
Cyprus
CZK-R
DEN
England
Estonia
Eswatini
Falk
Fiji
Finland
Fran
Gambia
GER
Ghana
GIBR
Greece
GUE
Hong Kong
Hungary
ICC World XI
INA
India
IOM
Iran
Ireland
Israel
Italy
Japan
JER
Kenya
KSA
KUW
LES
Luxembourg
Maldives
Mali
Malta
MAS
Mexico
Mongolia
Mozambique
MWest Indies
MYAN
Namibia
NED
NEP
New Zealand
Nigeria
Norway
OMA
Pakistan
Panama
Papau New Guinea
Peru
PHI
PORT
Qatar
ROM
RWN
Samoa
Scotland
Serbia
SEY
Singapore
SKOR
Slovenia
South Africa
Spain
Sri Lanka
Sri LankaE
STHEL
SUI
Suriname
SWA
Sweden
TAN
TCL
Thailand
TKY
Uganda
United Arab Emirates
United States of America
VAN
West Indies
World XI
Zimbabwe

It turned out some of the three letter abbreviations that had not been converted matched those used by the International Olympic Committee. The countrycode package also includes these abbreviations, so I add another line of code checking for abbreviations matching the Olympic standard.

This converts quite a few more abbreviations

t20i_prepared <- t20i_runs_20250628 |> 
  dplyr::mutate(Year=lubridate::year(Date),
         Month=lubridate::month(Date),
         Day=lubridate::day(Date)
  ) |> 
  dplyr::mutate(Country = dplyr::case_when(
    nchar(Country)==3 & nchar(countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"))>3 ~ countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"),
    nchar(Country)==3 & nchar(countrycode::countrycode(Country, origin='ioc', destination='country.name', nomatch="not"))>3 ~ countrycode::countrycode(Country, origin='ioc', destination='country.name', nomatch="not"),
    TRUE ~ Country  # Keep other values unchanged
  ))

olympic_countries <- as.data.frame(table(t20i_prepared$Country)) 
olympic_heading <- olympic_countries |> 
  dplyr::rename(Countries=Var1)

olympic_countries_only <- olympic_heading[, "Countries", drop=FALSE] 

gt::gt(olympic_countries_only)
Countries
Afghanistan
Argentina
Australia
Austria
Bahrain
Bangladesh
Belg
Belize
Bermuda
Bhm
Bhutan
Botswana
Brazil
Bulgaria
Cambodia
Canada
Cayman Islands
Chile
China
Cook Islands
Costa Rica
Côte d’Ivoire
CRT
Cyprus
CZK-R
Denmark
England
Estonia
Eswatini
Falk
Fiji
Finland
Fran
Gambia
Germany
Ghana
GIBR
Greece
GUE
Hong Kong
Hungary
ICC World XI
India
Indonesia
IOM
Iran
Ireland
Israel
Italy
Japan
JER
Kenya
Kuwait
Lesotho
Luxembourg
Malaysia
Maldives
Mali
Malta
Mexico
Mongolia
Mozambique
MWest Indies
MYAN
Namibia
Nepal
Netherlands
New Zealand
Nigeria
Norway
Oman
Pakistan
Panama
Papau New Guinea
Peru
Philippines
PORT
Qatar
ROM
RWN
Samoa
Saudi Arabia
Scotland
Serbia
Seychelles
Singapore
SKOR
Slovenia
South Africa
Spain
Sri Lanka
Sri LankaE
STHEL
Suriname
SWA
Sweden
Switzerland
Tanzania
TCL
Thailand
TKY
Uganda
United Arab Emirates
United States of America
Vanuatu
West Indies
World XI
Zimbabwe

I handle the remaining abbreviations and misspellings by simply adding a line of code for each one.

t20i_prepared <- t20i_runs_20250628 |> 
  dplyr::mutate(Year=lubridate::year(Date),
         Month=lubridate::month(Date),
         Day=lubridate::day(Date)
  ) |> 
  dplyr::mutate(Country = dplyr::case_when(
    nchar(Country)==3 & nchar(countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"))>3 ~ countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"),
    nchar(Country)==3 & nchar(countrycode::countrycode(Country, origin='ioc', destination='country.name', nomatch="not"))>3 ~ countrycode::countrycode(Country, origin='ioc', destination='country.name', nomatch="not"),
    Country == "MWest Indies" ~ "West Indies",
    Country == "Sri LankaE" ~ "Sri Lanka",
    Country == "World XI" ~ "ICC World XI",
    Country == "Bhm" ~ "Bahamas",
    Country == "TCL" ~ "Turks and Caicos Islands",
    Country == "SWA" ~ "Eswatini",
    Country == "SKOR" ~ "South Korea",
    Country == "Belg" ~ "Belgium",
    Country == "CRT" ~ "Croatia",
    Country == "CZK-R" ~ "Czechia",
    Country == "Falk" ~ "Falkland Islands",
    Country == "Fran" ~ "France",
    Country == "GIBR" ~ "Gibraltar",
    Country == "GUE" ~ "Guernsey",
    Country == "IOM" ~ "Isle of Man",
    Country == "JER" ~ "Jersey",
    Country == "MYAN" ~ "Myanmar",
    Country == "PORT" ~ "Portugal",
    Country == "ROM" ~ "Romania",
    Country == "RWN" ~ "Rwanda",
    Country == "STHEL" ~ "St Helena",
    Country == "TKY" ~ "Türkiye",
    TRUE ~ Country  # Keep other values unchanged
  ))
finished_countries <- as.data.frame(table(t20i_prepared$Country)) 
finished_heading <- finished_countries |> 
  dplyr::rename(Countries=Var1)

finished_countries_only <- finished_heading[, "Countries", drop=FALSE] 

gt::gt(finished_countries_only)
Countries
Afghanistan
Argentina
Australia
Austria
Bahamas
Bahrain
Bangladesh
Belgium
Belize
Bermuda
Bhutan
Botswana
Brazil
Bulgaria
Cambodia
Canada
Cayman Islands
Chile
China
Cook Islands
Costa Rica
Côte d’Ivoire
Croatia
Cyprus
Czechia
Denmark
England
Estonia
Eswatini
Falkland Islands
Fiji
Finland
France
Gambia
Germany
Ghana
Gibraltar
Greece
Guernsey
Hong Kong
Hungary
ICC World XI
India
Indonesia
Iran
Ireland
Isle of Man
Israel
Italy
Japan
Jersey
Kenya
Kuwait
Lesotho
Luxembourg
Malaysia
Maldives
Mali
Malta
Mexico
Mongolia
Mozambique
Myanmar
Namibia
Nepal
Netherlands
New Zealand
Nigeria
Norway
Oman
Pakistan
Panama
Papau New Guinea
Peru
Philippines
Portugal
Qatar
Romania
Rwanda
Samoa
Saudi Arabia
Scotland
Serbia
Seychelles
Singapore
Slovenia
South Africa
South Korea
Spain
Sri Lanka
St Helena
Suriname
Sweden
Switzerland
Tanzania
Thailand
Türkiye
Turks and Caicos Islands
Uganda
United Arab Emirates
United States of America
Vanuatu
West Indies
Zimbabwe

Plotting International Runs Over Time

The t20i_prepared data frame can now be used to plot how many T20 International runs have been scored over time.

At first I thought I might do this by month but there are many months where no matches were held. I then thought of analyzing by quarter, but there were a couple of quarters that are missing as well.

This is quite understandable, since T20 International games only started in 2005 as an experimental and controversial new version of the game of cricket.

I settled on doing the analysis by year. This also has the advantage of being easier to follow and understand than a monthly or quarterly analysis anyway. So I used the summarize() function to convert the t20i_prepared data frame to the total_per_year data frame, which shows the total number of T20 International runs scored per year.

total_per_year <- t20i_prepared |> 
  dplyr::arrange(Year, Month, Day) |> 
  dplyr::summarize(
    yearly_totals= sum(Runs, na.rm = TRUE), 
    .by=c(Year)
  )

dplyr::glimpse(total_per_year)
Rows: 21
Columns: 2
$ Year          <dbl> 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 20…
$ yearly_totals <int> 864, 2405, 10383, 6113, 13220, 17236, 5659, 21423, 14560…

I then used the ts() function to create a time series base on this data. A time series is a collection of observations of well-defined data items obtained through repeated measurements over time (definition taken from https://otexts.com/fpp2/ts-objects.html).

t20i_to_2025 <- ts(total_per_year, start=2005)

plot.ts(t20i_to_2025[,2], plot.type="single")

This plot is misleading, it creates the impression that there is a sudden drop-off in T20 in the year 2025. Since 2025 is still in progress,it cannot be reasonably compared with previous years.

Doing the plot again so that it ends with 2024 makes more sense.

t20i_to_2024 <- ts(total_per_year, start=2005, end=2024)
plot.ts(t20i_to_2024[,2], plot.type="single")

plot.ts() produces a very basic plot. I use the tidy() function to convert the relevant data to a tibble, so it can be plotted using the ggplot() function. Just a few basic additions make the plot more readable.

t20i_to_2024_tib <- broom::tidy(t20i_to_2024[,2])

ggplot_to_2024 <- ggplot2::ggplot(
  t20i_to_2024_tib,
  ggplot2::aes(x=index, y=value)
  ) +
  ggplot2::geom_line() +
  ggplot2::scale_x_continuous(limits=c(2005, 2024), breaks = c(2005, 2010, 2015, 2020, 2024)) +
  ggplot2::scale_y_continuous(labels = scales::comma) +
  ggplot2::labs(title = "Total T20 International Runs Per Year", x = "Year", y = "Runs Scored", caption = "Source: ESPNcricinfo")
  
ggplot_to_2024

As might be expected, dramatic growth is followed by a sharp drop-off during the pandemic in 2020. This then leads to an artificial surge after restrictions are lifted, presumably teams are trying to make up for lost time. After a slight slump, the rate then dramatically surges again leading up to 2024.

Plotting Runs By Country

I used the summarize() function again to get the sum of runs for each country by each year. We can then use this information to plot runs by country instead of by year. I also summarized the number of players per country per year. See the section further below where I analyze the data on the number of players

country_by_year <- t20i_prepared |> 
  dplyr::arrange(Year, Month, Day) |> 
  dplyr::summarize(
    country_annual_runs=sum(Runs, na.rm = TRUE),
    player_totals=dplyr::n_distinct(Player, na.rm=TRUE),
    .by=c(Country, Year)
  ) |> 
  dplyr::filter(Year != 2025)

country_by_year
# A tibble: 642 × 4
   Country       Year country_annual_runs player_totals
   <chr>        <dbl>               <int>         <int>
 1 Australia     2005                 275            13
 2 New Zealand   2005                 291            16
 3 England       2005                 173            11
 4 South Africa  2005                 125            12
 5 Australia     2006                 385            15
 6 South Africa  2006                 410            22
 7 New Zealand   2006                 374            17
 8 West Indies   2006                 116            11
 9 England       2006                 281            16
10 Sri Lanka     2006                 313            16
# ℹ 632 more rows

It will also be useful to get the overall total of runs scored by each country. This can then be used to detirmine which countries have contributed more significantly.

country_sum <- t20i_prepared |> 
  dplyr::arrange(Year, Month, Day) |>
  dplyr::filter(Year != 2025) |> 
  dplyr::summarize(
    country_totals=sum(Runs, na.rm = TRUE), 
    .by=c(Country)
  ) |> 
  dplyr::arrange(desc(country_totals))

country_sum
# A tibble: 102 × 2
   Country      country_totals
   <chr>                 <int>
 1 India                 36541
 2 Pakistan              35192
 3 West Indies           33847
 4 New Zealand           31754
 5 Sri Lanka             30547
 6 Australia             29788
 7 England               29423
 8 South Africa          28357
 9 Bangladesh            23822
10 Ireland               21975
# ℹ 92 more rows

I could have simply summarized the progress of the top ten countries. But there is very little difference between country 10, Ireland, and country 11, Zimbabwe.

There is a sudden drop after country twelve, Afghanistan, and the next country, the United Arab Emirates. Afghanistan has more than 3000 more runs than this country. So I decided to plot the top twelve countries.

Many of the remaining teams have not actually played that many games and many no longer exist.

So next I want to convey how great the contribution is of the top twelve teams compared to the remaining 90 teams. I have reversed the order of the columns to make it easier to follow.

sum_top_12 <- sum(country_sum[1:12, "country_totals"])
sum_remainder <- sum(country_sum[13:nrow(country_sum), "country_totals"])

top_vs_rest <- data.frame(
  Category = c("Top 12 Teams", "Teams Ranked 13 to 102"),
  Total = c(sum_top_12, sum_remainder)
)

top_12_compare <- ggplot2::ggplot(top_vs_rest, ggplot2::aes(x = forcats::fct_rev(Category), y = Total)) +
  ggplot2::geom_col(width = 0.5) +
  ggplot2::scale_y_continuous(labels = scales::comma) +
  ggplot2::labs(title = "12 Top Scoring Teams Versus Remaining 90 Teams", x = "Team Groups", y = "Runs Scored", caption = "Source: ESPNcricinfo")

top_12_compare

I then filtered the runs of each of the top twelve teams and showed the total yearly runs of each team by facet. I ordered the plots from greatest to least in order of runs scored. I was able to order the facets using levels. I could not find any way to calculate the levels, all I could do was actually type out the names of the countries in the order they should appear. Obviously it would be much better for R to calculate which countries are the current top 12, but I could not find a way to do that.

top_12 <- country_by_year |> 
  dplyr::filter(Country %in% c(
    "India", "Pakistan", "West Indies", "New Zealand", "Sri Lanka", "Australia",
    "England", "South Africa", "Bangladesh", "Ireland", "Zimbabwe", "Afghanistan"
  ))

top_12_plot <- ggplot2::ggplot(top_12, ggplot2::aes(x=Year, y=country_annual_runs)) +
  ggplot2::geom_line() +
  ggplot2::geom_point() +
  ggplot2::scale_x_continuous(limits=c(2005, 2024), breaks = c(2005, 2010, 2015, 2020, 2024)) +
  ggplot2::scale_y_continuous(labels = scales::comma) +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, vjust = 0.5, hjust = 1)) +
  ggplot2::facet_wrap(~factor(Country, levels=c(
    "India", "Pakistan", "West Indies", "New Zealand", "Sri Lanka", "Australia",
    "England", "South Africa", "Bangladesh", "Ireland", "Zimbabwe", "Afghanistan"
  ))) +
  ggplot2::theme(panel.spacing = ggplot2::unit(1, "lines")) +
  ggplot2::labs(title = "Top Twelve T20 Countries: Runs Per Year", x = "Year", y = "Runs Scored", caption = "Source: ESPNcricinfo") +
  ggplot2::theme(axis.text.x = ggplot2::element_text(vjust = 1))

top_12_plot

Plotting Runs By Continent

With over 100 countries, any attempt to plot each country individually would be overwhelming.

I considered different ways of grouping the countries by region and showing the runs scored by each region.

There are arguably six areas where the all of the more significant teams are concentrated: South Asia, Western Europe, Southern Africa, Oceania, and North America.

In addition, there are “minnow teams” in a surprising number of other far-flung countries and territories.

I thought it could be useful to plot total runs per continent. The only drawback to this is deciding how to categorize one of the more significant team, the West Indian Team. It is a single team whose members come from many Caribbean Islands, considered to be part of North America, and Guyana, a country on mainland South America.

I decided to try categorizing all national teams by continent using the countrycode package.

First I simply duplicated the Country column

add_continents <- country_by_year |> 
  dplyr::mutate(Continent=Country,
         .before=Country) 

add_continents
# A tibble: 642 × 5
   Continent    Country       Year country_annual_runs player_totals
   <chr>        <chr>        <dbl>               <int>         <int>
 1 Australia    Australia     2005                 275            13
 2 New Zealand  New Zealand   2005                 291            16
 3 England      England       2005                 173            11
 4 South Africa South Africa  2005                 125            12
 5 Australia    Australia     2006                 385            15
 6 South Africa South Africa  2006                 410            22
 7 New Zealand  New Zealand   2006                 374            17
 8 West Indies  West Indies   2006                 116            11
 9 England      England       2006                 281            16
10 Sri Lanka    Sri Lanka     2006                 313            16
# ℹ 632 more rows

I then used countrycode to convert most of the team names to the name of the continent that country or territory belongs to.

add_continents <- country_by_year |> 
  dplyr::mutate(Continent=Country,
         .before=Country) |> 
  dplyr::mutate(Continent=dplyr::case_when(
    nchar(countrycode::countrycode(Continent, origin='country.name', destination='continent', nomatch="not"))>3 ~ countrycode::countrycode(
      Continent, origin='country.name', destination='continent', nomatch="not"),
    TRUE ~ Continent  # Keep other values unchanged
    )) 

add_continents
# A tibble: 642 × 5
   Continent   Country       Year country_annual_runs player_totals
   <chr>       <chr>        <dbl>               <int>         <int>
 1 Oceania     Australia     2005                 275            13
 2 Oceania     New Zealand   2005                 291            16
 3 England     England       2005                 173            11
 4 Africa      South Africa  2005                 125            12
 5 Oceania     Australia     2006                 385            15
 6 Africa      South Africa  2006                 410            22
 7 Oceania     New Zealand   2006                 374            17
 8 West Indies West Indies   2006                 116            11
 9 England     England       2006                 281            16
10 Asia        Sri Lanka     2006                 313            16
# ℹ 632 more rows

I then just had to write a few lines of code to convert the three teams that were not recognized by countrycode. England and Scotland were not recognized because they both form part of the UK, and the West Indies was not recognized because it is an umbrella term.

It turned out that the countrycode package does not distinguish between North and South America, combining them into a single term “the Americas”. I think this is because of the many islands and regions that have strong associations with both North and South America. This meant there was no need to assign the West Indies to either North or South America, it could simply be considered part of the Americas. Similarly there was no need to consider Team USA as belonging to a different region than the West Indies. This seemed very appropriate.

I filtered out results for the ICC World XI. This is a team of international all-stars that has not played many games and may not ever be convened again. These runs should not be assigned to a particular region.

add_continents <- country_by_year |> 
  dplyr::mutate(Continent=Country,
         .before=Country) |> 
  dplyr::mutate(Continent=dplyr::case_when(
    nchar(countrycode::countrycode(Continent, origin='country.name', destination='continent', nomatch="not"))>3 ~ countrycode::countrycode(
      Continent, origin='country.name', destination='continent', nomatch="not"),
    Continent == "West Indies" ~ "Americas",
    Continent == "England" ~ "Europe",
    Continent == "Scotland" ~ "Europe",
    TRUE ~ Continent  # Keep other values unchanged
    )) |> 
  dplyr::filter(Continent!="ICC World XI")

add_continents
# A tibble: 640 × 5
   Continent Country       Year country_annual_runs player_totals
   <chr>     <chr>        <dbl>               <int>         <int>
 1 Oceania   Australia     2005                 275            13
 2 Oceania   New Zealand   2005                 291            16
 3 Europe    England       2005                 173            11
 4 Africa    South Africa  2005                 125            12
 5 Oceania   Australia     2006                 385            15
 6 Africa    South Africa  2006                 410            22
 7 Oceania   New Zealand   2006                 374            17
 8 Americas  West Indies   2006                 116            11
 9 Europe    England       2006                 281            16
10 Asia      Sri Lanka     2006                 313            16
# ℹ 630 more rows

Having grouped all of the teams by continent, I was then used continent_by_year to provide the total number of runs per continent per year. I also summarized the annual number of players per continent. (See next section for analysis of player numbers).

continent_by_year <- add_continents |> 
  dplyr::summarize(
    annual_runs=sum(country_annual_runs, na.rm = TRUE),
    annual_players=sum(player_totals, na.rm = TRUE),
    # annual_teams=dplyr::n_distinct(Country, na.rm=TRUE),
    .by=c(Continent, Year)
  ) 

continent_by_year
# A tibble: 98 × 4
   Continent  Year annual_runs annual_players
   <chr>     <dbl>       <int>          <int>
 1 Oceania    2005         566             29
 2 Europe     2005         173             11
 3 Africa     2005         125             12
 4 Oceania    2006         759             32
 5 Africa     2006         528             33
 6 Americas   2006         116             11
 7 Europe     2006         281             16
 8 Asia       2006         721             49
 9 Oceania    2007        2424             39
10 Europe     2007        1354             37
# ℹ 88 more rows

I then used continent_sum to rank each continent by total overall runs scored in descending order.

continent_sum <- add_continents |> 
  dplyr::summarize(
    sum_annual_runs=sum(country_annual_runs, na.rm = TRUE), 
    .by=c(Continent)
  ) |> 
  dplyr::arrange(desc(sum_annual_runs))

continent_sum
# A tibble: 5 × 2
  Continent sum_annual_runs
  <chr>               <int>
1 Asia               289714
2 Europe             198321
3 Africa             133298
4 Oceania             80284
5 Americas            70930

I was then able to plot the runs per continent using facets. Once again I had to use levels to determine the order in which the continents appeared, although I would have preferred to use R to automatically calculate the rank of each continent.

continent_plot <- ggplot2::ggplot(continent_by_year, ggplot2::aes(x=Year, y=annual_runs)) +
  ggplot2::geom_line() +
  ggplot2::geom_point() +
  ggplot2::scale_x_continuous(limits=c(2005, 2024), breaks = c(2005, 2010, 2015, 2020, 2024)) +
  ggplot2::scale_y_continuous(labels = scales::comma) +
  ggplot2::facet_grid(~ factor(Continent, levels=c(
    "Asia", "Europe", "Africa", "Oceania", "Americas"
  ))) +
  ggplot2::theme(panel.spacing = ggplot2::unit(1, "lines")) +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, vjust = 0.5, hjust = 1)) +
  ggplot2::labs(title = "T20 International: Runs By Continent", x = "Year", y = "Runs Scored", caption = "Source: ESPNcricinfo") +
   ggplot2::theme(axis.text.x = ggplot2::element_text(vjust = 1))

continent_plot

Runs Per Continent Compared With Number of Players

Now that we have analyzed the number of runs, we can start to ask “chicken and egg” questions. Are there secondary factors that influence the number of runs? Our data includes data on the number of players. Do continents with more players generate more runs overall?

Its difficult to compare the total numbers of runs and players. The run total is exponentially bigger than the totals for players. Let’s use Oceania as an example.

compare_oceania <- continent_by_year |> 
  dplyr::filter(Continent=="Oceania")

compare_oceania
# A tibble: 20 × 4
   Continent  Year annual_runs annual_players
   <chr>     <dbl>       <int>          <int>
 1 Oceania    2005         566             29
 2 Oceania    2006         759             32
 3 Oceania    2007        2424             39
 4 Oceania    2008         871             36
 5 Oceania    2009        2869             49
 6 Oceania    2010        3791             43
 7 Oceania    2011        1164             32
 8 Oceania    2012        4125             49
 9 Oceania    2013        1993             43
10 Oceania    2014        3006             51
11 Oceania    2015        1125             42
12 Oceania    2016        3781             57
13 Oceania    2017        2259             57
14 Oceania    2018        4586             45
15 Oceania    2019        7150             84
16 Oceania    2020        3119             42
17 Oceania    2021        6867             71
18 Oceania    2022       11426            115
19 Oceania    2023        7524            111
20 Oceania    2024       10879            128

Next we can try to plot totals of runs and players on a single plot. To do this, we first need to convert the data to long format. We will no longer have a wide format, where there is a single row for each year. Instead each value category will have a separate row for each year.

long_oceania_runs <- compare_oceania |> 
  tidyr::pivot_longer(
    cols=c("annual_runs", "annual_players"),
    names_to="value_categories",
    values_to="annual_figures"
  )

long_oceania_runs
# A tibble: 40 × 4
   Continent  Year value_categories annual_figures
   <chr>     <dbl> <chr>                     <int>
 1 Oceania    2005 annual_runs                 566
 2 Oceania    2005 annual_players               29
 3 Oceania    2006 annual_runs                 759
 4 Oceania    2006 annual_players               32
 5 Oceania    2007 annual_runs                2424
 6 Oceania    2007 annual_players               39
 7 Oceania    2008 annual_runs                 871
 8 Oceania    2008 annual_players               36
 9 Oceania    2009 annual_runs                2869
10 Oceania    2009 annual_players               49
# ℹ 30 more rows

We can now plot the two value categories side-by-side. But the plot doesn’t really make sense. The run totals are exponentially bigger than the player totals. It is difficult to meaningfully compare these categories in this way.

long_oceania_runs_plot <-ggplot2::ggplot(long_oceania_runs,
           ggplot2::aes(x = Year,
           y = annual_figures,
           col = value_categories)) +
  ggplot2::geom_line() +
  ggplot2::geom_point() +
  ggplot2::scale_x_continuous(breaks = c(2005, 2010, 2015, 2020, 2024)) +
  ggplot2::scale_y_continuous(labels = scales::comma) +
  ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
  ggplot2::labs(
    x = "Year", 
    y = "Annual Totals",
    col = "Growth Metrics"
    )

long_oceania_runs_plot

One solution might be to provide a chart or spreadsheet that simply show the raw numbers. But there is another way to compare these different values.

We can plot these widely divergent figures logarithmically. Instead of using a Y axis that increases at an even rate, we will use a Y axis that increases exponentially. We can then meaningfully compare the number of players with the number of runs because we will essentially have a smaller scale for players and a larger scale for runs.

There are different standards that can be used to plot these figures logarithmically, but we don’t need to do a deep dive into the details. Lets take oceania_compare_plot and change the Y axis so that we are using log 10 as the standard by which to plot both growth metrics. For more about logarithms in general and log 10 in particular, see here.

Then we can plot the results. This shows overall correlation between number of runs and players. Both factors usually either increase or decrease together each year, even if the proportions are somewhat different. But there are also exceptions. The factors diverge in 2010, 2017, and 2018.

oceania_log_plot <- ggplot2::ggplot(
  long_oceania_runs,
  ggplot2::aes(x = Year,
               y = annual_figures,
               col = value_categories
               )
  ) +
  ggplot2::geom_line() +
  ggplot2::geom_point() +
  ggplot2::scale_x_continuous(breaks = c(2005, 2010, 2015, 2020, 2024)) +
  ggplot2::scale_y_continuous(
    trans = "log10",
    labels = scales::comma
    ) +
  ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
  ggplot2::labs(
    title="Oceania: Comparative Increase of Runs and Number of Players", 
    x = "Year", 
    y = "Annual Totals",
    col = "Growth Metrics",
    caption = "Source: ESPNcricinfo"
  )

oceania_log_plot

Let’s try this with Asia. First we need to take the data and make it long.

compare_asia <- continent_by_year |> 
  dplyr::filter(Continent=="Asia") |> 
  tidyr::pivot_longer(
    cols=c("annual_runs", "annual_players"),
    names_to="value_categories",
    values_to="annual_figures"
  )

compare_asia
# A tibble: 38 × 4
   Continent  Year value_categories annual_figures
   <chr>     <dbl> <chr>                     <int>
 1 Asia       2006 annual_runs                 721
 2 Asia       2006 annual_players               49
 3 Asia       2007 annual_runs                4224
 4 Asia       2007 annual_players               61
 5 Asia       2008 annual_runs                1406
 6 Asia       2008 annual_players               60
 7 Asia       2009 annual_runs                5402
 8 Asia       2009 annual_players               75
 9 Asia       2010 annual_runs                5805
10 Asia       2010 annual_players               97
# ℹ 28 more rows

We see an overall correlation again, but the factors diverge in 2018 and 2023

asia_log_plot <-ggplot2::ggplot(compare_asia,
                                   ggplot2::aes(x = Year,
                                                y = annual_figures,
                                                col = value_categories)) +
  ggplot2::geom_line() +
  ggplot2::geom_point() +
  ggplot2::scale_x_continuous(breaks = c(2006, 2010, 2015, 2020, 2024)) +
  ggplot2::scale_y_continuous(
    trans = "log10",
    labels = scales::comma
  ) +
  ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
  ggplot2::labs(
    title="Asia: Comparative Increase of Runs and Number of Players", 
    x = "Year", 
    y = "Annual Totals",
    col = "Growth Metrics",
    caption = "Source: ESPNcricinfo"
  )

asia_log_plot

Now we can analyze the results for Africa.

compare_africa <- continent_by_year |> 
  dplyr::filter(Continent=="Africa") |> 
  tidyr::pivot_longer(
    cols=c("annual_runs", "annual_players"),
    names_to="value_categories",
    values_to="annual_figures"
  )

compare_africa
# A tibble: 40 × 4
   Continent  Year value_categories annual_figures
   <chr>     <dbl> <chr>                     <int>
 1 Africa     2005 annual_runs                 125
 2 Africa     2005 annual_players               12
 3 Africa     2006 annual_runs                 528
 4 Africa     2006 annual_players               33
 5 Africa     2007 annual_runs                1627
 6 Africa     2007 annual_players               49
 7 Africa     2008 annual_runs                1088
 8 Africa     2008 annual_players               42
 9 Africa     2009 annual_runs                1800
10 Africa     2009 annual_players               25
# ℹ 30 more rows

Again we see an overall correlation, but with divergences in 2008 and 2016.

africa_log_plot <- ggplot2::ggplot(
  compare_africa,
  ggplot2::aes(x = Year,
               y = annual_figures,
               col = value_categories
               )
  ) +
  ggplot2::geom_line() +
  ggplot2::geom_point() +
  ggplot2::scale_x_continuous(breaks = c(2005, 2010, 2015, 2020, 2024)) +
  ggplot2::scale_y_continuous(
    trans = "log10",
    labels = scales::comma
    ) +
  ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
  ggplot2::labs(
    title="Africa: Comparative Increase of Runs and Number of Players", 
    x = "Year", 
    y = "Annual Totals",
    col = "Growth Metrics",
    caption = "Source: ESPNcricinfo"
  )

africa_log_plot

Now we can take a look at the results from Europe.

compare_europe <- continent_by_year |> 
  dplyr::filter(Continent=="Europe") |> 
  tidyr::pivot_longer(
    cols=c("annual_runs", "annual_players"),
    names_to="value_categories",
    values_to="annual_figures"
  )

compare_europe
# A tibble: 40 × 4
   Continent  Year value_categories annual_figures
   <chr>     <dbl> <chr>                     <int>
 1 Europe     2005 annual_runs                 173
 2 Europe     2005 annual_players               11
 3 Europe     2006 annual_runs                 281
 4 Europe     2006 annual_players               16
 5 Europe     2007 annual_runs                1354
 6 Europe     2007 annual_players               37
 7 Europe     2008 annual_runs                1429
 8 Europe     2008 annual_players               51
 9 Europe     2009 annual_runs                2150
10 Europe     2009 annual_players               63
# ℹ 30 more rows

There is a overall correlation, with divergences in 2010, 2013, and 2014.

europe_log_plot <- ggplot2::ggplot(
  compare_europe,
  ggplot2::aes(x = Year,
               y = annual_figures,
               col = value_categories
               )
  ) +
  ggplot2::geom_line() +
  ggplot2::geom_point() +
  ggplot2::scale_x_continuous(breaks = c(2005, 2010, 2015, 2020, 2024)) +
  ggplot2::scale_y_continuous(
    trans = "log10",
    labels = scales::comma
    ) +
  ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
  ggplot2::labs(
    title="Europe: Comparative Increase of Runs and Number of Players", 
    x = "Year", 
    y = "Annual Totals",
    col = "Growth Metrics",
    caption = "Source: ESPNcricinfo"
  )

europe_log_plot

And lastly, the Americas.

compare_americas <- continent_by_year |> 
  dplyr::filter(Continent=="Americas") |> 
  tidyr::pivot_longer(
    cols=c("annual_runs", "annual_players"),
    names_to="value_categories",
    values_to="annual_figures"
  )

compare_americas
# A tibble: 38 × 4
   Continent  Year value_categories annual_figures
   <chr>     <dbl> <chr>                     <int>
 1 Americas   2006 annual_runs                 116
 2 Americas   2006 annual_players               11
 3 Americas   2007 annual_runs                 754
 4 Americas   2007 annual_players               17
 5 Americas   2008 annual_runs                1319
 6 Americas   2008 annual_players               57
 7 Americas   2009 annual_runs                 999
 8 Americas   2009 annual_players               23
 9 Americas   2010 annual_runs                1686
10 Americas   2010 annual_players               37
# ℹ 28 more rows

There is a overall correlation, with divergences in 2013, 2016, and 2023.

americas_log_plot <-ggplot2::ggplot(compare_americas,
                                   ggplot2::aes(x = Year,
                                                y = annual_figures,
                                                col = value_categories)) +
  ggplot2::geom_line() +
  ggplot2::geom_point() +
  ggplot2::scale_x_continuous(breaks = c(2006, 2010, 2015, 2020, 2024)) +
  ggplot2::scale_y_continuous(
    trans = "log10",
    labels = scales::comma
  ) +
  ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
  ggplot2::labs(
    title="Americas: Comparative Increase of Runs and Number of Players", 
    x = "Year", 
    y = "Annual Totals",
    col = "Growth Metrics",
    caption = "Source: ESPNcricinfo"
  )

americas_log_plot

This little experiment seems to confirm what common sense might tell us. Total player numbers are not an absolute predictor of runs scored.

Runs scored are probably determined by a combination of these overlapping factors:

  • Talent pipeline

  • Financial resources

  • Culture of excellence (success breeding success)

  • Overall popularity of cricket in a particular region

However, the number of players is an easily quantifiable factor that serves as a good proxy for these harder-to-measure influences.