t20i_runs <- cricketdata::fetch_cricinfo("t20", "men", "batting", "innings", country = NULL)
all_runs <- saveRDS(t20i_runs, "all_runs")T20 International Cricket: Runs By Country
Libraries Used In This Project
I have written this code specifying the source of each function. So there is no need load any package using library(). But if you would like to know which libraries have been used, they are listed below.
Libraries Included in tidyverse Suite of Packages:
dplyr
forcats
ggplot2
lubridate
Libraries Not Included in tidyverse:
broom
countrycode
cricketdata
gt
scales
Getting the Data
This is how I scraped the data, and then saved it.:
- I requested the data on all runs scored (“batting”) by all T20 ciricket national teams.
- I then saved this data as the document “all_runs”.
It can take repeated attempts and considerable time to finish this process. Other code snippets in this document have been set to execute, but not this one.
The saved data frame can be used to extract the data on which countries have scored how many runs.
It records how much each runner from each country scored. in each game. Thais makes for over 70, 000 rows. This can pretty easily be totaled per country. It also includes lots of information we don’t really need for this project that records how well each individual player did.
One immediate cause for concern is the “Country” column, which shows the full names of some countries and abbreviations for others. This is inconsistent.
t20i_runs_20250628 <- readRDS("all_runs")
dplyr::glimpse(t20i_runs_20250628)Rows: 70,169
Columns: 14
$ Date <date> 2018-07-03, 2019-02-23, 2013-08-29, 2024-12-14, 2016-09…
$ Player <chr> "AJ Finch ", "Hazratullah Zazai ", "AJ Finch ", "YSD Sen…
$ Country <chr> "Australia", "Afghanistan", "Australia", "CAY", "Austral…
$ Runs <int> 172, 162, 156, 150, 145, 144, 137, 137, 137, 135, 135, 1…
$ NotOut <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE…
$ Minutes <dbl> NA, NA, 70, NA, NA, NA, NA, NA, 94, NA, 93, NA, NA, NA, …
$ BallsFaced <int> 76, 62, 63, 67, 65, 41, 50, 49, 62, 62, 54, 68, 73, 43, …
$ Fours <int> 16, 11, 11, 7, 14, 6, 8, 7, 5, 11, 7, 8, 15, 7, 15, 5, 5…
$ Sixes <int> 10, 16, 14, 13, 9, 18, 12, 15, 16, 10, 13, 12, 6, 15, 6,…
$ StrikeRate <dbl> 226.3158, 261.2903, 247.6190, 223.8806, 223.0769, 351.21…
$ Innings <int> 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Participation <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "…
$ Opposition <chr> "Zimbabwe", "Ireland", "England", "Brazil", "Sri Lanka",…
$ Ground <chr> "Harare", "Dehradun", "Southampton", "Buenos Aires", "Pa…
Cleaning the Data
The data frame is organized by date, but I want especially to look at runs per year. So the next thing to do is to convert the dates to three separate columns, so that I can group runs by year.
t20i_prepared <- t20i_runs_20250628 |>
dplyr::mutate(Year=lubridate::year(Date),
Month=lubridate::month(Date),
Day=lubridate::day(Date)
)
dplyr::glimpse(t20i_prepared)Rows: 70,169
Columns: 17
$ Date <date> 2018-07-03, 2019-02-23, 2013-08-29, 2024-12-14, 2016-09…
$ Player <chr> "AJ Finch ", "Hazratullah Zazai ", "AJ Finch ", "YSD Sen…
$ Country <chr> "Australia", "Afghanistan", "Australia", "CAY", "Austral…
$ Runs <int> 172, 162, 156, 150, 145, 144, 137, 137, 137, 135, 135, 1…
$ NotOut <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE…
$ Minutes <dbl> NA, NA, 70, NA, NA, NA, NA, NA, 94, NA, 93, NA, NA, NA, …
$ BallsFaced <int> 76, 62, 63, 67, 65, 41, 50, 49, 62, 62, 54, 68, 73, 43, …
$ Fours <int> 16, 11, 11, 7, 14, 6, 8, 7, 5, 11, 7, 8, 15, 7, 15, 5, 5…
$ Sixes <int> 10, 16, 14, 13, 9, 18, 12, 15, 16, 10, 13, 12, 6, 15, 6,…
$ StrikeRate <dbl> 226.3158, 261.2903, 247.6190, 223.8806, 223.0769, 351.21…
$ Innings <int> 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Participation <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "…
$ Opposition <chr> "Zimbabwe", "Ireland", "England", "Brazil", "Sri Lanka",…
$ Ground <chr> "Harare", "Dehradun", "Southampton", "Buenos Aires", "Pa…
$ Year <dbl> 2018, 2019, 2013, 2024, 2016, 2024, 2023, 2022, 2024, 20…
$ Month <dbl> 7, 2, 8, 12, 9, 6, 9, 6, 1, 2, 2, 2, 4, 10, 7, 8, 9, 7, …
$ Day <int> 3, 23, 29, 14, 6, 17, 27, 5, 17, 29, 2, 15, 18, 23, 25, …
Listing the unique values used in the “Country” column shows even more cause for concern. There are 108 entries, including obvious misspellings like “MWest Indies”, and puzzling abbreviations like “GUE” and “STHEL”
raw_countries <- as.data.frame(table(t20i_prepared$Country))
countries_heading <- raw_countries |>
dplyr::rename(Countries=Var1)
countries_only <- countries_heading[, "Countries", drop=FALSE]
gt::gt(countries_only)| Countries |
|---|
| Afghanistan |
| Arg |
| Australia |
| Aut |
| BAN |
| Belg |
| BER |
| Bhm |
| BHR |
| BHU |
| Blz |
| BOT |
| BRA |
| BUL |
| CAM |
| Canada |
| CAY |
| Chile |
| CHN |
| CIV |
| COK |
| CRC |
| CRT |
| CYP |
| CZK-R |
| DEN |
| England |
| ESP |
| EST |
| Falk |
| Fiji |
| FIN |
| Fran |
| GER |
| GHA |
| GIBR |
| GMB |
| GRC |
| GUE |
| Hong Kong |
| HUN |
| ICC World XI |
| INA |
| India |
| IOM |
| Iran |
| Ireland |
| ISR |
| ITA |
| Japan |
| JER |
| Kenya |
| KSA |
| KUW |
| LES |
| LUX |
| Mali |
| MAS |
| MDV |
| Mex |
| MLT |
| MNG |
| MOZ |
| MWest Indies |
| MYAN |
| Namibia |
| NED |
| NEP |
| New Zealand |
| NGA |
| NOR |
| OMA |
| Pakistan |
| PAN |
| Papau New Guinea |
| Peru |
| PHI |
| PORT |
| QAT |
| ROM |
| RWN |
| Samoa |
| Scotland |
| SEY |
| SGP |
| SKOR |
| South Africa |
| SRB |
| Sri Lanka |
| Sri LankaE |
| STHEL |
| SUI |
| Sur |
| SVN |
| SWA |
| SWE |
| SWZ |
| TAN |
| TCL |
| THA |
| TKY |
| UGA |
| United Arab Emirates |
| United States of America |
| VAN |
| West Indies |
| World XI |
| Zimbabwe |
Closer examination reveals that not all of these “national” teams represent countries. The ICC World X1 is actually a team of international all-stars. “GUE” is an abbreviation for the British territory Guernsey, and “STHEL” represents another British territory, St Helena. Many would find these abbreviations unintelligible.
It would be possible, but tedious, to write a line of code for each one of these many abbreviations, converting them to the full name for each country.
Some of these three letter country abbreviations, such as GHA for “Ghana”, are part of the ISO Three Character Country codes standard. The countrycode package can also convert these abbreviations to full names.
I arrived at a method to achieve this. I write a conditional statement, that requires a string of only three characters, and then checks for the ISO Three Character Standard. If it finds a match, it supplies the full name of the country. If it finds no match, I specify it returns the word “not”. I also add the condition that it will only replace the three letter abbreviation if the result is more than three letters long. So if it finds no match and returns the word”not”, no change will be made.
This code converts quite a few of the abbreviations, but many are unchanged.
#|label: add__iso_countries
t20i_prepared <- t20i_runs_20250628 |>
dplyr::mutate(Year=lubridate::year(Date),
Month=lubridate::month(Date),
Day=lubridate::day(Date)
) |>
dplyr::mutate(Country = dplyr::case_when(
nchar(Country)==3 & nchar(countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"))>3 ~ countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"),
TRUE ~ Country # Keep other values unchanged
))
iso_countries <- as.data.frame(table(t20i_prepared$Country))
iso_heading <- iso_countries |>
dplyr::rename(Countries=Var1)
iso_countries_only <- iso_heading[, "Countries", drop=FALSE]
gt::gt(iso_countries_only)| Countries |
|---|
| Afghanistan |
| Argentina |
| Australia |
| Austria |
| Bahrain |
| BAN |
| Belg |
| Belize |
| BER |
| Bhm |
| BHU |
| BOT |
| Brazil |
| BUL |
| CAM |
| Canada |
| CAY |
| Chile |
| China |
| Cook Islands |
| Côte d’Ivoire |
| CRC |
| CRT |
| Cyprus |
| CZK-R |
| DEN |
| England |
| Estonia |
| Eswatini |
| Falk |
| Fiji |
| Finland |
| Fran |
| Gambia |
| GER |
| Ghana |
| GIBR |
| Greece |
| GUE |
| Hong Kong |
| Hungary |
| ICC World XI |
| INA |
| India |
| IOM |
| Iran |
| Ireland |
| Israel |
| Italy |
| Japan |
| JER |
| Kenya |
| KSA |
| KUW |
| LES |
| Luxembourg |
| Maldives |
| Mali |
| Malta |
| MAS |
| Mexico |
| Mongolia |
| Mozambique |
| MWest Indies |
| MYAN |
| Namibia |
| NED |
| NEP |
| New Zealand |
| Nigeria |
| Norway |
| OMA |
| Pakistan |
| Panama |
| Papau New Guinea |
| Peru |
| PHI |
| PORT |
| Qatar |
| ROM |
| RWN |
| Samoa |
| Scotland |
| Serbia |
| SEY |
| Singapore |
| SKOR |
| Slovenia |
| South Africa |
| Spain |
| Sri Lanka |
| Sri LankaE |
| STHEL |
| SUI |
| Suriname |
| SWA |
| Sweden |
| TAN |
| TCL |
| Thailand |
| TKY |
| Uganda |
| United Arab Emirates |
| United States of America |
| VAN |
| West Indies |
| World XI |
| Zimbabwe |
It turned out some of the three letter abbreviations that had not been converted matched those used by the International Olympic Committee. The countrycode package also includes these abbreviations, so I add another line of code checking for abbreviations matching the Olympic standard.
This converts quite a few more abbreviations
t20i_prepared <- t20i_runs_20250628 |>
dplyr::mutate(Year=lubridate::year(Date),
Month=lubridate::month(Date),
Day=lubridate::day(Date)
) |>
dplyr::mutate(Country = dplyr::case_when(
nchar(Country)==3 & nchar(countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"))>3 ~ countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"),
nchar(Country)==3 & nchar(countrycode::countrycode(Country, origin='ioc', destination='country.name', nomatch="not"))>3 ~ countrycode::countrycode(Country, origin='ioc', destination='country.name', nomatch="not"),
TRUE ~ Country # Keep other values unchanged
))
olympic_countries <- as.data.frame(table(t20i_prepared$Country))
olympic_heading <- olympic_countries |>
dplyr::rename(Countries=Var1)
olympic_countries_only <- olympic_heading[, "Countries", drop=FALSE]
gt::gt(olympic_countries_only)| Countries |
|---|
| Afghanistan |
| Argentina |
| Australia |
| Austria |
| Bahrain |
| Bangladesh |
| Belg |
| Belize |
| Bermuda |
| Bhm |
| Bhutan |
| Botswana |
| Brazil |
| Bulgaria |
| Cambodia |
| Canada |
| Cayman Islands |
| Chile |
| China |
| Cook Islands |
| Costa Rica |
| Côte d’Ivoire |
| CRT |
| Cyprus |
| CZK-R |
| Denmark |
| England |
| Estonia |
| Eswatini |
| Falk |
| Fiji |
| Finland |
| Fran |
| Gambia |
| Germany |
| Ghana |
| GIBR |
| Greece |
| GUE |
| Hong Kong |
| Hungary |
| ICC World XI |
| India |
| Indonesia |
| IOM |
| Iran |
| Ireland |
| Israel |
| Italy |
| Japan |
| JER |
| Kenya |
| Kuwait |
| Lesotho |
| Luxembourg |
| Malaysia |
| Maldives |
| Mali |
| Malta |
| Mexico |
| Mongolia |
| Mozambique |
| MWest Indies |
| MYAN |
| Namibia |
| Nepal |
| Netherlands |
| New Zealand |
| Nigeria |
| Norway |
| Oman |
| Pakistan |
| Panama |
| Papau New Guinea |
| Peru |
| Philippines |
| PORT |
| Qatar |
| ROM |
| RWN |
| Samoa |
| Saudi Arabia |
| Scotland |
| Serbia |
| Seychelles |
| Singapore |
| SKOR |
| Slovenia |
| South Africa |
| Spain |
| Sri Lanka |
| Sri LankaE |
| STHEL |
| Suriname |
| SWA |
| Sweden |
| Switzerland |
| Tanzania |
| TCL |
| Thailand |
| TKY |
| Uganda |
| United Arab Emirates |
| United States of America |
| Vanuatu |
| West Indies |
| World XI |
| Zimbabwe |
I handle the remaining abbreviations and misspellings by simply adding a line of code for each one.
t20i_prepared <- t20i_runs_20250628 |>
dplyr::mutate(Year=lubridate::year(Date),
Month=lubridate::month(Date),
Day=lubridate::day(Date)
) |>
dplyr::mutate(Country = dplyr::case_when(
nchar(Country)==3 & nchar(countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"))>3 ~ countrycode::countrycode(Country, origin='iso3c', destination='country.name', nomatch="not"),
nchar(Country)==3 & nchar(countrycode::countrycode(Country, origin='ioc', destination='country.name', nomatch="not"))>3 ~ countrycode::countrycode(Country, origin='ioc', destination='country.name', nomatch="not"),
Country == "MWest Indies" ~ "West Indies",
Country == "Sri LankaE" ~ "Sri Lanka",
Country == "World XI" ~ "ICC World XI",
Country == "Bhm" ~ "Bahamas",
Country == "TCL" ~ "Turks and Caicos Islands",
Country == "SWA" ~ "Eswatini",
Country == "SKOR" ~ "South Korea",
Country == "Belg" ~ "Belgium",
Country == "CRT" ~ "Croatia",
Country == "CZK-R" ~ "Czechia",
Country == "Falk" ~ "Falkland Islands",
Country == "Fran" ~ "France",
Country == "GIBR" ~ "Gibraltar",
Country == "GUE" ~ "Guernsey",
Country == "IOM" ~ "Isle of Man",
Country == "JER" ~ "Jersey",
Country == "MYAN" ~ "Myanmar",
Country == "PORT" ~ "Portugal",
Country == "ROM" ~ "Romania",
Country == "RWN" ~ "Rwanda",
Country == "STHEL" ~ "St Helena",
Country == "TKY" ~ "Türkiye",
TRUE ~ Country # Keep other values unchanged
))
finished_countries <- as.data.frame(table(t20i_prepared$Country))
finished_heading <- finished_countries |>
dplyr::rename(Countries=Var1)
finished_countries_only <- finished_heading[, "Countries", drop=FALSE]
gt::gt(finished_countries_only)| Countries |
|---|
| Afghanistan |
| Argentina |
| Australia |
| Austria |
| Bahamas |
| Bahrain |
| Bangladesh |
| Belgium |
| Belize |
| Bermuda |
| Bhutan |
| Botswana |
| Brazil |
| Bulgaria |
| Cambodia |
| Canada |
| Cayman Islands |
| Chile |
| China |
| Cook Islands |
| Costa Rica |
| Côte d’Ivoire |
| Croatia |
| Cyprus |
| Czechia |
| Denmark |
| England |
| Estonia |
| Eswatini |
| Falkland Islands |
| Fiji |
| Finland |
| France |
| Gambia |
| Germany |
| Ghana |
| Gibraltar |
| Greece |
| Guernsey |
| Hong Kong |
| Hungary |
| ICC World XI |
| India |
| Indonesia |
| Iran |
| Ireland |
| Isle of Man |
| Israel |
| Italy |
| Japan |
| Jersey |
| Kenya |
| Kuwait |
| Lesotho |
| Luxembourg |
| Malaysia |
| Maldives |
| Mali |
| Malta |
| Mexico |
| Mongolia |
| Mozambique |
| Myanmar |
| Namibia |
| Nepal |
| Netherlands |
| New Zealand |
| Nigeria |
| Norway |
| Oman |
| Pakistan |
| Panama |
| Papau New Guinea |
| Peru |
| Philippines |
| Portugal |
| Qatar |
| Romania |
| Rwanda |
| Samoa |
| Saudi Arabia |
| Scotland |
| Serbia |
| Seychelles |
| Singapore |
| Slovenia |
| South Africa |
| South Korea |
| Spain |
| Sri Lanka |
| St Helena |
| Suriname |
| Sweden |
| Switzerland |
| Tanzania |
| Thailand |
| Türkiye |
| Turks and Caicos Islands |
| Uganda |
| United Arab Emirates |
| United States of America |
| Vanuatu |
| West Indies |
| Zimbabwe |
Plotting International Runs Over Time
The t20i_prepared data frame can now be used to plot how many T20 International runs have been scored over time.
At first I thought I might do this by month but there are many months where no matches were held. I then thought of analyzing by quarter, but there were a couple of quarters that are missing as well.
This is quite understandable, since T20 International games only started in 2005 as an experimental and controversial new version of the game of cricket.
I settled on doing the analysis by year. This also has the advantage of being easier to follow and understand than a monthly or quarterly analysis anyway. So I used the summarize() function to convert the t20i_prepared data frame to the total_per_year data frame, which shows the total number of T20 International runs scored per year.
total_per_year <- t20i_prepared |>
dplyr::arrange(Year, Month, Day) |>
dplyr::summarize(
yearly_totals= sum(Runs, na.rm = TRUE),
.by=c(Year)
)
dplyr::glimpse(total_per_year)Rows: 21
Columns: 2
$ Year <dbl> 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 20…
$ yearly_totals <int> 864, 2405, 10383, 6113, 13220, 17236, 5659, 21423, 14560…
I then used the ts() function to create a time series base on this data. A time series is a collection of observations of well-defined data items obtained through repeated measurements over time (definition taken from https://otexts.com/fpp2/ts-objects.html).
t20i_to_2025 <- ts(total_per_year, start=2005)
plot.ts(t20i_to_2025[,2], plot.type="single")This plot is misleading, it creates the impression that there is a sudden drop-off in T20 in the year 2025. Since 2025 is still in progress,it cannot be reasonably compared with previous years.
Doing the plot again so that it ends with 2024 makes more sense.
t20i_to_2024 <- ts(total_per_year, start=2005, end=2024)
plot.ts(t20i_to_2024[,2], plot.type="single")plot.ts() produces a very basic plot. I use the tidy() function to convert the relevant data to a tibble, so it can be plotted using the ggplot() function. Just a few basic additions make the plot more readable.
t20i_to_2024_tib <- broom::tidy(t20i_to_2024[,2])
ggplot_to_2024 <- ggplot2::ggplot(
t20i_to_2024_tib,
ggplot2::aes(x=index, y=value)
) +
ggplot2::geom_line() +
ggplot2::scale_x_continuous(limits=c(2005, 2024), breaks = c(2005, 2010, 2015, 2020, 2024)) +
ggplot2::scale_y_continuous(labels = scales::comma) +
ggplot2::labs(title = "Total T20 International Runs Per Year", x = "Year", y = "Runs Scored", caption = "Source: ESPNcricinfo")
ggplot_to_2024As might be expected, dramatic growth is followed by a sharp drop-off during the pandemic in 2020. This then leads to an artificial surge after restrictions are lifted, presumably teams are trying to make up for lost time. After a slight slump, the rate then dramatically surges again leading up to 2024.
Plotting Runs By Country
I used the summarize() function again to get the sum of runs for each country by each year. We can then use this information to plot runs by country instead of by year. I also summarized the number of players per country per year. See the section further below where I analyze the data on the number of players
country_by_year <- t20i_prepared |>
dplyr::arrange(Year, Month, Day) |>
dplyr::summarize(
country_annual_runs=sum(Runs, na.rm = TRUE),
player_totals=dplyr::n_distinct(Player, na.rm=TRUE),
.by=c(Country, Year)
) |>
dplyr::filter(Year != 2025)
country_by_year# A tibble: 642 × 4
Country Year country_annual_runs player_totals
<chr> <dbl> <int> <int>
1 Australia 2005 275 13
2 New Zealand 2005 291 16
3 England 2005 173 11
4 South Africa 2005 125 12
5 Australia 2006 385 15
6 South Africa 2006 410 22
7 New Zealand 2006 374 17
8 West Indies 2006 116 11
9 England 2006 281 16
10 Sri Lanka 2006 313 16
# ℹ 632 more rows
It will also be useful to get the overall total of runs scored by each country. This can then be used to detirmine which countries have contributed more significantly.
country_sum <- t20i_prepared |>
dplyr::arrange(Year, Month, Day) |>
dplyr::filter(Year != 2025) |>
dplyr::summarize(
country_totals=sum(Runs, na.rm = TRUE),
.by=c(Country)
) |>
dplyr::arrange(desc(country_totals))
country_sum# A tibble: 102 × 2
Country country_totals
<chr> <int>
1 India 36541
2 Pakistan 35192
3 West Indies 33847
4 New Zealand 31754
5 Sri Lanka 30547
6 Australia 29788
7 England 29423
8 South Africa 28357
9 Bangladesh 23822
10 Ireland 21975
# ℹ 92 more rows
I could have simply summarized the progress of the top ten countries. But there is very little difference between country 10, Ireland, and country 11, Zimbabwe.
There is a sudden drop after country twelve, Afghanistan, and the next country, the United Arab Emirates. Afghanistan has more than 3000 more runs than this country. So I decided to plot the top twelve countries.
Many of the remaining teams have not actually played that many games and many no longer exist.
So next I want to convey how great the contribution is of the top twelve teams compared to the remaining 90 teams. I have reversed the order of the columns to make it easier to follow.
sum_top_12 <- sum(country_sum[1:12, "country_totals"])
sum_remainder <- sum(country_sum[13:nrow(country_sum), "country_totals"])
top_vs_rest <- data.frame(
Category = c("Top 12 Teams", "Teams Ranked 13 to 102"),
Total = c(sum_top_12, sum_remainder)
)
top_12_compare <- ggplot2::ggplot(top_vs_rest, ggplot2::aes(x = forcats::fct_rev(Category), y = Total)) +
ggplot2::geom_col(width = 0.5) +
ggplot2::scale_y_continuous(labels = scales::comma) +
ggplot2::labs(title = "12 Top Scoring Teams Versus Remaining 90 Teams", x = "Team Groups", y = "Runs Scored", caption = "Source: ESPNcricinfo")
top_12_compareI then filtered the runs of each of the top twelve teams and showed the total yearly runs of each team by facet. I ordered the plots from greatest to least in order of runs scored. I was able to order the facets using levels. I could not find any way to calculate the levels, all I could do was actually type out the names of the countries in the order they should appear. Obviously it would be much better for R to calculate which countries are the current top 12, but I could not find a way to do that.
top_12 <- country_by_year |>
dplyr::filter(Country %in% c(
"India", "Pakistan", "West Indies", "New Zealand", "Sri Lanka", "Australia",
"England", "South Africa", "Bangladesh", "Ireland", "Zimbabwe", "Afghanistan"
))
top_12_plot <- ggplot2::ggplot(top_12, ggplot2::aes(x=Year, y=country_annual_runs)) +
ggplot2::geom_line() +
ggplot2::geom_point() +
ggplot2::scale_x_continuous(limits=c(2005, 2024), breaks = c(2005, 2010, 2015, 2020, 2024)) +
ggplot2::scale_y_continuous(labels = scales::comma) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, vjust = 0.5, hjust = 1)) +
ggplot2::facet_wrap(~factor(Country, levels=c(
"India", "Pakistan", "West Indies", "New Zealand", "Sri Lanka", "Australia",
"England", "South Africa", "Bangladesh", "Ireland", "Zimbabwe", "Afghanistan"
))) +
ggplot2::theme(panel.spacing = ggplot2::unit(1, "lines")) +
ggplot2::labs(title = "Top Twelve T20 Countries: Runs Per Year", x = "Year", y = "Runs Scored", caption = "Source: ESPNcricinfo") +
ggplot2::theme(axis.text.x = ggplot2::element_text(vjust = 1))
top_12_plotPlotting Runs By Continent
With over 100 countries, any attempt to plot each country individually would be overwhelming.
I considered different ways of grouping the countries by region and showing the runs scored by each region.
There are arguably six areas where the all of the more significant teams are concentrated: South Asia, Western Europe, Southern Africa, Oceania, and North America.
In addition, there are “minnow teams” in a surprising number of other far-flung countries and territories.
I thought it could be useful to plot total runs per continent. The only drawback to this is deciding how to categorize one of the more significant team, the West Indian Team. It is a single team whose members come from many Caribbean Islands, considered to be part of North America, and Guyana, a country on mainland South America.
I decided to try categorizing all national teams by continent using the countrycode package.
First I simply duplicated the Country column
add_continents <- country_by_year |>
dplyr::mutate(Continent=Country,
.before=Country)
add_continents# A tibble: 642 × 5
Continent Country Year country_annual_runs player_totals
<chr> <chr> <dbl> <int> <int>
1 Australia Australia 2005 275 13
2 New Zealand New Zealand 2005 291 16
3 England England 2005 173 11
4 South Africa South Africa 2005 125 12
5 Australia Australia 2006 385 15
6 South Africa South Africa 2006 410 22
7 New Zealand New Zealand 2006 374 17
8 West Indies West Indies 2006 116 11
9 England England 2006 281 16
10 Sri Lanka Sri Lanka 2006 313 16
# ℹ 632 more rows
I then used countrycode to convert most of the team names to the name of the continent that country or territory belongs to.
add_continents <- country_by_year |>
dplyr::mutate(Continent=Country,
.before=Country) |>
dplyr::mutate(Continent=dplyr::case_when(
nchar(countrycode::countrycode(Continent, origin='country.name', destination='continent', nomatch="not"))>3 ~ countrycode::countrycode(
Continent, origin='country.name', destination='continent', nomatch="not"),
TRUE ~ Continent # Keep other values unchanged
))
add_continents# A tibble: 642 × 5
Continent Country Year country_annual_runs player_totals
<chr> <chr> <dbl> <int> <int>
1 Oceania Australia 2005 275 13
2 Oceania New Zealand 2005 291 16
3 England England 2005 173 11
4 Africa South Africa 2005 125 12
5 Oceania Australia 2006 385 15
6 Africa South Africa 2006 410 22
7 Oceania New Zealand 2006 374 17
8 West Indies West Indies 2006 116 11
9 England England 2006 281 16
10 Asia Sri Lanka 2006 313 16
# ℹ 632 more rows
I then just had to write a few lines of code to convert the three teams that were not recognized by countrycode. England and Scotland were not recognized because they both form part of the UK, and the West Indies was not recognized because it is an umbrella term.
It turned out that the countrycode package does not distinguish between North and South America, combining them into a single term “the Americas”. I think this is because of the many islands and regions that have strong associations with both North and South America. This meant there was no need to assign the West Indies to either North or South America, it could simply be considered part of the Americas. Similarly there was no need to consider Team USA as belonging to a different region than the West Indies. This seemed very appropriate.
I filtered out results for the ICC World XI. This is a team of international all-stars that has not played many games and may not ever be convened again. These runs should not be assigned to a particular region.
add_continents <- country_by_year |>
dplyr::mutate(Continent=Country,
.before=Country) |>
dplyr::mutate(Continent=dplyr::case_when(
nchar(countrycode::countrycode(Continent, origin='country.name', destination='continent', nomatch="not"))>3 ~ countrycode::countrycode(
Continent, origin='country.name', destination='continent', nomatch="not"),
Continent == "West Indies" ~ "Americas",
Continent == "England" ~ "Europe",
Continent == "Scotland" ~ "Europe",
TRUE ~ Continent # Keep other values unchanged
)) |>
dplyr::filter(Continent!="ICC World XI")
add_continents# A tibble: 640 × 5
Continent Country Year country_annual_runs player_totals
<chr> <chr> <dbl> <int> <int>
1 Oceania Australia 2005 275 13
2 Oceania New Zealand 2005 291 16
3 Europe England 2005 173 11
4 Africa South Africa 2005 125 12
5 Oceania Australia 2006 385 15
6 Africa South Africa 2006 410 22
7 Oceania New Zealand 2006 374 17
8 Americas West Indies 2006 116 11
9 Europe England 2006 281 16
10 Asia Sri Lanka 2006 313 16
# ℹ 630 more rows
Having grouped all of the teams by continent, I was then used continent_by_year to provide the total number of runs per continent per year. I also summarized the annual number of players per continent. (See next section for analysis of player numbers).
continent_by_year <- add_continents |>
dplyr::summarize(
annual_runs=sum(country_annual_runs, na.rm = TRUE),
annual_players=sum(player_totals, na.rm = TRUE),
# annual_teams=dplyr::n_distinct(Country, na.rm=TRUE),
.by=c(Continent, Year)
)
continent_by_year# A tibble: 98 × 4
Continent Year annual_runs annual_players
<chr> <dbl> <int> <int>
1 Oceania 2005 566 29
2 Europe 2005 173 11
3 Africa 2005 125 12
4 Oceania 2006 759 32
5 Africa 2006 528 33
6 Americas 2006 116 11
7 Europe 2006 281 16
8 Asia 2006 721 49
9 Oceania 2007 2424 39
10 Europe 2007 1354 37
# ℹ 88 more rows
I then used continent_sum to rank each continent by total overall runs scored in descending order.
continent_sum <- add_continents |>
dplyr::summarize(
sum_annual_runs=sum(country_annual_runs, na.rm = TRUE),
.by=c(Continent)
) |>
dplyr::arrange(desc(sum_annual_runs))
continent_sum# A tibble: 5 × 2
Continent sum_annual_runs
<chr> <int>
1 Asia 289714
2 Europe 198321
3 Africa 133298
4 Oceania 80284
5 Americas 70930
I was then able to plot the runs per continent using facets. Once again I had to use levels to determine the order in which the continents appeared, although I would have preferred to use R to automatically calculate the rank of each continent.
continent_plot <- ggplot2::ggplot(continent_by_year, ggplot2::aes(x=Year, y=annual_runs)) +
ggplot2::geom_line() +
ggplot2::geom_point() +
ggplot2::scale_x_continuous(limits=c(2005, 2024), breaks = c(2005, 2010, 2015, 2020, 2024)) +
ggplot2::scale_y_continuous(labels = scales::comma) +
ggplot2::facet_grid(~ factor(Continent, levels=c(
"Asia", "Europe", "Africa", "Oceania", "Americas"
))) +
ggplot2::theme(panel.spacing = ggplot2::unit(1, "lines")) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, vjust = 0.5, hjust = 1)) +
ggplot2::labs(title = "T20 International: Runs By Continent", x = "Year", y = "Runs Scored", caption = "Source: ESPNcricinfo") +
ggplot2::theme(axis.text.x = ggplot2::element_text(vjust = 1))
continent_plotRuns Per Continent Compared With Number of Players
Now that we have analyzed the number of runs, we can start to ask “chicken and egg” questions. Are there secondary factors that influence the number of runs? Our data includes data on the number of players. Do continents with more players generate more runs overall?
Its difficult to compare the total numbers of runs and players. The run total is exponentially bigger than the totals for players. Let’s use Oceania as an example.
compare_oceania <- continent_by_year |>
dplyr::filter(Continent=="Oceania")
compare_oceania# A tibble: 20 × 4
Continent Year annual_runs annual_players
<chr> <dbl> <int> <int>
1 Oceania 2005 566 29
2 Oceania 2006 759 32
3 Oceania 2007 2424 39
4 Oceania 2008 871 36
5 Oceania 2009 2869 49
6 Oceania 2010 3791 43
7 Oceania 2011 1164 32
8 Oceania 2012 4125 49
9 Oceania 2013 1993 43
10 Oceania 2014 3006 51
11 Oceania 2015 1125 42
12 Oceania 2016 3781 57
13 Oceania 2017 2259 57
14 Oceania 2018 4586 45
15 Oceania 2019 7150 84
16 Oceania 2020 3119 42
17 Oceania 2021 6867 71
18 Oceania 2022 11426 115
19 Oceania 2023 7524 111
20 Oceania 2024 10879 128
Next we can try to plot totals of runs and players on a single plot. To do this, we first need to convert the data to long format. We will no longer have a wide format, where there is a single row for each year. Instead each value category will have a separate row for each year.
long_oceania_runs <- compare_oceania |>
tidyr::pivot_longer(
cols=c("annual_runs", "annual_players"),
names_to="value_categories",
values_to="annual_figures"
)
long_oceania_runs# A tibble: 40 × 4
Continent Year value_categories annual_figures
<chr> <dbl> <chr> <int>
1 Oceania 2005 annual_runs 566
2 Oceania 2005 annual_players 29
3 Oceania 2006 annual_runs 759
4 Oceania 2006 annual_players 32
5 Oceania 2007 annual_runs 2424
6 Oceania 2007 annual_players 39
7 Oceania 2008 annual_runs 871
8 Oceania 2008 annual_players 36
9 Oceania 2009 annual_runs 2869
10 Oceania 2009 annual_players 49
# ℹ 30 more rows
We can now plot the two value categories side-by-side. But the plot doesn’t really make sense. The run totals are exponentially bigger than the player totals. It is difficult to meaningfully compare these categories in this way.
long_oceania_runs_plot <-ggplot2::ggplot(long_oceania_runs,
ggplot2::aes(x = Year,
y = annual_figures,
col = value_categories)) +
ggplot2::geom_line() +
ggplot2::geom_point() +
ggplot2::scale_x_continuous(breaks = c(2005, 2010, 2015, 2020, 2024)) +
ggplot2::scale_y_continuous(labels = scales::comma) +
ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
ggplot2::labs(
x = "Year",
y = "Annual Totals",
col = "Growth Metrics"
)
long_oceania_runs_plotOne solution might be to provide a chart or spreadsheet that simply show the raw numbers. But there is another way to compare these different values.
We can plot these widely divergent figures logarithmically. Instead of using a Y axis that increases at an even rate, we will use a Y axis that increases exponentially. We can then meaningfully compare the number of players with the number of runs because we will essentially have a smaller scale for players and a larger scale for runs.
There are different standards that can be used to plot these figures logarithmically, but we don’t need to do a deep dive into the details. Lets take oceania_compare_plot and change the Y axis so that we are using log 10 as the standard by which to plot both growth metrics. For more about logarithms in general and log 10 in particular, see here.
Then we can plot the results. This shows overall correlation between number of runs and players. Both factors usually either increase or decrease together each year, even if the proportions are somewhat different. But there are also exceptions. The factors diverge in 2010, 2017, and 2018.
oceania_log_plot <- ggplot2::ggplot(
long_oceania_runs,
ggplot2::aes(x = Year,
y = annual_figures,
col = value_categories
)
) +
ggplot2::geom_line() +
ggplot2::geom_point() +
ggplot2::scale_x_continuous(breaks = c(2005, 2010, 2015, 2020, 2024)) +
ggplot2::scale_y_continuous(
trans = "log10",
labels = scales::comma
) +
ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
ggplot2::labs(
title="Oceania: Comparative Increase of Runs and Number of Players",
x = "Year",
y = "Annual Totals",
col = "Growth Metrics",
caption = "Source: ESPNcricinfo"
)
oceania_log_plotLet’s try this with Asia. First we need to take the data and make it long.
compare_asia <- continent_by_year |>
dplyr::filter(Continent=="Asia") |>
tidyr::pivot_longer(
cols=c("annual_runs", "annual_players"),
names_to="value_categories",
values_to="annual_figures"
)
compare_asia# A tibble: 38 × 4
Continent Year value_categories annual_figures
<chr> <dbl> <chr> <int>
1 Asia 2006 annual_runs 721
2 Asia 2006 annual_players 49
3 Asia 2007 annual_runs 4224
4 Asia 2007 annual_players 61
5 Asia 2008 annual_runs 1406
6 Asia 2008 annual_players 60
7 Asia 2009 annual_runs 5402
8 Asia 2009 annual_players 75
9 Asia 2010 annual_runs 5805
10 Asia 2010 annual_players 97
# ℹ 28 more rows
We see an overall correlation again, but the factors diverge in 2018 and 2023
asia_log_plot <-ggplot2::ggplot(compare_asia,
ggplot2::aes(x = Year,
y = annual_figures,
col = value_categories)) +
ggplot2::geom_line() +
ggplot2::geom_point() +
ggplot2::scale_x_continuous(breaks = c(2006, 2010, 2015, 2020, 2024)) +
ggplot2::scale_y_continuous(
trans = "log10",
labels = scales::comma
) +
ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
ggplot2::labs(
title="Asia: Comparative Increase of Runs and Number of Players",
x = "Year",
y = "Annual Totals",
col = "Growth Metrics",
caption = "Source: ESPNcricinfo"
)
asia_log_plotNow we can analyze the results for Africa.
compare_africa <- continent_by_year |>
dplyr::filter(Continent=="Africa") |>
tidyr::pivot_longer(
cols=c("annual_runs", "annual_players"),
names_to="value_categories",
values_to="annual_figures"
)
compare_africa# A tibble: 40 × 4
Continent Year value_categories annual_figures
<chr> <dbl> <chr> <int>
1 Africa 2005 annual_runs 125
2 Africa 2005 annual_players 12
3 Africa 2006 annual_runs 528
4 Africa 2006 annual_players 33
5 Africa 2007 annual_runs 1627
6 Africa 2007 annual_players 49
7 Africa 2008 annual_runs 1088
8 Africa 2008 annual_players 42
9 Africa 2009 annual_runs 1800
10 Africa 2009 annual_players 25
# ℹ 30 more rows
Again we see an overall correlation, but with divergences in 2008 and 2016.
africa_log_plot <- ggplot2::ggplot(
compare_africa,
ggplot2::aes(x = Year,
y = annual_figures,
col = value_categories
)
) +
ggplot2::geom_line() +
ggplot2::geom_point() +
ggplot2::scale_x_continuous(breaks = c(2005, 2010, 2015, 2020, 2024)) +
ggplot2::scale_y_continuous(
trans = "log10",
labels = scales::comma
) +
ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
ggplot2::labs(
title="Africa: Comparative Increase of Runs and Number of Players",
x = "Year",
y = "Annual Totals",
col = "Growth Metrics",
caption = "Source: ESPNcricinfo"
)
africa_log_plotNow we can take a look at the results from Europe.
compare_europe <- continent_by_year |>
dplyr::filter(Continent=="Europe") |>
tidyr::pivot_longer(
cols=c("annual_runs", "annual_players"),
names_to="value_categories",
values_to="annual_figures"
)
compare_europe# A tibble: 40 × 4
Continent Year value_categories annual_figures
<chr> <dbl> <chr> <int>
1 Europe 2005 annual_runs 173
2 Europe 2005 annual_players 11
3 Europe 2006 annual_runs 281
4 Europe 2006 annual_players 16
5 Europe 2007 annual_runs 1354
6 Europe 2007 annual_players 37
7 Europe 2008 annual_runs 1429
8 Europe 2008 annual_players 51
9 Europe 2009 annual_runs 2150
10 Europe 2009 annual_players 63
# ℹ 30 more rows
There is a overall correlation, with divergences in 2010, 2013, and 2014.
europe_log_plot <- ggplot2::ggplot(
compare_europe,
ggplot2::aes(x = Year,
y = annual_figures,
col = value_categories
)
) +
ggplot2::geom_line() +
ggplot2::geom_point() +
ggplot2::scale_x_continuous(breaks = c(2005, 2010, 2015, 2020, 2024)) +
ggplot2::scale_y_continuous(
trans = "log10",
labels = scales::comma
) +
ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
ggplot2::labs(
title="Europe: Comparative Increase of Runs and Number of Players",
x = "Year",
y = "Annual Totals",
col = "Growth Metrics",
caption = "Source: ESPNcricinfo"
)
europe_log_plotAnd lastly, the Americas.
compare_americas <- continent_by_year |>
dplyr::filter(Continent=="Americas") |>
tidyr::pivot_longer(
cols=c("annual_runs", "annual_players"),
names_to="value_categories",
values_to="annual_figures"
)
compare_americas# A tibble: 38 × 4
Continent Year value_categories annual_figures
<chr> <dbl> <chr> <int>
1 Americas 2006 annual_runs 116
2 Americas 2006 annual_players 11
3 Americas 2007 annual_runs 754
4 Americas 2007 annual_players 17
5 Americas 2008 annual_runs 1319
6 Americas 2008 annual_players 57
7 Americas 2009 annual_runs 999
8 Americas 2009 annual_players 23
9 Americas 2010 annual_runs 1686
10 Americas 2010 annual_players 37
# ℹ 28 more rows
There is a overall correlation, with divergences in 2013, 2016, and 2023.
americas_log_plot <-ggplot2::ggplot(compare_americas,
ggplot2::aes(x = Year,
y = annual_figures,
col = value_categories)) +
ggplot2::geom_line() +
ggplot2::geom_point() +
ggplot2::scale_x_continuous(breaks = c(2006, 2010, 2015, 2020, 2024)) +
ggplot2::scale_y_continuous(
trans = "log10",
labels = scales::comma
) +
ggplot2::scale_color_discrete(labels=c("Number of Players", "Runs Scored" )) +
ggplot2::labs(
title="Americas: Comparative Increase of Runs and Number of Players",
x = "Year",
y = "Annual Totals",
col = "Growth Metrics",
caption = "Source: ESPNcricinfo"
)
americas_log_plotThis little experiment seems to confirm what common sense might tell us. Total player numbers are not an absolute predictor of runs scored.
Runs scored are probably determined by a combination of these overlapping factors:
Talent pipeline
Financial resources
Culture of excellence (success breeding success)
Overall popularity of cricket in a particular region
However, the number of players is an easily quantifiable factor that serves as a good proxy for these harder-to-measure influences.