Producing charts for ‘The Changing Fortunes of the Richest Countries in Grand Slam Tennis’
By: Dr. Chris Martin
Tools / packages used: R, R Markdown,
ggplot2, tidyverse (inc. dplyr and
tidyr), plotly.
Techniques used: exploratory data analysis,
functional programming (purrr package), data visualisation,
data cleaning/reshaping/manipulation.
Chart types used: area chart, line chart, bar chart,
stacked bar chart, small multiples, heatmap, ridge chart, interactive
charts (with ggplotly).
Source data: To produce the charts, I needed data on the women’s and men’s singles entrants for each Grand Slam tournament since 1990. This came from the excellent Tennis Abstract.
This notebook produces the static data visualisations which features in my data storytelling project: The Changing Fortunes of the Richest Countries in Grand Slam Tennis. You can read the full story on my website. This notebook uses data processed, clean and explored in another notebook.
A note on my data visualistion workflow
The chart produced in this notebook are ‘skeletons’ with fairly minimal styling, but all the key structural components in places. The chart are exported from this notebook as svgs. These can are then edited - adding textures, photos, annotations etc. - using graphic design software to create the final versions.
Some of the charts in this notebook made into the final data story, others where experiments that ultimately did no lead to a final chart. Some charts in this notebook are missing axis labels, titles etc. this is because it made the charts easier to edit further down the line in graphic design software.
Setting up the notebook
# import packages
library(tidyverse) # for data manipulation and viz
library(knitr) # for formatting tables
library(kableExtra) # for formatting tables
# set default theme for exploratory plots
theme_set(theme_light()) # using a minimal theme to make it easier to edit
# the plots in graphic design software later on
# set default R markdown chunk options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
# output the first lines of a dataframe in a nice format
scrollable_table <- function(df){
df %>%
kable("html") %>%
kable_styling() %>%
scroll_box(width = "100%", height = "200px",
fixed_thead = list(enabled = T, background = "grey90"))
}Reading in the data
The data is read in from csvs produced in
data_clean.Rmd, a lot of the data preparation was done in
that notebook.
gs_first_round_gdp <- read_csv("../data/results_gdp.csv") %>%
#just look up until covid pandemic (as will have distorted things)
filter(year < 2020)
# check data looks as expected
gs_first_round_gdp %>%
head() %>%
kable()| year | tourney_name | tour | name | id | ioc | country | gdp_per_capita | iso |
|---|---|---|---|---|---|---|---|---|
| 1990 | Australian Open | atp | Jim Pugh | 101004 | USA | United States | 40436.94 | USA |
| 1990 | Australian Open | atp | Ivan Lendl | 100656 | USA | United States | 40436.94 | USA |
| 1990 | Australian Open | atp | Cyril Suk | 101327 | CZE | Czechia | 23585.18 | CZE |
| 1990 | Australian Open | atp | Tomas Carbonell | 101507 | ESP | Spain | 27543.92 | ESP |
| 1990 | Australian Open | atp | Michael Brown B395 | 101895 | AUS | Australia | 31016.42 | AUS |
| 1990 | Australian Open | atp | Karel Novacek | 101120 | CZE | Czechia | 23585.18 | CZE |
gs_entries_by_country <- read_csv("../data/gs_entries_by_country.csv") %>%
# just look up until covid pandemic (as will have distorted things)
filter(year < 2020)
# check data looks as expected
gs_entries_by_country %>%
head() %>%
kable()| year | tourney_name | country_code | country | gdp_per_capita | num_first_rd | income_decile | top_20_perc |
|---|---|---|---|---|---|---|---|
| 1990 | Australian Open | USA | United States | 40436.94 | 55 | 10 | TRUE |
| 1990 | Australian Open | AUS | Australia | 31016.42 | 36 | 9 | TRUE |
| 1990 | Australian Open | FRA | France | 33732.02 | 20 | 9 | TRUE |
| 1990 | Australian Open | GER | Germany | 36699.48 | 20 | 9 | TRUE |
| 1990 | Australian Open | SWE | Sweden | 34156.82 | 14 | 9 | TRUE |
| 1990 | Australian Open | CZE | Czechia | 23585.18 | 10 | 8 | FALSE |
How have the richest 20% of countries performed at Grand Slams (1990 - 2019)?
This section focuses on how well the richest 20% of countries performed at Grand Slams (1990 - 2019). The metric used for a country’s performance is how many players they had appearing in the first round of Grand Slam tennis tournaments.
The overall trend
The performance of the richest countries declined 1900 to 2008, and then picked up again a little.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
plot_df <- gs_entries_by_country %>%
# looked performance for two groups of countries
# top 20% richest countries and the rest
mutate(top_20_perc = if_else(is.na(top_20_perc), FALSE, top_20_perc)) %>%
group_by(year, top_20_perc) %>%
summarise(num_first_rd = sum(num_first_rd)) %>%
ungroup() %>%
# calculate proportions from counts
group_by(year) %>%
mutate(perc_first_round = num_first_rd / sum(num_first_rd)) %>%
ungroup()
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df) +
# core chart
geom_area(aes(year, perc_first_round, fill = top_20_perc)) +
# format axis
scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
# tidy up presentation
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25))
p # ----------------------------------------------------------------------------
# Export the plot for editing
# ----------------------------------------------------------------------------
ggsave("../images/all_gs.svg", units = "mm", width = 600, height = 325)Is the overall trend driven by the composition of 20% richest countries changing over time?
When I saw showing early versions of the visualisation to members of the target audience, a question came up. Are the trends indentified a result of the countries in the top 20% richest changing? In this section of the notebook I reshape so I can produce a chart that shows that this isn’t drivnig the trend.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
# count number of first round appearances per country per year
top_twenty_countries <- gs_entries_by_country %>%
filter(top_20_perc) %>%
count(year, country, wt = num_first_rd) %>%
arrange(year, desc(n)) %>%
group_by(country) %>%
mutate(country_ave_n = mean(n)) %>%
ungroup()
# average number of appearances per country each year (1990 - 2019)
country_ave_ns <- top_twenty_countries %>%
distinct(country, country_ave_n)
# Count how many time the countries appear in the top twenty percent
top_twenty_counts <- top_twenty_countries %>%
count(country) %>%
left_join(country_ave_ns) %>%
arrange(desc(n), desc(country_ave_n)) %>%
rename(years_top_20 = n)
# this will be useful later for ordering the countriesin the plot
levels <- top_twenty_counts$country
# create a grid to see where countries have moved out of top 20 perc
grid <- expand_grid(year = unique(top_twenty_countries$year),
country = unique(top_twenty_countries$country))
# identify implicitly missing data
plot_df <- grid %>%
left_join(top_twenty_countries)
# identify countries in bottom 80%
bottom_80_countries_by_year <- gs_entries_by_country %>%
distinct(year, country, income_decile) %>%
filter(income_decile < 9)
# function for checking if a country is the bottom 80% in a given year
check_bottom_80 <- function(year, country){
selector <- bottom_80_countries_by_year$year == year &
bottom_80_countries_by_year$country == country
res <- bottom_80_countries_by_year[selector, ]
if(nrow(res) == 0){
return(FALSE)
}
else {
return(res[[1,"income_decile"]] < 9)
}
}
# quick test of check_bottom_80()
check_bottom_80(1990, "Nigeria")## [1] TRUE
# update plotting dataframe with variable recording if a country is the bottom 80%
# in a given year
plot_df_1 <- plot_df %>%
mutate(in_bottom_80 = map2_lgl(year, country, ~check_bottom_80(.x,.y)),
n = if_else(is.na(n) & in_bottom_80,
-1, n),
n = replace_na(n, 0),
bin_n = cut(n, breaks = c(-Inf, -1e-10,0,1e10, 10, 50, 100, Inf)))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df_1) +
# core plot
geom_tile(aes(x = year,
y = factor(country, levels = rev(levels)),
fill = bin_n),
colour = "#F8F7F7") +
# plot text
labs(x= NULL, y = NULL) +
guides(fill = guide_legend(reverse=TRUE)) +
# colours
scale_fill_manual(values = c("#E7E4E5", "white", '#cfcde7',
'#9f9dce', '#6e70b6', '#35469d')) +
# simplify plot for editting
scale_x_continuous(position = "top") +
coord_equal() +
theme_minimal()
p# output plot for editing
ggsave("../images/image_4.svg")
# a quick test to confirm that the plot makes sense
gs_entries_by_country %>%
filter(country == "Israel") %>%
distinct(year, income_decile) %>%
scrollable_table()| year | income_decile |
|---|---|
| 1990 | 8 |
| 1991 | 8 |
| 1992 | 8 |
| 1993 | 8 |
| 1994 | 8 |
| 1995 | 8 |
| 1996 | 8 |
| 1997 | 8 |
| 1998 | 8 |
| 1999 | 8 |
| 2000 | 8 |
| 2001 | 8 |
| 2002 | 8 |
| 2003 | 8 |
| 2004 | 8 |
| 2005 | 8 |
| 2006 | 8 |
| 2007 | 8 |
| 2008 | 8 |
| 2009 | 8 |
| 2010 | 8 |
| 2011 | 8 |
| 2012 | 8 |
| 2013 | 9 |
| 2014 | 9 |
| 2015 | 9 |
| 2016 | 9 |
| 2017 | 9 |
| 2018 | 9 |
The decline of the richest countries
Having established the overall trend is not an artifact of countries moving in and out the richest 20%, this section looks in detail at decline of the richest countries performance 1990-2008.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
# simplify representation of missing data
gs_entries_by_country_clean <- gs_entries_by_country %>%
mutate(top_20_perc = if_else(is.na(top_20_perc), FALSE, top_20_perc))
# count first round appearance for all countries by year
all_gs_entries_by_country <- gs_entries_by_country_clean %>%
group_by(year, country) %>%
summarise(num_first_rd_year = sum(num_first_rd)) %>%
ungroup() %>%
left_join(distinct(gs_entries_by_country_clean, country,
year, top_20_perc)) %>%
group_by(year) %>%
mutate(perc_first_round = num_first_rd_year / sum(num_first_rd_year)) %>%
ungroup()
# output to check counts make sense for top 20 vs bottom 80
all_gs_entries_by_country %>%
group_by(year, top_20_perc) %>%
summarise(perc_first_round = sum(perc_first_round)) %>%
scrollable_table()| year | top_20_perc | perc_first_round |
|---|---|---|
| 1990 | FALSE | 0.2183575 |
| 1990 | TRUE | 0.7816425 |
| 1991 | FALSE | 0.2222222 |
| 1991 | TRUE | 0.7777778 |
| 1992 | FALSE | 0.2280702 |
| 1992 | TRUE | 0.7719298 |
| 1993 | FALSE | 0.2400389 |
| 1993 | TRUE | 0.7599611 |
| 1994 | FALSE | 0.2725509 |
| 1994 | TRUE | 0.7274491 |
| 1995 | FALSE | 0.2575316 |
| 1995 | TRUE | 0.7424684 |
| 1996 | FALSE | 0.2810078 |
| 1996 | TRUE | 0.7189922 |
| 1997 | FALSE | 0.3071705 |
| 1997 | TRUE | 0.6928295 |
| 1998 | FALSE | 0.3166023 |
| 1998 | TRUE | 0.6833977 |
| 1999 | FALSE | 0.3496638 |
| 1999 | TRUE | 0.6503362 |
| 2000 | FALSE | 0.3563107 |
| 2000 | TRUE | 0.6436893 |
| 2001 | FALSE | 0.3578337 |
| 2001 | TRUE | 0.6421663 |
| 2002 | FALSE | 0.3972868 |
| 2002 | TRUE | 0.6027132 |
| 2003 | FALSE | 0.4046467 |
| 2003 | TRUE | 0.5953533 |
| 2004 | FALSE | 0.4078695 |
| 2004 | TRUE | 0.5921305 |
| 2005 | FALSE | 0.4181818 |
| 2005 | TRUE | 0.5818182 |
| 2006 | FALSE | 0.4306358 |
| 2006 | TRUE | 0.5693642 |
| 2007 | FALSE | 0.4324324 |
| 2007 | TRUE | 0.5675676 |
| 2008 | FALSE | 0.4613900 |
| 2008 | TRUE | 0.5386100 |
| 2009 | FALSE | 0.4434698 |
| 2009 | TRUE | 0.5565302 |
| 2010 | FALSE | 0.4472656 |
| 2010 | TRUE | 0.5527344 |
| 2011 | FALSE | 0.4469328 |
| 2011 | TRUE | 0.5530672 |
| 2012 | FALSE | 0.4316406 |
| 2012 | TRUE | 0.5683594 |
| 2013 | FALSE | 0.4208984 |
| 2013 | TRUE | 0.5791016 |
| 2014 | FALSE | 0.4287109 |
| 2014 | TRUE | 0.5712891 |
| 2015 | FALSE | 0.4345703 |
| 2015 | TRUE | 0.5654297 |
| 2016 | FALSE | 0.4394531 |
| 2016 | TRUE | 0.5605469 |
| 2017 | FALSE | 0.4003906 |
| 2017 | TRUE | 0.5996094 |
| 2018 | FALSE | 0.3857422 |
| 2018 | TRUE | 0.6142578 |
| 2019 | FALSE | 0.3984375 |
| 2019 | TRUE | 0.6015625 |
# focus on top 20 entries
gs_entries_top_20_perc <- all_gs_entries_by_country %>%
filter(top_20_perc)
# identify implicitly missing data
grid <- expand_grid(year = unique(gs_entries_top_20_perc$year),
country = unique(gs_entries_top_20_perc$country))
# create variables needed for plotting
plot_df <- grid %>%
left_join(gs_entries_top_20_perc) %>%
mutate(perc_first_round = replace_na(perc_first_round, 0)) %>%
filter(year <= 2008) %>%
mutate(is_usa = country == "United States")
# calculate change in performance (per country)
# compares number of first round appearance 1990 to 2008
# for grouping, ordering and colouring countries in the area chart
change_df <- plot_df %>%
filter(year == max(year) | year == min(year)) %>%
select(year, country, perc_first_round) %>%
pivot_wider(names_from = year,
values_from = perc_first_round,
values_fill = 0) %>%
mutate(change = `2008` - `1990`,
fall = change < 0,
change_bin = cut(change,
breaks = c(-Inf, -0.1, -0.02, 0.02, 0.1, Inf))) %>%
arrange(change)
# add in change in performance to dataframe for plotting
plot_df_1 <- plot_df %>%
left_join(change_df) %>%
group_by(country) %>%
mutate(ave_num_first_round = mean(num_first_rd_year, na.rm = TRUE)) %>%
ungroup() %>%
arrange(fall, desc(change)) %>%
mutate(country = factor(country, levels = unique(country)))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df_1) +
# core plot
geom_area(aes(year, perc_first_round,
group = country,
fill = change_bin),
colour = "grey80", size = 0.2) +
# colours
scale_fill_manual(values = c("#c94a54", "#efb3aa",
"#fffff0", "#aeabcd")) +
# format axis
scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
breaks = c(seq(0,0.8,0.1)),
expand = expansion(mult = c(0, .1))) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2008)) +
# simplify plot for editting
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25))
p# output plot for editting
ggsave("../images/image_5.svg", units = "mm", width = 525, height = 350)
# output plot as interactive for inspection
plotly::ggplotly(p)# ----------------------------------------------------------------------------
# Get the information needed to annotate the plot
# ----------------------------------------------------------------------------
levels(plot_df_1$change_bin)## [1] "(-Inf,-0.1]" "(-0.1,-0.02]" "(-0.02,0.02]" "(0.02,0.1]" "(0.1, Inf]"
# get data points for plot annotation
plot_df_1 %>%
filter(country %in% c("United States", "Australia", "Sweden")) %>%
group_by(year) %>%
summarise(perc_first_round_tot = sum(perc_first_round)) %>%
scrollable_table()| year | perc_first_round_tot |
|---|---|
| 1990 | 0.3951691 |
| 1991 | 0.3508772 |
| 1992 | 0.3255361 |
| 1993 | 0.3323615 |
| 1994 | 0.3064985 |
| 1995 | 0.2847425 |
| 1996 | 0.2635659 |
| 1997 | 0.2364341 |
| 1998 | 0.2326255 |
| 1999 | 0.2199808 |
| 2000 | 0.2165049 |
| 2001 | 0.1953578 |
| 2002 | 0.1841085 |
| 2003 | 0.1955470 |
| 2004 | 0.1794626 |
| 2005 | 0.1655502 |
| 2006 | 0.1512524 |
| 2007 | 0.1563707 |
| 2008 | 0.1389961 |
plot_df_1 %>%
filter(country == "United States") %>%
group_by(year) %>%
summarise(perc_first_round_tot = sum(perc_first_round)) %>%
scrollable_table()| year | perc_first_round_tot |
|---|---|
| 1990 | 0.2531401 |
| 1991 | 0.2261209 |
| 1992 | 0.2183236 |
| 1993 | 0.2157434 |
| 1994 | 0.1920466 |
| 1995 | 0.1827017 |
| 1996 | 0.1511628 |
| 1997 | 0.1395349 |
| 1998 | 0.1341699 |
| 1999 | 0.1325648 |
| 2000 | 0.1281553 |
| 2001 | 0.1179884 |
| 2002 | 0.1220930 |
| 2003 | 0.1316554 |
| 2004 | 0.1199616 |
| 2005 | 0.1090909 |
| 2006 | 0.1001927 |
| 2007 | 0.0994208 |
| 2008 | 0.0907336 |
# get players from a country in a given year (for annotation)
get_players <- function(country_str, year_int){
gs_first_round_gdp %>%
filter(country == country_str & year == year_int) %>%
distinct(name)
}
get_players("United States", 1990) %>%
scrollable_table()| name |
|---|
| Jim Pugh |
| Ivan Lendl |
| Tim Wilkison |
| Todd Witsken |
| Glenn Layendecker |
| Jimmy Brown |
| John McEnroe |
| Dan Goldie |
| Leif Shiras |
| Richey Reneberg |
| Ronald Agenor |
| Tim Mayotte |
| Pete Sampras |
| Aaron Krickstein |
| David Wheaton |
| Jimmy Arias |
| Jim Courier |
| Paul Chamberlin |
| Paul Annacone |
| Kelly Jones |
| Scott Davis |
| Lawson Duncan |
| Jay Berger |
| Andre Agassi |
| Michael Chang |
| Richard Matuszewski |
| Derrick Rostagno |
| Malivai Washington |
| Jim Grabb |
| Joey Rive |
| Bryan Shelton |
| David Pate |
| Michael Robertson |
| Brian Garrow |
| Brad Pearce |
| Rick Leach |
| Jeff Tarango |
| Kevin Curren |
| Brad Gilbert |
| Ken Flach |
| Martin Blackman |
| David Witt |
| Ivan Baron |
| Chris Garner |
| Tommy Ho |
| Patrick McEnroe |
| Robert Seguso |
| Steve Bryan |
| Todd Martin |
| Carrie Cunningham |
| Erika De Lone |
| Kimberly Kessaris |
| Shaun Stafford |
| Elise Burgin |
| Ronni Reis |
| Patty Fendick |
| Audra Keller |
| Stacey Martin |
| Pam Shriver |
| Lori Mcneil |
| Louise Allen |
| Jill Smoller |
| Mary Joe Fernandez |
| Robin White |
| Kathy Rinaldi Stunkel |
| Marianne Werdel Witmeyer |
| Ann Henricksson |
| Donna Faber |
| Andrea Leand |
| Camille Benjamin |
| Rosalyn Fairbank |
| Beverly Bowes |
| Heather Ludloff |
| Katrina Adams |
| Zina Garrison |
| Cammy Macgregor |
| Laxmi Poruri |
| Tami Whitlinger Jones |
| Terry Phelps |
| Gigi Fernandez |
| Halle Cioffi |
| Ann Grossman |
| Amy Frazier |
| Jennifer Santrock |
| Hu Na |
| Susan Sloane Lundy |
| Linda Wild |
| Betsy Nagelsen |
| Gretchen Magers |
| Jennifer Capriati |
| Monica Seles |
| Meredith Mcgrath |
| Kathy Jordan |
| Wendy White |
| Mareen Louie Harper |
| Mary Lou Piatek |
| Anna Ivan |
| Anne Smith |
| Martina Navratilova |
| Andrea Farley |
| Lisa Raymond |
| Chanda Rubin |
| Caroline Kuhlman |
| Sandy Collins |
| Eleni Rossides |
| Debbie Graham |
get_players("United States", 2008) %>%
scrollable_table()| name |
|---|
| Sam Querrey |
| Vincent Spadea |
| Donald Young |
| Robby Ginepri |
| Scoville Jenkins |
| Mardy Fish |
| James Blake |
| Bobby Reynolds |
| Wayne Odesnik |
| John Isner |
| Kevin Kim |
| Andy Roddick |
| Ryler Deheart |
| Brendan Evans |
| Ryan Sweeting |
| Robert Kendrick |
| Austin Krajicek |
| Sam Warburg |
| Michael Russell |
| Rajeev Ram |
| Julie Ditty |
| Jill Craybas |
| Lindsay Davenport |
| Ashley Harkleroad |
| Lilia Osterloh |
| Madison Brengle |
| Serena Williams |
| Venus Williams |
| Meilen Tu |
| Vania King |
| Laura Granville |
| Bethanie Mattek Sands |
| Melanie Oudin |
| Kristie Ahn |
| Gail Brodsky |
| Jamea Jackson |
| Alexa Glatch |
| Asia Muhammad |
| Ahsha Rolle |
| Shenay Perry |
| Coco Vandeweghe |
The mini resurgence of the richest countries
After 2008 there is an improvement in the performance of the richest countries. This chunk of code produces a plot that focusses in on this trend.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
# focus on 2009 onwards and find implicitly missing data
plot_df <- grid %>%
left_join(gs_entries_top_20_perc) %>%
mutate(perc_first_round = replace_na(perc_first_round, 0)) %>%
filter(year > 2008)
# calculate change in performance of coutries between 2009 and 2019
change_df <- plot_df %>%
filter(year == max(year) | year == min(year)) %>%
select(year, country, perc_first_round) %>%
pivot_wider(names_from = year,
values_from = perc_first_round,
values_fill = 0) %>%
mutate(change = `2019` - `2009`,
fall = change < 0,
# create bins for grouping countries
change_bin = cut(change,
breaks = c(-Inf, -0.1, -0.02, 0.02, 0.1, Inf))) %>%
arrange(change)
# define countries of particular interest for in plot
countries_of_int <- c("United States", "France", "Sweden", "Australia", "Spain")
# group all the other (not of specific interest) countries
plot_df_1 <- plot_df %>%
left_join(change_df) %>%
mutate(country = if_else(country %in% countries_of_int,
country,
"Other")) %>%
group_by(country, year) %>%
summarise(perc_first_round = sum(perc_first_round)) %>%
ungroup()
# define order of the facets within the plot
facet_order <- c("United States", "Australia", "Sweden", "France", "Spain", "Other")
# create annotation layer for plot
annotation_df <- plot_df_1 %>%
mutate(label = round(perc_first_round * 100, 1),
num_appearances = round(perc_first_round * 256)) %>%
filter(year == max(plot_df_1$year)|
year == min(plot_df_1$year))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df_1,
aes(year, perc_first_round)) +
# core plot
geom_area(aes(group = country),
colour = "grey80", size = 0.2) +
# annotation
ggrepel::geom_text_repel(data = annotation_df,
mapping = aes(label = num_appearances)) +
# create small multiples
facet_wrap(~factor(country, levels = facet_order))
pThe most succesful countries in the richest 20%
This section focuses in on the most successful countries: I had noticed that the USA, France and Spain had been very prominent around 2008. The aim here is to see which countries was the most successful at any given point in time.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
gs_first_round_gdp %>%
count(year, country, iso) %>%
group_by(year) %>%
mutate(perc_appear = n / sum(n)) %>%
filter(iso %in% c("USA", "FRA", "ESP")) %>%
# --------------------------------------------------------------------------
# Produce the plot
# --------------------------------------------------------------------------
ggplot() +
geom_line(aes(year, perc_appear, colour = country))Below I try an alternative approach, where all the other coutries are added into the plot two. To see if this feels like it adds some useful context. It doesn’t look like it does.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
country_group_counts <- gs_first_round_gdp %>%
count(year, country, iso) %>%
mutate(colour = if_else(
iso %in% c("USA", "FRA", "ESP"), iso, "other"
))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
country_group_counts %>%
ggplot() +
# core plot
geom_line(aes(year, n, group = country, colour = colour)) +
# format axis
scale_y_continuous(expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
# tidy up plot for editting
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25),
panel.grid.minor = element_blank()) # export plot for editting
ggsave("../images/image_7.svg", units = "mm", width = 525, height = 350)For context I wanted to take a look at the other countries that had been in the top 3 over time.
# identify and explore the top 3 countries in each year
top_3 <- gs_first_round_gdp %>%
count(year, country) %>%
arrange(year, desc(n)) %>%
group_by(year) %>%
mutate(rank = rank(-n)) %>%
ungroup() %>%
filter(rank == 1 | rank == 2 | rank == 3)
top_3 %>%
scrollable_table()| year | country | n | rank |
|---|---|---|---|
| 1990 | United States | 262 | 1 |
| 1990 | Australia | 88 | 2 |
| 1990 | France | 85 | 3 |
| 1991 | United States | 232 | 1 |
| 1991 | France | 83 | 2 |
| 1991 | Germany | 74 | 3 |
| 1992 | United States | 224 | 1 |
| 1992 | France | 94 | 2 |
| 1992 | Germany | 85 | 3 |
| 1993 | United States | 222 | 1 |
| 1993 | Germany | 86 | 2 |
| 1993 | France | 85 | 3 |
| 1994 | United States | 198 | 1 |
| 1994 | Germany | 81 | 2 |
| 1994 | France | 78 | 3 |
| 1995 | United States | 188 | 1 |
| 1995 | Germany | 92 | 2 |
| 1995 | France | 89 | 3 |
| 1996 | United States | 156 | 1 |
| 1996 | France | 77 | 2 |
| 1996 | Australia | 71 | 3 |
| 1997 | United States | 144 | 1 |
| 1997 | Spain | 84 | 2 |
| 1997 | France | 83 | 3 |
| 1998 | United States | 139 | 1 |
| 1998 | Spain | 92 | 2 |
| 1998 | France | 91 | 3 |
| 1999 | United States | 138 | 1 |
| 1999 | France | 90 | 2 |
| 1999 | Spain | 80 | 3 |
| 2000 | United States | 132 | 1 |
| 2000 | France | 95 | 2 |
| 2000 | Spain | 78 | 3 |
| 2001 | United States | 122 | 1 |
| 2001 | Spain | 112 | 2 |
| 2001 | France | 87 | 3 |
| 2002 | United States | 126 | 1 |
| 2002 | Spain | 96 | 2 |
| 2002 | France | 90 | 3 |
| 2003 | United States | 136 | 1 |
| 2003 | Spain | 95 | 2 |
| 2003 | France | 86 | 3 |
| 2004 | United States | 125 | 1 |
| 2004 | France | 103 | 2 |
| 2004 | Spain | 102 | 3 |
| 2005 | United States | 114 | 1 |
| 2005 | France | 100 | 2 |
| 2005 | Spain | 93 | 3 |
| 2006 | United States | 104 | 1 |
| 2006 | France | 101 | 2 |
| 2006 | Spain | 80 | 3 |
| 2007 | France | 117 | 1 |
| 2007 | United States | 103 | 2 |
| 2007 | Russia | 90 | 3 |
| 2008 | France | 119 | 1 |
| 2008 | United States | 94 | 2 |
| 2008 | Russia | 84 | 3 |
| 2009 | France | 112 | 1 |
| 2009 | United States | 85 | 2 |
| 2010 | Russia | 85 | 3 |
| 2011 | France | 95 | 1 |
| 2011 | United States | 85 | 2 |
| 2011 | Russia | 82 | 3 |
| 2012 | United States | 95 | 1 |
| 2012 | France | 92 | 2 |
| 2012 | Spain | 78 | 3 |
| 2013 | United States | 103 | 1 |
| 2013 | France | 91 | 2 |
| 2013 | Spain | 86 | 3 |
| 2014 | United States | 96 | 1 |
| 2014 | France | 87 | 2 |
| 2014 | Spain | 74 | 3 |
| 2015 | United States | 106 | 1 |
| 2015 | France | 74 | 2 |
| 2016 | United States | 125 | 1 |
| 2016 | France | 81 | 2 |
| 2016 | Spain | 68 | 3 |
| 2017 | United States | 125 | 1 |
| 2017 | France | 84 | 2 |
| 2017 | Russia | 74 | 3 |
| 2018 | United States | 133 | 1 |
| 2018 | France | 76 | 2 |
| 2018 | Germany | 64 | 3 |
| 2019 | United States | 125 | 1 |
| 2019 | France | 83 | 2 |
| 2019 | Russia | 62 | 3 |
Which Countries outside the top twenty percent were most successful?
Looking beyond the top 20%, I wanted to see which other countries had been successful.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
# count first round appearance for countries outside top 20%
gs_entries_the_other_80 <- gs_entries_by_country %>%
filter(!top_20_perc) %>%
count(year, country, wt = num_first_rd)
# identify implicitly missing data
grid <- expand_grid(country = unique(gs_entries_the_other_80$country),
year = unique(gs_entries_the_other_80$year))
# where a country did no appear for a given year explicitly record
# zero first round appearances
plot_df <- grid %>%
left_join(gs_entries_the_other_80) %>%
mutate(n = replace_na(n, 0))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df,
aes(year, n, group = country)) +
geom_line()
p# output as interactive for exploration
plotly::ggplotly(p)As Czechia was a sucessful country, I wanted to check if it was referred to by any other names in the dataset.
gs_entries_by_country %>%
filter(str_detect(str_to_lower(country), "cz")) %>%
distinct(country) %>%
kable()| country |
|---|
| Czechia |
So, now I can focus in on the three most successful countries (outisde the top 20% richest countries).
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
# focus on top 3 countries outside the richest 20%
countries_of_int <- c("Russia", "Argentina", "Czechia")
plot_df %>%
mutate(colour = if_else(country %in% countries_of_int,
country, "other")) %>%
# Czechia moves into the top 20% in 2017 so we don't want the line
# to continue
filter(!(country == "Czechia" & year >= 2017)) %>%
ggplot(aes(year, n,
colour = colour,
group = country)) +
# core plot
geom_line() +
# colours
scale_colour_manual(values = c("#A7BCD6", "#35469D",
"grey95", "#C94A54")) +
# format axis
scale_y_continuous(expand = c(0,0)) +
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
# tidy up plots for editting
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25),
panel.grid.minor = element_blank()) # export for editting
ggsave("../images/image_8.svg", units = "mm", width = 525, height = 350)The poorest 50% of countries
In the last part of the data story I flip things on there head, and focus on the poorest 50% of countries. This allowed me to explore the persistent challenges faced by players from poorer countries in making Grand Slam appearance. This section of the notebook works through producing some chart which show these persistent challenges.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
bottom_50_perc <- gs_entries_by_country %>%
filter(income_decile <= 5)
# ----------------------------------------------------------------------------
# Create the plot
# ----------------------------------------------------------------------------
bottom_50_perc %>%
# only want to count each country once per year
distinct(year, country) %>%
count(year) %>%
ggplot() +
# core plot
geom_col(aes(year, n), width = 0.8) +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
# format axis
scale_x_continuous(expand = c(0,0),
breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
coord_cartesian(clip = 'off') +
# tidy up plot for editing
labs(x = NULL, y = NULL) +
theme(legend.position = "none",
plot.margin = margin(25,25,25,25),
panel.grid.minor = element_blank()) # export plot for editting
ggsave("../images/image_9.svg", units = "mm", width = 525, height = 350)
# output data points for annotation
bottom_50_perc %>%
distinct(year, country) %>%
filter(year == 1990 | year == 2013) %>%
kable()| year | country |
|---|---|
| 1990 | India |
| 1990 | Nigeria |
| 1990 | Peru |
| 2013 | China |
| 2013 | Uzbekistan |
| 2013 | India |
| 2013 | Georgia |
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
# identify the countries from bottom 50 percent with most appearances in first round 1990 - 2019
top_n <- 5
top_n_countries <- bottom_50_perc %>%
count(country) %>%
slice_max(order_by = n, n = top_n) %>%
.$country
bottom_50_perc %>%
distinct(year,country) %>%
filter(year == max(year)) %>%
.$country## [1] "Ukraine" "South Africa" "Tunisia" "India" "Uzbekistan"
## [6] "Moldova" "Bolivia"
plot_df <- bottom_50_perc %>%
count(year, country) %>%
# group all countries outside top n together as other
mutate(country = if_else(country %in% top_n_countries,
country,
"Other")) %>%
group_by(year, country) %>%
summarise(n = sum(n))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
# total number of first round appearances by bottom 50% countries
ggplot(plot_df) +
geom_col(aes(year, n, fill = country))# output for use in story text
bottom_50_perc %>%
distinct(year, country) %>%
filter(year == 2003) %>%
kable()| year | country |
|---|---|
| 2003 | Belarus |
| 2003 | Morocco |
| 2003 | Indonesia |
| 2003 | Philippines |
| 2003 | Peru |
| 2003 | Armenia |
| 2003 | Ecuador |
| 2003 | Georgia |
| 2003 | Paraguay |
| 2003 | Uzbekistan |
| 2003 | Zimbabwe |
| 2003 | Colombia |
| 2003 | Madagascar |
| 2003 | Ukraine |
| 2003 | Bosnia and Herzegovina |
| 2003 | China |
| 2003 | Tunisia |
The penultimate chunk of code in this notebook, looks at if countries have moved in and out of the poorest 50% of countries. it is similar to the code chunck and chart above, looking at if countries had moved in and out of the richest 20%. This chart ultimately didn’t make into the data story.
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
# count first round appearances by bottom 50% countries
bottom_50_perc_country_counts <- bottom_50_perc %>%
count(year, country, wt = num_first_rd) %>%
# average the counts
group_by(country) %>%
mutate(country_ave_n = mean(n, na.rm = TRUE)) %>%
ungroup()
# look at how many times the countries appear in the bottom fifty percent
bottom_50_summary <- bottom_50_perc_country_counts %>%
count(country) %>%
rename(total_n = n) %>%
left_join(distinct(
bottom_50_perc_country_counts,
country,
country_ave_n
)) %>%
arrange(desc(country_ave_n))
# create a grid to see where countries have moved out of bottom 50 perc
grid <- expand_grid(year = unique(bottom_50_perc_country_counts$year),
country = unique(bottom_50_perc_country_counts$country))
plot_df <- grid %>%
left_join(bottom_50_perc_country_counts)
# top 50% countries for comparison to bottom 50% countries
top_50_countries_by_year <- gs_entries_by_country %>%
distinct(year, country, income_decile) %>%
filter(income_decile > 5)
# check if a country is bottom 50 in a given year
# returns a logical
check_top_50 <- function(year, country){
selector <- top_50_countries_by_year$year == year &
top_50_countries_by_year$country == country
res <- top_50_countries_by_year[selector, ]
if(nrow(res) == 0){
return(FALSE)
}
else {
return(res[[1,"income_decile"]] > 5)
}
}
# a quick function test
check_top_50(1991, "United States")## [1] TRUE
# run check on if countries are in the bottom 50% for full dataframe
plot_df_1 <- plot_df %>%
mutate(in_top_50 = map2_lgl(year, country, ~check_top_50(.x,.y)),
n = if_else(is.na(n) & in_top_50,
-1, n),
n = replace_na(n, 0),
bin_n = cut(n, breaks = c(-Inf, -1e-10,0,1e10, 5, 10, 20, Inf)))
# for ordering countries in the plot
levels <- rev(bottom_50_summary$country)
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
ggplot(plot_df_1) +
# core plot
geom_tile(aes(x = year,
y = factor(country, levels = levels),
fill = bin_n),
colour = "#E7E4E5") +
# colours
scale_fill_manual(values = c("#E7E4E5", "white", '#cfcde7',
'#9f9dce', '#6e70b6', '#35469d')) +
# tidy up plot ahead of editting
labs(x= NULL, y = NULL) +
coord_equal() +
theme_minimal() Finally, a chart that didn’t make it into the final data story. It tries to show the distribution of the number of appearances in the Grand Slam first round over the years.
library(ggridges)
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
# the top 10 most successful bottom 50 countries
bottom_50_countries <- bottom_50_summary %>%
slice_head(n = 10) %>%
.$country
gs_first_round_gdp %>%
filter(country %in% bottom_50_countries) %>%
ggplot() +
# core plot
ggridges::geom_density_ridges(mapping = aes(year,
factor(country, levels = levels),
height = stat(density)),
stat = "density")