Producing charts for ‘The Changing Fortunes of the Richest Countries in Grand Slam Tennis’

By: Dr. Chris Martin

Tools / packages used: R, R Markdown, ggplot2, tidyverse (inc. dplyr and tidyr), plotly.

Techniques used: exploratory data analysis, functional programming (purrr package), data visualisation, data cleaning/reshaping/manipulation.

Chart types used: area chart, line chart, bar chart, stacked bar chart, small multiples, heatmap, ridge chart, interactive charts (with ggplotly).

Source data: To produce the charts, I needed data on the women’s and men’s singles entrants for each Grand Slam tournament since 1990. This came from the excellent Tennis Abstract.

This notebook produces the static data visualisations which features in my data storytelling project: The Changing Fortunes of the Richest Countries in Grand Slam Tennis. You can read the full story on my website. This notebook uses data processed, clean and explored in another notebook.

A note on my data visualistion workflow

The chart produced in this notebook are ‘skeletons’ with fairly minimal styling, but all the key structural components in places. The chart are exported from this notebook as svgs. These can are then edited - adding textures, photos, annotations etc. - using graphic design software to create the final versions.

Some of the charts in this notebook made into the final data story, others where experiments that ultimately did no lead to a final chart. Some charts in this notebook are missing axis labels, titles etc. this is because it made the charts easier to edit further down the line in graphic design software.

Setting up the notebook

# import packages
library(tidyverse)  # for data manipulation and viz
library(knitr)      # for formatting tables
library(kableExtra) # for formatting tables

# set default theme for exploratory plots
theme_set(theme_light())  # using a minimal theme to make it easier to edit 
                          # the plots in graphic design software later on

# set default R markdown chunk options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

# output the first lines of a dataframe in a nice format
scrollable_table <- function(df){
  df %>% 
    kable("html") %>% 
    kable_styling() %>%
    scroll_box(width = "100%", height = "200px",
               fixed_thead = list(enabled = T, background = "grey90"))
}

Reading in the data

The data is read in from csvs produced in data_clean.Rmd, a lot of the data preparation was done in that notebook.

gs_first_round_gdp <- read_csv("../data/results_gdp.csv") %>% 
  #just look up until covid pandemic  (as will have distorted things)
  filter(year < 2020)

# check data looks as expected
gs_first_round_gdp %>% 
  head() %>% 
  kable()
year tourney_name tour name id ioc country gdp_per_capita iso
1990 Australian Open atp Jim Pugh 101004 USA United States 40436.94 USA
1990 Australian Open atp Ivan Lendl 100656 USA United States 40436.94 USA
1990 Australian Open atp Cyril Suk 101327 CZE Czechia 23585.18 CZE
1990 Australian Open atp Tomas Carbonell 101507 ESP Spain 27543.92 ESP
1990 Australian Open atp Michael Brown B395 101895 AUS Australia 31016.42 AUS
1990 Australian Open atp Karel Novacek 101120 CZE Czechia 23585.18 CZE
gs_entries_by_country <- read_csv("../data/gs_entries_by_country.csv") %>% 
  # just look up until covid pandemic  (as will have distorted things)
  filter(year < 2020)

# check data looks as expected
gs_entries_by_country %>% 
  head() %>% 
  kable()
year tourney_name country_code country gdp_per_capita num_first_rd income_decile top_20_perc
1990 Australian Open USA United States 40436.94 55 10 TRUE
1990 Australian Open AUS Australia 31016.42 36 9 TRUE
1990 Australian Open FRA France 33732.02 20 9 TRUE
1990 Australian Open GER Germany 36699.48 20 9 TRUE
1990 Australian Open SWE Sweden 34156.82 14 9 TRUE
1990 Australian Open CZE Czechia 23585.18 10 8 FALSE

How have the richest 20% of countries performed at Grand Slams (1990 - 2019)?

This section focuses on how well the richest 20% of countries performed at Grand Slams (1990 - 2019). The metric used for a country’s performance is how many players they had appearing in the first round of Grand Slam tennis tournaments.

The overall trend

The performance of the richest countries declined 1900 to 2008, and then picked up again a little.

# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
plot_df <- gs_entries_by_country %>% 
  
  # looked performance for two groups of countries
  # top 20% richest countries and the rest
  mutate(top_20_perc = if_else(is.na(top_20_perc), FALSE, top_20_perc)) %>% 
  group_by(year, top_20_perc) %>% 
  summarise(num_first_rd = sum(num_first_rd)) %>% 
  ungroup() %>%
  
  # calculate proportions from counts
  group_by(year) %>% 
  mutate(perc_first_round = num_first_rd / sum(num_first_rd)) %>% 
  ungroup()

# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------  
p <-  ggplot(plot_df) +
  
  # core chart
  geom_area(aes(year, perc_first_round, fill = top_20_perc)) +
  
  # format axis
    scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
                       expand = c(0,0)) +
    scale_x_continuous(expand = c(0,0), 
                       breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
    coord_cartesian(clip = 'off') +
  
  # tidy up presentation
    labs(x = NULL, y = NULL) +
    theme(legend.position = "none",
          plot.margin = margin(25,25,25,25)) 

p 

# ----------------------------------------------------------------------------
# Export the plot for editing
# ----------------------------------------------------------------------------
ggsave("../images/all_gs.svg", units = "mm", width = 600, height = 325)

Is the overall trend driven by the composition of 20% richest countries changing over time?

When I saw showing early versions of the visualisation to members of the target audience, a question came up. Are the trends indentified a result of the countries in the top 20% richest changing? In this section of the notebook I reshape so I can produce a chart that shows that this isn’t drivnig the trend.

# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------

# count number of first round appearances per country per year
top_twenty_countries <- gs_entries_by_country %>% 
  filter(top_20_perc) %>% 
  count(year, country, wt = num_first_rd) %>% 
  arrange(year, desc(n)) %>% 
  group_by(country) %>% 
  mutate(country_ave_n = mean(n)) %>% 
  ungroup()

# average number of appearances per country each year (1990 - 2019)
country_ave_ns <- top_twenty_countries %>% 
  distinct(country, country_ave_n)
  
# Count how many time the countries appear in the top twenty percent
top_twenty_counts <- top_twenty_countries %>% 
  count(country) %>% 
  left_join(country_ave_ns) %>% 
  arrange(desc(n), desc(country_ave_n)) %>% 
  rename(years_top_20 = n)
  

# this will be useful later for ordering the countriesin the plot
levels <- top_twenty_counts$country

# create a grid to see where countries have moved out of top 20 perc
grid <-  expand_grid(year = unique(top_twenty_countries$year),
                     country = unique(top_twenty_countries$country))

# identify implicitly missing data
plot_df <- grid %>% 
  left_join(top_twenty_countries)

# identify countries in bottom 80%
bottom_80_countries_by_year <- gs_entries_by_country %>% 
  distinct(year, country, income_decile) %>% 
  filter(income_decile < 9)

# function for checking if a country is the bottom 80% in a given year
check_bottom_80 <- function(year, country){

  selector <- bottom_80_countries_by_year$year == year & 
              bottom_80_countries_by_year$country == country
  
  res <- bottom_80_countries_by_year[selector, ]
  
  if(nrow(res) == 0){
    return(FALSE)
    }
  else {
    return(res[[1,"income_decile"]] < 9)
  }
}

# quick test of check_bottom_80()
check_bottom_80(1990, "Nigeria")
## [1] TRUE
# update plotting dataframe with variable recording if a country is the bottom 80% 
# in a given year
plot_df_1 <- plot_df %>% 
  mutate(in_bottom_80 = map2_lgl(year, country, ~check_bottom_80(.x,.y)),
         n = if_else(is.na(n) & in_bottom_80,
                     -1, n),
         n = replace_na(n, 0),
         bin_n = cut(n, breaks = c(-Inf, -1e-10,0,1e10, 10, 50, 100, Inf)))  

# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------

p <- ggplot(plot_df_1) +
  
      # core plot
      geom_tile(aes(x = year, 
                    y = factor(country, levels = rev(levels)),
                    fill = bin_n),
                colour = "#F8F7F7") +
      
      # plot text
      labs(x= NULL, y = NULL) +
      guides(fill = guide_legend(reverse=TRUE)) +
      
      # colours
      scale_fill_manual(values = c("#E7E4E5", "white", '#cfcde7', 
                                   '#9f9dce', '#6e70b6', '#35469d')) +
      # simplify plot for editting
      scale_x_continuous(position = "top") +
      coord_equal() +  
      theme_minimal()
  
p

# output plot for editing
ggsave("../images/image_4.svg")

# a quick test to confirm that the plot makes sense
gs_entries_by_country %>% 
  filter(country == "Israel") %>% 
  distinct(year, income_decile) %>% 
  scrollable_table()
year income_decile
1990 8
1991 8
1992 8
1993 8
1994 8
1995 8
1996 8
1997 8
1998 8
1999 8
2000 8
2001 8
2002 8
2003 8
2004 8
2005 8
2006 8
2007 8
2008 8
2009 8
2010 8
2011 8
2012 8
2013 9
2014 9
2015 9
2016 9
2017 9
2018 9

The decline of the richest countries

Having established the overall trend is not an artifact of countries moving in and out the richest 20%, this section looks in detail at decline of the richest countries performance 1990-2008.

# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------

# simplify representation of missing data
gs_entries_by_country_clean <- gs_entries_by_country %>% 
  mutate(top_20_perc = if_else(is.na(top_20_perc), FALSE, top_20_perc))

# count first round appearance for all countries by year
all_gs_entries_by_country <- gs_entries_by_country_clean %>% 
  
  group_by(year, country) %>% 
  summarise(num_first_rd_year = sum(num_first_rd)) %>% 
  ungroup() %>% 
  
  left_join(distinct(gs_entries_by_country_clean, country, 
                     year, top_20_perc)) %>%

  group_by(year) %>%
  mutate(perc_first_round = num_first_rd_year / sum(num_first_rd_year)) %>%
  ungroup()

# output to check counts make sense for top 20 vs bottom 80
all_gs_entries_by_country %>% 
  group_by(year, top_20_perc) %>% 
  summarise(perc_first_round = sum(perc_first_round)) %>% 
  scrollable_table()
year top_20_perc perc_first_round
1990 FALSE 0.2183575
1990 TRUE 0.7816425
1991 FALSE 0.2222222
1991 TRUE 0.7777778
1992 FALSE 0.2280702
1992 TRUE 0.7719298
1993 FALSE 0.2400389
1993 TRUE 0.7599611
1994 FALSE 0.2725509
1994 TRUE 0.7274491
1995 FALSE 0.2575316
1995 TRUE 0.7424684
1996 FALSE 0.2810078
1996 TRUE 0.7189922
1997 FALSE 0.3071705
1997 TRUE 0.6928295
1998 FALSE 0.3166023
1998 TRUE 0.6833977
1999 FALSE 0.3496638
1999 TRUE 0.6503362
2000 FALSE 0.3563107
2000 TRUE 0.6436893
2001 FALSE 0.3578337
2001 TRUE 0.6421663
2002 FALSE 0.3972868
2002 TRUE 0.6027132
2003 FALSE 0.4046467
2003 TRUE 0.5953533
2004 FALSE 0.4078695
2004 TRUE 0.5921305
2005 FALSE 0.4181818
2005 TRUE 0.5818182
2006 FALSE 0.4306358
2006 TRUE 0.5693642
2007 FALSE 0.4324324
2007 TRUE 0.5675676
2008 FALSE 0.4613900
2008 TRUE 0.5386100
2009 FALSE 0.4434698
2009 TRUE 0.5565302
2010 FALSE 0.4472656
2010 TRUE 0.5527344
2011 FALSE 0.4469328
2011 TRUE 0.5530672
2012 FALSE 0.4316406
2012 TRUE 0.5683594
2013 FALSE 0.4208984
2013 TRUE 0.5791016
2014 FALSE 0.4287109
2014 TRUE 0.5712891
2015 FALSE 0.4345703
2015 TRUE 0.5654297
2016 FALSE 0.4394531
2016 TRUE 0.5605469
2017 FALSE 0.4003906
2017 TRUE 0.5996094
2018 FALSE 0.3857422
2018 TRUE 0.6142578
2019 FALSE 0.3984375
2019 TRUE 0.6015625
# focus on top 20 entries
gs_entries_top_20_perc <- all_gs_entries_by_country %>% 
  filter(top_20_perc)

# identify implicitly missing data
grid <- expand_grid(year = unique(gs_entries_top_20_perc$year),
                    country = unique(gs_entries_top_20_perc$country))
  
# create variables needed for plotting
plot_df <- grid %>% 
  left_join(gs_entries_top_20_perc) %>% 
  mutate(perc_first_round = replace_na(perc_first_round, 0)) %>% 
  filter(year <= 2008) %>% 
  mutate(is_usa = country == "United States")

# calculate change in performance (per country) 
# compares number of first round appearance 1990 to 2008
# for grouping, ordering and colouring countries in the area chart
change_df <- plot_df %>% 
  
  filter(year == max(year) | year == min(year)) %>% 
  select(year, country, perc_first_round) %>% 
  
  pivot_wider(names_from = year,
              values_from = perc_first_round,
              values_fill = 0) %>% 
  mutate(change = `2008` - `1990`,
         fall = change < 0,
         change_bin = cut(change, 
                          breaks = c(-Inf, -0.1, -0.02, 0.02, 0.1, Inf))) %>% 
  arrange(change)

# add in change in performance to dataframe for plotting
plot_df_1 <- plot_df %>% 
  
  left_join(change_df) %>% 
  
  group_by(country) %>% 
  mutate(ave_num_first_round = mean(num_first_rd_year, na.rm = TRUE)) %>% 
  ungroup() %>% 

  arrange(fall, desc(change)) %>% 
  mutate(country = factor(country, levels = unique(country)))

# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------

p <- ggplot(plot_df_1) +
  
  # core plot
      geom_area(aes(year, perc_first_round,
                  group = country,
                  fill = change_bin),
              colour = "grey80", size = 0.2) +
    
      # colours
      scale_fill_manual(values = c("#c94a54", "#efb3aa", 
                                   "#fffff0", "#aeabcd")) +
      # format axis
      scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
                         breaks = c(seq(0,0.8,0.1)),
                         expand = expansion(mult = c(0, .1))) +
      scale_x_continuous(expand = c(0,0), 
                         breaks=c(1990,1995,2000,2005,2008)) +
    
      # simplify plot for editting
      labs(x = NULL, y = NULL) +
      theme(legend.position = "none",
            plot.margin = margin(25,25,25,25)) 


p

# output plot for editting
ggsave("../images/image_5.svg", units = "mm", width = 525, height = 350)

# output plot as interactive for inspection
plotly::ggplotly(p)
# ----------------------------------------------------------------------------
# Get the information needed to annotate the plot
# ----------------------------------------------------------------------------

levels(plot_df_1$change_bin)
## [1] "(-Inf,-0.1]"  "(-0.1,-0.02]" "(-0.02,0.02]" "(0.02,0.1]"   "(0.1, Inf]"
# get data points for plot annotation
plot_df_1 %>% 
  filter(country %in% c("United States", "Australia", "Sweden")) %>% 
  group_by(year) %>% 
  summarise(perc_first_round_tot = sum(perc_first_round)) %>% 
  scrollable_table()
year perc_first_round_tot
1990 0.3951691
1991 0.3508772
1992 0.3255361
1993 0.3323615
1994 0.3064985
1995 0.2847425
1996 0.2635659
1997 0.2364341
1998 0.2326255
1999 0.2199808
2000 0.2165049
2001 0.1953578
2002 0.1841085
2003 0.1955470
2004 0.1794626
2005 0.1655502
2006 0.1512524
2007 0.1563707
2008 0.1389961
plot_df_1 %>% 
  filter(country == "United States") %>% 
  group_by(year) %>% 
  summarise(perc_first_round_tot = sum(perc_first_round)) %>% 
  scrollable_table()
year perc_first_round_tot
1990 0.2531401
1991 0.2261209
1992 0.2183236
1993 0.2157434
1994 0.1920466
1995 0.1827017
1996 0.1511628
1997 0.1395349
1998 0.1341699
1999 0.1325648
2000 0.1281553
2001 0.1179884
2002 0.1220930
2003 0.1316554
2004 0.1199616
2005 0.1090909
2006 0.1001927
2007 0.0994208
2008 0.0907336
# get players from a country in a given year (for annotation)
get_players <- function(country_str, year_int){
  gs_first_round_gdp %>% 
  filter(country == country_str & year == year_int) %>% 
  distinct(name)
}

get_players("United States", 1990) %>% 
  scrollable_table()
name
Jim Pugh
Ivan Lendl
Tim Wilkison
Todd Witsken
Glenn Layendecker
Jimmy Brown
John McEnroe
Dan Goldie
Leif Shiras
Richey Reneberg
Ronald Agenor
Tim Mayotte
Pete Sampras
Aaron Krickstein
David Wheaton
Jimmy Arias
Jim Courier
Paul Chamberlin
Paul Annacone
Kelly Jones
Scott Davis
Lawson Duncan
Jay Berger
Andre Agassi
Michael Chang
Richard Matuszewski
Derrick Rostagno
Malivai Washington
Jim Grabb
Joey Rive
Bryan Shelton
David Pate
Michael Robertson
Brian Garrow
Brad Pearce
Rick Leach
Jeff Tarango
Kevin Curren
Brad Gilbert
Ken Flach
Martin Blackman
David Witt
Ivan Baron
Chris Garner
Tommy Ho
Patrick McEnroe
Robert Seguso
Steve Bryan
Todd Martin
Carrie Cunningham
Erika De Lone
Kimberly Kessaris
Shaun Stafford
Elise Burgin
Ronni Reis
Patty Fendick
Audra Keller
Stacey Martin
Pam Shriver
Lori Mcneil
Louise Allen
Jill Smoller
Mary Joe Fernandez
Robin White
Kathy Rinaldi Stunkel
Marianne Werdel Witmeyer
Ann Henricksson
Donna Faber
Andrea Leand
Camille Benjamin
Rosalyn Fairbank
Beverly Bowes
Heather Ludloff
Katrina Adams
Zina Garrison
Cammy Macgregor
Laxmi Poruri
Tami Whitlinger Jones
Terry Phelps
Gigi Fernandez
Halle Cioffi
Ann Grossman
Amy Frazier
Jennifer Santrock
Hu Na
Susan Sloane Lundy
Linda Wild
Betsy Nagelsen
Gretchen Magers
Jennifer Capriati
Monica Seles
Meredith Mcgrath
Kathy Jordan
Wendy White
Mareen Louie Harper
Mary Lou Piatek
Anna Ivan
Anne Smith
Martina Navratilova
Andrea Farley
Lisa Raymond
Chanda Rubin
Caroline Kuhlman
Sandy Collins
Eleni Rossides
Debbie Graham
get_players("United States", 2008) %>% 
  scrollable_table()
name
Sam Querrey
Vincent Spadea
Donald Young
Robby Ginepri
Scoville Jenkins
Mardy Fish
James Blake
Bobby Reynolds
Wayne Odesnik
John Isner
Kevin Kim
Andy Roddick
Ryler Deheart
Brendan Evans
Ryan Sweeting
Robert Kendrick
Austin Krajicek
Sam Warburg
Michael Russell
Rajeev Ram
Julie Ditty
Jill Craybas
Lindsay Davenport
Ashley Harkleroad
Lilia Osterloh
Madison Brengle
Serena Williams
Venus Williams
Meilen Tu
Vania King
Laura Granville
Bethanie Mattek Sands
Melanie Oudin
Kristie Ahn
Gail Brodsky
Jamea Jackson
Alexa Glatch
Asia Muhammad
Ahsha Rolle
Shenay Perry
Coco Vandeweghe

The mini resurgence of the richest countries

After 2008 there is an improvement in the performance of the richest countries. This chunk of code produces a plot that focusses in on this trend.

# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------

# focus on 2009 onwards and find implicitly missing data
plot_df <- grid %>% 
  
  left_join(gs_entries_top_20_perc) %>% 
  mutate(perc_first_round = replace_na(perc_first_round, 0)) %>% 
  filter(year > 2008)

# calculate change in performance of coutries between 2009 and 2019
change_df <- plot_df %>% 
  
  filter(year == max(year) | year == min(year)) %>% 
  select(year, country, perc_first_round) %>% 
  
  pivot_wider(names_from = year,
              values_from = perc_first_round,
              values_fill = 0) %>% 
  
  mutate(change = `2019` - `2009`,
         fall = change < 0,
         # create bins for grouping countries
         change_bin = cut(change, 
                          breaks = c(-Inf, -0.1, -0.02, 0.02, 0.1, Inf))) %>% 
  arrange(change)

# define countries of particular interest for in plot
countries_of_int <-  c("United States", "France", "Sweden", "Australia", "Spain")

# group all the other (not of specific interest) countries
plot_df_1 <- plot_df %>% 
  
  left_join(change_df) %>% 
  
  mutate(country = if_else(country %in% countries_of_int, 
                           country, 
                           "Other")) %>% 
  
  group_by(country, year) %>%
  summarise(perc_first_round = sum(perc_first_round)) %>% 
  ungroup()
  
# define order of the facets within the plot
facet_order <- c("United States", "Australia", "Sweden", "France", "Spain", "Other")

# create annotation layer for plot
annotation_df <- plot_df_1 %>% 
  mutate(label = round(perc_first_round * 100, 1),
         num_appearances = round(perc_first_round * 256)) %>% 
  filter(year == max(plot_df_1$year)|
           year == min(plot_df_1$year))
# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------

p <- ggplot(plot_df_1,
            aes(year, perc_first_round)) +
  
      # core plot
      geom_area(aes(group = country),
                colour = "grey80", size = 0.2) +
  
      # annotation
      ggrepel::geom_text_repel(data = annotation_df,
                               mapping = aes(label = num_appearances)) +
      
      # create small multiples
      facet_wrap(~factor(country, levels = facet_order))

p

The most succesful countries in the richest 20%

This section focuses in on the most successful countries: I had noticed that the USA, France and Spain had been very prominent around 2008. The aim here is to see which countries was the most successful at any given point in time.

# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
gs_first_round_gdp %>% 

  count(year, country, iso) %>% 
  group_by(year) %>% 
  mutate(perc_appear = n / sum(n)) %>% 
  
  filter(iso %in% c("USA", "FRA", "ESP")) %>% 
  
  # --------------------------------------------------------------------------
  # Produce the plot
  # --------------------------------------------------------------------------
  ggplot() +
  geom_line(aes(year, perc_appear, colour = country))

Below I try an alternative approach, where all the other coutries are added into the plot two. To see if this feels like it adds some useful context. It doesn’t look like it does.

# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
country_group_counts <- gs_first_round_gdp %>% 
  count(year, country, iso) %>% 
  mutate(colour = if_else(
    iso %in% c("USA", "FRA", "ESP"), iso, "other"
  )) 

# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
country_group_counts %>% 
  
  ggplot() +
  
    # core plot
    geom_line(aes(year, n, group = country, colour = colour)) +
    
    # format axis 
    scale_y_continuous(expand = c(0,0)) +
    scale_x_continuous(expand = c(0,0), 
                         breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
    coord_cartesian(clip = 'off') +
    
    # tidy up plot for editting
    labs(x = NULL, y = NULL) +
    theme(legend.position = "none",
          plot.margin = margin(25,25,25,25),
          panel.grid.minor = element_blank()) 

# export plot for editting
ggsave("../images/image_7.svg", units = "mm", width = 525, height = 350)

For context I wanted to take a look at the other countries that had been in the top 3 over time.

# identify and explore the top 3 countries in each year
top_3 <- gs_first_round_gdp %>% 
  count(year, country) %>% 
  arrange(year, desc(n)) %>% 
  group_by(year) %>% 
  mutate(rank = rank(-n)) %>% 
  ungroup() %>% 
  
  filter(rank == 1 | rank == 2 | rank == 3)

top_3 %>% 
  scrollable_table()
year country n rank
1990 United States 262 1
1990 Australia 88 2
1990 France 85 3
1991 United States 232 1
1991 France 83 2
1991 Germany 74 3
1992 United States 224 1
1992 France 94 2
1992 Germany 85 3
1993 United States 222 1
1993 Germany 86 2
1993 France 85 3
1994 United States 198 1
1994 Germany 81 2
1994 France 78 3
1995 United States 188 1
1995 Germany 92 2
1995 France 89 3
1996 United States 156 1
1996 France 77 2
1996 Australia 71 3
1997 United States 144 1
1997 Spain 84 2
1997 France 83 3
1998 United States 139 1
1998 Spain 92 2
1998 France 91 3
1999 United States 138 1
1999 France 90 2
1999 Spain 80 3
2000 United States 132 1
2000 France 95 2
2000 Spain 78 3
2001 United States 122 1
2001 Spain 112 2
2001 France 87 3
2002 United States 126 1
2002 Spain 96 2
2002 France 90 3
2003 United States 136 1
2003 Spain 95 2
2003 France 86 3
2004 United States 125 1
2004 France 103 2
2004 Spain 102 3
2005 United States 114 1
2005 France 100 2
2005 Spain 93 3
2006 United States 104 1
2006 France 101 2
2006 Spain 80 3
2007 France 117 1
2007 United States 103 2
2007 Russia 90 3
2008 France 119 1
2008 United States 94 2
2008 Russia 84 3
2009 France 112 1
2009 United States 85 2
2010 Russia 85 3
2011 France 95 1
2011 United States 85 2
2011 Russia 82 3
2012 United States 95 1
2012 France 92 2
2012 Spain 78 3
2013 United States 103 1
2013 France 91 2
2013 Spain 86 3
2014 United States 96 1
2014 France 87 2
2014 Spain 74 3
2015 United States 106 1
2015 France 74 2
2016 United States 125 1
2016 France 81 2
2016 Spain 68 3
2017 United States 125 1
2017 France 84 2
2017 Russia 74 3
2018 United States 133 1
2018 France 76 2
2018 Germany 64 3
2019 United States 125 1
2019 France 83 2
2019 Russia 62 3

Which Countries outside the top twenty percent were most successful?

Looking beyond the top 20%, I wanted to see which other countries had been successful.

# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------

# count first round appearance for countries outside top 20%
gs_entries_the_other_80 <- gs_entries_by_country %>% 
  filter(!top_20_perc) %>% 
  count(year, country, wt = num_first_rd)

# identify implicitly missing data
grid <- expand_grid(country = unique(gs_entries_the_other_80$country),
                    year = unique(gs_entries_the_other_80$year))

# where a country did no appear for a given year explicitly record 
# zero first round appearances
plot_df <- grid %>%
  left_join(gs_entries_the_other_80) %>%
  mutate(n = replace_na(n, 0))

# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------
p <- ggplot(plot_df,
       aes(year, n, group = country)) +
  geom_line()

p

# output as interactive for exploration
plotly::ggplotly(p)

As Czechia was a sucessful country, I wanted to check if it was referred to by any other names in the dataset.

gs_entries_by_country %>% 
  filter(str_detect(str_to_lower(country), "cz")) %>% 
  distinct(country) %>% 
  kable()
country
Czechia

So, now I can focus in on the three most successful countries (outisde the top 20% richest countries).

# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------

# focus on top 3 countries outside the richest 20%
countries_of_int <- c("Russia", "Argentina", "Czechia")

plot_df %>% 
  
  mutate(colour = if_else(country %in% countries_of_int,
                          country, "other")) %>% 
  
  # Czechia moves into the top 20% in 2017 so we don't want the line
  # to continue
  filter(!(country == "Czechia" & year >= 2017)) %>% 
  
  ggplot(aes(year, n, 
             colour = colour, 
             group = country)) +
    
    # core plot
    geom_line() +
  
    # colours
    scale_colour_manual(values = c("#A7BCD6", "#35469D", 
                                   "grey95", "#C94A54")) +
  
    # format axis
    scale_y_continuous(expand = c(0,0)) +
    scale_x_continuous(expand = c(0,0), 
                       breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
    coord_cartesian(clip = 'off') +
    
    # tidy up plots for editting
    labs(x = NULL, y = NULL) +
    theme(legend.position = "none",
          plot.margin = margin(25,25,25,25),
          panel.grid.minor = element_blank()) 

# export for editting
ggsave("../images/image_8.svg", units = "mm", width = 525, height = 350)

The poorest 50% of countries

In the last part of the data story I flip things on there head, and focus on the poorest 50% of countries. This allowed me to explore the persistent challenges faced by players from poorer countries in making Grand Slam appearance. This section of the notebook works through producing some chart which show these persistent challenges.

# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------
bottom_50_perc <- gs_entries_by_country %>% 
  filter(income_decile <= 5)

# ----------------------------------------------------------------------------
# Create the plot
# ----------------------------------------------------------------------------

bottom_50_perc %>% 
  
  # only want to count each country once per year
  distinct(year, country) %>% 
  count(year) %>% 
  
  
  ggplot() +
  
    # core plot
    geom_col(aes(year, n), width = 0.8) +
            scale_y_continuous(expand = expansion(mult = c(0, .1))) +
    
    # format axis
    scale_x_continuous(expand = c(0,0), 
                       breaks=c(1990,1995,2000,2005,2010,2015,2019)) +
    coord_cartesian(clip = 'off') +
    
    # tidy up plot for editing
    labs(x = NULL, y = NULL) +
    theme(legend.position = "none",
        plot.margin = margin(25,25,25,25),
        panel.grid.minor = element_blank()) 

# export plot for editting
ggsave("../images/image_9.svg", units = "mm", width = 525, height = 350)

# output data points for annotation
bottom_50_perc %>% 
  distinct(year, country) %>% 
  filter(year == 1990 | year == 2013) %>% 
  kable()
year country
1990 India
1990 Nigeria
1990 Peru
2013 China
2013 Uzbekistan
2013 India
2013 Georgia
# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------

# identify the countries from bottom 50 percent with most appearances in first round 1990 - 2019
top_n <- 5

top_n_countries <- bottom_50_perc %>% 
  count(country) %>% 
  slice_max(order_by = n, n = top_n) %>% 
  .$country

bottom_50_perc %>% 
  distinct(year,country) %>% 
  filter(year == max(year)) %>% 
  .$country
## [1] "Ukraine"      "South Africa" "Tunisia"      "India"        "Uzbekistan"  
## [6] "Moldova"      "Bolivia"
plot_df <- bottom_50_perc %>% 
  count(year, country) %>% 
  
  # group all countries outside top n together as other
  mutate(country = if_else(country %in% top_n_countries,
                           country,
                           "Other")) %>% 
  group_by(year, country) %>% 
  summarise(n = sum(n))

# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------

# total number of first round appearances by bottom 50% countries
ggplot(plot_df) +
  geom_col(aes(year, n, fill = country))

# output for use in story text
bottom_50_perc %>% 
  distinct(year, country) %>% 
  filter(year == 2003) %>% 
  kable()
year country
2003 Belarus
2003 Morocco
2003 Indonesia
2003 Philippines
2003 Peru
2003 Armenia
2003 Ecuador
2003 Georgia
2003 Paraguay
2003 Uzbekistan
2003 Zimbabwe
2003 Colombia
2003 Madagascar
2003 Ukraine
2003 Bosnia and Herzegovina
2003 China
2003 Tunisia

The penultimate chunk of code in this notebook, looks at if countries have moved in and out of the poorest 50% of countries. it is similar to the code chunck and chart above, looking at if countries had moved in and out of the richest 20%. This chart ultimately didn’t make into the data story.

# ----------------------------------------------------------------------------
# Reshape data for plotting
# ----------------------------------------------------------------------------

# count first round appearances by bottom 50% countries
bottom_50_perc_country_counts <- bottom_50_perc %>% 
  
  count(year, country, wt = num_first_rd) %>% 
  
  # average the counts
  group_by(country) %>% 
  mutate(country_ave_n = mean(n, na.rm = TRUE)) %>% 
  ungroup()


# look at how many times the countries appear in the bottom fifty percent
bottom_50_summary <- bottom_50_perc_country_counts %>% 
  count(country) %>% 
  rename(total_n = n) %>% 
  left_join(distinct(
              bottom_50_perc_country_counts,
              country,
              country_ave_n
              )) %>% 
  arrange(desc(country_ave_n))


  
# create a grid to see where countries have moved out of bottom 50 perc
grid <-  expand_grid(year = unique(bottom_50_perc_country_counts$year),
                     country = unique(bottom_50_perc_country_counts$country))
plot_df <- grid %>% 
  left_join(bottom_50_perc_country_counts)

# top 50% countries for comparison to bottom 50% countries
top_50_countries_by_year <- gs_entries_by_country %>% 
  distinct(year, country, income_decile) %>% 
  filter(income_decile > 5)

# check if a country is bottom 50 in a given year
# returns a logical
check_top_50 <- function(year, country){

  selector <- top_50_countries_by_year$year == year & 
              top_50_countries_by_year$country == country
  
  res <- top_50_countries_by_year[selector, ]
  
  if(nrow(res) == 0){
    return(FALSE)
    }
  else {
    return(res[[1,"income_decile"]] > 5)
  }
}

# a quick function test
check_top_50(1991, "United States")
## [1] TRUE
# run check on if countries are in the bottom 50% for full dataframe
plot_df_1 <- plot_df %>% 
  mutate(in_top_50 = map2_lgl(year, country, ~check_top_50(.x,.y)),
         n = if_else(is.na(n) & in_top_50,
                     -1, n),
         n = replace_na(n, 0),
         bin_n = cut(n, breaks = c(-Inf, -1e-10,0,1e10, 5, 10, 20, Inf)))


# for ordering countries in the plot
levels <- rev(bottom_50_summary$country)

# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------

ggplot(plot_df_1) +
  
  # core plot
  geom_tile(aes(x = year, 
                y = factor(country, levels = levels),
                fill = bin_n),
            colour = "#E7E4E5") +
  
    # colours  
    scale_fill_manual(values = c("#E7E4E5", "white", '#cfcde7', 
                                 '#9f9dce', '#6e70b6', '#35469d')) +
    
    # tidy up plot ahead of editting
    labs(x= NULL, y = NULL) +
    coord_equal() +
    theme_minimal() 

Finally, a chart that didn’t make it into the final data story. It tries to show the distribution of the number of appearances in the Grand Slam first round over the years.

library(ggridges)

# ----------------------------------------------------------------------------
# Produce the plot
# ----------------------------------------------------------------------------

# the top 10 most successful bottom 50 countries
bottom_50_countries <- bottom_50_summary %>% 
  slice_head(n = 10) %>% 
  .$country

gs_first_round_gdp %>% 
  
  filter(country %in% bottom_50_countries) %>% 

  ggplot() +
  
    # core plot
    ggridges::geom_density_ridges(mapping = aes(year, 
                                      factor(country, levels = levels),
                                      height = stat(density)),
                                  stat = "density")