Introduction

In this exercise, we work with a list of Grand Slam men’s singles champions. The idea of using this dataset came from a post by Peter Kowalchuk on the Week 5 Discussion Board. The source of the dataset can be found at: https://en.wikipedia.org/wiki/List_of_Grand_Slam_men%27s_singles_champions

In analyzing the data, I was interested in addressing three questions:

Create a CSV file that includes all the information in the dataset

The list of Grand Slam men’s singles champions is available on the Wikipedia website. I copied and saved the data as a CSV file on GitHub. The data include the men’s singles champion by year for each Grand Slam tournament.

Read the dataset and use tidyr and dplyr to tidy and transform the data

First, let’s read in the data from the CSV file stored on GitHub.

library(tidyverse)
library(knitr)

# read the CSV file from GitHub and add column names
file <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/grandslam.csv"
raw <- read_csv(file)
names(raw) <- c("Year", "Australian", "French", "Wimbledon", "US")

str(raw)
## Classes 'tbl_df', 'tbl' and 'data.frame':    144 obs. of  5 variables:
##  $ Year      : int  1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 ...
##  $ Australian: chr  "started in 1905" "tournament not created" "tournament not created" "tournament not created" ...
##  $ French    : chr  "started in 1891" "tournament not created" "tournament not created" "tournament not created" ...
##  $ Wimbledon : chr  "<U+00A0>GBR:<U+00A0>Spencer Gore" "<U+00A0>Frank Hadow" "<U+00A0>John Hartley<U+00A0>(1/2)" "<U+00A0>John Hartley<U+00A0>(2/2)" ...
##  $ US        : chr  "started in 1881" "tournament not created" "tournament not created" "tournament not created" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 5
##   .. ..$ Year                   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ <U+00A0>Australian Open: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ <U+00A0>French Open    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ <U+00A0>Wimbledon      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ <U+00A0>US Open        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
tail(raw, 20)
## # A tibble: 20 x 5
##     Year Australian          French        Wimbledon          US          
##    <int> <chr>               <chr>         <chr>              <chr>       
##  1  1999 "\xa0Yevgeny Kafel~ "\xa0Andre A~ "\xa0Pete Sampras~ "\xa0Andre ~
##  2  2000 "\xa0Andre Agassi\~ "\xa0Gustavo~ "\xa0Pete Sampras~ "\xa0Marat ~
##  3  2001 "\xa0Andre Agassi\~ "\xa0Gustavo~ "\xa0Goran Ivani\~ "\xa0Lleyto~
##  4  2002 "\xa0Thomas Johans~ "\xa0Albert ~ "\xa0Lleyton Hewi~ "\xa0Pete S~
##  5  2003 "\xa0Andre Agassi\~ "\xa0Juan Ca~ "\xa0\xa0SUI:\xa0~ "\xa0Andy R~
##  6  2004 "\xa0Roger Federer~ "\xa0Gast\xf~ "\xa0Roger Federe~ "\xa0Roger ~
##  7  2005 "\xa0Marat Safin\x~ "\xa0Rafael ~ "\xa0Roger Federe~ "\xa0Roger ~
##  8  2006 "\xa0Roger Federer~ "\xa0Rafael ~ "\xa0Roger Federe~ "\xa0Roger ~
##  9  2007 "\xa0Roger Federer~ "\xa0Rafael ~ "\xa0Roger Federe~ "\xa0Roger ~
## 10  2008 "\xa0SRB:\xa0Novak~ "\xa0Rafael ~ "\xa0Rafael Nadal~ "\xa0Roger ~
## 11  2009 "\xa0Rafael Nadal\~ "\xa0Roger F~ "\xa0Roger Federe~ "\xa0Juan M~
## 12  2010 "\xa0Roger Federer~ "\xa0Rafael ~ "\xa0Rafael Nadal~ "\xa0Rafael~
## 13  2011 "\xa0Novak Djokovi~ "\xa0Rafael ~ "\xa0Novak Djokov~ "\xa0Novak ~
## 14  2012 "\xa0Novak Djokovi~ "\xa0Rafael ~ "\xa0Roger Federe~ "\xa0Andy M~
## 15  2013 "\xa0Novak Djokovi~ "\xa0Rafael ~ "\xa0Andy Murray\~ "\xa0Rafael~
## 16  2014 "\xa0Stan Wawrinka~ "\xa0Rafael ~ "\xa0Novak Djokov~ "\xa0Marin ~
## 17  2015 "\xa0Novak Djokovi~ "\xa0Stan Wa~ "\xa0Novak Djokov~ "\xa0Novak ~
## 18  2016 "\xa0Novak Djokovi~ "\xa0Novak D~ "\xa0Andy Murray\~ "\xa0Stan W~
## 19  2017 "\xa0Roger Federer~ "\xa0Rafael ~ "\xa0Roger Federe~ "\xa0Rafael~
## 20  2018 "\xa0Roger Federer~ "\xa0Rafael ~ "\xa0Novak Djokov~ "\xa0Novak ~

Next let’s gather the tournament data into separate observation rows by year, so that the remaining columns will be year, tournament, and player.

df1 <- raw %>% 
    # gather champions data into tournament and player columns
    gather("Tournament", "Player", 2:5) 
tail(df1, 20)
## # A tibble: 20 x 3
##     Year Tournament Player                         
##    <int> <chr>      <chr>                          
##  1  1999 US         "\xa0Andre Agassi\xa0(5/8)"    
##  2  2000 US         "\xa0Marat Safin\xa0(1/2)"     
##  3  2001 US         "\xa0Lleyton Hewitt\xa0(1/2)"  
##  4  2002 US         "\xa0Pete Sampras\xa0(14/14)"  
##  5  2003 US         "\xa0Andy Roddick"             
##  6  2004 US         "\xa0Roger Federer\xa0(4/20)"  
##  7  2005 US         "\xa0Roger Federer\xa0(6/20)"  
##  8  2006 US         "\xa0Roger Federer\xa0(9/20)"  
##  9  2007 US         "\xa0Roger Federer\xa0(12/20)" 
## 10  2008 US         "\xa0Roger Federer\xa0(13/20)" 
## 11  2009 US         "\xa0Juan Mart\xedn del Potro" 
## 12  2010 US         "\xa0Rafael Nadal\xa0(9/17)"   
## 13  2011 US         "\xa0Novak Djokovic\xa0(4/14)" 
## 14  2012 US         "\xa0Andy Murray\xa0(1/3)"     
## 15  2013 US         "\xa0Rafael Nadal\xa0(13/17)"  
## 16  2014 US         "\xa0Marin ?ili?"              
## 17  2015 US         "\xa0Novak Djokovic\xa0(10/14)"
## 18  2016 US         "\xa0Stan Wawrinka\xa0(3/3)"   
## 19  2017 US         "\xa0Rafael Nadal\xa0(16/17)"  
## 20  2018 US         "\xa0Novak Djokovic\xa0(14/14)"

There are several problems with the player names that we need to fix, in order to create a tidy data frame:

# basic data cleaning for player names
# (a) replace with NA the following entries:
#       - starting with "tournament" or "Tournament"
#       - starting with "started", "Started"
#       - starting with "world", "World"
#       - starting with "France", "Open", "Amateur"
# (b) remove html tags and other special characters, etc. (anything not alphanum, space or punctuation)
# (c) remove ending (m/n) and asterisks

# regular expressions for str_replace_all function
repl <- c("^[Tt]ournament.*" = NA_character_,
          "^[Ss]tarted.*" = NA_character_,
          "^[Ww]orld.*" = NA_character_,
          "^France.*" = NA_character_,
          "^Open.*" = NA_character_,
          "^Amateur.*" = NA_character_,
          "[^[:alnum:]\\s[:punct:]]" = "",
          "(\\(.+\\))*(\\*)*$" = ""
          )
df1$Player <- df1$Player %>% str_replace_all(repl)
tail(df1, 20)
## # A tibble: 20 x 3
##     Year Tournament Player              
##    <int> <chr>      <chr>               
##  1  1999 US         Andre Agassi        
##  2  2000 US         Marat Safin         
##  3  2001 US         Lleyton Hewitt      
##  4  2002 US         Pete Sampras        
##  5  2003 US         Andy Roddick        
##  6  2004 US         Roger Federer       
##  7  2005 US         Roger Federer       
##  8  2006 US         Roger Federer       
##  9  2007 US         Roger Federer       
## 10  2008 US         Roger Federer       
## 11  2009 US         Juan Martn del Potro
## 12  2010 US         Rafael Nadal        
## 13  2011 US         Novak Djokovic      
## 14  2012 US         Andy Murray         
## 15  2013 US         Rafael Nadal        
## 16  2014 US         Marin ?ili?         
## 17  2015 US         Novak Djokovic      
## 18  2016 US         Stan Wawrinka       
## 19  2017 US         Rafael Nadal        
## 20  2018 US         Novak Djokovic

This is looking much better now. However, there’s still a problem, in that country information is included for some players the first time the player appears on the list. This country information, if it’s complete for all players, may be useful. Let’s separate out the country information into a separate table, and then join the available country data with the player names.

# check player names with country information
df1$Player %>% str_subset(":")
##  [1] "AUS:Rodney Heath"         "NZL:Anthony Wilding"     
##  [3] "PER:Alex Olmedo"          "RSA:Johan Kriek"         
##  [5] "GER: Boris Becker"        "CZE:Petr Korda"          
##  [7] "SRB:Novak Djokovic"       "FRA:Jean Schopfer"       
##  [9] "GER:Gottfried von Cramm"  "GER: Gottfried von Cramm"
## [11] "HUN:Jzsef Asbth"          "EGY:Jaroslav Drobn"      
## [13] "SWE:Sven Davidson"        "ITA:Nicola Pietrangeli"  
## [15] "ESP:Manuel Santana"       "TCH:Jan Kode"            
## [17] "ARG:Guillermo Vilas"      "ECU:Andrs Gmez"          
## [19] "AUT:Thomas Muster"        "RUS:Yevgeny Kafelnikov"  
## [21] "BRA:Gustavo Kuerten"      "GBR:Spencer Gore"        
## [23] "EGY: Jaroslav Drobn"      "FRG:Boris Becker"        
## [25] "NED:Richard Krajicek"     "SUI:Roger Federer"       
## [27] "USA:Richard Sears"        "MEX:Rafael Osuna"        
## [29] "ROU:Ilie N?stase"
# put country info at end of string, then separate player name and country
df1$Player <- df1$Player %>% str_replace("^(\\w+)(\\s*:\\s*)(.+)", "\\3:\\1")
# temp will store temporary player & country info
temp <- str_split(df1$Player, ":", simplify = TRUE)
# assign just player name back to player column
df1$Player <- temp[ , 1]
# how many country entries?
sum(temp[ , 2] != "", na.rm = TRUE)
## [1] 29
# create player-country table
df_country <- as_tibble(temp[temp[ , 2] !="", ])
df_country
## # A tibble: 29 x 2
##    V1                  V2   
##    <chr>               <chr>
##  1 Rodney Heath        AUS  
##  2 Anthony Wilding     NZL  
##  3 Alex Olmedo         PER  
##  4 Johan Kriek         RSA  
##  5 Boris Becker        GER  
##  6 Petr Korda          CZE  
##  7 Novak Djokovic      SRB  
##  8 Jean Schopfer       FRA  
##  9 Gottfried von Cramm GER  
## 10 Gottfried von Cramm GER  
## # ... with 19 more rows
# add country info to all players using left join
df2 <- df1 %>% left_join(df_country, by = c("Player" = "V1"))
tail(df2, 20)
## # A tibble: 20 x 4
##     Year Tournament Player               V2   
##    <int> <chr>      <chr>                <chr>
##  1  1999 US         Andre Agassi         <NA> 
##  2  2000 US         Marat Safin          <NA> 
##  3  2001 US         Lleyton Hewitt       <NA> 
##  4  2002 US         Pete Sampras         <NA> 
##  5  2003 US         Andy Roddick         <NA> 
##  6  2004 US         Roger Federer        SUI  
##  7  2005 US         Roger Federer        SUI  
##  8  2006 US         Roger Federer        SUI  
##  9  2007 US         Roger Federer        SUI  
## 10  2008 US         Roger Federer        SUI  
## 11  2009 US         Juan Martn del Potro <NA> 
## 12  2010 US         Rafael Nadal         <NA> 
## 13  2011 US         Novak Djokovic       SRB  
## 14  2012 US         Andy Murray          <NA> 
## 15  2013 US         Rafael Nadal         <NA> 
## 16  2014 US         Marin ?ili?          <NA> 
## 17  2015 US         Novak Djokovic       SRB  
## 18  2016 US         Stan Wawrinka        <NA> 
## 19  2017 US         Rafael Nadal         <NA> 
## 20  2018 US         Novak Djokovic       SRB

Too bad, this is disappointing! It turns out that the country information is not consistently available in the table for all players. Let’s load another table of player names and countries of origin for Grand Slam winners from the Open period, which I created from the original dataset on Wikipedia. Then we can use this country information to join with the Grand Slam data frame, at least for players during the Open period (from 1968 on).

# read the CSV file from GitHub and add column names
file1 <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/player_country.csv"
raw1 <- read_csv(file1, col_names = FALSE)
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character()
## )
names(raw1) <- c("Player", "Country")
# remove special characters from player names
repl1 <- c("[^[:alnum:]\\s[:punct:]]" = "")
raw1$Player <- raw1$Player %>% str_replace_all(repl1) 
(df_country1 <- raw1)
## # A tibble: 54 x 2
##    Player          Country
##    <chr>           <chr>  
##  1 Adriano Panatta ITA    
##  2 Albert Costa    ESP    
##  3 Andre Agassi    USA    
##  4 Andrs Gimeno    ESP    
##  5 Andrs Gmez      ECU    
##  6 Andy Murray     GBR    
##  7 Andy Roddick    USA    
##  8 Arthur Ashe     USA    
##  9 Bjrn Borg       SWE    
## 10 Boris Becker    GER    
## # ... with 44 more rows
# add country info to all players using left join, also add flag for Open era
df3 <- df1 %>% left_join(df_country1) %>% mutate(Open_Era = ifelse(Year >= 1968, "Y", "N"))
## Joining, by = "Player"
tail(df3, 20)
## # A tibble: 20 x 5
##     Year Tournament Player               Country Open_Era
##    <int> <chr>      <chr>                <chr>   <chr>   
##  1  1999 US         Andre Agassi         USA     Y       
##  2  2000 US         Marat Safin          RUS     Y       
##  3  2001 US         Lleyton Hewitt       AUS     Y       
##  4  2002 US         Pete Sampras         USA     Y       
##  5  2003 US         Andy Roddick         USA     Y       
##  6  2004 US         Roger Federer        SUI     Y       
##  7  2005 US         Roger Federer        SUI     Y       
##  8  2006 US         Roger Federer        SUI     Y       
##  9  2007 US         Roger Federer        SUI     Y       
## 10  2008 US         Roger Federer        SUI     Y       
## 11  2009 US         Juan Martn del Potro ARG     Y       
## 12  2010 US         Rafael Nadal         ESP     Y       
## 13  2011 US         Novak Djokovic       SRB     Y       
## 14  2012 US         Andy Murray          GBR     Y       
## 15  2013 US         Rafael Nadal         ESP     Y       
## 16  2014 US         Marin ?ili?          CRO     Y       
## 17  2015 US         Novak Djokovic       SRB     Y       
## 18  2016 US         Stan Wawrinka        SUI     Y       
## 19  2017 US         Rafael Nadal         ESP     Y       
## 20  2018 US         Novak Djokovic       SRB     Y

Use the tidy data to address the questions

Now that we have a tidy data frame, we can analyze the data to answer the questions.

Q1: Which players have won the most Grand Slam men’s singles titles, overall and in the Open era?

From the tables and charts below:

  • Overall, the top 5 players by number of Grand Slam singles titles are:
    • Roger Feder (20)
    • Rafel Nadal (17)
    • Novak Djokovic (14)
    • Pete Sampras (14)
    • Roy Emerson (12)
  • In the Open era (from 1968 on), the top 5 players by number of Grand Slam singles titles are:
    • Roger Feder
    • Rafel Nadal
    • Novak Djokovic
    • Pete Sampras
    • Bjorn Borg
    I.e., the top 4 players remain the same as in the overall rankings, while Bjorn Borg moves into the 5th spot.
# most Grand Slam titles - overall
df_plot1 <- df3 %>% filter(!is.na(Player)) %>% group_by(Player) %>% summarize(Total = n()) %>% arrange(desc(Total)) %>% 
    filter(Total >= 5) 
kable(df_plot1, caption = "Players with 5 or More Grand Slam Titles - Overall")
Players with 5 or More Grand Slam Titles - Overall
Player Total
Roger Federer 20
Rafael Nadal 17
Novak Djokovic 14
Pete Sampras 14
Roy Emerson 12
Bjrn Borg 11
Rod Laver 11
Bill Tilden 10
Andre Agassi 8
Fred Perry 8
Henri Cochet 8
Ivan Lendl 8
Jimmy Connors 8
Ken Rosewall 8
Max Decugis 8
John McEnroe 7
John Newcombe 7
Mats Wilander 7
Ren Lacoste 7
Richard Sears 7
William Larned 7
William Renshaw 7
Anthony Wilding 6
Boris Becker 6
Don Budge 6
Jack Crawford 6
Laurence Doherty 6
Stefan Edberg 6
Frank Sedgman 5
Jean Borotra 5
Tony Trabert 5
ggplot(df_plot1) + geom_bar(aes(x = reorder(Player, Total), y = Total, fill = Total), stat = "identity") + coord_flip() + 
    labs(y = "Total Grand Slam Titles", x = "Player", title = "Players with 5 or More Men's Grand Slam Titles - Overall") 

# most Grand Slam titles - Open Era
df_plot2 <- df3 %>% filter(!is.na(Player), Open_Era == "Y") %>% group_by(Player) %>% summarize(Total = n()) %>% arrange(desc(Total)) %>% 
    filter(Total >= 5) 
kable(df_plot2, caption = "Players with 5 or More Grand Slam Titles - Open Era")
Players with 5 or More Grand Slam Titles - Open Era
Player Total
Roger Federer 20
Rafael Nadal 17
Novak Djokovic 14
Pete Sampras 14
Bjrn Borg 11
Andre Agassi 8
Ivan Lendl 8
Jimmy Connors 8
John McEnroe 7
Mats Wilander 7
Boris Becker 6
Stefan Edberg 6
John Newcombe 5
Rod Laver 5
ggplot(df_plot2) + geom_bar(aes(x = reorder(Player, Total), y = Total, fill = Total), stat = "identity") + coord_flip() + 
    labs(y = "Total Grand Slam Titles", x = "Player", title = "Players with 5 or More Men's Grand Slam Titles - Open Era") 

Q2: Who are the top players by tournament in the Open era?

The top 3 players in the Open era who have won the most Grand Slam singles titles, by tournament, are the following:

  • Australian Open
    • Novak Djokovic (6)
    • Roger Federer (6)
    • Andre Agassi (4)
  • French Open
    • Rafael Nadal (11)
    • Bjorn Borg (6)
    • Gustavo Kuerten, Ivan Lendl, Mats Wilander (3)
  • US Open
    • Jimmy Connors (5)
    • Pete Sampras (5)
    • Roger Federer (5)
  • Wimbledon
    • Roger Federer (8)
    • Pete Sampras (7)
    • Bjorn Borg (5)
# most Grand Slam titles in Open Era, by tournament
df_plot3 <- df3 %>% filter(!is.na(Player), Open_Era == "Y") %>% group_by(Player, Tournament) %>% summarize(Total = n()) %>% 
    arrange(Tournament, desc(Total)) %>% filter(Total >= 3) 
kable(df_plot3, caption = "Players with 3 or More Titles by Tournament - Open Era")
Players with 3 or More Titles by Tournament - Open Era
Player Tournament Total
Novak Djokovic Australian 6
Roger Federer Australian 6
Andre Agassi Australian 4
Mats Wilander Australian 3
Rafael Nadal French 11
Bjrn Borg French 6
Gustavo Kuerten French 3
Ivan Lendl French 3
Mats Wilander French 3
Jimmy Connors US 5
Pete Sampras US 5
Roger Federer US 5
John McEnroe US 4
Ivan Lendl US 3
Novak Djokovic US 3
Rafael Nadal US 3
Roger Federer Wimbledon 8
Pete Sampras Wimbledon 7
Bjrn Borg Wimbledon 5
Novak Djokovic Wimbledon 4
Boris Becker Wimbledon 3
John McEnroe Wimbledon 3
ggplot(df_plot3) + geom_bar(aes(x = Player, y = Total, fill = Tournament), stat = "identity") + coord_flip() + 
    labs(y = "Tournament Titles", x = "Player", title = "Top Champions in the Open Era, by Grand Slam Tournament") + facet_wrap(~ Tournament)

Q3: Which country is most represented by Grand Slam men’s singles winners in the Open era?

In the Open era, the top countries represented with the most men’s singles Grand Slam titles are:

  • USA (50)
  • Sweden (25)
  • Spain (24)
  • Switzerland (23)
  • Austria (19)
df_plot4 <- df3 %>% filter(!is.na(Player), Open_Era == "Y") %>% group_by(Country) %>% summarize(Total = n()) %>% arrange(desc(Total)) %>% 
    filter(Total >= 5) 
kable(df_plot4, caption = "Countries Represented with 5 or More Grand Slam Titles - Modern Era")
Countries Represented with 5 or More Grand Slam Titles - Modern Era
Country Total
USA 50
SWE 25
ESP 24
SUI 23
AUS 19
SRB 14
CZE 12
GER 7
ARG 6
ggplot(df_plot4) + geom_bar(aes(x = reorder(Country, Total), y = Total, fill = Total), stat = "identity") + coord_flip() + 
    labs(y = "Total Grand Slam Titles", x = "Country", title = "Countries Represented with 5 or More Men's Grand Slam Titles - Open Era")