In this exercise, we work with a list of Grand Slam men’s singles champions. The idea of using this dataset came from a post by Peter Kowalchuk on the Week 5 Discussion Board. The source of the dataset can be found at: https://en.wikipedia.org/wiki/List_of_Grand_Slam_men%27s_singles_champions
In analyzing the data, I was interested in addressing three questions:
Which players have won the most Grand Slam men’s singles titles, overall and in the Open era?
Who are the top players by tournament in the Open era?
Which country is most represented by Grand Slam men’s singles winners in the Open era?
The list of Grand Slam men’s singles champions is available on the Wikipedia website. I copied and saved the data as a CSV file on GitHub. The data include the men’s singles champion by year for each Grand Slam tournament.
First, let’s read in the data from the CSV file stored on GitHub.
library(tidyverse)
library(knitr)
# read the CSV file from GitHub and add column names
file <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/grandslam.csv"
raw <- read_csv(file)
names(raw) <- c("Year", "Australian", "French", "Wimbledon", "US")
str(raw)
## Classes 'tbl_df', 'tbl' and 'data.frame': 144 obs. of 5 variables:
## $ Year : int 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 ...
## $ Australian: chr "started in 1905" "tournament not created" "tournament not created" "tournament not created" ...
## $ French : chr "started in 1891" "tournament not created" "tournament not created" "tournament not created" ...
## $ Wimbledon : chr "<U+00A0>GBR:<U+00A0>Spencer Gore" "<U+00A0>Frank Hadow" "<U+00A0>John Hartley<U+00A0>(1/2)" "<U+00A0>John Hartley<U+00A0>(2/2)" ...
## $ US : chr "started in 1881" "tournament not created" "tournament not created" "tournament not created" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 5
## .. ..$ Year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ <U+00A0>Australian Open: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ <U+00A0>French Open : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ <U+00A0>Wimbledon : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ <U+00A0>US Open : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
tail(raw, 20)
## # A tibble: 20 x 5
## Year Australian French Wimbledon US
## <int> <chr> <chr> <chr> <chr>
## 1 1999 "\xa0Yevgeny Kafel~ "\xa0Andre A~ "\xa0Pete Sampras~ "\xa0Andre ~
## 2 2000 "\xa0Andre Agassi\~ "\xa0Gustavo~ "\xa0Pete Sampras~ "\xa0Marat ~
## 3 2001 "\xa0Andre Agassi\~ "\xa0Gustavo~ "\xa0Goran Ivani\~ "\xa0Lleyto~
## 4 2002 "\xa0Thomas Johans~ "\xa0Albert ~ "\xa0Lleyton Hewi~ "\xa0Pete S~
## 5 2003 "\xa0Andre Agassi\~ "\xa0Juan Ca~ "\xa0\xa0SUI:\xa0~ "\xa0Andy R~
## 6 2004 "\xa0Roger Federer~ "\xa0Gast\xf~ "\xa0Roger Federe~ "\xa0Roger ~
## 7 2005 "\xa0Marat Safin\x~ "\xa0Rafael ~ "\xa0Roger Federe~ "\xa0Roger ~
## 8 2006 "\xa0Roger Federer~ "\xa0Rafael ~ "\xa0Roger Federe~ "\xa0Roger ~
## 9 2007 "\xa0Roger Federer~ "\xa0Rafael ~ "\xa0Roger Federe~ "\xa0Roger ~
## 10 2008 "\xa0SRB:\xa0Novak~ "\xa0Rafael ~ "\xa0Rafael Nadal~ "\xa0Roger ~
## 11 2009 "\xa0Rafael Nadal\~ "\xa0Roger F~ "\xa0Roger Federe~ "\xa0Juan M~
## 12 2010 "\xa0Roger Federer~ "\xa0Rafael ~ "\xa0Rafael Nadal~ "\xa0Rafael~
## 13 2011 "\xa0Novak Djokovi~ "\xa0Rafael ~ "\xa0Novak Djokov~ "\xa0Novak ~
## 14 2012 "\xa0Novak Djokovi~ "\xa0Rafael ~ "\xa0Roger Federe~ "\xa0Andy M~
## 15 2013 "\xa0Novak Djokovi~ "\xa0Rafael ~ "\xa0Andy Murray\~ "\xa0Rafael~
## 16 2014 "\xa0Stan Wawrinka~ "\xa0Rafael ~ "\xa0Novak Djokov~ "\xa0Marin ~
## 17 2015 "\xa0Novak Djokovi~ "\xa0Stan Wa~ "\xa0Novak Djokov~ "\xa0Novak ~
## 18 2016 "\xa0Novak Djokovi~ "\xa0Novak D~ "\xa0Andy Murray\~ "\xa0Stan W~
## 19 2017 "\xa0Roger Federer~ "\xa0Rafael ~ "\xa0Roger Federe~ "\xa0Rafael~
## 20 2018 "\xa0Roger Federer~ "\xa0Rafael ~ "\xa0Novak Djokov~ "\xa0Novak ~
Next let’s gather the tournament data into separate observation rows by year, so that the remaining columns will be year, tournament, and player.
df1 <- raw %>%
# gather champions data into tournament and player columns
gather("Tournament", "Player", 2:5)
tail(df1, 20)
## # A tibble: 20 x 3
## Year Tournament Player
## <int> <chr> <chr>
## 1 1999 US "\xa0Andre Agassi\xa0(5/8)"
## 2 2000 US "\xa0Marat Safin\xa0(1/2)"
## 3 2001 US "\xa0Lleyton Hewitt\xa0(1/2)"
## 4 2002 US "\xa0Pete Sampras\xa0(14/14)"
## 5 2003 US "\xa0Andy Roddick"
## 6 2004 US "\xa0Roger Federer\xa0(4/20)"
## 7 2005 US "\xa0Roger Federer\xa0(6/20)"
## 8 2006 US "\xa0Roger Federer\xa0(9/20)"
## 9 2007 US "\xa0Roger Federer\xa0(12/20)"
## 10 2008 US "\xa0Roger Federer\xa0(13/20)"
## 11 2009 US "\xa0Juan Mart\xedn del Potro"
## 12 2010 US "\xa0Rafael Nadal\xa0(9/17)"
## 13 2011 US "\xa0Novak Djokovic\xa0(4/14)"
## 14 2012 US "\xa0Andy Murray\xa0(1/3)"
## 15 2013 US "\xa0Rafael Nadal\xa0(13/17)"
## 16 2014 US "\xa0Marin ?ili?"
## 17 2015 US "\xa0Novak Djokovic\xa0(10/14)"
## 18 2016 US "\xa0Stan Wawrinka\xa0(3/3)"
## 19 2017 US "\xa0Rafael Nadal\xa0(16/17)"
## 20 2018 US "\xa0Novak Djokovic\xa0(14/14)"
There are several problems with the player names that we need to fix, in order to create a tidy data frame:
Missing data should be represented by NAs, instead of explanations such as “started in 1905”, “tournament not created”, “World War I”, etc.
HTML tags (for national flags) should be removed
Other special characters such as footnotes, asterisks, etc. should be removed.
# basic data cleaning for player names
# (a) replace with NA the following entries:
# - starting with "tournament" or "Tournament"
# - starting with "started", "Started"
# - starting with "world", "World"
# - starting with "France", "Open", "Amateur"
# (b) remove html tags and other special characters, etc. (anything not alphanum, space or punctuation)
# (c) remove ending (m/n) and asterisks
# regular expressions for str_replace_all function
repl <- c("^[Tt]ournament.*" = NA_character_,
"^[Ss]tarted.*" = NA_character_,
"^[Ww]orld.*" = NA_character_,
"^France.*" = NA_character_,
"^Open.*" = NA_character_,
"^Amateur.*" = NA_character_,
"[^[:alnum:]\\s[:punct:]]" = "",
"(\\(.+\\))*(\\*)*$" = ""
)
df1$Player <- df1$Player %>% str_replace_all(repl)
tail(df1, 20)
## # A tibble: 20 x 3
## Year Tournament Player
## <int> <chr> <chr>
## 1 1999 US Andre Agassi
## 2 2000 US Marat Safin
## 3 2001 US Lleyton Hewitt
## 4 2002 US Pete Sampras
## 5 2003 US Andy Roddick
## 6 2004 US Roger Federer
## 7 2005 US Roger Federer
## 8 2006 US Roger Federer
## 9 2007 US Roger Federer
## 10 2008 US Roger Federer
## 11 2009 US Juan Martn del Potro
## 12 2010 US Rafael Nadal
## 13 2011 US Novak Djokovic
## 14 2012 US Andy Murray
## 15 2013 US Rafael Nadal
## 16 2014 US Marin ?ili?
## 17 2015 US Novak Djokovic
## 18 2016 US Stan Wawrinka
## 19 2017 US Rafael Nadal
## 20 2018 US Novak Djokovic
This is looking much better now. However, there’s still a problem, in that country information is included for some players the first time the player appears on the list. This country information, if it’s complete for all players, may be useful. Let’s separate out the country information into a separate table, and then join the available country data with the player names.
# check player names with country information
df1$Player %>% str_subset(":")
## [1] "AUS:Rodney Heath" "NZL:Anthony Wilding"
## [3] "PER:Alex Olmedo" "RSA:Johan Kriek"
## [5] "GER: Boris Becker" "CZE:Petr Korda"
## [7] "SRB:Novak Djokovic" "FRA:Jean Schopfer"
## [9] "GER:Gottfried von Cramm" "GER: Gottfried von Cramm"
## [11] "HUN:Jzsef Asbth" "EGY:Jaroslav Drobn"
## [13] "SWE:Sven Davidson" "ITA:Nicola Pietrangeli"
## [15] "ESP:Manuel Santana" "TCH:Jan Kode"
## [17] "ARG:Guillermo Vilas" "ECU:Andrs Gmez"
## [19] "AUT:Thomas Muster" "RUS:Yevgeny Kafelnikov"
## [21] "BRA:Gustavo Kuerten" "GBR:Spencer Gore"
## [23] "EGY: Jaroslav Drobn" "FRG:Boris Becker"
## [25] "NED:Richard Krajicek" "SUI:Roger Federer"
## [27] "USA:Richard Sears" "MEX:Rafael Osuna"
## [29] "ROU:Ilie N?stase"
# put country info at end of string, then separate player name and country
df1$Player <- df1$Player %>% str_replace("^(\\w+)(\\s*:\\s*)(.+)", "\\3:\\1")
# temp will store temporary player & country info
temp <- str_split(df1$Player, ":", simplify = TRUE)
# assign just player name back to player column
df1$Player <- temp[ , 1]
# how many country entries?
sum(temp[ , 2] != "", na.rm = TRUE)
## [1] 29
# create player-country table
df_country <- as_tibble(temp[temp[ , 2] !="", ])
df_country
## # A tibble: 29 x 2
## V1 V2
## <chr> <chr>
## 1 Rodney Heath AUS
## 2 Anthony Wilding NZL
## 3 Alex Olmedo PER
## 4 Johan Kriek RSA
## 5 Boris Becker GER
## 6 Petr Korda CZE
## 7 Novak Djokovic SRB
## 8 Jean Schopfer FRA
## 9 Gottfried von Cramm GER
## 10 Gottfried von Cramm GER
## # ... with 19 more rows
# add country info to all players using left join
df2 <- df1 %>% left_join(df_country, by = c("Player" = "V1"))
tail(df2, 20)
## # A tibble: 20 x 4
## Year Tournament Player V2
## <int> <chr> <chr> <chr>
## 1 1999 US Andre Agassi <NA>
## 2 2000 US Marat Safin <NA>
## 3 2001 US Lleyton Hewitt <NA>
## 4 2002 US Pete Sampras <NA>
## 5 2003 US Andy Roddick <NA>
## 6 2004 US Roger Federer SUI
## 7 2005 US Roger Federer SUI
## 8 2006 US Roger Federer SUI
## 9 2007 US Roger Federer SUI
## 10 2008 US Roger Federer SUI
## 11 2009 US Juan Martn del Potro <NA>
## 12 2010 US Rafael Nadal <NA>
## 13 2011 US Novak Djokovic SRB
## 14 2012 US Andy Murray <NA>
## 15 2013 US Rafael Nadal <NA>
## 16 2014 US Marin ?ili? <NA>
## 17 2015 US Novak Djokovic SRB
## 18 2016 US Stan Wawrinka <NA>
## 19 2017 US Rafael Nadal <NA>
## 20 2018 US Novak Djokovic SRB
Too bad, this is disappointing! It turns out that the country information is not consistently available in the table for all players. Let’s load another table of player names and countries of origin for Grand Slam winners from the Open period, which I created from the original dataset on Wikipedia. Then we can use this country information to join with the Grand Slam data frame, at least for players during the Open period (from 1968 on).
# read the CSV file from GitHub and add column names
file1 <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/player_country.csv"
raw1 <- read_csv(file1, col_names = FALSE)
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character()
## )
names(raw1) <- c("Player", "Country")
# remove special characters from player names
repl1 <- c("[^[:alnum:]\\s[:punct:]]" = "")
raw1$Player <- raw1$Player %>% str_replace_all(repl1)
(df_country1 <- raw1)
## # A tibble: 54 x 2
## Player Country
## <chr> <chr>
## 1 Adriano Panatta ITA
## 2 Albert Costa ESP
## 3 Andre Agassi USA
## 4 Andrs Gimeno ESP
## 5 Andrs Gmez ECU
## 6 Andy Murray GBR
## 7 Andy Roddick USA
## 8 Arthur Ashe USA
## 9 Bjrn Borg SWE
## 10 Boris Becker GER
## # ... with 44 more rows
# add country info to all players using left join, also add flag for Open era
df3 <- df1 %>% left_join(df_country1) %>% mutate(Open_Era = ifelse(Year >= 1968, "Y", "N"))
## Joining, by = "Player"
tail(df3, 20)
## # A tibble: 20 x 5
## Year Tournament Player Country Open_Era
## <int> <chr> <chr> <chr> <chr>
## 1 1999 US Andre Agassi USA Y
## 2 2000 US Marat Safin RUS Y
## 3 2001 US Lleyton Hewitt AUS Y
## 4 2002 US Pete Sampras USA Y
## 5 2003 US Andy Roddick USA Y
## 6 2004 US Roger Federer SUI Y
## 7 2005 US Roger Federer SUI Y
## 8 2006 US Roger Federer SUI Y
## 9 2007 US Roger Federer SUI Y
## 10 2008 US Roger Federer SUI Y
## 11 2009 US Juan Martn del Potro ARG Y
## 12 2010 US Rafael Nadal ESP Y
## 13 2011 US Novak Djokovic SRB Y
## 14 2012 US Andy Murray GBR Y
## 15 2013 US Rafael Nadal ESP Y
## 16 2014 US Marin ?ili? CRO Y
## 17 2015 US Novak Djokovic SRB Y
## 18 2016 US Stan Wawrinka SUI Y
## 19 2017 US Rafael Nadal ESP Y
## 20 2018 US Novak Djokovic SRB Y
Now that we have a tidy data frame, we can analyze the data to answer the questions.
From the tables and charts below:
# most Grand Slam titles - overall
df_plot1 <- df3 %>% filter(!is.na(Player)) %>% group_by(Player) %>% summarize(Total = n()) %>% arrange(desc(Total)) %>%
filter(Total >= 5)
kable(df_plot1, caption = "Players with 5 or More Grand Slam Titles - Overall")
| Player | Total |
|---|---|
| Roger Federer | 20 |
| Rafael Nadal | 17 |
| Novak Djokovic | 14 |
| Pete Sampras | 14 |
| Roy Emerson | 12 |
| Bjrn Borg | 11 |
| Rod Laver | 11 |
| Bill Tilden | 10 |
| Andre Agassi | 8 |
| Fred Perry | 8 |
| Henri Cochet | 8 |
| Ivan Lendl | 8 |
| Jimmy Connors | 8 |
| Ken Rosewall | 8 |
| Max Decugis | 8 |
| John McEnroe | 7 |
| John Newcombe | 7 |
| Mats Wilander | 7 |
| Ren Lacoste | 7 |
| Richard Sears | 7 |
| William Larned | 7 |
| William Renshaw | 7 |
| Anthony Wilding | 6 |
| Boris Becker | 6 |
| Don Budge | 6 |
| Jack Crawford | 6 |
| Laurence Doherty | 6 |
| Stefan Edberg | 6 |
| Frank Sedgman | 5 |
| Jean Borotra | 5 |
| Tony Trabert | 5 |
ggplot(df_plot1) + geom_bar(aes(x = reorder(Player, Total), y = Total, fill = Total), stat = "identity") + coord_flip() +
labs(y = "Total Grand Slam Titles", x = "Player", title = "Players with 5 or More Men's Grand Slam Titles - Overall")
# most Grand Slam titles - Open Era
df_plot2 <- df3 %>% filter(!is.na(Player), Open_Era == "Y") %>% group_by(Player) %>% summarize(Total = n()) %>% arrange(desc(Total)) %>%
filter(Total >= 5)
kable(df_plot2, caption = "Players with 5 or More Grand Slam Titles - Open Era")
| Player | Total |
|---|---|
| Roger Federer | 20 |
| Rafael Nadal | 17 |
| Novak Djokovic | 14 |
| Pete Sampras | 14 |
| Bjrn Borg | 11 |
| Andre Agassi | 8 |
| Ivan Lendl | 8 |
| Jimmy Connors | 8 |
| John McEnroe | 7 |
| Mats Wilander | 7 |
| Boris Becker | 6 |
| Stefan Edberg | 6 |
| John Newcombe | 5 |
| Rod Laver | 5 |
ggplot(df_plot2) + geom_bar(aes(x = reorder(Player, Total), y = Total, fill = Total), stat = "identity") + coord_flip() +
labs(y = "Total Grand Slam Titles", x = "Player", title = "Players with 5 or More Men's Grand Slam Titles - Open Era")
The top 3 players in the Open era who have won the most Grand Slam singles titles, by tournament, are the following:
# most Grand Slam titles in Open Era, by tournament
df_plot3 <- df3 %>% filter(!is.na(Player), Open_Era == "Y") %>% group_by(Player, Tournament) %>% summarize(Total = n()) %>%
arrange(Tournament, desc(Total)) %>% filter(Total >= 3)
kable(df_plot3, caption = "Players with 3 or More Titles by Tournament - Open Era")
| Player | Tournament | Total |
|---|---|---|
| Novak Djokovic | Australian | 6 |
| Roger Federer | Australian | 6 |
| Andre Agassi | Australian | 4 |
| Mats Wilander | Australian | 3 |
| Rafael Nadal | French | 11 |
| Bjrn Borg | French | 6 |
| Gustavo Kuerten | French | 3 |
| Ivan Lendl | French | 3 |
| Mats Wilander | French | 3 |
| Jimmy Connors | US | 5 |
| Pete Sampras | US | 5 |
| Roger Federer | US | 5 |
| John McEnroe | US | 4 |
| Ivan Lendl | US | 3 |
| Novak Djokovic | US | 3 |
| Rafael Nadal | US | 3 |
| Roger Federer | Wimbledon | 8 |
| Pete Sampras | Wimbledon | 7 |
| Bjrn Borg | Wimbledon | 5 |
| Novak Djokovic | Wimbledon | 4 |
| Boris Becker | Wimbledon | 3 |
| John McEnroe | Wimbledon | 3 |
ggplot(df_plot3) + geom_bar(aes(x = Player, y = Total, fill = Tournament), stat = "identity") + coord_flip() +
labs(y = "Tournament Titles", x = "Player", title = "Top Champions in the Open Era, by Grand Slam Tournament") + facet_wrap(~ Tournament)
In the Open era, the top countries represented with the most men’s singles Grand Slam titles are:
df_plot4 <- df3 %>% filter(!is.na(Player), Open_Era == "Y") %>% group_by(Country) %>% summarize(Total = n()) %>% arrange(desc(Total)) %>%
filter(Total >= 5)
kable(df_plot4, caption = "Countries Represented with 5 or More Grand Slam Titles - Modern Era")
| Country | Total |
|---|---|
| USA | 50 |
| SWE | 25 |
| ESP | 24 |
| SUI | 23 |
| AUS | 19 |
| SRB | 14 |
| CZE | 12 |
| GER | 7 |
| ARG | 6 |
ggplot(df_plot4) + geom_bar(aes(x = reorder(Country, Total), y = Total, fill = Total), stat = "identity") + coord_flip() +
labs(y = "Total Grand Slam Titles", x = "Country", title = "Countries Represented with 5 or More Men's Grand Slam Titles - Open Era")