I replicated a blog post by Giorgio Garziano called, “Visualizing Tennis Grand Slam Winners Performances”. My experience with this article showed me a new world of data visualization. More importantly it was an article that was geared towards one of my significant passions in life; the game of tennis. This blog post also allowed to be intrduced to a whole new world of data manipulation. It gave me more insight and intrigued me to learn many different ways you can manipulate data in R or RStudio.
This post explores each winner of every major tennis grand slam since 1877 until present day. The data set includes every major grand slam winner and runner-up. The plots, graphs, and diagrams in this post could allow someone to visualize who won which tournament in what year. It also allows someone to visiualize which player won a specific tournament multiple times.
The following packages were used in this r blog replication for data manipulation.
library(ggplot2)
library(gplots)
library(RColorBrewer)
library(dplyr)
library(knitr)
library(timelineS)
library(circlize)
library(fmsb)
These chunks of code enable us to load R libraries and import the Tennis Grand Slam Winners dataset.
library_toload <- c("dplyr", "knitr", "ggplot2", "gplots", "RColorBrewer", "timelineS", "circlize", "fmsb")
invisible(lapply(library_toload, function(x) {suppressPackageStartupMessages(library(x, character.only=TRUE))}))
url_file <- "https://datascienceplus.com/wp-content/uploads/2017/04/tennis-grand-slam-winners.txt"
slam_win <- read.delim(url(url_file), sep="\t", stringsAsFactors = FALSE)
kable(head(slam_win, 20))
| YEAR | TOURNAMENT | WINNER | RUNNER.UP |
|---|---|---|---|
| 2017 | Australian Open | Roger Federer | Rafael Nadal |
| 2016 | U.S. Open | Stan Wawrinka | Novak Djokovic |
| 2016 | Wimbledon | Andy Murray | Milos Raonic |
| 2016 | French Open | Novak Djokovic | Andy Murray |
| 2016 | Australian Open | Novak Djokovic | Andy Murray |
| 2015 | U.S. Open | Novak Djokovic | Roger Federer |
| 2015 | Wimbledon | Novak Djokovic | Roger Federer |
| 2015 | French Open | Stan Wawrinka | Novak Djokovic |
| 2015 | Australian Open | Novak Djokovic | Andy Murray |
| 2014 | U.S. Open | Marin Cilic | Kei Nishikori |
| 2014 | Wimbledon | Novak Djokovic | Roger Federer |
| 2014 | French Open | Rafael Nadal | Novak Djokovic |
| 2014 | Australian Open | Stan Wawrinka | Rafael Nadal |
| 2013 | U.S. Open | Rafael Nadal | Novak Djokovic |
| 2013 | Wimbledon | Andy Murray | Novak Djokovic |
| 2013 | French Open | Rafael Nadal | David Ferrer |
| 2013 | Australian Open | Novak Djokovic | Andy Murray |
| 2012 | U.S. Open | Andy Murray | Novak Djokovic |
| 2012 | Wimbledon | Roger Federer | Andy Murray |
| 2012 | French Open | Rafael Nadal | Novak Djokovic |
The author made a adjustment to the tournament data column that was need to have the same naming for the Australian Open Tournment.
slam_win[grep("Australian Open", slam_win$TOURNAMENT), "TOURNAMENT"] = "Australian Open"
In order to compute a publication ready table where each champion’s name is associated to his own number of Tennis Grand Slam wins, the necessary step is to group the winners and summarize the number of wins in the opening steps of this replication.
slam_top_chart = slam_win %>% group_by(WINNER) %>% summarise(NUM_WINS=n()) %>% arrange(desc(NUM_WINS))
kable(head(slam_top_chart, 40))
| WINNER | NUM_WINS |
|---|---|
| Roger Federer | 18 |
| Pete Sampras | 14 |
| Rafael Nadal | 14 |
| Novak Djokovic | 12 |
| Roy Emerson | 12 |
| Bjorn Borg | 11 |
| Rod Laver | 11 |
| William T. Tilden | 10 |
| Andre Agassi | 8 |
| Fred Perry | 8 |
| Henri Cochet | 8 |
| Ivan Lendl | 8 |
| Jimmy Connors | 8 |
| Ken Rosewall | 8 |
| Max Decugis | 8 |
| William A. Larned | 8 |
| John McEnroe | 7 |
| John Newcombe | 7 |
| Mats Wilander | 7 |
| Rene Lacoste | 7 |
| Richard D. Sears | 7 |
| William Renshaw | 7 |
| Boris Becker | 6 |
| Donald Budge | 6 |
| Stefan Edberg | 6 |
| Frank Sedgman | 5 |
| Jack Crawford | 5 |
| Jean Borotra | 5 |
| Laurie Doherty | 5 |
| Tony Trabert | 5 |
| Andre Vacherot | 4 |
| Anthony Wilding | 4 |
| Ashley J. Cooper | 4 |
| Frank Parker | 4 |
| Guillermo Vilas | 4 |
| Jim Courier | 4 |
| Lewis Hoad | 4 |
| Manuel Santana | 4 |
| Pat O’Hara Wood | 4 |
| Paul Ayme | 4 |
To visualize such data, a barpplot would report best results. In this instance it would grant a way to evaluate a specific players performance in relation to the multiple grand slam tournaments. The barpolot shown only inlcudes the champions who have won a minimum of four or more Tennis Grand Slam tournaments.
slam_top_chart$WINNER <- factor(slam_top_chart$WINNER, levels = slam_top_chart$WINNER[order(slam_top_chart$NUM_WINS)])
top_winners_gt4 = slam_top_chart %>% filter(NUM_WINS >= 4)
the_colours = c("#FF4000FF", "#FF8000FF", "#FFFF00FF", "#80FF00FF",
"#00FF00FF", "#00FF80FF", "#00FFFFFF", "#0080FFFF",
"#FF00FFFF", "#000000FF", "#0000FFFF")
ggplot(data=top_winners_gt4, aes(x=WINNER, y=NUM_WINS, fill=NUM_WINS)) +
geom_bar(stat='identity') + coord_flip() + guides(fill=FALSE) +
scale_fill_gradientn(colours = the_colours)
In order to compare the champion’s tournament performance for every specifc Grand Slam, it is easier to put the tournaments and winners in their own group accordingly.
slam_top_chart_by_trn = slam_win %>% filter(WINNER %in% top_winners_gt4$WINNER) %>% group_by(TOURNAMENT, WINNER) %>% summarise(NUM_WINS=n()) %>% arrange(desc(NUM_WINS))
slam_top_chart_by_trn$NUM_WINS <- factor(slam_top_chart_by_trn$NUM_WINS)
kable(head(slam_top_chart_by_trn, 10))
| TOURNAMENT | WINNER | NUM_WINS |
|---|---|---|
| French Open | Rafael Nadal | 9 |
| French Open | Max Decugis | 8 |
| U.S. Open | William A. Larned | 8 |
| U.S. Open | Richard D. Sears | 7 |
| U.S. Open | William T. Tilden | 7 |
| Wimbledon | Pete Sampras | 7 |
| Wimbledon | Roger Federer | 7 |
| Wimbledon | William Renshaw | 7 |
| Australian Open | Novak Djokovic | 6 |
| Australian Open | Roy Emerson | 6 |
The next barplot explores a way to visualize such data, that allows to see which player(s) win a specific Grand Slam Tournament.
ggplot(data=slam_top_chart_by_trn, aes(x=WINNER, y=NUM_WINS, fill=NUM_WINS)) +
geom_bar(stat='identity') + coord_flip() + guides(fill=FALSE) + scale_y_discrete() +
facet_grid(. ~ TOURNAMENT)
A heatmap is essentially a table that uses colors to represent data. They are extremely useful to pick out specific data points, such as high and low values. They are important that they aid into identifying patterns and located outliers. Here we use a heatmap to highlight how many times each champion met each other on other Grand Slam Tournament finals. Where the matrix belows reports the count of each pairs’ match ups.
tl_rec <- 1:50
winner_runnerup <- slam_win[tl_rec, c("WINNER", "RUNNER.UP")]
winner_runnerup_names <- unique(c(winner_runnerup[,1], winner_runnerup[,2]))
match_matrix <- matrix(0, nrow=length(winner_runnerup_names),
ncol=length(winner_runnerup_names))
colnames(match_matrix) <- winner_runnerup_names
rownames(match_matrix) <- winner_runnerup_names
for(i in 1:nrow(winner_runnerup)) {
winner <- winner_runnerup[i, "WINNER"]
runner_up <- winner_runnerup[i, "RUNNER.UP"]
r <- which(rownames(match_matrix) == winner)
c <- which(colnames(match_matrix) == runner_up)
match_matrix[r,c] <- match_matrix[r,c] + 1
match_matrix[c,r] <- match_matrix[c,r] + 1
}
diag(match_matrix) <- NA
my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)
col_breaks = c(seq(0, 0.99, length=100), # for green
seq(1, 5, length=100), # for yellow
seq(5.01, 10, length=100)) # for red
heatmap.2(match_matrix,
cellnote = match_matrix, # same data set for cell labels
main = "Tennis Grand Slam Champions - Finals Match Heatmap", # heat map title
notecol = "black", # change font color of cell labels to black
density.info = "none", # turns off density plot inside color legend
trace = "none", # turns off trace lines inside the heat map
margins = c(12,9), # widens margins around plot
col= my_palette, # use on color palette defined earlier
breaks = col_breaks, # enable color transition at specified limits
dendrogram = "none", # only draw a row dendrogram
Colv = "NA")
A dendogram is essentially a fancy way of naming a tree diagram to display groups formed by hierarchial clustering. The following tables diplays the group of champioons of the first 20 by number of greatest to least amount of wins. The groups are determined based on wins difference equal to two and clustered based on champion’s wins.
ch_n <- 1:20
kable(slam_top_chart[ch_n,])
| WINNER | NUM_WINS |
|---|---|
| Roger Federer | 18 |
| Pete Sampras | 14 |
| Rafael Nadal | 14 |
| Novak Djokovic | 12 |
| Roy Emerson | 12 |
| Bjorn Borg | 11 |
| Rod Laver | 11 |
| William T. Tilden | 10 |
| Andre Agassi | 8 |
| Fred Perry | 8 |
| Henri Cochet | 8 |
| Ivan Lendl | 8 |
| Jimmy Connors | 8 |
| Ken Rosewall | 8 |
| Max Decugis | 8 |
| William A. Larned | 8 |
| John McEnroe | 7 |
| John Newcombe | 7 |
| Mats Wilander | 7 |
| Rene Lacoste | 7 |
wins <- slam_top_chart[ch_n, -1]
d_wins <- dist(wins, method = "euclidean")
hclust_fit <- hclust(d_wins)
h_value <- 2
groups <- cutree(hclust_fit, h = h_value)
plot(hclust_fit, labels = slam_top_chart$WINNER[ch_n], main = "Champions Dendrogram")
rect.hclust(hclust_fit, h = h_value, border = "blue")
The timeline plot is the perfect choice to illustrate the champion’s win sequence. The plot displays the winner’s name along with the tournaments and dates of the specific tournament.
year_to_date_trnm <- function(the_year, the_trnm) {
the_date <- NULL
if (the_trnm == "Australian Open") {
the_date <- (paste(the_year, "-01-31", sep=""))
} else if (the_trnm == "French Open") {
the_date <- (paste(the_year, "-06-15", sep=""))
} else if (the_trnm == "Wimbledon") {
the_date <- (paste(the_year, "-07-15", sep=""))
} else if (the_trnm == "U.S. Open") {
the_date <- (paste(the_year, "-09-07", sep=""))
}
the_date
}
slam_win$YEAR_DATE <- as.Date(mapply(year_to_date_trnm, slam_win$YEAR, slam_win$TOURNAMENT), format="%Y-%m-%d")
tl_rec <- 1:20
timelineS(slam_win[tl_rec, c("WINNER", "YEAR_DATE")], line.color = "red", scale.font = 3,
scale = "month", scale.format = "%Y", label.cex = 0.7, buffer.days = 100,
labels = paste(slam_win[tl_rec, "WINNER"], slam_win[tl_rec, "TOURNAMENT"]))
The next step in this process is to arrange the data set to where it filters out an output that gives us where the champions with more than ten Grand Slam tournaments wins are embedded.
top_winners_gt10 = slam_top_chart %>% filter(NUM_WINS > 10)
kable(head(top_winners_gt10))
| WINNER | NUM_WINS |
|---|---|
| Roger Federer | 18 |
| Pete Sampras | 14 |
| Rafael Nadal | 14 |
| Novak Djokovic | 12 |
| Roy Emerson | 12 |
| Bjorn Borg | 11 |
Furthermore, we then group the dataset by tournaments and winner and then have the copmuted value of the total number of wins per tournament.
slam_win_cnt = inner_join(slam_win, top_winners_gt10) %>% select(TOURNAMENT, WINNER) %>%
group_by(WINNER, TOURNAMENT) %>% summarise(NUM_WINS = n()) %>% arrange(TOURNAMENT, desc(NUM_WINS))
## Warning: Column `WINNER` joining character vector and factor, coercing into
## character vector
kable(slam_win_cnt)
| WINNER | TOURNAMENT | NUM_WINS |
|---|---|---|
| Novak Djokovic | Australian Open | 6 |
| Roy Emerson | Australian Open | 6 |
| Roger Federer | Australian Open | 5 |
| Rod Laver | Australian Open | 3 |
| Pete Sampras | Australian Open | 2 |
| Rafael Nadal | Australian Open | 1 |
| Rafael Nadal | French Open | 9 |
| Bjorn Borg | French Open | 6 |
| Rod Laver | French Open | 2 |
| Roy Emerson | French Open | 2 |
| Novak Djokovic | French Open | 1 |
| Roger Federer | French Open | 1 |
| Pete Sampras | U.S. Open | 5 |
| Roger Federer | U.S. Open | 5 |
| Novak Djokovic | U.S. Open | 2 |
| Rafael Nadal | U.S. Open | 2 |
| Rod Laver | U.S. Open | 2 |
| Roy Emerson | U.S. Open | 2 |
| Pete Sampras | Wimbledon | 7 |
| Roger Federer | Wimbledon | 7 |
| Bjorn Borg | Wimbledon | 5 |
| Rod Laver | Wimbledon | 4 |
| Novak Djokovic | Wimbledon | 3 |
| Rafael Nadal | Wimbledon | 2 |
| Roy Emerson | Wimbledon | 2 |
This generated output is then put into a chord diagram, where it is advantageous that we see the inter-relationships between players. It also alows us to see the connections between the players and what they might share in common.
chordDiagram(slam_win_cnt)
The association of the strengths and weaknesses in a player(s) specific tournament can allow for the use of seeing a specifc player(s) performances in each of the Tennis Grand Slam tournaments. A radar plot is recommended to visualize such data. It allows for someone to explore a specifc player’s performance and which tournament that tend to win the most.
champion_radar_plot <- function(df, champion_name) {
slam_win_cnt_chp = df %>% filter(WINNER == champion_name)
chp_num_wins <- slam_win_cnt_chp$NUM_WINS
l <- length(chp_num_wins)
max_v <- 10 # choosing the same maximum value for all champions
chp_df <- data.frame(rbind(max = rep(max_v, l), min = rep(0, l), chp_num_wins))
colnames(chp_df) <- slam_win_cnt_chp$TOURNAMENT
seg_n <- max_v
radarchart(chp_df, axistype = 1, caxislabels = seq(0, max_v, 1), seg = seg_n,
centerzero = TRUE, pcol = rgb(0.2, 0.5, 0.5, 0.9) , pfcol = rgb(0.2, 0.5, 0.5, 0.3),
plwd = 1, cglcol = "grey", cglty = 1, axislabcol = "blue",
vlcex = 0.8, calcex = 0.7, title = champion_name)
}
champion_radar_plot(slam_win_cnt, "Roger Federer")
champion_radar_plot(slam_win_cnt, "Rafael Nadal")
champion_radar_plot(slam_win_cnt, "Novak Djokovic")
This extra credit project allowed me to gain immense insight on how to use rmarkdown and explore the different ways to see data manipulation in RStudio. This experience was more than fun! At times a bit frustrating with the coding, but nevermore than rewarding when seeing the final finshed product.