Tennis Grand Slam Winners

Motivation

I replicated a blog post by Giorgio Garziano called, “Visualizing Tennis Grand Slam Winners Performances”. My experience with this article showed me a new world of data visualization. More importantly it was an article that was geared towards one of my significant passions in life; the game of tennis. This blog post also allowed to be intrduced to a whole new world of data manipulation. It gave me more insight and intrigued me to learn many different ways you can manipulate data in R or RStudio.

Blog Summary

This post explores each winner of every major tennis grand slam since 1877 until present day. The data set includes every major grand slam winner and runner-up. The plots, graphs, and diagrams in this post could allow someone to visualize who won which tournament in what year. It also allows someone to visiualize which player won a specific tournament multiple times.

Packages

The following packages were used in this r blog replication for data manipulation.

library(ggplot2) 
library(gplots)
library(RColorBrewer) 
library(dplyr) 
library(knitr) 
library(timelineS) 
library(circlize)
library(fmsb)

Analysis

These chunks of code enable us to load R libraries and import the Tennis Grand Slam Winners dataset.

library_toload <- c("dplyr", "knitr", "ggplot2", "gplots", "RColorBrewer", "timelineS", "circlize", "fmsb")
invisible(lapply(library_toload, function(x) {suppressPackageStartupMessages(library(x, character.only=TRUE))}))

url_file <- "https://datascienceplus.com/wp-content/uploads/2017/04/tennis-grand-slam-winners.txt"
slam_win <- read.delim(url(url_file), sep="\t", stringsAsFactors = FALSE)
kable(head(slam_win, 20))

YEAR	TOURNAMENT	WINNER	RUNNER.UP
2017	Australian Open	Roger Federer	Rafael Nadal
2016	U.S. Open	Stan Wawrinka	Novak Djokovic
2016	Wimbledon	Andy Murray	Milos Raonic
2016	French Open	Novak Djokovic	Andy Murray
2016	Australian Open	Novak Djokovic	Andy Murray
2015	U.S. Open	Novak Djokovic	Roger Federer
2015	Wimbledon	Novak Djokovic	Roger Federer
2015	French Open	Stan Wawrinka	Novak Djokovic
2015	Australian Open	Novak Djokovic	Andy Murray
2014	U.S. Open	Marin Cilic	Kei Nishikori
2014	Wimbledon	Novak Djokovic	Roger Federer
2014	French Open	Rafael Nadal	Novak Djokovic
2014	Australian Open	Stan Wawrinka	Rafael Nadal
2013	U.S. Open	Rafael Nadal	Novak Djokovic
2013	Wimbledon	Andy Murray	Novak Djokovic
2013	French Open	Rafael Nadal	David Ferrer
2013	Australian Open	Novak Djokovic	Andy Murray
2012	U.S. Open	Andy Murray	Novak Djokovic
2012	Wimbledon	Roger Federer	Andy Murray
2012	French Open	Rafael Nadal	Novak Djokovic

The author made a adjustment to the tournament data column that was need to have the same naming for the Australian Open Tournment.

slam_win[grep("Australian Open", slam_win$TOURNAMENT), "TOURNAMENT"] = "Australian Open"

Barpolt

In order to compute a publication ready table where each champion’s name is associated to his own number of Tennis Grand Slam wins, the necessary step is to group the winners and summarize the number of wins in the opening steps of this replication.

slam_top_chart = slam_win %>% group_by(WINNER) %>% summarise(NUM_WINS=n()) %>% arrange(desc(NUM_WINS))
kable(head(slam_top_chart, 40))

WINNER	NUM_WINS
Roger Federer	18
Pete Sampras	14
Rafael Nadal	14
Novak Djokovic	12
Roy Emerson	12
Bjorn Borg	11
Rod Laver	11
William T. Tilden	10
Andre Agassi	8
Fred Perry	8
Henri Cochet	8
Ivan Lendl	8
Jimmy Connors	8
Ken Rosewall	8
Max Decugis	8
William A. Larned	8
John McEnroe	7
John Newcombe	7
Mats Wilander	7
Rene Lacoste	7
Richard D. Sears	7
William Renshaw	7
Boris Becker	6
Donald Budge	6
Stefan Edberg	6
Frank Sedgman	5
Jack Crawford	5
Jean Borotra	5
Laurie Doherty	5
Tony Trabert	5
Andre Vacherot	4
Anthony Wilding	4
Ashley J. Cooper	4
Frank Parker	4
Guillermo Vilas	4
Jim Courier	4
Lewis Hoad	4
Manuel Santana	4
Pat O’Hara Wood	4
Paul Ayme	4

To visualize such data, a barpplot would report best results. In this instance it would grant a way to evaluate a specific players performance in relation to the multiple grand slam tournaments. The barpolot shown only inlcudes the champions who have won a minimum of four or more Tennis Grand Slam tournaments.

slam_top_chart$WINNER <- factor(slam_top_chart$WINNER, levels = slam_top_chart$WINNER[order(slam_top_chart$NUM_WINS)])
top_winners_gt4 = slam_top_chart %>% filter(NUM_WINS >= 4)
the_colours = c("#FF4000FF", "#FF8000FF", "#FFFF00FF", "#80FF00FF",
                "#00FF00FF", "#00FF80FF", "#00FFFFFF", "#0080FFFF",
                "#FF00FFFF", "#000000FF", "#0000FFFF")
ggplot(data=top_winners_gt4, aes(x=WINNER, y=NUM_WINS, fill=NUM_WINS)) +
  geom_bar(stat='identity') + coord_flip() + guides(fill=FALSE) +
  scale_fill_gradientn(colours = the_colours)

In order to compare the champion’s tournament performance for every specifc Grand Slam, it is easier to put the tournaments and winners in their own group accordingly.

slam_top_chart_by_trn = slam_win %>% filter(WINNER %in% top_winners_gt4$WINNER) %>% group_by(TOURNAMENT, WINNER) %>% summarise(NUM_WINS=n()) %>% arrange(desc(NUM_WINS))
slam_top_chart_by_trn$NUM_WINS <- factor(slam_top_chart_by_trn$NUM_WINS)
kable(head(slam_top_chart_by_trn, 10))

TOURNAMENT	WINNER	NUM_WINS
French Open	Rafael Nadal	9
French Open	Max Decugis	8
U.S. Open	William A. Larned	8
U.S. Open	Richard D. Sears	7
U.S. Open	William T. Tilden	7
Wimbledon	Pete Sampras	7
Wimbledon	Roger Federer	7
Wimbledon	William Renshaw	7
Australian Open	Novak Djokovic	6
Australian Open	Roy Emerson	6

The next barplot explores a way to visualize such data, that allows to see which player(s) win a specific Grand Slam Tournament.

ggplot(data=slam_top_chart_by_trn, aes(x=WINNER, y=NUM_WINS, fill=NUM_WINS)) +
  geom_bar(stat='identity') + coord_flip() + guides(fill=FALSE) + scale_y_discrete() +
  facet_grid(. ~ TOURNAMENT)

Heatmap

A heatmap is essentially a table that uses colors to represent data. They are extremely useful to pick out specific data points, such as high and low values. They are important that they aid into identifying patterns and located outliers. Here we use a heatmap to highlight how many times each champion met each other on other Grand Slam Tournament finals. Where the matrix belows reports the count of each pairs’ match ups.

tl_rec <- 1:50
winner_runnerup <- slam_win[tl_rec, c("WINNER", "RUNNER.UP")]
winner_runnerup_names <- unique(c(winner_runnerup[,1], winner_runnerup[,2]))

match_matrix <- matrix(0, nrow=length(winner_runnerup_names),
                       ncol=length(winner_runnerup_names))
colnames(match_matrix) <- winner_runnerup_names
rownames(match_matrix) <- winner_runnerup_names

for(i in 1:nrow(winner_runnerup)) {
  winner <- winner_runnerup[i, "WINNER"]
  runner_up <- winner_runnerup[i, "RUNNER.UP"]
  r <- which(rownames(match_matrix) == winner)
  c <- which(colnames(match_matrix) == runner_up)
  
  match_matrix[r,c] <- match_matrix[r,c] + 1
  match_matrix[c,r] <- match_matrix[c,r] + 1
}

diag(match_matrix) <- NA

my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)
col_breaks = c(seq(0, 0.99, length=100),  # for green
               seq(1, 5, length=100),     # for yellow
               seq(5.01, 10, length=100)) # for red
heatmap.2(match_matrix,
          cellnote = match_matrix,  # same data set for cell labels
          main = "Tennis Grand Slam Champions - Finals Match Heatmap", # heat map title
          notecol = "black",      # change font color of cell labels to black
          density.info = "none",  # turns off density plot inside color legend
          trace = "none",         # turns off trace lines inside the heat map
          margins = c(12,9),     # widens margins around plot
          col= my_palette,       # use on color palette defined earlier
          breaks = col_breaks,    # enable color transition at specified limits
          dendrogram = "none",     # only draw a row dendrogram
          Colv = "NA")

Dendogram plot

A dendogram is essentially a fancy way of naming a tree diagram to display groups formed by hierarchial clustering. The following tables diplays the group of champioons of the first 20 by number of greatest to least amount of wins. The groups are determined based on wins difference equal to two and clustered based on champion’s wins.

ch_n <- 1:20
kable(slam_top_chart[ch_n,])

WINNER	NUM_WINS
Roger Federer	18
Pete Sampras	14
Rafael Nadal	14
Novak Djokovic	12
Roy Emerson	12
Bjorn Borg	11
Rod Laver	11
William T. Tilden	10
Andre Agassi	8
Fred Perry	8
Henri Cochet	8
Ivan Lendl	8
Jimmy Connors	8
Ken Rosewall	8
Max Decugis	8
William A. Larned	8
John McEnroe	7
John Newcombe	7
Mats Wilander	7
Rene Lacoste	7

wins <- slam_top_chart[ch_n, -1]
d_wins <- dist(wins, method = "euclidean")
hclust_fit <- hclust(d_wins)
h_value <- 2
groups <- cutree(hclust_fit, h = h_value)
plot(hclust_fit, labels = slam_top_chart$WINNER[ch_n], main = "Champions Dendrogram")
rect.hclust(hclust_fit, h = h_value, border = "blue")

Timeline plot

The timeline plot is the perfect choice to illustrate the champion’s win sequence. The plot displays the winner’s name along with the tournaments and dates of the specific tournament.

year_to_date_trnm <- function(the_year, the_trnm) {
  the_date <- NULL
  if (the_trnm == "Australian Open") {
    the_date <- (paste(the_year, "-01-31", sep=""))
  } else if (the_trnm == "French Open") {
    the_date <- (paste(the_year, "-06-15", sep=""))
  } else if (the_trnm == "Wimbledon") {
    the_date <- (paste(the_year, "-07-15", sep=""))
  } else if (the_trnm == "U.S. Open") {
    the_date <- (paste(the_year, "-09-07", sep=""))
  }
  the_date
}

slam_win$YEAR_DATE <- as.Date(mapply(year_to_date_trnm, slam_win$YEAR, slam_win$TOURNAMENT), format="%Y-%m-%d")
tl_rec <- 1:20
timelineS(slam_win[tl_rec, c("WINNER", "YEAR_DATE")], line.color = "red", scale.font = 3,
          scale = "month", scale.format = "%Y", label.cex = 0.7, buffer.days = 100,
          labels = paste(slam_win[tl_rec, "WINNER"], slam_win[tl_rec, "TOURNAMENT"]))

Chord Diagram

The next step in this process is to arrange the data set to where it filters out an output that gives us where the champions with more than ten Grand Slam tournaments wins are embedded.

top_winners_gt10 = slam_top_chart %>% filter(NUM_WINS > 10)
kable(head(top_winners_gt10))

WINNER	NUM_WINS
Roger Federer	18
Pete Sampras	14
Rafael Nadal	14
Novak Djokovic	12
Roy Emerson	12
Bjorn Borg	11

Furthermore, we then group the dataset by tournaments and winner and then have the copmuted value of the total number of wins per tournament.

slam_win_cnt = inner_join(slam_win, top_winners_gt10) %>% select(TOURNAMENT, WINNER) %>%
  group_by(WINNER, TOURNAMENT) %>% summarise(NUM_WINS = n()) %>% arrange(TOURNAMENT, desc(NUM_WINS))

## Warning: Column `WINNER` joining character vector and factor, coercing into
## character vector

kable(slam_win_cnt)

WINNER	TOURNAMENT	NUM_WINS
Novak Djokovic	Australian Open	6
Roy Emerson	Australian Open	6
Roger Federer	Australian Open	5
Rod Laver	Australian Open	3
Pete Sampras	Australian Open	2
Rafael Nadal	Australian Open	1
Rafael Nadal	French Open	9
Bjorn Borg	French Open	6
Rod Laver	French Open	2
Roy Emerson	French Open	2
Novak Djokovic	French Open	1
Roger Federer	French Open	1
Pete Sampras	U.S. Open	5
Roger Federer	U.S. Open	5
Novak Djokovic	U.S. Open	2
Rafael Nadal	U.S. Open	2
Rod Laver	U.S. Open	2
Roy Emerson	U.S. Open	2
Pete Sampras	Wimbledon	7
Roger Federer	Wimbledon	7
Bjorn Borg	Wimbledon	5
Rod Laver	Wimbledon	4
Novak Djokovic	Wimbledon	3
Rafael Nadal	Wimbledon	2
Roy Emerson	Wimbledon	2

This generated output is then put into a chord diagram, where it is advantageous that we see the inter-relationships between players. It also alows us to see the connections between the players and what they might share in common.

chordDiagram(slam_win_cnt)

Radar Plot

The association of the strengths and weaknesses in a player(s) specific tournament can allow for the use of seeing a specifc player(s) performances in each of the Tennis Grand Slam tournaments. A radar plot is recommended to visualize such data. It allows for someone to explore a specifc player’s performance and which tournament that tend to win the most.

Roger Federer

champion_radar_plot <- function(df, champion_name) {
  slam_win_cnt_chp = df %>% filter(WINNER == champion_name)
  chp_num_wins <- slam_win_cnt_chp$NUM_WINS
  l <- length(chp_num_wins)
  max_v <- 10 # choosing the same maximum value for all champions
  chp_df <- data.frame(rbind(max = rep(max_v, l), min = rep(0, l), chp_num_wins))
  colnames(chp_df) <- slam_win_cnt_chp$TOURNAMENT
  seg_n <- max_v
  radarchart(chp_df, axistype = 1, caxislabels = seq(0, max_v, 1), seg = seg_n,
             centerzero = TRUE, pcol = rgb(0.2, 0.5, 0.5, 0.9) , pfcol = rgb(0.2, 0.5, 0.5, 0.3),
             plwd = 1, cglcol = "grey", cglty = 1, axislabcol = "blue",
             vlcex = 0.8, calcex = 0.7, title = champion_name)
  
}

champion_radar_plot(slam_win_cnt, "Roger Federer")

Rafael Nadal

champion_radar_plot(slam_win_cnt, "Rafael Nadal")

Novak Djokovic

champion_radar_plot(slam_win_cnt, "Novak Djokovic")

Project Conclusion

This extra credit project allowed me to gain immense insight on how to use rmarkdown and explore the different ways to see data manipulation in RStudio. This experience was more than fun! At times a bit frustrating with the coding, but nevermore than rewarding when seeing the final finshed product.