Motivation

I replicated a blog post by Giorgio Garziano called, “Visualizing Tennis Grand Slam Winners Performances”. My experience with this article showed me a new world of data visualization. More importantly it was an article that was geared towards one of my significant passions in life; the game of tennis. This blog post also allowed to be intrduced to a whole new world of data manipulation. It gave me more insight and intrigued me to learn many different ways you can manipulate data in R or RStudio.

Blog Summary

This post explores each winner of every major tennis grand slam since 1877 until present day. The data set includes every major grand slam winner and runner-up. The plots, graphs, and diagrams in this post could allow someone to visualize who won which tournament in what year. It also allows someone to visiualize which player won a specific tournament multiple times.

Packages

The following packages were used in this r blog replication for data manipulation.

library(ggplot2) 
library(gplots)
library(RColorBrewer) 
library(dplyr) 
library(knitr) 
library(timelineS) 
library(circlize)
library(fmsb)

Analysis

These chunks of code enable us to load R libraries and import the Tennis Grand Slam Winners dataset.

library_toload <- c("dplyr", "knitr", "ggplot2", "gplots", "RColorBrewer", "timelineS", "circlize", "fmsb")
invisible(lapply(library_toload, function(x) {suppressPackageStartupMessages(library(x, character.only=TRUE))}))

url_file <- "https://datascienceplus.com/wp-content/uploads/2017/04/tennis-grand-slam-winners.txt"
slam_win <- read.delim(url(url_file), sep="\t", stringsAsFactors = FALSE)
kable(head(slam_win, 20))
YEAR TOURNAMENT WINNER RUNNER.UP
2017 Australian Open Roger Federer Rafael Nadal
2016 U.S. Open Stan Wawrinka Novak Djokovic
2016 Wimbledon Andy Murray Milos Raonic
2016 French Open Novak Djokovic Andy Murray
2016 Australian Open Novak Djokovic Andy Murray
2015 U.S. Open Novak Djokovic Roger Federer
2015 Wimbledon Novak Djokovic Roger Federer
2015 French Open Stan Wawrinka Novak Djokovic
2015 Australian Open Novak Djokovic Andy Murray
2014 U.S. Open Marin Cilic Kei Nishikori
2014 Wimbledon Novak Djokovic Roger Federer
2014 French Open Rafael Nadal Novak Djokovic
2014 Australian Open Stan Wawrinka Rafael Nadal
2013 U.S. Open Rafael Nadal Novak Djokovic
2013 Wimbledon Andy Murray Novak Djokovic
2013 French Open Rafael Nadal David Ferrer
2013 Australian Open Novak Djokovic Andy Murray
2012 U.S. Open Andy Murray Novak Djokovic
2012 Wimbledon Roger Federer Andy Murray
2012 French Open Rafael Nadal Novak Djokovic

The author made a adjustment to the tournament data column that was need to have the same naming for the Australian Open Tournment.

slam_win[grep("Australian Open", slam_win$TOURNAMENT), "TOURNAMENT"] = "Australian Open"

Barpolt

In order to compute a publication ready table where each champion’s name is associated to his own number of Tennis Grand Slam wins, the necessary step is to group the winners and summarize the number of wins in the opening steps of this replication.

slam_top_chart = slam_win %>% group_by(WINNER) %>% summarise(NUM_WINS=n()) %>% arrange(desc(NUM_WINS))
kable(head(slam_top_chart, 40))
WINNER NUM_WINS
Roger Federer 18
Pete Sampras 14
Rafael Nadal 14
Novak Djokovic 12
Roy Emerson 12
Bjorn Borg 11
Rod Laver 11
William T. Tilden 10
Andre Agassi 8
Fred Perry 8
Henri Cochet 8
Ivan Lendl 8
Jimmy Connors 8
Ken Rosewall 8
Max Decugis 8
William A. Larned 8
John McEnroe 7
John Newcombe 7
Mats Wilander 7
Rene Lacoste 7
Richard D. Sears 7
William Renshaw 7
Boris Becker 6
Donald Budge 6
Stefan Edberg 6
Frank Sedgman 5
Jack Crawford 5
Jean Borotra 5
Laurie Doherty 5
Tony Trabert 5
Andre Vacherot 4
Anthony Wilding 4
Ashley J. Cooper 4
Frank Parker 4
Guillermo Vilas 4
Jim Courier 4
Lewis Hoad 4
Manuel Santana 4
Pat O’Hara Wood 4
Paul Ayme 4

To visualize such data, a barpplot would report best results. In this instance it would grant a way to evaluate a specific players performance in relation to the multiple grand slam tournaments. The barpolot shown only inlcudes the champions who have won a minimum of four or more Tennis Grand Slam tournaments.

slam_top_chart$WINNER <- factor(slam_top_chart$WINNER, levels = slam_top_chart$WINNER[order(slam_top_chart$NUM_WINS)])
top_winners_gt4 = slam_top_chart %>% filter(NUM_WINS >= 4)
the_colours = c("#FF4000FF", "#FF8000FF", "#FFFF00FF", "#80FF00FF",
                "#00FF00FF", "#00FF80FF", "#00FFFFFF", "#0080FFFF",
                "#FF00FFFF", "#000000FF", "#0000FFFF")
ggplot(data=top_winners_gt4, aes(x=WINNER, y=NUM_WINS, fill=NUM_WINS)) +
  geom_bar(stat='identity') + coord_flip() + guides(fill=FALSE) +
  scale_fill_gradientn(colours = the_colours)

In order to compare the champion’s tournament performance for every specifc Grand Slam, it is easier to put the tournaments and winners in their own group accordingly.

slam_top_chart_by_trn = slam_win %>% filter(WINNER %in% top_winners_gt4$WINNER) %>% group_by(TOURNAMENT, WINNER) %>% summarise(NUM_WINS=n()) %>% arrange(desc(NUM_WINS))
slam_top_chart_by_trn$NUM_WINS <- factor(slam_top_chart_by_trn$NUM_WINS)
kable(head(slam_top_chart_by_trn, 10))
TOURNAMENT WINNER NUM_WINS
French Open Rafael Nadal 9
French Open Max Decugis 8
U.S. Open William A. Larned 8
U.S. Open Richard D. Sears 7
U.S. Open William T. Tilden 7
Wimbledon Pete Sampras 7
Wimbledon Roger Federer 7
Wimbledon William Renshaw 7
Australian Open Novak Djokovic 6
Australian Open Roy Emerson 6

The next barplot explores a way to visualize such data, that allows to see which player(s) win a specific Grand Slam Tournament.

ggplot(data=slam_top_chart_by_trn, aes(x=WINNER, y=NUM_WINS, fill=NUM_WINS)) +
  geom_bar(stat='identity') + coord_flip() + guides(fill=FALSE) + scale_y_discrete() +
  facet_grid(. ~ TOURNAMENT)

Heatmap

A heatmap is essentially a table that uses colors to represent data. They are extremely useful to pick out specific data points, such as high and low values. They are important that they aid into identifying patterns and located outliers. Here we use a heatmap to highlight how many times each champion met each other on other Grand Slam Tournament finals. Where the matrix belows reports the count of each pairs’ match ups.

tl_rec <- 1:50
winner_runnerup <- slam_win[tl_rec, c("WINNER", "RUNNER.UP")]
winner_runnerup_names <- unique(c(winner_runnerup[,1], winner_runnerup[,2]))

match_matrix <- matrix(0, nrow=length(winner_runnerup_names),
                       ncol=length(winner_runnerup_names))
colnames(match_matrix) <- winner_runnerup_names
rownames(match_matrix) <- winner_runnerup_names

for(i in 1:nrow(winner_runnerup)) {
  winner <- winner_runnerup[i, "WINNER"]
  runner_up <- winner_runnerup[i, "RUNNER.UP"]
  r <- which(rownames(match_matrix) == winner)
  c <- which(colnames(match_matrix) == runner_up)
  
  match_matrix[r,c] <- match_matrix[r,c] + 1
  match_matrix[c,r] <- match_matrix[c,r] + 1
}

diag(match_matrix) <- NA
my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)
col_breaks = c(seq(0, 0.99, length=100),  # for green
               seq(1, 5, length=100),     # for yellow
               seq(5.01, 10, length=100)) # for red
heatmap.2(match_matrix,
          cellnote = match_matrix,  # same data set for cell labels
          main = "Tennis Grand Slam Champions - Finals Match Heatmap", # heat map title
          notecol = "black",      # change font color of cell labels to black
          density.info = "none",  # turns off density plot inside color legend
          trace = "none",         # turns off trace lines inside the heat map
          margins = c(12,9),     # widens margins around plot
          col= my_palette,       # use on color palette defined earlier
          breaks = col_breaks,    # enable color transition at specified limits
          dendrogram = "none",     # only draw a row dendrogram
          Colv = "NA")

Dendogram plot

A dendogram is essentially a fancy way of naming a tree diagram to display groups formed by hierarchial clustering. The following tables diplays the group of champioons of the first 20 by number of greatest to least amount of wins. The groups are determined based on wins difference equal to two and clustered based on champion’s wins.

ch_n <- 1:20
kable(slam_top_chart[ch_n,])
WINNER NUM_WINS
Roger Federer 18
Pete Sampras 14
Rafael Nadal 14
Novak Djokovic 12
Roy Emerson 12
Bjorn Borg 11
Rod Laver 11
William T. Tilden 10
Andre Agassi 8
Fred Perry 8
Henri Cochet 8
Ivan Lendl 8
Jimmy Connors 8
Ken Rosewall 8
Max Decugis 8
William A. Larned 8
John McEnroe 7
John Newcombe 7
Mats Wilander 7
Rene Lacoste 7
wins <- slam_top_chart[ch_n, -1]
d_wins <- dist(wins, method = "euclidean")
hclust_fit <- hclust(d_wins)
h_value <- 2
groups <- cutree(hclust_fit, h = h_value)
plot(hclust_fit, labels = slam_top_chart$WINNER[ch_n], main = "Champions Dendrogram")
rect.hclust(hclust_fit, h = h_value, border = "blue") 

Timeline plot

The timeline plot is the perfect choice to illustrate the champion’s win sequence. The plot displays the winner’s name along with the tournaments and dates of the specific tournament.

year_to_date_trnm <- function(the_year, the_trnm) {
  the_date <- NULL
  if (the_trnm == "Australian Open") {
    the_date <- (paste(the_year, "-01-31", sep=""))
  } else if (the_trnm == "French Open") {
    the_date <- (paste(the_year, "-06-15", sep=""))
  } else if (the_trnm == "Wimbledon") {
    the_date <- (paste(the_year, "-07-15", sep=""))
  } else if (the_trnm == "U.S. Open") {
    the_date <- (paste(the_year, "-09-07", sep=""))
  }
  the_date
}

slam_win$YEAR_DATE <- as.Date(mapply(year_to_date_trnm, slam_win$YEAR, slam_win$TOURNAMENT), format="%Y-%m-%d")
tl_rec <- 1:20
timelineS(slam_win[tl_rec, c("WINNER", "YEAR_DATE")], line.color = "red", scale.font = 3,
          scale = "month", scale.format = "%Y", label.cex = 0.7, buffer.days = 100,
          labels = paste(slam_win[tl_rec, "WINNER"], slam_win[tl_rec, "TOURNAMENT"]))

Chord Diagram

The next step in this process is to arrange the data set to where it filters out an output that gives us where the champions with more than ten Grand Slam tournaments wins are embedded.

top_winners_gt10 = slam_top_chart %>% filter(NUM_WINS > 10)
kable(head(top_winners_gt10))
WINNER NUM_WINS
Roger Federer 18
Pete Sampras 14
Rafael Nadal 14
Novak Djokovic 12
Roy Emerson 12
Bjorn Borg 11

Furthermore, we then group the dataset by tournaments and winner and then have the copmuted value of the total number of wins per tournament.

slam_win_cnt = inner_join(slam_win, top_winners_gt10) %>% select(TOURNAMENT, WINNER) %>%
  group_by(WINNER, TOURNAMENT) %>% summarise(NUM_WINS = n()) %>% arrange(TOURNAMENT, desc(NUM_WINS))
## Warning: Column `WINNER` joining character vector and factor, coercing into
## character vector
kable(slam_win_cnt)
WINNER TOURNAMENT NUM_WINS
Novak Djokovic Australian Open 6
Roy Emerson Australian Open 6
Roger Federer Australian Open 5
Rod Laver Australian Open 3
Pete Sampras Australian Open 2
Rafael Nadal Australian Open 1
Rafael Nadal French Open 9
Bjorn Borg French Open 6
Rod Laver French Open 2
Roy Emerson French Open 2
Novak Djokovic French Open 1
Roger Federer French Open 1
Pete Sampras U.S. Open 5
Roger Federer U.S. Open 5
Novak Djokovic U.S. Open 2
Rafael Nadal U.S. Open 2
Rod Laver U.S. Open 2
Roy Emerson U.S. Open 2
Pete Sampras Wimbledon 7
Roger Federer Wimbledon 7
Bjorn Borg Wimbledon 5
Rod Laver Wimbledon 4
Novak Djokovic Wimbledon 3
Rafael Nadal Wimbledon 2
Roy Emerson Wimbledon 2

This generated output is then put into a chord diagram, where it is advantageous that we see the inter-relationships between players. It also alows us to see the connections between the players and what they might share in common.

chordDiagram(slam_win_cnt)

Radar Plot

The association of the strengths and weaknesses in a player(s) specific tournament can allow for the use of seeing a specifc player(s) performances in each of the Tennis Grand Slam tournaments. A radar plot is recommended to visualize such data. It allows for someone to explore a specifc player’s performance and which tournament that tend to win the most.

Roger Federer

champion_radar_plot <- function(df, champion_name) {
  slam_win_cnt_chp = df %>% filter(WINNER == champion_name)
  chp_num_wins <- slam_win_cnt_chp$NUM_WINS
  l <- length(chp_num_wins)
  max_v <- 10 # choosing the same maximum value for all champions
  chp_df <- data.frame(rbind(max = rep(max_v, l), min = rep(0, l), chp_num_wins))
  colnames(chp_df) <- slam_win_cnt_chp$TOURNAMENT
  seg_n <- max_v
  radarchart(chp_df, axistype = 1, caxislabels = seq(0, max_v, 1), seg = seg_n,
             centerzero = TRUE, pcol = rgb(0.2, 0.5, 0.5, 0.9) , pfcol = rgb(0.2, 0.5, 0.5, 0.3),
             plwd = 1, cglcol = "grey", cglty = 1, axislabcol = "blue",
             vlcex = 0.8, calcex = 0.7, title = champion_name)
  
}

champion_radar_plot(slam_win_cnt, "Roger Federer")

Rafael Nadal

champion_radar_plot(slam_win_cnt, "Rafael Nadal")

Novak Djokovic

champion_radar_plot(slam_win_cnt, "Novak Djokovic")

Project Conclusion

This extra credit project allowed me to gain immense insight on how to use rmarkdown and explore the different ways to see data manipulation in RStudio. This experience was more than fun! At times a bit frustrating with the coding, but nevermore than rewarding when seeing the final finshed product.