We are going to have a look at the deuce dataset that is there in R and see some of the different things that we can produce from this dataset. The deuce package contains data on professional Tennis and the purpose is for users to do a variety of sports analyses.
The deuce package can be installed with the following code below. Once you install the package then load the following libraries as well:
library(devtools)
install_github("skoval/deuce")
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(htmlTable)
library(deuce)
library(ggplot2)
The following table provides the names and summary descriptions of each dataset in the package:
| Name | Description |
|---|---|
| atp_elo | ATP Elo Ratings |
| atp_importance | ATP Point Importance |
| atp_matches | ATP Playing Activity |
| atp_odds | ATP Match Odds |
| atp_odds_match_lookup | ATP Match Odds Lookup Table |
| atp_players | Biographic Details of ATP Players |
| atp_rankings | Rankings of ATP Players |
| atp_tournaments | ATP Tournaments |
| gs_point_by_point | Grand Slam Point by Point Match Data |
| mcp_points | Detail Point-by-Point Tennis Matches |
| point_by_point | Point by Point Match Data |
| wta_elo | WTA Elo Ratings |
| wta_importance | WTA Point Importance |
| wta_matches | WTA Playing Activity |
| wta_odds | WTA Match Odds |
| wta_odds_match_lookup | WTA Match Odds Lookup Table |
| wta_players | Biographic Details of WTA Players |
| wta_rankings | Rankings of WTA Players |
Let us take a look at few of the analyses we can do with this package.
There is one question that is addressed in almost every sport that has been played and that is which player is the Greatest of all time (GOAT)? The debate for this in Tennis has been going on for decades now. Even though everyone has a different opinion on this let us consider one way to tackle this discussion. We are going to use Elo ratings in order to measure a players ability as it gives us a number of a player’s strength over time in comparison to the different opponents they have faced.
We are going to consider a player’s peak Elo to evaluate their overall highest achievement in their career. We are going to be using the atp_elo dataset and we are considering all surfaces that they have played on (grass, hard, clay).
#Loading the dataset
data("atp_elo")
#Getting the peak elo ratings for all the players in the open era
peak_atp_elo <- atp_elo %>%
group_by(player_name) %>%
dplyr::summarise(
peak.elo = max(overall_elo, na.rm = T)
)
#Getting the top 10 players with the highest elo ratings
peak_atp_elo <- peak_atp_elo[order(peak_atp_elo$peak.elo, decreasing = T),][1:10,]
peak_atp_elo$player_name <- factor(peak_atp_elo$player_name, levels = peak_atp_elo$player_name[order(peak_atp_elo$peak.elo)], order = T)
#Plotting our data
peak_atp_elo %>%
ggplot(aes(y = peak.elo, x = player_name)) +
geom_point(size = 2, col = "red") +
theme(legend.position = "none") +
scale_y_continuous("Career Peak Elo") +
scale_x_discrete("") +
coord_flip()
In this exploration, we are going to take a look at the Rally length trends by finding out the average rally length in both men’s and women’s Tennis from the year 2000 till 2019. Rally length is the number of shots that are played in a point. We are going to use the mcp_points dataset which gives us the detailed point by point match details.
#Loading the dataset
data("mcp_points")
# Counting double faults as 1 shot
mcp_points <- mcp_points %>%
dplyr::mutate(
year = as.numeric(substr(match_id, 1, 4)),
ATP = ifelse(grepl("[0-9]-M-", match_id), "ATP", "WTA"),
rallyCount = as.numeric(ifelse(rallyCount == 0, 1, rallyCount))
) %>%
filter(year >= 2000, !is.na(rallyCount))
#Plotting the graph
#ATP refers to men's Tennis and WTA refers to women's Tennis
mcp_points %>%
ggplot(aes(y = rallyCount, x = year, fill = ATP, colour = ATP)) +
geom_smooth(alpha = 0.3) +
scale_y_continuous("Rally Length", breaks = scales::pretty_breaks(n = 10)) +
scale_x_continuous("", breaks = scales::pretty_breaks(n = 10)) +
expand_limits(y = 1) +
theme(legend.position = "bottom", legend.direction = "horizontal") +
ggtitle("Rally Length Trends")
There are various other analyses that can be done through the different datasets that are found in the deuce package with more data exploration. The dataset contains detailed information of data from the Tennis tour and can prove to be really useful.