1. Introduction

We are going to have a look at the deuce dataset that is there in R and see some of the different things that we can produce from this dataset. The deuce package contains data on professional Tennis and the purpose is for users to do a variety of sports analyses.

2. Package Installation

The deuce package can be installed with the following code below. Once you install the package then load the following libraries as well:

library(devtools)
install_github("skoval/deuce")
knitr::opts_chunk$set(echo = TRUE)

library(knitr)
library(htmlTable)
library(deuce)
library(ggplot2)

3. Datasets included

The following table provides the names and summary descriptions of each dataset in the package:

Name Description
atp_elo ATP Elo Ratings
atp_importance ATP Point Importance
atp_matches ATP Playing Activity
atp_odds ATP Match Odds
atp_odds_match_lookup ATP Match Odds Lookup Table
atp_players Biographic Details of ATP Players
atp_rankings Rankings of ATP Players
atp_tournaments ATP Tournaments
gs_point_by_point Grand Slam Point by Point Match Data
mcp_points Detail Point-by-Point Tennis Matches
point_by_point Point by Point Match Data
wta_elo WTA Elo Ratings
wta_importance WTA Point Importance
wta_matches WTA Playing Activity
wta_odds WTA Match Odds
wta_odds_match_lookup WTA Match Odds Lookup Table
wta_players Biographic Details of WTA Players
wta_rankings Rankings of WTA Players

4. Exploration of the datasets

Exploration - 1

Let us take a look at few of the analyses we can do with this package.

There is one question that is addressed in almost every sport that has been played and that is which player is the Greatest of all time (GOAT)? The debate for this in Tennis has been going on for decades now. Even though everyone has a different opinion on this let us consider one way to tackle this discussion. We are going to use Elo ratings in order to measure a players ability as it gives us a number of a player’s strength over time in comparison to the different opponents they have faced.

We are going to consider a player’s peak Elo to evaluate their overall highest achievement in their career. We are going to be using the atp_elo dataset and we are considering all surfaces that they have played on (grass, hard, clay).

#Loading the dataset
data("atp_elo")

#Getting the peak elo ratings for all the players in the open era
peak_atp_elo <- atp_elo %>%
    group_by(player_name) %>%
    dplyr::summarise(
      peak.elo = max(overall_elo, na.rm = T)
    )

#Getting the top 10 players with the highest elo ratings
peak_atp_elo <- peak_atp_elo[order(peak_atp_elo$peak.elo, decreasing = T),][1:10,]
peak_atp_elo$player_name <- factor(peak_atp_elo$player_name, levels = peak_atp_elo$player_name[order(peak_atp_elo$peak.elo)], order = T)

#Plotting our data
peak_atp_elo %>%
  ggplot(aes(y = peak.elo, x = player_name)) + 
  geom_point(size = 2, col = "red") +
  theme(legend.position = "none") +
  scale_y_continuous("Career Peak Elo") + 
  scale_x_discrete("") + 
  coord_flip()

  • As we can see from the above graph, the male player int eh open era who has achieved the highest Elo is Novak Djokovic, followed by Bjorn Borg and Roger Federer who are a little below when compared to Djokovic. We can also say that when it comes to all the surfaces put together, Djokovic seems to have the highest points.

Exploration - 2

In this exploration, we are going to take a look at the Rally length trends by finding out the average rally length in both men’s and women’s Tennis from the year 2000 till 2019. Rally length is the number of shots that are played in a point. We are going to use the mcp_points dataset which gives us the detailed point by point match details.

#Loading the dataset
data("mcp_points")

# Counting double faults as 1 shot
mcp_points <- mcp_points %>%
  dplyr::mutate(
    year = as.numeric(substr(match_id, 1, 4)),
    ATP = ifelse(grepl("[0-9]-M-", match_id), "ATP", "WTA"),
    rallyCount = as.numeric(ifelse(rallyCount == 0, 1, rallyCount))
  ) %>%
  filter(year >= 2000, !is.na(rallyCount))

#Plotting the graph
#ATP refers to men's Tennis and WTA refers to women's Tennis
mcp_points %>%
  ggplot(aes(y = rallyCount, x = year, fill = ATP, colour = ATP)) + 
  geom_smooth(alpha = 0.3) + 
  scale_y_continuous("Rally Length", breaks = scales::pretty_breaks(n = 10)) + 
  scale_x_continuous("", breaks = scales::pretty_breaks(n = 10)) + 
  expand_limits(y = 1) + 
  theme(legend.position = "bottom", legend.direction = "horizontal") + 
  ggtitle("Rally Length Trends")

  • The typical rally length are mostly between 3 and 5 shots for both men and women, with the men’s rally length varying a little bit more than the women’s. For women’s the rally length is typically between 4 and 4.5.

5. Summary

There are various other analyses that can be done through the different datasets that are found in the deuce package with more data exploration. The dataset contains detailed information of data from the Tennis tour and can prove to be really useful.