Introduction

for my final project I chose the ATP Tennis data set. In this data set it contains information on all tennis matches on the men’s tennis ATP tour from 2000 to 2026, with daily updates as the tour continues throughout the year. The data set includes in the columns; information on the players in each match, their rankings, the surface of the match, the tournament,the score,the date, the round, the level, how many sets it took, the points and odds of the game, and the winner of each match. The rows are each an individual match, with about 60000 games recorded.

With this data set, I wanted to explore game upsets, both wins and loses. I wanted to see what players had the most upsets, and on what surface they were most dominant on. I also wanted to see how ranking gaps affected the probability of an upset, and finally, how the two greatest players of all time, Rafael Nadal and Rodger Federer, compared in terms of upsets.

Overall, these visualizations aim to understand what upsets have happened in the ATP and comparing different players and surfaces to see if there are any patterns in the upsets.

Data

rm(list=ls())
library(tidyverse)

tennis = read_csv("atp_tennis.csv")

Visualization 1: Players with the most upsets

In this first visualization, it aims to find the 10 players who have upset their opponents the most, meaning they won against a higher-ranked player.

tennis_upsets <- tennis %>%
  mutate(
    Upset = ifelse(
      (Rank_1 > Rank_2 & Winner == Player_1) |
        (Rank_2 > Rank_1 & Winner == Player_2),
      "Upset", "Expected"
    ),
    Upset_Winner = ifelse(
      Upset == "Upset" & Winner == Player_1, Player_1,
      ifelse(Upset == "Upset" & Winner == Player_2, Player_2, NA)
    )
  ) %>%
  filter(Upset == "Upset")

top_upsets <- tennis_upsets %>%
  count(Upset_Winner, sort = TRUE) %>%
  top_n(10, n)

ggplot(top_upsets, aes(x = reorder(Upset_Winner, n), y = n)) +
  geom_col(fill = "green") +
  geom_text(aes(label = n), hjust = -0.1, color = "black") +
  coord_flip() +
  labs(
    title = "Top 10 Players with the Most Upset Wins",
    x = "Player",
    y = "Number of Upset Wins"
  ) +
  theme_minimal()

In this first visualization, it shows that Feliciano Lopez is the player with the most upset wins, with Stan Wawrinka and Gael Monfils following closely behind. All three of these players have had long careers on the ATP tour, some still continuing now.

With this information in mind, I continued to explore Lopez’s upset wins, to see if there were any patterns in the ranking gaps of his upset wins, and if there were any patterns in the surfaces he was most dominant on.

lopez_upsets <- tennis %>%
  filter(!is.na(Rank_1), !is.na(Rank_2)) %>%
  mutate(
    Upset = case_when(
      Rank_1 > Rank_2 & Winner == Player_1 ~ TRUE,
      Rank_2 > Rank_1 & Winner == Player_2 ~ TRUE,
      TRUE ~ FALSE
    )
  ) %>%
  filter(Upset == TRUE & Winner == "Lopez F.") %>%
  mutate(
    Rank_Gap = abs(Rank_1 - Rank_2)
  )

ggplot(lopez_upsets, aes(x = Rank_Gap)) +
  geom_histogram(binwidth = 10, fill = "lightgreen", color ="forestgreen") +
  labs(
    title = "Distribution of Ranking Gaps in Lopez F. Upset Wins",
    x = "Ranking Gap",
    y = "Count"
  ) +
  theme_minimal()

# By Surface
ggplot(lopez_upsets, aes(x = Rank_Gap)) +
  geom_histogram(binwidth = 10, fill = "lightgreen", color ="forestgreen") +
  labs(
    title = "Distribution of Ranking Gaps in Lopez F. Upset Wins",
    x = "Ranking Gap",
    y = "Count"
  ) +
  facet_wrap(~Surface) +
  theme_minimal()

From the first histogram, it shows that most of Lopez’s upset wins have come from ranking gaps of 20 or less, with a few outliers of bigger ranking gaps. This suggests that Lopez has been more successful in upsetting players who are closer to his ranking, rather than those who are much higher ranked. These bigger gaps could be explained by the difference in his own ranking, as Lopez has been ranked as high as 12 in the world, but has also been ranked outside the top 100 at times in his career, especially early on.

Visualization 2: Do Bigger Ranking Gaps Lead to Fewer Upsets?

From the last graphs, it seems that Lopez’s upset wins are more common when the ranking gap is smaller. This leads to the question of whether bigger ranking gaps generally lead to fewer upsets across the entire ATP tour.

tennis_diff <- tennis %>%
  mutate(
    Rank_Diff = abs(Rank_1 - Rank_2),
    Upset = ifelse(
      (Rank_1 > Rank_2 & Winner == Player_1) |
        (Rank_2 > Rank_1 & Winner == Player_2),
      1, 0
    )
  )

ggplot(tennis_diff, aes(x = Rank_Diff, y = Upset)) +
  geom_smooth() +
  labs(
    title = "Do Bigger Ranking Gaps Lead to Fewer Upsets?",
    x = "Ranking Difference",
    y = "Probability of Upset"
  ) +
  theme_minimal()

From this graph we see that as the ranking difference increases, the probability of an upset decreases. This suggests that players are more likely to win against opponents who are closer to their own ranking, and less likely to win against opponents who are much higher ranked. This trend is consistent with the idea that higher-ranked players are generally more skilled and experienced, making it more difficult for lower-ranked players to pull off an upset.

Visualization 3: Nadal vs Federer Upsets

Finally, I wanted to compare the two greatest players of all time, Rafael Nadal and Roger Federer, in terms of their upset wins and losses. I wanted to see how they compared in terms of their upset rates, and if there were any differences in their upset rates across different surfaces.

players_tennis <- tennis %>%
  filter(
    str_detect(Player_1, "Nadal|Federer") |
      str_detect(Player_2, "Nadal|Federer")
  )

players_tennis <- players_tennis %>%
  mutate(
    Player = case_when(
      str_detect(Player_1, "Nadal") ~ "Nadal",
      str_detect(Player_2, "Nadal") ~ "Nadal",
      str_detect(Player_1, "Federer") ~ "Federer",
      str_detect(Player_2, "Federer") ~ "Federer"
    ),
    
    Is_P1 = case_when(
      Player == "Nadal" ~ str_detect(Player_1, "Nadal"),
      Player == "Federer" ~ str_detect(Player_1, "Federer")
    ),
    
    Player_Rank = ifelse(Is_P1, Rank_1, Rank_2),
    Opponent_Rank = ifelse(Is_P1, Rank_2, Rank_1),
    
    Player_Won = case_when(
      Player == "Nadal" ~ str_detect(Winner, "Nadal"),
      Player == "Federer" ~ str_detect(Winner, "Federer")
    ),
    
    Upset_Type = case_when(
      Player_Rank < Opponent_Rank & Player_Won == FALSE ~ "Upset Loss",
      Player_Rank > Opponent_Rank & Player_Won == TRUE ~ "Upset Win",
      TRUE ~ "Expected"
    )
  )

# Rates proportion
ggplot(players_tennis, aes(x = Player, fill = Upset_Type)) +
  geom_bar(position = "fill") +
  labs(
    title = "Upset Rate Comparison ",
    y = "Proportion"
  ) +
  theme_minimal()+
  scale_fill_viridis_d(option = "plasma")

#By surface
ggplot(players_tennis, aes(x = Player, fill = Upset_Type)) +
  geom_bar(position = "dodge") +
  facet_wrap(~Surface) +
  labs(
    title = "Upset Rate Comparison by Surface",
    x = "Player",
    y = "Count"
  ) +
  theme_minimal() +
  scale_fill_viridis_d(option = "plasma")

From the first graph, it shows that both Nadal and Federer have a similar proportion of upset wins and losses, with a majority of their matches being expected outcomes. However, Nadal has a slightly higher proportion of upset wins compared to Federer, while Federer has a slightly higher proportion of upset losses compared to Nadal.

When looking at the second graph by surface, it shows that Nadal has a higher proportion of upset wins on clay courts compared to Federer, which is consistent with Nadal’s reputation as the “King of Clay”. On hard courts and grass courts, both players have similar proportions of upset wins and losses. Overall, these graphs suggest that while both players have had their share of upsets, Nadal has been more successful in pulling off upsets on clay courts compared to Federer.

Conclusion

In conclusion, this analysis was to find out more about upsets in different players careers and to compare two great players.

From the first visualization, it was found that Feliciano Lopez has the most upset wins in the ATP tour, with most of his upset wins coming from ranking gaps of 20 or less. This suggests that Lopez has been more successful in upsetting players who are closer to his ranking, rather than those who are much higher ranked.

From the second visualization, it was found that as the ranking difference increases, the probability of an upset decreases. This suggests that players are more likely to win against opponents who are closer to their own ranking, and less likely to win against opponents who are much higher ranked.

Finally, when comparing Nadal and Federer, it was found that both players have a similar proportion of upset wins and losses, with a majority of their matches being expected outcomes. However, Nadal has a slightly higher proportion of upset wins compared to Federer, while Federer has a slightly higher proportion of upset losses compared to Nadal. When looking at the graphs by surface, it was found that Nadal has a higher proportion of upset wins on clay courts compared to Federer, which is consistent with Nadal’s reputation as the “King of Clay”.

Overall, this analysis provides insights into the patterns of upsets in the ATP tour and highlights the differences between two of the greatest players of all time. It also shows how ranking gaps can affect the likelihood of an upset, and how certain players may have more success in pulling off upsets on specific surfaces.

Final Project

Hannah Bui

2026-03-31