Assignment 7

WTA Ranking and Age Analysis

Introduction

The Women’s Tennis Association is the organizing body of the Women’s tennis around the world. Players of many ages play the sport and I want to analyze this, along with other factors such as rank and points.

I intend to answer my questions by using the rankings of WTA tennis players. I used RStudio’s chromote feature to web scrape the WTA’s top 50 rankings for singles and doubles players as of May 8th 2026, this will be the data I will use for my analysis. It is suitable as it comes directly from the WTA organization itself.

Data accessed at:

https://www.wtatennis.com/rankings/singles

https://www.wtatennis.com/rankings/doubles

Data Wrangling

When scraping the table of WTA rankings for Singles and Doubles Tennis players, I needed to complete some data wrangling so that the dataset would be easy to use. For example, I needed to filter out the words “Ranking History” and irrelevent numbers that showed up in some of the columns. I also needed to make sure that the rank and points column were numeric and remove an empty column. In my analysis, I will also create a column called points per tournament by dividing points by tournaments played and a column categorizing age.

Analysis

  1. Create a table showing each players points per tournament.
# A tibble: 100 × 5
    Rank Player                 Age Tournaments.Played   PPT
   <dbl> <chr>                <dbl>              <dbl> <dbl>
 1    11 Veronika Kudermetova    29                  9  624.
 2     2 Taylor Townsend         30                 14  585 
 3     3 Elise Mertens           30                 14  581.
 4     1 Katerina Siniakova      29                 16  547.
 5     1 Aryna Sabalenka         28                 19  532.
 6     4 Jasmine Paolini         30                 15  492.
 7     4 Sara Errani             39                 15  492.
 8     6 Gabriela Dabrowski      34                 16  433.
 9     2 Elena Rybakina          26                 23  372 
10     3 Iga Swiatek             24                 19  366.
# ℹ 90 more rows

By calculating the Points Per Tournament (PPT), we can see that lower-ranked players like Veronika Kudermetova can actually have a higher scoring rate than the top-ranked seeds.

  1. Next, I am going to create a scatter plot to show the relationship between tournaments played and points earned.

There is not a strong positive or negative correlation. However, there are interesting outliers, those who are high on the y-axis and low or average on the x-axis. For example, there is one player with over 10,000 points who has played only about 19 tournaments. This indicates a high efficiency, most likely, winning or reaching the finals of nearly every event they enter. Several players have 7,500+ points while playing fewer than 20 tournaments. These are players who prioritize quality over quantity.

There is a heavy density of points between 20 and 32tournaments played, but with points mostly staying below 2,500. This group represents players who compete very frequently but likely exit in earlier rounds or play in lower-tier tournaments.

  1. To dive deeper into this, I am going to create a boxplot to show the point distribution by amount of tournaments played.

The visualization shows that players with a “Strategic” workload (under 16 tournaments) actually maintain a significantly higher median point total than those playing more frequently. While the the group over 16 tournaments contains more players, their points are more heavily concentrated at the lower end of the scale, suggesting that quantity of play does not guarantee a higher rank. The presence of several high-performing outliers in the over 16 tournaments group shows that while some can sustain elite play across a heavy schedule, the most efficient path to the top appears to be a more selective tournament calendar.

Create a histogram of the distribution of age for the Top 50 WTA Singles and Doubles Players.

This histogram shows that the age distribution of top 50 WTA is slightly right-skewed, with a significant concentration of athletes in their late 20s and early 30s. The highest frequency occurs around age 30. There is a presence of younger players starting in their late teens and grows in to 30s. The count drops off sharply after age 34, with only a few outliers competing into their early 40s.

  1. Create a boxplot showing the point distribution by age group.

The “Over 30” group exhibits a higher median point total and a larger interquartile range, suggesting that veteran players in the top 50 tend to maintain more consistent high-level point totals. In contrast, the “Under 30” group has a lower median but features several outliers reaching up to 10,000 points, representing the dominant top stars of the younger generation. Overall, while the younger group contains the absolute highest point earners, the veteran group shows a higher baseline of point accumulation across its middle 50% of players.