Final Project

WTA and ATP Tennis Analysis

Introduction & Overview

Hello everyone! My name is Grace and I love watching tennis and attending tournaments. For my final project for Programming in Analytics, I want to explore this area of interest. Tennis is such a dynamic sport and there are many factors that can affect a player performance. I want to explore these factors.

I found a dataset that has tennis tournament information from 1978 to 2019. You can download the dataset here at https://www.kaggle.com/datasets/taylorbrownlow/atpwta-tennis-data. The dataset provides information from both the Women Tennis Association (WTA) and Association of Tennis Professionals (ATO) matches. The dataset has 50 different colmns showing the different aspects of a match for 373,436 matches. For example, the type of tournament, surface type, tournament level, and winner name. This dataset will allow me to analyze different factors of tennis matches.

Data Dictionary

  • tourney_name: The name of the tournament

  • urface: Court surface (e.g., ‘Clay’, ‘Hard’, ‘Grass’)

  • draw_size: total tournament draw size

  • tourney_level the level of the tournament (e.g. ‘G’ = Grand Slams

  • tourney_date: usually the Monday of the tournament week

  • match_num: a match-specific identifier

  • winner_id: player_id of the match winner

  • winner_seed: seed of match winner

  • winner_entry: ‘WC’ = wild card, ‘Q’ = qualifier, ‘LL’ = lucky loser, ‘PR’ = protected ranking, ‘ITF’ = ITF entry

  • winner_name: name of winner

  • winner_hand: dominant hand of winner (‘R’ = Right, ‘L’ = Left, ‘U’ = Unknown

  • winner_ht: height in cm of winner

  • winner_ioc: winner winner 3-character country code

  • winner_age: winner age in years

  • winner_id: player_id of the match winner

  • winner_seed: seed of match winner

  • winner_entry: ‘WC’ = wild card, ‘Q’ = qualifier, ‘LL’ = lucky loser, ‘PR’ = protected ranking, ‘ITF’ = ITF entry

  • loser_name: name of loser

  • loser_hand: dominant hand of loser (‘R’ = Right, ‘L’ = Left, ‘U’ = Unknown

  • loser_ht: height in cm of loser

  • loser_ioc: winner winner 3-character country code

  • loser_age: winner age in years

  • loser_id: player_id of the match loser

  • loser_seed: seed of match loser

  • score: final score of match

  • best_ of: number of sets in the match

  • round: round of tournament

  • minutes: match length

  • w_ace: winner’s ace count

  • w_df: winner’s double fault count

  • w_svpt: winner’s serve points

  • w_1stin: winners first serves made

  • w_1stWon: winner’s first serves points won

  • w_2ndWon: winner 2nd points won

  • w_SvGms: winner service games won

  • w_bpSaved: winner break points saved

  • W_bpFaced: winner break points faced

  • w_ace: winner’s ace count

  • w_df: winner’s double fault count

  • l_svpt: winner’s serve points

  • l_1stin: winners first serves made

  • l_1stWon: winner’s first serves points won

  • l_2ndWon: winner 2nd points won

  • l_SvGms: winner service games won

  • l_bpSaved: winner break points saved

  • l_bpFaced: winner break points faced

  • winner_rank: rank of the winner

  • winner_rank_points: the number of rank points the winner had

  • loser rank: rank of loser

  • loser rank_points: the number of loser rank points the loser had

league: Women Tennis Association(women’s tennis) or Association of Tennis Professionals (men’s tennis)

Summary Statistics:

   winner_ht        winner_age       loser_age         minutes      
 Min.   :146.0    Min.   :-30.22   Min.   :-30.22   Min.   :   0    
 1st Qu.:177.0    1st Qu.: 21.46   1st Qu.: 21.40   1st Qu.:  74    
 Median :183.0    Median : 24.27   Median : 24.26   Median :  95    
 Mean   :181.8    Mean   : 24.66   Mean   : 24.67   Mean   : 102    
 3rd Qu.:188.0    3rd Qu.: 27.41   3rd Qu.: 27.44   3rd Qu.: 123    
 Max.   :208.0    Max.   : 65.04   Max.   : 65.04   Max.   :2475    
 NA's   :174546   NA's   :18856    NA's   :39602    NA's   :275448  
     w_ace              w_df             w_svpt          w_1stIn      
 Min.   :  0.000   Min.   :  0.000   Min.   :  0.00   Min.   :  0.00  
 1st Qu.:  2.000   1st Qu.:  1.000   1st Qu.: 55.00   1st Qu.: 33.00  
 Median :  4.000   Median :  2.000   Median : 70.00   Median : 43.00  
 Mean   :  5.507   Mean   :  2.989   Mean   : 75.48   Mean   : 46.32  
 3rd Qu.:  8.000   3rd Qu.:  4.000   3rd Qu.: 92.00   3rd Qu.: 56.00  
 Max.   :113.000   Max.   :236.000   Max.   :491.00   Max.   :361.00  
 NA's   :251147    NA's   :251004    NA's   :251145   NA's   :251145  
    w_1stWon         w_2ndWon         w_SvGms         w_bpSaved     
 Min.   :  0      Min.   : 0.00    Min.   : 0.00    Min.   : 0.000  
 1st Qu.: 25      1st Qu.:11.00    1st Qu.: 9.00    1st Qu.: 1.000  
 Median : 32      Median :15.00    Median :11.00    Median : 3.000  
 Mean   : 34      Mean   :15.66    Mean   :12.09    Mean   : 3.661  
 3rd Qu.: 40      3rd Qu.:19.00    3rd Qu.:15.00    3rd Qu.: 5.000  
 Max.   :292      Max.   :82.00    Max.   :90.00    Max.   :25.000  
 NA's   :251145   NA's   :251145   NA's   :271126   NA's   :251150  
   w_bpFaced          l_ace              l_df            l_svpt      
 Min.   : 0.000   Min.   :  0.000   Min.   : 0.000   Min.   :  0.00  
 1st Qu.: 2.000   1st Qu.:  1.000   1st Qu.: 2.000   1st Qu.: 58.00  
 Median : 5.000   Median :  3.000   Median : 3.000   Median : 73.00  
 Mean   : 5.548   Mean   :  3.994   Mean   : 3.634   Mean   : 77.88  
 3rd Qu.: 8.000   3rd Qu.:  6.000   3rd Qu.: 5.000   3rd Qu.: 93.00  
 Max.   :34.000   Max.   :103.000   Max.   :31.000   Max.   :489.00  
 NA's   :251150   NA's   :251152    NA's   :251184   NA's   :251146  
    l_1stIn          l_1stWon         l_2ndWon         l_SvGms      
 Min.   :  0.0    Min.   :  0.00   Min.   :  0.00   Min.   : 0.00   
 1st Qu.: 34.0    1st Qu.: 20.00   1st Qu.:  9.00   1st Qu.: 9.00   
 Median : 43.0    Median : 28.00   Median : 13.00   Median :11.00   
 Mean   : 46.6    Mean   : 29.83   Mean   : 13.85   Mean   :11.89   
 3rd Qu.: 57.0    3rd Qu.: 37.00   3rd Qu.: 18.00   3rd Qu.:14.00   
 Max.   :328.0    Max.   :284.00   Max.   :101.00   Max.   :91.00   
 NA's   :251145   NA's   :251145   NA's   :251145   NA's   :271126  
   l_bpSaved     
 Min.   :-6.000  
 1st Qu.: 2.000  
 Median : 4.000  
 Mean   : 4.842  
 3rd Qu.: 7.000  
 Max.   :28.000  
 NA's   :251147  

Descriptive analysis

1.

The first question I would like to analyze is does average number of minutes of a match get longer for each round in a Grad slam tournament for both the WTA and ATP.

Warning: Removed 25530 rows containing non-finite outside the scale range
(`stat_summary()`).

Warning: Removed 12143 rows containing non-finite outside the scale range
(`stat_summary()`).

For the ATP, the average number of minutes of a match does get longer for each round in a Grad slam tournament. In the early rounds (R128 or R64), top-seeded players often face qualifiers or lower-ranked players, often resulting in a quicker match time. In the later rounds however, players usually face similarly ranked players, leading to tighter matches and potentially more deuces, tiebreaks, and longer rallies.

For the WTA, the average number of mintutes increases from round 128 to round 16, then drops for the quarter finals and semi-finals, then rises to the highest average for the final round.

2.

The second question I would like to analyze is, what is the variability of double faults (failure of both 1st and 2nd serve) of winners and losers across rounds of tournaments.

Warning: Removed 76670 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Warning: Removed 76670 rows containing non-finite outside the scale range
(`stat_boxplot()`).

The median number of double faults for winners relatievly consistent across rounds, with the early round (round 128), having a slightly higher median. The median number of double faults for losers starts out higher in the early rounds, then stays consist for the later rounds. By the time players reach the Semis (SF) or Finals (F), both winners and losers show very similar, low median double fault counts, suggesting that those with bad service techniques are taken out in the early rounds.

There are many outliers in winners chat. Some winners are have 20 or more double faults and still win, this is likely because their first serve is so dominant that it compensates for the errors.

3.

The third question I want to analyze is, does surface type (Carpet, Clay, Grass, or Hard) have an affect on the average amount of aces for winners for both the ATP and WTA?

Warning: Removed 238102 rows containing non-finite outside the scale range
(`stat_summary()`).

The surface type does seem to have an affect of the number of aces for winners. Grass has the highest number of aces, while clay has the lowest.

4.

The fourth question I would like to analyze is does player height have an affect on ace percentage?

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 36 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 36 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 31 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 31 rows containing missing values or values outside the scale range
(`geom_point()`).

Across all three charts, there is a consistent positive correlation proving that as player height increases, the efficiency of the serve (Ace Percentage) also rises linearly. However, a high ace percentage isn’t guaranteed, as it is a skill and not just based on height. The ATP regression line appears to have a slightly steeper slope compared to the WTA.

Secondary Data Source

To supplement my dataset, I scraped WTA rankings data to provide an analysis into rankings, points, and age. I used RStudio’s chromote feature to web scrape for singles and doubles players rankings as of May 8th 2026, this will be the data I will use for my analysis. It is suitable as it comes directly from the WTA organization itself.

Data accessed at:

https://www.wtatennis.com/rankings/singles

https://www.wtatennis.com/rankings/doubles

I first created a scatter plot to show the relationship between tournaments played and points earned.

There is not a strong positive or negative correlation. However, there are interesting outliers, those who are high on the y-axis and low or average on the x-axis. For example, there is one player with over 10,000 points who has played only about 19 tournaments. This indicates a high efficiency, most likely, winning or reaching the finals of nearly every event they enter. Several players have 7,500+ points while playing fewer than 20 tournaments. These are players who prioritize quality over quantity.

There is a heavy density of points between 20 and 32tournaments played, but with points mostly staying below 2,500. This group represents players who compete very frequently but likely exit in earlier rounds or play in lower-tier tournaments.

To dive deeper into this, I then created a boxplot to show the point distribution by amount of tournaments played.

The visualization shows that players with a “Strategic” workload (under 16 tournaments) actually maintain a significantly higher median point total than those playing more frequently. While the the group over 16 tournaments contains more players, their points are more heavily concentrated at the lower end of the scale, suggesting that quantity of play does not guarantee a higher rank. The presence of several high-performing outliers in the over 16 tournaments group shows that while some can sustain elite play across a heavy schedule, the most efficient path to the top appears to be a more selective tournament calendar.

Next, I created a histogram of the distribution of age for the Top 50 WTA Singles and Doubles Players.

This histogram shows that the age distribution of top 50 WTA is slightly right-skewed, with a significant concentration of athletes in their late 20s and early 30s. The highest frequency occurs around age 30. There is a presence of younger players starting in their late teens and grows in to 30s. The count drops off sharply after age 34, with only a few outliers competing into their early 40s.

Finally, I created a boxplot showing the point distribution by age group.

The “Over 30” group exhibits a higher median point total and a larger interquartile range, suggesting that veteran players in the top 50 tend to maintain more consistent high-level point totals. In contrast, the “Under 30” group has a lower median but features several outliers reaching up to 10,000 points, representing the dominant top stars of the younger generation. Overall, while the younger group contains the absolute highest point earners, the veteran group shows a higher baseline of point accumulation across its middle 50% of players.

Conclusion

Overall, physical height remains a primary predictor of service power, but external factors like court surface and tournament round play a significant role in determining the tactical nature and length of professional matches.

The data shows that ranking is as much a result of scheduling strategy as it is of raw performance. Veterans dominate the leaderboard through experience and selective participation. Success in the WTA is not just about playing more often, but about making every appearance count