winner_ht winner_age loser_age minutes
Min. :146.0 Min. :-30.22 Min. :-30.22 Min. : 0
1st Qu.:177.0 1st Qu.: 21.46 1st Qu.: 21.40 1st Qu.: 74
Median :183.0 Median : 24.27 Median : 24.26 Median : 95
Mean :181.8 Mean : 24.66 Mean : 24.67 Mean : 102
3rd Qu.:188.0 3rd Qu.: 27.41 3rd Qu.: 27.44 3rd Qu.: 123
Max. :208.0 Max. : 65.04 Max. : 65.04 Max. :2475
NA's :174546 NA's :18856 NA's :39602 NA's :275448
w_ace w_df w_svpt w_1stIn
Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.00
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 55.00 1st Qu.: 33.00
Median : 4.000 Median : 2.000 Median : 70.00 Median : 43.00
Mean : 5.507 Mean : 2.989 Mean : 75.48 Mean : 46.32
3rd Qu.: 8.000 3rd Qu.: 4.000 3rd Qu.: 92.00 3rd Qu.: 56.00
Max. :113.000 Max. :236.000 Max. :491.00 Max. :361.00
NA's :251147 NA's :251004 NA's :251145 NA's :251145
w_1stWon w_2ndWon w_SvGms w_bpSaved
Min. : 0 Min. : 0.00 Min. : 0.00 Min. : 0.000
1st Qu.: 25 1st Qu.:11.00 1st Qu.: 9.00 1st Qu.: 1.000
Median : 32 Median :15.00 Median :11.00 Median : 3.000
Mean : 34 Mean :15.66 Mean :12.09 Mean : 3.661
3rd Qu.: 40 3rd Qu.:19.00 3rd Qu.:15.00 3rd Qu.: 5.000
Max. :292 Max. :82.00 Max. :90.00 Max. :25.000
NA's :251145 NA's :251145 NA's :271126 NA's :251150
w_bpFaced l_ace l_df l_svpt
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 58.00
Median : 5.000 Median : 3.000 Median : 3.000 Median : 73.00
Mean : 5.548 Mean : 3.994 Mean : 3.634 Mean : 77.88
3rd Qu.: 8.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 93.00
Max. :34.000 Max. :103.000 Max. :31.000 Max. :489.00
NA's :251150 NA's :251152 NA's :251184 NA's :251146
l_1stIn l_1stWon l_2ndWon l_SvGms
Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 34.0 1st Qu.: 20.00 1st Qu.: 9.00 1st Qu.: 9.00
Median : 43.0 Median : 28.00 Median : 13.00 Median :11.00
Mean : 46.6 Mean : 29.83 Mean : 13.85 Mean :11.89
3rd Qu.: 57.0 3rd Qu.: 37.00 3rd Qu.: 18.00 3rd Qu.:14.00
Max. :328.0 Max. :284.00 Max. :101.00 Max. :91.00
NA's :251145 NA's :251145 NA's :251145 NA's :271126
l_bpSaved
Min. :-6.000
1st Qu.: 2.000
Median : 4.000
Mean : 4.842
3rd Qu.: 7.000
Max. :28.000
NA's :251147
Final Project
WTA and ATP Tennis Analysis
Introduction & Overview
Hello everyone! My name is Grace and I love watching tennis and attending tournaments. For my final project for Programming in Analytics, I want to explore this area of interest. Tennis is such a dynamic sport and there are many factors that can affect a player performance. I want to explore these factors.
I found a dataset that has tennis tournament information from 1978 to 2019. You can download the dataset here at https://www.kaggle.com/datasets/taylorbrownlow/atpwta-tennis-data. The dataset provides information from both the Women Tennis Association (WTA) and Association of Tennis Professionals (ATO) matches. The dataset has 50 different colmns showing the different aspects of a match for 373,436 matches. For example, the type of tournament, surface type, tournament level, and winner name. This dataset will allow me to analyze different factors of tennis matches.
Data Dictionary
tourney_name: The name of the tournament
urface: Court surface (e.g., ‘Clay’, ‘Hard’, ‘Grass’)
draw_size: total tournament draw size
tourney_level the level of the tournament (e.g. ‘G’ = Grand Slams
tourney_date: usually the Monday of the tournament week
match_num: a match-specific identifier
winner_id: player_id of the match winner
winner_seed: seed of match winner
winner_entry: ‘WC’ = wild card, ‘Q’ = qualifier, ‘LL’ = lucky loser, ‘PR’ = protected ranking, ‘ITF’ = ITF entry
winner_name: name of winner
winner_hand: dominant hand of winner (‘R’ = Right, ‘L’ = Left, ‘U’ = Unknown
winner_ht: height in cm of winner
winner_ioc: winner winner 3-character country code
winner_age: winner age in years
winner_id: player_id of the match winner
winner_seed: seed of match winner
winner_entry: ‘WC’ = wild card, ‘Q’ = qualifier, ‘LL’ = lucky loser, ‘PR’ = protected ranking, ‘ITF’ = ITF entry
loser_name: name of loser
loser_hand: dominant hand of loser (‘R’ = Right, ‘L’ = Left, ‘U’ = Unknown
loser_ht: height in cm of loser
loser_ioc: winner winner 3-character country code
loser_age: winner age in years
loser_id: player_id of the match loser
loser_seed: seed of match loser
score: final score of match
best_ of: number of sets in the match
round: round of tournament
minutes: match length
w_ace: winner’s ace count
w_df: winner’s double fault count
w_svpt: winner’s serve points
w_1stin: winners first serves made
w_1stWon: winner’s first serves points won
w_2ndWon: winner 2nd points won
w_SvGms: winner service games won
w_bpSaved: winner break points saved
W_bpFaced: winner break points faced
w_ace: winner’s ace count
w_df: winner’s double fault count
l_svpt: winner’s serve points
l_1stin: winners first serves made
l_1stWon: winner’s first serves points won
l_2ndWon: winner 2nd points won
l_SvGms: winner service games won
l_bpSaved: winner break points saved
l_bpFaced: winner break points faced
winner_rank: rank of the winner
winner_rank_points: the number of rank points the winner had
loser rank: rank of loser
loser rank_points: the number of loser rank points the loser had
league: Women Tennis Association(women’s tennis) or Association of Tennis Professionals (men’s tennis)
Summary Statistics:
Descriptive analysis
1.
The first question I would like to analyze is does average number of minutes of a match get longer for each round in a Grad slam tournament for both the WTA and ATP.
Warning: Removed 25530 rows containing non-finite outside the scale range
(`stat_summary()`).
Warning: Removed 12143 rows containing non-finite outside the scale range
(`stat_summary()`).
For the ATP, the average number of minutes of a match does get longer for each round in a Grad slam tournament. In the early rounds (R128 or R64), top-seeded players often face qualifiers or lower-ranked players, often resulting in a quicker match time. In the later rounds however, players usually face similarly ranked players, leading to tighter matches and potentially more deuces, tiebreaks, and longer rallies.
For the WTA, the average number of mintutes increases from round 128 to round 16, then drops for the quarter finals and semi-finals, then rises to the highest average for the final round.
2.
The second question I would like to analyze is, what is the variability of double faults (failure of both 1st and 2nd serve) of winners and losers across rounds of tournaments.
Warning: Removed 76670 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 76670 rows containing non-finite outside the scale range
(`stat_boxplot()`).
The median number of double faults for winners relatievly consistent across rounds, with the early round (round 128), having a slightly higher median. The median number of double faults for losers starts out higher in the early rounds, then stays consist for the later rounds. By the time players reach the Semis (SF) or Finals (F), both winners and losers show very similar, low median double fault counts, suggesting that those with bad service techniques are taken out in the early rounds.
There are many outliers in winners chat. Some winners are have 20 or more double faults and still win, this is likely because their first serve is so dominant that it compensates for the errors.
3.
The third question I want to analyze is, does surface type (Carpet, Clay, Grass, or Hard) have an affect on the average amount of aces for winners for both the ATP and WTA?
Warning: Removed 238102 rows containing non-finite outside the scale range
(`stat_summary()`).
The surface type does seem to have an affect of the number of aces for winners. Grass has the highest number of aces, while clay has the lowest.
4.
The fourth question I would like to analyze is does player height have an affect on ace percentage?
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 36 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 36 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 31 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 31 rows containing missing values or values outside the scale range
(`geom_point()`).
Across all three charts, there is a consistent positive correlation proving that as player height increases, the efficiency of the serve (Ace Percentage) also rises linearly. However, a high ace percentage isn’t guaranteed, as it is a skill and not just based on height. The ATP regression line appears to have a slightly steeper slope compared to the WTA.
Secondary Data Source
To supplement my dataset, I scraped WTA rankings data to provide an analysis into rankings, points, and age. I used RStudio’s chromote feature to web scrape for singles and doubles players rankings as of May 8th 2026, this will be the data I will use for my analysis. It is suitable as it comes directly from the WTA organization itself.
Data accessed at:
https://www.wtatennis.com/rankings/singles
https://www.wtatennis.com/rankings/doubles
I first created a scatter plot to show the relationship between tournaments played and points earned.
There is not a strong positive or negative correlation. However, there are interesting outliers, those who are high on the y-axis and low or average on the x-axis. For example, there is one player with over 10,000 points who has played only about 19 tournaments. This indicates a high efficiency, most likely, winning or reaching the finals of nearly every event they enter. Several players have 7,500+ points while playing fewer than 20 tournaments. These are players who prioritize quality over quantity.
There is a heavy density of points between 20 and 32tournaments played, but with points mostly staying below 2,500. This group represents players who compete very frequently but likely exit in earlier rounds or play in lower-tier tournaments.
To dive deeper into this, I then created a boxplot to show the point distribution by amount of tournaments played.
The visualization shows that players with a “Strategic” workload (under 16 tournaments) actually maintain a significantly higher median point total than those playing more frequently. While the the group over 16 tournaments contains more players, their points are more heavily concentrated at the lower end of the scale, suggesting that quantity of play does not guarantee a higher rank. The presence of several high-performing outliers in the over 16 tournaments group shows that while some can sustain elite play across a heavy schedule, the most efficient path to the top appears to be a more selective tournament calendar.
Next, I created a histogram of the distribution of age for the Top 50 WTA Singles and Doubles Players.
This histogram shows that the age distribution of top 50 WTA is slightly right-skewed, with a significant concentration of athletes in their late 20s and early 30s. The highest frequency occurs around age 30. There is a presence of younger players starting in their late teens and grows in to 30s. The count drops off sharply after age 34, with only a few outliers competing into their early 40s.
Finally, I created a boxplot showing the point distribution by age group.
The “Over 30” group exhibits a higher median point total and a larger interquartile range, suggesting that veteran players in the top 50 tend to maintain more consistent high-level point totals. In contrast, the “Under 30” group has a lower median but features several outliers reaching up to 10,000 points, representing the dominant top stars of the younger generation. Overall, while the younger group contains the absolute highest point earners, the veteran group shows a higher baseline of point accumulation across its middle 50% of players.
Conclusion
Overall, physical height remains a primary predictor of service power, but external factors like court surface and tournament round play a significant role in determining the tactical nature and length of professional matches.
The data shows that ranking is as much a result of scheduling strategy as it is of raw performance. Veterans dominate the leaderboard through experience and selective participation. Success in the WTA is not just about playing more often, but about making every appearance count