Introduction

Tennis is a sport played between two players (singles) or two teams of two players (doubles). All players on the court use a tennis racket to hit a ball back and forth. One player serves the ball at the beginning of each point, and the first player to return the ball to the other side, within bounds, without it being returned before it bounces twice, wins the point. The first player to 4 points wins a game and adds a point to their game total. The first player to 6 games wins the set. Matches are decided on a best-of-five-set basis for men, and best-of-three-set for women. Both games and sets must be won by two. Tennis is a massively popular sport that hosts a number of international tournaments and is also featured in the Olympics. Our purpose behind analyzing the data of these matches was to highlight certain tendencies of top players in tournament play. Some of these examples include what proportion of Serena’s points are made up of short rallies (<= 4), or the comparison between the four observed players’ proportion of aces.

We have 19 variables overall in our data with nine of them previewed above. Each observation in our data set represents one individual point in the match. The P1 and P2 Game Score variables track the score of the current game between Players 1 and 2, and the Service and Point Winner variables keep track of who was serving for each point and who won each point. The Winning Shot Type variable tracks which type of shot was last used by the winner of the point. The Rally Length variable keeps track of how long each point was, or how many times the ball was hit in play by either player during the point.

For our observations, we decided to each watch recordings of older tennis matches on Youtube. We gathered our data in a shared Google Sheets file, making compiling our data in the same place easier. Each group member was assigned a match, and from that match, they recorded statistics from one set. Some of the tournaments included Wimbledon, US Open, or Australian Open. Two of us watched Serena Williams vs Venus Williams, and the other two watched Roger Federer vs Novak Djokovic. Each member was responsible for recording the same amount of variables in their desired set. The five variables that we decided were the most important are, Across or In, Winner type, Winning shot type, Service in or out, and Rally length. The first variable, Across or In, pertains to the server. If there is a right-handed server and they serve from the right-hand side across the court, then this is considered Across, however, if they are standing on the left-hand side, this is called serving In. The second variable, Winner type, is calculated by who won the point. The third variable, Winning shot type, refers to how the opponent won the point. This could have been by an ace, forehand, or backhand. However, if I am rallying against an opponent and I hit a perfect forehand down the line and the opponent gets there but hits the ball into the net, this was considered a winner type of forehand because my perfect forehand shot resulted in the opponent hitting it into the net. The fourth variable recorded was if the server got his first serve in or not. If the server got their first serve in we recorded a 1, and if they did not we recorded a 0. Finally, the last variable recorded was the length of rallies. If there was an ace then this would be 0, but if the opponent returned the serve, then we counted the serve as part of the rally. Once all of this data was in the shared Google Sheet, we chose which variables that were the most interesting to us and then delegated the work for the other sections.


Summary

Table 1

Table 1 summarizes some key statistics for each player we tracked data for. The first variable reports what percentage of the time a player’s first serve was in; Novak Djokovic was the most accurate server in the sets we tracked based on this metric. Another key serving metric, Aces, was dominated by Serena Williams, who scored 24 of her 58 total points with an ace. However, Serena also had the most faults of all players we tracked data for. Serena Williams won the most points of the four players, but Novak Djokovic won the largest proportion of the points he played in. Djokovic also won the most games, being the only player in our data who won both sets we tracked data for. Serena and Venus Williams each won one of their sets, while Roger Federer did not win any sets at all, and only won three games overall.

Figure 1

Figure 1 displays the distribution of points won by different shot types for each player we tracked data for. Djokovic won more of his points with a forehand shot than any other shot type, where his opponent Federer won exactly one-third of his points on each of the shot types. Serena won a larger proportion of her points on aces than any other player we tracked, along with a bigger proportion of points won on backhands than any other players in the data. Serena won nearly half of her points on her forehand and had the lowest proportion of shots won on backhands among the players we tracked data for.

Figure 2

Figure 2 visualizes the distributions of rally length for each player for both when they are serving and when they are the point winner. One of the first observations we can make is that the two plots are extremely similar, which makes sense as professional tennis players are in general more likely to win a point when they are serving. Each of the density curves is similar to that of the general density curve for rally length. We also observe that Serena Williams, who scores a high proportion of her points on aces (Figure 1), seems to have the steepest peak in the left-most end of her density curve, indicating she is winning many of her points on short rallies. Serena’s opponent Venus, however, has a more right-skewed distribution than Serena, indicating that Venus could be winning more points on lengthy rallies.


Insights

Measuring the Importance of the First Serve

From Figure 1, we can see that the players in our data scored different proportions of their overall points from aces. Supplementing this figure with the data presented in Table 1, we observe that the players had different values for first serve percentages. Given this information, we wanted to attempt to evaluate how important getting the first serve in was for each of the players. To accomplish this, we created an evaluation metric called FSI (First Serve Importance). This metric was calculated using the following formula.

Creating this metric yielded the following values for FSI.

Based on FSI, Novak Djokovic has the most important first serve, largely because his win percentage is so high when his first serve is in. Perhaps a more interesting finding is that it is not as important for either of the Williams sisters to get their first serve in as their point win percentage is still extremely high when their first serve is out. This statistic is certainly an instance where a larger sample size would enhance its effectiveness, both in the number of points recorded as well as the number of players tracked. Knowing ahead of time if an opponent has a high FSI could potentially help tennis players prepare for competition by knowing if they should devote more time to practicing returning first serves.


A Look into Longer Rallies Between the Williams Sisters

Referring back to Figure 2, we observed that there could be a difference between rally length in points won by Serena and Venus Williams. Given this information and the knowledge that each of the sisters won one of the sets that we tracked data for, we wanted to explore if we could possibly explain what was different about the two sets that led to different victors. A key difference between the two matches is that the Wimbledon match (won by Venus) was played on a grass court while the US Open match (won by Serena) was played on a hard court. Tying this back to what we learned from Figure 2, we visualized the distribution of rally lengths in the two different matches and conducted a two-sample t-test for a difference in means in rally length between the two court types. Looking at the figure below, we observe that there was a higher proportion of short rallies in the hard court match and a higher proportion of longer rallies in the grass court match. We also observed that exactly one-half of points in their matches were decided in rallies of two shots or less. Our two-sample t-test for a difference in means in rally lengths between the two court types yielded a p-value of 0.012 and a 95% confidence interval of [0.321, 2.522], meaning we can confidently say there was a significant difference in mean rally lengths in the two different matches (the mean rally lengths were 2.7 and 4.12 for the US Open and Wimbledon, respectively). Interpreting this information, we infer that Venus may be better at winning points in longer rallies and therefore should attempt to take her opponent deep into points in order to gain a competitive edge. Similarly, Serena can know that she may be less likely to win a point as the rally gets longer and could adapt her playstyle or practice style accordingly.


Critiques

Numerous measures could have been taken to improve our data collection. To begin, we had instances in our data where the standard of how certain observations in a column were recorded wasn’t solidified. For example, we had “Serena” in certain columns, where “Serena Williams” would have been the proper input. There were also some variables that are recorded differently in the world of tennis, for example. We counted aces as serves that were not able to be returned, while usually, it means a service that is not touched at all by the returning player. We also counted the serve in the rally count, which is debated heavily in the tennis world.

Beyond standardization, collecting more data in the form of more observations as well as more variables would have led to greater confidence in our findings, as well as a greater potential for exploratory analysis. It is impossible for us to make data-based conclusions on tennis player tendencies without having a larger sample size consisting of not only more points but also different players over a longer course of time. Shot velocity is common information given at the conclusion of most points and could have been useful for our analysis to compare the average velocities of certain players and their proportion of points won in a match. Locational data of shot placement as well as where players are located when they score a shot/concede a shot would have also been useful to find more tendencies unique to certain players. This would have also given us the chance to compartmentalize the court to see where players spend most of their time throughout the life of a rally.

One critique we have on our insight regarding rally length between the Williams sisters is the fact that the match we tracked at Wimbledon took place in 2008 while the match tracked at the US Open took place in 2015; this is a large gap in time where either of the sisters’ play styles could have changed dramatically. Also, the FSI metric we calculated can be difficult to interpret considering it can take both positive and negative values and that is magnitude in either direction is not entirely indicative of player performance, but rather just a player tendency which we believe could help tennis players better prepare for their opponents.