1. Why I Chose This Topic
“She’s very strong, she must hit the ball very hard”, “He’s chubby, he must be slow”, “He’s very tall, he must have a great serve”… And the list could go on… No matter what sport we play, there are always many assumptions people make just based on the physical looks of the players.
I have been surrounded by assumptions of all types ever since I picked up a tennis racquet when I was 5. I have spent uncountable amount of hours on the court either practicing or playing matches. Therefore, I wanted to do my final project on a topic that I am passionate about: tennis. It seemed like a perfect opportunity to apply what I learn in the classroom to my favorite sport.
There were so many directions I could go with this project. And then I thought about myself when I was little. Right before going on the court, I looked at my opponent and before she had even hit the ball, I would already make presumptions on how she played based on how she looked. I remember once, my opponent was very tall and thought: “wow! she’s probably going to be a good server”. This turned out to not be the case but even before playing I already had a little mental disadvantage. Over the years, I have learned to not make those type of assumptions. That led me to choose to do the project on whether the height of a tennis player influences how good of a server that player is and get a better understanding of what being a “good server” means.
2. Introduction to the Topic
First of all, let’s define or try to understand who we can consider a good server.
The goal in tennis is to win points in order to win games, then to win sets that lead the player to win a match. A player with an effective serve (a good server) will be able to better dictate the play in a point. We can measure this efficiency with a number of parameters such as percentage of first serves in, percentage of serve games won and percentage of serve points won. Then the question is whether taller tennis players have an advantage when it comes to serving. If so, which of the previous parameters (or others) are best to measure this efficiency.
Additionally, I would like to assess the potential difference in this topic, if any, between men’s and women’s tennis.
3. Data Collection and Dataframe
I have created my own database. I collected the data from the official websites for men and women professional tennis players (ATP for men and WTA for women). I limited my research to the top 100 ranked tennis players (men and women) on June 10th, 2019. For every player, I collected his/her height, percentage of first serves In in 2019, percentage of serve games won in 2019, percentage of serve points won in 2019. Furthermore, the average first serve speed was available for the men so I also added it.
4. Visualizing the Data
5. Results
5.1. Height vs. Percentage of Service Games Won
The first element that I want to analyze the correlation between height and the percentage of service games won. The following scatterplot illustrates height versus the percentage of service games won for men and women. I separated the data by sex and added a linear best fit line to show the trends in both men and women.
ggplot(servedata, aes(x=servedata$HEIGHT.in.cm, y=servedata$SERVICE.GAMES.WON.percentage)) + geom_point(aes(color=SEX))+
labs(x="Height in cm",y=" Percentage of Service Games Won ",title="Height vs. Percentage of Service Games Won") + theme_clean() + facet_wrap(~SEX, ncol=2) + geom_smooth(method="lm", size=0.7, color="purple")

At first sight, the data for female looks more spread out than the data for male which might be an indication that there is a higher correlation in the later one. To confirm the previous observation, I calculated some correlation coefficients. Therefore, I calculated the correlation coefficient for height vs. percentage of service games won separated by gender.
FEMALE DATA
The following variable, female_height2, selects the columns ‘HEIGHT.in.cm’ and ‘SERVICE.GAMES.WON.percentage’ from the ‘servedata’, filters the data with only the females and arranges it by height.
female_height2 <- servedata %>%
select(SEX, HEIGHT.in.cm, SERVICE.GAMES.WON.percentage) %>%
filter(SERVICE.GAMES.WON.percentage, SEX=="FEMALE") %>%
arrange(HEIGHT.in.cm)
The correlation coefficient between female’s height and their percentage of service games won is calculated as follows:
cor(female_height2$HEIGHT.in.cm, female_height2$SERVICE.GAMES.WON.percentage, use = "complete.obs")
## [1] 0.1986259
This result shows that there is not a strong positive correlation between those two elements.
MALE DATA
The following variable, male_height2, selects the columns ‘HEIGHT.in.cm’ and ‘SERVICE.GAMES.WON.percentage’ from the ‘servedata’, filters the data with only the males and arranges it by height.
male_height2 <- servedata %>%
select(SEX, HEIGHT.in.cm, SERVICE.GAMES.WON.percentage) %>%
filter(SERVICE.GAMES.WON.percentage, SEX=="MALE") %>%
arrange(HEIGHT.in.cm)
The correlation coefficient between male’s height and their percentage of service games won is calculated as follows:
cor(male_height2$HEIGHT.in.cm, male_height2$SERVICE.GAMES.WON.percentage, use = "complete.obs")
## [1] 0.5354061
This result shows that there is a moderate to strong positive correlation between those two elements.
These correlation coefficients confirm my initial observation of the graphs.
5.2. Height vs. Percentage of Service Points Won
After the previous analysis of the height versus the percentage of service games won, I will now look at the correlation between the players’ height and their percentage of service points won. The data is again put in a scatterplot then separated by sex with a best fit line added to show their trends.
ggplot(servedata, aes(x=servedata$HEIGHT.in.cm, y=servedata$SERVICE.POINTS.WON.percentage)) + geom_point(aes(color=SEX))+
labs(x="Height in cm",y=" Percentage of Service Points Won ",title="Height vs. Percentage of Service Points Won") + theme_clean() + facet_wrap(~SEX, ncol=2) + geom_smooth(method="lm", size=0.7, color="purple")

Again, at first sight it looks like there is a higher positive correlation in the male scatterplot than in the female one.
FEMALE DATA
The following variable, female_height3, selects the columns ‘HEIGHT.in.cm’ and ‘SERVICE.POINTS.WON.percentage’ from the ‘servedata’, filters the data with only the females and arranges it by height.
female_height3 <- servedata %>%
select(SEX, HEIGHT.in.cm, SERVICE.POINTS.WON.percentage) %>%
filter(SERVICE.POINTS.WON.percentage, SEX=="FEMALE") %>%
arrange(HEIGHT.in.cm)
The correlation coefficient between female’s height and their percentage of service points won is calculated as follows:
cor(female_height3$HEIGHT.in.cm, female_height3$SERVICE.POINTS.WON.percentage, use = "complete.obs")
## [1] 0.1916793
This result shows that there is a small positive correlation between those two elements.
MALE DATA
The following variable, male_height3, selects the columns ‘HEIGHT.in.cm’ and ‘SERVICE.POINTS.WON.percentage’ from the ‘servedata’, filters the data with only the males and arranges it by height.
male_height3 <- servedata %>%
select(SEX, HEIGHT.in.cm, SERVICE.POINTS.WON.percentage) %>%
filter(SERVICE.POINTS.WON.percentage, SEX=="MALE") %>%
arrange(HEIGHT.in.cm)
The correlation coefficient between male’s height and their percentage of service points won is calculated as follows:
cor(male_height3$HEIGHT.in.cm, male_height3$SERVICE.POINTS.WON.percentage, use = "complete.obs")
## [1] 0.5425625
This result shows that there is a moderate to strong positive correlation between those two elements.
Once again, these correlation coefficients confirm my initial observation of the graphs.
5.3. Height vs. Percentage of First Serves In
After visualizing height versus the percentage of service points won, we will look at the correlation between the players’ height and their percentage of first serves made. The data is put in a scatterplot then separated by sex with a best fit line added to show their trends.
ggplot(servedata, aes(x=servedata$HEIGHT.in.cm, y=servedata$FIRST.SERVE.percentage)) + geom_point(aes(color=SEX))+
labs(x="Height in cm",y=" Percentage of First Serves In ",title="Height vs. Percentage of First Serves In ") + theme_clean() + facet_wrap(~SEX, ncol=2) + geom_smooth(method="lm", size=0.7, color="purple")

Contrary to the previous two cases, it looks like there is a limited negative correlation in the men’s and women’s scatterplot.
FEMALE DATA
The following variable, female_height4, selects the columns ‘HEIGHT.in.cm’ and ‘FIRST.SERVE.percentage’ from the ‘servedata’, filters the data with only the females and arranges it by height.
female_height4 <- servedata %>%
select(SEX, HEIGHT.in.cm, FIRST.SERVE.percentage) %>%
filter(FIRST.SERVE.percentage, SEX=="FEMALE") %>%
arrange(HEIGHT.in.cm)
The correlation coefficient between female’s height and their percentage of first serves in is calculated as follows:
cor(female_height4$HEIGHT.in.cm, female_height4$FIRST.SERVE.percentage, use = "complete.obs")
## [1] -0.2312219
This result shows that there is a slight negative correlation between those two elements.
MALE DATA
The following variable, male_height4, selects the columns ‘HEIGHT.in.cm’ and ‘FIRST.SERVE.percentage’ from the ‘servedata’, filters the data with only the males and arranges it by height.
male_height4 <- servedata %>%
select(SEX, HEIGHT.in.cm, FIRST.SERVE.percentage) %>%
filter(FIRST.SERVE.percentage, SEX=="MALE") %>%
arrange(HEIGHT.in.cm)
The correlation coefficient between male’s height and their percentage of first serves in is calculated as follows:
cor(male_height4$HEIGHT.in.cm, male_height4$FIRST.SERVE.percentage, use = "complete.obs")
## [1] -0.08369432
This result shows that the correlation coefficient is so minimal that we can conclude that there is no correlation between these two factors.
5.4. Height vs. Average First Serve Speed (Men Only)
The following variable, male_height5, selects the columns ‘HEIGHT.in.cm’ and ‘AVERAGE.FIRST.SERVE.SPEED’ from the ‘servedata’, filters the data with only the males and arranges it by height.
male_height5 <- servedata %>%
select(SEX, HEIGHT.in.cm, AVERAGE.FIRST.SERVE.SPEED) %>%
filter(AVERAGE.FIRST.SERVE.SPEED, SEX=="MALE") %>%
arrange(HEIGHT.in.cm)
The correlation coefficient between all of the males’ height and their percentage of first serves in is calculated as follows:
cor(male_height5$HEIGHT.in.cm, male_height5$AVERAGE.FIRST.SERVE.SPEED, use = "complete.obs")
## [1] 0.8020822
This result shows that there is a very strong positive correlation between those two elements, as originally observed.
6. Interpretation of the Results & Conclusion
Overall, the correlation coefficient for male is always higher than the correlation coefficient for female in all three scatterplots. This means that a player’s height is more of a factor to be a “good server” in male’s tennis rather than in women’s tennis. This leads us to conclude that at the highest possible level of tennis (top 100 players in the world), the serve is more relevant in men’s tennis than in women’s tennis.
For players with an average first serve speed of 120mph or more (top-third of the scatterplot above), the average of the percentage of service games won is 83.7%, while for players with an average first serve speed of 110mph or less (lower-third of the scatterplot above), the average of the percentage of service games won is 75.3%. Therefore, it is harder to hold serve for a shorter player because on average their first serve will be slower.
7. Limitations
There are many other factors that constribute to being a “good server”. On the one hand, there are the factors related to the serve and his/her abilities, for example, the placement of the ball in the service box, the spin of the ball, and the capability to adapt to playing conditions (wind, sun, altitude, etc.). On the other hand, there are the factors related to the returner, such as his/her ability to return the serves which could influences the quality of the serves.
In terms of the data I used, I limited my analysis to the top 100 players on a particular date. Ideally, more players could have been added and throughout a longer period of time. Furthermore, there are less statistics available on women’s tennis than on men’s tennis. I was not able to find the average first serve speed of the female players which could have helped me in my conclusions about women’s tennis.
8. My Takeaway
I was surprised about the amount of statistics available on professional tennis players. It is such a useful tool to analyze and understand not only one’s game but also the opponent’s game and stategy. Hopefully in the next few years to come, these statistics can be made for collegiate or junior matches and that everyone can have access to those.
I have really enjoyed doing this project because I was actually able to put into practice the material learned in class to analyze a topic of interest to me. Furthermore, I managed to use statistical and graphical tools to confirm intuitive observations that I had in my daily life experience when playing tennis.