John Isner serving

John Isner serving

1. Why I Chose This Topic

“She’s very strong, she must hit the ball very hard”, “He’s chubby, he must be slow”, “He’s very tall, he must have a great serve”… And the list could go on… No matter what sport we play, there are always many assumptions people make just based on the physical looks of the players.

I have been surrounded by assumptions of all types ever since I picked up a tennis racquet when I was 5. I have spent uncountable amount of hours on the court either practicing or playing matches. Therefore, I wanted to do my final project on a topic that I am passionate about: tennis. It seemed like a perfect opportunity to apply what I learn in the classroom to my favorite sport.

There were so many directions I could go with this project. And then I thought about myself when I was little. Right before going on the court, I looked at my opponent and before she had even hit the ball, I would already make presumptions on how she played based on how she looked. I remember once, my opponent was very tall and thought: “wow! she’s probably going to be a good server”. This turned out to not be the case but even before playing I already had a little mental disadvantage. Over the years, I have learned to not make those type of assumptions. That led me to choose to do the project on whether the height of a tennis player influences how good of a server that player is and get a better understanding of what being a “good server” means.

2. Introduction to the Topic

First of all, let’s define or try to understand who we can consider a good server.

The goal in tennis is to win points in order to win games, then to win sets that lead the player to win a match. A player with an effective serve (a good server) will be able to better dictate the play in a point. We can measure this efficiency with a number of parameters such as percentage of first serves in, percentage of serve games won and percentage of serve points won. Then the question is whether taller tennis players have an advantage when it comes to serving. If so, which of the previous parameters (or others) are best to measure this efficiency.

Additionally, I would like to assess the potential difference in this topic, if any, between men’s and women’s tennis.

3. Data Collection and Dataframe

I have created my own database. I collected the data from the official websites for men and women professional tennis players (ATP for men and WTA for women). I limited my research to the top 100 ranked tennis players (men and women) on June 10th, 2019. For every player, I collected his/her height, percentage of first serves In in 2019, percentage of serve games won in 2019, percentage of serve points won in 2019. Furthermore, the average first serve speed was available for the men so I also added it.

4. Visualizing the Data

The following shows the header of my data frame. The whole data frame can be accessed here

head(servedata)
##       NAME RANK    SEX HAND HEIGHT.in.cm FIRST.SERVE.percentage
## 1    Osaka    1 FEMALE    R          180                   62.4
## 2    Barty    2 FEMALE    R          166                   58.4
## 3 Pliskova    3 FEMALE    R          186                   65.9
## 4  Bertens    4 FEMALE    R          182                   57.7
## 5  Kvitova    5 FEMALE    L          182                   61.4
## 6   Kerber    6 FEMALE    L          173                   61.8
##   SERVICE.GAMES.WON.percentage SERVICE.POINTS.WON.percentage
## 1                         78.1                          61.8
## 2                         77.3                          62.5
## 3                         79.8                          63.2
## 4                         76.6                          62.5
## 5                         79.1                          62.9
## 6                         70.7                          59.4
##   AVERAGE.FIRST.SERVE.SPEED
## 1                        NA
## 2                        NA
## 3                        NA
## 4                        NA
## 5                        NA
## 6                        NA

5. Results

5.1. Height vs. Percentage of Service Games Won

At first sight, the data for female looks more spread out than the data for male which might be an indication that there is a higher correlation in the later one. To confirm the previous observation, I calculated some correlation coefficients. Therefore, I calculated the correlation coefficient for height vs. percentage of service games won separated by gender.

FEMALE DATA

The following variable, female_height2, selects the columns ‘HEIGHT.in.cm’ and ‘SERVICE.GAMES.WON.percentage’ from the ‘servedata’, filters the data with only the females and arranges it by height.

female_height2 <- servedata %>%
  select(SEX, HEIGHT.in.cm, SERVICE.GAMES.WON.percentage) %>%
  filter(SERVICE.GAMES.WON.percentage, SEX=="FEMALE") %>%
  arrange(HEIGHT.in.cm)

The correlation coefficient between female’s height and their percentage of service games won is calculated as follows:

cor(female_height2$HEIGHT.in.cm, female_height2$SERVICE.GAMES.WON.percentage, use = "complete.obs")
## [1] 0.1986259

This result shows that there is not a strong positive correlation between those two elements.

MALE DATA

The following variable, male_height2, selects the columns ‘HEIGHT.in.cm’ and ‘SERVICE.GAMES.WON.percentage’ from the ‘servedata’, filters the data with only the males and arranges it by height.

male_height2 <- servedata %>%
  select(SEX, HEIGHT.in.cm, SERVICE.GAMES.WON.percentage) %>%
  filter(SERVICE.GAMES.WON.percentage, SEX=="MALE") %>%
  arrange(HEIGHT.in.cm)

The correlation coefficient between male’s height and their percentage of service games won is calculated as follows:

cor(male_height2$HEIGHT.in.cm, male_height2$SERVICE.GAMES.WON.percentage, use = "complete.obs")
## [1] 0.5354061

This result shows that there is a moderate to strong positive correlation between those two elements.

These correlation coefficients confirm my initial observation of the graphs.

5.2. Height vs. Percentage of Service Points Won

After the previous analysis of the height versus the percentage of service games won, I will now look at the correlation between the players’ height and their percentage of service points won. The data is again put in a scatterplot then separated by sex with a best fit line added to show their trends.

ggplot(servedata, aes(x=servedata$HEIGHT.in.cm, y=servedata$SERVICE.POINTS.WON.percentage)) + geom_point(aes(color=SEX))+
  labs(x="Height in cm",y=" Percentage of Service Points Won ",title="Height vs. Percentage of Service Points Won") + theme_clean() + facet_wrap(~SEX, ncol=2) + geom_smooth(method="lm", size=0.7, color="purple")

Again, at first sight it looks like there is a higher positive correlation in the male scatterplot than in the female one.

FEMALE DATA

The following variable, female_height3, selects the columns ‘HEIGHT.in.cm’ and ‘SERVICE.POINTS.WON.percentage’ from the ‘servedata’, filters the data with only the females and arranges it by height.

female_height3 <- servedata %>%
  select(SEX, HEIGHT.in.cm, SERVICE.POINTS.WON.percentage) %>%
  filter(SERVICE.POINTS.WON.percentage, SEX=="FEMALE") %>%
  arrange(HEIGHT.in.cm)

The correlation coefficient between female’s height and their percentage of service points won is calculated as follows:

cor(female_height3$HEIGHT.in.cm, female_height3$SERVICE.POINTS.WON.percentage, use = "complete.obs")
## [1] 0.1916793

This result shows that there is a small positive correlation between those two elements.

MALE DATA

The following variable, male_height3, selects the columns ‘HEIGHT.in.cm’ and ‘SERVICE.POINTS.WON.percentage’ from the ‘servedata’, filters the data with only the males and arranges it by height.

male_height3 <- servedata %>%
  select(SEX, HEIGHT.in.cm, SERVICE.POINTS.WON.percentage) %>%
  filter(SERVICE.POINTS.WON.percentage, SEX=="MALE") %>%
  arrange(HEIGHT.in.cm)

The correlation coefficient between male’s height and their percentage of service points won is calculated as follows:

cor(male_height3$HEIGHT.in.cm, male_height3$SERVICE.POINTS.WON.percentage, use = "complete.obs")
## [1] 0.5425625

This result shows that there is a moderate to strong positive correlation between those two elements.

Once again, these correlation coefficients confirm my initial observation of the graphs.

5.3. Height vs. Percentage of First Serves In

After visualizing height versus the percentage of service points won, we will look at the correlation between the players’ height and their percentage of first serves made. The data is put in a scatterplot then separated by sex with a best fit line added to show their trends.

ggplot(servedata, aes(x=servedata$HEIGHT.in.cm, y=servedata$FIRST.SERVE.percentage)) + geom_point(aes(color=SEX))+
  labs(x="Height in cm",y=" Percentage of First Serves In ",title="Height vs. Percentage of First Serves In ") + theme_clean() + facet_wrap(~SEX, ncol=2) + geom_smooth(method="lm", size=0.7, color="purple")

Contrary to the previous two cases, it looks like there is a limited negative correlation in the men’s and women’s scatterplot.

FEMALE DATA

The following variable, female_height4, selects the columns ‘HEIGHT.in.cm’ and ‘FIRST.SERVE.percentage’ from the ‘servedata’, filters the data with only the females and arranges it by height.

female_height4 <- servedata %>%
  select(SEX, HEIGHT.in.cm, FIRST.SERVE.percentage) %>%
  filter(FIRST.SERVE.percentage, SEX=="FEMALE") %>%
  arrange(HEIGHT.in.cm)

The correlation coefficient between female’s height and their percentage of first serves in is calculated as follows:

cor(female_height4$HEIGHT.in.cm, female_height4$FIRST.SERVE.percentage, use = "complete.obs")
## [1] -0.2312219

This result shows that there is a slight negative correlation between those two elements.

MALE DATA

The following variable, male_height4, selects the columns ‘HEIGHT.in.cm’ and ‘FIRST.SERVE.percentage’ from the ‘servedata’, filters the data with only the males and arranges it by height.

male_height4 <- servedata %>%
  select(SEX, HEIGHT.in.cm, FIRST.SERVE.percentage) %>%
  filter(FIRST.SERVE.percentage, SEX=="MALE") %>%
  arrange(HEIGHT.in.cm)

The correlation coefficient between male’s height and their percentage of first serves in is calculated as follows:

cor(male_height4$HEIGHT.in.cm, male_height4$FIRST.SERVE.percentage, use = "complete.obs")
## [1] -0.08369432

This result shows that the correlation coefficient is so minimal that we can conclude that there is no correlation between these two factors.

5.4. Height vs. Average First Serve Speed (Men Only)

After visualizing height versus the percentage of first serves in, we will look at the correlation between male players’ height and their average first serve speed. This information was not available for females and therefore the scatterplot will only include the data for male. A best fit line was also added to show its trend.

ggplot(servedata, aes(x=servedata$HEIGHT.in.cm, y=servedata$AVERAGE.FIRST.SERVE.SPEED)) + geom_point(aes(color=SEX))+
  labs(x="Height in cm",y=" Average First Serve Speed ",title="Height vs. Average First Serve Speed") + theme_clean() + geom_smooth(method="lm", size=0.7, color="blue")

#### At first sight, it looks like there is a strong positive correlation in the scatterplot.

The following variable, male_height5, selects the columns ‘HEIGHT.in.cm’ and ‘AVERAGE.FIRST.SERVE.SPEED’ from the ‘servedata’, filters the data with only the males and arranges it by height.

male_height5 <- servedata %>%
  select(SEX, HEIGHT.in.cm, AVERAGE.FIRST.SERVE.SPEED) %>%
  filter(AVERAGE.FIRST.SERVE.SPEED, SEX=="MALE") %>%
  arrange(HEIGHT.in.cm)

The correlation coefficient between all of the males’ height and their percentage of first serves in is calculated as follows:

cor(male_height5$HEIGHT.in.cm, male_height5$AVERAGE.FIRST.SERVE.SPEED, use = "complete.obs")
## [1] 0.8020822

This result shows that there is a very strong positive correlation between those two elements, as originally observed.

6. Interpretation of the Results & Conclusion

Overall, the correlation coefficient for male is always higher than the correlation coefficient for female in all three scatterplots. This means that a player’s height is more of a factor to be a “good server” in male’s tennis rather than in women’s tennis. This leads us to conclude that at the highest possible level of tennis (top 100 players in the world), the serve is more relevant in men’s tennis than in women’s tennis.

For players with an average first serve speed of 120mph or more (top-third of the scatterplot above), the average of the percentage of service games won is 83.7%, while for players with an average first serve speed of 110mph or less (lower-third of the scatterplot above), the average of the percentage of service games won is 75.3%. Therefore, it is harder to hold serve for a shorter player because on average their first serve will be slower.

7. Limitations

In terms of the data I used, I limited my analysis to the top 100 players on a particular date. Ideally, more players could have been added and throughout a longer period of time. Furthermore, there are less statistics available on women’s tennis than on men’s tennis. I was not able to find the average first serve speed of the female players which could have helped me in my conclusions about women’s tennis.

8. My Takeaway

I was surprised about the amount of statistics available on professional tennis players. It is such a useful tool to analyze and understand not only one’s game but also the opponent’s game and stategy. Hopefully in the next few years to come, these statistics can be made for collegiate or junior matches and that everyone can have access to those.

I have really enjoyed doing this project because I was actually able to put into practice the material learned in class to analyze a topic of interest to me. Furthermore, I managed to use statistical and graphical tools to confirm intuitive observations that I had in my daily life experience when playing tennis.