Background

The Wimbledon Tennis Championship is one of the four grand slams in the tennis season each year that takes place in the UK. This dataset, from the UCI Machine Learning Repository, contains information about the 2013 Wimbledon Men and Women’s Singles Championships. 2013 was an important year for Wimbledon as it was the year that Andy Murray won the championship being a UK native. Each row of data represents a game played in the tournament and includes a variety of numerical factors about the game. All of the variables listed below are collected for both players, they’re represented in the dataset with a .1 or .2 for example: FSP.1 is first serve percentage for player 1 and FSP.2 is for player 2.

Variables	Variable Label
Player 1	Name of Player 1
Player 2	Name of Player 2
Results	1 if Player 1 wins
FSP	First Serve Percentage
FSW	First Serves Won
SSP	Second Serve Percentage
SSW	Second Serves Won
ACE	Aces Won
DBF	Double Faults Committed
WNR	Winners Hit
UFE	Unforced Errors Committed
BPC	Break Points Created
BPW	Break Points Won
NPA	Net Points Attempted
NPW	Net Points Won
TPW	Total Points Won
ST1-5	Results for Sets 1-5
FNL	Final Number of Games Won

Question

What factors were most important when it came to winning Wimbledon?

In order to answer this question, we want to create a decision tree model using the data from Wimbledon 2013. After some consideration we decided to create two different models for the Women’s and Men’s tournaments, as they are separate events in Wimbledon. Additionally, we want to explore the differences between the gini and information splitting methods to see which one would work better for the two specific situations.

Exploratory Data Analyses

Loading in Data

library(tidyverse) 

wmen <- read.csv("Wimbledon-men-2013.csv") #loading in datasets
wwmen <- read.csv("Wimbledon-women-2013.csv")

str(wmen, list.len=8) # viewing structure of dataset and variable classes

## 'data.frame':    114 obs. of  42 variables:
##  $ Player1: Factor w/ 77 levels "A.Haider-Maurer",..: 7 40 58 72 65 50 2 39 49 42 ...
##  $ Player2: Factor w/ 75 levels "A.Bedene","A.Bogomolov Jr.",..: 8 74 33 2 52 71 7 72 46 57 ...
##  $ Round  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Result : int  0 0 1 1 0 0 1 0 0 1 ...
##  $ FNL.1  : int  0 1 3 3 0 0 3 0 0 3 ...
##  $ FNL.2  : int  3 3 0 0 3 3 1 3 3 0 ...
##  $ FSP.1  : int  59 62 72 77 68 59 63 61 61 67 ...
##  $ FSW.1  : int  29 77 44 40 61 41 56 47 31 56 ...
##   [list output truncated]

str(wwmen, list.len=8)

## 'data.frame':    122 obs. of  42 variables:
##  $ Player1: Factor w/ 77 levels "A.Cadantu","A.Cornet",..: 48 18 63 2 76 10 77 25 75 20 ...
##  $ Player2: Factor w/ 84 levels "A.Beck","A.Cadantu",..: 79 30 74 81 40 12 80 37 5 32 ...
##  $ Round  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Result : int  0 0 1 1 0 0 0 1 0 1 ...
##  $ FNL.1  : int  0 0 2 2 0 1 1 2 0 2 ...
##  $ FNL.2  : int  2 2 0 1 2 2 2 0 2 1 ...
##  $ FSP.1  : int  60 69 63 57 73 59 69 68 78 61 ...
##  $ FSW.1  : int  21 23 17 36 34 38 40 31 21 44 ...
##   [list output truncated]

Looking at the structure of the data, most variables are factors, or time series integers. As a result, summary statistics do not provide meaningful information. Looking at the men’s wimbledon data, the variables such as first serve percentage (FSP) and first serves won (FSW) are not supplied on a continuous scale. Similar logic is applied to all other variables, including those in the women’s wimbledon dataset.

The first issue from the data is that multiple factors in the dataset could be used to calculate a winner rather than predict a winner. This is especially true when it comes to number of games each player wins in each set, and the total games won, as a player wins if they win three sets in mens or two sets in womens. For this reason, any variable that the model could use to calculate if a player will win will be taken out, such as FNL (final games won) and ST1-5 which contain set results. The player names will also be taken out because we want the results to be generalized to any player.

Data Cleaning

library(tidyverse)

#CLEANING MENS SINGLES DATA WIMBLEDON 2013
wmen$Result <- as.factor(wmen$Result) # changing class of Result variable to factor

wmen_drop = subset(wmen, select = c(Result, Round,FSP.1, SSP.1, ACE.1, DBF.1, WNR.1, UFE.1, BPC.1, NPA.1,FSP.2, SSP.2, ACE.2, DBF.2, WNR.2, UFE.2, BPC.2, NPA.2 )) # selecting variables for analyses

wmen_drop[is.na(wmen_drop)] <- 0 # Replacing NAs with 0's to be safe
wm_df <- data.frame(wmen_drop) # manipulating resulting data into a dataframe

#CLEANING WOMENS SINGLES DATA WIMBLEDON 2013
wwmen$Result <- as.factor(wwmen$Result) # changing class of result variable to factor 

wwmen_drop = subset(wwmen, select = c(Result, Round,FSP.1, SSP.1, ACE.1, DBF.1, WNR.1, UFE.1, BPC.1, NPA.1,FSP.2, SSP.2, ACE.2, DBF.2, WNR.2, UFE.2, BPC.2, NPA.2 )) #selecting variables for analyses

wwmen_drop[is.na(wwmen_drop)] <- 0 # Replacing NAs with 0's to be safe
wwm_df <- data.frame(wwmen_drop) # manipulating resulting data into a dataframe

Methods

Mens Wimbledon Analysis

Baserate for wins and loses

table(wm_df$Result)[2] / sum(table(wm_df$Result))

##         1 
## 0.4824561

### Conclusion: the baserate is 48% for player 1 winning. Hopefully our algorithm performs better than this!

Building A Model

Gini Tree

library(rpart)
library(rpart.plot)

set.seed(1999)
wm_men_tree_gini = rpart(Result~., # testing Result against all other variables 
                            method = "class",  #<- specify method, use "class" for tree
                            parms = list(split = "gini"), #<- method for choosing tree split
                            data = wm_df, #<- data used 
                            control = rpart.control(cp=.01)) 
rpart.plot(wm_men_tree_gini, type =4, extra = 101) # displays gini tree

Information Tree

wm_men_tree_info = rpart(Result~., # testing Result against all other variables 
                            method = "class",  #<- specify method, use "class" for tree
                            parms = list(split = "information"), #<- method for choosing tree split
                            data = wm_df, #<- data used
                            control = rpart.control(cp=.01))

rpart.plot(wm_men_tree_info, type =4, extra = 101) # displays information tree

Predicting and Accuracy

library(caret)
wm_men_tree_gini_pred = predict(wm_men_tree_gini, type= "class") # predicting using gini tree
confusionMatrix(as.factor(wm_men_tree_gini_pred), as.factor(wm_df$Result),   positive="1",dnn=c("Prediction", "Actual"), mode = "sens_spec") # obtaining confusion matrix for gini tree predictions

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 56  9
##          1  3 46
##                                           
##                Accuracy : 0.8947          
##                  95% CI : (0.8233, 0.9444)
##     No Information Rate : 0.5175          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7884          
##                                           
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.8364          
##             Specificity : 0.9492          
##          Pos Pred Value : 0.9388          
##          Neg Pred Value : 0.8615          
##              Prevalence : 0.4825          
##          Detection Rate : 0.4035          
##    Detection Prevalence : 0.4298          
##       Balanced Accuracy : 0.8928          
##                                           
##        'Positive' Class : 1               
##

wm_men_tree_info_pred = predict(wm_men_tree_info, type= "class") # predicting using information tree
confusionMatrix(as.factor(wm_men_tree_info_pred), as.factor(wm_df$Result), positive="1",dnn=c("Prediction", "Actual"), mode = "sens_spec") # obtaining confusion matrix for information tree predictions

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 48  3
##          1 11 52
##                                           
##                Accuracy : 0.8772          
##                  95% CI : (0.8025, 0.9312)
##     No Information Rate : 0.5175          
##     P-Value [Acc > NIR] : 3.365e-16       
##                                           
##                   Kappa : 0.7553          
##                                           
##  Mcnemar's Test P-Value : 0.06137         
##                                           
##             Sensitivity : 0.9455          
##             Specificity : 0.8136          
##          Pos Pred Value : 0.8254          
##          Neg Pred Value : 0.9412          
##              Prevalence : 0.4825          
##          Detection Rate : 0.4561          
##    Detection Prevalence : 0.5526          
##       Balanced Accuracy : 0.8795          
##                                           
##        'Positive' Class : 1               
##

Based on the results of the confusion matrices created of our models, we decided to use the tree with the Gini split method going forward due to its higher accuracy. In this particular case we did not prefer to see a higher specificity or sensitivity because both winning and losing are equally important outcomes, so accuracy was our most important factor.

Pruning the Gini tree

## Plotting Cp values
plotcp(wm_men_tree_gini)

## Creating a CP Table
cp_gini <- as.data.frame(wm_men_tree_gini$cptable, ) # turning cp table into dataframe
cp_gini$opt <- cp_gini$`rel error`+ cp_gini$xstd # calculating optimal value
cp_gini

##           CP nsplit rel error    xerror       xstd       opt
## 1 0.50909091      0 1.0000000 1.1454545 0.09652510 1.0965251
## 2 0.10909091      1 0.4909091 0.7454545 0.09316188 0.5840710
## 3 0.05454545      3 0.2727273 0.5818182 0.08723019 0.3599575
## 4 0.01000000      4 0.2181818 0.4181818 0.07790575 0.2960876

# Conclusions: From the output of the table, no optimal value is less than the sum of the relative error plus xstd, thus a cp of 0.01 will be maintained. 

set.seed(1999)
wm_men_tree_gini2 = rpart(Result~., 
                            method = "class",  
                            parms = list(split = "gini"), #<- method for choosing tree split
                            data = wm_df, #<- data used
                            control = rpart.control(cp=0.01, minsplit=10)) #new parameters based on what we learned from the cp table
rpart.plot(wm_men_tree_gini2, type =4, extra = 101) # displaying pruned gini tree

wm_men_tree_gini2_pred = predict(wm_men_tree_gini2, type= "class") # predicting using pruned gini tree
confusionMatrix(as.factor(wm_men_tree_gini2_pred), as.factor(wm_df$Result)) # obtaining confusion matrix for new gini tree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 57  8
##          1  2 47
##                                           
##                Accuracy : 0.9123          
##                  95% CI : (0.8446, 0.9571)
##     No Information Rate : 0.5175          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8237          
##                                           
##  Mcnemar's Test P-Value : 0.1138          
##                                           
##             Sensitivity : 0.9661          
##             Specificity : 0.8545          
##          Pos Pred Value : 0.8769          
##          Neg Pred Value : 0.9592          
##              Prevalence : 0.5175          
##          Detection Rate : 0.5000          
##    Detection Prevalence : 0.5702          
##       Balanced Accuracy : 0.9103          
##                                           
##        'Positive' Class : 0               
##

# To prune our initial Gini tree, we wanted to see if there was an optimal CP value. Through analysis to find the optimal value, 0.01 was sufficient enough. Thus, we then wanted to see if there was an issue of overfitting in our tree. We tested a range of the minsplit hyperparameter (4-15) and found that 10 provided a higher accuracy with limited overfitting. For example, using a minsplit of 4 resulted in a higher accuracy at 94%, but there was definitely overfitting present. When we tried a minsplit of 15, the accuracy level stayed the same as the default minsplit of 20.

With the new parameters our accuracy increased to 91%, which increased from the original model’s accuracy of 89%. The final Gini tree has a cp of 0.01 and minsplit value of 10.

Womens Wimbledon Analysis

Baserate for wins and loses

table(wwm_df$Result)[2] / sum(table(wwm_df$Result))

##         1 
## 0.5491803

### Conclusion: the baserate is 54% for player 1 winning. Hopefully our algorithm performs better than this!

Building A Model

Gini Tree

set.seed(1999)
wwm_men_tree_gini = rpart(Result~., # testing Results against all other variables
                            method = "class",  #<- specify method, use "class" for tree
                            parms = list(split = "gini"), #<- method for choosing tree split
                            data = wwm_df, #<- data used
                            control = rpart.control(cp=.01))
rpart.plot(wwm_men_tree_gini, type =4, extra = 101) # plotting women's gini tree

Information Tree

wwm_men_tree_info = rpart(Result~., # testing Results against all other variables
                            method = "class",  #<- specify method, use "class" for tree
                            parms = list(split = "information"), #<- method for choosing tree split
                            data = wwm_df, #<- data used
                            control = rpart.control(cp=.01))

rpart.plot(wwm_men_tree_info, type =4, extra = 101) # plotting women's information tree

Predicting and Accuracy

library(caret)
wwm_men_tree_gini_pred = predict(wwm_men_tree_gini, type= "class") # predicting using gini tree

confusionMatrix(as.factor(wwm_men_tree_gini_pred), as.factor(wwm_df$Result), positive="1", dnn=c("Prediction", "Actual"), mode = "sens_spec") # obtaining confusion matrix for gini tree predictions

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 38  2
##          1 17 65
##                                           
##                Accuracy : 0.8443          
##                  95% CI : (0.7675, 0.9036)
##     No Information Rate : 0.5492          
##     P-Value [Acc > NIR] : 4.341e-12       
##                                           
##                   Kappa : 0.6776          
##                                           
##  Mcnemar's Test P-Value : 0.001319        
##                                           
##             Sensitivity : 0.9701          
##             Specificity : 0.6909          
##          Pos Pred Value : 0.7927          
##          Neg Pred Value : 0.9500          
##              Prevalence : 0.5492          
##          Detection Rate : 0.5328          
##    Detection Prevalence : 0.6721          
##       Balanced Accuracy : 0.8305          
##                                           
##        'Positive' Class : 1               
##

wwm_men_tree_info_pred = predict(wwm_men_tree_info, type= "class") # predicting using information tree

confusionMatrix(as.factor(wwm_men_tree_info_pred), as.factor(wwm_df$Result), positive="1",dnn=c("Prediction", "Actual"), mode = "sens_spec") # obtaining confusion matrix for information tree predictions)

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 49 11
##          1  6 56
##                                           
##                Accuracy : 0.8607          
##                  95% CI : (0.7863, 0.9167)
##     No Information Rate : 0.5492          
##     P-Value [Acc > NIR] : 1.951e-13       
##                                           
##                   Kappa : 0.7209          
##                                           
##  Mcnemar's Test P-Value : 0.332           
##                                           
##             Sensitivity : 0.8358          
##             Specificity : 0.8909          
##          Pos Pred Value : 0.9032          
##          Neg Pred Value : 0.8167          
##              Prevalence : 0.5492          
##          Detection Rate : 0.4590          
##    Detection Prevalence : 0.5082          
##       Balanced Accuracy : 0.8634          
##                                           
##        'Positive' Class : 1               
##

Based on the results, the information tree had a higher accuracy at 86% compared to 84% with the Gini Index, so we decided to use it going forward. We did not consider specificity and sensitivity because of the reasons stated above.

Pruning the Information tree

## Plotting the Cp
plotcp(wwm_men_tree_info)

## Creating a Cp Table
cp_info <- as.data.frame(wwm_men_tree_info$cptable, )
cp_info$opt <- cp_info$`rel error`+ cp_info$xstd #obtaining optimal value
cp_info

##           CP nsplit rel error    xerror       xstd       opt
## 1 0.36363636      0 1.0000000 1.0000000 0.09992546 1.0999255
## 2 0.12727273      1 0.6363636 0.8363636 0.09732919 0.7336928
## 3 0.07272727      3 0.3818182 0.6181818 0.09004056 0.4718587
## 4 0.01000000      4 0.3090909 0.5272727 0.08548657 0.3945775

set.seed(1999)
wwm_men_tree_info2 = rpart(Result~.,
                            method = "class",  
                            parms = list(split = "information"), #<- method for choosing tree split
                            data = wwm_df, #<- data used
                            control = rpart.control(cp=.01, minsplit = 14)) #changing the parameters based on new cp insights

rpart.plot(wwm_men_tree_info2, type =4, extra = 101) # plotting pruned information tree

wwm_men_tree_info2_pred = predict(wwm_men_tree_info2, type= "class") # predicting using new information tree
confusionMatrix(as.factor(wwm_men_tree_info2_pred), as.factor(wwm_df$Result), positive="1",dnn=c("Prediction", "Actual"), mode = "sens_spec") # obtaining confusion matrix for new information tree predictions

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 42  2
##          1 13 65
##                                           
##                Accuracy : 0.877           
##                  95% CI : (0.8053, 0.9295)
##     No Information Rate : 0.5492          
##     P-Value [Acc > NIR] : 6.731e-15       
##                                           
##                   Kappa : 0.7472          
##                                           
##  Mcnemar's Test P-Value : 0.009823        
##                                           
##             Sensitivity : 0.9701          
##             Specificity : 0.7636          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.9545          
##              Prevalence : 0.5492          
##          Detection Rate : 0.5328          
##    Detection Prevalence : 0.6393          
##       Balanced Accuracy : 0.8669          
##                                           
##        'Positive' Class : 1               
##

# The cp table provided an optimal cp value of .127. This value was applied with a range of minsplit values, but the accuracy not only decreased, it stayed the same for minsplit values 1-25. Thus, we then decided to use the optimal cp value provided in the plot, which is 0.027. This, however, did not improve our accuracy when tested with the same range of minsplit values. We ultimately decided to stick with our original cp of 0.01 and found that a minsplit of 14 resulted in the highest accuracy with least amount of overfitting.

With the new parameters, the model’s accuracy went from 86% to 87.7%. Even if this is a small increase, it is still an increase.

Conclusions

After creating both of the decision trees we learned a lot about what variables seemed to be most important when it came to winning. Initially both the men’s and women’s trees had break points created within the trees, which is not surprising, as break points usually lead to games won. What was interesting is that the men’s tree also contained aces while the women’s tree also contained winners hit. Both of them also lead to guaranteed points for the player that hits them, but it suggests that men are more likely to win by hitting aces and women by hitting winners.

We created two default trees for men and women one with gini index as the split method and one with information as the split method and decided to use the one with higher accuracy going forward. With the mens data the gini tree provided a higher accuracy and with the women’s data the information tree provided a higher accuracy. This made us glad that we had decided to test both trees in order to create a more useful model. Through this process, we gained more insight into gini vs. information splitting methods.

While tuning the hyperparameters for both trees, it became very clear that overfitting was an easy mistake with this dataset. We tested multiple combinations of cp and minsplit values to create a tree that was more accurate without too much overfitting.

Final Trees

Men’s Wimbledon Final Tree

rpart.plot(wm_men_tree_gini2, type =4, extra = 101) #plotting final men's tree

# The men's final tree showed that the most important factors to win Wimbledon included creating more than 8 break points, hitting more than 17 aces, and committing less than 42 unforced errors.

Women’s Wimbledon Final Tree

rpart.plot(wwm_men_tree_info2, type =4, extra = 101) #plotting final women's tree

# The women’s final tree showed that in order to win Wimbledon you had to create more than 5 break points, your second serve percentage had to be less than or equal to 34, you had to hit 22 winners and less than 3 aces.

Future Work

Further analyses can be conducted on datasets from other years to have more confidence when generalizing the most important variables that result in a Wimbledon win. Additionally, to improve upon the decision trees created, other hyperparameters can be added to further prune the trees. More exploration between the relationship of optimal cp and minsplit values can be done to have a balance between high accuracy and low amount of overfitting for these specific datasets. It would also be interesting to look at the results that a random forest model would produce as well as a deep learning model. Because all these models use different algorithms and decision making processes the difference in results would provide more information.

What It Takes to Win Wimbledon

Jeannette Jiang and Malvika Kuncham

5/3/2020

Background

Question

What factors were most important when it came to winning Wimbledon?

Exploratory Data Analyses

Loading in Data

Data Cleaning

Methods

Mens Wimbledon Analysis

Baserate for wins and loses

Building A Model

Gini Tree

Information Tree

Predicting and Accuracy

Pruning the Gini tree

Womens Wimbledon Analysis

Baserate for wins and loses

Building A Model

Gini Tree

Information Tree

Predicting and Accuracy

Pruning the Information tree

Conclusions

Final Trees

Men’s Wimbledon Final Tree

Women’s Wimbledon Final Tree

Future Work

Sources