The Wimbledon Tennis Championship is one of the four grand slams in the tennis season each year that takes place in the UK. This dataset, from the UCI Machine Learning Repository, contains information about the 2013 Wimbledon Men and Women’s Singles Championships. 2013 was an important year for Wimbledon as it was the year that Andy Murray won the championship being a UK native. Each row of data represents a game played in the tournament and includes a variety of numerical factors about the game. All of the variables listed below are collected for both players, they’re represented in the dataset with a .1 or .2 for example: FSP.1 is first serve percentage for player 1 and FSP.2 is for player 2.
| Variables | Variable Label |
|---|---|
| Player 1 | Name of Player 1 |
| Player 2 | Name of Player 2 |
| Results | 1 if Player 1 wins |
| FSP | First Serve Percentage |
| FSW | First Serves Won |
| SSP | Second Serve Percentage |
| SSW | Second Serves Won |
| ACE | Aces Won |
| DBF | Double Faults Committed |
| WNR | Winners Hit |
| UFE | Unforced Errors Committed |
| BPC | Break Points Created |
| BPW | Break Points Won |
| NPA | Net Points Attempted |
| NPW | Net Points Won |
| TPW | Total Points Won |
| ST1-5 | Results for Sets 1-5 |
| FNL | Final Number of Games Won |
In order to answer this question, we want to create a decision tree model using the data from Wimbledon 2013. After some consideration we decided to create two different models for the Women’s and Men’s tournaments, as they are separate events in Wimbledon. Additionally, we want to explore the differences between the gini and information splitting methods to see which one would work better for the two specific situations.
library(tidyverse)
wmen <- read.csv("Wimbledon-men-2013.csv") #loading in datasets
wwmen <- read.csv("Wimbledon-women-2013.csv")
str(wmen, list.len=8) # viewing structure of dataset and variable classes
## 'data.frame': 114 obs. of 42 variables:
## $ Player1: Factor w/ 77 levels "A.Haider-Maurer",..: 7 40 58 72 65 50 2 39 49 42 ...
## $ Player2: Factor w/ 75 levels "A.Bedene","A.Bogomolov Jr.",..: 8 74 33 2 52 71 7 72 46 57 ...
## $ Round : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Result : int 0 0 1 1 0 0 1 0 0 1 ...
## $ FNL.1 : int 0 1 3 3 0 0 3 0 0 3 ...
## $ FNL.2 : int 3 3 0 0 3 3 1 3 3 0 ...
## $ FSP.1 : int 59 62 72 77 68 59 63 61 61 67 ...
## $ FSW.1 : int 29 77 44 40 61 41 56 47 31 56 ...
## [list output truncated]
str(wwmen, list.len=8)
## 'data.frame': 122 obs. of 42 variables:
## $ Player1: Factor w/ 77 levels "A.Cadantu","A.Cornet",..: 48 18 63 2 76 10 77 25 75 20 ...
## $ Player2: Factor w/ 84 levels "A.Beck","A.Cadantu",..: 79 30 74 81 40 12 80 37 5 32 ...
## $ Round : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Result : int 0 0 1 1 0 0 0 1 0 1 ...
## $ FNL.1 : int 0 0 2 2 0 1 1 2 0 2 ...
## $ FNL.2 : int 2 2 0 1 2 2 2 0 2 1 ...
## $ FSP.1 : int 60 69 63 57 73 59 69 68 78 61 ...
## $ FSW.1 : int 21 23 17 36 34 38 40 31 21 44 ...
## [list output truncated]
Looking at the structure of the data, most variables are factors, or time series integers. As a result, summary statistics do not provide meaningful information. Looking at the men’s wimbledon data, the variables such as first serve percentage (FSP) and first serves won (FSW) are not supplied on a continuous scale. Similar logic is applied to all other variables, including those in the women’s wimbledon dataset.
The first issue from the data is that multiple factors in the dataset could be used to calculate a winner rather than predict a winner. This is especially true when it comes to number of games each player wins in each set, and the total games won, as a player wins if they win three sets in mens or two sets in womens. For this reason, any variable that the model could use to calculate if a player will win will be taken out, such as FNL (final games won) and ST1-5 which contain set results. The player names will also be taken out because we want the results to be generalized to any player.
library(tidyverse)
#CLEANING MENS SINGLES DATA WIMBLEDON 2013
wmen$Result <- as.factor(wmen$Result) # changing class of Result variable to factor
wmen_drop = subset(wmen, select = c(Result, Round,FSP.1, SSP.1, ACE.1, DBF.1, WNR.1, UFE.1, BPC.1, NPA.1,FSP.2, SSP.2, ACE.2, DBF.2, WNR.2, UFE.2, BPC.2, NPA.2 )) # selecting variables for analyses
wmen_drop[is.na(wmen_drop)] <- 0 # Replacing NAs with 0's to be safe
wm_df <- data.frame(wmen_drop) # manipulating resulting data into a dataframe
#CLEANING WOMENS SINGLES DATA WIMBLEDON 2013
wwmen$Result <- as.factor(wwmen$Result) # changing class of result variable to factor
wwmen_drop = subset(wwmen, select = c(Result, Round,FSP.1, SSP.1, ACE.1, DBF.1, WNR.1, UFE.1, BPC.1, NPA.1,FSP.2, SSP.2, ACE.2, DBF.2, WNR.2, UFE.2, BPC.2, NPA.2 )) #selecting variables for analyses
wwmen_drop[is.na(wwmen_drop)] <- 0 # Replacing NAs with 0's to be safe
wwm_df <- data.frame(wwmen_drop) # manipulating resulting data into a dataframe
table(wm_df$Result)[2] / sum(table(wm_df$Result))
## 1
## 0.4824561
### Conclusion: the baserate is 48% for player 1 winning. Hopefully our algorithm performs better than this!
library(rpart)
library(rpart.plot)
set.seed(1999)
wm_men_tree_gini = rpart(Result~., # testing Result against all other variables
method = "class", #<- specify method, use "class" for tree
parms = list(split = "gini"), #<- method for choosing tree split
data = wm_df, #<- data used
control = rpart.control(cp=.01))
rpart.plot(wm_men_tree_gini, type =4, extra = 101) # displays gini tree
wm_men_tree_info = rpart(Result~., # testing Result against all other variables
method = "class", #<- specify method, use "class" for tree
parms = list(split = "information"), #<- method for choosing tree split
data = wm_df, #<- data used
control = rpart.control(cp=.01))
rpart.plot(wm_men_tree_info, type =4, extra = 101) # displays information tree
library(caret)
wm_men_tree_gini_pred = predict(wm_men_tree_gini, type= "class") # predicting using gini tree
confusionMatrix(as.factor(wm_men_tree_gini_pred), as.factor(wm_df$Result), positive="1",dnn=c("Prediction", "Actual"), mode = "sens_spec") # obtaining confusion matrix for gini tree predictions
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 56 9
## 1 3 46
##
## Accuracy : 0.8947
## 95% CI : (0.8233, 0.9444)
## No Information Rate : 0.5175
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7884
##
## Mcnemar's Test P-Value : 0.1489
##
## Sensitivity : 0.8364
## Specificity : 0.9492
## Pos Pred Value : 0.9388
## Neg Pred Value : 0.8615
## Prevalence : 0.4825
## Detection Rate : 0.4035
## Detection Prevalence : 0.4298
## Balanced Accuracy : 0.8928
##
## 'Positive' Class : 1
##
wm_men_tree_info_pred = predict(wm_men_tree_info, type= "class") # predicting using information tree
confusionMatrix(as.factor(wm_men_tree_info_pred), as.factor(wm_df$Result), positive="1",dnn=c("Prediction", "Actual"), mode = "sens_spec") # obtaining confusion matrix for information tree predictions
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 48 3
## 1 11 52
##
## Accuracy : 0.8772
## 95% CI : (0.8025, 0.9312)
## No Information Rate : 0.5175
## P-Value [Acc > NIR] : 3.365e-16
##
## Kappa : 0.7553
##
## Mcnemar's Test P-Value : 0.06137
##
## Sensitivity : 0.9455
## Specificity : 0.8136
## Pos Pred Value : 0.8254
## Neg Pred Value : 0.9412
## Prevalence : 0.4825
## Detection Rate : 0.4561
## Detection Prevalence : 0.5526
## Balanced Accuracy : 0.8795
##
## 'Positive' Class : 1
##
Based on the results of the confusion matrices created of our models, we decided to use the tree with the Gini split method going forward due to its higher accuracy. In this particular case we did not prefer to see a higher specificity or sensitivity because both winning and losing are equally important outcomes, so accuracy was our most important factor.
## Plotting Cp values
plotcp(wm_men_tree_gini)
## Creating a CP Table
cp_gini <- as.data.frame(wm_men_tree_gini$cptable, ) # turning cp table into dataframe
cp_gini$opt <- cp_gini$`rel error`+ cp_gini$xstd # calculating optimal value
cp_gini
## CP nsplit rel error xerror xstd opt
## 1 0.50909091 0 1.0000000 1.1454545 0.09652510 1.0965251
## 2 0.10909091 1 0.4909091 0.7454545 0.09316188 0.5840710
## 3 0.05454545 3 0.2727273 0.5818182 0.08723019 0.3599575
## 4 0.01000000 4 0.2181818 0.4181818 0.07790575 0.2960876
# Conclusions: From the output of the table, no optimal value is less than the sum of the relative error plus xstd, thus a cp of 0.01 will be maintained.
set.seed(1999)
wm_men_tree_gini2 = rpart(Result~.,
method = "class",
parms = list(split = "gini"), #<- method for choosing tree split
data = wm_df, #<- data used
control = rpart.control(cp=0.01, minsplit=10)) #new parameters based on what we learned from the cp table
rpart.plot(wm_men_tree_gini2, type =4, extra = 101) # displaying pruned gini tree
wm_men_tree_gini2_pred = predict(wm_men_tree_gini2, type= "class") # predicting using pruned gini tree
confusionMatrix(as.factor(wm_men_tree_gini2_pred), as.factor(wm_df$Result)) # obtaining confusion matrix for new gini tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 57 8
## 1 2 47
##
## Accuracy : 0.9123
## 95% CI : (0.8446, 0.9571)
## No Information Rate : 0.5175
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8237
##
## Mcnemar's Test P-Value : 0.1138
##
## Sensitivity : 0.9661
## Specificity : 0.8545
## Pos Pred Value : 0.8769
## Neg Pred Value : 0.9592
## Prevalence : 0.5175
## Detection Rate : 0.5000
## Detection Prevalence : 0.5702
## Balanced Accuracy : 0.9103
##
## 'Positive' Class : 0
##
# To prune our initial Gini tree, we wanted to see if there was an optimal CP value. Through analysis to find the optimal value, 0.01 was sufficient enough. Thus, we then wanted to see if there was an issue of overfitting in our tree. We tested a range of the minsplit hyperparameter (4-15) and found that 10 provided a higher accuracy with limited overfitting. For example, using a minsplit of 4 resulted in a higher accuracy at 94%, but there was definitely overfitting present. When we tried a minsplit of 15, the accuracy level stayed the same as the default minsplit of 20.
With the new parameters our accuracy increased to 91%, which increased from the original model’s accuracy of 89%. The final Gini tree has a cp of 0.01 and minsplit value of 10.
table(wwm_df$Result)[2] / sum(table(wwm_df$Result))
## 1
## 0.5491803
### Conclusion: the baserate is 54% for player 1 winning. Hopefully our algorithm performs better than this!
set.seed(1999)
wwm_men_tree_gini = rpart(Result~., # testing Results against all other variables
method = "class", #<- specify method, use "class" for tree
parms = list(split = "gini"), #<- method for choosing tree split
data = wwm_df, #<- data used
control = rpart.control(cp=.01))
rpart.plot(wwm_men_tree_gini, type =4, extra = 101) # plotting women's gini tree
wwm_men_tree_info = rpart(Result~., # testing Results against all other variables
method = "class", #<- specify method, use "class" for tree
parms = list(split = "information"), #<- method for choosing tree split
data = wwm_df, #<- data used
control = rpart.control(cp=.01))
rpart.plot(wwm_men_tree_info, type =4, extra = 101) # plotting women's information tree
library(caret)
wwm_men_tree_gini_pred = predict(wwm_men_tree_gini, type= "class") # predicting using gini tree
confusionMatrix(as.factor(wwm_men_tree_gini_pred), as.factor(wwm_df$Result), positive="1", dnn=c("Prediction", "Actual"), mode = "sens_spec") # obtaining confusion matrix for gini tree predictions
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 38 2
## 1 17 65
##
## Accuracy : 0.8443
## 95% CI : (0.7675, 0.9036)
## No Information Rate : 0.5492
## P-Value [Acc > NIR] : 4.341e-12
##
## Kappa : 0.6776
##
## Mcnemar's Test P-Value : 0.001319
##
## Sensitivity : 0.9701
## Specificity : 0.6909
## Pos Pred Value : 0.7927
## Neg Pred Value : 0.9500
## Prevalence : 0.5492
## Detection Rate : 0.5328
## Detection Prevalence : 0.6721
## Balanced Accuracy : 0.8305
##
## 'Positive' Class : 1
##
wwm_men_tree_info_pred = predict(wwm_men_tree_info, type= "class") # predicting using information tree
confusionMatrix(as.factor(wwm_men_tree_info_pred), as.factor(wwm_df$Result), positive="1",dnn=c("Prediction", "Actual"), mode = "sens_spec") # obtaining confusion matrix for information tree predictions)
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 49 11
## 1 6 56
##
## Accuracy : 0.8607
## 95% CI : (0.7863, 0.9167)
## No Information Rate : 0.5492
## P-Value [Acc > NIR] : 1.951e-13
##
## Kappa : 0.7209
##
## Mcnemar's Test P-Value : 0.332
##
## Sensitivity : 0.8358
## Specificity : 0.8909
## Pos Pred Value : 0.9032
## Neg Pred Value : 0.8167
## Prevalence : 0.5492
## Detection Rate : 0.4590
## Detection Prevalence : 0.5082
## Balanced Accuracy : 0.8634
##
## 'Positive' Class : 1
##
Based on the results, the information tree had a higher accuracy at 86% compared to 84% with the Gini Index, so we decided to use it going forward. We did not consider specificity and sensitivity because of the reasons stated above.
## Plotting the Cp
plotcp(wwm_men_tree_info)
## Creating a Cp Table
cp_info <- as.data.frame(wwm_men_tree_info$cptable, )
cp_info$opt <- cp_info$`rel error`+ cp_info$xstd #obtaining optimal value
cp_info
## CP nsplit rel error xerror xstd opt
## 1 0.36363636 0 1.0000000 1.0000000 0.09992546 1.0999255
## 2 0.12727273 1 0.6363636 0.8363636 0.09732919 0.7336928
## 3 0.07272727 3 0.3818182 0.6181818 0.09004056 0.4718587
## 4 0.01000000 4 0.3090909 0.5272727 0.08548657 0.3945775
set.seed(1999)
wwm_men_tree_info2 = rpart(Result~.,
method = "class",
parms = list(split = "information"), #<- method for choosing tree split
data = wwm_df, #<- data used
control = rpart.control(cp=.01, minsplit = 14)) #changing the parameters based on new cp insights
rpart.plot(wwm_men_tree_info2, type =4, extra = 101) # plotting pruned information tree
wwm_men_tree_info2_pred = predict(wwm_men_tree_info2, type= "class") # predicting using new information tree
confusionMatrix(as.factor(wwm_men_tree_info2_pred), as.factor(wwm_df$Result), positive="1",dnn=c("Prediction", "Actual"), mode = "sens_spec") # obtaining confusion matrix for new information tree predictions
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 42 2
## 1 13 65
##
## Accuracy : 0.877
## 95% CI : (0.8053, 0.9295)
## No Information Rate : 0.5492
## P-Value [Acc > NIR] : 6.731e-15
##
## Kappa : 0.7472
##
## Mcnemar's Test P-Value : 0.009823
##
## Sensitivity : 0.9701
## Specificity : 0.7636
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.9545
## Prevalence : 0.5492
## Detection Rate : 0.5328
## Detection Prevalence : 0.6393
## Balanced Accuracy : 0.8669
##
## 'Positive' Class : 1
##
# The cp table provided an optimal cp value of .127. This value was applied with a range of minsplit values, but the accuracy not only decreased, it stayed the same for minsplit values 1-25. Thus, we then decided to use the optimal cp value provided in the plot, which is 0.027. This, however, did not improve our accuracy when tested with the same range of minsplit values. We ultimately decided to stick with our original cp of 0.01 and found that a minsplit of 14 resulted in the highest accuracy with least amount of overfitting.
With the new parameters, the model’s accuracy went from 86% to 87.7%. Even if this is a small increase, it is still an increase.
After creating both of the decision trees we learned a lot about what variables seemed to be most important when it came to winning. Initially both the men’s and women’s trees had break points created within the trees, which is not surprising, as break points usually lead to games won. What was interesting is that the men’s tree also contained aces while the women’s tree also contained winners hit. Both of them also lead to guaranteed points for the player that hits them, but it suggests that men are more likely to win by hitting aces and women by hitting winners.
We created two default trees for men and women one with gini index as the split method and one with information as the split method and decided to use the one with higher accuracy going forward. With the mens data the gini tree provided a higher accuracy and with the women’s data the information tree provided a higher accuracy. This made us glad that we had decided to test both trees in order to create a more useful model. Through this process, we gained more insight into gini vs. information splitting methods.
While tuning the hyperparameters for both trees, it became very clear that overfitting was an easy mistake with this dataset. We tested multiple combinations of cp and minsplit values to create a tree that was more accurate without too much overfitting.
rpart.plot(wm_men_tree_gini2, type =4, extra = 101) #plotting final men's tree
# The men's final tree showed that the most important factors to win Wimbledon included creating more than 8 break points, hitting more than 17 aces, and committing less than 42 unforced errors.
rpart.plot(wwm_men_tree_info2, type =4, extra = 101) #plotting final women's tree
# The women’s final tree showed that in order to win Wimbledon you had to create more than 5 break points, your second serve percentage had to be less than or equal to 34, you had to hit 22 winners and less than 3 aces.
Further analyses can be conducted on datasets from other years to have more confidence when generalizing the most important variables that result in a Wimbledon win. Additionally, to improve upon the decision trees created, other hyperparameters can be added to further prune the trees. More exploration between the relationship of optimal cp and minsplit values can be done to have a balance between high accuracy and low amount of overfitting for these specific datasets. It would also be interesting to look at the results that a random forest model would produce as well as a deep learning model. Because all these models use different algorithms and decision making processes the difference in results would provide more information.