Predicting Outcome of ATP Matches

Katie Evans
January 30, 2018

Introduction to tennis

There are many factors that might correlate with a player winning or losing a match
- If a player has a bad day and they don't play their best tennis, they are more likely to lose
In this dataset curated by Jeff Sackmann, we can look at every match in 2017. Variables of interest are:
- Aces: when the server wins the point with the serve
- Double fault: when the server misses both serves (thus loses the point)
- % First Serves In: How often the player gets the first serve in
- % First Serve Points Won: When the player gets the first serve in, do they win the point?
- % Second Serve Points Won: If the player has to use a second serve, do they win the point?
- Rank: ATP tour rank of the player
- % Break Points Saved: Number of break points saved out of total break points faced

My Application - EXPLORE tab

In the EXPLORE tab, you can look at the relationship between any of these 7 variables and the outcome of the match in the form of a boxplot
The example below shows the percent break points saved grouped by outcome. You can see that players that won the match have, on average, more break points saved than players that lost.

plot of chunk unnamed-chunk-1

My Application - MODEL tab

In the MODEL tab, you can generate a random forest model using any or all of the 7 variables to predict outcome of the match.
Select as many variables as you want, then click the Calibrate Model button to generate the new model.
The output is a confusion matrix showing the accuracy of the model:

Confusion Matrix and Statistics

          Reference
Prediction Lose Win
      Lose  349 151
      Win   114 274

               Accuracy : 0.7016          
                 95% CI : (0.6703, 0.7315)
    No Information Rate : 0.5214          
    P-Value [Acc > NIR] : <2e-16          

                  Kappa : 0.3999          
 Mcnemar's Test P-Value : 0.027           

            Sensitivity : 0.7538          
            Specificity : 0.6447          
         Pos Pred Value : 0.6980          
         Neg Pred Value : 0.7062          
             Prevalence : 0.5214          
         Detection Rate : 0.3930          
   Detection Prevalence : 0.5631          
      Balanced Accuracy : 0.6992          

       'Positive' Class : Lose

Why this is cool

Using just service statistics, we can fairly acurately predict the outcome of a tennis match!
This could help players focus on serving (or not, depending on what the model says)
I would love to incorporate other statistics (winners and unforced errors) but finding good datasets are hard!