Do Certain Tennis Strokes Predict Winning in Women’s Professional Tennis?

Section 1: Introduction

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/WTA logo.jpg")

Section 1.1: Business Background

The Women’s Tennis Association WTA was founded in 1973 by Billie Jean King and currently consists of more than 1600 players from all around the world. The WTA is a sports organization where competitors compete in sanctioned tennis tournaments to earn ranking points, tournament titles, and cash prizes. Part of the organization’s mission is to provide ongoing player and coach development. The WTA uses a player development advisory panel that is comprised of volunteer and independent sports science and medicine experts that advise on things such as: health, sports psychology, primary care, fitness training, adolescent and women’s health, performance coaching, athlete development and training.

Section 1.2: Business Problem

The WTA would like to identify if certain strokes: serves, forehands, backhands, or slice predict winning. By identifying if certain strokes predict winning the organization may be able to help develop better training plans for coaches and players.

Section 1.3: Data

The WTA does not have an extensive list of data or databases that have stored information regarding player strokes publicly available. New technology is emerging; however, comprehensive publicly available data has been created and made available on GitHub. Using this data (access date July 2022) which spans four decades and is part of the Match Charting Project by Jeff Sackmann, an analysis of women’s professional tennis player strokes can be conducted to see what strokes may predict winning.

There is a total of (36) excel spreadsheets, (18) of the spreadsheets are for men’s tennis stats and (18) are for women’s tennis stats.
The focus of this analysis will be on the women’s tennis spreadsheets, more specifically the spreadsheets labeled charting-w-matches as a lookup table regarding player and match information, charting-w-stats-ServeBasics, and charting-w-stats-ShotDirOutcomes(charting-w-stats-ShotDirection is a subset of data from charting-w-stats-ShotDirOutcomes and may be used in some analyses).
While there is a treasure-trove of information in the Match Charting Project, narrowing down the aforementioned spreadsheets will allow for a focused look at most commonly used player strokes that may predict winning.

Section 1.4: Methods

Several types of analysis will be conducted using Excel, Tableau, Tableau Prep, and R-Markdown to predict winning of women’s professional tennis players:

Descriptive Statistics: Used to evaluate the frequencies of player strokes of those who won or lost matches. Descriptive statistics will allow players and coaches to filter stats from player to player to see what strokes players often hit, which will allow coaches and players to develop customized training programs for specific opponents. Descriptive statistics may also reveal potential outliers in the data that can be further evaluated.
Classification Trees and KNN: Used with binary data (win/loss) to predict what contributes to the outcome of yes/no or in this case win/loss. This evaluation may give an idea of strokes or attributes such as player hand, year, or player number that contribute to a players win/loss record.
Logistic Regression: Used with binary data to predict an outcome (win/loss) based on variables in the data set. This will help narrow down which types of strokes are the best at predicting whether a player won or lost. Further evaluation will be done using random forest and bagging models to see which model(s) is best at predicting win/loss and what strokes are most important to winning and losing. Regression analysis in and of itself may yield the most useful method for this particular business problem in that it focuses entirely on predicting win/loss and will give further insight into what variables in the data may contribute to a players win/loss record.
K-means Clustering: Used to see if any groupings can be derived in the data, more specifically are there groups within winning and losing.

Section 1.5: Assumptions

There is a statistically significant relationship between strokes (serves, forehands, backhands, and slice) and winning.
Exploratory analysis and descriptive statistics will yield usable information for player development plans.

* Throughout this document serves will be assessed as a stroke by themselves. Forehands, backhands, and slice will be referenced as a group and this group may be interchangeably called shots or strokes for their assessment in this project.

Section 2: Packages Required

Section 2.1: Package Descriptions

The following seven R packages are used for this project:

tidyverse
- The tidyverse package allows a user to load other files like excel files into R. It is also a package that helps with data visualization and data manipulation.
class
- The class package allows a user to use the class library which allows for KNN to run. The package allows for classifications to be performed on data.
rpart
- Rpart is a package that allows users to run decision trees. It uses the rpart function to fit and predict regression or classification data.
Rpart.plot
- The rpart.plot package allows the user to plot a decision tree. When plotted using the library rpart.plot the user can read what makes up the decision tree such as the root node and the leaves under the root node.
ipred
- The ipred package allows the user to run a bagging model and improves classification and bagging for classification, regression, and survival problems.
randomForest
- The randomForest package uses Breiman’s random forest algorithm for classification and regression and allows the user to run random forest models.
fpc
- The fpc package allows the user to cluster data.

Section 2.2: Loading of Packages

By clicking on the code button shown to the right, you can see the code for loading all packages/libraries needed to complete this project.

#libraries to run if not already in R
#install.packages('tidyverse')
#install.packages('class')
#install.packages('rpart')
#install.packages('rpart.plot')
#install.packages('ipred')
#install.packages('randomForest')
#install.packages('fpc')

# Below are the libraries for this projects analysis
library(tidyverse)#for exploratory analysis
library(readxl) # to read excel files
library(class) #KNN
library(rpart)#decision tree
library(rpart.plot)#decision tree
#library(ipred)#bagging will not be initially run as it creates conflicts with other libraries/will run if it is needed to show analysis
library(randomForest)# random forest
#library(adabag)#boosting muted for purposes of running, but will be posted with code if needed
library(fpc) # for Kmeans clustering

Section 3: Data

Section 3.1: Original Data

The data in this project comes from the match charting project which was started to chart and chronical professional tennis player stats during a match. Everyday people have volunteered to watch a match and chart each shot during a point so that anyone who may have an interest could look up statistical information regarding their favorite player.
The orignal data can be found here.
The four original data sets can be found below:
- Charting-w-matches-The original accessed table consisted of 1819 rows and 17 columns.
- Charting-w-stats-ServeBasics-The original accessed table consisted of 12,067 rows and 11 columns.
- Charting-w-stats-ShotDirection-The original accessed table consisted of 15,809 rows and 8 columns.
- Charting-w-stats-ShotDirOutcomes-The original accessed table consisted of 43,091 rows and 10 columns.
A data dictionary and how the information was collected/charted can be found here

Section 3.2: Data Preparation

The charting-w-matches (aka: lookup table) CSV file was cleaned in excel by doing the following items:
- Added a win/loss column since this is the main focus of this project. This was done by looking up every match and noting whether player 1 or 2 won, this was changed later to 1’s and 0’s for a binary classification format and occurs after the stack mentioned later
- 6 rows from 1779 to 1884 are blank- rows deleted
- Row 1785 is messed up and portions are missing- row deleted due to missing info
- Row 1778 has a problem where about 200 entries were put into one column of data- all of this data was put into a word document and put back into the csv file into the appropriate rows where they should have been
- Columns K, N and P have 1 or 2 items missing those items will be looked up to fill in the missing values
- Fixed a surface type for surface carpet, one was marked Carpet (hard) changed to Carpet
- Changed Australian Open Q by taking the Q away in one entry
- Time has 1111 entries missing and a lot of the entries have various time types reported, since this is over half the column, maybe eliminated in further analyses.
- Court has 541 missing entries- will need to be omitted for some types of analyses
- Umpire has 916 entries missing- will need to be omitted for some types of analyses
- Formatted court names to be the same
- Changed two hardcourts to hard
- Changed Gard to hard
- Combined outdoor clay with clay
- In order to join the serve and lookup tables, the data needs to be stacked with player 1 on top with all corresponding information and player 2 stacked below player 1 with all corresponding information, therefore doubling the data set rows. A unique ID had to be created b/c when you stack there will no longer be a unique identifier in the lookup table. The unique ID was created by adding a column and assigned numbers were created, 1 to 4066.
The charting-w-stats-ServeBasics CSV file was cleaned in excel by doing the following items:
- The serve table does not have any missing values in the columns
- Took out the total rows b/c the information is redundant. If you add player 1 and player 2 together for the matchup you get the total therefore a total row is not needed. Once total rows are removed the total rows in the data set decreases to 8044.
- Modified the “row” column and separated it out into two columns one for set number and one for player number
- Sorted by Player 1 and then Player 2 to stack one on top of the other making 8044 rows
- In order to get the unique identifier into the serve table, the match ID and player number columns were combined and v-lookup was performed with the serve table and lookup table
Join 1- Completed in Tableau Prep
- The lookup table and the serve table were joined on the unique ID in tableau prep
- Duplicate columns were deleted from the join process
- Saved as an excel file
- May be modified later for R use, but used in tableau as is
The charting-w-stats-shotdirection CSV file was cleaned in excel by doing the following items:
- Took out total rows like the serve table since it is redundant, row total decreases to 11,788
- Changed the “row” column name to shot_type and spelled out the type of shots used, Forehand, Backhand, Slice *Sorted by player 1 and then player 2 and stacked on top just like the previous tables and did a v-lookup with the lookup table to get the unique ID like the serve table, this further decreased the row total due to non-matching data to the lookup table to a total of 8,834
Join 2- Completed in Tableau Prep
- At this point it did not make sense to combine all tables together into one table since serves are a different beast in and of themselves and a point can be won on a serve alone, so a second join was created with the lookup table and the shot location table
- The tables were joined on unique ID in tableau prep
- Duplicate columns were deleted from the join process
- Saved as an excel file
- May be modified later for R use but used in tableau as is
The charting-w-stats-shotdiroutcomes CSV file was cleaned in excel by doing the following items:
- After working with the other data in Tableau it was noted that the shot direction table did not have the outcomes of each shot like the serve table did, so the original CSV tables were observed and there is a table with shot direction outcomes and shot direction combined together, this table will be cleaned and joined with the lookup table in a similar fashion as the other tables
- split “row” column at “-” point and made columns shot_type and shot_location
- changed shot_types into their full name F to Forehand, B to Backhand, S to Slice
- changed shot locations to: DTL- down_the_line, DTM-down_middle, XC-crosscourt, IO-inside_out, II-inside_in
Join 3- Completed in Tableau Prep
- Shot direction with outcomes excel sheet was combined with the lookup table
- The tables were joined on unique ID in tableau prep
- Duplicate columns were deleted from the join process
- There is a column called winners in the shot type outcomes and this was relabeled to shot winners to differentiate from overall winner when joining the two tables together
- Saved as an excel file
- May be modified later for R use, but used in tableau as is
Geocoding
- Geocoding can be used to plot X,Y coordinates on an object/picture
- CBI studio was used to plot X,Y coordinates on a tennis court according to the areas labeled for serves and stroke types (Forehand, Backhand, Slice), two separate plots were created, and a CSV file was created with the plot numbers- see image used below for plotting in CBI Studio.

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/side view of tennis court.jpg")

The stacked joins of shot types and serves were used because each is joined with player names who are associated with player 1 and player 2 which were only reported in the serves and shot location tables. This is important for this process because player totals and averages are what will be plotted on the image.
Each join with subtracted columns: all columns subtracted to get down to players associated with shot types- see example below.

knitr::include_graphics ("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/CBI studie excel sheet example.jpeg")

* Modifications for R will consist of taking out columns that might have too many levels/categories for assessment such as player, month, day, round, tournament and columns like unique ID and match ID will be removed because they are not probative for the analysis.

Section 4: Exploratory Analysis

Section 4.1: Tableau Exploratory Analysis

Top 25 all time players shown below

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Top 25 players.jpg")

Below graph shows linearity in outcomes for serves that are down the center or wide for forced errors, and points won for all players. There was minimal difference in observed plots when only winners or losers were observed separately.

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Serve location outcomes.jpg")

In the initial and joined data sets, players are listed as player 1 and player 2. Player 1 is designated as the player who served first. When looking at those who won, the data shows that the player who served first won more. This is an interesting fact for strategy.

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Does serve Matter.jpg")

Body shot serves are observed much less in the top 25 players for overall wins. Body serves are also observed less as a whole with all players.

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Avg serve top 25.jpg")

The below graph highlights forehands as the most commonly used shot/stroke in the data used for this project.

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Most used shots.jpg")

As observed in the heat map below most shots are hit crosscourt or down the middle of the court for all players as well as those players who win.

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Where are most shots hit.jpg")

Again, the forehand and backhand shots crosscourt forced more player errors with the inside out shot coming into play as well. The inside out shot is where the player will run around their backhand to hit a forehand. The trend in the two heat maps above is commonly observed throughout the data and is observed with all players as well as those who win.

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Which shots forced errors.jpg")

Below is an interactive chart in Tableau that could be used by coaches and players to explore shots used by certain players. This chart could be modified to view all players and filtered on years. The drop down has the top 25 players of all time, and players can be searched to see what their strengths and weaknesses are; looking at Serena Williams, she hits on average more backhands than forehands crosscourt. This is interesting because most players are stronger with their forehand and will hit more forehands overall.

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Interactive strokes.jpg")

Section 4.2: Exploratory Analysis Using R

Serve data below

There are 8044 rows and 21 columns in the serve data set and a mixture of numerical and character variables.

#reading in the serve data
serve <- read_excel("G:/Other computers/My Laptop/Documents/Capstone project/Joined data/Serves joined R.xlsx")
#converting the serve data into a data frame
serves <- as.data.frame(serve)

#observation of what the serve data frame looks like
str(serves)

## 'data.frame':    8044 obs. of  21 variables:
##  $ Unique_ID    : num  1798 1798 1770 1770 1736 ...
##  $ Players      : chr  "Agnieszka Radwanska" "Agnieszka Radwanska" "Svetlana Kuznetsova" "Svetlana Kuznetsova" ...
##  $ Winner       : num  1 1 1 1 1 1 0 0 1 1 ...
##  $ player_num   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ set_num      : num  1 2 1 2 1 2 1 2 1 2 ...
##  $ Pl_Hand      : chr  "R" "R" "R" "R" ...
##  $ Year         : num  2007 2007 2009 2009 2011 ...
##  $ Month        : num  9 9 6 6 6 6 10 10 4 4 ...
##  $ Day          : num  1 1 6 6 14 14 7 7 1 1 ...
##  $ Tournament   : chr  "US Open" "US Open" "Roland Garros" "Roland Garros" ...
##  $ Round        : chr  "R32" "R32" "F" "F" ...
##  $ Surface      : chr  "Hard" "Hard" "Clay" "Clay" ...
##  $ pts          : num  46 50 42 30 79 108 37 43 51 42 ...
##  $ pts_won      : num  24 33 28 16 50 56 21 29 39 29 ...
##  $ aces         : num  1 2 0 0 12 1 1 0 1 3 ...
##  $ unret        : num  1 3 0 0 5 2 1 4 0 0 ...
##  $ forced_err   : num  2 2 5 0 12 10 4 6 6 2 ...
##  $ pts_won3shots: num  13 13 10 6 34 25 8 12 15 15 ...
##  $ wide         : num  11 13 22 12 13 30 11 9 18 16 ...
##  $ body         : num  17 16 9 9 31 43 3 13 11 13 ...
##  $ down_center  : num  18 21 11 9 32 34 23 21 22 13 ...

There are no missing values in the data set

#No missing data
sum(is.na(serves))

## [1] 0

Histograms of numerical variables- All have a right skew which means that most of the data falls below the mean.

par(mfrow=c(3,3))
hist(serves$pts, col="blue", breaks = 30)
hist(serves$pts_won, col="blue", breaks = 30)
hist(serves$aces, col="blue", breaks = 30)
hist(serves$unret, col="blue", breaks = 30)
hist(serves$forced_err, col="blue", breaks = 30)
hist(serves$pts_won3shots, col="blue", breaks = 30)
hist(serves$wide, col="blue", breaks = 30)
hist(serves$body, col="blue", breaks = 30)
hist(serves$down_center, col="blue", breaks = 30)

Scatter plots of numeric variables- Linearity observed with many variables

pairs(serves[,13:21])

Box plots of types of serves (wide, body, and down the center) to winning- All have outliers

par(mfrow=c(1,1))
boxplot(serves[,19]~serves[,3], notch=FALSE, ylab="Wide", xlab="Loser/Winner", col="blue")

boxplot(serves[,20]~serves[,3], notch=FALSE, ylab="Body", xlab="Loser/Winner", col="blue")

boxplot(serves[,21]~serves[,3], notch=FALSE, ylab="Down The Center", xlab="Loser/Winner", col="blue")

Shots/Strokes data below

There are 43,901 rows and 19 columns in the shots data set and a mixture of numerical and character variables

#reading in the serve data
shots<-read_excel("G:/Other computers/My Laptop/Documents/Capstone project/Joined data/Shot type with outcomes joined to lookup for R.xlsx")

#converting the serve data into a data frame
shots<-as.data.frame(shots)

#observation of what the serve data frame looks like
str(shots)

## 'data.frame':    43901 obs. of  19 variables:
##  $ shot_type        : chr  "Forehand" "Forehand" "Forehand" "Forehand" ...
##  $ shot_location    : chr  "crosscourt" "down_middle" "down_the_line" "inside_out" ...
##  $ shots            : num  34 23 12 11 2 19 16 9 8 6 ...
##  $ pt_ending        : num  9 3 5 4 1 5 2 5 1 0 ...
##  $ shot_winners     : num  0 0 0 4 0 2 0 0 0 0 ...
##  $ induced_forced   : num  2 0 1 0 0 2 0 2 1 0 ...
##  $ unforced         : num  7 3 4 0 1 1 2 3 0 0 ...
##  $ shots_in_pts_won : num  20 14 5 9 0 12 12 4 6 2 ...
##  $ shots_in_pts_lost: num  14 9 7 2 2 7 4 5 2 4 ...
##  $ match_id         : chr  "20090606-W-Roland_Garros-F-Svetlana_Kuznetsova-Dinara_Safina" "20090606-W-Roland_Garros-F-Svetlana_Kuznetsova-Dinara_Safina" "20090606-W-Roland_Garros-F-Svetlana_Kuznetsova-Dinara_Safina" "20090606-W-Roland_Garros-F-Svetlana_Kuznetsova-Dinara_Safina" ...
##  $ player_num       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Unique ID        : num  1770 1770 1770 1770 1770 1770 1770 1770 1770 1770 ...
##  $ Winner           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Players          : chr  "Svetlana Kuznetsova" "Svetlana Kuznetsova" "Svetlana Kuznetsova" "Svetlana Kuznetsova" ...
##  $ Pl_Hand          : chr  "R" "R" "R" "R" ...
##  $ Year             : num  2009 2009 2009 2009 2009 ...
##  $ Month            : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ Day              : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ Surface          : chr  "Clay" "Clay" "Clay" "Clay" ...

Several variables were eliminated and can be observed by hitting the code button to the right. The reason why they were elimnated can be observed in the code.

shots$match_id=NULL # not needed for analysis
shots$Players=NULL # too many levels, not focus study
shots$Round=NULL #too many levels
shots$Tournament=NULL #too many levels
shots$Month=NULL # too many levels
shots$Day=NULL #too many levels
shots$`Unique ID`=NULL # not needed
str(shots)

## 'data.frame':    43901 obs. of  14 variables:
##  $ shot_type        : chr  "Forehand" "Forehand" "Forehand" "Forehand" ...
##  $ shot_location    : chr  "crosscourt" "down_middle" "down_the_line" "inside_out" ...
##  $ shots            : num  34 23 12 11 2 19 16 9 8 6 ...
##  $ pt_ending        : num  9 3 5 4 1 5 2 5 1 0 ...
##  $ shot_winners     : num  0 0 0 4 0 2 0 0 0 0 ...
##  $ induced_forced   : num  2 0 1 0 0 2 0 2 1 0 ...
##  $ unforced         : num  7 3 4 0 1 1 2 3 0 0 ...
##  $ shots_in_pts_won : num  20 14 5 9 0 12 12 4 6 2 ...
##  $ shots_in_pts_lost: num  14 9 7 2 2 7 4 5 2 4 ...
##  $ player_num       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Winner           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Pl_Hand          : chr  "R" "R" "R" "R" ...
##  $ Year             : num  2009 2009 2009 2009 2009 ...
##  $ Surface          : chr  "Clay" "Clay" "Clay" "Clay" ...

Scatter plots of numerical shot variables- Linearity observed in some shot outcomes

pairs(shots[,3:9]) # most linearity observed in shot outcomes

All histograms of the numerical shot variables have a right skew meaning most of the data falls below the mean

par(mfrow=c(3,3)) # all right skewed

hist(shots$shots, col="blue", breaks = 10)
hist(shots$pt_ending, col="blue", breaks = 10)
hist(shots$shot_winners, col="blue", breaks = 10)
hist(shots$induced_forced, col="blue", breaks = 10)
hist(shots$unforced, col="blue", breaks = 10)

Section 5: Statistical Analysis

Section 5.1: Serve Statistical Analysis

Prior to running the analysis several columns were dropped and some character columns were changed to factors. This process can be viewed in the code button to the right along with the final data frame observed below.

serves$Unique_ID=NULL # not useful
serves$Players=NULL # research is not about individual player
serves$Tournament=NULL # too many levels
serves$pts=NULL # total points not the goal of the research
serves$Round=NULL # too many levels
serves$Month=NULL # too many levels
serves$Day=NULL # too many levels


serves$Pl_Hand<-as.factor(serves$Pl_Hand)
serves$Surface<-as.factor(serves$Surface)

str(serves)

## 'data.frame':    8044 obs. of  14 variables:
##  $ Winner       : num  1 1 1 1 1 1 0 0 1 1 ...
##  $ player_num   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ set_num      : num  1 2 1 2 1 2 1 2 1 2 ...
##  $ Pl_Hand      : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year         : num  2007 2007 2009 2009 2011 ...
##  $ Surface      : Factor w/ 4 levels "Carpet","Clay",..: 4 4 2 2 3 3 4 4 4 4 ...
##  $ pts_won      : num  24 33 28 16 50 56 21 29 39 29 ...
##  $ aces         : num  1 2 0 0 12 1 1 0 1 3 ...
##  $ unret        : num  1 3 0 0 5 2 1 4 0 0 ...
##  $ forced_err   : num  2 2 5 0 12 10 4 6 6 2 ...
##  $ pts_won3shots: num  13 13 10 6 34 25 8 12 15 15 ...
##  $ wide         : num  11 13 22 12 13 30 11 9 18 16 ...
##  $ body         : num  17 16 9 9 31 43 3 13 11 13 ...
##  $ down_center  : num  18 21 11 9 32 34 23 21 22 13 ...

The data set was split into training and test sets with 90% of the data put in the training set as it was observed to perform the best with the data.

set.seed(1234)
sample_serves <- sample(nrow(serves), nrow(serves)*0.90)
serves_train <- serves[sample_serves,]
serves_test <- serves[-sample_serves,]

# 50, 60, 70, 80, and 90 Percent were tried for training and 90% was the best

When running a decision tree on the data it was noted that player number and only player number of 1 is contributing the most to whether or not the player wins. This model has a misclassification rate of 368.

#library(rpart)
#library(rpart.plot)
set.seed(1234)

serves_train$Winner <- as.factor(serves_train$Winner)
serves_test$Winner <- as.factor(serves_test$Winner)


serves_rpart1 <- rpart(formula = Winner ~., data = serves_train, method = "class")
pred1 <- predict(serves_rpart1, serves_test, type = "class")
#pred1

###plotting the model
par(mfrow=c(1,1))
#serves_rpart1
prp(serves_rpart1, extra = 1)

###confusion matrix
table(serves_test$Winner, pred1, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0 204 182
##    1 186 233

sum(serves_test$Winner != pred1)

## [1] 368

#368 misclassified

Using Logistic Regression and backwards selection, the lower player number is found to be the only variable of importance when predicting winning. This model has a misclassification rate of 371.

set.seed (1234)

serves_train$Winner <- as.factor(serves_train$Winner)
serves_test$Winner <- as.factor(serves_test$Winner)

serves_glm0 <- glm(Winner~., family = binomial, data = serves_train)
summary(serves_glm0) # player number only significant variable, lower is better

## 
## Call:
## glm(formula = Winner ~ ., family = binomial, data = serves_train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.341  -1.144  -1.062   1.167   1.337  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    6.979e-01  6.082e+00   0.115    0.909    
## player_num    -3.806e-01  7.956e-02  -4.784 1.72e-06 ***
## set_num        7.892e-03  4.730e-02   0.167    0.867    
## Pl_HandR       7.375e-02  7.667e-02   0.962    0.336    
## Year          -7.972e-05  3.029e-03  -0.026    0.979    
## SurfaceClay    1.751e-02  5.063e-01   0.035    0.972    
## SurfaceGrass   1.144e-02  5.074e-01   0.023    0.982    
## SurfaceHard   -1.831e-02  5.046e-01  -0.036    0.971    
## pts_won        2.172e-05  8.295e-03   0.003    0.998    
## aces          -2.107e-03  1.716e-02  -0.123    0.902    
## unret         -3.143e-02  3.270e-02  -0.961    0.336    
## forced_err    -1.290e-02  1.223e-02  -1.055    0.291    
## pts_won3shots  8.085e-03  1.145e-02   0.706    0.480    
## wide          -4.806e-03  5.415e-03  -0.888    0.375    
## body           2.906e-03  4.939e-03   0.588    0.556    
## down_center   -2.100e-03  5.960e-03  -0.352    0.725    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10035.2  on 7238  degrees of freedom
## Residual deviance:  9993.8  on 7223  degrees of freedom
## AIC: 10026
## 
## Number of Fisher Scoring iterations: 3

## Prediction
servespred_resp <- predict(serves_glm0, newdata = serves_test, type = "response")

## Create a confusion matrix 0.5 cutoff probability
table(serves_test$Winner, (servespred_resp > 0.5)*1, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth   0   1
##     0 208 178
##     1 193 226

#misclassification rate is 371

A bagging model was run. It had the lowest misclassification rate of all models, at 271. This model does the best job at predicting winning and losing.

#install.packages('ipred')
library(ipred)
set.seed(1234)

serves_train$Winner <- as.factor(serves_train$Winner)
serves_test$Winner <- as.factor(serves_test$Winner)

servesbag_model <- bagging(formula = Winner ~., data = serves_train, nbagg = 50)

servesbag_pred <- predict(servesbag_model, newdata = serves_test)
#servesbag_pred

table(serves_test$Winner, servesbag_pred, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0 257 129
##    1 142 277

sum(serves_test$Winner != servesbag_pred)

## [1] 271

# 271 misclassified- 50, 100, 500, and 1000 nbaggs tried 50 was the best model

The player number data appeared to be drowning the data. The player number variable was removed to see if any further insights could be gained.

serves$player_num=NULL
str(serves)
set.seed(1234)
sample_serves <- sample(nrow(serves), nrow(serves)*0.90)
serves_train <- serves[sample_serves,]
serves_test <- serves[-sample_serves,]

The decision tree appeared below and had the same misclassification rate as the first run at 368. Forced error became the root node in this model. Causing more then 3 forced errors equated to winning more. Several other models were rerun, and Logistic regression revealed no statistically significant variables this time and had a misclassification rate of 375, while backward selection honed in on the same thing as the decision tree with forced errors being a deciding factor. All in all, nothing was gained from removing the variable player number. Player number is the deciding factor in this data set not serve types.

#library(rpart)
#library(rpart.plot)
set.seed(1234)

serves_train$Winner <- as.factor(serves_train$Winner)
serves_test$Winner <- as.factor(serves_test$Winner)


serves_rpart1 <- rpart(formula = Winner ~., data = serves_train, method = "class")
pred1 <- predict(serves_rpart1, serves_test, type = "class")
#pred1

###plotting the model
par(mfrow=c(1,1))
#serves_rpart1
prp(serves_rpart1, extra = 1)

###confusion matrix
table(serves_test$Winner, pred1, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0 159 227
##    1 141 278

sum(serves_test$Winner != pred1)

## [1] 368

#368 misclassified

Section 5.2: Stroke Statistical Analysis

The data set was split into training and test sets with 80% of the data put in the training set as it was observed to perform the best with the data (data frame can be observed below). Prior to running the data, Player hand and Surface were changed to factor variables. Shots points in won and shots point in lost were removed from the data, because these two columns dominate the results.

set.seed(1234)

shots$Pl_Hand <- as.factor(shots$Pl_Hand)
shots$Surface <- as.factor(shots$Surface)

shots$shots_in_pts_won=NULL
shots$shots_in_pts_lost=NULL
sample_shots <- sample(nrow(shots), nrow(shots)*0.80) # tried 50, 60 and 70 prior with worse results
shots_train <- shots[sample_shots,]
shots_test <- shots[-sample_shots,]

The decision tree below focused on shot outcomes with shot winners as the root node and had a misclassification rate of 3890. Shot winners are when a player hits a shot and wins the point without the opponent touching the shot. This model does a good job at predicting winning. Following the root node to the right, the player who has more than two winners and less than or equal to five unforced errors is more likely to win. Following the root node to the left, players who have less than two winners, less than or equal to three unforced errors, and player number 1 were more likely to win. It is interesting to see player number plays a part in the strokes data as well.

#library(rpart)
library(rpart.plot)
set.seed(1234)
shots$shot_type <-as.factor(shots$shot_type)
shots$shot_location<-as.factor(shots$shot_location)

shots$shots_in_pts_won=NULL
shots$shots_in_pts_lost=NULL
str(shots)

## 'data.frame':    43901 obs. of  12 variables:
##  $ shot_type     : Factor w/ 3 levels "Backhand","Forehand",..: 2 2 2 2 2 1 1 1 1 3 ...
##  $ shot_location : Factor w/ 5 levels "crosscourt","down_middle",..: 1 2 3 5 4 1 2 3 5 1 ...
##  $ shots         : num  34 23 12 11 2 19 16 9 8 6 ...
##  $ pt_ending     : num  9 3 5 4 1 5 2 5 1 0 ...
##  $ shot_winners  : num  0 0 0 4 0 2 0 0 0 0 ...
##  $ induced_forced: num  2 0 1 0 0 2 0 2 1 0 ...
##  $ unforced      : num  7 3 4 0 1 1 2 3 0 0 ...
##  $ player_num    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Winner        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Pl_Hand       : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year          : num  2009 2009 2009 2009 2009 ...
##  $ Surface       : Factor w/ 4 levels "Carpet","Clay",..: 2 2 2 2 2 2 2 2 2 2 ...

shots_rpart1 <- rpart(formula = Winner ~., data = shots_train, method = "class")
pred1 <- predict(shots_rpart1, shots_test, type = "class")
#pred1

###plotting the model
par(mfrow=c(1,1))
#shots_rpart1
prp(shots_rpart1, extra = 1)

###confusion matrix
table(shots_test$Winner, pred1, dnn = c("True", "Pred"))

##     Pred
## True    0    1
##    0 2506 1917
##    1 1973 2385

sum(shots_test$Winner != pred1)

## [1] 3890

#3890 misclassified

Logistic regression was the best model for this data and showed that forehands, slice, shots down the middle, number of shots, player number and player hand are statistically significant variables for winning. Misclassification rate is 3779. This model does a good job at predicting winning as well. Looking at the statistically significant variables, it is interesting to note that as forehands, slice shots, and player number go down winning increases. As more shots go down the middle winning goes up. This information seems a bit odd to me, because depending on the player, hitting more strokes down the middle could be bad as player’s opponent will not be moving much and will be able to dictate where the point shots will go. Backwards selection did not yield a better misclassification rate.

 #shot type and location changed to factors
shots$shot_type <-as.factor(shots$shot_type)
shots$shot_location<-as.factor(shots$shot_location)

set.seed(1234)
#logistic regression model
shots_glm0 <- glm(Winner~., family = binomial, data = shots_train)
summary(shots_glm0)

## 
## Call:
## glm(formula = Winner ~ ., family = binomial, data = shots_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5245  -1.1552  -0.6164   1.1520   2.1778  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -2.065133   2.689709  -0.768   0.4426    
## shot_typeForehand          -0.103496   0.026297  -3.936 8.30e-05 ***
## shot_typeSlice             -0.063346   0.031372  -2.019   0.0435 *  
## shot_locationdown_middle    0.172743   0.033941   5.090 3.59e-07 ***
## shot_locationdown_the_line  0.001717   0.034900   0.049   0.9608    
## shot_locationinside_in      0.065148   0.059746   1.090   0.2755    
## shot_locationinside_out    -0.021389   0.035834  -0.597   0.5506    
## shots                       0.006156   0.001025   6.007 1.89e-09 ***
## pt_ending                  -0.176739   0.225745  -0.783   0.4337    
## shot_winners                0.342177   0.225908   1.515   0.1299    
## induced_forced              0.291281   0.225931   1.289   0.1973    
## unforced                   -0.003935   0.225970  -0.017   0.9861    
## player_num                 -0.278190   0.021716 -12.810  < 2e-16 ***
## Pl_HandR                    0.084206   0.035328   2.384   0.0171 *  
## Year                        0.001259   0.001339   0.940   0.3470    
## SurfaceClay                -0.149733   0.250304  -0.598   0.5497    
## SurfaceGrass               -0.122758   0.250769  -0.490   0.6245    
## SurfaceHard                -0.137286   0.249603  -0.550   0.5823    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 48687  on 35119  degrees of freedom
## Residual deviance: 47503  on 35102  degrees of freedom
## AIC: 47539
## 
## Number of Fisher Scoring iterations: 4

#predictions if needed
shotspred_glm0 <- predict(shots_glm0, newdata = shots_test, type = "response")
#shotspred_resp[500]

## Create a confusion matrix 0.5 cutoff probability
table(shots_test$Winner, (shotspred_glm0 > 0.5)*1, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth    0    1
##     0 2505 1918
##     1 1861 2497

Section 5.3: Additional Stroke Statistical Analysis

After observing the stroke analysis joined with shot outcomes and the lookup table, I had an idea to only use the shot data joined with the lookup table to see if the outcomes were drowning out my shot type/location data. I also wanted to observe if adding shot location and type together would give a clear picture as to how the data was performing since the other data set seemed off when looking at hitting more shots down the middle being more desirable. I filtered the joined data in excel to show only forehands for shot locations and put F in front of all shot locations, then I deleted the shot type column, and made it a separate excel sheet. I also did this for backhands and slice shots. I then joined each sheet together in tableau prep on the unique ID column to create one data set. This should give more insight into what types of shots may contribute to winning and maybe the down the middle significance in the other data set may be more meaningful with this data.

The code used in the button to the right can be observed for reading the data set in and elimination of some columns as well as change some columns to factors for various reasons. The data frame can be observed below prior to running any models. There are 2166 rows 20 variables.

shot<-read.csv("G:/Other computers/My Laptop/Documents/Capstone project/Joined data/Shots only no outcomes/Join R FBS.csv")
shot<-as.data.frame(shot)


shot$Players=NULL #too many levels
shot$Unique.ID=NULL #not informative
shot$Pl_Hand<-as.factor(shot$Pl_Hand) # done to run analysis
shot$Surface<-as.factor(shot$Surface) # done to run analysis

str(shot)#2166 obs. of  20 variables

## 'data.frame':    2166 obs. of  20 variables:
##  $ Winner      : int  1 1 0 1 0 1 1 0 1 1 ...
##  $ Pl_Hand     : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year        : int  2009 2011 2012 2012 2014 2014 2014 2014 2013 2013 ...
##  $ Surface     : Factor w/ 4 levels "Carpet","Clay",..: 2 3 4 2 4 4 4 4 4 4 ...
##  $ player      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Fcrosscourt : int  34 36 37 28 14 31 11 80 17 44 ...
##  $ Fdown_middle: int  23 45 42 15 12 22 8 46 7 33 ...
##  $ FdownDL     : int  12 18 5 3 3 8 2 16 11 20 ...
##  $ Finside_out : int  11 29 22 6 10 25 7 33 21 22 ...
##  $ Finside_in  : int  2 0 1 0 0 2 0 1 0 0 ...
##  $ Bcrosscourt : int  19 49 19 21 48 48 12 34 53 56 ...
##  $ Bdown_middle: int  16 50 28 17 43 34 14 25 15 60 ...
##  $ BdownDL     : int  9 15 8 9 6 9 3 12 6 15 ...
##  $ Binside_out : int  8 12 5 2 9 12 7 13 7 16 ...
##  $ Binside_in  : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Scrosscourt : int  6 2 0 0 0 0 0 0 1 1 ...
##  $ Sdown_middle: int  6 5 1 1 3 5 2 2 3 7 ...
##  $ SdownDL     : int  2 1 0 0 0 0 0 2 1 1 ...
##  $ Sinside_out : int  0 0 0 0 0 0 1 0 0 1 ...
##  $ Sinside_in  : int  0 0 0 0 0 0 0 0 0 0 ...

Exploratory analysis of this data showed that Backhands down the middle were the only shots that backhands outpaced forehands for any other type of shot. This is not surprising, as backhands are in general harder to hit and apply direction so this may contribute to more down the middle shots. This data set showed that numerical data exhibited a right skew similar to serves and there was a lot of linearity in the numerical data.

avgshotdm<- apply(shot[,c(7,12,17)], 2, mean)
barplot(avgshotdm, ylim = c(0, 25), ylab = "Average", col = "blue")  # only shot where backhand is used more than forehand

The data set was split into training and test sets with 80% of the data put in the training set as it was observed to perform the best with the data.

set.seed(1234)

sample_shot <- sample(nrow(shot), nrow(shot)*0.80) # tried 50, 60 and 70 prior with worse results
shot_train <- shot[sample_shot,]
shot_test <- shot[-sample_shot,]

A decision tree was run, and the root node was found to be forehands down the middle, in fact hitting 10 or less would lead to winning if following the root node to the right. The decision tree has a misclassification rate of 203. This information is more telling as to what down the middle meant in the last data set. It isn’t just down the middle, but hitting less forehands down the middle is better. Looking at the left side of the root node with forehands of 10 or more combine with other shots, lead to more winning in a variety of different ways as seen below.

#library(rpart)
#library(rpart.plot)
set.seed(1234)


shot_rpart1 <- rpart(formula = Winner ~., data = shot_train, method = "class")
pred1 <- predict(shot_rpart1, shot_test, type = "class")
#pred1

###plotting the model
par(mfrow=c(1,1))
#shot_rpart1
prp(shot_rpart1, extra = 1)

###confusion matrix
table(shot_test$Winner, pred1, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0  80 131
##    1  72 151

sum(shot_test$Winner != pred1)

## [1] 203

#misclassificaiton rate 203

Backwards selection had a misclassification rate of 192. Backwards selection was the best model with the lowest missclassification rate for this data set. Hitting more forehands crosscourt leads to winning and hitting less forehands down the middle increases the chance of winning which makes more sense than the prior data set, however when looking at the decision tree it is probably a combination of both that will lead to winning.

summary(shot_glm_back)

## 
## Call:
## glm(formula = Winner ~ player + Fcrosscourt + Fdown_middle + 
##     FdownDL, family = binomial, data = shot_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6028  -1.2107   0.9923   1.1311   1.4595  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.388541   0.201364   1.930 0.053662 .  
## player       -0.277310   0.147431  -1.881 0.059979 .  
## Fcrosscourt   0.010197   0.002807   3.632 0.000281 ***
## Fdown_middle -0.010242   0.003743  -2.736 0.006211 ** 
## FdownDL      -0.013148   0.006729  -1.954 0.050703 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2397.2  on 1731  degrees of freedom
## Residual deviance: 2377.2  on 1727  degrees of freedom
## AIC: 2387.2
## 
## Number of Fisher Scoring iterations: 4

## Prediction
shotpred_resp1 <- predict(shot_glm_back, newdata = shot_test, type = "response")

## Create a confusion matrix 0.5 cutoff probability
table(shot_test$Winner, (shotpred_resp1 > 0.5)*1, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth   0   1
##     0  76 135
##     1  57 166

Section 6: Summary

Section 6.1: Revisting The Business Problem and Assumptions

The business problem was to identify if certain strokes: serves, forehands, backhands, or slice predict winning. By identifying if certain strokes predict winning the WTA may be able to help develop better training plans for coaches and players.
Two Assumption were made about the data in section 1.5

There is a statistically significant relationship between strokes (serves, forehands, backhands, and slice) and winning.
Exploratory analysis and descriptive statistics will yield usable information for player development plans.

Section 6.2: Summary of Analysis

In summary the main takeaways from the serve data is that types of serves do not predict winning. Which person served first is the most statistically significant variable when predicting winning with this data. The best model at predicting winning for the serve data is the bagging model. Exploratory analysis of serving showed that when looking at the top 25 all-time winners, most winners, on average used wide serves and down the center serves, with body serves being the least likely serve to use. Serve types were more closely related to serve outcomes when looking at exploratory analysis than winning or losing. Both, serve types and serve outcomes, were not good at predicting winning or losing and this was echoed in the regression analysis when only player number was exposed as the only significant variable in this data set.
The exploratory stroke analysis (forehand, backhand, and slice) revealed most shots are forehand. Most strokes are hit crosscourt or down the middle. Forehand crosscourt strokes cause more errors for the opponent. Backhands down the middle, outpaced forehands and slice, otherwise forehands outpaced backhands and slice in the rest of the data when it came to shot locations.
When combining shots, shot outcomes, and the lookup table the best model was identified as Regression and it identified statistically significant variables as forehands, slice, player number, player hand, shots, and shots down the middle. As forehands, slice shots, and player number go down winning increases and hitting more shots down the middle winning goes up. Hitting more shots down the middle did not seem to make sense as this would make it easier on the opponent to take advantage of the player, by being able to not run as much and dictate where the opponent would like to hit a shot. Since this data did not yield outcomes that made the most sense, the below bullet point with shot types and locations combined into their respective columns and joined with only the lookup table were observed.
The best model, when looking at only the combined shot types and shot locations, was identified as backwards regression, which highlighted forehands crosscourt and forehands down the middle as the only statistically significant variables. As forehands down the middle go down winning goes up and as forehands crosscourt go up so does winning. As far as the ability to cluster, this data did not cluster well.

Section 6.3: Implementation, Key Takeways, Further Ideas

Strokes in and of themselves are not great at predicting winning. There is more of a relationship between strokes and outcomes such as forced errors and points won, which is highlighted in the exploratory analysis.
Types of strokes, serves, backhands, and slice strokes are not statistically significant and do not predict winning. With that being said, there is still value in what was observed. For strategic purposes, if a player wins the toss for serve selection they should choose to serve first as they are more likely to win.
Looking at the regression for serves in 5.1, it should be noted that even though serve types don’t predict winning, wide and down the center serves should be hit less as winning goes up and body serves should be hit more as they increase the chances of wining. This is useful for creating a player development plan, which would contribute to the overall business problem.
Looking at the regular regression below, the WTA, players and coaches can use this as a good resource for a player development plan. Strokes that do not have a negative sign in front of them in the estimate column, which is the first column on the left after types of columns, equate to being used more for winning and those that have a negative sign in front of them should be used less for winning.

knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Shots regression jpeg.jpg")

Crosscourt forehands and forehands down the middle are statistically significant and predict winning and should be exploited in a player development plan.
Assumption one was debunked by the use of this data, for the most part strokes did not predict winning. A better revision of the business problem might be to look at: do strokes predict stroke outcomes?
Assumption two is correct in that exploratory analysis did yield useful information and the use of the geocoded tableau charts in 4.1 may be helpful when studying other opponents prior to a match, either observing a players weakness or for observing what the player may hit most often. This can help develop a curated training plan for a specific opponent.
If time allotted, it would be interesting to go back and apply bullet point 4 in 6.2, to the shot types, shot outcome, and lookup table join and see if there could be more meaningful information derived from the larger data set.
This study did not look at all strokes used by tennis players, so there may be more information derived from adding those additional strokes.