knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/WTA logo.jpg")
The Women’s Tennis Association WTA was founded in 1973 by Billie Jean King and currently consists of more than 1600 players from all around the world. The WTA is a sports organization where competitors compete in sanctioned tennis tournaments to earn ranking points, tournament titles, and cash prizes. Part of the organization’s mission is to provide ongoing player and coach development. The WTA uses a player development advisory panel that is comprised of volunteer and independent sports science and medicine experts that advise on things such as: health, sports psychology, primary care, fitness training, adolescent and women’s health, performance coaching, athlete development and training.
The WTA would like to identify if certain strokes: serves, forehands, backhands, or slice predict winning. By identifying if certain strokes predict winning the organization may be able to help develop better training plans for coaches and players.
The WTA does not have an extensive list of data or databases that have stored information regarding player strokes publicly available. New technology is emerging; however, comprehensive publicly available data has been created and made available on GitHub. Using this data (access date July 2022) which spans four decades and is part of the Match Charting Project by Jeff Sackmann, an analysis of women’s professional tennis player strokes can be conducted to see what strokes may predict winning.
There is a total of (36) excel spreadsheets, (18) of the spreadsheets are for men’s tennis stats and (18) are for women’s tennis stats.
The focus of this analysis will be on the women’s tennis spreadsheets, more specifically the spreadsheets labeled charting-w-matches as a lookup table regarding player and match information, charting-w-stats-ServeBasics, and charting-w-stats-ShotDirOutcomes(charting-w-stats-ShotDirection is a subset of data from charting-w-stats-ShotDirOutcomes and may be used in some analyses).
While there is a treasure-trove of information in the Match Charting Project, narrowing down the aforementioned spreadsheets will allow for a focused look at most commonly used player strokes that may predict winning.
Several types of analysis will be conducted using Excel, Tableau, Tableau Prep, and R-Markdown to predict winning of women’s professional tennis players:
Descriptive Statistics: Used to evaluate the frequencies of player strokes of those who won or lost matches. Descriptive statistics will allow players and coaches to filter stats from player to player to see what strokes players often hit, which will allow coaches and players to develop customized training programs for specific opponents. Descriptive statistics may also reveal potential outliers in the data that can be further evaluated.
Classification Trees and KNN: Used with binary data (win/loss) to predict what contributes to the outcome of yes/no or in this case win/loss. This evaluation may give an idea of strokes or attributes such as player hand, year, or player number that contribute to a players win/loss record.
Logistic Regression: Used with binary data to predict an outcome (win/loss) based on variables in the data set. This will help narrow down which types of strokes are the best at predicting whether a player won or lost. Further evaluation will be done using random forest and bagging models to see which model(s) is best at predicting win/loss and what strokes are most important to winning and losing. Regression analysis in and of itself may yield the most useful method for this particular business problem in that it focuses entirely on predicting win/loss and will give further insight into what variables in the data may contribute to a players win/loss record.
K-means Clustering: Used to see if any groupings can be derived in the data, more specifically are there groups within winning and losing.
There is a statistically significant relationship between strokes (serves, forehands, backhands, and slice) and winning.
Exploratory analysis and descriptive statistics will yield usable information for player development plans.
The following seven R packages are used for this project:
tidyverse
class
rpart
Rpart.plot
ipred
randomForest
fpc
#libraries to run if not already in R
#install.packages('tidyverse')
#install.packages('class')
#install.packages('rpart')
#install.packages('rpart.plot')
#install.packages('ipred')
#install.packages('randomForest')
#install.packages('fpc')
# Below are the libraries for this projects analysis
library(tidyverse)#for exploratory analysis
library(readxl) # to read excel files
library(class) #KNN
library(rpart)#decision tree
library(rpart.plot)#decision tree
#library(ipred)#bagging will not be initially run as it creates conflicts with other libraries/will run if it is needed to show analysis
library(randomForest)# random forest
#library(adabag)#boosting muted for purposes of running, but will be posted with code if needed
library(fpc) # for Kmeans clustering
The data in this project comes from the match charting project which was started to chart and chronical professional tennis player stats during a match. Everyday people have volunteered to watch a match and chart each shot during a point so that anyone who may have an interest could look up statistical information regarding their favorite player.
The orignal data can be found here.
The four original data sets can be found below:
Charting-w-matches-The original accessed table consisted of 1819 rows and 17 columns.
Charting-w-stats-ServeBasics-The original accessed table consisted of 12,067 rows and 11 columns.
Charting-w-stats-ShotDirection-The original accessed table consisted of 15,809 rows and 8 columns.
Charting-w-stats-ShotDirOutcomes-The original accessed table consisted of 43,091 rows and 10 columns.
A data dictionary and how the information was collected/charted can be found here
The charting-w-matches (aka: lookup table) CSV file was cleaned in excel by doing the following items:
The charting-w-stats-ServeBasics CSV file was cleaned in excel by doing the following items:
Join 1- Completed in Tableau Prep
The charting-w-stats-shotdirection CSV file was cleaned in excel by doing the following items:
Join 2- Completed in Tableau Prep
The charting-w-stats-shotdiroutcomes CSV file was cleaned in excel by doing the following items:
Join 3- Completed in Tableau Prep
Geocoding
knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/side view of tennis court.jpg")
The stacked joins of shot types and serves were used because each is joined with player names who are associated with player 1 and player 2 which were only reported in the serves and shot location tables. This is important for this process because player totals and averages are what will be plotted on the image.
Each join with subtracted columns: all columns subtracted to get down to players associated with shot types- see example below.
knitr::include_graphics ("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/CBI studie excel sheet example.jpeg")
Top 25 all time players shown below
knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Top 25 players.jpg")
Below graph shows linearity in outcomes for serves that are down the center or wide for forced errors, and points won for all players. There was minimal difference in observed plots when only winners or losers were observed separately.
knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Serve location outcomes.jpg")
In the initial and joined data sets, players are listed as player 1 and player 2. Player 1 is designated as the player who served first. When looking at those who won, the data shows that the player who served first won more. This is an interesting fact for strategy.
knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Does serve Matter.jpg")
Body shot serves are observed much less in the top 25 players for overall wins. Body serves are also observed less as a whole with all players.
knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Avg serve top 25.jpg")
The below graph highlights forehands as the most commonly used shot/stroke in the data used for this project.
knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Most used shots.jpg")
As observed in the heat map below most shots are hit crosscourt or down the middle of the court for all players as well as those players who win.
knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Where are most shots hit.jpg")
Again, the forehand and backhand shots crosscourt forced more player errors with the inside out shot coming into play as well. The inside out shot is where the player will run around their backhand to hit a forehand. The trend in the two heat maps above is commonly observed throughout the data and is observed with all players as well as those who win.
knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Which shots forced errors.jpg")
Below is an interactive chart in Tableau that could be used by coaches and players to explore shots used by certain players. This chart could be modified to view all players and filtered on years. The drop down has the top 25 players of all time, and players can be searched to see what their strengths and weaknesses are; looking at Serena Williams, she hits on average more backhands than forehands crosscourt. This is interesting because most players are stronger with their forehand and will hit more forehands overall.
knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Interactive strokes.jpg")
There are 8044 rows and 21 columns in the serve data set and a mixture of numerical and character variables.
#reading in the serve data
serve <- read_excel("G:/Other computers/My Laptop/Documents/Capstone project/Joined data/Serves joined R.xlsx")
#converting the serve data into a data frame
serves <- as.data.frame(serve)
#observation of what the serve data frame looks like
str(serves)
## 'data.frame': 8044 obs. of 21 variables:
## $ Unique_ID : num 1798 1798 1770 1770 1736 ...
## $ Players : chr "Agnieszka Radwanska" "Agnieszka Radwanska" "Svetlana Kuznetsova" "Svetlana Kuznetsova" ...
## $ Winner : num 1 1 1 1 1 1 0 0 1 1 ...
## $ player_num : num 1 1 1 1 1 1 1 1 1 1 ...
## $ set_num : num 1 2 1 2 1 2 1 2 1 2 ...
## $ Pl_Hand : chr "R" "R" "R" "R" ...
## $ Year : num 2007 2007 2009 2009 2011 ...
## $ Month : num 9 9 6 6 6 6 10 10 4 4 ...
## $ Day : num 1 1 6 6 14 14 7 7 1 1 ...
## $ Tournament : chr "US Open" "US Open" "Roland Garros" "Roland Garros" ...
## $ Round : chr "R32" "R32" "F" "F" ...
## $ Surface : chr "Hard" "Hard" "Clay" "Clay" ...
## $ pts : num 46 50 42 30 79 108 37 43 51 42 ...
## $ pts_won : num 24 33 28 16 50 56 21 29 39 29 ...
## $ aces : num 1 2 0 0 12 1 1 0 1 3 ...
## $ unret : num 1 3 0 0 5 2 1 4 0 0 ...
## $ forced_err : num 2 2 5 0 12 10 4 6 6 2 ...
## $ pts_won3shots: num 13 13 10 6 34 25 8 12 15 15 ...
## $ wide : num 11 13 22 12 13 30 11 9 18 16 ...
## $ body : num 17 16 9 9 31 43 3 13 11 13 ...
## $ down_center : num 18 21 11 9 32 34 23 21 22 13 ...
There are no missing values in the data set
#No missing data
sum(is.na(serves))
## [1] 0
Histograms of numerical variables- All have a right skew which means that most of the data falls below the mean.
par(mfrow=c(3,3))
hist(serves$pts, col="blue", breaks = 30)
hist(serves$pts_won, col="blue", breaks = 30)
hist(serves$aces, col="blue", breaks = 30)
hist(serves$unret, col="blue", breaks = 30)
hist(serves$forced_err, col="blue", breaks = 30)
hist(serves$pts_won3shots, col="blue", breaks = 30)
hist(serves$wide, col="blue", breaks = 30)
hist(serves$body, col="blue", breaks = 30)
hist(serves$down_center, col="blue", breaks = 30)
Scatter plots of numeric variables- Linearity observed with many variables
pairs(serves[,13:21])
Box plots of types of serves (wide, body, and down the center) to winning- All have outliers
par(mfrow=c(1,1))
boxplot(serves[,19]~serves[,3], notch=FALSE, ylab="Wide", xlab="Loser/Winner", col="blue")
boxplot(serves[,20]~serves[,3], notch=FALSE, ylab="Body", xlab="Loser/Winner", col="blue")
boxplot(serves[,21]~serves[,3], notch=FALSE, ylab="Down The Center", xlab="Loser/Winner", col="blue")
There are 43,901 rows and 19 columns in the shots data set and a mixture of numerical and character variables
#reading in the serve data
shots<-read_excel("G:/Other computers/My Laptop/Documents/Capstone project/Joined data/Shot type with outcomes joined to lookup for R.xlsx")
#converting the serve data into a data frame
shots<-as.data.frame(shots)
#observation of what the serve data frame looks like
str(shots)
## 'data.frame': 43901 obs. of 19 variables:
## $ shot_type : chr "Forehand" "Forehand" "Forehand" "Forehand" ...
## $ shot_location : chr "crosscourt" "down_middle" "down_the_line" "inside_out" ...
## $ shots : num 34 23 12 11 2 19 16 9 8 6 ...
## $ pt_ending : num 9 3 5 4 1 5 2 5 1 0 ...
## $ shot_winners : num 0 0 0 4 0 2 0 0 0 0 ...
## $ induced_forced : num 2 0 1 0 0 2 0 2 1 0 ...
## $ unforced : num 7 3 4 0 1 1 2 3 0 0 ...
## $ shots_in_pts_won : num 20 14 5 9 0 12 12 4 6 2 ...
## $ shots_in_pts_lost: num 14 9 7 2 2 7 4 5 2 4 ...
## $ match_id : chr "20090606-W-Roland_Garros-F-Svetlana_Kuznetsova-Dinara_Safina" "20090606-W-Roland_Garros-F-Svetlana_Kuznetsova-Dinara_Safina" "20090606-W-Roland_Garros-F-Svetlana_Kuznetsova-Dinara_Safina" "20090606-W-Roland_Garros-F-Svetlana_Kuznetsova-Dinara_Safina" ...
## $ player_num : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Unique ID : num 1770 1770 1770 1770 1770 1770 1770 1770 1770 1770 ...
## $ Winner : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Players : chr "Svetlana Kuznetsova" "Svetlana Kuznetsova" "Svetlana Kuznetsova" "Svetlana Kuznetsova" ...
## $ Pl_Hand : chr "R" "R" "R" "R" ...
## $ Year : num 2009 2009 2009 2009 2009 ...
## $ Month : num 6 6 6 6 6 6 6 6 6 6 ...
## $ Day : num 6 6 6 6 6 6 6 6 6 6 ...
## $ Surface : chr "Clay" "Clay" "Clay" "Clay" ...
Several variables were eliminated and can be observed by hitting the code button to the right. The reason why they were elimnated can be observed in the code.
shots$match_id=NULL # not needed for analysis
shots$Players=NULL # too many levels, not focus study
shots$Round=NULL #too many levels
shots$Tournament=NULL #too many levels
shots$Month=NULL # too many levels
shots$Day=NULL #too many levels
shots$`Unique ID`=NULL # not needed
str(shots)
## 'data.frame': 43901 obs. of 14 variables:
## $ shot_type : chr "Forehand" "Forehand" "Forehand" "Forehand" ...
## $ shot_location : chr "crosscourt" "down_middle" "down_the_line" "inside_out" ...
## $ shots : num 34 23 12 11 2 19 16 9 8 6 ...
## $ pt_ending : num 9 3 5 4 1 5 2 5 1 0 ...
## $ shot_winners : num 0 0 0 4 0 2 0 0 0 0 ...
## $ induced_forced : num 2 0 1 0 0 2 0 2 1 0 ...
## $ unforced : num 7 3 4 0 1 1 2 3 0 0 ...
## $ shots_in_pts_won : num 20 14 5 9 0 12 12 4 6 2 ...
## $ shots_in_pts_lost: num 14 9 7 2 2 7 4 5 2 4 ...
## $ player_num : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Winner : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Pl_Hand : chr "R" "R" "R" "R" ...
## $ Year : num 2009 2009 2009 2009 2009 ...
## $ Surface : chr "Clay" "Clay" "Clay" "Clay" ...
Scatter plots of numerical shot variables- Linearity observed in some shot outcomes
pairs(shots[,3:9]) # most linearity observed in shot outcomes
All histograms of the numerical shot variables have a right skew meaning most of the data falls below the mean
par(mfrow=c(3,3)) # all right skewed
hist(shots$shots, col="blue", breaks = 10)
hist(shots$pt_ending, col="blue", breaks = 10)
hist(shots$shot_winners, col="blue", breaks = 10)
hist(shots$induced_forced, col="blue", breaks = 10)
hist(shots$unforced, col="blue", breaks = 10)
Prior to running the analysis several columns were dropped and some character columns were changed to factors. This process can be viewed in the code button to the right along with the final data frame observed below.
serves$Unique_ID=NULL # not useful
serves$Players=NULL # research is not about individual player
serves$Tournament=NULL # too many levels
serves$pts=NULL # total points not the goal of the research
serves$Round=NULL # too many levels
serves$Month=NULL # too many levels
serves$Day=NULL # too many levels
serves$Pl_Hand<-as.factor(serves$Pl_Hand)
serves$Surface<-as.factor(serves$Surface)
str(serves)
## 'data.frame': 8044 obs. of 14 variables:
## $ Winner : num 1 1 1 1 1 1 0 0 1 1 ...
## $ player_num : num 1 1 1 1 1 1 1 1 1 1 ...
## $ set_num : num 1 2 1 2 1 2 1 2 1 2 ...
## $ Pl_Hand : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : num 2007 2007 2009 2009 2011 ...
## $ Surface : Factor w/ 4 levels "Carpet","Clay",..: 4 4 2 2 3 3 4 4 4 4 ...
## $ pts_won : num 24 33 28 16 50 56 21 29 39 29 ...
## $ aces : num 1 2 0 0 12 1 1 0 1 3 ...
## $ unret : num 1 3 0 0 5 2 1 4 0 0 ...
## $ forced_err : num 2 2 5 0 12 10 4 6 6 2 ...
## $ pts_won3shots: num 13 13 10 6 34 25 8 12 15 15 ...
## $ wide : num 11 13 22 12 13 30 11 9 18 16 ...
## $ body : num 17 16 9 9 31 43 3 13 11 13 ...
## $ down_center : num 18 21 11 9 32 34 23 21 22 13 ...
The data set was split into training and test sets with 90% of the data put in the training set as it was observed to perform the best with the data.
set.seed(1234)
sample_serves <- sample(nrow(serves), nrow(serves)*0.90)
serves_train <- serves[sample_serves,]
serves_test <- serves[-sample_serves,]
# 50, 60, 70, 80, and 90 Percent were tried for training and 90% was the best
When running a decision tree on the data it was noted that player number and only player number of 1 is contributing the most to whether or not the player wins. This model has a misclassification rate of 368.
#library(rpart)
#library(rpart.plot)
set.seed(1234)
serves_train$Winner <- as.factor(serves_train$Winner)
serves_test$Winner <- as.factor(serves_test$Winner)
serves_rpart1 <- rpart(formula = Winner ~., data = serves_train, method = "class")
pred1 <- predict(serves_rpart1, serves_test, type = "class")
#pred1
###plotting the model
par(mfrow=c(1,1))
#serves_rpart1
prp(serves_rpart1, extra = 1)
###confusion matrix
table(serves_test$Winner, pred1, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 204 182
## 1 186 233
sum(serves_test$Winner != pred1)
## [1] 368
#368 misclassified
Using Logistic Regression and backwards selection, the lower player number is found to be the only variable of importance when predicting winning. This model has a misclassification rate of 371.
set.seed (1234)
serves_train$Winner <- as.factor(serves_train$Winner)
serves_test$Winner <- as.factor(serves_test$Winner)
serves_glm0 <- glm(Winner~., family = binomial, data = serves_train)
summary(serves_glm0) # player number only significant variable, lower is better
##
## Call:
## glm(formula = Winner ~ ., family = binomial, data = serves_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.341 -1.144 -1.062 1.167 1.337
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.979e-01 6.082e+00 0.115 0.909
## player_num -3.806e-01 7.956e-02 -4.784 1.72e-06 ***
## set_num 7.892e-03 4.730e-02 0.167 0.867
## Pl_HandR 7.375e-02 7.667e-02 0.962 0.336
## Year -7.972e-05 3.029e-03 -0.026 0.979
## SurfaceClay 1.751e-02 5.063e-01 0.035 0.972
## SurfaceGrass 1.144e-02 5.074e-01 0.023 0.982
## SurfaceHard -1.831e-02 5.046e-01 -0.036 0.971
## pts_won 2.172e-05 8.295e-03 0.003 0.998
## aces -2.107e-03 1.716e-02 -0.123 0.902
## unret -3.143e-02 3.270e-02 -0.961 0.336
## forced_err -1.290e-02 1.223e-02 -1.055 0.291
## pts_won3shots 8.085e-03 1.145e-02 0.706 0.480
## wide -4.806e-03 5.415e-03 -0.888 0.375
## body 2.906e-03 4.939e-03 0.588 0.556
## down_center -2.100e-03 5.960e-03 -0.352 0.725
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10035.2 on 7238 degrees of freedom
## Residual deviance: 9993.8 on 7223 degrees of freedom
## AIC: 10026
##
## Number of Fisher Scoring iterations: 3
## Prediction
servespred_resp <- predict(serves_glm0, newdata = serves_test, type = "response")
## Create a confusion matrix 0.5 cutoff probability
table(serves_test$Winner, (servespred_resp > 0.5)*1, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 208 178
## 1 193 226
#misclassification rate is 371
A bagging model was run. It had the lowest misclassification rate of all models, at 271. This model does the best job at predicting winning and losing.
#install.packages('ipred')
library(ipred)
set.seed(1234)
serves_train$Winner <- as.factor(serves_train$Winner)
serves_test$Winner <- as.factor(serves_test$Winner)
servesbag_model <- bagging(formula = Winner ~., data = serves_train, nbagg = 50)
servesbag_pred <- predict(servesbag_model, newdata = serves_test)
#servesbag_pred
table(serves_test$Winner, servesbag_pred, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 257 129
## 1 142 277
sum(serves_test$Winner != servesbag_pred)
## [1] 271
# 271 misclassified- 50, 100, 500, and 1000 nbaggs tried 50 was the best model
The player number data appeared to be drowning the data. The player number variable was removed to see if any further insights could be gained.
serves$player_num=NULL
str(serves)
set.seed(1234)
sample_serves <- sample(nrow(serves), nrow(serves)*0.90)
serves_train <- serves[sample_serves,]
serves_test <- serves[-sample_serves,]
The decision tree appeared below and had the same misclassification rate as the first run at 368. Forced error became the root node in this model. Causing more then 3 forced errors equated to winning more. Several other models were rerun, and Logistic regression revealed no statistically significant variables this time and had a misclassification rate of 375, while backward selection honed in on the same thing as the decision tree with forced errors being a deciding factor. All in all, nothing was gained from removing the variable player number. Player number is the deciding factor in this data set not serve types.
#library(rpart)
#library(rpart.plot)
set.seed(1234)
serves_train$Winner <- as.factor(serves_train$Winner)
serves_test$Winner <- as.factor(serves_test$Winner)
serves_rpart1 <- rpart(formula = Winner ~., data = serves_train, method = "class")
pred1 <- predict(serves_rpart1, serves_test, type = "class")
#pred1
###plotting the model
par(mfrow=c(1,1))
#serves_rpart1
prp(serves_rpart1, extra = 1)
###confusion matrix
table(serves_test$Winner, pred1, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 159 227
## 1 141 278
sum(serves_test$Winner != pred1)
## [1] 368
#368 misclassified
The data set was split into training and test sets with 80% of the data put in the training set as it was observed to perform the best with the data (data frame can be observed below). Prior to running the data, Player hand and Surface were changed to factor variables. Shots points in won and shots point in lost were removed from the data, because these two columns dominate the results.
set.seed(1234)
shots$Pl_Hand <- as.factor(shots$Pl_Hand)
shots$Surface <- as.factor(shots$Surface)
shots$shots_in_pts_won=NULL
shots$shots_in_pts_lost=NULL
sample_shots <- sample(nrow(shots), nrow(shots)*0.80) # tried 50, 60 and 70 prior with worse results
shots_train <- shots[sample_shots,]
shots_test <- shots[-sample_shots,]
The decision tree below focused on shot outcomes with shot winners as the root node and had a misclassification rate of 3890. Shot winners are when a player hits a shot and wins the point without the opponent touching the shot. This model does a good job at predicting winning. Following the root node to the right, the player who has more than two winners and less than or equal to five unforced errors is more likely to win. Following the root node to the left, players who have less than two winners, less than or equal to three unforced errors, and player number 1 were more likely to win. It is interesting to see player number plays a part in the strokes data as well.
#library(rpart)
library(rpart.plot)
set.seed(1234)
shots$shot_type <-as.factor(shots$shot_type)
shots$shot_location<-as.factor(shots$shot_location)
shots$shots_in_pts_won=NULL
shots$shots_in_pts_lost=NULL
str(shots)
## 'data.frame': 43901 obs. of 12 variables:
## $ shot_type : Factor w/ 3 levels "Backhand","Forehand",..: 2 2 2 2 2 1 1 1 1 3 ...
## $ shot_location : Factor w/ 5 levels "crosscourt","down_middle",..: 1 2 3 5 4 1 2 3 5 1 ...
## $ shots : num 34 23 12 11 2 19 16 9 8 6 ...
## $ pt_ending : num 9 3 5 4 1 5 2 5 1 0 ...
## $ shot_winners : num 0 0 0 4 0 2 0 0 0 0 ...
## $ induced_forced: num 2 0 1 0 0 2 0 2 1 0 ...
## $ unforced : num 7 3 4 0 1 1 2 3 0 0 ...
## $ player_num : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Winner : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Pl_Hand : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : num 2009 2009 2009 2009 2009 ...
## $ Surface : Factor w/ 4 levels "Carpet","Clay",..: 2 2 2 2 2 2 2 2 2 2 ...
shots_rpart1 <- rpart(formula = Winner ~., data = shots_train, method = "class")
pred1 <- predict(shots_rpart1, shots_test, type = "class")
#pred1
###plotting the model
par(mfrow=c(1,1))
#shots_rpart1
prp(shots_rpart1, extra = 1)
###confusion matrix
table(shots_test$Winner, pred1, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 2506 1917
## 1 1973 2385
sum(shots_test$Winner != pred1)
## [1] 3890
#3890 misclassified
Logistic regression was the best model for this data and showed that forehands, slice, shots down the middle, number of shots, player number and player hand are statistically significant variables for winning. Misclassification rate is 3779. This model does a good job at predicting winning as well. Looking at the statistically significant variables, it is interesting to note that as forehands, slice shots, and player number go down winning increases. As more shots go down the middle winning goes up. This information seems a bit odd to me, because depending on the player, hitting more strokes down the middle could be bad as player’s opponent will not be moving much and will be able to dictate where the point shots will go. Backwards selection did not yield a better misclassification rate.
#shot type and location changed to factors
shots$shot_type <-as.factor(shots$shot_type)
shots$shot_location<-as.factor(shots$shot_location)
set.seed(1234)
#logistic regression model
shots_glm0 <- glm(Winner~., family = binomial, data = shots_train)
summary(shots_glm0)
##
## Call:
## glm(formula = Winner ~ ., family = binomial, data = shots_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5245 -1.1552 -0.6164 1.1520 2.1778
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.065133 2.689709 -0.768 0.4426
## shot_typeForehand -0.103496 0.026297 -3.936 8.30e-05 ***
## shot_typeSlice -0.063346 0.031372 -2.019 0.0435 *
## shot_locationdown_middle 0.172743 0.033941 5.090 3.59e-07 ***
## shot_locationdown_the_line 0.001717 0.034900 0.049 0.9608
## shot_locationinside_in 0.065148 0.059746 1.090 0.2755
## shot_locationinside_out -0.021389 0.035834 -0.597 0.5506
## shots 0.006156 0.001025 6.007 1.89e-09 ***
## pt_ending -0.176739 0.225745 -0.783 0.4337
## shot_winners 0.342177 0.225908 1.515 0.1299
## induced_forced 0.291281 0.225931 1.289 0.1973
## unforced -0.003935 0.225970 -0.017 0.9861
## player_num -0.278190 0.021716 -12.810 < 2e-16 ***
## Pl_HandR 0.084206 0.035328 2.384 0.0171 *
## Year 0.001259 0.001339 0.940 0.3470
## SurfaceClay -0.149733 0.250304 -0.598 0.5497
## SurfaceGrass -0.122758 0.250769 -0.490 0.6245
## SurfaceHard -0.137286 0.249603 -0.550 0.5823
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 48687 on 35119 degrees of freedom
## Residual deviance: 47503 on 35102 degrees of freedom
## AIC: 47539
##
## Number of Fisher Scoring iterations: 4
#predictions if needed
shotspred_glm0 <- predict(shots_glm0, newdata = shots_test, type = "response")
#shotspred_resp[500]
## Create a confusion matrix 0.5 cutoff probability
table(shots_test$Winner, (shotspred_glm0 > 0.5)*1, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 2505 1918
## 1 1861 2497
After observing the stroke analysis joined with shot outcomes and the lookup table, I had an idea to only use the shot data joined with the lookup table to see if the outcomes were drowning out my shot type/location data. I also wanted to observe if adding shot location and type together would give a clear picture as to how the data was performing since the other data set seemed off when looking at hitting more shots down the middle being more desirable. I filtered the joined data in excel to show only forehands for shot locations and put F in front of all shot locations, then I deleted the shot type column, and made it a separate excel sheet. I also did this for backhands and slice shots. I then joined each sheet together in tableau prep on the unique ID column to create one data set. This should give more insight into what types of shots may contribute to winning and maybe the down the middle significance in the other data set may be more meaningful with this data.
The code used in the button to the right can be observed for reading the data set in and elimination of some columns as well as change some columns to factors for various reasons. The data frame can be observed below prior to running any models. There are 2166 rows 20 variables.
shot<-read.csv("G:/Other computers/My Laptop/Documents/Capstone project/Joined data/Shots only no outcomes/Join R FBS.csv")
shot<-as.data.frame(shot)
shot$Players=NULL #too many levels
shot$Unique.ID=NULL #not informative
shot$Pl_Hand<-as.factor(shot$Pl_Hand) # done to run analysis
shot$Surface<-as.factor(shot$Surface) # done to run analysis
str(shot)#2166 obs. of 20 variables
## 'data.frame': 2166 obs. of 20 variables:
## $ Winner : int 1 1 0 1 0 1 1 0 1 1 ...
## $ Pl_Hand : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : int 2009 2011 2012 2012 2014 2014 2014 2014 2013 2013 ...
## $ Surface : Factor w/ 4 levels "Carpet","Clay",..: 2 3 4 2 4 4 4 4 4 4 ...
## $ player : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Fcrosscourt : int 34 36 37 28 14 31 11 80 17 44 ...
## $ Fdown_middle: int 23 45 42 15 12 22 8 46 7 33 ...
## $ FdownDL : int 12 18 5 3 3 8 2 16 11 20 ...
## $ Finside_out : int 11 29 22 6 10 25 7 33 21 22 ...
## $ Finside_in : int 2 0 1 0 0 2 0 1 0 0 ...
## $ Bcrosscourt : int 19 49 19 21 48 48 12 34 53 56 ...
## $ Bdown_middle: int 16 50 28 17 43 34 14 25 15 60 ...
## $ BdownDL : int 9 15 8 9 6 9 3 12 6 15 ...
## $ Binside_out : int 8 12 5 2 9 12 7 13 7 16 ...
## $ Binside_in : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Scrosscourt : int 6 2 0 0 0 0 0 0 1 1 ...
## $ Sdown_middle: int 6 5 1 1 3 5 2 2 3 7 ...
## $ SdownDL : int 2 1 0 0 0 0 0 2 1 1 ...
## $ Sinside_out : int 0 0 0 0 0 0 1 0 0 1 ...
## $ Sinside_in : int 0 0 0 0 0 0 0 0 0 0 ...
Exploratory analysis of this data showed that Backhands down the middle were the only shots that backhands outpaced forehands for any other type of shot. This is not surprising, as backhands are in general harder to hit and apply direction so this may contribute to more down the middle shots. This data set showed that numerical data exhibited a right skew similar to serves and there was a lot of linearity in the numerical data.
avgshotdm<- apply(shot[,c(7,12,17)], 2, mean)
barplot(avgshotdm, ylim = c(0, 25), ylab = "Average", col = "blue") # only shot where backhand is used more than forehand
The data set was split into training and test sets with 80% of the data put in the training set as it was observed to perform the best with the data.
set.seed(1234)
sample_shot <- sample(nrow(shot), nrow(shot)*0.80) # tried 50, 60 and 70 prior with worse results
shot_train <- shot[sample_shot,]
shot_test <- shot[-sample_shot,]
A decision tree was run, and the root node was found to be forehands down the middle, in fact hitting 10 or less would lead to winning if following the root node to the right. The decision tree has a misclassification rate of 203. This information is more telling as to what down the middle meant in the last data set. It isn’t just down the middle, but hitting less forehands down the middle is better. Looking at the left side of the root node with forehands of 10 or more combine with other shots, lead to more winning in a variety of different ways as seen below.
#library(rpart)
#library(rpart.plot)
set.seed(1234)
shot_rpart1 <- rpart(formula = Winner ~., data = shot_train, method = "class")
pred1 <- predict(shot_rpart1, shot_test, type = "class")
#pred1
###plotting the model
par(mfrow=c(1,1))
#shot_rpart1
prp(shot_rpart1, extra = 1)
###confusion matrix
table(shot_test$Winner, pred1, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 80 131
## 1 72 151
sum(shot_test$Winner != pred1)
## [1] 203
#misclassificaiton rate 203
Backwards selection had a misclassification rate of 192. Backwards selection was the best model with the lowest missclassification rate for this data set. Hitting more forehands crosscourt leads to winning and hitting less forehands down the middle increases the chance of winning which makes more sense than the prior data set, however when looking at the decision tree it is probably a combination of both that will lead to winning.
summary(shot_glm_back)
##
## Call:
## glm(formula = Winner ~ player + Fcrosscourt + Fdown_middle +
## FdownDL, family = binomial, data = shot_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6028 -1.2107 0.9923 1.1311 1.4595
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.388541 0.201364 1.930 0.053662 .
## player -0.277310 0.147431 -1.881 0.059979 .
## Fcrosscourt 0.010197 0.002807 3.632 0.000281 ***
## Fdown_middle -0.010242 0.003743 -2.736 0.006211 **
## FdownDL -0.013148 0.006729 -1.954 0.050703 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2397.2 on 1731 degrees of freedom
## Residual deviance: 2377.2 on 1727 degrees of freedom
## AIC: 2387.2
##
## Number of Fisher Scoring iterations: 4
## Prediction
shotpred_resp1 <- predict(shot_glm_back, newdata = shot_test, type = "response")
## Create a confusion matrix 0.5 cutoff probability
table(shot_test$Winner, (shotpred_resp1 > 0.5)*1, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 76 135
## 1 57 166
The business problem was to identify if certain strokes: serves, forehands, backhands, or slice predict winning. By identifying if certain strokes predict winning the WTA may be able to help develop better training plans for coaches and players.
Two Assumption were made about the data in section 1.5
There is a statistically significant relationship between strokes (serves, forehands, backhands, and slice) and winning.
Exploratory analysis and descriptive statistics will yield usable information for player development plans.
In summary the main takeaways from the serve data is that types of serves do not predict winning. Which person served first is the most statistically significant variable when predicting winning with this data. The best model at predicting winning for the serve data is the bagging model. Exploratory analysis of serving showed that when looking at the top 25 all-time winners, most winners, on average used wide serves and down the center serves, with body serves being the least likely serve to use. Serve types were more closely related to serve outcomes when looking at exploratory analysis than winning or losing. Both, serve types and serve outcomes, were not good at predicting winning or losing and this was echoed in the regression analysis when only player number was exposed as the only significant variable in this data set.
The exploratory stroke analysis (forehand, backhand, and slice) revealed most shots are forehand. Most strokes are hit crosscourt or down the middle. Forehand crosscourt strokes cause more errors for the opponent. Backhands down the middle, outpaced forehands and slice, otherwise forehands outpaced backhands and slice in the rest of the data when it came to shot locations.
When combining shots, shot outcomes, and the lookup table the best model was identified as Regression and it identified statistically significant variables as forehands, slice, player number, player hand, shots, and shots down the middle. As forehands, slice shots, and player number go down winning increases and hitting more shots down the middle winning goes up. Hitting more shots down the middle did not seem to make sense as this would make it easier on the opponent to take advantage of the player, by being able to not run as much and dictate where the opponent would like to hit a shot. Since this data did not yield outcomes that made the most sense, the below bullet point with shot types and locations combined into their respective columns and joined with only the lookup table were observed.
The best model, when looking at only the combined shot types and shot locations, was identified as backwards regression, which highlighted forehands crosscourt and forehands down the middle as the only statistically significant variables. As forehands down the middle go down winning goes up and as forehands crosscourt go up so does winning. As far as the ability to cluster, this data did not cluster well.
Strokes in and of themselves are not great at predicting winning. There is more of a relationship between strokes and outcomes such as forced errors and points won, which is highlighted in the exploratory analysis.
Types of strokes, serves, backhands, and slice strokes are not
statistically significant and do not predict winning. With that being
said, there is still value in what was observed. For strategic purposes,
if a player wins the toss for serve selection they should choose to
serve first as they are more likely to win.
Looking at the regression for serves in 5.1, it should be noted that even though serve types don’t predict winning, wide and down the center serves should be hit less as winning goes up and body serves should be hit more as they increase the chances of wining. This is useful for creating a player development plan, which would contribute to the overall business problem.
Looking at the regular regression below, the WTA, players and coaches can use this as a good resource for a player development plan. Strokes that do not have a negative sign in front of them in the estimate column, which is the first column on the left after types of columns, equate to being used more for winning and those that have a negative sign in front of them should be used less for winning.
knitr::include_graphics("G:/Other computers/My Laptop/Documents/Capstone project/R markdown jpegs/Shots regression jpeg.jpg")
Crosscourt forehands and forehands down the middle are statistically significant and predict winning and should be exploited in a player development plan.
Assumption one was debunked by the use of this data, for the most part strokes did not predict winning. A better revision of the business problem might be to look at: do strokes predict stroke outcomes?
Assumption two is correct in that exploratory analysis did yield useful information and the use of the geocoded tableau charts in 4.1 may be helpful when studying other opponents prior to a match, either observing a players weakness or for observing what the player may hit most often. This can help develop a curated training plan for a specific opponent.
If time allotted, it would be interesting to go back and apply bullet point 4 in 6.2, to the shot types, shot outcome, and lookup table join and see if there could be more meaningful information derived from the larger data set.
This study did not look at all strokes used by tennis players, so there may be more information derived from adding those additional strokes.