This report analyzes factors within the game of tennis that can be associated with match winning percentages. While all stages of tennis players, from recreational to competitive, receive various coaching with regards to what strategies will help them become successful players, there remain a few outspoken phrases that most players will be told in their lifetimes. In short, these coaching cues tend to echo that players win more points when they get their first serve in play and more matches when they reduce double faults in their play. By investigating the factors which are associated with an increased winning percentage, we are able to specifically target aspects of a player’s tennis game which can be improved to maximize player success. Additionally, as previous research has attempted, we can improve the forecasting for who is more likely to win a matchup based on the players’ differing styles of play.
The existing body of research thus far has attempted mainly to put probability and statistical formulae together in order to predict the outcome for players winning matches. Of these, the majority specifically focus on whether a player is serving or returning in play to calculate the outcome of a game or set within a match. For example, O’Malley (2008) derives a probability model for winning a game within a tennis match uses a binomial distribution to establish what is coined as a tennis formula. When O’Malley accounts for the probability \(p\) for winning a point in this distribution, however, he relies extensively on whether the player is serving or not. Similarly, for statistically regarding an optimal tennis strategy, George (1973) dives into an optimal sequencing of a player’s two serves. What the stories of this body of existing research tells, is that to build an effective model to predict both winning percentages of tennis players, one should focus their view around elements of tennis that relate to service. What sets this analysis apart, is that we attempt to include factors alongside service and return within a model to predict winning percentages.
To examine the hypothesis that there are more factors for predicting win percentage than simply related to serve and return, we scraped data from TennisAbstract.com. The data we found contains information on the matches of the current top fifty ranking male professional players for the past year of their professional careers. Of this entire data, there is information on service statistics, return statistics, and miscellaneous statistics such as overall points, games, and sets won in the past year, average lengths of points, games, sets, and matches, and the player’s dominance ratio, given by the percent of return points won divided by the percent of service points lost. To simplify our analysis, we retained variables related to winning individual points including percentages for first serves in, first serve points won, second serve points won, double faults, aces, return points won, break points won (as the returner), and break points saved (as the server), along with tiebreaker win percentages, average opponent rankings, and dominance ratio. These specific variables allow us to analyze the association of percentage of specific types of points won with the overall percentage of winning matches. In disregarding the percentages of games and sets won with analogous statistics, we are expanding upon former research that operates under the assumption that each point in a tennis match is independently and identically distributed.
It is also worth noting that we chose not to center our variables around an average value for each of our predictors. While this leads to the intercept of our model having no plausible interpretation with regards to winning percentage, it does allow for a more flexible use of our model as new data is collected. That is, as new data is collected, a centered model would require calculations of updated average values for all predictors in our model, as opposed to leaving the model uncentered and having the freedom to use raw percentages as predictors for winning percentage.
We will use multiple linear regression, derived via stepwise selection, in order to determine associations these variables have with overall winning percentages among the top ranking male professionals. Within this multiple linear regression model, we are able to perform t-tests to determine the significance of each factor individually. With each t-test, we can report a t-statistic under the null hypothesis that the given explanatory variable has no linear association with a player’s winning percentage and the alternative hypothesis that there is some significant association between a given explanatory factor and winning percentage. We also will examine our adjusted R-squared term to conclude that our model suggests a reasonably sound model that can explain variability in our data. Once our linear model is established, we can examine some experimental models with various transformations and interaction terms to confirm that the model given through stepwise selection is our best possible model. In comparing across models, we will employ the ANOVA tests in order to obtain an F-statistic and show that our model given through stepwise selection is a sound model in comparison with other experimental models. This F-test will operate under the null hypothesis that our model is just as good as another experimental model, and the alternative hypothesis that the experimental model has significant improvements over the model we are reporting.
Removal Decisions
One variable that our initial exploration suggested could be very valuable was the dominance ratio. Dominance ratio is a value created by the website we got our data from, and is calculated by taking the percentage of return points won divided by the percentage of serve points lost. However, we decided against using dominance ratio in our final model because its correlation with winning percentage (0.86) was much higher than any other correlation, and after controlling for dominance ratio most other variables were rendered insignificant with winning percentage as the response. Thus, in favor of trying to learn more about the effect of a variety of aspects of play, we created our final model without dominance ratio.
In this section, we will begin by reporting some preliminary data visualization, before reporting the results of our multiple linear regression. Expanding upon this linear model, we will report confidence intervals related to strengths and weaknesses of a player in our explanatory variables. The discussion portion of this report examines implications of our findings along with paths for future research.
To begin, we will first examine Figure 3.1 showing the distribution of winning percentages of sampled players shows a rather normal distribution. There is a slight chance that there is some slight right skew to this distribution. However, this would confirm previous researchers’ attempts– and the assumption we were operating under– that winning a tennis match is identically and independently distributed.
Serving Percentages
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 53.1 | 60.225 | 63 | 65.25 | 71.4 | 62.788 | 3.579798 | 50 | 0 |
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 64.7 | 70.95 | 73.45 | 77.15 | 84.2 | 73.982 | 4.16495 | 50 | 0 |
Using the common theme that players win more points when their first serve goes in, we look at both the average values of how often these professional players make their first serve in, as well as how often they win the point on their first serve. What we see on average is that from the data collected over the past year of professional matches, the rate at which players are putting their first serve into play is about 62.8% (Table 1). Of these first serves made, the average rate at which players win the point is 74.0% (Table 2). Considering that these players perform at an extremely phenomenal level, the average of placing a first serve into play, a stroke any player has absolute control over, seems lower than one may expect.
Additionally, in examining Figure 1 depicting the percentage of first serves put into play vs. the winning percentages of players shows a positive trend, however, there is not an extremely strong association we can pull from this image. In this context of our observational units being the top 50 ranked players, this would likely make sense. These players all must have certain strengths that have pulled them to the top of the rankings, and thus it would make sense that players with lower first serve percentages may rely more heavily on other strengths to win matches, and as such could still have relatively similar winning percentages. It is worth noting several unusual points. The three points with the highest winning percentages are unsurprisingly the top three ranked players, who are considered the “Big Three” of men’s Tennis. It is interesting to note they do have above average first serve percentages, suggesting that the combination of a strong serve as well as the rest of their game is helping to push them to the top. It is also worth noting the two points with the higher serving percentages, which have relatively low winning percentages, with the second highest point being the second lowest winning percentage.
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 56.4 | 60.625 | 63.2 | 65.975 | 75.4 | 63.744 | 4.204427 | 50 | 0 |
Another possible service game predictor we see possibly influencing our model is the percentage of break points saved as a server. This speaks to the server’s ability to play under the pressure of possibly losing an entire game. Thus, as we see in Table 3, it is interesting that on average, these professionals are able to save 63.7% of the break points their opponents obtain. However, this value is nearly identical to the average of the mean percentage of first serve points won and the mean percentage of second serve points won ((74.0 + 52.2) / 2 = 63.1). This suggests that whether or not a point is break point has little effect on a player’s chances of winning the point. Hence, we are not sure whether we will see this predictor in our final model.
Return of Serve Percentages
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 25.3 | 36.05 | 39.5 | 42.075 | 48.8 | 39.376 | 4.322771 | 50 | 0 |
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 27.9 | 35.3 | 36.65 | 38.475 | 42.3 | 36.72 | 2.926166 | 50 | 0 |
On the other side of serving is returning serve. A player’s ability to win a return game, known as “breaking serve”, is an important factor for winning a match. Here we explore the average rate at which these professionals win the final point in a game when the opportunity arises, called a break point. We find that on average professional players win a break point 39.4% (Table 4) of the time when they reach this point in a game. The average rage at which these professionals are winning return points overall is 36.7% (Table 5), which seems to agree that break points are nearly won at the same rate as any other return point. Figure 2 shows the association of winning return points percentage vs. winning percentage. This plot has a rather strong, positive linear association between these two variables, implying that it is worth investigating whether this positive correlation is significant when examining a final model. What this suggests is that as a player’s rate for winning return points increases, we expect that their winning percentage will also increase. Additionally, we see that the same “Big Three” appear here with rather dominant percentages for winning return points as well.
Tiebreak Percentages
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 26.7 | 48.95 | 55.15 | 60 | 87 | 54.14 | 11.34642 | 50 | 0 |
A final explanatory variable which our references suggest as a route for further investigation is the rate at which players win tiebreakers, whether at the set or match levels. Within our analysis, we treat each tiebreaker type as equal since the structure and order of playing points is identical and the only difference between a set tiebreaker and a match tiebreaker is the amount of points needed to win (7 points vs. 10 points). Notice Figure 4 shows the rate at which players are winning tiebreakers, which has a mean value of 54.1% (Table 6) among our sample of professional players, against winning percentage. Figure 4 suggests that there exists a rather linear and positive association between these two variables. Thus, we could expect that the players with a higher rate of winning tiebreakers in their matches will also have a higher rate of winning in general.
Model
Our model was created using the assistance of a stepwise two-way modeling function. The variables included in this function were: First Serve Percentage (FSPct), First Serve Winning Percentage (FSPtW), Second Serve Winning Percentage (SSPtW), Ace Percentage (APct), Double Fault Percentage (DFPct), Return Points Won Percentage (RPtWPct), Break Point Conversion Percentage (BPCPct), Break Points Saved Percentage (BPSPct), and Tiebreaker Win Percentage (TBWPct). From these variables, the function reduced our model to: \[ WPct = -254.5 + 2.66 \cdot RPtWPct + 1.53 \cdot FSPtW + 0.78\cdot FSPct + 0.86\cdot SSPtW + 0.17\cdot TBWPct.\]
All variables in this model were found to be statistically significant at the 0.001 level except for Tiebreaker Win Percentage, which was still significant at the 0.01 level. The model was also found to have an R-squared value of 0.81, thus 81% of variation in winning percentage can be explained by our model.
Assume that for each of the following interpretations all other variables are held constant.
Our model includes three different variables involving the serve. Among these, the First Serve Points Won Percentage (\(t = 7.864, \, p < 0.001\); 95% CI: 1.14 - 1.93) predicts an average increase in winning percentage of 1.52 for each additional percentage point in the number of points won after a made first serve. The next serve statistic, First Serve Percentage (\(t = 4.089, \, p < 0.001\); 95% CI: 0.40 - 1.16) predicts an average winning percentage increase of 0.78 points per additional percentage point of first serves in. The last serve statistic, Second Serve Points Won (\(t = 3.712, \, p < 0.001\); 95% CI: 0.39 - 1.33) predicts an average increase in winning percentage of 0.86 points for each additional percentage point in the number of points won after a made second serve.
There was only one statistic directly associated with return of serve that made it into our model. This was Return Points Won Percentage (\(t = 9.864, \, p < 0.001\); 95% CI: 2.12 - 3.21). The coefficient of this variable predicts an average increase in winning percentage of 2.66 points for each additional percentage point in points won while returning. The only other not directly serving-related statistic in our model was Tiebreakers Won Percentage (\(t = 2.858, \, p < 0.01\); 95% CI: 0.05 - 0.29). Our model suggests an average increase of 0.17 points in winning percentage for each additional percentage point in tiebreakers won.
| Player | WinningPct | FirstServePct | FirstServePtW | cooks | RetPtWPct |
|---|---|---|---|---|---|
| Milos Raonic | 64.3 | 62.5 | 84.2 | 0.2040603 | 34.7 |
| Reilly Opelka | 58.7 | 64.0 | 79.3 | 0.1240316 | 29.1 |
| Dusan Lajovic | 45.5 | 69.4 | 68.0 | 0.1177649 | 36.6 |
We do note that as Figure 5 demonstrates, this model contains three points with Cook’s Distances greater than 5/45 (number of predictors/sample size - number of predictors), therefore labeling these points as influential. These three players and some notable statistics are given in Table 7.
Models used for comparison against this model with interaction terms or transformations show that we found no significant improvements upon this model. These results can be found in the appendix following. Additionally, plots regarding necessary conditions for conducting linear regression and indicators for possible multicollinearity have been placed in the appendix as well.
The multiple linear regression model reported above shows that some of the most valuable predictors among many in predicting a tennis player’s winning percentage, thus success, are percentages for return points won, first serve points won, first serves put into play, second serve points won, and tiebreakers won. This closely aligns with previous research from O’Malley (2008) and George (1973) in emphasizing the importance of service and return in the game of tennis to predict which player will win. Following from O’Malley’s probability model for winning a single game within a match, our model suggests that winning percentage among some of the top ranking male professional players is heavily reliant on serve and return as well. Yet, in considering George’s strong emphasis on how a player should sequence serves for an optimal tennis strategy, there is little regard to what specific types of scenarios should be trained more or less for any level of player. The results of this model give us a chance to better evaluate coaching methodology and focus.
In developing this model, we are able to pinpoint what predictors have more practical significance than others and can comment on what coaching and player development programs should emphasize to produce players with higher success as measured by a winning percentage. As above, coaching tends to emphasize an importance on playing a strong first serve followed by a weaker second serve for the benefit of consistency, as George agrees, but there is far less regard for how to play return points. Having service gives a player innate advantage in being in complete control of the first stroke of the point, thus at the professional level, we see that the average rates for winning a service point is over 70% for first serves and 50% for second serves, whereas the average rate for winning a return point is relatively low at under 40%. Considering this and the reported model, we see that increasing a player’s percentage of winning return points by a single percentage point is associated with a 2.66% increase in winning percentage, which holds the most practical significance of all predictors. Hence, the implication of this model is that one of the most efficient ways for a player to potentially increase their winning percentage is through improving their return play.
The previous implication becomes even more interesting when the influential points of our model are considered. All three of the players that were singled out as influential (Milos Raonic, Dusan Lajovic, and Reilly Opelka) have negative residuals, and thus our model overpredicted their winning percentages. The interesting thing about these three players is that they all follow a particular pattern. These players have very mediocre win percentages (64.3%, 45.5%, 58.7% respectively) when compared to their serving statistics. They all have average or greater percentage of first serves in at 62.5%, 69.4%, and 64.0% respectively, and more importantly they have a high percentage of first serves points won at 84.2%, 68.8%, and 79.3% respectively. Again note that this is when compared to their respective winning percentages. These percentages in combination with each player’s below average return points won percentage seems to suggest that our model may overvalue serving statistics in it’s predictions. Taking this information into account, the implication that increasing return points won has the most potential to increase winning percentage becomes even stronger.
In considering the scope of inference, we must be careful on the fronts of generalizability and causality. With regards to generalizability, this model was based upon matches for world-class, male tennis players, thus it is safe to assume that these players are roughly similarly matched with one another and the skill level is relatively narrow. However, considering both youth tennis and recreational tennis, there is no narrow skill level since the range is anywhere from beginner to collegiate athlete. That said, it is worth mentioning that the same narrow skill levels are present in league play, typically, although the sample of professional players is not conducive for generalizing to an entire population. For one, the serves played among professionals often exceed 100mph despite the best youth rarely mustering such force. With an eye towards causality, there was no experimental design to this study. While it is extremely tempting to say that improving any predictor in our model results in some increase in win percentage, there is no clear causation given the nature of the study. Despite a lack of causality and a lack of generalizability, this does open the door for further research. One path that is possible to consider would be collecting similar statistics among players on high school teams within a given region, conference, or section in order to determine how these conclusions differ when applied to a larger and more representative sample. Similarly, one could attempt to break these same analyses down by the court surface as to determine whether other explanatory factors play a stronger or weaker role in their relationship with winning percentage. Another path for exploration would simply be including more than only male professionals, but also female professionals as there may exist play style differences that allow for other explanatory factors for winning percentage. A final consideration for further research could include data on what type of strokes are used in winning a given point, as this study focused almost strictly on what type of points are won. Following these recommendations for further research could allow for more accurate predictions of a player winning a match as well as beneficial coaching implementation.
Bibliography
George, S. L. (1973). Optimal Strategy in Tennis: A Simple Probabilistic Model. Applied Statistics, 22(1), 97–104. doi: 10.2307/2346309
O’Malley, A. (2008). Probability Formulas and Statistical Analysis in Tennis, Journal of Quantitative Analysis in Sports, 4(2). doi: https://doi.org/10.2202/1559-0410.1100
Final Model Code
# Final Model
FinalVars <- Tennis %>%
select(WinningPct, MeanOppRk, AcePct, DFPct, FirstServePct, FirstServePtW, SecondServePtW, RetPtWPct, BPConvPct, BPSavedPct, TBWPct)
mod1 <- lm(WinningPct ~ 1, data = FinalVars)
mod2 <- lm(WinningPct ~ ., data = FinalVars)
step(mod1, direction = "both", scope = formula(mod2))
## Start: AIC=228.96
## WinningPct ~ 1
##
## Df Sum of Sq RSS AIC
## + RetPtWPct 1 1039.22 3642.0 218.41
## + SecondServePtW 1 934.45 3746.8 219.83
## + MeanOppRk 1 914.82 3766.4 220.09
## + TBWPct 1 850.52 3830.7 220.94
## + BPConvPct 1 583.68 4097.5 224.31
## + BPSavedPct 1 466.80 4214.4 225.71
## + FirstServePtW 1 379.96 4301.3 226.73
## <none> 4681.2 228.96
## + FirstServePct 1 138.86 4542.3 229.46
## + DFPct 1 42.37 4638.8 230.51
## + AcePct 1 5.24 4676.0 230.91
##
## Step: AIC=218.41
## WinningPct ~ RetPtWPct
##
## Df Sum of Sq RSS AIC
## + FirstServePtW 1 1704.47 1937.5 188.86
## + BPSavedPct 1 1394.65 2247.3 196.27
## + AcePct 1 1364.98 2277.0 196.93
## + SecondServePtW 1 804.72 2837.3 207.93
## + TBWPct 1 738.98 2903.0 209.07
## + MeanOppRk 1 642.91 2999.1 210.70
## + FirstServePct 1 226.29 3415.7 217.21
## <none> 3642.0 218.41
## + DFPct 1 31.77 3610.2 219.97
## + BPConvPct 1 21.37 3620.6 220.12
## - RetPtWPct 1 1039.22 4681.2 228.96
##
## Step: AIC=188.86
## WinningPct ~ RetPtWPct + FirstServePtW
##
## Df Sum of Sq RSS AIC
## + FirstServePct 1 591.26 1346.3 172.65
## + SecondServePtW 1 500.25 1437.3 175.92
## + BPSavedPct 1 456.50 1481.0 177.42
## + DFPct 1 333.43 1604.1 181.41
## + TBWPct 1 237.26 1700.2 184.32
## <none> 1937.5 188.86
## + MeanOppRk 1 61.86 1875.6 189.23
## + BPConvPct 1 55.81 1881.7 189.40
## + AcePct 1 36.58 1900.9 189.90
## - FirstServePtW 1 1704.47 3642.0 218.41
## - RetPtWPct 1 2363.74 4301.3 226.73
##
## Step: AIC=172.65
## WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct
##
## Df Sum of Sq RSS AIC
## + SecondServePtW 1 289.83 1056.4 162.53
## + BPSavedPct 1 179.30 1167.0 167.51
## + TBWPct 1 176.15 1170.1 167.64
## + DFPct 1 112.40 1233.8 170.29
## <none> 1346.3 172.65
## + BPConvPct 1 29.98 1316.3 173.53
## + MeanOppRk 1 8.53 1337.7 174.34
## + AcePct 1 0.20 1346.0 174.65
## - FirstServePct 1 591.26 1937.5 188.86
## - FirstServePtW 1 2069.44 3415.7 217.21
## - RetPtWPct 1 2745.14 4091.4 226.23
##
## Step: AIC=162.53
## WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW
##
## Df Sum of Sq RSS AIC
## + TBWPct 1 165.41 891.0 156.02
## + BPSavedPct 1 49.35 1007.1 162.14
## <none> 1056.4 162.53
## + MeanOppRk 1 36.88 1019.5 162.75
## + BPConvPct 1 24.05 1032.4 163.38
## + AcePct 1 15.19 1041.2 163.81
## + DFPct 1 4.98 1051.4 164.29
## - SecondServePtW 1 289.83 1346.3 172.65
## - FirstServePct 1 380.84 1437.3 175.92
## - FirstServePtW 1 1689.39 2745.8 208.29
## - RetPtWPct 1 2337.18 3393.6 218.88
##
## Step: AIC=156.02
## WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW +
## TBWPct
##
## Df Sum of Sq RSS AIC
## <none> 891.01 156.02
## + BPConvPct 1 32.36 858.65 156.17
## + BPSavedPct 1 22.88 868.13 156.72
## + AcePct 1 5.50 885.51 157.71
## + MeanOppRk 1 5.29 885.72 157.72
## + DFPct 1 1.49 889.52 157.93
## - TBWPct 1 165.41 1056.42 162.53
## - SecondServePtW 1 279.09 1170.10 167.64
## - FirstServePct 1 338.64 1229.65 170.12
## - FirstServePtW 1 1242.11 2133.12 197.67
## - RetPtWPct 1 1970.48 2861.49 212.35
##
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct +
## SecondServePtW + TBWPct, data = FinalVars)
##
## Coefficients:
## (Intercept) RetPtWPct FirstServePtW FirstServePct
## -254.5145 2.6631 1.5334 0.7804
## SecondServePtW TBWPct
## 0.8633 0.1710
final_model <- lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct +
SecondServePtW + TBWPct, data = FinalVars)
summary(final_model)
##
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct +
## SecondServePtW + TBWPct, data = FinalVars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7734 -3.2390 -0.3393 3.2544 8.4486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -254.51453 26.34480 -9.661 1.92e-12 ***
## RetPtWPct 2.66314 0.26998 9.864 1.02e-12 ***
## FirstServePtW 1.53344 0.19579 7.832 7.04e-10 ***
## FirstServePct 0.78039 0.19083 4.089 0.000181 ***
## SecondServePtW 0.86328 0.23254 3.712 0.000574 ***
## TBWPct 0.17099 0.05983 2.858 0.006489 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.5 on 44 degrees of freedom
## Multiple R-squared: 0.8097, Adjusted R-squared: 0.788
## F-statistic: 37.43 on 5 and 44 DF, p-value: 8.802e-15
# Checks Variance inflation factors for multicollinearity
car::vif(final_model)
## RetPtWPct FirstServePtW FirstServePct SecondServePtW TBWPct
## 1.510121 1.609118 1.129266 1.092261 1.115057
# Fitting conditions for applying linear regression
plot(final_model)
# Experimental Models to test against final model
FinalVars.DR <- Tennis %>%
select(WinningPct, MeanOppRk, AcePct, DFPct, FirstServePct, FirstServePtW, SecondServePtW, RetPtWPct, BPConvPct, BPSavedPct, TBWPct, DomRat)
# Adding dominence ratio
stepwin.DR <- lm(formula = WinningPct ~ DomRat + TBWPct + BPConvPct, data = FinalVars.DR)
summary(stepwin.DR)
##
## Call:
## lm(formula = WinningPct ~ DomRat + TBWPct + BPConvPct, data = FinalVars.DR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9347 -2.9734 0.1278 3.4165 9.4820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.25670 8.35954 -4.696 2.42e-05 ***
## DomRat 72.55419 7.17899 10.106 2.90e-13 ***
## TBWPct 0.16753 0.06214 2.696 0.00977 **
## BPConvPct 0.30068 0.16234 1.852 0.07043 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.668 on 46 degrees of freedom
## Multiple R-squared: 0.7859, Adjusted R-squared: 0.7719
## F-statistic: 56.27 on 3 and 46 DF, p-value: 1.979e-15
anova(final_model,stepwin.DR)
## Analysis of Variance Table
##
## Model 1: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW +
## TBWPct
## Model 2: WinningPct ~ DomRat + TBWPct + BPConvPct
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44 891.01
## 2 46 1002.43 -2 -111.42 2.751 0.07486 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Taking out TBWPct since not at the point level
serve_return <- lm(WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW, data = FinalVars)
summary(serve_return)
##
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct +
## SecondServePtW, data = FinalVars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.887 -3.081 1.066 2.974 8.243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -267.6464 27.9309 -9.582 1.94e-12 ***
## RetPtWPct 2.8310 0.2837 9.978 5.58e-13 ***
## FirstServePtW 1.7036 0.2008 8.483 6.90e-11 ***
## FirstServePct 0.8248 0.2048 4.028 0.000214 ***
## SecondServePtW 0.8795 0.2503 3.514 0.001020 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.845 on 45 degrees of freedom
## Multiple R-squared: 0.7743, Adjusted R-squared: 0.7543
## F-statistic: 38.6 on 4 and 45 DF, p-value: 5.232e-14
anova(serve_return, final_model)
## Analysis of Variance Table
##
## Model 1: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW
## Model 2: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW +
## TBWPct
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 1056.42
## 2 44 891.01 1 165.41 8.1683 0.006489 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Interaction between first serves in play and
# winning first serve points
serve.int <- lm(WinningPct ~ RetPtWPct + FirstServePtW + FirstServePtW:FirstServePct + FirstServePct + SecondServePtW, data = FinalVars)
summary(serve.int)
##
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePtW:FirstServePct +
## FirstServePct + SecondServePtW, data = FinalVars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4716 -2.7015 0.7377 2.9932 8.0991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.87840 217.92601 0.243 0.80941
## RetPtWPct 2.90871 0.28489 10.210 3.51e-13 ***
## FirstServePtW -2.60550 2.91301 -0.894 0.37596
## FirstServePct -4.29169 3.45671 -1.242 0.22098
## SecondServePtW 0.85630 0.24753 3.459 0.00121 **
## FirstServePtW:FirstServePct 0.06846 0.04617 1.483 0.14528
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.782 on 44 degrees of freedom
## Multiple R-squared: 0.7851, Adjusted R-squared: 0.7606
## F-statistic: 32.14 on 5 and 44 DF, p-value: 1.221e-13
# Not a significant improvement over serve_return model
# by transitivity not significant over final_model
anova(serve_return, serve.int)
## Analysis of Variance Table
##
## Model 1: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW
## Model 2: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePtW:FirstServePct +
## FirstServePct + SecondServePtW
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 1056.4
## 2 44 1006.1 1 50.271 2.1984 0.1453
FinalVars.DR <- FinalVars.DR %>%
mutate(log_WPct = log(WinningPct),
log_FSPct = log(FirstServePct),
cube_TBWPct = (TBWPct)^3,
sqr_Ret = (RetPtWPct)^2)
# Not as high on adjusted R-squared for more complex
# interpretations
logWPct <- lm(log_WPct ~ RetPtWPct + FirstServePtW + FirstServePct +
SecondServePtW + TBWPct, data = FinalVars.DR)
summary(logWPct)
##
## Call:
## lm(formula = log_WPct ~ RetPtWPct + FirstServePtW + FirstServePct +
## SecondServePtW + TBWPct, data = FinalVars.DR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.172505 -0.055925 0.000312 0.054429 0.133356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.795961 0.465869 -1.709 0.09458 .
## RetPtWPct 0.041699 0.004774 8.734 3.65e-11 ***
## FirstServePtW 0.024480 0.003462 7.070 9.02e-09 ***
## FirstServePct 0.011768 0.003375 3.487 0.00112 **
## SecondServePtW 0.012624 0.004112 3.070 0.00366 **
## TBWPct 0.002563 0.001058 2.423 0.01958 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07958 on 44 degrees of freedom
## Multiple R-squared: 0.7653, Adjusted R-squared: 0.7387
## F-statistic: 28.7 on 5 and 44 DF, p-value: 8.143e-13
# Not as high on adjusted R-squared for more complex
# interpretations
logFSPct <- lm(WinningPct ~ RetPtWPct + FirstServePtW + log_FSPct +
SecondServePtW + TBWPct, data = FinalVars.DR)
summary(logFSPct)
##
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + log_FSPct +
## SecondServePtW + TBWPct, data = FinalVars.DR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8005 -3.3409 -0.3298 3.2006 8.4268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -406.24447 56.70684 -7.164 6.58e-09 ***
## RetPtWPct 2.65940 0.27023 9.841 1.09e-12 ***
## FirstServePtW 1.53656 0.19635 7.826 7.19e-10 ***
## log_FSPct 48.62038 11.95546 4.067 0.000194 ***
## SecondServePtW 0.85270 0.23358 3.651 0.000691 ***
## TBWPct 0.17079 0.05992 2.850 0.006627 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.507 on 44 degrees of freedom
## Multiple R-squared: 0.8091, Adjusted R-squared: 0.7874
## F-statistic: 37.29 on 5 and 44 DF, p-value: 9.4e-15
# Again lower adjusted R-squared for more complicated
# interpretatons
logged <- lm(log_WPct ~ RetPtWPct + FirstServePtW + log_FSPct +
SecondServePtW + TBWPct, data = FinalVars.DR)
summary(logged)
##
## Call:
## lm(formula = log_WPct ~ RetPtWPct + FirstServePtW + log_FSPct +
## SecondServePtW + TBWPct, data = FinalVars.DR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.173932 -0.057441 -0.000197 0.055910 0.133031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.083484 1.002408 -3.076 0.00360 **
## RetPtWPct 0.041642 0.004777 8.718 3.85e-11 ***
## FirstServePtW 0.024526 0.003471 7.066 9.14e-09 ***
## log_FSPct 0.733064 0.211337 3.469 0.00118 **
## SecondServePtW 0.012465 0.004129 3.019 0.00421 **
## TBWPct 0.002560 0.001059 2.417 0.01986 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07967 on 44 degrees of freedom
## Multiple R-squared: 0.7648, Adjusted R-squared: 0.7381
## F-statistic: 28.61 on 5 and 44 DF, p-value: 8.557e-13
# Cubed term for TBWPct from scatterplot above
cubeTBWPct <- lm(WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct +
SecondServePtW + cube_TBWPct + TBWPct, data = FinalVars.DR)
summary(cubeTBWPct)
##
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct +
## SecondServePtW + cube_TBWPct + TBWPct, data = FinalVars.DR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.921 -3.312 -0.040 3.344 8.207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.607e+02 3.193e+01 -8.164 2.78e-10 ***
## RetPtWPct 2.702e+00 2.944e-01 9.176 1.10e-11 ***
## FirstServePtW 1.546e+00 2.011e-01 7.690 1.31e-09 ***
## FirstServePct 8.000e-01 2.008e-01 3.985 0.000257 ***
## SecondServePtW 8.733e-01 2.366e-01 3.691 0.000625 ***
## cube_TBWPct -6.595e-06 1.884e-05 -0.350 0.728051
## TBWPct 2.309e-01 1.815e-01 1.272 0.210125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.546 on 43 degrees of freedom
## Multiple R-squared: 0.8102, Adjusted R-squared: 0.7837
## F-statistic: 30.59 on 6 and 43 DF, p-value: 5.391e-14
anova(final_model, cubeTBWPct)
## Analysis of Variance Table
##
## Model 1: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW +
## TBWPct
## Model 2: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW +
## cube_TBWPct + TBWPct
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44 891.01
## 2 43 888.48 1 2.5311 0.1225 0.7281
# Squared term for RetPtWPct from scatterplot above
sqrRet <- lm(WinningPct ~ RetPtWPct + sqr_Ret + FirstServePtW + FirstServePct +
SecondServePtW + TBWPct, data = FinalVars.DR)
summary(sqrRet)
##
## Call:
## lm(formula = WinningPct ~ RetPtWPct + sqr_Ret + FirstServePtW +
## FirstServePct + SecondServePtW + TBWPct, data = FinalVars.DR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3354 -3.2610 -0.1187 3.4057 9.5859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -107.58180 79.82591 -1.348 0.18481
## RetPtWPct -4.48511 3.68850 -1.216 0.23063
## sqr_Ret 0.09936 0.05114 1.943 0.05860 .
## FirstServePtW 1.47845 0.19200 7.700 1.26e-09 ***
## FirstServePct 0.62957 0.20071 3.137 0.00308 **
## SecondServePtW 0.76658 0.23096 3.319 0.00185 **
## TBWPct 0.15868 0.05837 2.718 0.00942 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.365 on 43 degrees of freedom
## Multiple R-squared: 0.825, Adjusted R-squared: 0.8006
## F-statistic: 33.79 on 6 and 43 DF, p-value: 9.715e-15
anova(final_model, sqrRet)
## Analysis of Variance Table
##
## Model 1: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW +
## TBWPct
## Model 2: WinningPct ~ RetPtWPct + sqr_Ret + FirstServePtW + FirstServePct +
## SecondServePtW + TBWPct
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44 891.01
## 2 43 819.11 1 71.906 3.7748 0.0586 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary Statistics for EDA
favstats(~ DFPct, data = Tennis) # % double faults
## min Q1 median Q3 max mean sd n missing
## 1.7 2.6 3.4 4.1 7.7 3.564 1.370351 50 0
favstats(~ BPSavedPct, data = Tennis) # % break points against them that they win
## min Q1 median Q3 max mean sd n missing
## 56.4 60.625 63.2 65.975 75.4 63.744 4.204427 50 0
favstats(~ DomRat, data = Tennis) # % return points won / % serve points lost
## min Q1 median Q3 max mean sd n missing
## 0.94 1.02 1.06 1.1075 1.42 1.081 0.1026834 50 0
favstats(~ AcePct, data = Tennis) # % of serves that the are aces
## min Q1 median Q3 max mean sd n missing
## 2.1 5.5 8.1 11.75 25.4 9.438 5.457475 50 0
favstats(~ InPlayRetPtW, data = Tennis) # % return points won excluding aces and double faults
## min Q1 median Q3 max mean sd n missing
## 27.7 35.775 37.6 39.275 44.4 37.47 3.321805 50 0
favstats(~ SecondServePtW, data = Tennis) # % serve points won starting with a second serve
## min Q1 median Q3 max mean sd n missing
## 42.9 50.6 52.2 54.5 58.5 52.248 2.889265 50 0
favstats(~ TBWPct, data = Tennis) # percent of tiebreakers won
## min Q1 median Q3 max mean sd n missing
## 26.7 48.95 55.15 60 87 54.14 11.34642 50 0
EDA
# EDA for most variables we examine
ggplot(Tennis, aes(x = FirstServePct, y = WinningPct)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
ggplot(Tennis, aes(BreakPct, WinningPct)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
ggplot(Tennis, aes(WinningPct, ..density..)) +
geom_histogram() +
geom_density()
ggplot(Tennis, aes(WinningPct)) +
geom_histogram()
ggplot(Tennis, aes(Rank, WinningPct)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_smooth(method = "lm", se = FALSE, color = "red")
cor(Rank, WinningPct, data = Tennis)
ggplot(Tennis, aes(TBWPct, WinningPct)) +
geom_point() +
geom_smooth(se = FALSE)
Correlation
InitVars <- Tennis %>%
select(WinningPct, Rank, FirstServePct, FirstServePtW, BPConvPct, BreakPct, DFPct, BPSavedPct, DomRat, AcePct, InPlayRetPtW, SecondServePtW, TBWPct)
pairs(InitVars) # scatterplot matrix
round(cor(InitVars), 2) # table of correlations
## WinningPct Rank FirstServePct FirstServePtW BPConvPct
## WinningPct 1.00 -0.65 0.17 0.28 0.35
## Rank -0.65 1.00 -0.13 -0.14 -0.22
## FirstServePct 0.17 -0.13 1.00 -0.13 0.01
## FirstServePtW 0.28 -0.14 -0.13 1.00 -0.37
## BPConvPct 0.35 -0.22 0.01 -0.37 1.00
## BreakPct 0.52 -0.45 -0.01 -0.51 0.70
## DFPct -0.10 -0.16 -0.38 0.26 -0.13
## BPSavedPct 0.32 -0.09 0.25 0.55 -0.42
## DomRat 0.86 -0.58 0.13 0.40 0.29
## AcePct 0.03 0.07 0.04 0.85 -0.47
## InPlayRetPtW 0.45 -0.36 -0.08 -0.53 0.65
## SecondServePtW 0.45 -0.13 0.19 0.10 0.08
## TBWPct 0.43 -0.36 0.01 0.23 -0.01
## BreakPct DFPct BPSavedPct DomRat AcePct InPlayRetPtW
## WinningPct 0.52 -0.10 0.32 0.86 0.03 0.45
## Rank -0.45 -0.16 -0.09 -0.58 0.07 -0.36
## FirstServePct -0.01 -0.38 0.25 0.13 0.04 -0.08
## FirstServePtW -0.51 0.26 0.55 0.40 0.85 -0.53
## BPConvPct 0.70 -0.13 -0.42 0.29 -0.47 0.65
## BreakPct 1.00 -0.07 -0.39 0.47 -0.69 0.96
## DFPct -0.07 1.00 -0.09 -0.12 0.26 -0.10
## BPSavedPct -0.39 -0.09 1.00 0.34 0.51 -0.40
## DomRat 0.47 -0.12 0.34 1.00 0.09 0.43
## AcePct -0.69 0.26 0.51 0.09 1.00 -0.74
## InPlayRetPtW 0.96 -0.10 -0.40 0.43 -0.74 1.00
## SecondServePtW 0.12 -0.65 0.41 0.56 -0.02 0.11
## TBWPct 0.03 0.09 0.25 0.31 0.17 0.03
## SecondServePtW TBWPct
## WinningPct 0.45 0.43
## Rank -0.13 -0.36
## FirstServePct 0.19 0.01
## FirstServePtW 0.10 0.23
## BPConvPct 0.08 -0.01
## BreakPct 0.12 0.03
## DFPct -0.65 0.09
## BPSavedPct 0.41 0.25
## DomRat 0.56 0.31
## AcePct -0.02 0.17
## InPlayRetPtW 0.11 0.03
## SecondServePtW 1.00 0.09
## TBWPct 0.09 1.00
Initial Modeling
# Initial Modeling setup and results
#######SETUP#######
model1 <- lm(WinningPct ~ 1, data = Tennis)
model2 <- lm(Rank ~ 1, data = Tennis)
##These next sets and models are to easily generate stepwise functions just using those initial ##variables we picked out as the final model.
WInitVars <- InitVars %>%
select(-Rank)
WInitVars.noDR <- InitVars %>%
select(-Rank, -DomRat)
RInitVars <- InitVars %>%
select(-WinningPct)
model1Init <- lm(WinningPct ~ ., data = WInitVars)
model1Init.noDR <- lm(WinningPct ~ ., data = WInitVars.noDR)
model2Init <- lm(Rank ~ ., data = RInitVars)
##CLEAN OUT VARIABLES BEFORE USING BELOW##
##These next ones are for easy throwing all the predictors into a big stepwise, but some ##variables are just calculations of others and some are raw numbers and therefore largely ##affected by number of matches, so we should take a look at all the variables and clean out ##the ones we don't want before we use these.
#WPredictors <- Tennis %>%
# select(-Rank, -Player, -Country)
#WPredNoDR <- Tennis %>%
# select(-Rank, -Player, -Country, -DomRat)
#RPredictors <- Tennis %>%
# select(-WinningPct, -Player, -Country)
#RPredNoDR <- Tennis %>%
# select(-WinningPct, -Player, -Country, -DomRat)
#model1big <- lm(WinningPct ~ ., data = WPredictors)
#model1big.noDR <- lm(WinningPct ~ ., data = WPredNoDR)
#model2big <- lm(Rank ~ ., data = RPredictors)
#model2big.noDR <- lm(Rank ~ ., data = RPredNoDR)
#######EXPERIMENTAL MODELS#######
model1a <- lm(WinningPct ~ DomRat, data = Tennis)
model2a <- lm(Rank ~ DomRat, data = Tennis)
summary(model1a)
##
## Call:
## lm(formula = WinningPct ~ DomRat, data = Tennis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3637 -3.6617 0.4441 3.6489 9.2440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.475 7.596 -3.749 0.000478 ***
## DomRat 81.924 6.996 11.710 1.13e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.029 on 48 degrees of freedom
## Multiple R-squared: 0.7407, Adjusted R-squared: 0.7353
## F-statistic: 137.1 on 1 and 48 DF, p-value: 1.126e-15
summary(model2a)
##
## Call:
## lm(formula = Rank ~ DomRat, data = Tennis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.0148 -10.1341 0.5657 10.2336 24.2366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.77 18.10 6.342 7.52e-08 ***
## DomRat -82.58 16.67 -4.955 9.41e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.98 on 48 degrees of freedom
## Multiple R-squared: 0.3384, Adjusted R-squared: 0.3246
## F-statistic: 24.55 on 1 and 48 DF, p-value: 9.408e-06
model1b <- lm(WinningPct ~ InPlayRetPtW + FirstServePtW + SecondServePtW, data = Tennis)
model2b <- lm(Rank ~ InPlayRetPtW + FirstServePtW + SecondServePtW, data = Tennis)
summary(model1b)
##
## Call:
## lm(formula = WinningPct ~ InPlayRetPtW + FirstServePtW + SecondServePtW,
## data = Tennis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1489 -3.8786 -0.0153 3.8547 12.9941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -191.4998 27.1136 -7.063 7.33e-09 ***
## InPlayRetPtW 2.2513 0.3023 7.447 1.95e-09 ***
## FirstServePtW 1.5532 0.2406 6.455 5.99e-08 ***
## SecondServePtW 1.0013 0.2962 3.381 0.00148 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.848 on 46 degrees of freedom
## Multiple R-squared: 0.664, Adjusted R-squared: 0.6421
## F-statistic: 30.3 on 3 and 46 DF, p-value: 5.779e-11
summary(model2b)
##
## Call:
## lm(formula = Rank ~ InPlayRetPtW + FirstServePtW + SecondServePtW,
## data = Tennis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.8456 -11.2203 -0.0928 8.9812 23.1730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 245.81915 59.11976 4.158 0.000138 ***
## InPlayRetPtW -2.63672 0.65914 -4.000 0.000228 ***
## FirstServePtW -1.58880 0.52463 -3.028 0.004022 **
## SecondServePtW -0.07615 0.64578 -0.118 0.906647
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.75 on 46 degrees of freedom
## Multiple R-squared: 0.2818, Adjusted R-squared: 0.235
## F-statistic: 6.017 on 3 and 46 DF, p-value: 0.001515
model1c <- lm(WinningPct ~ InPlayRetPtW + FirstServePtW + SecondServePtW + DomRat, data = Tennis)
model2c <- lm(Rank ~ InPlayRetPtW + FirstServePtW + SecondServePtW + DomRat, data = Tennis)
summary(model1c)
##
## Call:
## lm(formula = WinningPct ~ InPlayRetPtW + FirstServePtW + SecondServePtW +
## DomRat, data = Tennis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8319 -3.0307 0.5679 3.2883 9.3918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.31668 48.68238 -0.541 0.591463
## InPlayRetPtW 0.21080 0.58799 0.359 0.721634
## FirstServePtW -0.03969 0.46085 -0.086 0.931747
## SecondServePtW -0.13093 0.38993 -0.336 0.738595
## DomRat 81.66449 21.01745 3.886 0.000332 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.116 on 45 degrees of freedom
## Multiple R-squared: 0.7484, Adjusted R-squared: 0.726
## F-statistic: 33.46 on 4 and 45 DF, p-value: 5.85e-13
summary(model2c)
##
## Call:
## lm(formula = Rank ~ InPlayRetPtW + FirstServePtW + SecondServePtW +
## DomRat, data = Tennis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.554 -7.625 -2.319 9.064 21.635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -81.5941 109.1976 -0.747 0.45882
## InPlayRetPtW 1.4078 1.3189 1.067 0.29147
## FirstServePtW 1.5686 1.0337 1.517 0.13615
## SecondServePtW 2.1680 0.8746 2.479 0.01700 *
## DomRat -161.8691 47.1434 -3.434 0.00129 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.48 on 45 degrees of freedom
## Multiple R-squared: 0.4309, Adjusted R-squared: 0.3803
## F-statistic: 8.518 on 4 and 45 DF, p-value: 3.318e-05
#This was mostly just too look at whether DomRat would overpower other variables and it kinda #looks like it does at least w/ WinningPct as response. Might have to look at other specific #factors and leave DomRat out. We still have to mention it tho obviously. It doesn't affect as #much in Rank so we could potentially look at things that improve DomRat predictions with rank #if we want more things to talk about.
#######STEPWISE MODELS#######
#These are the stepwise models generated with just the initial couple variables as the full ##model
step(model1, direction = "both", scope = formula(model1Init))
## Start: AIC=228.96
## WinningPct ~ 1
##
## Df Sum of Sq RSS AIC
## + DomRat 1 3467.5 1213.7 163.47
## + BreakPct 1 1283.0 3398.2 214.95
## + InPlayRetPtW 1 941.8 3739.4 219.73
## + SecondServePtW 1 934.4 3746.8 219.83
## + TBWPct 1 850.5 3830.7 220.94
## + BPConvPct 1 583.7 4097.5 224.31
## + BPSavedPct 1 466.8 4214.4 225.71
## + FirstServePtW 1 380.0 4301.3 226.73
## <none> 4681.2 228.96
## + FirstServePct 1 138.9 4542.3 229.46
## + DFPct 1 42.4 4638.8 230.51
## + AcePct 1 5.2 4676.0 230.91
##
## Step: AIC=163.47
## WinningPct ~ DomRat
##
## Df Sum of Sq RSS AIC
## + TBWPct 1 136.5 1077.2 159.50
## + BreakPct 1 81.7 1132.0 161.99
## + BPConvPct 1 52.9 1160.8 163.24
## <none> 1213.7 163.47
## + InPlayRetPtW 1 32.4 1181.3 164.12
## + FirstServePtW 1 19.8 1193.9 164.65
## + FirstServePct 1 15.4 1198.3 164.83
## + AcePct 1 7.8 1206.0 165.15
## + SecondServePtW 1 7.7 1206.0 165.15
## + BPSavedPct 1 2.7 1211.0 165.36
## + DFPct 1 0.1 1213.6 165.47
## - DomRat 1 3467.5 4681.2 228.96
##
## Step: AIC=159.5
## WinningPct ~ DomRat + TBWPct
##
## Df Sum of Sq RSS AIC
## + BreakPct 1 117.04 960.1 155.75
## + BPConvPct 1 74.75 1002.4 157.91
## + InPlayRetPtW 1 51.57 1025.6 159.05
## <none> 1077.2 159.50
## + FirstServePtW 1 35.17 1042.0 159.84
## + AcePct 1 21.82 1055.4 160.48
## + FirstServePct 1 18.24 1058.9 160.65
## + SecondServePtW 1 2.62 1074.6 161.38
## + DFPct 1 1.61 1075.6 161.43
## + BPSavedPct 1 0.06 1077.1 161.50
## - TBWPct 1 136.55 1213.7 163.47
## - DomRat 1 2753.51 3830.7 220.94
##
## Step: AIC=155.75
## WinningPct ~ DomRat + TBWPct + BreakPct
##
## Df Sum of Sq RSS AIC
## + InPlayRetPtW 1 101.37 858.77 152.17
## + BPSavedPct 1 82.86 877.27 153.24
## + AcePct 1 58.34 901.80 154.62
## + FirstServePtW 1 46.17 913.97 155.29
## <none> 960.14 155.75
## + FirstServePct 1 27.68 932.46 156.29
## + BPConvPct 1 3.71 956.43 157.56
## + DFPct 1 1.52 958.62 157.67
## + SecondServePtW 1 0.43 959.71 157.73
## - BreakPct 1 117.04 1077.18 159.50
## - TBWPct 1 171.89 1132.03 161.99
## - DomRat 1 1638.79 2598.93 203.54
##
## Step: AIC=152.17
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW
##
## Df Sum of Sq RSS AIC
## + BPSavedPct 1 71.45 787.32 149.83
## <none> 858.77 152.17
## + FirstServePtW 1 25.81 832.96 152.65
## + AcePct 1 14.05 844.72 153.35
## + FirstServePct 1 8.21 850.56 153.69
## + DFPct 1 5.62 853.14 153.84
## + SecondServePtW 1 1.20 857.57 154.10
## + BPConvPct 1 0.20 858.57 154.16
## - InPlayRetPtW 1 101.37 960.14 155.75
## - BreakPct 1 166.84 1025.61 159.05
## - TBWPct 1 183.33 1042.09 159.85
## - DomRat 1 1549.48 2408.24 201.73
##
## Step: AIC=149.83
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW + BPSavedPct
##
## Df Sum of Sq RSS AIC
## + FirstServePtW 1 60.67 726.65 147.82
## <none> 787.32 149.83
## + AcePct 1 24.08 763.24 150.28
## + BPConvPct 1 7.10 780.22 151.38
## + DFPct 1 1.91 785.41 151.71
## + FirstServePct 1 1.20 786.11 151.75
## + SecondServePtW 1 0.66 786.66 151.79
## - BPSavedPct 1 71.45 858.77 152.17
## - InPlayRetPtW 1 89.95 877.27 153.24
## - TBWPct 1 160.72 948.04 157.12
## - BreakPct 1 206.18 993.50 159.46
## - DomRat 1 683.54 1470.86 179.08
##
## Step: AIC=147.82
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW + BPSavedPct +
## FirstServePtW
##
## Df Sum of Sq RSS AIC
## + DFPct 1 67.408 659.24 144.95
## + FirstServePct 1 54.605 672.05 145.91
## + SecondServePtW 1 35.825 690.83 147.29
## <none> 726.65 147.82
## + BPConvPct 1 7.828 718.82 149.28
## - InPlayRetPtW 1 59.254 785.90 149.74
## + AcePct 1 0.131 726.52 149.81
## - FirstServePtW 1 60.668 787.32 149.83
## - DomRat 1 75.017 801.67 150.73
## - BPSavedPct 1 106.307 832.96 152.65
## - TBWPct 1 153.918 880.57 155.43
## - BreakPct 1 264.795 991.45 161.36
##
## Step: AIC=144.95
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW + BPSavedPct +
## FirstServePtW + DFPct
##
## Df Sum of Sq RSS AIC
## + FirstServePct 1 53.06 606.18 142.76
## - DomRat 1 5.51 664.75 143.37
## <none> 659.24 144.95
## + SecondServePtW 1 4.16 655.08 146.64
## + AcePct 1 1.70 657.54 146.82
## + BPConvPct 1 1.41 657.83 146.85
## - InPlayRetPtW 1 57.41 716.65 147.13
## - DFPct 1 67.41 726.65 147.82
## - BPSavedPct 1 120.32 779.56 151.34
## - FirstServePtW 1 126.17 785.41 151.71
## - TBWPct 1 184.24 843.48 155.28
## - BreakPct 1 330.73 989.98 163.28
##
## Step: AIC=142.76
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW + BPSavedPct +
## FirstServePtW + DFPct + FirstServePct
##
## Df Sum of Sq RSS AIC
## + SecondServePtW 1 73.83 532.36 138.26
## - DomRat 1 1.34 607.53 140.87
## - InPlayRetPtW 1 14.71 620.89 141.96
## <none> 606.18 142.76
## + AcePct 1 0.99 605.20 144.68
## + BPConvPct 1 0.64 605.54 144.71
## - FirstServePct 1 53.06 659.24 144.95
## - DFPct 1 65.86 672.05 145.91
## - BPSavedPct 1 112.90 719.09 149.30
## - FirstServePtW 1 178.87 785.05 153.69
## - TBWPct 1 191.66 797.85 154.50
## - BreakPct 1 331.34 937.53 162.56
##
## Step: AIC=138.26
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW + BPSavedPct +
## FirstServePtW + DFPct + FirstServePct + SecondServePtW
##
## Df Sum of Sq RSS AIC
## - InPlayRetPtW 1 0.06 532.41 136.27
## - DFPct 1 1.10 533.46 136.37
## <none> 532.36 138.26
## + AcePct 1 9.26 523.09 139.39
## + BPConvPct 1 0.11 532.25 140.25
## - DomRat 1 52.75 585.11 140.99
## - SecondServePtW 1 73.83 606.18 142.76
## - BPSavedPct 1 82.95 615.31 143.50
## - FirstServePct 1 122.72 655.08 146.64
## - TBWPct 1 228.93 761.29 154.15
## - FirstServePtW 1 246.18 778.53 155.27
## - BreakPct 1 396.96 929.31 164.12
##
## Step: AIC=136.27
## WinningPct ~ DomRat + TBWPct + BreakPct + BPSavedPct + FirstServePtW +
## DFPct + FirstServePct + SecondServePtW
##
## Df Sum of Sq RSS AIC
## - DFPct 1 1.32 533.74 134.39
## <none> 532.41 136.27
## + AcePct 1 9.26 523.15 137.39
## + BPConvPct 1 0.09 532.32 138.26
## + InPlayRetPtW 1 0.06 532.36 138.26
## - DomRat 1 67.90 600.31 140.27
## - BPSavedPct 1 82.91 615.32 141.51
## - SecondServePtW 1 88.48 620.89 141.96
## - FirstServePct 1 175.88 708.30 148.54
## - TBWPct 1 229.10 761.51 152.16
## - FirstServePtW 1 330.95 863.36 158.44
## - BreakPct 1 478.65 1011.06 166.34
##
## Step: AIC=134.39
## WinningPct ~ DomRat + TBWPct + BreakPct + BPSavedPct + FirstServePtW +
## FirstServePct + SecondServePtW
##
## Df Sum of Sq RSS AIC
## <none> 533.74 134.39
## + AcePct 1 10.52 523.21 135.40
## + DFPct 1 1.32 532.41 136.27
## + InPlayRetPtW 1 0.28 533.46 136.37
## + BPConvPct 1 0.18 533.56 136.38
## - DomRat 1 70.28 604.02 138.58
## - BPSavedPct 1 81.79 615.52 139.52
## - SecondServePtW 1 153.51 687.25 145.03
## - FirstServePct 1 206.45 740.18 148.74
## - TBWPct 1 228.26 762.00 150.20
## - FirstServePtW 1 329.74 863.47 156.45
## - BreakPct 1 478.36 1012.09 164.39
##
## Call:
## lm(formula = WinningPct ~ DomRat + TBWPct + BreakPct + BPSavedPct +
## FirstServePtW + FirstServePct + SecondServePtW, data = Tennis)
##
## Coefficients:
## (Intercept) DomRat TBWPct BreakPct
## -270.1609 -55.8446 0.2089 2.5595
## BPSavedPct FirstServePtW FirstServePct SecondServePtW
## 0.4602 2.4912 0.7835 1.1659
# lm(formula = WinningPct ~ DomRat + TBWPct + BreakPct + BPSavedPct + FirstServePtW + FirstServePct + SecondServePtW, data = Tennis)
step(model1, direction = "both", scope = formula(model1Init.noDR))
## Start: AIC=228.96
## WinningPct ~ 1
##
## Df Sum of Sq RSS AIC
## + BreakPct 1 1282.98 3398.2 214.95
## + InPlayRetPtW 1 941.83 3739.4 219.73
## + SecondServePtW 1 934.45 3746.8 219.83
## + TBWPct 1 850.52 3830.7 220.94
## + BPConvPct 1 583.68 4097.5 224.31
## + BPSavedPct 1 466.80 4214.4 225.71
## + FirstServePtW 1 379.96 4301.3 226.73
## <none> 4681.2 228.96
## + FirstServePct 1 138.86 4542.3 229.46
## + DFPct 1 42.37 4638.8 230.51
## + AcePct 1 5.24 4676.0 230.91
##
## Step: AIC=214.95
## WinningPct ~ BreakPct
##
## Df Sum of Sq RSS AIC
## + FirstServePtW 1 1927.85 1470.4 175.06
## + BPSavedPct 1 1491.95 1906.3 188.04
## + AcePct 1 1386.65 2011.6 190.73
## + TBWPct 1 799.30 2598.9 203.54
## + SecondServePtW 1 693.02 2705.2 205.54
## + InPlayRetPtW 1 180.45 3217.8 214.22
## + FirstServePct 1 147.72 3250.5 214.73
## <none> 3398.2 214.95
## + DFPct 1 14.99 3383.2 216.73
## + BPConvPct 1 2.21 3396.0 216.92
## - BreakPct 1 1282.98 4681.2 228.96
##
## Step: AIC=175.06
## WinningPct ~ BreakPct + FirstServePtW
##
## Df Sum of Sq RSS AIC
## + BPSavedPct 1 461.82 1008.6 158.21
## + FirstServePct 1 377.03 1093.4 162.25
## + SecondServePtW 1 341.65 1128.7 163.84
## + TBWPct 1 274.33 1196.0 166.74
## + DFPct 1 252.67 1217.7 167.63
## <none> 1470.4 175.06
## + InPlayRetPtW 1 39.16 1431.2 175.71
## + AcePct 1 15.10 1455.3 176.55
## + BPConvPct 1 0.87 1469.5 177.03
## - FirstServePtW 1 1927.85 3398.2 214.95
## - BreakPct 1 2830.87 4301.3 226.73
##
## Step: AIC=158.21
## WinningPct ~ BreakPct + FirstServePtW + BPSavedPct
##
## Df Sum of Sq RSS AIC
## + TBWPct 1 164.78 843.8 151.29
## + FirstServePct 1 148.69 859.9 152.24
## + DFPct 1 103.67 904.9 154.79
## + SecondServePtW 1 89.70 918.9 155.56
## <none> 1008.6 158.21
## + InPlayRetPtW 1 31.01 977.5 158.65
## + BPConvPct 1 19.84 988.7 159.22
## + AcePct 1 9.94 998.6 159.72
## - BPSavedPct 1 461.82 1470.4 175.06
## - FirstServePtW 1 897.72 1906.3 188.04
## - BreakPct 1 3122.01 4130.6 226.71
##
## Step: AIC=151.29
## WinningPct ~ BreakPct + FirstServePtW + BPSavedPct + TBWPct
##
## Df Sum of Sq RSS AIC
## + FirstServePct 1 151.30 692.5 143.41
## + DFPct 1 125.22 718.6 145.26
## + SecondServePtW 1 103.54 740.2 146.75
## + InPlayRetPtW 1 42.11 801.7 150.73
## <none> 843.8 151.29
## + BPConvPct 1 19.45 824.3 152.13
## + AcePct 1 5.37 838.4 152.97
## - TBWPct 1 164.78 1008.6 158.21
## - BPSavedPct 1 352.27 1196.0 166.74
## - FirstServePtW 1 734.68 1578.5 180.61
## - BreakPct 1 2727.54 3571.3 221.43
##
## Step: AIC=143.41
## WinningPct ~ BreakPct + FirstServePtW + BPSavedPct + TBWPct +
## FirstServePct
##
## Df Sum of Sq RSS AIC
## + SecondServePtW 1 88.47 604.0 138.58
## + DFPct 1 65.92 626.6 140.41
## <none> 692.5 143.41
## + InPlayRetPtW 1 8.46 684.0 144.80
## + BPConvPct 1 8.18 684.3 144.82
## + AcePct 1 2.48 690.0 145.23
## - FirstServePct 1 151.30 843.8 151.29
## - BPSavedPct 1 164.01 856.5 152.04
## - TBWPct 1 167.39 859.9 152.24
## - FirstServePtW 1 875.77 1568.3 182.28
## - BreakPct 1 2766.67 3459.1 221.84
##
## Step: AIC=138.58
## WinningPct ~ BreakPct + FirstServePtW + BPSavedPct + TBWPct +
## FirstServePct + SecondServePtW
##
## Df Sum of Sq RSS AIC
## <none> 604.02 138.58
## + InPlayRetPtW 1 12.07 591.94 139.57
## + DFPct 1 3.70 600.31 140.27
## + BPConvPct 1 3.28 600.74 140.31
## + AcePct 1 0.04 603.98 140.58
## - BPSavedPct 1 56.88 660.89 141.08
## - SecondServePtW 1 88.47 692.48 143.41
## - FirstServePct 1 136.22 740.24 146.75
## - TBWPct 1 180.12 784.13 149.63
## - FirstServePtW 1 881.52 1485.54 181.57
## - BreakPct 1 2229.70 2833.71 213.87
##
## Call:
## lm(formula = WinningPct ~ BreakPct + FirstServePtW + BPSavedPct +
## TBWPct + FirstServePct + SecondServePtW, data = Tennis)
##
## Coefficients:
## (Intercept) BreakPct FirstServePtW BPSavedPct
## -173.7426 1.6216 1.4125 0.3762
## TBWPct FirstServePct SecondServePtW
## 0.1795 0.5108 0.5428
# lm(formula = WinningPct ~ BreakPct + FirstServePtW + BPSavedPct + TBWPct + FirstServePct + SecondServePtW, data = Tennis)
step(model2, direction = "both", scope = formula(model2Init))
## Start: AIC=268.94
## Rank ~ 1
##
## Df Sum of Sq RSS AIC
## + DomRat 1 3523.3 6889.2 250.28
## + BreakPct 1 2086.6 8325.9 259.75
## + InPlayRetPtW 1 1365.8 9046.7 263.91
## + TBWPct 1 1316.8 9095.7 264.18
## + BPConvPct 1 508.1 9904.4 268.44
## <none> 10412.5 268.94
## + DFPct 1 255.1 10157.4 269.70
## + FirstServePtW 1 196.0 10216.5 269.99
## + SecondServePtW 1 168.6 10243.9 270.12
## + FirstServePct 1 164.9 10247.6 270.14
## + BPSavedPct 1 85.8 10326.7 270.52
## + AcePct 1 55.6 10356.9 270.67
##
## Step: AIC=250.28
## Rank ~ DomRat
##
## Df Sum of Sq RSS AIC
## + SecondServePtW 1 589.7 6299.5 247.81
## + DFPct 1 527.6 6361.6 248.30
## + BreakPct 1 399.1 6490.1 249.30
## + TBWPct 1 361.6 6527.6 249.59
## <none> 6889.2 250.28
## + AcePct 1 158.9 6730.3 251.12
## + InPlayRetPtW 1 154.4 6734.8 251.15
## + BPSavedPct 1 135.5 6753.7 251.29
## + FirstServePtW 1 113.5 6775.7 251.45
## + BPConvPct 1 29.6 6859.6 252.07
## + FirstServePct 1 24.3 6864.9 252.11
## - DomRat 1 3523.3 10412.5 268.94
##
## Step: AIC=247.81
## Rank ~ DomRat + SecondServePtW
##
## Df Sum of Sq RSS AIC
## + TBWPct 1 278.4 6021.1 247.55
## <none> 6299.5 247.81
## + BreakPct 1 244.0 6055.5 247.84
## + FirstServePtW 1 223.7 6075.8 248.00
## + AcePct 1 213.9 6085.7 248.08
## + FirstServePct 1 71.6 6227.9 249.24
## + InPlayRetPtW 1 70.6 6228.9 249.25
## + DFPct 1 65.3 6234.2 249.29
## + BPSavedPct 1 26.1 6273.5 249.60
## + BPConvPct 1 7.9 6291.6 249.75
## - SecondServePtW 1 589.7 6889.2 250.28
## - DomRat 1 3944.4 10243.9 270.12
##
## Step: AIC=247.55
## Rank ~ DomRat + SecondServePtW + TBWPct
##
## Df Sum of Sq RSS AIC
## + BreakPct 1 347.53 5673.6 246.58
## + AcePct 1 299.78 5721.3 247.00
## + FirstServePtW 1 284.19 5736.9 247.13
## <none> 6021.1 247.55
## - TBWPct 1 278.40 6299.5 247.81
## + InPlayRetPtW 1 118.53 5902.6 248.56
## + FirstServePct 1 76.09 5945.0 248.91
## + BPSavedPct 1 73.96 5947.1 248.93
## + DFPct 1 44.23 5976.9 249.18
## + BPConvPct 1 24.36 5996.8 249.35
## - SecondServePtW 1 506.47 6527.6 249.59
## - DomRat 1 2981.14 9002.3 265.66
##
## Step: AIC=246.58
## Rank ~ DomRat + SecondServePtW + TBWPct + BreakPct
##
## Df Sum of Sq RSS AIC
## + InPlayRetPtW 1 471.10 5202.5 244.24
## <none> 5673.6 246.58
## - SecondServePtW 1 328.27 6001.9 247.39
## + DFPct 1 121.16 5552.4 247.50
## - BreakPct 1 347.53 6021.1 247.55
## + FirstServePct 1 98.11 5575.5 247.71
## + BPConvPct 1 96.46 5577.1 247.72
## - TBWPct 1 381.96 6055.5 247.84
## + BPSavedPct 1 18.12 5655.5 248.42
## + AcePct 1 5.94 5667.6 248.53
## + FirstServePtW 1 1.57 5672.0 248.56
## - DomRat 1 1398.98 7072.6 255.60
##
## Step: AIC=244.24
## Rank ~ DomRat + SecondServePtW + TBWPct + BreakPct + InPlayRetPtW
##
## Df Sum of Sq RSS AIC
## <none> 5202.5 244.24
## + AcePct 1 184.18 5018.3 244.44
## + BPConvPct 1 174.16 5028.3 244.54
## - SecondServePtW 1 294.10 5496.6 244.99
## + DFPct 1 72.97 5129.5 245.54
## + FirstServePct 1 20.52 5182.0 246.05
## + FirstServePtW 1 10.60 5191.9 246.14
## - TBWPct 1 423.91 5626.4 246.16
## + BPSavedPct 1 6.61 5195.9 246.18
## - InPlayRetPtW 1 471.10 5673.6 246.58
## - BreakPct 1 700.10 5902.6 248.56
## - DomRat 1 1225.35 6427.8 252.82
##
## Call:
## lm(formula = Rank ~ DomRat + SecondServePtW + TBWPct + BreakPct +
## InPlayRetPtW, data = Tennis)
##
## Coefficients:
## (Intercept) DomRat SecondServePtW TBWPct
## -4.7762 -72.1582 1.0512 -0.2779
## BreakPct InPlayRetPtW
## -2.6916 3.3847
# lm(formula = Rank ~ DomRat + SecondServePtW + TBWPct + BreakPct + InPlayRetPtW, data = Tennis)
### LASSO & RIDGE Experimental models ###
InitWin <- InitVars %>%
select(-Rank)
InitRank <- InitVars %>%
select(-WinningPct)
Xwin <- model.matrix(WinningPct ~ ., InitWin)[,-1]
Xrank <- model.matrix(Rank ~ ., InitRank)[,-1]
ywin <- as.numeric(InitVars$WinningPct)
yrank <- as.numeric(InitVars$Rank)
#RidgeWin
win.ridge.cv <- cv.glmnet(Xwin, ywin, alpha = 0, nfolds = 5)
plot(win.ridge.cv)
win.ridge.cv$lambda.min
## [1] 0.832764
win.ridge.cv$lambda.1se
## [1] 11.2677
#RidgeRank
rank.ridge.cv <- cv.glmnet(Xrank, yrank, alpha = 0, nfolds = 5)
plot(rank.ridge.cv)
rank.ridge.cv$lambda.min
## [1] 7.828625
rank.ridge.cv$lambda.1se
## [1] 41.77902
#LassoWin
win.lasso.cv <- cv.glmnet(Xwin, ywin, alpha = 1, nfolds = 5)
plot(win.lasso.cv)
win.lasso.cv$lambda.min
## [1] 0.002791094
win.lasso.cv$lambda.1se
## [1] 0.1524573
#LassoRank
rank.lasso.cv <- cv.glmnet(Xrank, yrank, alpha = 1, nfolds = 5)
plot(rank.lasso.cv)
rank.lasso.cv$lambda.min
## [1] 0.9878605
rank.lasso.cv$lambda.1se
## [1] 3.98801
#All Models Together
b1 = as.matrix(coef(win.ridge.cv, s = "lambda.min"))
b2 = coef(win.ridge.cv, s = "lambda.1se")[1:11]
b5 = coef(rank.ridge.cv, s = "lambda.min")[1:11]
b6 = coef(rank.ridge.cv, s = "lambda.1se")[1:11]
b3 = coef(win.lasso.cv, s = "lambda.min")[1:11]
b4 = coef(win.lasso.cv, s = "lambda.1se")[1:11]
b7 = coef(rank.lasso.cv, s = "lambda.min")[1:11]
b8 = coef(rank.lasso.cv, s = "lambda.1se")[1:11]
cbind(b1, b2, b3, b4, b5, b6, b7, b8)
## Warning in cbind(b1, b2, b3, b4, b5, b6, b7, b8): number of rows of result
## is not a multiple of vector length (arg 2)
## 1 b2 b3 b4
## (Intercept) -123.09984702 -58.82574031 -284.47333686 -163.00766491
## FirstServePct 0.31041003 0.15158970 0.83054451 0.44849633
## FirstServePtW 0.68203972 0.26511430 2.72652605 1.33757427
## BPConvPct 0.20841288 0.19847692 0.01961784 0.06000491
## BreakPct 0.81334818 0.29414387 2.55462592 1.53209182
## DFPct -0.05557015 0.02248809 -0.06020046 -0.05695194
## BPSavedPct 0.37707320 0.22958076 0.47013328 0.38796553
## DomRat 24.64658451 23.31777876 -58.56106749 0.00000000
## AcePct 0.11834527 0.06411360 -0.19782175 0.00000000
## InPlayRetPtW 0.31970024 0.33987790 -0.04159828 0.00000000
## SecondServePtW 0.28560393 0.31218712 1.14632812 0.50520551
## TBWPct 0.16392231 -58.82574031 -284.47333686 -163.00766491
## b5 b6 b7 b8
## (Intercept) 158.55494885 95.39301088 129.5313051 71.93407986
## FirstServePct -0.42758952 -0.13741233 -0.2704484 0.00000000
## FirstServePtW -0.32662277 -0.13133281 0.0000000 0.00000000
## BPConvPct -0.02502694 -0.09580483 0.0000000 0.00000000
## BreakPct -0.52408582 -0.23917172 -0.5815116 -0.09613719
## DFPct -1.52162128 -0.53073979 -1.6857479 0.00000000
## BPSavedPct -0.07788130 -0.06831170 0.0000000 0.00000000
## DomRat -34.05840673 -16.62792310 -53.5599163 -41.02435338
## AcePct 0.11072908 0.02429171 0.0000000 0.00000000
## InPlayRetPtW -0.33576260 -0.26511393 0.0000000 0.00000000
## SecondServePtW -0.00112217 -0.06381424 0.0000000 0.00000000
## TBWPct 158.55494885 95.39301088 129.5313051 71.93407986
# Rank lasso min: 98.9 - 0.42(BreakPct) - 0.72(DFPct) - 50.3(DomRat) + 98.9(SecondServePtW)
###Outliers###
Tennis <- Tennis %>%
mutate(fitted = fitted(final_model),
standardized = stdres(final_model),
studentized = studres(final_model))
table8 <- Tennis %>%
select(Player, WinningPct, FirstServePct, FirstServePtW, studentized, RetPtWPct) %>%
filter(studentized > 2 | studentized < -2)
kable(table8, caption = "Outliers")
| Player | WinningPct | FirstServePct | FirstServePtW | studentized | RetPtWPct |
|---|---|---|---|---|---|
| Dusan Lajovic | 45.5 | 69.4 | 68.0 | -2.188589 | 36.6 |
| Milos Raonic | 64.3 | 62.5 | 84.2 | -2.276267 | 34.7 |
###Plots for Outliers
outs2 <- Tennis %>%
filter(studentized > 2 | studentized < -2)
ggplot(data = Tennis, aes(x = Rank, y = studentized)) +
geom_point() +
geom_hline(yintercept = c(-2, 2), color = "red") +
ggrepel::geom_label_repel(aes(label = Player), data = outs2)
###Influential Points###
finalmodel_diag <- ls.diag(final_model)
Tennis <- Tennis %>%
mutate(cooks = finalmodel_diag$cooks)
table9 <- Tennis %>%
select(Player, WinningPct, FirstServePct, FirstServePtW, cooks, RetPtWPct) %>%
filter(cooks > 5/45) %>%
arrange(desc(cooks))
kable(table9, caption = "Influential Points")
| Player | WinningPct | FirstServePct | FirstServePtW | cooks | RetPtWPct |
|---|---|---|---|---|---|
| Milos Raonic | 64.3 | 62.5 | 84.2 | 0.2040603 | 34.7 |
| Reilly Opelka | 58.7 | 64.0 | 79.3 | 0.1240316 | 29.1 |
| Dusan Lajovic | 45.5 | 69.4 | 68.0 | 0.1177649 | 36.6 |
###Plots for Influential Points###
outs1 <- Tennis %>%
filter(cooks > 5/45)
ggplot(data = Tennis, aes(x = Rank, y = cooks)) +
geom_point() +
geom_hline(yintercept = 5/45, color = "red", linetype = 2) +
ggrepel::geom_label_repel(aes(label = Player), data = outs1)