Final Report

Introduction

This report analyzes factors within the game of tennis that can be associated with match winning percentages. While all stages of tennis players, from recreational to competitive, receive various coaching with regards to what strategies will help them become successful players, there remain a few outspoken phrases that most players will be told in their lifetimes. In short, these coaching cues tend to echo that players win more points when they get their first serve in play and more matches when they reduce double faults in their play. By investigating the factors which are associated with an increased winning percentage, we are able to specifically target aspects of a player’s tennis game which can be improved to maximize player success. Additionally, as previous research has attempted, we can improve the forecasting for who is more likely to win a matchup based on the players’ differing styles of play.

The existing body of research thus far has attempted mainly to put probability and statistical formulae together in order to predict the outcome for players winning matches. Of these, the majority specifically focus on whether a player is serving or returning in play to calculate the outcome of a game or set within a match. For example, O’Malley (2008) derives a probability model for winning a game within a tennis match uses a binomial distribution to establish what is coined as a tennis formula. When O’Malley accounts for the probability \(p\) for winning a point in this distribution, however, he relies extensively on whether the player is serving or not. Similarly, for statistically regarding an optimal tennis strategy, George (1973) dives into an optimal sequencing of a player’s two serves. What the stories of this body of existing research tells, is that to build an effective model to predict both winning percentages of tennis players, one should focus their view around elements of tennis that relate to service. What sets this analysis apart, is that we attempt to include factors alongside service and return within a model to predict winning percentages.

Materials and Methods

To examine the hypothesis that there are more factors for predicting win percentage than simply related to serve and return, we scraped data from TennisAbstract.com. The data we found contains information on the matches of the current top fifty ranking male professional players for the past year of their professional careers. Of this entire data, there is information on service statistics, return statistics, and miscellaneous statistics such as overall points, games, and sets won in the past year, average lengths of points, games, sets, and matches, and the player’s dominance ratio, given by the percent of return points won divided by the percent of service points lost. To simplify our analysis, we retained variables related to winning individual points including percentages for first serves in, first serve points won, second serve points won, double faults, aces, return points won, break points won (as the returner), and break points saved (as the server), along with tiebreaker win percentages, average opponent rankings, and dominance ratio. These specific variables allow us to analyze the association of percentage of specific types of points won with the overall percentage of winning matches. In disregarding the percentages of games and sets won with analogous statistics, we are expanding upon former research that operates under the assumption that each point in a tennis match is independently and identically distributed.

It is also worth noting that we chose not to center our variables around an average value for each of our predictors. While this leads to the intercept of our model having no plausible interpretation with regards to winning percentage, it does allow for a more flexible use of our model as new data is collected. That is, as new data is collected, a centered model would require calculations of updated average values for all predictors in our model, as opposed to leaving the model uncentered and having the freedom to use raw percentages as predictors for winning percentage.

We will use multiple linear regression, derived via stepwise selection, in order to determine associations these variables have with overall winning percentages among the top ranking male professionals. Within this multiple linear regression model, we are able to perform t-tests to determine the significance of each factor individually. With each t-test, we can report a t-statistic under the null hypothesis that the given explanatory variable has no linear association with a player’s winning percentage and the alternative hypothesis that there is some significant association between a given explanatory factor and winning percentage. We also will examine our adjusted R-squared term to conclude that our model suggests a reasonably sound model that can explain variability in our data. Once our linear model is established, we can examine some experimental models with various transformations and interaction terms to confirm that the model given through stepwise selection is our best possible model. In comparing across models, we will employ the ANOVA tests in order to obtain an F-statistic and show that our model given through stepwise selection is a sound model in comparison with other experimental models. This F-test will operate under the null hypothesis that our model is just as good as another experimental model, and the alternative hypothesis that the experimental model has significant improvements over the model we are reporting.

Removal Decisions

One variable that our initial exploration suggested could be very valuable was the dominance ratio. Dominance ratio is a value created by the website we got our data from, and is calculated by taking the percentage of return points won divided by the percentage of serve points lost. However, we decided against using dominance ratio in our final model because its correlation with winning percentage (0.86) was much higher than any other correlation, and after controlling for dominance ratio most other variables were rendered insignificant with winning percentage as the response. Thus, in favor of trying to learn more about the effect of a variety of aspects of play, we created our final model without dominance ratio.

Results

In this section, we will begin by reporting some preliminary data visualization, before reporting the results of our multiple linear regression. Expanding upon this linear model, we will report confidence intervals related to strengths and weaknesses of a player in our explanatory variables. The discussion portion of this report examines implications of our findings along with paths for future research.

To begin, we will first examine Figure 3.1 showing the distribution of winning percentages of sampled players shows a rather normal distribution. There is a slight chance that there is some slight right skew to this distribution. However, this would confirm previous researchers’ attempts– and the assumption we were operating under– that winning a tennis match is identically and independently distributed.

Serving Percentages

First Serve Percentage
	min	Q1	median	Q3	max	mean	sd	n	missing
	53.1	60.225	63	65.25	71.4	62.788	3.579798	50	0

First Serve Win Percentage
	min	Q1	median	Q3	max	mean	sd	n	missing
	64.7	70.95	73.45	77.15	84.2	73.982	4.16495	50	0

Using the common theme that players win more points when their first serve goes in, we look at both the average values of how often these professional players make their first serve in, as well as how often they win the point on their first serve. What we see on average is that from the data collected over the past year of professional matches, the rate at which players are putting their first serve into play is about 62.8% (Table 1). Of these first serves made, the average rate at which players win the point is 74.0% (Table 2). Considering that these players perform at an extremely phenomenal level, the average of placing a first serve into play, a stroke any player has absolute control over, seems lower than one may expect.

Additionally, in examining Figure 1 depicting the percentage of first serves put into play vs. the winning percentages of players shows a positive trend, however, there is not an extremely strong association we can pull from this image. In this context of our observational units being the top 50 ranked players, this would likely make sense. These players all must have certain strengths that have pulled them to the top of the rankings, and thus it would make sense that players with lower first serve percentages may rely more heavily on other strengths to win matches, and as such could still have relatively similar winning percentages. It is worth noting several unusual points. The three points with the highest winning percentages are unsurprisingly the top three ranked players, who are considered the “Big Three” of men’s Tennis. It is interesting to note they do have above average first serve percentages, suggesting that the combination of a strong serve as well as the rest of their game is helping to push them to the top. It is also worth noting the two points with the higher serving percentages, which have relatively low winning percentages, with the second highest point being the second lowest winning percentage.

Percentage of Break Points Saved
	min	Q1	median	Q3	max	mean	sd	n	missing
	56.4	60.625	63.2	65.975	75.4	63.744	4.204427	50	0

Another possible service game predictor we see possibly influencing our model is the percentage of break points saved as a server. This speaks to the server’s ability to play under the pressure of possibly losing an entire game. Thus, as we see in Table 3, it is interesting that on average, these professionals are able to save 63.7% of the break points their opponents obtain. However, this value is nearly identical to the average of the mean percentage of first serve points won and the mean percentage of second serve points won ((74.0 + 52.2) / 2 = 63.1). This suggests that whether or not a point is break point has little effect on a player’s chances of winning the point. Hence, we are not sure whether we will see this predictor in our final model.

Return of Serve Percentages

Percentage of Break Points Converted
	min	Q1	median	Q3	max	mean	sd	n	missing
	25.3	36.05	39.5	42.075	48.8	39.376	4.322771	50	0

Return Point Win Percentage
	min	Q1	median	Q3	max	mean	sd	n	missing
	27.9	35.3	36.65	38.475	42.3	36.72	2.926166	50	0

On the other side of serving is returning serve. A player’s ability to win a return game, known as “breaking serve”, is an important factor for winning a match. Here we explore the average rate at which these professionals win the final point in a game when the opportunity arises, called a break point. We find that on average professional players win a break point 39.4% (Table 4) of the time when they reach this point in a game. The average rage at which these professionals are winning return points overall is 36.7% (Table 5), which seems to agree that break points are nearly won at the same rate as any other return point. Figure 2 shows the association of winning return points percentage vs. winning percentage. This plot has a rather strong, positive linear association between these two variables, implying that it is worth investigating whether this positive correlation is significant when examining a final model. What this suggests is that as a player’s rate for winning return points increases, we expect that their winning percentage will also increase. Additionally, we see that the same “Big Three” appear here with rather dominant percentages for winning return points as well.

Tiebreak Percentages

Tiebreakers Won Percentage
	min	Q1	median	Q3	max	mean	sd	n	missing
	26.7	48.95	55.15	60	87	54.14	11.34642	50	0

A final explanatory variable which our references suggest as a route for further investigation is the rate at which players win tiebreakers, whether at the set or match levels. Within our analysis, we treat each tiebreaker type as equal since the structure and order of playing points is identical and the only difference between a set tiebreaker and a match tiebreaker is the amount of points needed to win (7 points vs. 10 points). Notice Figure 4 shows the rate at which players are winning tiebreakers, which has a mean value of 54.1% (Table 6) among our sample of professional players, against winning percentage. Figure 4 suggests that there exists a rather linear and positive association between these two variables. Thus, we could expect that the players with a higher rate of winning tiebreakers in their matches will also have a higher rate of winning in general.

Model

Our model was created using the assistance of a stepwise two-way modeling function. The variables included in this function were: First Serve Percentage (FSPct), First Serve Winning Percentage (FSPtW), Second Serve Winning Percentage (SSPtW), Ace Percentage (APct), Double Fault Percentage (DFPct), Return Points Won Percentage (RPtWPct), Break Point Conversion Percentage (BPCPct), Break Points Saved Percentage (BPSPct), and Tiebreaker Win Percentage (TBWPct). From these variables, the function reduced our model to: \[ WPct = -254.5 + 2.66 \cdot RPtWPct + 1.53 \cdot FSPtW + 0.78\cdot FSPct + 0.86\cdot SSPtW + 0.17\cdot TBWPct.\]

All variables in this model were found to be statistically significant at the 0.001 level except for Tiebreaker Win Percentage, which was still significant at the 0.01 level. The model was also found to have an R-squared value of 0.81, thus 81% of variation in winning percentage can be explained by our model.

Assume that for each of the following interpretations all other variables are held constant.

Our model includes three different variables involving the serve. Among these, the First Serve Points Won Percentage (\(t = 7.864, \, p < 0.001\); 95% CI: 1.14 - 1.93) predicts an average increase in winning percentage of 1.52 for each additional percentage point in the number of points won after a made first serve. The next serve statistic, First Serve Percentage (\(t = 4.089, \, p < 0.001\); 95% CI: 0.40 - 1.16) predicts an average winning percentage increase of 0.78 points per additional percentage point of first serves in. The last serve statistic, Second Serve Points Won (\(t = 3.712, \, p < 0.001\); 95% CI: 0.39 - 1.33) predicts an average increase in winning percentage of 0.86 points for each additional percentage point in the number of points won after a made second serve.

There was only one statistic directly associated with return of serve that made it into our model. This was Return Points Won Percentage (\(t = 9.864, \, p < 0.001\); 95% CI: 2.12 - 3.21). The coefficient of this variable predicts an average increase in winning percentage of 2.66 points for each additional percentage point in points won while returning. The only other not directly serving-related statistic in our model was Tiebreakers Won Percentage (\(t = 2.858, \, p < 0.01\); 95% CI: 0.05 - 0.29). Our model suggests an average increase of 0.17 points in winning percentage for each additional percentage point in tiebreakers won.

Influential Points
Player	WinningPct	FirstServePct	FirstServePtW	cooks	RetPtWPct
Milos Raonic	64.3	62.5	84.2	0.2040603	34.7
Reilly Opelka	58.7	64.0	79.3	0.1240316	29.1
Dusan Lajovic	45.5	69.4	68.0	0.1177649	36.6

We do note that as Figure 5 demonstrates, this model contains three points with Cook’s Distances greater than 5/45 (number of predictors/sample size - number of predictors), therefore labeling these points as influential. These three players and some notable statistics are given in Table 7.

Models used for comparison against this model with interaction terms or transformations show that we found no significant improvements upon this model. These results can be found in the appendix following. Additionally, plots regarding necessary conditions for conducting linear regression and indicators for possible multicollinearity have been placed in the appendix as well.

Discussion

The multiple linear regression model reported above shows that some of the most valuable predictors among many in predicting a tennis player’s winning percentage, thus success, are percentages for return points won, first serve points won, first serves put into play, second serve points won, and tiebreakers won. This closely aligns with previous research from O’Malley (2008) and George (1973) in emphasizing the importance of service and return in the game of tennis to predict which player will win. Following from O’Malley’s probability model for winning a single game within a match, our model suggests that winning percentage among some of the top ranking male professional players is heavily reliant on serve and return as well. Yet, in considering George’s strong emphasis on how a player should sequence serves for an optimal tennis strategy, there is little regard to what specific types of scenarios should be trained more or less for any level of player. The results of this model give us a chance to better evaluate coaching methodology and focus.

In developing this model, we are able to pinpoint what predictors have more practical significance than others and can comment on what coaching and player development programs should emphasize to produce players with higher success as measured by a winning percentage. As above, coaching tends to emphasize an importance on playing a strong first serve followed by a weaker second serve for the benefit of consistency, as George agrees, but there is far less regard for how to play return points. Having service gives a player innate advantage in being in complete control of the first stroke of the point, thus at the professional level, we see that the average rates for winning a service point is over 70% for first serves and 50% for second serves, whereas the average rate for winning a return point is relatively low at under 40%. Considering this and the reported model, we see that increasing a player’s percentage of winning return points by a single percentage point is associated with a 2.66% increase in winning percentage, which holds the most practical significance of all predictors. Hence, the implication of this model is that one of the most efficient ways for a player to potentially increase their winning percentage is through improving their return play.

The previous implication becomes even more interesting when the influential points of our model are considered. All three of the players that were singled out as influential (Milos Raonic, Dusan Lajovic, and Reilly Opelka) have negative residuals, and thus our model overpredicted their winning percentages. The interesting thing about these three players is that they all follow a particular pattern. These players have very mediocre win percentages (64.3%, 45.5%, 58.7% respectively) when compared to their serving statistics. They all have average or greater percentage of first serves in at 62.5%, 69.4%, and 64.0% respectively, and more importantly they have a high percentage of first serves points won at 84.2%, 68.8%, and 79.3% respectively. Again note that this is when compared to their respective winning percentages. These percentages in combination with each player’s below average return points won percentage seems to suggest that our model may overvalue serving statistics in it’s predictions. Taking this information into account, the implication that increasing return points won has the most potential to increase winning percentage becomes even stronger.

In considering the scope of inference, we must be careful on the fronts of generalizability and causality. With regards to generalizability, this model was based upon matches for world-class, male tennis players, thus it is safe to assume that these players are roughly similarly matched with one another and the skill level is relatively narrow. However, considering both youth tennis and recreational tennis, there is no narrow skill level since the range is anywhere from beginner to collegiate athlete. That said, it is worth mentioning that the same narrow skill levels are present in league play, typically, although the sample of professional players is not conducive for generalizing to an entire population. For one, the serves played among professionals often exceed 100mph despite the best youth rarely mustering such force. With an eye towards causality, there was no experimental design to this study. While it is extremely tempting to say that improving any predictor in our model results in some increase in win percentage, there is no clear causation given the nature of the study. Despite a lack of causality and a lack of generalizability, this does open the door for further research. One path that is possible to consider would be collecting similar statistics among players on high school teams within a given region, conference, or section in order to determine how these conclusions differ when applied to a larger and more representative sample. Similarly, one could attempt to break these same analyses down by the court surface as to determine whether other explanatory factors play a stronger or weaker role in their relationship with winning percentage. Another path for exploration would simply be including more than only male professionals, but also female professionals as there may exist play style differences that allow for other explanatory factors for winning percentage. A final consideration for further research could include data on what type of strokes are used in winning a given point, as this study focused almost strictly on what type of points are won. Following these recommendations for further research could allow for more accurate predictions of a player winning a match as well as beneficial coaching implementation.

Appendix

Bibliography

George, S. L. (1973). Optimal Strategy in Tennis: A Simple Probabilistic Model. Applied Statistics, 22(1), 97–104. doi: 10.2307/2346309
O’Malley, A. (2008). Probability Formulas and Statistical Analysis in Tennis, Journal of Quantitative Analysis in Sports, 4(2). doi: https://doi.org/10.2202/1559-0410.1100

Final Model Code

# Final Model
FinalVars <- Tennis %>%
  select(WinningPct, MeanOppRk, AcePct, DFPct, FirstServePct, FirstServePtW, SecondServePtW, RetPtWPct, BPConvPct, BPSavedPct, TBWPct)

mod1 <- lm(WinningPct ~ 1, data = FinalVars)
mod2 <- lm(WinningPct ~ ., data = FinalVars)

step(mod1, direction = "both", scope = formula(mod2))

## Start:  AIC=228.96
## WinningPct ~ 1
## 
##                  Df Sum of Sq    RSS    AIC
## + RetPtWPct       1   1039.22 3642.0 218.41
## + SecondServePtW  1    934.45 3746.8 219.83
## + MeanOppRk       1    914.82 3766.4 220.09
## + TBWPct          1    850.52 3830.7 220.94
## + BPConvPct       1    583.68 4097.5 224.31
## + BPSavedPct      1    466.80 4214.4 225.71
## + FirstServePtW   1    379.96 4301.3 226.73
## <none>                        4681.2 228.96
## + FirstServePct   1    138.86 4542.3 229.46
## + DFPct           1     42.37 4638.8 230.51
## + AcePct          1      5.24 4676.0 230.91
## 
## Step:  AIC=218.41
## WinningPct ~ RetPtWPct
## 
##                  Df Sum of Sq    RSS    AIC
## + FirstServePtW   1   1704.47 1937.5 188.86
## + BPSavedPct      1   1394.65 2247.3 196.27
## + AcePct          1   1364.98 2277.0 196.93
## + SecondServePtW  1    804.72 2837.3 207.93
## + TBWPct          1    738.98 2903.0 209.07
## + MeanOppRk       1    642.91 2999.1 210.70
## + FirstServePct   1    226.29 3415.7 217.21
## <none>                        3642.0 218.41
## + DFPct           1     31.77 3610.2 219.97
## + BPConvPct       1     21.37 3620.6 220.12
## - RetPtWPct       1   1039.22 4681.2 228.96
## 
## Step:  AIC=188.86
## WinningPct ~ RetPtWPct + FirstServePtW
## 
##                  Df Sum of Sq    RSS    AIC
## + FirstServePct   1    591.26 1346.3 172.65
## + SecondServePtW  1    500.25 1437.3 175.92
## + BPSavedPct      1    456.50 1481.0 177.42
## + DFPct           1    333.43 1604.1 181.41
## + TBWPct          1    237.26 1700.2 184.32
## <none>                        1937.5 188.86
## + MeanOppRk       1     61.86 1875.6 189.23
## + BPConvPct       1     55.81 1881.7 189.40
## + AcePct          1     36.58 1900.9 189.90
## - FirstServePtW   1   1704.47 3642.0 218.41
## - RetPtWPct       1   2363.74 4301.3 226.73
## 
## Step:  AIC=172.65
## WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct
## 
##                  Df Sum of Sq    RSS    AIC
## + SecondServePtW  1    289.83 1056.4 162.53
## + BPSavedPct      1    179.30 1167.0 167.51
## + TBWPct          1    176.15 1170.1 167.64
## + DFPct           1    112.40 1233.8 170.29
## <none>                        1346.3 172.65
## + BPConvPct       1     29.98 1316.3 173.53
## + MeanOppRk       1      8.53 1337.7 174.34
## + AcePct          1      0.20 1346.0 174.65
## - FirstServePct   1    591.26 1937.5 188.86
## - FirstServePtW   1   2069.44 3415.7 217.21
## - RetPtWPct       1   2745.14 4091.4 226.23
## 
## Step:  AIC=162.53
## WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW
## 
##                  Df Sum of Sq    RSS    AIC
## + TBWPct          1    165.41  891.0 156.02
## + BPSavedPct      1     49.35 1007.1 162.14
## <none>                        1056.4 162.53
## + MeanOppRk       1     36.88 1019.5 162.75
## + BPConvPct       1     24.05 1032.4 163.38
## + AcePct          1     15.19 1041.2 163.81
## + DFPct           1      4.98 1051.4 164.29
## - SecondServePtW  1    289.83 1346.3 172.65
## - FirstServePct   1    380.84 1437.3 175.92
## - FirstServePtW   1   1689.39 2745.8 208.29
## - RetPtWPct       1   2337.18 3393.6 218.88
## 
## Step:  AIC=156.02
## WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW + 
##     TBWPct
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                         891.01 156.02
## + BPConvPct       1     32.36  858.65 156.17
## + BPSavedPct      1     22.88  868.13 156.72
## + AcePct          1      5.50  885.51 157.71
## + MeanOppRk       1      5.29  885.72 157.72
## + DFPct           1      1.49  889.52 157.93
## - TBWPct          1    165.41 1056.42 162.53
## - SecondServePtW  1    279.09 1170.10 167.64
## - FirstServePct   1    338.64 1229.65 170.12
## - FirstServePtW   1   1242.11 2133.12 197.67
## - RetPtWPct       1   1970.48 2861.49 212.35

## 
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + 
##     SecondServePtW + TBWPct, data = FinalVars)
## 
## Coefficients:
##    (Intercept)       RetPtWPct   FirstServePtW   FirstServePct  
##      -254.5145          2.6631          1.5334          0.7804  
## SecondServePtW          TBWPct  
##         0.8633          0.1710

final_model <- lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + 
    SecondServePtW + TBWPct, data = FinalVars)
summary(final_model)

## 
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + 
##     SecondServePtW + TBWPct, data = FinalVars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7734 -3.2390 -0.3393  3.2544  8.4486 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -254.51453   26.34480  -9.661 1.92e-12 ***
## RetPtWPct         2.66314    0.26998   9.864 1.02e-12 ***
## FirstServePtW     1.53344    0.19579   7.832 7.04e-10 ***
## FirstServePct     0.78039    0.19083   4.089 0.000181 ***
## SecondServePtW    0.86328    0.23254   3.712 0.000574 ***
## TBWPct            0.17099    0.05983   2.858 0.006489 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.5 on 44 degrees of freedom
## Multiple R-squared:  0.8097, Adjusted R-squared:  0.788 
## F-statistic: 37.43 on 5 and 44 DF,  p-value: 8.802e-15

# Checks Variance inflation factors for multicollinearity
car::vif(final_model)

##      RetPtWPct  FirstServePtW  FirstServePct SecondServePtW         TBWPct 
##       1.510121       1.609118       1.129266       1.092261       1.115057

# Fitting conditions for applying linear regression
plot(final_model)

# Experimental Models to test against final model

FinalVars.DR <- Tennis %>%
  select(WinningPct, MeanOppRk, AcePct, DFPct, FirstServePct, FirstServePtW, SecondServePtW, RetPtWPct, BPConvPct, BPSavedPct, TBWPct, DomRat)

# Adding dominence ratio
stepwin.DR <- lm(formula = WinningPct ~ DomRat + TBWPct + BPConvPct, data = FinalVars.DR)
summary(stepwin.DR)

## 
## Call:
## lm(formula = WinningPct ~ DomRat + TBWPct + BPConvPct, data = FinalVars.DR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9347  -2.9734   0.1278   3.4165   9.4820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.25670    8.35954  -4.696 2.42e-05 ***
## DomRat       72.55419    7.17899  10.106 2.90e-13 ***
## TBWPct        0.16753    0.06214   2.696  0.00977 ** 
## BPConvPct     0.30068    0.16234   1.852  0.07043 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.668 on 46 degrees of freedom
## Multiple R-squared:  0.7859, Adjusted R-squared:  0.7719 
## F-statistic: 56.27 on 3 and 46 DF,  p-value: 1.979e-15

anova(final_model,stepwin.DR)

## Analysis of Variance Table
## 
## Model 1: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW + 
##     TBWPct
## Model 2: WinningPct ~ DomRat + TBWPct + BPConvPct
##   Res.Df     RSS Df Sum of Sq     F  Pr(>F)  
## 1     44  891.01                             
## 2     46 1002.43 -2   -111.42 2.751 0.07486 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Taking out TBWPct since not at the point level
serve_return <- lm(WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW, data = FinalVars)
summary(serve_return)

## 
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + 
##     SecondServePtW, data = FinalVars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.887  -3.081   1.066   2.974   8.243 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -267.6464    27.9309  -9.582 1.94e-12 ***
## RetPtWPct         2.8310     0.2837   9.978 5.58e-13 ***
## FirstServePtW     1.7036     0.2008   8.483 6.90e-11 ***
## FirstServePct     0.8248     0.2048   4.028 0.000214 ***
## SecondServePtW    0.8795     0.2503   3.514 0.001020 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.845 on 45 degrees of freedom
## Multiple R-squared:  0.7743, Adjusted R-squared:  0.7543 
## F-statistic:  38.6 on 4 and 45 DF,  p-value: 5.232e-14

anova(serve_return, final_model)

## Analysis of Variance Table
## 
## Model 1: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW
## Model 2: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW + 
##     TBWPct
##   Res.Df     RSS Df Sum of Sq      F   Pr(>F)   
## 1     45 1056.42                                
## 2     44  891.01  1    165.41 8.1683 0.006489 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Interaction between first serves in play and 
# winning first serve points
serve.int <- lm(WinningPct ~ RetPtWPct + FirstServePtW + FirstServePtW:FirstServePct + FirstServePct + SecondServePtW, data = FinalVars)
summary(serve.int)

## 
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePtW:FirstServePct + 
##     FirstServePct + SecondServePtW, data = FinalVars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4716  -2.7015   0.7377   2.9932   8.0991 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  52.87840  217.92601   0.243  0.80941    
## RetPtWPct                     2.90871    0.28489  10.210 3.51e-13 ***
## FirstServePtW                -2.60550    2.91301  -0.894  0.37596    
## FirstServePct                -4.29169    3.45671  -1.242  0.22098    
## SecondServePtW                0.85630    0.24753   3.459  0.00121 ** 
## FirstServePtW:FirstServePct   0.06846    0.04617   1.483  0.14528    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.782 on 44 degrees of freedom
## Multiple R-squared:  0.7851, Adjusted R-squared:  0.7606 
## F-statistic: 32.14 on 5 and 44 DF,  p-value: 1.221e-13

# Not a significant improvement over serve_return model
# by transitivity not significant over final_model
anova(serve_return, serve.int)

## Analysis of Variance Table
## 
## Model 1: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW
## Model 2: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePtW:FirstServePct + 
##     FirstServePct + SecondServePtW
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     45 1056.4                           
## 2     44 1006.1  1    50.271 2.1984 0.1453

FinalVars.DR <- FinalVars.DR %>%
  mutate(log_WPct = log(WinningPct),
         log_FSPct = log(FirstServePct),
         cube_TBWPct = (TBWPct)^3,
         sqr_Ret = (RetPtWPct)^2)

# Not as high on adjusted R-squared for more complex
# interpretations
logWPct <- lm(log_WPct ~ RetPtWPct + FirstServePtW + FirstServePct + 
    SecondServePtW + TBWPct, data = FinalVars.DR)
summary(logWPct)

## 
## Call:
## lm(formula = log_WPct ~ RetPtWPct + FirstServePtW + FirstServePct + 
##     SecondServePtW + TBWPct, data = FinalVars.DR)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.172505 -0.055925  0.000312  0.054429  0.133356 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.795961   0.465869  -1.709  0.09458 .  
## RetPtWPct       0.041699   0.004774   8.734 3.65e-11 ***
## FirstServePtW   0.024480   0.003462   7.070 9.02e-09 ***
## FirstServePct   0.011768   0.003375   3.487  0.00112 ** 
## SecondServePtW  0.012624   0.004112   3.070  0.00366 ** 
## TBWPct          0.002563   0.001058   2.423  0.01958 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07958 on 44 degrees of freedom
## Multiple R-squared:  0.7653, Adjusted R-squared:  0.7387 
## F-statistic:  28.7 on 5 and 44 DF,  p-value: 8.143e-13

# Not as high on adjusted R-squared for more complex
# interpretations
logFSPct <- lm(WinningPct ~ RetPtWPct + FirstServePtW + log_FSPct + 
    SecondServePtW + TBWPct, data = FinalVars.DR)
summary(logFSPct)

## 
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + log_FSPct + 
##     SecondServePtW + TBWPct, data = FinalVars.DR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8005 -3.3409 -0.3298  3.2006  8.4268 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -406.24447   56.70684  -7.164 6.58e-09 ***
## RetPtWPct         2.65940    0.27023   9.841 1.09e-12 ***
## FirstServePtW     1.53656    0.19635   7.826 7.19e-10 ***
## log_FSPct        48.62038   11.95546   4.067 0.000194 ***
## SecondServePtW    0.85270    0.23358   3.651 0.000691 ***
## TBWPct            0.17079    0.05992   2.850 0.006627 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.507 on 44 degrees of freedom
## Multiple R-squared:  0.8091, Adjusted R-squared:  0.7874 
## F-statistic: 37.29 on 5 and 44 DF,  p-value: 9.4e-15

# Again lower adjusted R-squared for more complicated 
# interpretatons
logged <- lm(log_WPct ~ RetPtWPct + FirstServePtW + log_FSPct + 
    SecondServePtW + TBWPct, data = FinalVars.DR)
summary(logged)

## 
## Call:
## lm(formula = log_WPct ~ RetPtWPct + FirstServePtW + log_FSPct + 
##     SecondServePtW + TBWPct, data = FinalVars.DR)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.173932 -0.057441 -0.000197  0.055910  0.133031 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.083484   1.002408  -3.076  0.00360 ** 
## RetPtWPct       0.041642   0.004777   8.718 3.85e-11 ***
## FirstServePtW   0.024526   0.003471   7.066 9.14e-09 ***
## log_FSPct       0.733064   0.211337   3.469  0.00118 ** 
## SecondServePtW  0.012465   0.004129   3.019  0.00421 ** 
## TBWPct          0.002560   0.001059   2.417  0.01986 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07967 on 44 degrees of freedom
## Multiple R-squared:  0.7648, Adjusted R-squared:  0.7381 
## F-statistic: 28.61 on 5 and 44 DF,  p-value: 8.557e-13

# Cubed term for TBWPct from scatterplot above
cubeTBWPct <- lm(WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + 
    SecondServePtW + cube_TBWPct + TBWPct, data = FinalVars.DR)
summary(cubeTBWPct)

## 
## Call:
## lm(formula = WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + 
##     SecondServePtW + cube_TBWPct + TBWPct, data = FinalVars.DR)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.921 -3.312 -0.040  3.344  8.207 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.607e+02  3.193e+01  -8.164 2.78e-10 ***
## RetPtWPct       2.702e+00  2.944e-01   9.176 1.10e-11 ***
## FirstServePtW   1.546e+00  2.011e-01   7.690 1.31e-09 ***
## FirstServePct   8.000e-01  2.008e-01   3.985 0.000257 ***
## SecondServePtW  8.733e-01  2.366e-01   3.691 0.000625 ***
## cube_TBWPct    -6.595e-06  1.884e-05  -0.350 0.728051    
## TBWPct          2.309e-01  1.815e-01   1.272 0.210125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.546 on 43 degrees of freedom
## Multiple R-squared:  0.8102, Adjusted R-squared:  0.7837 
## F-statistic: 30.59 on 6 and 43 DF,  p-value: 5.391e-14

anova(final_model, cubeTBWPct)

## Analysis of Variance Table
## 
## Model 1: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW + 
##     TBWPct
## Model 2: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW + 
##     cube_TBWPct + TBWPct
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     44 891.01                           
## 2     43 888.48  1    2.5311 0.1225 0.7281

# Squared term for RetPtWPct from scatterplot above
sqrRet <- lm(WinningPct ~ RetPtWPct + sqr_Ret + FirstServePtW + FirstServePct + 
    SecondServePtW + TBWPct, data = FinalVars.DR)
summary(sqrRet)

## 
## Call:
## lm(formula = WinningPct ~ RetPtWPct + sqr_Ret + FirstServePtW + 
##     FirstServePct + SecondServePtW + TBWPct, data = FinalVars.DR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3354 -3.2610 -0.1187  3.4057  9.5859 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -107.58180   79.82591  -1.348  0.18481    
## RetPtWPct        -4.48511    3.68850  -1.216  0.23063    
## sqr_Ret           0.09936    0.05114   1.943  0.05860 .  
## FirstServePtW     1.47845    0.19200   7.700 1.26e-09 ***
## FirstServePct     0.62957    0.20071   3.137  0.00308 ** 
## SecondServePtW    0.76658    0.23096   3.319  0.00185 ** 
## TBWPct            0.15868    0.05837   2.718  0.00942 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.365 on 43 degrees of freedom
## Multiple R-squared:  0.825,  Adjusted R-squared:  0.8006 
## F-statistic: 33.79 on 6 and 43 DF,  p-value: 9.715e-15

anova(final_model, sqrRet)

## Analysis of Variance Table
## 
## Model 1: WinningPct ~ RetPtWPct + FirstServePtW + FirstServePct + SecondServePtW + 
##     TBWPct
## Model 2: WinningPct ~ RetPtWPct + sqr_Ret + FirstServePtW + FirstServePct + 
##     SecondServePtW + TBWPct
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1     44 891.01                             
## 2     43 819.11  1    71.906 3.7748 0.0586 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary Statistics for EDA

favstats(~ DFPct, data = Tennis) # % double faults

##  min  Q1 median  Q3 max  mean       sd  n missing
##  1.7 2.6    3.4 4.1 7.7 3.564 1.370351 50       0

favstats(~ BPSavedPct, data = Tennis) # % break points against them that they win

##   min     Q1 median     Q3  max   mean       sd  n missing
##  56.4 60.625   63.2 65.975 75.4 63.744 4.204427 50       0

favstats(~ DomRat, data = Tennis) # % return points won / % serve points lost

##   min   Q1 median     Q3  max  mean        sd  n missing
##  0.94 1.02   1.06 1.1075 1.42 1.081 0.1026834 50       0

favstats(~ AcePct, data = Tennis) # % of serves that the are aces

##  min  Q1 median    Q3  max  mean       sd  n missing
##  2.1 5.5    8.1 11.75 25.4 9.438 5.457475 50       0

favstats(~ InPlayRetPtW, data = Tennis) # % return points won excluding aces and double faults

##   min     Q1 median     Q3  max  mean       sd  n missing
##  27.7 35.775   37.6 39.275 44.4 37.47 3.321805 50       0

favstats(~ SecondServePtW, data = Tennis) # % serve points won starting with a second serve

##   min   Q1 median   Q3  max   mean       sd  n missing
##  42.9 50.6   52.2 54.5 58.5 52.248 2.889265 50       0

favstats(~ TBWPct, data = Tennis) # percent of tiebreakers won

##   min    Q1 median Q3 max  mean       sd  n missing
##  26.7 48.95  55.15 60  87 54.14 11.34642 50       0

EDA

# EDA for most variables we examine
ggplot(Tennis, aes(x = FirstServePct, y =  WinningPct)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

ggplot(Tennis, aes(BreakPct, WinningPct)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

ggplot(Tennis, aes(WinningPct, ..density..)) +
  geom_histogram() +
  geom_density()

ggplot(Tennis, aes(WinningPct)) +
  geom_histogram()

ggplot(Tennis, aes(Rank, WinningPct)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  geom_smooth(method = "lm", se = FALSE, color = "red")

cor(Rank, WinningPct, data = Tennis)

ggplot(Tennis, aes(TBWPct, WinningPct)) +
  geom_point() +
  geom_smooth(se = FALSE)

Correlation

InitVars <- Tennis %>%
  select(WinningPct, Rank, FirstServePct, FirstServePtW, BPConvPct, BreakPct, DFPct, BPSavedPct, DomRat, AcePct, InPlayRetPtW, SecondServePtW, TBWPct)

pairs(InitVars) # scatterplot matrix

round(cor(InitVars), 2) # table of correlations

##                WinningPct  Rank FirstServePct FirstServePtW BPConvPct
## WinningPct           1.00 -0.65          0.17          0.28      0.35
## Rank                -0.65  1.00         -0.13         -0.14     -0.22
## FirstServePct        0.17 -0.13          1.00         -0.13      0.01
## FirstServePtW        0.28 -0.14         -0.13          1.00     -0.37
## BPConvPct            0.35 -0.22          0.01         -0.37      1.00
## BreakPct             0.52 -0.45         -0.01         -0.51      0.70
## DFPct               -0.10 -0.16         -0.38          0.26     -0.13
## BPSavedPct           0.32 -0.09          0.25          0.55     -0.42
## DomRat               0.86 -0.58          0.13          0.40      0.29
## AcePct               0.03  0.07          0.04          0.85     -0.47
## InPlayRetPtW         0.45 -0.36         -0.08         -0.53      0.65
## SecondServePtW       0.45 -0.13          0.19          0.10      0.08
## TBWPct               0.43 -0.36          0.01          0.23     -0.01
##                BreakPct DFPct BPSavedPct DomRat AcePct InPlayRetPtW
## WinningPct         0.52 -0.10       0.32   0.86   0.03         0.45
## Rank              -0.45 -0.16      -0.09  -0.58   0.07        -0.36
## FirstServePct     -0.01 -0.38       0.25   0.13   0.04        -0.08
## FirstServePtW     -0.51  0.26       0.55   0.40   0.85        -0.53
## BPConvPct          0.70 -0.13      -0.42   0.29  -0.47         0.65
## BreakPct           1.00 -0.07      -0.39   0.47  -0.69         0.96
## DFPct             -0.07  1.00      -0.09  -0.12   0.26        -0.10
## BPSavedPct        -0.39 -0.09       1.00   0.34   0.51        -0.40
## DomRat             0.47 -0.12       0.34   1.00   0.09         0.43
## AcePct            -0.69  0.26       0.51   0.09   1.00        -0.74
## InPlayRetPtW       0.96 -0.10      -0.40   0.43  -0.74         1.00
## SecondServePtW     0.12 -0.65       0.41   0.56  -0.02         0.11
## TBWPct             0.03  0.09       0.25   0.31   0.17         0.03
##                SecondServePtW TBWPct
## WinningPct               0.45   0.43
## Rank                    -0.13  -0.36
## FirstServePct            0.19   0.01
## FirstServePtW            0.10   0.23
## BPConvPct                0.08  -0.01
## BreakPct                 0.12   0.03
## DFPct                   -0.65   0.09
## BPSavedPct               0.41   0.25
## DomRat                   0.56   0.31
## AcePct                  -0.02   0.17
## InPlayRetPtW             0.11   0.03
## SecondServePtW           1.00   0.09
## TBWPct                   0.09   1.00

Initial Modeling

# Initial Modeling setup and results

#######SETUP#######
model1 <- lm(WinningPct ~ 1, data = Tennis)
model2 <- lm(Rank ~ 1, data = Tennis)

##These next sets and models are to easily generate stepwise functions just using those initial ##variables we picked out as the final model.

WInitVars <- InitVars %>%
  select(-Rank)

WInitVars.noDR <- InitVars %>%
  select(-Rank, -DomRat)

RInitVars <- InitVars %>%
  select(-WinningPct)

model1Init <- lm(WinningPct ~ ., data = WInitVars)
model1Init.noDR <- lm(WinningPct ~ ., data = WInitVars.noDR)
model2Init <- lm(Rank ~ ., data = RInitVars)

##CLEAN OUT VARIABLES BEFORE USING BELOW##

##These next ones are for easy throwing all the predictors into a big stepwise, but some ##variables are just calculations of others and some are raw numbers and therefore largely ##affected by number of matches, so we should take a look at all the variables and clean out ##the ones we don't want before we use these.

#WPredictors <- Tennis %>%
#  select(-Rank, -Player, -Country)

#WPredNoDR <- Tennis %>%
#  select(-Rank, -Player, -Country, -DomRat)

#RPredictors <- Tennis %>%
#  select(-WinningPct, -Player, -Country)

#RPredNoDR <- Tennis %>%
#  select(-WinningPct, -Player, -Country, -DomRat)

#model1big <- lm(WinningPct ~ ., data = WPredictors)
#model1big.noDR <- lm(WinningPct ~ ., data = WPredNoDR)
#model2big <- lm(Rank ~ ., data = RPredictors)
#model2big.noDR <- lm(Rank ~ ., data = RPredNoDR)

#######EXPERIMENTAL MODELS#######
model1a <- lm(WinningPct ~ DomRat, data = Tennis)
model2a <- lm(Rank ~ DomRat, data = Tennis)
 summary(model1a)

## 
## Call:
## lm(formula = WinningPct ~ DomRat, data = Tennis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3637  -3.6617   0.4441   3.6489   9.2440 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -28.475      7.596  -3.749 0.000478 ***
## DomRat        81.924      6.996  11.710 1.13e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.029 on 48 degrees of freedom
## Multiple R-squared:  0.7407, Adjusted R-squared:  0.7353 
## F-statistic: 137.1 on 1 and 48 DF,  p-value: 1.126e-15

 summary(model2a)

## 
## Call:
## lm(formula = Rank ~ DomRat, data = Tennis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.0148 -10.1341   0.5657  10.2336  24.2366 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   114.77      18.10   6.342 7.52e-08 ***
## DomRat        -82.58      16.67  -4.955 9.41e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.98 on 48 degrees of freedom
## Multiple R-squared:  0.3384, Adjusted R-squared:  0.3246 
## F-statistic: 24.55 on 1 and 48 DF,  p-value: 9.408e-06

model1b <- lm(WinningPct ~ InPlayRetPtW + FirstServePtW + SecondServePtW, data = Tennis)
model2b <- lm(Rank ~ InPlayRetPtW + FirstServePtW + SecondServePtW, data = Tennis)
summary(model1b)

## 
## Call:
## lm(formula = WinningPct ~ InPlayRetPtW + FirstServePtW + SecondServePtW, 
##     data = Tennis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1489  -3.8786  -0.0153   3.8547  12.9941 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -191.4998    27.1136  -7.063 7.33e-09 ***
## InPlayRetPtW      2.2513     0.3023   7.447 1.95e-09 ***
## FirstServePtW     1.5532     0.2406   6.455 5.99e-08 ***
## SecondServePtW    1.0013     0.2962   3.381  0.00148 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.848 on 46 degrees of freedom
## Multiple R-squared:  0.664,  Adjusted R-squared:  0.6421 
## F-statistic:  30.3 on 3 and 46 DF,  p-value: 5.779e-11

summary(model2b)

## 
## Call:
## lm(formula = Rank ~ InPlayRetPtW + FirstServePtW + SecondServePtW, 
##     data = Tennis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.8456 -11.2203  -0.0928   8.9812  23.1730 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    245.81915   59.11976   4.158 0.000138 ***
## InPlayRetPtW    -2.63672    0.65914  -4.000 0.000228 ***
## FirstServePtW   -1.58880    0.52463  -3.028 0.004022 ** 
## SecondServePtW  -0.07615    0.64578  -0.118 0.906647    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.75 on 46 degrees of freedom
## Multiple R-squared:  0.2818, Adjusted R-squared:  0.235 
## F-statistic: 6.017 on 3 and 46 DF,  p-value: 0.001515

model1c <- lm(WinningPct ~ InPlayRetPtW + FirstServePtW + SecondServePtW + DomRat, data = Tennis)
model2c <- lm(Rank ~ InPlayRetPtW + FirstServePtW + SecondServePtW + DomRat, data = Tennis)
summary(model1c)

## 
## Call:
## lm(formula = WinningPct ~ InPlayRetPtW + FirstServePtW + SecondServePtW + 
##     DomRat, data = Tennis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.8319  -3.0307   0.5679   3.2883   9.3918 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -26.31668   48.68238  -0.541 0.591463    
## InPlayRetPtW     0.21080    0.58799   0.359 0.721634    
## FirstServePtW   -0.03969    0.46085  -0.086 0.931747    
## SecondServePtW  -0.13093    0.38993  -0.336 0.738595    
## DomRat          81.66449   21.01745   3.886 0.000332 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.116 on 45 degrees of freedom
## Multiple R-squared:  0.7484, Adjusted R-squared:  0.726 
## F-statistic: 33.46 on 4 and 45 DF,  p-value: 5.85e-13

summary(model2c)

## 
## Call:
## lm(formula = Rank ~ InPlayRetPtW + FirstServePtW + SecondServePtW + 
##     DomRat, data = Tennis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.554  -7.625  -2.319   9.064  21.635 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     -81.5941   109.1976  -0.747  0.45882   
## InPlayRetPtW      1.4078     1.3189   1.067  0.29147   
## FirstServePtW     1.5686     1.0337   1.517  0.13615   
## SecondServePtW    2.1680     0.8746   2.479  0.01700 * 
## DomRat         -161.8691    47.1434  -3.434  0.00129 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.48 on 45 degrees of freedom
## Multiple R-squared:  0.4309, Adjusted R-squared:  0.3803 
## F-statistic: 8.518 on 4 and 45 DF,  p-value: 3.318e-05

#This was mostly just too look at whether DomRat would overpower other variables and it kinda #looks like it does at least w/ WinningPct as response. Might have to look at other specific #factors and leave DomRat out. We still have to mention it tho obviously. It doesn't affect as #much in Rank so we could potentially look at things that improve DomRat predictions with rank #if we want more things to talk about.

#######STEPWISE MODELS#######

#These are the stepwise models generated with just the initial couple variables as the full ##model

step(model1, direction = "both", scope = formula(model1Init))

## Start:  AIC=228.96
## WinningPct ~ 1
## 
##                  Df Sum of Sq    RSS    AIC
## + DomRat          1    3467.5 1213.7 163.47
## + BreakPct        1    1283.0 3398.2 214.95
## + InPlayRetPtW    1     941.8 3739.4 219.73
## + SecondServePtW  1     934.4 3746.8 219.83
## + TBWPct          1     850.5 3830.7 220.94
## + BPConvPct       1     583.7 4097.5 224.31
## + BPSavedPct      1     466.8 4214.4 225.71
## + FirstServePtW   1     380.0 4301.3 226.73
## <none>                        4681.2 228.96
## + FirstServePct   1     138.9 4542.3 229.46
## + DFPct           1      42.4 4638.8 230.51
## + AcePct          1       5.2 4676.0 230.91
## 
## Step:  AIC=163.47
## WinningPct ~ DomRat
## 
##                  Df Sum of Sq    RSS    AIC
## + TBWPct          1     136.5 1077.2 159.50
## + BreakPct        1      81.7 1132.0 161.99
## + BPConvPct       1      52.9 1160.8 163.24
## <none>                        1213.7 163.47
## + InPlayRetPtW    1      32.4 1181.3 164.12
## + FirstServePtW   1      19.8 1193.9 164.65
## + FirstServePct   1      15.4 1198.3 164.83
## + AcePct          1       7.8 1206.0 165.15
## + SecondServePtW  1       7.7 1206.0 165.15
## + BPSavedPct      1       2.7 1211.0 165.36
## + DFPct           1       0.1 1213.6 165.47
## - DomRat          1    3467.5 4681.2 228.96
## 
## Step:  AIC=159.5
## WinningPct ~ DomRat + TBWPct
## 
##                  Df Sum of Sq    RSS    AIC
## + BreakPct        1    117.04  960.1 155.75
## + BPConvPct       1     74.75 1002.4 157.91
## + InPlayRetPtW    1     51.57 1025.6 159.05
## <none>                        1077.2 159.50
## + FirstServePtW   1     35.17 1042.0 159.84
## + AcePct          1     21.82 1055.4 160.48
## + FirstServePct   1     18.24 1058.9 160.65
## + SecondServePtW  1      2.62 1074.6 161.38
## + DFPct           1      1.61 1075.6 161.43
## + BPSavedPct      1      0.06 1077.1 161.50
## - TBWPct          1    136.55 1213.7 163.47
## - DomRat          1   2753.51 3830.7 220.94
## 
## Step:  AIC=155.75
## WinningPct ~ DomRat + TBWPct + BreakPct
## 
##                  Df Sum of Sq     RSS    AIC
## + InPlayRetPtW    1    101.37  858.77 152.17
## + BPSavedPct      1     82.86  877.27 153.24
## + AcePct          1     58.34  901.80 154.62
## + FirstServePtW   1     46.17  913.97 155.29
## <none>                         960.14 155.75
## + FirstServePct   1     27.68  932.46 156.29
## + BPConvPct       1      3.71  956.43 157.56
## + DFPct           1      1.52  958.62 157.67
## + SecondServePtW  1      0.43  959.71 157.73
## - BreakPct        1    117.04 1077.18 159.50
## - TBWPct          1    171.89 1132.03 161.99
## - DomRat          1   1638.79 2598.93 203.54
## 
## Step:  AIC=152.17
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW
## 
##                  Df Sum of Sq     RSS    AIC
## + BPSavedPct      1     71.45  787.32 149.83
## <none>                         858.77 152.17
## + FirstServePtW   1     25.81  832.96 152.65
## + AcePct          1     14.05  844.72 153.35
## + FirstServePct   1      8.21  850.56 153.69
## + DFPct           1      5.62  853.14 153.84
## + SecondServePtW  1      1.20  857.57 154.10
## + BPConvPct       1      0.20  858.57 154.16
## - InPlayRetPtW    1    101.37  960.14 155.75
## - BreakPct        1    166.84 1025.61 159.05
## - TBWPct          1    183.33 1042.09 159.85
## - DomRat          1   1549.48 2408.24 201.73
## 
## Step:  AIC=149.83
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW + BPSavedPct
## 
##                  Df Sum of Sq     RSS    AIC
## + FirstServePtW   1     60.67  726.65 147.82
## <none>                         787.32 149.83
## + AcePct          1     24.08  763.24 150.28
## + BPConvPct       1      7.10  780.22 151.38
## + DFPct           1      1.91  785.41 151.71
## + FirstServePct   1      1.20  786.11 151.75
## + SecondServePtW  1      0.66  786.66 151.79
## - BPSavedPct      1     71.45  858.77 152.17
## - InPlayRetPtW    1     89.95  877.27 153.24
## - TBWPct          1    160.72  948.04 157.12
## - BreakPct        1    206.18  993.50 159.46
## - DomRat          1    683.54 1470.86 179.08
## 
## Step:  AIC=147.82
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW + BPSavedPct + 
##     FirstServePtW
## 
##                  Df Sum of Sq    RSS    AIC
## + DFPct           1    67.408 659.24 144.95
## + FirstServePct   1    54.605 672.05 145.91
## + SecondServePtW  1    35.825 690.83 147.29
## <none>                        726.65 147.82
## + BPConvPct       1     7.828 718.82 149.28
## - InPlayRetPtW    1    59.254 785.90 149.74
## + AcePct          1     0.131 726.52 149.81
## - FirstServePtW   1    60.668 787.32 149.83
## - DomRat          1    75.017 801.67 150.73
## - BPSavedPct      1   106.307 832.96 152.65
## - TBWPct          1   153.918 880.57 155.43
## - BreakPct        1   264.795 991.45 161.36
## 
## Step:  AIC=144.95
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW + BPSavedPct + 
##     FirstServePtW + DFPct
## 
##                  Df Sum of Sq    RSS    AIC
## + FirstServePct   1     53.06 606.18 142.76
## - DomRat          1      5.51 664.75 143.37
## <none>                        659.24 144.95
## + SecondServePtW  1      4.16 655.08 146.64
## + AcePct          1      1.70 657.54 146.82
## + BPConvPct       1      1.41 657.83 146.85
## - InPlayRetPtW    1     57.41 716.65 147.13
## - DFPct           1     67.41 726.65 147.82
## - BPSavedPct      1    120.32 779.56 151.34
## - FirstServePtW   1    126.17 785.41 151.71
## - TBWPct          1    184.24 843.48 155.28
## - BreakPct        1    330.73 989.98 163.28
## 
## Step:  AIC=142.76
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW + BPSavedPct + 
##     FirstServePtW + DFPct + FirstServePct
## 
##                  Df Sum of Sq    RSS    AIC
## + SecondServePtW  1     73.83 532.36 138.26
## - DomRat          1      1.34 607.53 140.87
## - InPlayRetPtW    1     14.71 620.89 141.96
## <none>                        606.18 142.76
## + AcePct          1      0.99 605.20 144.68
## + BPConvPct       1      0.64 605.54 144.71
## - FirstServePct   1     53.06 659.24 144.95
## - DFPct           1     65.86 672.05 145.91
## - BPSavedPct      1    112.90 719.09 149.30
## - FirstServePtW   1    178.87 785.05 153.69
## - TBWPct          1    191.66 797.85 154.50
## - BreakPct        1    331.34 937.53 162.56
## 
## Step:  AIC=138.26
## WinningPct ~ DomRat + TBWPct + BreakPct + InPlayRetPtW + BPSavedPct + 
##     FirstServePtW + DFPct + FirstServePct + SecondServePtW
## 
##                  Df Sum of Sq    RSS    AIC
## - InPlayRetPtW    1      0.06 532.41 136.27
## - DFPct           1      1.10 533.46 136.37
## <none>                        532.36 138.26
## + AcePct          1      9.26 523.09 139.39
## + BPConvPct       1      0.11 532.25 140.25
## - DomRat          1     52.75 585.11 140.99
## - SecondServePtW  1     73.83 606.18 142.76
## - BPSavedPct      1     82.95 615.31 143.50
## - FirstServePct   1    122.72 655.08 146.64
## - TBWPct          1    228.93 761.29 154.15
## - FirstServePtW   1    246.18 778.53 155.27
## - BreakPct        1    396.96 929.31 164.12
## 
## Step:  AIC=136.27
## WinningPct ~ DomRat + TBWPct + BreakPct + BPSavedPct + FirstServePtW + 
##     DFPct + FirstServePct + SecondServePtW
## 
##                  Df Sum of Sq     RSS    AIC
## - DFPct           1      1.32  533.74 134.39
## <none>                         532.41 136.27
## + AcePct          1      9.26  523.15 137.39
## + BPConvPct       1      0.09  532.32 138.26
## + InPlayRetPtW    1      0.06  532.36 138.26
## - DomRat          1     67.90  600.31 140.27
## - BPSavedPct      1     82.91  615.32 141.51
## - SecondServePtW  1     88.48  620.89 141.96
## - FirstServePct   1    175.88  708.30 148.54
## - TBWPct          1    229.10  761.51 152.16
## - FirstServePtW   1    330.95  863.36 158.44
## - BreakPct        1    478.65 1011.06 166.34
## 
## Step:  AIC=134.39
## WinningPct ~ DomRat + TBWPct + BreakPct + BPSavedPct + FirstServePtW + 
##     FirstServePct + SecondServePtW
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                         533.74 134.39
## + AcePct          1     10.52  523.21 135.40
## + DFPct           1      1.32  532.41 136.27
## + InPlayRetPtW    1      0.28  533.46 136.37
## + BPConvPct       1      0.18  533.56 136.38
## - DomRat          1     70.28  604.02 138.58
## - BPSavedPct      1     81.79  615.52 139.52
## - SecondServePtW  1    153.51  687.25 145.03
## - FirstServePct   1    206.45  740.18 148.74
## - TBWPct          1    228.26  762.00 150.20
## - FirstServePtW   1    329.74  863.47 156.45
## - BreakPct        1    478.36 1012.09 164.39

## 
## Call:
## lm(formula = WinningPct ~ DomRat + TBWPct + BreakPct + BPSavedPct + 
##     FirstServePtW + FirstServePct + SecondServePtW, data = Tennis)
## 
## Coefficients:
##    (Intercept)          DomRat          TBWPct        BreakPct  
##      -270.1609        -55.8446          0.2089          2.5595  
##     BPSavedPct   FirstServePtW   FirstServePct  SecondServePtW  
##         0.4602          2.4912          0.7835          1.1659

# lm(formula = WinningPct ~ DomRat + TBWPct + BreakPct + BPSavedPct + FirstServePtW + FirstServePct + SecondServePtW, data = Tennis)

step(model1, direction = "both", scope = formula(model1Init.noDR))

## Start:  AIC=228.96
## WinningPct ~ 1
## 
##                  Df Sum of Sq    RSS    AIC
## + BreakPct        1   1282.98 3398.2 214.95
## + InPlayRetPtW    1    941.83 3739.4 219.73
## + SecondServePtW  1    934.45 3746.8 219.83
## + TBWPct          1    850.52 3830.7 220.94
## + BPConvPct       1    583.68 4097.5 224.31
## + BPSavedPct      1    466.80 4214.4 225.71
## + FirstServePtW   1    379.96 4301.3 226.73
## <none>                        4681.2 228.96
## + FirstServePct   1    138.86 4542.3 229.46
## + DFPct           1     42.37 4638.8 230.51
## + AcePct          1      5.24 4676.0 230.91
## 
## Step:  AIC=214.95
## WinningPct ~ BreakPct
## 
##                  Df Sum of Sq    RSS    AIC
## + FirstServePtW   1   1927.85 1470.4 175.06
## + BPSavedPct      1   1491.95 1906.3 188.04
## + AcePct          1   1386.65 2011.6 190.73
## + TBWPct          1    799.30 2598.9 203.54
## + SecondServePtW  1    693.02 2705.2 205.54
## + InPlayRetPtW    1    180.45 3217.8 214.22
## + FirstServePct   1    147.72 3250.5 214.73
## <none>                        3398.2 214.95
## + DFPct           1     14.99 3383.2 216.73
## + BPConvPct       1      2.21 3396.0 216.92
## - BreakPct        1   1282.98 4681.2 228.96
## 
## Step:  AIC=175.06
## WinningPct ~ BreakPct + FirstServePtW
## 
##                  Df Sum of Sq    RSS    AIC
## + BPSavedPct      1    461.82 1008.6 158.21
## + FirstServePct   1    377.03 1093.4 162.25
## + SecondServePtW  1    341.65 1128.7 163.84
## + TBWPct          1    274.33 1196.0 166.74
## + DFPct           1    252.67 1217.7 167.63
## <none>                        1470.4 175.06
## + InPlayRetPtW    1     39.16 1431.2 175.71
## + AcePct          1     15.10 1455.3 176.55
## + BPConvPct       1      0.87 1469.5 177.03
## - FirstServePtW   1   1927.85 3398.2 214.95
## - BreakPct        1   2830.87 4301.3 226.73
## 
## Step:  AIC=158.21
## WinningPct ~ BreakPct + FirstServePtW + BPSavedPct
## 
##                  Df Sum of Sq    RSS    AIC
## + TBWPct          1    164.78  843.8 151.29
## + FirstServePct   1    148.69  859.9 152.24
## + DFPct           1    103.67  904.9 154.79
## + SecondServePtW  1     89.70  918.9 155.56
## <none>                        1008.6 158.21
## + InPlayRetPtW    1     31.01  977.5 158.65
## + BPConvPct       1     19.84  988.7 159.22
## + AcePct          1      9.94  998.6 159.72
## - BPSavedPct      1    461.82 1470.4 175.06
## - FirstServePtW   1    897.72 1906.3 188.04
## - BreakPct        1   3122.01 4130.6 226.71
## 
## Step:  AIC=151.29
## WinningPct ~ BreakPct + FirstServePtW + BPSavedPct + TBWPct
## 
##                  Df Sum of Sq    RSS    AIC
## + FirstServePct   1    151.30  692.5 143.41
## + DFPct           1    125.22  718.6 145.26
## + SecondServePtW  1    103.54  740.2 146.75
## + InPlayRetPtW    1     42.11  801.7 150.73
## <none>                         843.8 151.29
## + BPConvPct       1     19.45  824.3 152.13
## + AcePct          1      5.37  838.4 152.97
## - TBWPct          1    164.78 1008.6 158.21
## - BPSavedPct      1    352.27 1196.0 166.74
## - FirstServePtW   1    734.68 1578.5 180.61
## - BreakPct        1   2727.54 3571.3 221.43
## 
## Step:  AIC=143.41
## WinningPct ~ BreakPct + FirstServePtW + BPSavedPct + TBWPct + 
##     FirstServePct
## 
##                  Df Sum of Sq    RSS    AIC
## + SecondServePtW  1     88.47  604.0 138.58
## + DFPct           1     65.92  626.6 140.41
## <none>                         692.5 143.41
## + InPlayRetPtW    1      8.46  684.0 144.80
## + BPConvPct       1      8.18  684.3 144.82
## + AcePct          1      2.48  690.0 145.23
## - FirstServePct   1    151.30  843.8 151.29
## - BPSavedPct      1    164.01  856.5 152.04
## - TBWPct          1    167.39  859.9 152.24
## - FirstServePtW   1    875.77 1568.3 182.28
## - BreakPct        1   2766.67 3459.1 221.84
## 
## Step:  AIC=138.58
## WinningPct ~ BreakPct + FirstServePtW + BPSavedPct + TBWPct + 
##     FirstServePct + SecondServePtW
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                         604.02 138.58
## + InPlayRetPtW    1     12.07  591.94 139.57
## + DFPct           1      3.70  600.31 140.27
## + BPConvPct       1      3.28  600.74 140.31
## + AcePct          1      0.04  603.98 140.58
## - BPSavedPct      1     56.88  660.89 141.08
## - SecondServePtW  1     88.47  692.48 143.41
## - FirstServePct   1    136.22  740.24 146.75
## - TBWPct          1    180.12  784.13 149.63
## - FirstServePtW   1    881.52 1485.54 181.57
## - BreakPct        1   2229.70 2833.71 213.87

## 
## Call:
## lm(formula = WinningPct ~ BreakPct + FirstServePtW + BPSavedPct + 
##     TBWPct + FirstServePct + SecondServePtW, data = Tennis)
## 
## Coefficients:
##    (Intercept)        BreakPct   FirstServePtW      BPSavedPct  
##      -173.7426          1.6216          1.4125          0.3762  
##         TBWPct   FirstServePct  SecondServePtW  
##         0.1795          0.5108          0.5428

# lm(formula = WinningPct ~ BreakPct + FirstServePtW + BPSavedPct + TBWPct + FirstServePct + SecondServePtW, data = Tennis)

step(model2, direction = "both", scope = formula(model2Init))

## Start:  AIC=268.94
## Rank ~ 1
## 
##                  Df Sum of Sq     RSS    AIC
## + DomRat          1    3523.3  6889.2 250.28
## + BreakPct        1    2086.6  8325.9 259.75
## + InPlayRetPtW    1    1365.8  9046.7 263.91
## + TBWPct          1    1316.8  9095.7 264.18
## + BPConvPct       1     508.1  9904.4 268.44
## <none>                        10412.5 268.94
## + DFPct           1     255.1 10157.4 269.70
## + FirstServePtW   1     196.0 10216.5 269.99
## + SecondServePtW  1     168.6 10243.9 270.12
## + FirstServePct   1     164.9 10247.6 270.14
## + BPSavedPct      1      85.8 10326.7 270.52
## + AcePct          1      55.6 10356.9 270.67
## 
## Step:  AIC=250.28
## Rank ~ DomRat
## 
##                  Df Sum of Sq     RSS    AIC
## + SecondServePtW  1     589.7  6299.5 247.81
## + DFPct           1     527.6  6361.6 248.30
## + BreakPct        1     399.1  6490.1 249.30
## + TBWPct          1     361.6  6527.6 249.59
## <none>                         6889.2 250.28
## + AcePct          1     158.9  6730.3 251.12
## + InPlayRetPtW    1     154.4  6734.8 251.15
## + BPSavedPct      1     135.5  6753.7 251.29
## + FirstServePtW   1     113.5  6775.7 251.45
## + BPConvPct       1      29.6  6859.6 252.07
## + FirstServePct   1      24.3  6864.9 252.11
## - DomRat          1    3523.3 10412.5 268.94
## 
## Step:  AIC=247.81
## Rank ~ DomRat + SecondServePtW
## 
##                  Df Sum of Sq     RSS    AIC
## + TBWPct          1     278.4  6021.1 247.55
## <none>                         6299.5 247.81
## + BreakPct        1     244.0  6055.5 247.84
## + FirstServePtW   1     223.7  6075.8 248.00
## + AcePct          1     213.9  6085.7 248.08
## + FirstServePct   1      71.6  6227.9 249.24
## + InPlayRetPtW    1      70.6  6228.9 249.25
## + DFPct           1      65.3  6234.2 249.29
## + BPSavedPct      1      26.1  6273.5 249.60
## + BPConvPct       1       7.9  6291.6 249.75
## - SecondServePtW  1     589.7  6889.2 250.28
## - DomRat          1    3944.4 10243.9 270.12
## 
## Step:  AIC=247.55
## Rank ~ DomRat + SecondServePtW + TBWPct
## 
##                  Df Sum of Sq    RSS    AIC
## + BreakPct        1    347.53 5673.6 246.58
## + AcePct          1    299.78 5721.3 247.00
## + FirstServePtW   1    284.19 5736.9 247.13
## <none>                        6021.1 247.55
## - TBWPct          1    278.40 6299.5 247.81
## + InPlayRetPtW    1    118.53 5902.6 248.56
## + FirstServePct   1     76.09 5945.0 248.91
## + BPSavedPct      1     73.96 5947.1 248.93
## + DFPct           1     44.23 5976.9 249.18
## + BPConvPct       1     24.36 5996.8 249.35
## - SecondServePtW  1    506.47 6527.6 249.59
## - DomRat          1   2981.14 9002.3 265.66
## 
## Step:  AIC=246.58
## Rank ~ DomRat + SecondServePtW + TBWPct + BreakPct
## 
##                  Df Sum of Sq    RSS    AIC
## + InPlayRetPtW    1    471.10 5202.5 244.24
## <none>                        5673.6 246.58
## - SecondServePtW  1    328.27 6001.9 247.39
## + DFPct           1    121.16 5552.4 247.50
## - BreakPct        1    347.53 6021.1 247.55
## + FirstServePct   1     98.11 5575.5 247.71
## + BPConvPct       1     96.46 5577.1 247.72
## - TBWPct          1    381.96 6055.5 247.84
## + BPSavedPct      1     18.12 5655.5 248.42
## + AcePct          1      5.94 5667.6 248.53
## + FirstServePtW   1      1.57 5672.0 248.56
## - DomRat          1   1398.98 7072.6 255.60
## 
## Step:  AIC=244.24
## Rank ~ DomRat + SecondServePtW + TBWPct + BreakPct + InPlayRetPtW
## 
##                  Df Sum of Sq    RSS    AIC
## <none>                        5202.5 244.24
## + AcePct          1    184.18 5018.3 244.44
## + BPConvPct       1    174.16 5028.3 244.54
## - SecondServePtW  1    294.10 5496.6 244.99
## + DFPct           1     72.97 5129.5 245.54
## + FirstServePct   1     20.52 5182.0 246.05
## + FirstServePtW   1     10.60 5191.9 246.14
## - TBWPct          1    423.91 5626.4 246.16
## + BPSavedPct      1      6.61 5195.9 246.18
## - InPlayRetPtW    1    471.10 5673.6 246.58
## - BreakPct        1    700.10 5902.6 248.56
## - DomRat          1   1225.35 6427.8 252.82

## 
## Call:
## lm(formula = Rank ~ DomRat + SecondServePtW + TBWPct + BreakPct + 
##     InPlayRetPtW, data = Tennis)
## 
## Coefficients:
##    (Intercept)          DomRat  SecondServePtW          TBWPct  
##        -4.7762        -72.1582          1.0512         -0.2779  
##       BreakPct    InPlayRetPtW  
##        -2.6916          3.3847

# lm(formula = Rank ~ DomRat + SecondServePtW + TBWPct + BreakPct + InPlayRetPtW, data = Tennis)

### LASSO & RIDGE Experimental models ###
InitWin <- InitVars %>%
  select(-Rank)

InitRank <- InitVars %>%
  select(-WinningPct)

Xwin <- model.matrix(WinningPct ~ ., InitWin)[,-1]

Xrank <- model.matrix(Rank ~ ., InitRank)[,-1]

ywin <- as.numeric(InitVars$WinningPct)

yrank <- as.numeric(InitVars$Rank)


#RidgeWin
win.ridge.cv <- cv.glmnet(Xwin, ywin, alpha = 0, nfolds = 5)

plot(win.ridge.cv)

win.ridge.cv$lambda.min

## [1] 0.832764

win.ridge.cv$lambda.1se

## [1] 11.2677

#RidgeRank
rank.ridge.cv <- cv.glmnet(Xrank, yrank, alpha = 0, nfolds = 5)

plot(rank.ridge.cv)

rank.ridge.cv$lambda.min

## [1] 7.828625

rank.ridge.cv$lambda.1se

## [1] 41.77902

#LassoWin
win.lasso.cv <- cv.glmnet(Xwin, ywin, alpha = 1, nfolds = 5)

plot(win.lasso.cv)

win.lasso.cv$lambda.min

## [1] 0.002791094

win.lasso.cv$lambda.1se

## [1] 0.1524573

#LassoRank
rank.lasso.cv <- cv.glmnet(Xrank, yrank, alpha = 1, nfolds = 5)

plot(rank.lasso.cv)

rank.lasso.cv$lambda.min

## [1] 0.9878605

rank.lasso.cv$lambda.1se

## [1] 3.98801

#All Models Together
b1 = as.matrix(coef(win.ridge.cv, s = "lambda.min"))
b2 = coef(win.ridge.cv, s = "lambda.1se")[1:11]
b5 = coef(rank.ridge.cv, s = "lambda.min")[1:11]
b6 = coef(rank.ridge.cv, s = "lambda.1se")[1:11]
b3 = coef(win.lasso.cv, s = "lambda.min")[1:11]
b4 = coef(win.lasso.cv, s = "lambda.1se")[1:11]
b7 = coef(rank.lasso.cv, s = "lambda.min")[1:11]
b8 = coef(rank.lasso.cv, s = "lambda.1se")[1:11]
cbind(b1, b2, b3, b4, b5, b6, b7, b8)

## Warning in cbind(b1, b2, b3, b4, b5, b6, b7, b8): number of rows of result
## is not a multiple of vector length (arg 2)

##                            1           b2            b3            b4
## (Intercept)    -123.09984702 -58.82574031 -284.47333686 -163.00766491
## FirstServePct     0.31041003   0.15158970    0.83054451    0.44849633
## FirstServePtW     0.68203972   0.26511430    2.72652605    1.33757427
## BPConvPct         0.20841288   0.19847692    0.01961784    0.06000491
## BreakPct          0.81334818   0.29414387    2.55462592    1.53209182
## DFPct            -0.05557015   0.02248809   -0.06020046   -0.05695194
## BPSavedPct        0.37707320   0.22958076    0.47013328    0.38796553
## DomRat           24.64658451  23.31777876  -58.56106749    0.00000000
## AcePct            0.11834527   0.06411360   -0.19782175    0.00000000
## InPlayRetPtW      0.31970024   0.33987790   -0.04159828    0.00000000
## SecondServePtW    0.28560393   0.31218712    1.14632812    0.50520551
## TBWPct            0.16392231 -58.82574031 -284.47333686 -163.00766491
##                          b5           b6          b7           b8
## (Intercept)    158.55494885  95.39301088 129.5313051  71.93407986
## FirstServePct   -0.42758952  -0.13741233  -0.2704484   0.00000000
## FirstServePtW   -0.32662277  -0.13133281   0.0000000   0.00000000
## BPConvPct       -0.02502694  -0.09580483   0.0000000   0.00000000
## BreakPct        -0.52408582  -0.23917172  -0.5815116  -0.09613719
## DFPct           -1.52162128  -0.53073979  -1.6857479   0.00000000
## BPSavedPct      -0.07788130  -0.06831170   0.0000000   0.00000000
## DomRat         -34.05840673 -16.62792310 -53.5599163 -41.02435338
## AcePct           0.11072908   0.02429171   0.0000000   0.00000000
## InPlayRetPtW    -0.33576260  -0.26511393   0.0000000   0.00000000
## SecondServePtW  -0.00112217  -0.06381424   0.0000000   0.00000000
## TBWPct         158.55494885  95.39301088 129.5313051  71.93407986

# Rank lasso min: 98.9 - 0.42(BreakPct) - 0.72(DFPct) - 50.3(DomRat) + 98.9(SecondServePtW)

###Outliers###
Tennis <- Tennis %>%
  mutate(fitted = fitted(final_model),
         standardized = stdres(final_model),
         studentized = studres(final_model)) 

table8 <- Tennis %>%
  select(Player, WinningPct, FirstServePct, FirstServePtW, studentized, RetPtWPct) %>%
  filter(studentized > 2 | studentized < -2)
kable(table8, caption = "Outliers")

Outliers
Player	WinningPct	FirstServePct	FirstServePtW	studentized	RetPtWPct
Dusan Lajovic	45.5	69.4	68.0	-2.188589	36.6
Milos Raonic	64.3	62.5	84.2	-2.276267	34.7

###Plots for Outliers
outs2 <- Tennis %>%
  filter(studentized > 2 | studentized < -2)

ggplot(data = Tennis, aes(x = Rank, y = studentized)) +
  geom_point() +
  geom_hline(yintercept = c(-2, 2), color = "red") +
  ggrepel::geom_label_repel(aes(label = Player), data = outs2)

###Influential Points###
finalmodel_diag <- ls.diag(final_model)

Tennis <- Tennis %>%
  mutate(cooks = finalmodel_diag$cooks) 

table9 <- Tennis %>%
  select(Player, WinningPct, FirstServePct, FirstServePtW, cooks, RetPtWPct) %>%
  filter(cooks > 5/45) %>%
  arrange(desc(cooks))
kable(table9, caption = "Influential Points")

Influential Points
Player	WinningPct	FirstServePct	FirstServePtW	cooks	RetPtWPct
Milos Raonic	64.3	62.5	84.2	0.2040603	34.7
Reilly Opelka	58.7	64.0	79.3	0.1240316	29.1
Dusan Lajovic	45.5	69.4	68.0	0.1177649	36.6

###Plots for Influential Points###
outs1 <- Tennis %>%
  filter(cooks > 5/45)

ggplot(data = Tennis, aes(x = Rank, y = cooks)) + 
  geom_point() +
  geom_hline(yintercept = 5/45, color = "red", linetype = 2) +
  ggrepel::geom_label_repel(aes(label = Player), data = outs1)