The goal of this research is to develop a model that could predict the future performance of NCAA baseball players based on their freshman year performance, defined through basic and advanced baseball statistics. I used publicly available data that was published by the NCAA and was collected for all Division I NCAA baseball players between the 2013 and 2021 seasons. I analyzed the full dataset and created appropriate subsets to build predictive models for sophomore, junior, and senior year performance based on a players freshman year statistics. I applied different statistical methods to create predictive models in attempt to achieve the highest possible accuracy. These methods were linear regression, logistics regression, and k-nearest neighbors for both regression and classification. I compared these models using appropriate assessment parameters selected a best model to make developmental predictions for the latest class of American Athletic Conference freshman baseball players. All model computations were done using R statistical software.
College athletics introduces a unique problem that coaches must address in order to build the best team possible - players are only eligible for 4 seasons. It is important to understand how players will develop as early as possible in their collegiate careers. First year players often do not get many opportunities to play in games, so their statistics can be difficult to assess by individuals. Instead, I will attempt to build predictive models with this same set of freshman year statistics and give greater clarity to how a players freshman year performance relates to their future performance in their sophomore, junior, and senior seasons. Making this information available to coaches may assist them in recruiting, shifting positions, selecting the team, creating lineups, and more strategic decision making.
The NCAA baseball data that I am using contains 39 different variables for each player from each season. Some of these are basic baseball statistics that have been used to measure performance for over 100 years. Other advanced statistics have become popular more recently. The following table defines these statistics from the MLB.com glossary and Fangraphs sabermetrics library.
I will be using wRC+ to measure performance and will use it as the response variable in each of the models. Fangraphs describes weighted runs created plus as a "rate statistic which attempts to credit a hitter for the value of each outcome (single, double, etc) rather than treating all hits or times on base equally, while also controlling for park effects and the current run environment. wRC+ is scaled so that league average is 100 each year and every point above or below 100 is equal to one percentage point better or worse than league average. This makes wRC+ a better representation of offensive value than batting average, RBI, OPS, or wOBA."
\[ wRC+ = \frac{(wRAA/PA + LgR/PA) + (LgR/PA - (\mbox{Park Factor} \times LgR/PA))}{(wRC/PA \mbox{ excluding pitchers})} \]
For classification modeling, I will group ranges of wRC+ into buckets based on estimates from Fangraphs. These buckets are Excellent, Great, Above Average, Average, Below Average, Poor and Awful.
The data that I will use for this analysis is publicly availble from the NCAA. It contains basic and advanced baseball statistics for each Division I NCAA baseball player between the 2013 and 2021 seasons. The full dataset, before any manipulation, includes 45938 observations with 39 variables. There are 313 different schools and 21567 individual players represented in the data. Initially, I will remove row number, year, school, and conference, reducing the number of variables to 36.
length(unique(full$school)) # 313 unique schools represented
## [1] 313
length(unique(full$player_id)) # 21567 unique players represented
## [1] 21567
full <- full[, -c(1:3)] # remove row number, year, school, conference
dim(full) # 45938 rows, 36 variables
## [1] 45938 36
There are not many missing values in the data. There are 258 cases where the player ID = 0, and 179 cases where the year is missing. Each represents less than 1% of the data, and these rows will simply be removed.
# Remove these
full <- full[!(full$player_id==0), ]
full <- full[!(full$Yr=="N/A"), ]
dim(full) # 45505 rows, 36 variables
## [1] 45505 36
Next, it is important to set minimum requirements to qualify a player for this analysis. Players who do not have enough freshman year playing time may give unreliable predictive results. I will set the minimum number of plate appearances to be 32, equal to the first quantile. This ensures that a majority of the data is still available to be used in the model creation and is not so large that it is difficult for freshman players to meet it. This reduces the number of total observations to 34338.
## [1] 34338 36
Some of the variables in this dataset are very highly correlated, especially as some are variations of others. The correlation matrix shows which are highly corelated and will allow me to select variables to drop. These are: wOBA, GS, AB, OPS, H, ISO, OBpct, and GP. This reduces the number of total variables to 28.
## Var1 Var2 Freq
## 627 wOBA wRCplus 0.9946347
## 296 PA AB 0.9943010
## 259 GS PA 0.9841904
## 295 GS AB 0.9810809
## 517 OPS wOBA 0.9808464
## 625 OPS wRCplus 0.9727984
## 444 SlgPct OPS 0.9612833
## 693 AB H 0.9607523
## 692 PA H 0.9555079
## 478 BA ISO -0.9478833
## 515 OBPct wOBA 0.9435811
## 623 OBPct wRCplus 0.9387046
## 222 GP GS 0.9298970
## 258 GP PA 0.9283577
## 691 GS H 0.9252221
## 294 GP AB 0.9240593
## 703 R H 0.9118883
# Drop selected variables
full <- full[, -c(6:7,9,11,13:15,20)]
dim(full) # 34338 rows, 28 variables
## [1] 34338 28
Next, I will create the individual datasets that will be used to predict sophomore, junior, and senior year performance separately. These subsets will include a players freshman year stats and corresponding year stats in a single observation. This will limit the analysis to subsets of players with 2 years of stats, 3 years of stats, and 4 years of stats respectively. To predict sophomore years performance, I will be able to use observations from players with 2-year, 3-year, 4-year, and 5-year careers. To predict junior years performance, players with 3-year, 4-year, and 5-year careers. To predict senior year performance, only those players with 4-year and 5-year careers.
There are 4738 players that will be used to predict sophomore years performance, 3784 players to predict junior year performance, and 2443 players to predict senior year performance.
nrow(stats_fr_so) # 4738 players to predict sophomore year stats
## [1] 4738
nrow(stats_fr_jr) # 3784 players to predict junior year stats
## [1] 3784
nrow(stats_fr_sr) # 2443 players to predict senior year stats
## [1] 2443
The final form of the datasets before modeling begins includes all selected freshman year stats, along the wRC+ value for the prediction year (so, jr, or sr).
I will use several variations of statistical models to create the best model for prediction of future wRC+. I will use a combination of regression models (predicting the value itself) and classification models (predicting a class that the value will fall into) to determine which model will be most reliable and produced useable information. Regression models
Regresssion Models
\[ y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \varepsilon \] \[ \mbox{where } y \mbox{ is the response variable (wRC+), } \\X_1, X_2, ..., X_p \mbox{ are the predictors, and } \\ \varepsilon \mbox{ is the random error.} \]
\[ y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon \]
\[ \mbox{where } y \mbox{ is the response variable (wRC+), } \\X_1, \mbox{ } X_2, \mbox{ } X_1X_2 \mbox{ are the predictors, and } \\ \varepsilon \mbox{ is the random error.} \]
\[ \mbox{Euclidean Distance} = \sqrt{\sum\limits_{i=1}^n (x_i - y_i)^2} \]
\[ \mbox{where } k \mbox{ is the number of dimensions in the data, } \\x_i, \mbox{ is the observed predictor value, and } \\ y_i \mbox{ is the new data point to be predicted.} \]
Classification Models
\[ p = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + ... + \beta_p X_p)}} \] \[ \mbox{where } p \mbox{ is the probability that an observation belongs to a certain class.} \]
Regression models and classification models will be compared separately to determine the best models. The parameters which will determine the efficacy of the regression models are adjusted R^2 and root mean square error (RMSE). The paramteres which will determine the efficacy of the classification models are Akaike information criterion (AIC) and accuracy.
Regression Models
\[ R^2_{adj} = 1 - [\frac{(1-R^2)(n-1)}{n-k-1}] \] \[ \mbox{where } n \mbox{ is the number of points in the data sample and} \\ k \mbox{ is the number of predictors in the model} \]
\[ RMSE = \sqrt{\sum\limits_{i=1}^n \frac{(\hat{y_i} - y_i)^2}{n}} \]
\[ \mbox{where } \hat{y_i} \mbox{is the predicted value, } \\ y_i \mbox{ is the actual value, and } \\ n \mbox{ is the number of observations.} \] Classification Models
\[ AIC = 2K -2ln(L) \\\mbox{where } K \mbox{ is the number of predictors used in the model and } \\ L \mbox{ is the log-likelihood estimate.} \]
# Sophomore year
full_lm_so <- lm(wRCplus.so ~ ., data = wrc_fr_so)
summary(full_lm_so) # adj R^2 = 0.1377
##
## Call:
## lm(formula = wRCplus.so ~ ., data = wrc_fr_so)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.349 -23.228 0.972 24.551 159.310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.44335 9.00656 9.598 < 2e-16 ***
## PA.fr 0.06048 0.04881 1.239 0.21544
## BA.fr -0.36557 2.16270 -0.169 0.86578
## SlgPct.fr 51.45628 38.62222 1.332 0.18283
## BABIP.fr -103.55894 34.71738 -2.983 0.00287 **
## wRAA.fr 0.52093 0.25886 2.012 0.04423 *
## wRCplus.fr 0.13535 0.08387 1.614 0.10666
## R.fr 0.07105 0.16212 0.438 0.66120
## X2B.fr 0.11991 0.38441 0.312 0.75511
## X3B.fr 0.32391 0.95632 0.339 0.73485
## TB.fr 0.02640 0.02483 1.063 0.28782
## HR.fr 1.36835 0.72380 1.891 0.05875 .
## HRpct.fr -2.94547 1.81991 -1.618 0.10563
## RBI.fr 0.16932 0.14221 1.191 0.23387
## BB.fr 0.18052 0.23889 0.756 0.44988
## K.fr -0.24261 0.13558 -1.789 0.07362 .
## BBpct.fr 0.22320 0.39072 0.571 0.56786
## Kpct.fr 0.28900 0.22336 1.294 0.19578
## HBP.fr 0.35951 0.29674 1.212 0.22575
## SF.fr 0.82697 0.53224 1.554 0.12031
## SH.fr -1.08922 0.25286 -4.308 1.68e-05 ***
## SB.fr 0.52107 0.22661 2.299 0.02152 *
## CS.fr 0.23243 0.46973 0.495 0.62075
## SPD.fr -0.38131 0.60713 -0.628 0.52999
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.36 on 4714 degrees of freedom
## Multiple R-squared: 0.1419, Adjusted R-squared: 0.1377
## F-statistic: 33.9 on 23 and 4714 DF, p-value: < 2.2e-16
To determine if the overall model is useful for predicting wRC+, we can perform an ANOVA test with the following null and alternative hypothesis:
\[ H_0: \beta_1 = \beta_2 = ... = \beta_{23} = 0 \\ H_1: \mbox{not } H_0 \]
The p-value for this test is <2.2e-16 (essentially 0). Therefore, we are able to reject the null hypothesis and conclude that the overall model is useful for predicting wRC+. The adjusted R2 value of 0.1377 is rather low however.
To determine which predictors are singificant for this model, we perform a t test with the following null and alternative hypothesis for each predictor separately:
\[ H_0: \beta_1 = 0 \\ H_1: \beta_1 \neq 0 \\ . \\ . \\ . \\ H_0: \beta_{23} = 0 \\ H_1: \beta_{23} \neq 0 \]
I will use an alpha of 0.10. In the model output above, predictors with a p-value < 0.10 are denoted with one, two, or three asterisks, or a period. For these predictors, we can reject the null hypothesis and conclude that it is a significant predictor of wRC+. The significant predictors in this model are BABIP.fr, wRAA.fr, HR.fr, K.fr, SH.fr, and SB.fr.
To achieve a parsimonious model (the simplest model with good predictive power), I will remove the variables that are not significant at an aplpha = 0.10 level.
reduced_model_so <- lm(wRCplus.so ~ BABIP.fr + wRAA.fr + HR.fr + K.fr + SH.fr + SB.fr, data = wrc_fr_so)
summary(reduced_model_so) # adj R^2 = 0.1183
##
## Call:
## lm(formula = wRCplus.so ~ BABIP.fr + wRAA.fr + HR.fr + K.fr +
## SH.fr + SB.fr, data = wrc_fr_so)
##
## Residuals:
## Min 1Q Median 3Q Max
## -176.222 -23.489 1.131 25.018 154.665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.12088 3.83598 28.447 < 2e-16 ***
## BABIP.fr -57.73230 12.42174 -4.648 3.45e-06 ***
## wRAA.fr 1.76403 0.14003 12.597 < 2e-16 ***
## HR.fr 1.20080 0.39412 3.047 0.00233 **
## K.fr 0.24458 0.06183 3.956 7.74e-05 ***
## SH.fr -0.38153 0.22527 -1.694 0.09040 .
## SB.fr 0.79551 0.16172 4.919 8.99e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.8 on 4731 degrees of freedom
## Multiple R-squared: 0.1195, Adjusted R-squared: 0.1183
## F-statistic: 107 on 6 and 4731 DF, p-value: < 2.2e-16
The reduced model is:
\[ \hat{y} = 109.12 + (-57.73)\mbox{BABIP.fr} + 1.76\mbox{wRAA.fr} + 1.20\mbox{HR.fr} + 0.24\mbox{K.fr} + (-0.38)\mbox{SH.fr} + 0.80\mbox{SB.fr} \] Since the reduced model is nested in the full model we can compare them using an anova test. Since the reduced model has 6 predictors and the full model has 17 additional predictors, the following are the null and alternative hypothesis:
\[ H_0: \beta_7 = \beta_8 = ... = \beta_23 = 0 \\ H_1: \mbox{not }H_0 \]
anova(reduced_model_so, full_lm_so)
## Analysis of Variance Table
##
## Model 1: wRCplus.so ~ BABIP.fr + wRAA.fr + HR.fr + K.fr + SH.fr + SB.fr
## Model 2: wRCplus.so ~ PA.fr + BA.fr + SlgPct.fr + BABIP.fr + wRAA.fr +
## wRCplus.fr + R.fr + X2B.fr + X3B.fr + TB.fr + HR.fr + HRpct.fr +
## RBI.fr + BB.fr + K.fr + BBpct.fr + Kpct.fr + HBP.fr + SF.fr +
## SH.fr + SB.fr + CS.fr + SPD.fr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4731 7494853
## 2 4714 7303490 17 191364 7.2656 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of <2.2e-16 allows us to reject the null hypothesis and conclude that the full model with its 17 additional predictors is a significantly more useful model for predicting wRC+.
After repeating this exercise for junior and senior year predictions, we have the following results:
The interaction models function the same as the first order linear regression models with the addition of interaction terms. Each predictor has been multiplied with the others, bringing the total number of predictors to 276. This will be cut down based on which predictors are found to be significant.
# Sophomore year
int_vars_so <- wrc_fr_so[, -24]
wrc_int_so <- cbind(int_vars_so^2, do.call(cbind, combn(colnames(int_vars_so), 2,
FUN=function(x) list(int_vars_so[x[1]]*int_vars_so[x[2]]))))
colnames(wrc_int_so)[-(seq_len(ncol(int_vars_so)))] <- combn(colnames(int_vars_so), 2,
FUN = paste, collapse=":")
wrc_int_so$wRCplus.so <- wrc_fr_so$wRCplus.so
ncol(wrc_int_so) # 276 total predictors with 1 response variable
## [1] 277
full_int_so <- lm(wRCplus.so ~ ., data = wrc_int_so)
summary(full_int_so) # adj R^2 = 0.1417
##
## Call:
## lm(formula = wRCplus.so ~ ., data = wrc_int_so)
##
## Residuals:
## Min 1Q Median 3Q Max
## -182.675 -22.736 0.799 23.862 140.112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.686e+01 1.682e+01 3.380 0.00073 ***
## PA.fr 2.154e-03 3.404e-03 0.633 0.52698
## BA.fr -3.289e+01 4.253e+01 -0.773 0.43938
## SlgPct.fr 4.694e+02 1.708e+03 0.275 0.78350
## BABIP.fr 2.307e+02 1.728e+03 0.133 0.89383
## wRAA.fr -2.663e-02 9.509e-02 -0.280 0.77942
## wRCplus.fr -3.997e-04 6.098e-03 -0.066 0.94775
## R.fr 4.082e-02 2.980e-02 1.370 0.17087
## X2B.fr 6.719e-02 2.473e-01 0.272 0.78589
## X3B.fr 8.318e-01 1.373e+00 0.606 0.54478
## TB.fr 6.304e-02 2.573e-02 2.450 0.01433 *
## HR.fr -1.505e-01 7.623e-01 -0.197 0.84355
## HRpct.fr -1.670e+00 3.356e+00 -0.498 0.61880
## RBI.fr -4.472e-02 2.189e-02 -2.042 0.04116 *
## BB.fr 2.799e-02 9.164e-02 0.305 0.76003
## K.fr 1.268e-02 2.563e-02 0.495 0.62093
## BBpct.fr 6.864e-02 1.194e-01 0.575 0.56552
## Kpct.fr 4.662e-02 6.317e-02 0.738 0.46061
## HBP.fr -3.371e-02 8.801e-02 -0.383 0.70172
## SF.fr 2.942e-01 3.165e-01 0.929 0.35272
## SH.fr 2.175e-02 6.861e-02 0.317 0.75124
## SB.fr -4.587e-02 4.233e-02 -1.084 0.27856
## CS.fr 7.048e-02 2.461e-01 0.286 0.77460
## SPD.fr 5.682e-01 5.032e-01 1.129 0.25890
## `PA.fr:BA.fr` 7.827e+00 1.516e+01 0.516 0.60569
## `PA.fr:SlgPct.fr` 6.487e+00 8.608e+00 0.754 0.45115
## `PA.fr:BABIP.fr` -1.998e+01 1.402e+01 -1.425 0.15416
## `PA.fr:wRAA.fr` 1.146e-02 3.019e-02 0.380 0.70418
## `PA.fr:wRCplus.fr` 1.099e-02 8.713e-03 1.262 0.20710
## `PA.fr:R.fr` -2.808e-02 1.532e-02 -1.833 0.06688 .
## `PA.fr:X2B.fr` 5.976e-03 4.610e-02 0.130 0.89686
## `PA.fr:X3B.fr` -1.833e-01 1.065e-01 -1.721 0.08524 .
## `PA.fr:TB.fr` -1.992e-02 9.325e-03 -2.136 0.03272 *
## `PA.fr:HR.fr` -2.464e-03 8.286e-02 -0.030 0.97628
## `PA.fr:HRpct.fr` 1.690e-01 4.072e-01 0.415 0.67812
## `PA.fr:RBI.fr` 1.015e-02 1.372e-02 0.740 0.45949
## `PA.fr:BB.fr` 8.929e-04 2.957e-02 0.030 0.97591
## `PA.fr:K.fr` -9.624e-03 1.442e-02 -0.667 0.50466
## `PA.fr:BBpct.fr` 6.435e-02 7.203e-02 0.893 0.37168
## `PA.fr:Kpct.fr` 2.650e-03 4.166e-02 0.064 0.94928
## `PA.fr:HBP.fr` -2.545e-02 2.909e-02 -0.875 0.38179
## `PA.fr:SF.fr` -2.956e-02 4.933e-02 -0.599 0.54907
## `PA.fr:SH.fr` 2.038e-02 2.499e-02 0.816 0.41479
## `PA.fr:SB.fr` 3.526e-02 2.434e-02 1.449 0.14751
## `PA.fr:CS.fr` 2.350e-02 4.158e-02 0.565 0.57205
## `PA.fr:SPD.fr` 5.394e-02 6.892e-02 0.783 0.43386
## `BA.fr:SlgPct.fr` 1.393e+03 3.239e+03 0.430 0.66709
## `BA.fr:BABIP.fr` -2.328e+03 3.068e+03 -0.759 0.44801
## `BA.fr:wRAA.fr` -1.221e+01 5.787e+01 -0.211 0.83296
## `BA.fr:wRCplus.fr` 1.825e+00 9.192e+00 0.199 0.84265
## `BA.fr:R.fr` 1.828e+01 2.655e+01 0.689 0.49113
## `BA.fr:X2B.fr` -1.597e+01 7.216e+01 -0.221 0.82491
## `BA.fr:X3B.fr` -1.016e+02 1.641e+02 -0.619 0.53568
## `BA.fr:TB.fr` -9.226e+00 5.752e+00 -1.604 0.10876
## `BA.fr:HR.fr` -2.955e+01 1.880e+02 -0.157 0.87511
## `BA.fr:HRpct.fr` -2.115e+02 1.686e+02 -1.254 0.20992
## `BA.fr:RBI.fr` 4.928e+01 2.558e+01 1.926 0.05414 .
## `BA.fr:BB.fr` -4.137e+01 5.917e+01 -0.699 0.48443
## `BA.fr:K.fr` -1.317e+01 3.198e+01 -0.412 0.68051
## `BA.fr:BBpct.fr` 1.714e+01 3.541e+01 0.484 0.62847
## `BA.fr:Kpct.fr` 1.290e+01 2.354e+01 0.548 0.58352
## `BA.fr:HBP.fr` 1.434e+01 5.206e+01 0.275 0.78299
## `BA.fr:SF.fr` -3.032e+01 9.926e+01 -0.305 0.76005
## `BA.fr:SH.fr` -4.938e+01 4.750e+01 -1.040 0.29856
## `BA.fr:SB.fr` -3.938e+01 4.778e+01 -0.824 0.40989
## `BA.fr:CS.fr` -3.356e+01 7.091e+01 -0.473 0.63601
## `BA.fr:SPD.fr` 4.801e+01 7.089e+01 0.677 0.49830
## `SlgPct.fr:BABIP.fr` 1.439e+03 2.480e+03 0.580 0.56168
## `SlgPct.fr:wRAA.fr` 4.132e+01 4.017e+01 1.029 0.30367
## `SlgPct.fr:wRCplus.fr` -7.586e+00 5.625e+00 -1.349 0.17749
## `SlgPct.fr:R.fr` -1.990e+01 1.756e+01 -1.133 0.25732
## `SlgPct.fr:X2B.fr` -2.343e+01 4.642e+01 -0.505 0.61369
## `SlgPct.fr:X3B.fr` 3.762e+01 1.098e+02 0.343 0.73195
## `SlgPct.fr:TB.fr` 1.363e+00 3.562e+00 0.383 0.70201
## `SlgPct.fr:HR.fr` 7.013e+00 1.119e+02 0.063 0.95002
## `SlgPct.fr:HRpct.fr` 5.551e+01 1.434e+02 0.387 0.69863
## `SlgPct.fr:RBI.fr` -2.412e+01 1.638e+01 -1.473 0.14083
## `SlgPct.fr:BB.fr` 3.022e+00 3.707e+01 0.082 0.93503
## `SlgPct.fr:K.fr` 8.654e+00 2.406e+01 0.360 0.71913
## `SlgPct.fr:BBpct.fr` 7.293e+00 2.191e+01 0.333 0.73925
## `SlgPct.fr:Kpct.fr` -1.195e+01 1.557e+01 -0.767 0.44298
## `SlgPct.fr:HBP.fr` -2.305e+01 3.552e+01 -0.649 0.51650
## `SlgPct.fr:SF.fr` -1.139e+01 6.018e+01 -0.189 0.84994
## `SlgPct.fr:SH.fr` -1.936e+01 3.133e+01 -0.618 0.53661
## `SlgPct.fr:SB.fr` 4.553e+00 3.136e+01 0.145 0.88456
## `SlgPct.fr:CS.fr` 4.649e+01 5.128e+01 0.907 0.36471
## `SlgPct.fr:SPD.fr` -1.610e+01 4.701e+01 -0.342 0.73205
## `BABIP.fr:wRAA.fr` -5.410e+01 3.377e+01 -1.602 0.10927
## `BABIP.fr:wRCplus.fr` 2.594e+00 7.139e+00 0.363 0.71633
## `BABIP.fr:R.fr` 1.867e+01 1.903e+01 0.981 0.32661
## `BABIP.fr:X2B.fr` -3.948e-01 5.507e+01 -0.007 0.99428
## `BABIP.fr:X3B.fr` 1.101e+02 1.167e+02 0.944 0.34536
## `BABIP.fr:TB.fr` 4.400e+00 4.142e+00 1.062 0.28813
## `BABIP.fr:HR.fr` -7.403e+01 9.397e+01 -0.788 0.43085
## `BABIP.fr:HRpct.fr` 4.038e+01 1.181e+02 0.342 0.73234
## `BABIP.fr:RBI.fr` -1.184e+01 1.765e+01 -0.671 0.50245
## `BABIP.fr:BB.fr` 3.303e+01 3.486e+01 0.947 0.34346
## `BABIP.fr:K.fr` 2.704e+01 1.884e+01 1.435 0.15128
## `BABIP.fr:BBpct.fr` -1.512e+01 2.635e+01 -0.574 0.56618
## `BABIP.fr:Kpct.fr` -7.836e+00 2.058e+01 -0.381 0.70334
## `BABIP.fr:HBP.fr` 8.971e+00 3.605e+01 0.249 0.80349
## `BABIP.fr:SF.fr` 5.264e+00 7.048e+01 0.075 0.94047
## `BABIP.fr:SH.fr` 7.047e+01 3.737e+01 1.886 0.05939 .
## `BABIP.fr:SB.fr` 2.635e+01 3.276e+01 0.804 0.42130
## `BABIP.fr:CS.fr` -2.343e-01 5.406e+01 -0.004 0.99654
## `BABIP.fr:SPD.fr` -7.876e+01 5.338e+01 -1.475 0.14019
## `wRAA.fr:wRCplus.fr` 7.235e-02 4.539e-02 1.594 0.11100
## `wRAA.fr:R.fr` -6.061e-02 8.102e-02 -0.748 0.45443
## `wRAA.fr:X2B.fr` -1.160e-01 2.598e-01 -0.447 0.65517
## `wRAA.fr:X3B.fr` -7.748e-01 5.905e-01 -1.312 0.18952
## `wRAA.fr:TB.fr` -8.213e-02 3.083e-02 -2.664 0.00776 **
## `wRAA.fr:HR.fr` 5.811e-01 4.359e-01 1.333 0.18253
## `wRAA.fr:HRpct.fr` -5.047e-01 1.810e+00 -0.279 0.78044
## `wRAA.fr:RBI.fr` 5.709e-02 6.813e-02 0.838 0.40217
## `wRAA.fr:BB.fr` 1.040e-02 1.550e-01 0.067 0.94651
## `wRAA.fr:K.fr` -2.624e-02 7.961e-02 -0.330 0.74170
## `wRAA.fr:BBpct.fr` 1.499e-01 3.377e-01 0.444 0.65711
## `wRAA.fr:Kpct.fr` 8.256e-03 1.920e-01 0.043 0.96571
## `wRAA.fr:HBP.fr` -2.113e-01 1.379e-01 -1.533 0.12540
## `wRAA.fr:SF.fr` -3.137e-01 2.435e-01 -1.289 0.19763
## `wRAA.fr:SH.fr` 3.646e-02 1.241e-01 0.294 0.76890
## `wRAA.fr:SB.fr` 1.678e-01 1.199e-01 1.399 0.16179
## `wRAA.fr:CS.fr` -1.422e-02 2.305e-01 -0.062 0.95081
## `wRAA.fr:SPD.fr` -5.413e-02 3.600e-01 -0.150 0.88051
## `wRCplus.fr:R.fr` -2.911e-02 2.667e-02 -1.091 0.27513
## `wRCplus.fr:X2B.fr` 3.799e-02 7.788e-02 0.488 0.62568
## `wRCplus.fr:X3B.fr` -8.684e-02 1.766e-01 -0.492 0.62286
## `wRCplus.fr:TB.fr` 9.638e-03 5.352e-03 1.801 0.07175 .
## `wRCplus.fr:HR.fr` -2.209e-01 1.385e-01 -1.595 0.11076
## `wRCplus.fr:HRpct.fr` 2.654e-01 2.354e-01 1.127 0.25976
## `wRCplus.fr:RBI.fr` -9.775e-03 2.534e-02 -0.386 0.69972
## `wRCplus.fr:BB.fr` -3.484e-02 4.450e-02 -0.783 0.43377
## `wRCplus.fr:K.fr` 1.267e-02 2.436e-02 0.520 0.60310
## `wRCplus.fr:BBpct.fr` -6.792e-03 4.643e-02 -0.146 0.88369
## `wRCplus.fr:Kpct.fr` 1.118e-02 3.553e-02 0.315 0.75308
## `wRCplus.fr:HBP.fr` 5.836e-02 4.746e-02 1.230 0.21888
## `wRCplus.fr:SF.fr` 4.092e-02 8.920e-02 0.459 0.64644
## `wRCplus.fr:SH.fr` -2.383e-02 4.395e-02 -0.542 0.58772
## `wRCplus.fr:SB.fr` -1.792e-02 4.459e-02 -0.402 0.68774
## `wRCplus.fr:CS.fr` -2.895e-02 7.823e-02 -0.370 0.71131
## `wRCplus.fr:SPD.fr` 1.344e-01 8.143e-02 1.651 0.09879 .
## `R.fr:X2B.fr` 1.711e-01 1.301e-01 1.315 0.18843
## `R.fr:X3B.fr` 6.894e-01 3.157e-01 2.184 0.02903 *
## `R.fr:TB.fr` -9.441e-03 6.041e-03 -1.563 0.11816
## `R.fr:HR.fr` -1.137e-01 2.323e-01 -0.489 0.62467
## `R.fr:HRpct.fr` 1.309e+00 7.769e-01 1.686 0.09196 .
## `R.fr:RBI.fr` -4.219e-02 3.710e-02 -1.137 0.25542
## `R.fr:BB.fr` 2.399e-03 7.453e-02 0.032 0.97433
## `R.fr:K.fr` 5.713e-02 3.994e-02 1.430 0.15267
## `R.fr:BBpct.fr` 1.926e-01 1.279e-01 1.506 0.13209
## `R.fr:Kpct.fr` -1.837e-01 9.929e-02 -1.850 0.06437 .
## `R.fr:HBP.fr` 1.116e-01 7.596e-02 1.469 0.14197
## `R.fr:SF.fr` 1.764e-01 1.456e-01 1.212 0.22562
## `R.fr:SH.fr` -5.938e-03 6.777e-02 -0.088 0.93018
## `R.fr:SB.fr` -1.414e-01 6.399e-02 -2.209 0.02723 *
## `R.fr:CS.fr` 1.526e-01 1.260e-01 1.212 0.22574
## `R.fr:SPD.fr` -8.717e-02 2.448e-01 -0.356 0.72183
## `X2B.fr:X3B.fr` -3.462e-01 9.562e-01 -0.362 0.71733
## `X2B.fr:TB.fr` -4.792e-02 2.825e-02 -1.696 0.08992 .
## `X2B.fr:HR.fr` -1.630e-01 6.530e-01 -0.250 0.80285
## `X2B.fr:HRpct.fr` -5.711e-01 2.369e+00 -0.241 0.80948
## `X2B.fr:RBI.fr` 1.304e-01 1.135e-01 1.149 0.25060
## `X2B.fr:BB.fr` -2.553e-01 2.347e-01 -1.088 0.27661
## `X2B.fr:K.fr` -6.055e-02 1.188e-01 -0.510 0.61025
## `X2B.fr:BBpct.fr` 1.464e-01 4.988e-01 0.294 0.76906
## `X2B.fr:Kpct.fr` 1.030e-01 3.562e-01 0.289 0.77240
## `X2B.fr:HBP.fr` 8.466e-02 2.466e-01 0.343 0.73135
## `X2B.fr:SF.fr` 4.091e-01 4.284e-01 0.955 0.33961
## `X2B.fr:SH.fr` 8.584e-02 2.317e-01 0.370 0.71104
## `X2B.fr:SB.fr` -2.440e-01 2.229e-01 -1.095 0.27378
## `X2B.fr:CS.fr` 3.230e-03 3.912e-01 0.008 0.99341
## `X2B.fr:SPD.fr` -6.803e-02 5.915e-01 -0.115 0.90844
## `X3B.fr:TB.fr` -1.531e-01 6.003e-02 -2.550 0.01081 *
## `X3B.fr:HR.fr` -1.570e+00 1.545e+00 -1.016 0.30985
## `X3B.fr:HRpct.fr` 7.831e-01 5.297e+00 0.148 0.88249
## `X3B.fr:RBI.fr` 6.172e-01 2.759e-01 2.237 0.02533 *
## `X3B.fr:BB.fr` 2.201e-01 5.386e-01 0.409 0.68277
## `X3B.fr:K.fr` 2.106e-01 2.635e-01 0.800 0.42401
## `X3B.fr:BBpct.fr` -4.353e-02 1.104e+00 -0.039 0.96856
## `X3B.fr:Kpct.fr` -9.950e-01 7.863e-01 -1.265 0.20579
## `X3B.fr:HBP.fr` 9.188e-01 5.393e-01 1.704 0.08851 .
## `X3B.fr:SF.fr` -3.222e-01 1.016e+00 -0.317 0.75114
## `X3B.fr:SH.fr` -8.008e-02 5.049e-01 -0.159 0.87400
## `X3B.fr:SB.fr` -4.175e-01 4.461e-01 -0.936 0.34943
## `X3B.fr:CS.fr` 2.598e-01 8.047e-01 0.323 0.74684
## `X3B.fr:SPD.fr` -2.004e+00 1.438e+00 -1.393 0.16355
## `TB.fr:HR.fr` -1.302e-01 6.404e-02 -2.034 0.04204 *
## `TB.fr:HRpct.fr` -2.051e-02 1.447e-01 -0.142 0.88726
## `TB.fr:RBI.fr` 1.754e-04 5.042e-03 0.035 0.97225
## `TB.fr:BB.fr` 5.670e-02 2.442e-02 2.322 0.02030 *
## `TB.fr:K.fr` -6.856e-03 7.257e-03 -0.945 0.34485
## `TB.fr:BBpct.fr` -2.695e-02 2.842e-02 -0.948 0.34309
## `TB.fr:Kpct.fr` -1.287e-02 2.127e-02 -0.605 0.54514
## `TB.fr:HBP.fr` 3.711e-02 2.227e-02 1.667 0.09565 .
## `TB.fr:SF.fr` 8.028e-03 1.997e-02 0.402 0.68777
## `TB.fr:SH.fr` 1.390e-03 1.495e-02 0.093 0.92592
## `TB.fr:SB.fr` 1.348e-03 9.342e-03 0.144 0.88526
## `TB.fr:CS.fr` 1.999e-02 1.819e-02 1.099 0.27179
## `TB.fr:SPD.fr` 3.306e-02 3.421e-02 0.967 0.33379
## `HR.fr:HRpct.fr` -2.288e+00 4.002e+00 -0.572 0.56748
## `HR.fr:RBI.fr` 5.199e-03 1.910e-01 0.027 0.97828
## `HR.fr:BB.fr` -2.659e-01 4.144e-01 -0.642 0.52116
## `HR.fr:K.fr` 2.233e-01 2.073e-01 1.077 0.28133
## `HR.fr:BBpct.fr` -4.690e+00 3.636e+00 -1.290 0.19719
## `HR.fr:Kpct.fr` 2.983e+00 1.992e+00 1.498 0.13424
## `HR.fr:HBP.fr` 6.596e-01 4.228e-01 1.560 0.11884
## `HR.fr:SF.fr` 5.257e-02 7.183e-01 0.073 0.94166
## `HR.fr:SH.fr` 5.510e-01 5.003e-01 1.101 0.27088
## `HR.fr:SB.fr` -2.374e-01 4.057e-01 -0.585 0.55848
## `HR.fr:CS.fr` 2.032e-01 8.116e-01 0.250 0.80228
## `HR.fr:SPD.fr` 5.684e-01 9.436e-01 0.602 0.54698
## `HRpct.fr:RBI.fr` 9.356e-01 6.605e-01 1.417 0.15666
## `HRpct.fr:BB.fr` 4.253e+00 3.825e+00 1.112 0.26631
## `HRpct.fr:K.fr` -2.911e+00 1.924e+00 -1.513 0.13037
## `HRpct.fr:BBpct.fr` -1.876e-01 9.710e-01 -0.193 0.84679
## `HRpct.fr:Kpct.fr` -6.931e-02 7.201e-01 -0.096 0.92333
## `HRpct.fr:HBP.fr` -1.151e+00 1.494e+00 -0.770 0.44134
## `HRpct.fr:SF.fr` -3.190e-01 2.518e+00 -0.127 0.89920
## `HRpct.fr:SH.fr` 4.826e-01 1.513e+00 0.319 0.74981
## `HRpct.fr:SB.fr` 3.243e-01 1.443e+00 0.225 0.82216
## `HRpct.fr:CS.fr` -1.149e+00 2.443e+00 -0.470 0.63819
## `HRpct.fr:SPD.fr` -1.734e+00 2.036e+00 -0.852 0.39446
## `RBI.fr:BB.fr` 2.903e-02 6.857e-02 0.423 0.67206
## `RBI.fr:K.fr` -2.002e-02 3.395e-02 -0.590 0.55548
## `RBI.fr:BBpct.fr` -1.002e-01 1.172e-01 -0.855 0.39266
## `RBI.fr:Kpct.fr` 6.998e-02 8.913e-02 0.785 0.43243
## `RBI.fr:HBP.fr` -9.240e-02 6.671e-02 -1.385 0.16611
## `RBI.fr:SF.fr` -1.314e-01 1.171e-01 -1.122 0.26199
## `RBI.fr:SH.fr` -4.668e-02 6.148e-02 -0.759 0.44772
## `RBI.fr:SB.fr` 4.840e-02 6.094e-02 0.794 0.42714
## `RBI.fr:CS.fr` -2.667e-01 1.098e-01 -2.428 0.01522 *
## `RBI.fr:SPD.fr` -2.287e-02 1.940e-01 -0.118 0.90615
## `BB.fr:K.fr` 1.186e-02 7.353e-02 0.161 0.87188
## `BB.fr:BBpct.fr` -2.157e-01 2.419e-01 -0.892 0.37260
## `BB.fr:Kpct.fr` -1.508e+00 7.148e-01 -2.109 0.03498 *
## `BB.fr:HBP.fr` 7.339e-02 1.512e-01 0.485 0.62751
## `BB.fr:SF.fr` -9.890e-02 2.600e-01 -0.380 0.70365
## `BB.fr:SH.fr` -4.050e-02 1.308e-01 -0.310 0.75690
## `BB.fr:SB.fr` -9.939e-02 1.218e-01 -0.816 0.41454
## `BB.fr:CS.fr` -9.369e-02 2.396e-01 -0.391 0.69583
## `BB.fr:SPD.fr` -1.798e-01 3.222e-01 -0.558 0.57695
## `K.fr:BBpct.fr` 1.445e+00 7.027e-01 2.057 0.03979 *
## `K.fr:Kpct.fr` -1.008e-01 8.381e-02 -1.203 0.22922
## `K.fr:HBP.fr` -3.593e-02 7.494e-02 -0.479 0.63167
## `K.fr:SF.fr` 7.433e-02 1.347e-01 0.552 0.58110
## `K.fr:SH.fr` -8.374e-02 7.212e-02 -1.161 0.24569
## `K.fr:SB.fr` 3.629e-03 6.454e-02 0.056 0.95516
## `K.fr:CS.fr` -9.596e-02 1.157e-01 -0.829 0.40708
## `K.fr:SPD.fr` 1.588e-02 1.611e-01 0.099 0.92149
## `BBpct.fr:Kpct.fr` 4.481e-02 1.532e-01 0.293 0.76989
## `BBpct.fr:HBP.fr` -2.804e-01 2.748e-01 -1.021 0.30751
## `BBpct.fr:SF.fr` 4.162e-01 4.364e-01 0.954 0.34027
## `BBpct.fr:SH.fr` -5.383e-02 2.225e-01 -0.242 0.80885
## `BBpct.fr:SB.fr` 1.816e-01 2.199e-01 0.826 0.40899
## `BBpct.fr:CS.fr` 1.698e-01 3.818e-01 0.445 0.65646
## `BBpct.fr:SPD.fr` -4.622e-01 3.565e-01 -1.297 0.19485
## `Kpct.fr:HBP.fr` 1.507e-01 1.653e-01 0.912 0.36197
## `Kpct.fr:SF.fr` -5.876e-02 3.322e-01 -0.177 0.85961
## `Kpct.fr:SH.fr` -1.709e-01 1.668e-01 -1.024 0.30577
## `Kpct.fr:SB.fr` -9.829e-02 1.672e-01 -0.588 0.55656
## `Kpct.fr:CS.fr` 1.530e-01 2.743e-01 0.558 0.57702
## `Kpct.fr:SPD.fr` 3.632e-01 2.677e-01 1.357 0.17486
## `HBP.fr:SF.fr` 2.713e-01 2.468e-01 1.099 0.27182
## `HBP.fr:SH.fr` -2.633e-02 1.143e-01 -0.230 0.81781
## `HBP.fr:SB.fr` 2.223e-02 1.139e-01 0.195 0.84523
## `HBP.fr:CS.fr` -9.380e-02 2.077e-01 -0.452 0.65158
## `HBP.fr:SPD.fr` -5.015e-01 3.888e-01 -1.290 0.19711
## `SF.fr:SH.fr` 2.599e-01 2.261e-01 1.149 0.25045
## `SF.fr:SB.fr` 2.036e-01 2.220e-01 0.917 0.35920
## `SF.fr:CS.fr` 2.895e-02 4.130e-01 0.070 0.94412
## `SF.fr:SPD.fr` -3.986e-01 7.110e-01 -0.561 0.57514
## `SH.fr:SB.fr` -2.468e-02 1.038e-01 -0.238 0.81204
## `SH.fr:CS.fr` -3.012e-02 1.857e-01 -0.162 0.87120
## `SH.fr:SPD.fr` 2.285e-01 3.530e-01 0.647 0.51743
## `SB.fr:CS.fr` 3.364e-01 1.831e-01 1.837 0.06628 .
## `SB.fr:SPD.fr` 3.514e-01 3.291e-01 1.068 0.28570
## `CS.fr:SPD.fr` -1.767e+00 6.556e-01 -2.696 0.00705 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.27 on 4461 degrees of freedom
## Multiple R-squared: 0.1917, Adjusted R-squared: 0.1417
## F-statistic: 3.833 on 276 and 4461 DF, p-value: < 2.2e-16
full_int_so_RMSE <- round(sqrt(mean(full_int_so$residuals^2)),2) # RMSE = 38.11
model_stats_so[3,1:5] <- c("full_int_so", "Multiple linear regression",
"Linear model with all interaction terms", 0.1417, full_int_so_RMSE)
anova_so_int <- anova(full_lm_so, full_int_so) # interaction model not significantly more useful model
The significant predictors from the full interaction model are PA.fr, wRAA.fr, R.fr, X3B.fr, SB.fr, HR.fr, BB.fr, CS.fr, Kpct.fr, K.fr, BBpct.fr, SPD.fr, TB.fr, RBI.fr, PA.fr:TB.fr, wRAA.fr:TB.fr, R.fr:X3B.fr, R.fr:SB.fr, X3B.fr:TB.fr, 3B.fr:RBI.fr, TB.fr:HR.fr, TB.fr:BB.fr, RBI.fr:CS.fr, BB.fr:Kpct.fr, K.fr:BBpct.fr, and CS.fr:SPD.fr.
Again we have nested models, allowing us to compare the full first order model and the full interaction model. In this case, the p-value is large (0.1746) and we cannot reject the null hypothesis, concluding that the larger interaction model is not more significantly useful than the more simple model.
anova_so_int$`Pr(>F)`
## [1] NA 0.1746037
reduced_int_so <- lm(wRCplus.so ~ PA.fr + wRAA.fr + R.fr + X3B.fr + SB.fr + HR.fr + BB.fr + CS.fr + Kpct.fr +
K.fr + BBpct.fr + SPD.fr + TB.fr + RBI.fr + PA.fr:TB.fr + wRAA.fr:TB.fr +
R.fr:X3B.fr + R.fr:SB.fr + X3B.fr:TB.fr + X3B.fr:RBI.fr + TB.fr:HR.fr + TB.fr:BB.fr +
RBI.fr:CS.fr + BB.fr:Kpct.fr + K.fr:BBpct.fr + CS.fr:SPD.fr, data = wrc_int_so)
summary(reduced_int_so) # adj R^2 = 0.1133
##
## Call:
## lm(formula = wRCplus.so ~ PA.fr + wRAA.fr + R.fr + X3B.fr + SB.fr +
## HR.fr + BB.fr + CS.fr + Kpct.fr + K.fr + BBpct.fr + SPD.fr +
## TB.fr + RBI.fr + PA.fr:TB.fr + wRAA.fr:TB.fr + R.fr:X3B.fr +
## R.fr:SB.fr + X3B.fr:TB.fr + X3B.fr:RBI.fr + TB.fr:HR.fr +
## TB.fr:BB.fr + RBI.fr:CS.fr + BB.fr:Kpct.fr + K.fr:BBpct.fr +
## CS.fr:SPD.fr, data = wrc_int_so)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170.537 -23.701 0.832 25.260 163.224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.915e+01 2.056e+00 43.358 < 2e-16 ***
## PA.fr 6.235e-05 9.284e-05 0.672 0.50183
## wRAA.fr -3.373e-02 8.317e-03 -4.056 5.07e-05 ***
## R.fr 7.572e-03 2.842e-03 2.665 0.00773 **
## X3B.fr 5.039e-01 2.689e-01 1.874 0.06106 .
## SB.fr 4.099e-02 1.753e-02 2.338 0.01944 *
## HR.fr 2.323e-01 5.384e-02 4.314 1.64e-05 ***
## BB.fr 6.000e-03 4.929e-03 1.217 0.22360
## CS.fr 1.608e-01 1.591e-01 1.011 0.31218
## Kpct.fr -6.079e-03 2.190e-03 -2.776 0.00553 **
## K.fr -4.083e-03 1.627e-03 -2.510 0.01210 *
## BBpct.fr 2.517e-02 8.118e-03 3.100 0.00195 **
## SPD.fr 1.308e-02 5.103e-02 0.256 0.79771
## TB.fr 2.268e-03 7.632e-04 2.971 0.00298 **
## RBI.fr 8.517e-03 2.635e-03 3.232 0.00124 **
## PA.fr:TB.fr -2.284e-08 1.453e-08 -1.572 0.11596
## wRAA.fr:TB.fr 3.073e-06 1.098e-06 2.800 0.00514 **
## R.fr:X3B.fr 1.134e-05 1.906e-04 0.059 0.95258
## R.fr:SB.fr -1.797e-05 7.510e-06 -2.393 0.01677 *
## X3B.fr:TB.fr -4.157e-05 3.172e-05 -1.311 0.18999
## X3B.fr:RBI.fr -2.517e-05 2.620e-04 -0.096 0.92347
## HR.fr:TB.fr -1.336e-05 4.165e-06 -3.208 0.00134 **
## BB.fr:TB.fr -3.635e-07 6.511e-07 -0.558 0.57673
## CS.fr:RBI.fr 7.810e-05 1.305e-04 0.599 0.54947
## BB.fr:Kpct.fr -1.511e-03 6.974e-04 -2.167 0.03029 *
## K.fr:BBpct.fr 1.526e-03 6.987e-04 2.184 0.02899 *
## CS.fr:SPD.fr -4.567e-03 4.002e-03 -1.141 0.25384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.92 on 4711 degrees of freedom
## Multiple R-squared: 0.1182, Adjusted R-squared: 0.1133
## F-statistic: 24.28 on 26 and 4711 DF, p-value: < 2.2e-16
The equation for the reduced interaction model is:
\[ \hat{y} = 8.92\times10^1 + 6.24\times10^{-5}\mbox{PA.fr} + (-3.37\times10^{-2})\mbox{wRAA.fr} + ... +(-4.567\times10^-3)\mbox{CS.fr:SPD.fr} \]
Repeating this analysis for junior and senior year predictions, we have the following results:
k-nearest neighbors models require for the predictors to be normalized so that Euclidean distance is not skewed, which I do with the following code:
# Sophomore year
fr_so_train_norm <- fr_so_train
fr_so_valid_norm <- fr_so_valid
wrc_fr_so_norm <- wrc_fr_so
norm_values <- preProcess(fr_so_train[, 1:23], method = c("center", "scale"))
fr_so_train_norm[, 1:23] <- predict(norm_values, fr_so_train[, 1:23])
fr_so_valid_norm[, 1:23] <- predict(norm_values, fr_so_valid[, 1:23])
wrc_fr_so_norm[, 1:23] <- predict(norm_values, wrc_fr_so[, 1:23])
Then, I run the knn model with k values between 10 and 200. The best k-value is the one that corresponds with the lowest RMSE. For the sophomore year prediction model, k=180 produces the best results.
knitr::kable(knn_results_so, escape = FALSE, booktabs = TRUE) # k = 180, valid rmse = 38.88
| k | Train RMSE | Valid RMSE | Fit? |
|---|---|---|---|
| 10 | 40.08 | 40.08 | Over |
| 20 | 39.61 | 39.61 | Over |
| 30 | 39.28 | 39.28 | Over |
| 40 | 39.09 | 39.09 | Over |
| 50 | 39.02 | 39.02 | Over |
| 60 | 39.08 | 39.08 | Over |
| 70 | 38.92 | 38.92 | Over |
| 80 | 38.98 | 38.98 | Over |
| 90 | 38.97 | 38.97 | Over |
| 100 | 38.99 | 38.99 | Over |
| 110 | 39.00 | 39.00 | Over |
| 120 | 38.97 | 38.97 | Over |
| 130 | 38.92 | 38.92 | Over |
| 140 | 38.95 | 38.95 | Over |
| 150 | 38.93 | 38.93 | Over |
| 160 | 38.94 | 38.94 | Over |
| 170 | 38.91 | 38.91 | Over |
| 180 | 38.88 | 38.88 | Best |
| 190 | 38.89 | 38.89 | Under |
| 200 | 38.96 | 38.96 | Under |
Finally, we have the results of all regression models for all prediction years.
First, I will train a logistic regression model for all seven wRC+ classifications. This requires the addition of these classifications to the dataset, which can be done with:
# Sophomore year
# All levels
wrc_fr_so_glm <- wrc_fr_so %>% mutate(wRCplusrating.so =
case_when(wRCplus.so <= 60 ~ "Awful",
wRCplus.so > 60 & wRCplus.so <= 75 ~ "Poor",
wRCplus.so > 75 & wRCplus.so <= 90 ~ "Below Average",
wRCplus.so > 90 & wRCplus.so <= 110 ~ "Average",
wRCplus.so > 110 & wRCplus.so <= 130 ~ "Above Average",
wRCplus.so > 130 & wRCplus.so <= 150 ~ "Great",
wRCplus.so > 150 ~ "Excellent")
)
multinom_so <- multinom(wRCplusrating.so ~ .-wRCplus.so, data = wrc_fr_so_glm)
## # weights: 175 (144 variable)
## initial value 9219.722286
## iter 10 value 8868.089780
## iter 20 value 8807.358678
## iter 30 value 8782.130396
## iter 40 value 8768.806553
## iter 50 value 8764.351329
## iter 60 value 8751.007971
## iter 70 value 8730.458130
## iter 80 value 8682.358291
## iter 90 value 8655.799304
## iter 100 value 8628.337570
## final value 8628.337570
## stopped after 100 iterations
summary(multinom_so) # AIC = 17544.68
## Call:
## multinom(formula = wRCplusrating.so ~ . - wRCplus.so, data = wrc_fr_so_glm)
##
## Coefficients:
## (Intercept) PA.fr BA.fr SlgPct.fr BABIP.fr
## Above Average 0.61754215 -0.0017648055 0.01656815 0.15233187 -0.6728247
## Awful -0.12861713 -0.0068938247 0.70224919 0.07703881 1.7426851
## Below Average 0.70664950 -0.0061367378 1.10305891 -0.63082573 -0.6931290
## Excellent -1.13800888 -0.0067897903 -0.02906636 0.03503968 0.1015323
## Great 0.34043509 -0.0045085851 -0.37567708 -0.74037139 -1.5034511
## Poor 0.08692583 0.0002362675 -0.24826749 0.06516696 0.5382463
## wRAA.fr wRCplus.fr R.fr X2B.fr X3B.fr
## Above Average 0.0093078007 0.001072873 0.001902029 0.00304035 0.14914831
## Awful -0.0391158442 -0.003694869 0.001715085 -0.03348998 0.06050513
## Below Average -0.0006215197 -0.002311949 0.005320622 -0.02747100 0.07372247
## Excellent 0.0212483826 -0.003087980 0.001907237 0.02533869 0.15124430
## Great 0.0332013163 0.002625320 0.018968212 -0.01693816 0.24439156
## Poor 0.0261193184 -0.004623368 0.000598793 -0.01385670 0.13982594
## TB.fr HR.fr HRpct.fr RBI.fr BB.fr
## Above Average 0.0006942084 0.11542878 -0.02347664 -0.013289809 -0.009884806
## Awful 0.0058605554 0.04601809 -0.04096951 -0.011363193 -0.014913919
## Below Average -0.0012871319 -0.14210244 0.10825088 0.009947479 -0.018703935
## Excellent 0.0062657293 0.07306020 0.10272059 0.009085875 0.021014336
## Great -0.0018431672 0.07338962 -0.02316038 0.010657164 -0.011056815
## Poor -0.0017580146 -0.01899575 -0.01788090 -0.022461142 -0.063048813
## K.fr BBpct.fr Kpct.fr HBP.fr SF.fr
## Above Average 3.863094e-03 0.004369362 -0.018900408 -0.010327634 0.10761386
## Awful 1.793095e-02 0.015914712 0.002619607 -0.028303735 0.04117028
## Below Average 3.346905e-02 -0.001236747 -0.021458220 -0.036815459 0.07887567
## Excellent 9.937545e-05 0.042233079 -0.017097630 0.045772734 0.13891523
## Great -3.424453e-03 0.009625511 -0.003208336 -0.003068599 0.08287099
## Poor 3.001120e-02 0.039427604 -0.015734582 -0.074848176 0.02037623
## SH.fr SB.fr CS.fr SPD.fr
## Above Average 0.003685347 0.055218570 0.02274999 -0.12966463
## Awful 0.080883421 -0.040443181 0.03349001 -0.05389362
## Below Average 0.053585336 0.004630763 -0.02531996 -0.07484186
## Excellent -0.033317469 -0.002068828 0.08815314 -0.02524444
## Great 0.051660782 0.050768717 -0.04996450 -0.11857894
## Poor 0.071418617 -0.001629147 0.01798138 -0.05351196
##
## Std. Errors:
## (Intercept) PA.fr BA.fr SlgPct.fr BABIP.fr
## Above Average 0.4441279 0.003809495 0.09187218 0.09533794 0.06502439
## Awful 0.4418409 0.004453149 0.27651583 0.09333780 0.05953991
## Below Average 0.4750591 0.004450547 0.27600015 0.10080063 0.07068363
## Excellent 0.5808199 0.004599882 0.12175166 0.12710417 0.08329032
## Great 0.5029879 0.004249624 0.10426757 0.10712893 0.07296899
## Poor 0.4977756 0.004884565 0.10233704 0.10517199 0.07249790
## wRAA.fr wRCplus.fr R.fr X2B.fr X3B.fr
## Above Average 0.02079071 0.003308130 0.01308715 0.02186517 0.07187833
## Awful 0.02377963 0.003244235 0.01547385 0.02573976 0.08750224
## Below Average 0.02397699 0.003510159 0.01537288 0.02567444 0.08614900
## Excellent 0.02482636 0.004262231 0.01498312 0.02564009 0.08219101
## Great 0.02326907 0.003713960 0.01442527 0.02491434 0.07585241
## Poor 0.02644901 0.003727045 0.01712910 0.02838352 0.09055238
## TB.fr HR.fr HRpct.fr RBI.fr BB.fr
## Above Average 0.001922316 0.06174088 0.08164811 0.01150345 0.01920975
## Awful 0.002509806 0.07600944 0.08504675 0.01347576 0.02257654
## Below Average 0.002427970 0.07841285 0.09115142 0.01341703 0.02274869
## Excellent 0.002145618 0.06833050 0.09634715 0.01312329 0.02236789
## Great 0.002120893 0.06771862 0.09067646 0.01279920 0.02113943
## Poor 0.002700848 0.08768818 0.10160842 0.01516683 0.02507699
## K.fr BBpct.fr Kpct.fr HBP.fr SF.fr SH.fr
## Above Average 0.01124671 0.02484403 0.01494964 0.01939941 0.04247292 0.02120005
## Awful 0.01236172 0.02479743 0.01393496 0.02282583 0.04929101 0.02249155
## Below Average 0.01287203 0.02693490 0.01559192 0.02292864 0.04876633 0.02283241
## Excellent 0.01338027 0.03086455 0.01934323 0.02207276 0.04883035 0.02714375
## Great 0.01261795 0.02744382 0.01676611 0.02139603 0.04785122 0.02309596
## Poor 0.01398667 0.02826136 0.01631665 0.02604699 0.05588617 0.02480693
## SB.fr CS.fr SPD.fr
## Above Average 0.01848102 0.03784051 0.05168203
## Awful 0.02495010 0.04288419 0.05448126
## Below Average 0.02356634 0.04404800 0.05723813
## Excellent 0.02243708 0.04476043 0.06407481
## Great 0.02030721 0.04352384 0.05707977
## Poor 0.02547533 0.04748018 0.06048924
##
## Residual Deviance: 17256.68
## AIC: 17544.68
so_table <- table(wrc_fr_so_glm$wRCplusrating.so, wrc_fr_so_glm$wRCplusPred)
multinom_so_acc <- round((sum(diag(so_table))/sum(so_table)),4) # overall accuracy = 24.53%
so_table
##
## Average Above Average Awful Below Average Excellent Great Poor
## Average 442 130 318 12 53 17 3
## Above Average 357 167 205 8 53 31 3
## Awful 268 83 403 11 15 5 2
## Below Average 284 77 247 10 11 3 7
## Excellent 175 100 82 3 107 19 0
## Great 206 117 129 12 54 33 1
## Poor 174 62 218 9 7 5 0
multinom_so_acc
## [1] 0.2453
The 7x7 table shows the results of the predictions, where the labels across the top indicate the actual value and the labels to left indicate the predicted values. The model performs with an overall accuracy of 0.2453 or 24.53%.
Next, I will train a logistic regression model to predict a binary case, only wRC+ above average (1) and wRC+ below average (0).
wrc_fr_so_glm2 <- wrc_fr_so %>% mutate(wRCplusrating.so =
case_when(wRCplus.so < 100 ~ 0,
wRCplus.so >= 100 ~ 1))
glm_so <- glm(wRCplusrating.so ~ . - wRCplus.so, data = wrc_fr_so_glm2, family = "binomial")
summary(glm_so) # AIC = 6054.2
##
## Call:
## glm(formula = wRCplusrating.so ~ . - wRCplus.so, family = "binomial",
## data = wrc_fr_so_glm2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5309 -1.0556 0.3667 1.1170 1.8657
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.724e-01 4.915e-01 -1.368 0.17132
## PA.fr 4.651e-03 2.738e-03 1.699 0.08936 .
## BA.fr -3.638e-01 7.185e-01 -0.506 0.61260
## SlgPct.fr 2.359e+00 2.097e+00 1.125 0.26068
## BABIP.fr -5.179e+00 1.896e+00 -2.732 0.00629 **
## wRAA.fr 2.452e-02 1.456e-02 1.684 0.09213 .
## wRCplus.fr 8.605e-03 4.542e-03 1.894 0.05818 .
## R.fr 6.708e-03 9.166e-03 0.732 0.46426
## X2B.fr 6.080e-03 2.164e-02 0.281 0.77877
## X3B.fr 3.771e-02 5.563e-02 0.678 0.49779
## TB.fr 6.973e-05 1.454e-03 0.048 0.96175
## HR.fr 1.153e-01 4.435e-02 2.599 0.00934 **
## HRpct.fr -1.937e-01 9.947e-02 -1.947 0.05148 .
## RBI.fr 3.519e-03 7.995e-03 0.440 0.65984
## BB.fr 3.643e-03 1.356e-02 0.269 0.78818
## K.fr -2.164e-02 7.656e-03 -2.826 0.00471 **
## BBpct.fr 9.299e-03 2.108e-02 0.441 0.65908
## Kpct.fr 2.690e-02 1.224e-02 2.199 0.02790 *
## HBP.fr 1.783e-02 1.673e-02 1.066 0.28648
## SF.fr 4.011e-02 2.966e-02 1.352 0.17627
## SH.fr -5.002e-02 1.380e-02 -3.624 0.00029 ***
## SB.fr 4.020e-02 1.372e-02 2.930 0.00339 **
## CS.fr -2.517e-03 2.578e-02 -0.098 0.92221
## SPD.fr -3.525e-02 3.398e-02 -1.037 0.29962
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6566.2 on 4737 degrees of freedom
## Residual deviance: 6006.2 on 4714 degrees of freedom
## AIC: 6054.2
##
## Number of Fisher Scoring iterations: 6
cm_glm_so <- confusionMatrix(as.factor(ifelse(glm_so_pred >= 0.5, "1", "0")), as.factor(wrc_fr_so_glm2$wRCplusrating.so), positive = "1") # overall accuracy = 0.6351
cm_glm_so
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1583 993
## 1 736 1426
##
## Accuracy : 0.6351
## 95% CI : (0.6212, 0.6488)
## No Information Rate : 0.5106
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2715
##
## Mcnemar's Test P-Value : 7.431e-10
##
## Sensitivity : 0.5895
## Specificity : 0.6826
## Pos Pred Value : 0.6596
## Neg Pred Value : 0.6145
## Prevalence : 0.5106
## Detection Rate : 0.3010
## Detection Prevalence : 0.4563
## Balanced Accuracy : 0.6361
##
## 'Positive' Class : 1
##
In this confusion matrix, the prediction label refers to the prediction made by the model and the reference label refers to the actual classificaion of the test data. The binary case model performs with an accuracy of 0.6351 or 63.51%.
Taking only the significant predictors, we can create a reduced model.
glm_so_reduced <- glm(wRCplusrating.so ~ PA.fr + BABIP.fr + wRAA.fr + wRCplus.fr + HR.fr + HRpct.fr + K.fr + Kpct.fr + SH.fr + SB.fr - wRCplus.so, data = wrc_fr_so_glm2, family = "binomial")
summary(glm_so_reduced) # AIC = 6038.9
##
## Call:
## glm(formula = wRCplusrating.so ~ PA.fr + BABIP.fr + wRAA.fr +
## wRCplus.fr + HR.fr + HRpct.fr + K.fr + Kpct.fr + SH.fr +
## SB.fr - wRCplus.so, family = "binomial", data = wrc_fr_so_glm2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4477 -1.0611 0.3754 1.1184 1.8692
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.392624 0.287983 -1.363 0.17277
## PA.fr 0.007641 0.001435 5.325 1.01e-07 ***
## BABIP.fr -5.133025 0.933535 -5.498 3.83e-08 ***
## wRAA.fr 0.034195 0.011091 3.083 0.00205 **
## wRCplus.fr 0.012126 0.002426 4.998 5.78e-07 ***
## HR.fr 0.118222 0.041084 2.878 0.00401 **
## HRpct.fr -0.127802 0.054724 -2.335 0.01952 *
## K.fr -0.021298 0.007286 -2.923 0.00346 **
## Kpct.fr 0.026675 0.009822 2.716 0.00661 **
## SH.fr -0.056854 0.013259 -4.288 1.80e-05 ***
## SB.fr 0.030881 0.009786 3.156 0.00160 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6566.2 on 4737 degrees of freedom
## Residual deviance: 6016.9 on 4727 degrees of freedom
## AIC: 6038.9
##
## Number of Fisher Scoring iterations: 4
cm_glm_so_reduced <- confusionMatrix(as.factor(ifelse(glm_so_reduced_pred >= 0.5, "1", "0")), as.factor(wrc_fr_so_glm2$wRCplusrating.so), positive = "1") # overall accuracy = 0.6334
cm_glm_so_reduced
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1582 1000
## 1 737 1419
##
## Accuracy : 0.6334
## 95% CI : (0.6195, 0.6471)
## No Information Rate : 0.5106
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2682
##
## Mcnemar's Test P-Value : 3.249e-10
##
## Sensitivity : 0.5866
## Specificity : 0.6822
## Pos Pred Value : 0.6582
## Neg Pred Value : 0.6127
## Prevalence : 0.5106
## Detection Rate : 0.2995
## Detection Prevalence : 0.4550
## Balanced Accuracy : 0.6344
##
## 'Positive' Class : 1
##
The reduced model performs with an overall accuracy of 0.6334 or 63.34%.
Repeating these three models for junior year and senior year predictions, we have the follwoign results:
Using k-nearest neighbors for classification, like for regression, requires noramlized predictor. After normalizing the data, we can train the model using various k values to determine which performs with the highest accuracy. This model uses all seven classifications of wRC+ to predict sophomore year performance, and works best with a k value of 30.
accuracy_so_all # k = 30 gives best accuracy - 20.99%
## k accuracy
## 1 10 0.1867089
## 2 20 0.1951477
## 3 30 0.2099156
## 4 40 0.1898734
## 5 50 0.1803797
## 6 60 0.1814346
## 7 70 0.1993671
## 8 80 0.1962025
## 9 90 0.2067511
## 10 100 0.2004219
## 11 110 0.1983122
## 12 120 0.1940928
## 13 130 0.1951477
## 14 140 0.1940928
## 15 150 0.2078059
## 16 160 0.2004219
## 17 170 0.1993671
## 18 180 0.1940928
## 19 190 0.2067511
## 20 200 0.2099156
The confusion matrix shows how the best k model performs for each classification.
fr_so_knn_all_best <- knn(train = fr_so_knn_train_norm[, 1:23], test = fr_so_knn_valid_norm[, 1:23], cl = fr_so_knn_train_norm[, 24], k = 30)
confusionMatrix(fr_so_knn_all_best, fr_so_knn_valid_norm[, 24])
## Confusion Matrix and Statistics
##
## Reference
## Prediction Average Above Average Awful Below Average Excellent Great Poor
## Average 70 77 54 55 26 38 32
## Above Average 31 43 25 22 19 26 18
## Awful 36 23 55 30 17 25 25
## Below Average 14 14 11 17 4 6 5
## Excellent 7 16 4 6 19 19 2
## Great 5 10 3 3 1 4 4
## Poor 2 5 7 2 2 3 6
##
## Overall Statistics
##
## Accuracy : 0.2257
## 95% CI : (0.1995, 0.2537)
## No Information Rate : 0.1983
## P-Value [Acc > NIR] : 0.02009
##
## Kappa : 0.0727
##
## Mcnemar's Test P-Value : < 2e-16
##
## Statistics by Class:
##
## Class: Average Class: Above Average Class: Awful
## Sensitivity 0.42424 0.22872 0.34591
## Specificity 0.63985 0.81447 0.80228
## Pos Pred Value 0.19886 0.23370 0.26066
## Neg Pred Value 0.84060 0.81021 0.85889
## Prevalence 0.17405 0.19831 0.16772
## Detection Rate 0.07384 0.04536 0.05802
## Detection Prevalence 0.37131 0.19409 0.22257
## Balanced Accuracy 0.53204 0.52160 0.57410
## Class: Below Average Class: Excellent Class: Great
## Sensitivity 0.12593 0.21591 0.033058
## Specificity 0.93358 0.93721 0.968561
## Pos Pred Value 0.23944 0.26027 0.133333
## Neg Pred Value 0.86545 0.92114 0.872549
## Prevalence 0.14241 0.09283 0.127637
## Detection Rate 0.01793 0.02004 0.004219
## Detection Prevalence 0.07489 0.07700 0.031646
## Balanced Accuracy 0.52975 0.57656 0.500809
## Class: Poor
## Sensitivity 0.065217
## Specificity 0.975467
## Pos Pred Value 0.222222
## Neg Pred Value 0.906623
## Prevalence 0.097046
## Detection Rate 0.006329
## Detection Prevalence 0.028481
## Balanced Accuracy 0.520342
Next, like for logisitic regression, I will also predict using binary outcomes. The table shows which k value provides the best accuracy. I will be using the model with k=150 as the best model for binary outcomes.
accuracy_so_bin # k = 150 gives best accuracy - 60.44%
## k accuracy
## 1 10 0.5580169
## 2 20 0.5527426
## 3 30 0.5854430
## 4 40 0.5580169
## 5 50 0.5907173
## 6 60 0.5907173
## 7 70 0.5991561
## 8 80 0.5864979
## 9 90 0.5864979
## 10 100 0.5833333
## 11 110 0.5949367
## 12 120 0.5854430
## 13 130 0.5886076
## 14 140 0.5938819
## 15 150 0.6044304
## 16 160 0.5875527
## 17 170 0.6023207
## 18 180 0.5864979
## 19 190 0.5970464
## 20 200 0.5812236
fr_so_knn_bin_best <- knn(train = fr_so_knn_bin_train_norm[, 1:23],
test = fr_so_knn_bin_valid_norm[, 1:23],
cl = as.factor(fr_so_knn_bin_train_norm[, 24]), k = 150)
confusionMatrix(fr_so_knn_bin_best, as.factor(fr_so_knn_bin_valid_norm[, 24]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 340 265
## 1 120 223
##
## Accuracy : 0.5939
## 95% CI : (0.5618, 0.6253)
## No Information Rate : 0.5148
## P-Value [Acc > NIR] : 5.919e-07
##
## Kappa : 0.1943
##
## Mcnemar's Test P-Value : 2.153e-13
##
## Sensitivity : 0.7391
## Specificity : 0.4570
## Pos Pred Value : 0.5620
## Neg Pred Value : 0.6501
## Prevalence : 0.4852
## Detection Rate : 0.3586
## Detection Prevalence : 0.6382
## Balanced Accuracy : 0.5980
##
## 'Positive' Class : 0
##
After performing the same analysis for junior and senior year predictions, we get th follwoing results:
Overall, the regression models did not perform exceptionally well. The five models performed similarly for sophomore year predictions with RMSE values falling between 38 and 40 for each, quite a large amount of error for a wRC+ value that typically falls between -100 and 200. The adj. R2 values for the linear regression models did not fair well either, falling very short of the desired range of 0.65 or greater. Since these models are not very distinguishable by performs, I will select the first order linear regression model (full_lm_so/jr/sr) as the best model. The results are very similar for junior and senior year predictions as well.
The classification models had some more success than the regression models due to the ability to simplify into a binary case. The classification models for predicting all seven classes of wRC+ still resulted in poor accuracy. While the binary outcome models tell us less about future seasons, it is still a useful prediction to judge general development of players, and performs with much bettter accuracy. Again with similar accuracy across each of the binary classification models, I am going to select simplest model of the three. This is the binary logistic regression model with only the variables found to be significant. This will be the model that I use to predict future seasons for AAC freshmen from 2021.
To predict future seasons, I first take the subset of the data that fits the description of the players to be predicted. Then I simply use the predict() function to compute a probability (0,1) of a player have an above average wRC+ season, with probabilities 0.50 and greater designated as above average.
aac_pred_so <- predict(glm_so_reduced, newdata = full_aac_fr, type = "response")
aac_pred_jr <- predict(glm_jr_reduced, newdata = full_aac_fr, type = "response")
aac_pred_sr <- predict(glm_sr_reduced, newdata = full_aac_fr, type = "response")
All but one of the players (2301334) is predicted to achieve above average wRC+ by their senior season. The prediction for how quickly the players achieve this, though, varies. Some are predicted to perform at above average levels for the remainder of their collegiate careers, while others are predicted to reach this level during their junior or senior seasons.
setwd("~/College Classes/Fall 2021/Math Seminar")
library(tidyverse)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(corrplot)
library(caret)
library(nnet)
library(FNN)
options(max.print=1000000)
##### Read data #####
full <- read.csv("D1 Batting 2013-2021.csv")
dim(full) # 45938 rows, 39 variables
length(unique(full$school)) # 313 unique schools represented
length(unique(full$player_id)) # 21567 unique players represented
full <- full[, -c(1:3)] # remove row number, year, school, conference
dim(full) # 45938 rows, 36 variables
##### Data Exploration #####
# Missing values
nrow(subset(full, full$player_id == "0")) # 258 player IDs=0: 258/45938 = less than 1% of data
nrow(full[is.na(full$player_id), ]) # 0 na
nrow(full[is.na(full$Yr), ]) # 0 na
nrow(full[full$Yr=="N/A", ]) # 179 na: 179/45938 = less than 1% of data
nrow(full[is.na(full$Player), ]) # 0 na
nrow(full[full$Player=="N/A", ]) # 0 na
nrow(full[is.na(full$Pos), ]) # 0 na
nrow(full[full$Pos=="N/A", ]) # 0 na
nrow(full[is.na(full$BA), ]) # 0 na
nrow(full[full$BA=="N/A", ]) # 0 na
# Remove these
full <- full[!(full$player_id==0), ]
full <- full[!(full$Yr=="N/A"), ]
dim(full) # 45505 rows, 36 variables
# Minimum Plate Appearances
summary_pa <- summary(full$PA)
first_qrt_pa <- summary_pa[2] # 32
median_pa <- median(full$PA) # 99
hist1 <- ggplot(full, aes(x=PA)) +
geom_histogram(fill="azure3", col="black") +
ggtitle("PA Distribution") +
geom_segment(aes(x=first_qrt_pa, y=0, xend=first_qrt_pa, yend=3500), col="red", size=1) +
geom_segment(aes(x=median_pa, y=0, xend=median_pa, yend=2500), col="red", size=1) +
annotate("text", x=first_qrt_pa+20, y=3400, label= "First") +
annotate("text", x=first_qrt_pa+30, y=3250, label= "Quartile") +
annotate("text", x=median_pa+30, y=2350, label = "Median") +
ylab("Number of players") +
xlab("Plate appearances in a season") +
theme_economist_white()
hist1
full <- full[full$PA>=first_qrt_pa, ]
dim(full) # 34338 rows, 36 varaibles
# Correlation Matrix
corr_simple <- function(data=full,sig=0.90){
#convert data to numeric in order to run correlations
#convert to factor first to keep the integrity of the data - each value will become a number rather than turn into NA
df_cor <- data %>% mutate_if(is.character, as.factor)
df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)
#run a correlation and drop the insignificant ones
corr <- cor(df_cor)
#prepare to drop duplicates and correlations of 1
corr[lower.tri(corr,diag=TRUE)] <- NA
#drop perfect correlations
corr[corr == 1] <- NA
#turn into a 3-column table
corr <- as.data.frame(as.table(corr))
#remove the NA values from above
corr <- na.omit(corr)
#select significant values
corr <- subset(corr, abs(Freq) > sig)
#sort by highest correlation
corr <- corr[order(-abs(corr$Freq)),]
#print table
print(corr)
#turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
#plot correlations visually
corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ", title = "Correlation between variables")
}
corr_simple()
# Drop:
# wOBA 15
# GS 7
# AB 9
# OPS 13
# H 20
# ISO 14
# OBpct 11
# GP 6
# Drop selected variables
full <- full[, -c(6:7,9,11,13:15,20)]
dim(full) # 34338 rows, 28 variables
##### Dataset manipulation #####
# Players with Freshman year stats
nrow(subset(full, full$Yr=="Fr")) # 7134 players with freshman year stats
full_fr <- subset(full, full$Yr=="Fr")
player_id_fr <- full_fr$player_id
# Subset stats only for players who have freshman year stats
full_from_fr <- full[full$player_id %in% player_id_fr, ]
dim(full_from_fr) # 18080 rows, 31 variables
length(unique(full_from_fr$player_id)) # 6992 unique players represented
# Length of college career
full_from_fr_years <- full_from_fr %>%
count(player_id)
nrow(subset(full_from_fr_years, full_from_fr_years$n == "1")) # 2025 one year players
nrow(subset(full_from_fr_years, full_from_fr_years$n == "2")) # 1214 two year players
nrow(subset(full_from_fr_years, full_from_fr_years$n == "3")) # 1615 three year players
nrow(subset(full_from_fr_years, full_from_fr_years$n == "4")) # 1911 four year players
nrow(subset(full_from_fr_years, full_from_fr_years$n == "5")) # 224 five year players
nrow(subset(full_from_fr_years, full_from_fr_years$n == "6")) # 3 six year players
# Bar graph
full_from_fr_years_1_5 <- full_from_fr_years[!(full_from_fr_years$n==6), ]
bar1 <- ggplot(full_from_fr_years_1_5, aes(x=n)) +
geom_bar(fill="azure3") +
ggtitle("Length of Playing Career") +
xlab("Number of seasons with stats") +
ylab("Number of players") +
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5) +
theme_economist_white()
bar1
# Separate dataset by year classification
stats_fr <- subset(full_from_fr, full_from_fr$Yr=="Fr")
stats_so <- subset(full_from_fr, full_from_fr$Yr=="So")
stats_jr <- subset(full_from_fr, full_from_fr$Yr=="Jr")
stats_sr <- subset(full_from_fr, full_from_fr$Yr=="Sr")
# Create final form data frames
# Freshman and sophomore
stats_fr_so <- inner_join(stats_fr, stats_so, by="player_id", suffix=c(".fr", ".so"))
stats_fr_so <- stats_fr_so[, -c(1,4,5,29:32)]
stats_fr_so <- stats_fr_so %>%
rename(Player = Player.fr)
nrow(stats_fr_so) # 4738 players to predict sophomore year stats
# Freshman and junior
stats_fr_jr <- inner_join(stats_fr, stats_jr, by="player_id", suffix=c(".fr", ".jr"))
stats_fr_jr <- stats_fr_jr[, -c(1,4,5,29:32)]
stats_fr_jr <- stats_fr_jr %>%
rename(Player = Player.fr)
nrow(stats_fr_jr) # 3784 players to predict junior year stats
# Freshman and senior
stats_fr_sr <- inner_join(stats_fr, stats_sr, by="player_id", suffix=c(".fr", ".sr"))
stats_fr_sr <- stats_fr_sr[, -c(1,4,5,29:32)]
stats_fr_sr <- stats_fr_sr %>%
rename(Player = Player.fr)
nrow(stats_fr_sr) # 2443 players to predict senior year stats
# Full freshman to senior data frame
stats_fr_to_jr <- left_join(stats_fr_so, stats_jr, by="player_id")
stats_fr_to_jr <- stats_fr_to_jr[, -c(49:52)]
stats_fr_to_jr <- stats_fr_to_jr %>%
rename(Player = Player.x,
PA.jr = PA,
BA.jr = BA,
SlgPct.jr = SlgPct,
BABIP.jr = BABIP,
wRAA.jr = wRAA,
wRCplus.jr = wRCplus,
R.jr = R,
X2B.jr = X2B,
X3B.jr = X3B,
TB.jr = TB,
HR.jr = HR,
HRpct.jr = HRpct,
RBI.jr = RBI,
BB.jr = BB,
K.jr = K,
BBpct.jr = BBpct,
Kpct.jr = Kpct,
HBP.jr = HBP,
SF.jr = SF,
SH.jr = SH,
SB.jr = SB,
CS.jr = CS,
SPD.jr = SPD)
stats_fr_to_sr <- left_join(stats_fr_to_jr, stats_sr, by="player_id")
stats_fr_to_sr <- stats_fr_to_sr[, -c(72:75)]
stats_fr_to_sr <- stats_fr_to_sr %>%
rename(Player = Player.x,
PA.sr = PA,
BA.sr = BA,
SlgPct.sr = SlgPct,
BABIP.sr = BABIP,
wRAA.sr = wRAA,
wRCplus.sr = wRCplus,
R.sr = R,
X2B.sr = X2B,
X3B.sr = X3B,
TB.sr = TB,
HR.sr = HR,
HRpct.sr = HRpct,
RBI.sr = RBI,
BB.sr = BB,
K.sr = K,
BBpct.sr = BBpct,
Kpct.sr = Kpct,
HBP.sr = HBP,
SF.sr = SF,
SH.sr = SH,
SB.sr = SB,
CS.sr = CS,
SPD.sr = SPD)
nrow(stats_fr_to_sr) # 5127 total players remaining in full data frame
# Regression Modeling
# wRCplus prediction datasets
wrc_fr_so <- stats_fr_so[, -c(1:2, 26:30, 32:48)]
wrc_fr_jr <- stats_fr_jr[, -c(1:2, 26:30, 32:48)]
wrc_fr_sr <- stats_fr_sr[, -c(1:2, 26:30, 32:48)]
# Test & validation data sets
# Freshman to sophomore
set.seed(1819)
train.rows1 <- sample(nrow(wrc_fr_so), nrow(wrc_fr_so) * 0.8)
fr_so_train <- wrc_fr_so[train.rows1, ]
fr_so_valid <- wrc_fr_so[-train.rows1, ]
# Freshman to junior
set.seed(1819)
train.rows2 <- sample(nrow(wrc_fr_jr), nrow(wrc_fr_jr) * 0.8)
fr_jr_train <- wrc_fr_jr[train.rows2, ]
fr_jr_valid <- wrc_fr_jr[-train.rows2, ]
# Freshman to senior
set.seed(1819)
train.rows3 <- sample(nrow(wrc_fr_sr), nrow(wrc_fr_sr) * 0.8)
fr_sr_train <- wrc_fr_sr[train.rows3, ]
fr_sr_valid <- wrc_fr_sr[-train.rows3, ]
# Model Comparison Table
model_stats_so <- data.frame(Model_Name=character(),
Model_Type=character(),
Model_Description=character(),
Adj_R_Squared=numeric(),
RMSE=numeric(),
AIC=numeric(),
Accuracy=numeric())
model_stats_jr <- data.frame(Model_Name=character(),
Model_Type=character(),
Model_Description=character(),
Adj_R_Squared=numeric(),
RMSE=numeric(),
AIC=numeric(),
Accuracy=numeric())
model_stats_sr <- data.frame(Model_Name=character(),
Model_Type=character(),
Model_Description=character(),
Adj_R_Squared=numeric(),
RMSE=numeric(),
AIC=numeric(),
Accuracy=numeric())
# Multiple Regression Models (wRC+)
# Sophomore year
full_lm_so <- lm(wRCplus.so ~ ., data = wrc_fr_so)
summary(full_lm_so) # adj R^2 = 0.1377
full_lm_so_RMSE <- round(sqrt(mean(full_lm_so$residuals^2)),2) # RMSE = 39.26
model_stats_so[1,1:5] <- c("full_lm_so", "Mulitple linear regression", "First order linear model", 0.1377,
full_lm_so_RMSE)
reduced_model_so <- lm(wRCplus.so ~ BABIP.fr + wRAA.fr + HR.fr + K.fr + SH.fr + SB.fr, data = wrc_fr_so)
summary(reduced_model_so) # adj R^2 = 0.1183
reduced_model_so_RMSE <- round(sqrt(mean(reduced_model_so$residuals^2)),2) # RMSE = 39.77
model_stats_so[2,1:5] <- c("reduced_model_so", "Multiple linear regression",
"Linear model with singificant predictors", 0.1183, reduced_model_so_RMSE)
anova(reduced_model_so, full_lm_so)
# Junior year
full_lm_jr <- lm(wRCplus.jr ~ ., data = wrc_fr_jr)
summary(full_lm_jr) # adj R^2 = 0.1395
full_lm_jr_RMSE <- round(sqrt(mean(full_lm_jr$residuals^2)),2) # RMSE = 40.02
model_stats_jr[1,1:5] <- c("full_lm_jr", "Mulitple linear regression", "First order linear model", 0.1395,
full_lm_jr_RMSE)
reduced_model_jr <- lm(wRCplus.jr ~ PA.fr + wRAA.fr + TB.fr + K.fr + SH.fr + CS.fr, data = wrc_fr_jr)
summary(reduced_model_jr) # adj R^2 = 0.1296
reduced_model_jr_RMSE <- round(sqrt(mean(reduced_model_jr$residuals^2)),2) # RMSE = 40.33
model_stats_jr[2,1:5] <- c("reduced_model_jr", "Multiple linear regression",
"Linear model with singificant predictors", 0.1296, reduced_model_jr_RMSE)
anova(reduced_model_jr, full_lm_jr)
# Senior year
full_lm_sr <- lm(wRCplus.sr ~ ., data = wrc_fr_sr)
summary(full_lm_sr) # adj R^2 = 0.06893
full_lm_sr_RMSE <- round(sqrt(mean(full_lm_sr$residuals^2)),2) # RMSE = 39.44
model_stats_sr[1,1:5] <- c("full_lm_sr", "Mulitple linear regression", "First order linear model", 0.06893,
full_lm_sr_RMSE)
reduced_model_sr <- lm(wRCplus.sr ~ TB.fr + BBpct.fr + SF.fr + SH.fr +CS.fr, data = wrc_fr_sr)
summary(reduced_model_sr) # adj R^2 = 0.04159
reduced_model_sr_RMSE <- round(sqrt(mean(reduced_model_sr$residuals^2)),2) # RMSE = 40.17
model_stats_sr[2,1:5] <- c("reduced_model_sr", "Multiple linear regression",
"Linear model with singificant predictors", 0.04159, reduced_model_sr_RMSE)
anova(reduced_model_sr, full_lm_sr)
# Multiple Regression Models with Interaction
# Sophomore year
int_vars_so <- wrc_fr_so[, -24]
wrc_int_so <- cbind(int_vars_so^2, do.call(cbind, combn(colnames(int_vars_so), 2,
FUN=function(x) list(int_vars_so[x[1]]*int_vars_so[x[2]]))))
colnames(wrc_int_so)[-(seq_len(ncol(int_vars_so)))] <- combn(colnames(int_vars_so), 2,
FUN = paste, collapse=":")
wrc_int_so$wRCplus.so <- wrc_fr_so$wRCplus.so
full_int_so <- lm(wRCplus.so ~ ., data = wrc_int_so)
summary(full_int_so) # adj R^2 = 0.1417
full_int_so_RMSE <- round(sqrt(mean(full_int_so$residuals^2)),2) # RMSE = 38.11
model_stats_so[3,1:5] <- c("full_int_so", "Multiple linear regression",
"Linear model with all interaction terms", 0.1417, full_int_so_RMSE)
anova(full_lm_so, full_int_so) # interaction model not significantly more useful model
reduced_int_so <- lm(wRCplus.so ~ PA.fr + wRAA.fr + R.fr + X3B.fr + SB.fr + HR.fr + BB.fr + CS.fr + Kpct.fr +
K.fr + BBpct.fr + SPD.fr + TB.fr + RBI.fr + PA.fr:TB.fr + wRAA.fr:TB.fr +
R.fr:X3B.fr + R.fr:SB.fr + X3B.fr:TB.fr + X3B.fr:RBI.fr + TB.fr:HR.fr + TB.fr:BB.fr +
RBI.fr:CS.fr + BB.fr:Kpct.fr + K.fr:BBpct.fr + CS.fr:SPD.fr, data = wrc_int_so)
summary(reduced_int_so) # adj R^2 = 0.1133
reduced_int_so_RMSE <- round(sqrt(mean(reduced_int_so$residuals^2)),2) # RMSE = 38.11
model_stats_so[4,1:5] <- c("reduced_int_so", "Multiple linear regression",
"Linear model with interaction terms and significant predictors", 0.1133,
reduced_int_so_RMSE)
# Junior year
int_vars_jr <- wrc_fr_jr[, -24]
wrc_int_jr <- cbind(int_vars_jr^2, do.call(cbind, combn(colnames(int_vars_jr), 2,
FUN=function(x) list(int_vars_jr[x[1]]*int_vars_jr[x[2]]))))
colnames(wrc_int_jr)[-(seq_len(ncol(int_vars_jr)))] <- combn(colnames(int_vars_jr), 2,
FUN = paste, collapse=":")
wrc_int_jr$wRCplus.jr <- wrc_fr_jr$wRCplus.jr
full_int_jr <- lm(wRCplus.jr ~., data = wrc_int_jr)
summary(full_int_jr) # adj R^2 = 0.1373
full_int_jr_RMSE <- round(sqrt(mean(full_int_jr$residuals^2)),2) # RMSE = 38.70
model_stats_jr[3,1:5] <- c("full_int_jr", "Multiple linear regression",
"Linear model with all interaction terms", 0.1373, full_int_jr_RMSE)
anova(full_lm_jr, full_int_jr) # interaction model not significantly more useful model
reduced_int_jr <- lm(wRCplus.jr ~ TB.fr + R.fr + Kpct.fr + R.fr:Kpct.fr + HBP.fr + R.fr:HBP.fr + X2B.fr +
X2B.fr:TB.fr + X3B.fr + X3B.fr:TB.fr + HR.fr + TB.fr:HR.fr + RBI.fr + SB.fr +
RBI.fr:SB.fr + SPD.fr + RBI.fr:SPD.fr + K.fr + CS.fr + K.fr:CS.fr, data = wrc_int_jr)
summary(reduced_int_jr) # adj R^2 = 0.1084
reduced_int_jr_RMSE <- round(sqrt(mean(reduced_int_jr$residuals^2)),2) # RMSE = 40.75
model_stats_jr[4,1:5] <- c("reduced_int_jr", "Multiple linear regression",
"Linear model with interaction terms and significant predictors", 0.1084,
reduced_int_jr_RMSE)
# Senior year
int_vars_sr <- wrc_fr_sr[, -24]
wrc_int_sr <- cbind(int_vars_sr^2, do.call(cbind, combn(colnames(int_vars_sr), 2,
FUN=function(x) list(int_vars_sr[x[1]]*int_vars_sr[x[2]]))))
colnames(wrc_int_sr)[-(seq_len(ncol(int_vars_sr)))] <- combn(colnames(int_vars_sr), 2,
FUN = paste, collapse=":")
wrc_int_sr$wRCplus.sr <- wrc_fr_sr$wRCplus.sr
full_int_sr <- lm(wRCplus.sr ~., data = wrc_int_sr)
summary(full_int_sr) # adj R^2 = 0.08442
full_int_sr_RMSE <- round(sqrt(mean(full_int_sr$residuals^2)),2) # RMSE = 37.01
model_stats_sr[3,1:5] <- c("full_int_sr", "Multiple linear regression",
"Linear model with all interaction terms", 0.08442, full_int_sr_RMSE)
anova(full_lm_sr, full_int_sr) # interaction model is signifcantly more useful model
reduced_int_sr <- lm(wRCplus.sr ~ BA.fr + SlgPct.fr + X3B.fr + SH.fr + PA.fr + X2B.fr + PA.fr:X2B.fr +
BA.fr:SlgPct.fr + wRAA.fr + BA.fr:wRAA.fr + wRCplus.fr + BA.fr:wRCplus.fr +
BA.fr:X2B.fr + BA.fr:X3B.fr + HRpct.fr + BA.fr:HRpct.fr + SF.fr + BA.fr:SF.fr +
CS.fr + BA.fr:CS.fr + SlgPct.fr:X3B.fr + SlgPct.fr:HRpct.fr + BABIP.fr +
BABIP.fr:wRAA.fr + R.fr + BABIP.fr:R.fr + BABIP.fr:X2B.fr + BABIP.fr+X3B.fr +
wRCplus.fr:CS.fr + R.fr:HRpct.fr + X2B.fr:X3B.fr + Kpct.fr + X3B.fr:Kpct.fr +
TB.fr + SB.fr + TB.fr:SB.fr + HR.fr + SH.fr + HR.fr:SH.fr, data = wrc_int_sr)
summary(reduced_int_sr) # adj R^2 = 0.06089
reduced_int_sr_RMSE <- round(sqrt(mean(reduced_int_sr$residuals^2)),2) # RMSE = 39.50
model_stats_sr[4,1:5] <- c("reduced_int_sr", "Multiple linear regression",
"Linear model with interaction terms and significant predictors", 0.06089,
reduced_int_sr_RMSE)
# Additional Models
# k-nearest neighbors for regression
rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
make_knn_pred <- function(k = 1, training, predicting) {
pred <- FNN::knn.reg(train = training[,1:23],
test = predicting[,1:23],
y = training[,24], k = k)$pred
act <- predicting[,24]
rmse(predicted = pred, actual = act)
}
k <- seq(10,200,10)
# Sophomore year
fr_so_train_norm <- fr_so_train
fr_so_valid_norm <- fr_so_valid
wrc_fr_so_norm <- wrc_fr_so
norm_values <- preProcess(fr_so_train[, 1:23], method = c("center", "scale"))
fr_so_train_norm[, 1:23] <- predict(norm_values, fr_so_train[, 1:23])
fr_so_valid_norm[, 1:23] <- predict(norm_values, fr_so_valid[, 1:23])
wrc_fr_so_norm[, 1:23] <- predict(norm_values, wrc_fr_so[, 1:23])
knn_train_rmse_so <- sapply(k, make_knn_pred,
training = fr_so_train_norm,
predicting = fr_so_valid_norm)
knn_valid_rmse_so <- sapply(k, make_knn_pred,
training = fr_so_train_norm,
predicting = fr_so_valid_norm)
best_k_so <- k[which.min(knn_valid_rmse_so)]
fit_status <- ifelse(k < best_k_so, "Over", ifelse(k == best_k_so, "Best", "Under"))
knn_results_so <- data.frame(
k,
round(knn_train_rmse_so, 2),
round(knn_valid_rmse_so, 2),
fit_status
)
colnames(knn_results_so) <- c("k", "Train RMSE", "Valid RMSE", "Fit?")
knitr::kable(knn_results_so, escape = FALSE, booktabs = TRUE) # k = 180, valid rmse = 38.88
so_knn_best <- knn.reg(train = fr_so_train_norm[, 1:23], test = fr_so_valid_norm[, 1:23],
y = fr_so_train_norm[,24], k = 180)
model_stats_so[5,1:5] <- c("so_knn_best", "k-nearest neighbors",
"k-nearest neighbors regression model with k=180", NA, 38.88)
# Junior year
fr_jr_train_norm <- fr_jr_train
fr_jr_valid_norm <- fr_jr_valid
wrc_fr_jr_norm <- wrc_fr_jr
norm_values <- preProcess(fr_jr_train[, 1:23], method = c("center", "scale"))
fr_jr_train_norm[, 1:23] <- predict(norm_values, fr_jr_train[, 1:23])
fr_jr_valid_norm[, 1:23] <- predict(norm_values, fr_jr_valid[, 1:23])
wrc_fr_jr_norm[, 1:23] <- predict(norm_values, wrc_fr_jr[, 1:23])
knn_train_rmse_jr <- sapply(k, make_knn_pred,
training = fr_jr_train_norm,
predicting = fr_jr_valid_norm)
knn_valid_rmse_jr <- sapply(k, make_knn_pred,
training = fr_jr_train_norm,
predicting = fr_jr_valid_norm)
best_k_jr <- k[which.min(knn_valid_rmse_jr)]
fit_status = ifelse(k < best_k_jr, "Over", ifelse(k == best_k_jr, "Best", "Under"))
knn_results_jr <- data.frame(
k,
round(knn_train_rmse_jr, 2),
round(knn_valid_rmse_jr, 2),
fit_status
)
colnames(knn_results_jr) <- c("k", "Train RMSE", "Valid RMSE", "Fit?")
knitr::kable(knn_results_jr, escape = FALSE, booktabs = TRUE) # k = 120, valid rmse = 42.08
jr_knn_best <- knn.reg(train = fr_jr_train_norm[, 1:23], test = fr_jr_valid_norm[, 1:23],
y = fr_jr_train_norm[,24], k = 120)
model_stats_jr[5,1:5] <- c("jr_knn_best", "k-nearest neighbors",
"k-nearest neighbors regression model with k=120", NA, 42.08)
# Senior year
fr_sr_train_norm <- fr_sr_train
fr_sr_valid_norm <- fr_sr_valid
wrc_fr_sr_norm <- wrc_fr_sr
norm_values <- preProcess(fr_sr_train[, 1:23], method = c("center", "scale"))
fr_sr_train_norm[, 1:23] <- predict(norm_values, fr_sr_train[, 1:23])
fr_sr_valid_norm[, 1:23] <- predict(norm_values, fr_sr_valid[, 1:23])
wrc_fr_sr_norm[, 1:23] <- predict(norm_values, wrc_fr_sr[, 1:23])
knn_train_rmse_sr <- sapply(k, make_knn_pred,
training = fr_sr_train_norm,
predicting = fr_sr_valid_norm)
knn_valid_rmse_sr <- sapply(k, make_knn_pred,
training = fr_sr_train_norm,
predicting = fr_sr_valid_norm)
best_k_sr <- k[which.min(knn_valid_rmse_sr)]
fit_status = ifelse(k < best_k_sr, "Over", ifelse(k == best_k_sr, "Best", "Under"))
knn_results_sr <- data.frame(
k,
round(knn_train_rmse_sr, 2),
round(knn_valid_rmse_sr, 2),
fit_status
)
colnames(knn_results_sr) <- c("k", "Train RMSE", "Valid RMSE", "Fit?")
knitr::kable(knn_results_sr, escape = FALSE, booktabs = TRUE) # k = 110, valid rmse = 39.76
sr_knn_best <- knn.reg(train = fr_sr_train_norm[, 1:23], test = fr_sr_valid_norm[, 1:23],
y = fr_sr_train_norm[,24], k = 110)
model_stats_sr[5,1:5] <- c("sr_knn_best", "k-nearest neighbors",
"k-nearest neighbors regression model with k=110", NA, 39.76)
# Classification Modeling
# Logistic Regression Models
# Sophomore year
# All levels
wrc_fr_so_glm <- wrc_fr_so %>% mutate(wRCplusrating.so =
case_when(wRCplus.so <= 60 ~ "Awful",
wRCplus.so > 60 & wRCplus.so <= 75 ~ "Poor",
wRCplus.so > 75 & wRCplus.so <= 90 ~ "Below Average",
wRCplus.so > 90 & wRCplus.so <= 110 ~ "Average",
wRCplus.so > 110 & wRCplus.so <= 130 ~ "Above Average",
wRCplus.so > 130 & wRCplus.so <= 150 ~ "Great",
wRCplus.so > 150 ~ "Excellent")
)
wrc_fr_so_glm$wRCplusrating.so <- as.factor(wrc_fr_so_glm$wRCplusrating.so)
wrc_fr_so_glm$wRCplusrating.so <- relevel(wrc_fr_so_glm$wRCplusrating.so, ref = "Average")
multinom_so <- multinom(wRCplusrating.so ~ .-wRCplus.so, data = wrc_fr_so_glm)
summary(multinom_so) # AIC = 17544.68
multinom_so_AIC <- round(multinom_so$AIC,0)
wrc_fr_so_glm$wRCplusPred <- predict(multinom_so, newdata = wrc_fr_so_glm, "class")
so_table <- table(wrc_fr_so_glm$wRCplusrating.so, wrc_fr_so_glm$wRCplusPred)
multinom_so_acc <- round((sum(diag(so_table))/sum(so_table)),4) # overall accuracy = 24.53%
model_stats_so[6,c(1:3,6:7)] <- c("multinom_so", "Logistic regression",
"Multinomial logistic regression model with 7 categorical outcomes",
multinom_so_AIC, multinom_so_acc)
# Binary outcome
wrc_fr_so_glm2 <- wrc_fr_so %>% mutate(wRCplusrating.so =
case_when(wRCplus.so < 100 ~ 0,
wRCplus.so >= 100 ~ 1))
wrc_fr_so_glm2$wRCplusrating.so <- as.factor(wrc_fr_so_glm2$wRCplusrating.so)
glm_so <- glm(wRCplusrating.so ~ . - wRCplus.so, data = wrc_fr_so_glm2, family = "binomial")
summary(glm_so) # AIC = 6054.2
glm_so_AIC <- round(glm_so$aic,0)
glm_so_pred <- predict(glm_so, wrc_fr_so_glm2, type = "response")
cm_glm_so <- confusionMatrix(as.factor(ifelse(glm_so_pred >= 0.5, "1", "0")), as.factor(wrc_fr_so_glm2$wRCplusrating.so),
positive = "1") # overall accuracy = 0.6351
glm_so_acc <- round(cm_glm_so$overall[1],4)
model_stats_so[7,c(1:3,6:7)] <- c("glm_so", "Logistic regression",
"Binary logistic regression model", glm_so_AIC, glm_so_acc)
glm_so_reduced <- glm(wRCplusrating.so ~ PA.fr + BABIP.fr + wRAA.fr + wRCplus.fr + HR.fr + HRpct.fr +
K.fr + Kpct.fr + SH.fr + SB.fr - wRCplus.so, data = wrc_fr_so_glm2, family = "binomial")
summary(glm_so_reduced) # AIC = 6038.9
glm_so_reduced_AIC <- round(glm_so_reduced$aic,0)
glm_so_reduced_pred <- predict(glm_so_reduced, wrc_fr_so_glm2, type = "response")
cm_glm_so_reduced <- confusionMatrix(as.factor(ifelse(glm_so_reduced_pred >= 0.5, "1", "0")), as.factor(wrc_fr_so_glm2$wRCplusrating.so),
positive = "1") # overall accuracy = 0.6334
glm_so_reduced_acc <- round(cm_glm_so_reduced$overall[1],4)
model_stats_so[8, c(1:3,6:7)] <- c("glm_so_reduced", "Logistic regression",
"Binary logistic regression model with significant variables",
glm_so_reduced_AIC, glm_so_reduced_acc)
# Junior Year
# All levels
wrc_fr_jr_glm <- wrc_fr_jr %>% mutate(wRCplusrating.jr =
case_when(wRCplus.jr <= 60 ~ "Awful",
wRCplus.jr > 60 & wRCplus.jr <= 75 ~ "Poor",
wRCplus.jr > 75 & wRCplus.jr <= 90 ~ "Below Average",
wRCplus.jr > 90 & wRCplus.jr <= 110 ~ "Average",
wRCplus.jr > 110 & wRCplus.jr <= 130 ~ "Above Average",
wRCplus.jr > 130 & wRCplus.jr <= 150 ~ "Great",
wRCplus.jr > 150 ~ "Excellent")
)
wrc_fr_jr_glm$wRCplusrating.jr <- as.factor(wrc_fr_jr_glm$wRCplusrating.jr)
wrc_fr_jr_glm$wRCplusrating.jr <- relevel(wrc_fr_jr_glm$wRCplusrating.jr, ref = "Average")
multinom_jr <- multinom(wRCplusrating.jr ~ .-wRCplus.jr, data = wrc_fr_jr_glm)
summary(multinom_jr) # AIC = 13693.88
multinom_jr_AIC <- round(multinom_jr$AIC,0)
wrc_fr_jr_glm$wRCplusPred <- predict(multinom_jr, newdata = wrc_fr_jr_glm, "class")
jr_table <- table(wrc_fr_jr_glm$wRCplusrating.jr, wrc_fr_jr_glm$wRCplusPred)
multinom_jr_acc <- round((sum(diag(jr_table))/sum(jr_table)),4) # overall accuracy = 25.11%
model_stats_jr[6,c(1:3,6:7)] <- c("multinom_jr", "Logistic regression",
"Multinomial logistic regression model with 7 categorical outcomes",
multinom_jr_AIC, multinom_jr_acc)
# Binary
# Binary outcome
wrc_fr_jr_glm2 <- wrc_fr_jr %>% mutate(wRCplusrating.jr =
case_when(wRCplus.jr < 100 ~ 0,
wRCplus.jr >= 100 ~ 1))
wrc_fr_jr_glm2$wRCplusrating.jr <- as.factor(wrc_fr_jr_glm2$wRCplusrating.jr)
glm_jr <- glm(wRCplusrating.jr ~ . - wRCplus.jr, data = wrc_fr_jr_glm2, family = "binomial")
summary(glm_jr) # AIC = 4707.4
glm_jr_AIC <- round(glm_jr$aic,0)
glm_jr_pred <- predict(glm_jr, wrc_fr_jr_glm2, type = "response")
cm_glm_jr <- confusionMatrix(as.factor(ifelse(glm_jr_pred >= 0.5, "1", "0")), as.factor(wrc_fr_jr_glm2$wRCplusrating.jr),
positive = "1") # overall accuracy = 0.6435
glm_jr_acc <- round(cm_glm_jr$overall[1],4)
model_stats_jr[7,c(1:3,6:7)] <- c("glm_jr", "Logistic regression",
"Binary logistic regression model",
glm_jr_AIC, glm_jr_acc)
glm_jr_reduced <- glm(wRCplusrating.jr ~ PA.fr + BA.fr + wRCplus.fr + R.fr + K.fr + SH.fr + SB.fr +
CS.fr - wRCplus.jr, data = wrc_fr_jr_glm2, family = "binomial")
summary(glm_jr_reduced) # AIC = 4707.9
glm_jr_reduced_AIC <- round(glm_jr_reduced$aic,0)
glm_jr_reduced_pred <- predict(glm_jr_reduced, wrc_fr_jr_glm2, type = "response")
cm_glm_jr_reduced <- confusionMatrix(as.factor(ifelse(glm_jr_reduced_pred >= 0.5, "1", "0")), as.factor(wrc_fr_jr_glm2$wRCplusrating.jr),
positive = "1") # overall accuracy = 0.6382
glm_jr_reduced_acc <- round(cm_glm_jr_reduced$overall[1], 4)
model_stats_jr[8, c(1:3,6:7)] <- c("glm_jr_reduced", "Logistic regression",
"Binary logistic regression model with significant variables",
glm_jr_reduced_AIC, glm_jr_reduced_acc)
# Senior Year
# All levels
wrc_fr_sr_glm <- wrc_fr_sr %>% mutate(wRCplusrating.sr =
case_when(wRCplus.sr <= 60 ~ "Awful",
wRCplus.sr > 60 & wRCplus.sr <= 75 ~ "Poor",
wRCplus.sr > 75 & wRCplus.sr <= 90 ~ "Below Average",
wRCplus.sr > 90 & wRCplus.sr <= 110 ~ "Average",
wRCplus.sr > 110 & wRCplus.sr <= 130 ~ "Above Average",
wRCplus.sr > 130 & wRCplus.sr <= 150 ~ "Great",
wRCplus.sr > 150 ~ "Excellent")
)
wrc_fr_sr_glm$wRCplusrating.sr <- as.factor(wrc_fr_sr_glm$wRCplusrating.sr)
wrc_fr_sr_glm$wRCplusrating.sr <- relevel(wrc_fr_sr_glm$wRCplusrating.sr, ref = "Average")
multinom_sr <- multinom(wRCplusrating.sr ~ .-wRCplus.sr, data = wrc_fr_sr_glm)
summary(multinom_sr) # AIC = 9265.011
multinom_sr_AIC <- round(multinom_sr$AIC,0)
wrc_fr_sr_glm$wRCplusPred <- predict(multinom_sr, newdata = wrc_fr_sr_glm, "class")
sr_table <- table(wrc_fr_sr_glm$wRCplusrating.sr, wrc_fr_sr_glm$wRCplusPred)
multinom_sr_acc <- round((sum(diag(sr_table))/sum(sr_table)),4) # overall accuracy = 24.15%
model_stats_sr[6,c(1:3,6:7)] <- c("multinom_sr", "Logistic regression",
"Multinomial logistic regression model with 7 categorical outcomes",
multinom_sr_AIC, multinom_sr_acc)
# Binary
# Binary outcome
wrc_fr_sr_glm2 <- wrc_fr_sr %>% mutate(wRCplusrating.sr =
case_when(wRCplus.sr < 100 ~ 0,
wRCplus.sr >= 100 ~ 1))
wrc_fr_sr_glm2$wRCplusrating.sr <- as.factor(wrc_fr_sr_glm2$wRCplusrating.sr)
glm_sr <- glm(wRCplusrating.sr ~ . - wRCplus.sr, data = wrc_fr_sr_glm2, family = "binomial")
summary(glm_sr) # AIC = 3172.7
glm_sr_AIC <- round(glm_sr$aic,0)
glm_sr_pred <- predict(glm_sr, wrc_fr_sr_glm2, type = "response")
cm_glm_sr <- confusionMatrix(as.factor(ifelse(glm_sr_pred >= 0.5, "1", "0")), as.factor(wrc_fr_sr_glm2$wRCplusrating.sr),
positive = "1") # overall accuracy = 0.6279
glm_sr_acc <- round(cm_glm_sr$overall[1],4)
model_stats_sr[7,c(1:3,6:7)] <- c("glm_sr", "Logistic regression",
"Binary logistic regression model",
glm_sr_AIC, glm_sr_acc)
glm_sr_reduced <- glm(wRCplusrating.sr ~ PA.fr + K.fr + SF.fr + CS.fr + SPD.fr - wRCplus.sr,
data = wrc_fr_sr_glm2, family = "binomial")
summary(glm_sr_reduced) # AIC = 3192.7
glm_sr_reduced_AIC <- round(glm_sr_reduced$aic,0)
glm_sr_reduced_pred <- predict(glm_sr_reduced, wrc_fr_sr_glm2, type = "response")
cm_glm_sr_reduced <- confusionMatrix(as.factor(ifelse(glm_sr_reduced_pred >= 0.5, "1", "0")), as.factor(wrc_fr_sr_glm2$wRCplusrating.sr),
positive = "1") # overall accuracy = 0.6177
glm_sr_reduced_acc <- round(cm_glm_sr_reduced$overall[1],4)
model_stats_sr[8, c(1:3,6:7)] <- c("glm_sr_reduced", "Logistic regression",
"Binary logistic regression model with significant variables",
glm_sr_reduced_AIC, glm_sr_reduced_acc)
# k-nearest neighbors for classification
# Sophomore year
# All outcomes
wrc_fr_so_knn <- wrc_fr_so_glm[, -c(24,26)]
set.seed(1819)
train.index1 <- sample(nrow(wrc_fr_so_knn), nrow(wrc_fr_so_knn) * 0.8)
fr_so_knn_train <- wrc_fr_so_knn[train.index1, ]
fr_so_knn_valid <- wrc_fr_so_knn[-train.index1, ]
fr_so_knn_train_norm <- fr_so_knn_train
fr_so_knn_valid_norm <- fr_so_knn_valid
fr_so_knn_norm <- wrc_fr_so_knn
norm_values1 <- preProcess(fr_so_knn_train[, 1:23], method = c("center", "scale"))
fr_so_knn_train_norm[, 1:23] <- predict(norm_values1, fr_so_knn_train[, 1:23])
fr_so_knn_valid_norm[, 1:23] <- predict(norm_values1, fr_so_knn_valid[, 1:23])
fr_so_knn_norm[, 1:23] <- predict(norm_values1, fr_so_knn_norm[, 1:23])
fr_so_knn_all <- knn(train = fr_so_knn_train_norm[, 1:23], test = fr_so_knn_valid_norm[, 1:23],
cl = fr_so_knn_train_norm[, 24], k = 3)
accuracy_so_all <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
for (i in 1:20) {
knn_pred_so_all <- knn(train = fr_so_knn_train_norm[, 1:23],
test = fr_so_knn_valid_norm[, 1:23],
cl = fr_so_knn_train_norm[, 24], k = i)
accuracy_so_all[i, 2] <- confusionMatrix(knn_pred_so_all, fr_so_knn_valid_norm[, 24])$overall[1]
}
accuracy_so_all # k = 30 gives best accuracy - 20.99%
fr_so_knn_all_best <- knn(train = fr_so_knn_train_norm[, 1:23], test = fr_so_knn_valid_norm[, 1:23],
cl = fr_so_knn_train_norm[, 24], k = 30)
confusionMatrix(fr_so_knn_all_best, fr_so_knn_valid_norm[, 24])
fr_so_knn_all_best_acc <- round(confusionMatrix(fr_so_knn_all_best, fr_so_knn_valid_norm[, 24])$overall[1],4)
model_stats_so[9, c(1:3,7)] <- c("fr_so_knn_all_best", "k-nearest neighbors",
"k-nearest neighbors model with all 7 wRC+ classifications",
fr_so_knn_all_best_acc)
# Binary Outcome
wrc_fr_so_knn_bin <- wrc_fr_so_glm2[, -24]
set.seed(1819)
train.index1 <- sample(nrow(wrc_fr_so_knn_bin), nrow(wrc_fr_so_knn_bin) * 0.8)
fr_so_knn_bin_train <- wrc_fr_so_knn_bin[train.index1, ]
fr_so_knn_bin_valid <- wrc_fr_so_knn_bin[-train.index1, ]
fr_so_knn_bin_train_norm <- fr_so_knn_bin_train
fr_so_knn_bin_valid_norm <- fr_so_knn_bin_valid
fr_so_knn_bin_norm <- wrc_fr_so_knn_bin
norm_values1.5 <- preProcess(fr_so_knn_bin_train[, 1:23], method = c("center", "scale"))
fr_so_knn_bin_train_norm[, 1:23] <- predict(norm_values1.5, fr_so_knn_bin_train[, 1:23])
fr_so_knn_bin_valid_norm[, 1:23] <- predict(norm_values1.5, fr_so_knn_bin_valid[, 1:23])
fr_so_knn_bin_norm[, 1:23] <- predict(norm_values1.5, fr_so_knn_bin_norm[, 1:23])
fr_so_knn_bin <- knn(train = fr_so_knn_bin_train_norm[, 1:23],
test = fr_so_knn_bin_valid_norm[, 1:23],
cl = fr_so_knn_bin_train_norm[, 24], k = 3)
accuracy_so_bin <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
for (i in 1:20) {
knn_pred_so_bin <- knn(train = fr_so_knn_bin_train_norm[, 1:23],
test = fr_so_knn_bin_valid_norm[, 1:23],
cl = as.factor(fr_so_knn_bin_train_norm[, 24]), k = i)
accuracy_so_bin[i, 2] <- confusionMatrix(knn_pred_so_bin, as.factor(fr_so_knn_bin_valid_norm[, 24]))$overall[1]
}
accuracy_so_bin # k = 150 gives best accuracy - 60.44%
fr_so_knn_bin_best <- knn(train = fr_so_knn_bin_train_norm[, 1:23],
test = fr_so_knn_bin_valid_norm[, 1:23],
cl = as.factor(fr_so_knn_bin_train_norm[, 24]), k = 150)
confusionMatrix(fr_so_knn_bin_best, as.factor(fr_so_knn_bin_valid_norm[, 24]))
fr_so_knn_bin_best_acc <- round(confusionMatrix(fr_so_knn_bin_best, as.factor(fr_so_knn_bin_valid_norm[, 24]))$overall[1],4)
model_stats_so[10, c(1:3,7)] <- c("fr_so_knn_bin_best", "k-nearest neighbors",
"k-nearest neighbors model with binary wRC+ classifications",
fr_so_knn_bin_best_acc)
# Junior year
# All outcomes
wrc_fr_jr_knn <- wrc_fr_jr_glm[, -c(24,26)]
set.seed(1819)
train.index1 <- sample(nrow(wrc_fr_jr_knn), nrow(wrc_fr_jr_knn) * 0.8)
fr_jr_knn_train <- wrc_fr_jr_knn[train.index1, ]
fr_jr_knn_valid <- wrc_fr_jr_knn[-train.index1, ]
fr_jr_knn_train_norm <- fr_jr_knn_train
fr_jr_knn_valid_norm <- fr_jr_knn_valid
fr_jr_knn_norm <- wrc_fr_jr_knn
norm_values2 <- preProcess(fr_jr_knn_train[, 1:23], method = c("center", "scale"))
fr_jr_knn_train_norm[, 1:23] <- predict(norm_values2, fr_jr_knn_train[, 1:23])
fr_jr_knn_valid_norm[, 1:23] <- predict(norm_values2, fr_jr_knn_valid[, 1:23])
fr_jr_knn_norm[, 1:23] <- predict(norm_values2, fr_jr_knn_norm[, 1:23])
fr_jr_knn_all <- knn(train = fr_jr_knn_train_norm[, 1:23], test = fr_jr_knn_valid_norm[, 1:23],
cl = fr_jr_knn_train_norm[, 24], k = 3)
accuracy_jr_all <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
for (i in 1:20) {
knn_pred_jr_all <- knn(train = fr_jr_knn_train_norm[, 1:23],
test = fr_jr_knn_valid_norm[, 1:23],
cl = fr_jr_knn_train_norm[, 24], k = i)
accuracy_jr_all[i, 2] <- confusionMatrix(knn_pred_jr_all, fr_jr_knn_valid_norm[, 24])$overall[1]
}
accuracy_jr_all # k = 80 gives best accuracy - 21.66%
fr_jr_knn_all_best <- knn(train = fr_jr_knn_train_norm[, 1:23], test = fr_jr_knn_valid_norm[, 1:23],
cl = fr_jr_knn_train_norm[, 24], k = 80)
fr_jr_knn_all_best_acc <- round(confusionMatrix(fr_jr_knn_all_best, fr_jr_knn_valid_norm[, 24])$overall[1],4)
model_stats_jr[9, c(1:3,7)] <- c("fr_jr_knn_all_best", "k-nearest neighbors",
"k-nearest neighbors model with all 7 wRC+ classifications",
fr_jr_knn_all_best_acc)
# Binary Outcome
wrc_fr_jr_knn_bin <- wrc_fr_jr_glm2[, -24]
set.seed(1819)
train.index1 <- sample(nrow(wrc_fr_jr_knn_bin), nrow(wrc_fr_jr_knn_bin) * 0.8)
fr_jr_knn_bin_train <- wrc_fr_jr_knn_bin[train.index1, ]
fr_jr_knn_bin_valid <- wrc_fr_jr_knn_bin[-train.index1, ]
fr_jr_knn_bin_train_norm <- fr_jr_knn_bin_train
fr_jr_knn_bin_valid_norm <- fr_jr_knn_bin_valid
fr_jr_knn_bin_norm <- wrc_fr_jr_knn_bin
norm_values2.5 <- preProcess(fr_jr_knn_bin_train[, 1:23], method = c("center", "scale"))
fr_jr_knn_bin_train_norm[, 1:23] <- predict(norm_values2.5, fr_jr_knn_bin_train[, 1:23])
fr_jr_knn_bin_valid_norm[, 1:23] <- predict(norm_values2.5, fr_jr_knn_bin_valid[, 1:23])
fr_jr_knn_bin_norm[, 1:23] <- predict(norm_values2.5, fr_jr_knn_bin_norm[, 1:23])
fr_jr_knn_bin <- knn(train = fr_jr_knn_bin_train_norm[, 1:23],
test = fr_jr_knn_bin_valid_norm[, 1:23],
cl = fr_jr_knn_bin_train_norm[, 24], k = 3)
accuracy_jr_bin <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
for (i in 1:20) {
knn_pred_jr_bin <- knn(train = fr_jr_knn_bin_train_norm[, 1:23],
test = fr_jr_knn_bin_valid_norm[, 1:23],
cl = as.factor(fr_jr_knn_bin_train_norm[, 24]), k = i)
accuracy_jr_bin[i, 2] <- confusionMatrix(knn_pred_jr_bin, as.factor(fr_jr_knn_bin_valid_norm[, 24]))$overall[1]
}
accuracy_jr_bin # k = 150 gives best accuracy - 63.94%
fr_jr_knn_bin_best <- knn(train = fr_jr_knn_bin_train_norm[, 1:23],
test = fr_jr_knn_bin_valid_norm[, 1:23],
cl = as.factor(fr_jr_knn_bin_train_norm[, 24]), k = 150)
fr_jr_knn_bin_best_acc <- round(confusionMatrix(fr_jr_knn_bin_best, as.factor(fr_jr_knn_bin_valid_norm[, 24]))$overall[1],4)
model_stats_jr[10, c(1:3,7)] <- c("fr_jr_knn_bin_best", "k-nearest neighbors",
"k-nearest neighbors model with binary wRC+ classifications",
fr_jr_knn_bin_best_acc)
# Senior year
# All outcomes
wrc_fr_sr_knn <- wrc_fr_sr_glm[, -c(24,26)]
set.seed(1819)
train.index1 <- sample(nrow(wrc_fr_sr_knn), nrow(wrc_fr_sr_knn) * 0.8)
fr_sr_knn_train <- wrc_fr_sr_knn[train.index1, ]
fr_sr_knn_valid <- wrc_fr_sr_knn[-train.index1, ]
fr_sr_knn_train_norm <- fr_sr_knn_train
fr_sr_knn_valid_norm <- fr_sr_knn_valid
fr_sr_knn_norm <- wrc_fr_sr_knn
norm_values3 <- preProcess(fr_sr_knn_train[, 1:23], method = c("center", "scale"))
fr_sr_knn_train_norm[, 1:23] <- predict(norm_values3, fr_sr_knn_train[, 1:23])
fr_sr_knn_valid_norm[, 1:23] <- predict(norm_values3, fr_sr_knn_valid[, 1:23])
fr_sr_knn_norm[, 1:23] <- predict(norm_values3, fr_sr_knn_norm[, 1:23])
fr_sr_knn_all <- knn(train = fr_sr_knn_train_norm[, 1:23],
test = fr_sr_knn_valid_norm[, 1:23],
cl = fr_sr_knn_train_norm[, 24], k = 3)
accuracy_sr_all <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
for (i in 1:20) {
knn_pred_sr_all <- knn(train = fr_sr_knn_train_norm[, 1:23],
test = fr_sr_knn_valid_norm[, 1:23],
cl = fr_sr_knn_train_norm[, 24], k = i)
accuracy_sr_all[i, 2] <- confusionMatrix(knn_pred_sr_all, fr_sr_knn_valid_norm[, 24])$overall[1]
}
accuracy_sr_all # k = 190 gives best accuracy - 20.25%
fr_sr_knn_all_best <- knn(train = fr_sr_knn_train_norm[, 1:23],
test = fr_sr_knn_valid_norm[, 1:23],
cl = fr_sr_knn_train_norm[, 24], k = 190)
fr_sr_knn_all_best_acc <- round(confusionMatrix(fr_sr_knn_all_best, fr_sr_knn_valid_norm[, 24])$overall[1],4)
model_stats_sr[9, c(1:3,7)] <- c("fr_sr_knn_all_best", "k-nearest neighbors",
"k-nearest neighbors model with all 7 wRC+ classifications",
fr_sr_knn_all_best_acc)
# Binary Outcome
wrc_fr_sr_knn_bin <- wrc_fr_sr_glm2[, -24]
set.seed(1819)
train.index1 <- sample(nrow(wrc_fr_sr_knn_bin), nrow(wrc_fr_sr_knn_bin) * 0.8)
fr_sr_knn_bin_train <- wrc_fr_sr_knn_bin[train.index1, ]
fr_sr_knn_bin_valid <- wrc_fr_sr_knn_bin[-train.index1, ]
fr_sr_knn_bin_train_norm <- fr_sr_knn_bin_train
fr_sr_knn_bin_valid_norm <- fr_sr_knn_bin_valid
fr_sr_knn_bin_norm <- wrc_fr_sr_knn_bin
norm_values3.5 <- preProcess(fr_sr_knn_bin_train[, 1:23], method = c("center", "scale"))
fr_sr_knn_bin_train_norm[, 1:23] <- predict(norm_values3.5, fr_sr_knn_bin_train[, 1:23])
fr_sr_knn_bin_valid_norm[, 1:23] <- predict(norm_values3.5, fr_sr_knn_bin_valid[, 1:23])
fr_sr_knn_bin_norm[, 1:23] <- predict(norm_values3.5, fr_sr_knn_bin_norm[, 1:23])
fr_sr_knn_bin <- knn(train = fr_sr_knn_bin_train_norm[, 1:23],
test = fr_sr_knn_bin_valid_norm[, 1:23],
cl = fr_sr_knn_bin_train_norm[, 24], k = 3)
accuracy_sr_bin <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
for (i in 1:20) {
knn_pred_sr_bin <- knn(train = fr_sr_knn_bin_train_norm[, 1:23],
test = fr_sr_knn_bin_valid_norm[, 1:23],
cl = as.factor(fr_sr_knn_bin_train_norm[, 24]), k = i)
accuracy_sr_bin[i, 2] <- confusionMatrix(knn_pred_sr_bin, as.factor(fr_sr_knn_bin_valid_norm[, 24]))$overall[1]
}
accuracy_sr_bin # k = 50 gives best accuracy - 61.55%
fr_sr_knn_bin_best <- knn(train = fr_sr_knn_bin_train_norm[, 1:23],
test = fr_sr_knn_bin_valid_norm[, 1:23],
cl = as.factor(fr_sr_knn_bin_train_norm[, 24]), k = 50)
fr_sr_knn_bin_best_acc <- round(confusionMatrix(fr_sr_knn_bin_best, as.factor(fr_sr_knn_bin_valid_norm[, 24]))$overall[1],4)
model_stats_sr[10, c(1:3,7)] <- c("fr_sr_knn_bin_best", "k-nearest neighbors",
"k-nearest neighbors model with binary wRC+ classifications",
fr_sr_knn_bin_best_acc)
# AAC Prediction using glm model
# Sophomore year
full <- read.csv("D1 Batting 2013-2021.csv")
full_aac_fr <- full[full$year==2021 & full$Yr=="Fr" & full$conference == "AAC" &
full$PA >= 32, ] # 30 eligable freshman players in the AAC in 2021
full_aac_fr <- full_aac_fr %>%
rename(PA.fr = PA,
BA.fr = BA,
SlgPct.fr = SlgPct,
BABIP.fr = BABIP,
wRAA.fr = wRAA,
wRCplus.fr = wRCplus,
R.fr = R,
X2B.fr = X2B,
X3B.fr = X3B,
TB.fr = TB,
HR.fr = HR,
HRpct.fr = HRpct,
RBI.fr = RBI,
BB.fr = BB,
K.fr = K,
BBpct.fr = BBpct,
Kpct.fr = Kpct,
HBP.fr = HBP,
SF.fr = SF,
SH.fr = SH,
SB.fr = SB,
CS.fr = CS,
SPD.fr = SPD)
full_aac_fr$wRCplus.so <- 0
full_aac_fr$wRCplus.jr <- 0
full_aac_fr$wRCplus.sr <- 0
aac_pred_so <- predict(glm_so_reduced, newdata = full_aac_fr, type = "response")
aac_pred_jr <- predict(glm_jr_reduced, newdata = full_aac_fr, type = "response")
aac_pred_sr <- predict(glm_sr_reduced, newdata = full_aac_fr, type = "response")
full_aac_fr <- cbind(full_aac_fr, aac_pred_so, aac_pred_jr, aac_pred_sr)
full_aac_fr <- full_aac_fr %>% mutate(So_Prediction =
case_when(aac_pred_so < 0.50 ~ "Below Average",
aac_pred_so >= 0.50 ~ "Above Average"))
full_aac_fr <- full_aac_fr %>% mutate(Jr_Prediction =
case_when(aac_pred_jr < 0.50 ~ "Below Average",
aac_pred_jr >= 0.50 ~ "Above Average"))
full_aac_fr <- full_aac_fr %>% mutate(Sr_Prediction =
case_when(aac_pred_sr < 0.50 ~ "Below Average",
aac_pred_sr >= 0.50 ~ "Above Average"))
aac_fr_predictions <- full_aac_fr[, c(3:4, 6, 8, 46:48)]