1 Abstract
2 Introduction
3 Preparation
- 3.1 Data definitions
- 3.2 Data cleaning
4 Methodolgy
- 4.1 Models
5 Statistical Models
- 5.1 Regression
- 5.2 Classification
6 Model Selection
- 6.1 Regression
- 6.2 Classification
7 Prediction
8 Appendix
- 8.1 Full code

1 Abstract

The goal of this research is to develop a model that could predict the future performance of NCAA baseball players based on their freshman year performance, defined through basic and advanced baseball statistics. I used publicly available data that was published by the NCAA and was collected for all Division I NCAA baseball players between the 2013 and 2021 seasons. I analyzed the full dataset and created appropriate subsets to build predictive models for sophomore, junior, and senior year performance based on a players freshman year statistics. I applied different statistical methods to create predictive models in attempt to achieve the highest possible accuracy. These methods were linear regression, logistics regression, and k-nearest neighbors for both regression and classification. I compared these models using appropriate assessment parameters selected a best model to make developmental predictions for the latest class of American Athletic Conference freshman baseball players. All model computations were done using R statistical software.

2 Introduction

College athletics introduces a unique problem that coaches must address in order to build the best team possible - players are only eligible for 4 seasons. It is important to understand how players will develop as early as possible in their collegiate careers. First year players often do not get many opportunities to play in games, so their statistics can be difficult to assess by individuals. Instead, I will attempt to build predictive models with this same set of freshman year statistics and give greater clarity to how a players freshman year performance relates to their future performance in their sophomore, junior, and senior seasons. Making this information available to coaches may assist them in recruiting, shifting positions, selecting the team, creating lineups, and more strategic decision making.

3 Preparation

3.1 Data definitions

The NCAA baseball data that I am using contains 39 different variables for each player from each season. Some of these are basic baseball statistics that have been used to measure performance for over 100 years. Other advanced statistics have become popular more recently. The following table defines these statistics from the MLB.com glossary and Fangraphs sabermetrics library.

I will be using wRC+ to measure performance and will use it as the response variable in each of the models. Fangraphs describes weighted runs created plus as a "rate statistic which attempts to credit a hitter for the value of each outcome (single, double, etc) rather than treating all hits or times on base equally, while also controlling for park effects and the current run environment. wRC+ is scaled so that league average is 100 each year and every point above or below 100 is equal to one percentage point better or worse than league average. This makes wRC+ a better representation of offensive value than batting average, RBI, OPS, or wOBA."

\[ wRC+ = \frac{(wRAA/PA + LgR/PA) + (LgR/PA - (\mbox{Park Factor} \times LgR/PA))}{(wRC/PA \mbox{ excluding pitchers})} \]

For classification modeling, I will group ranges of wRC+ into buckets based on estimates from Fangraphs. These buckets are Excellent, Great, Above Average, Average, Below Average, Poor and Awful.

3.2 Data cleaning

The data that I will use for this analysis is publicly availble from the NCAA. It contains basic and advanced baseball statistics for each Division I NCAA baseball player between the 2013 and 2021 seasons. The full dataset, before any manipulation, includes 45938 observations with 39 variables. There are 313 different schools and 21567 individual players represented in the data. Initially, I will remove row number, year, school, and conference, reducing the number of variables to 36.

length(unique(full$school)) # 313 unique schools represented

## [1] 313

length(unique(full$player_id)) # 21567 unique players represented

## [1] 21567

full <- full[, -c(1:3)] # remove row number, year, school, conference
dim(full) # 45938 rows, 36 variables

## [1] 45938    36

There are not many missing values in the data. There are 258 cases where the player ID = 0, and 179 cases where the year is missing. Each represents less than 1% of the data, and these rows will simply be removed.

  # Remove these
  
  full <- full[!(full$player_id==0), ]
  full <- full[!(full$Yr=="N/A"), ]
  dim(full) # 45505 rows, 36 variables

## [1] 45505    36

Next, it is important to set minimum requirements to qualify a player for this analysis. Players who do not have enough freshman year playing time may give unreliable predictive results. I will set the minimum number of plate appearances to be 32, equal to the first quantile. This ensures that a majority of the data is still available to be used in the model creation and is not so large that it is difficult for freshman players to meet it. This reduces the number of total observations to 34338.

## [1] 34338    36

Some of the variables in this dataset are very highly correlated, especially as some are variations of others. The correlation matrix shows which are highly corelated and will allow me to select variables to drop. These are: wOBA, GS, AB, OPS, H, ISO, OBpct, and GP. This reduces the number of total variables to 28.

##       Var1    Var2       Freq
## 627   wOBA wRCplus  0.9946347
## 296     PA      AB  0.9943010
## 259     GS      PA  0.9841904
## 295     GS      AB  0.9810809
## 517    OPS    wOBA  0.9808464
## 625    OPS wRCplus  0.9727984
## 444 SlgPct     OPS  0.9612833
## 693     AB       H  0.9607523
## 692     PA       H  0.9555079
## 478     BA     ISO -0.9478833
## 515  OBPct    wOBA  0.9435811
## 623  OBPct wRCplus  0.9387046
## 222     GP      GS  0.9298970
## 258     GP      PA  0.9283577
## 691     GS       H  0.9252221
## 294     GP      AB  0.9240593
## 703      R       H  0.9118883

  # Drop selected variables
  
  full <- full[, -c(6:7,9,11,13:15,20)]
  dim(full) # 34338 rows, 28 variables

## [1] 34338    28

Next, I will create the individual datasets that will be used to predict sophomore, junior, and senior year performance separately. These subsets will include a players freshman year stats and corresponding year stats in a single observation. This will limit the analysis to subsets of players with 2 years of stats, 3 years of stats, and 4 years of stats respectively. To predict sophomore years performance, I will be able to use observations from players with 2-year, 3-year, 4-year, and 5-year careers. To predict junior years performance, players with 3-year, 4-year, and 5-year careers. To predict senior year performance, only those players with 4-year and 5-year careers.

There are 4738 players that will be used to predict sophomore years performance, 3784 players to predict junior year performance, and 2443 players to predict senior year performance.

    nrow(stats_fr_so) # 4738 players to predict sophomore year stats

## [1] 4738

    nrow(stats_fr_jr) # 3784 players to predict junior year stats

## [1] 3784

    nrow(stats_fr_sr) # 2443 players to predict senior year stats

## [1] 2443

The final form of the datasets before modeling begins includes all selected freshman year stats, along the wRC+ value for the prediction year (so, jr, or sr).

4 Methodolgy

4.1 Models

I will use several variations of statistical models to create the best model for prediction of future wRC+. I will use a combination of regression models (predicting the value itself) and classification models (predicting a class that the value will fall into) to determine which model will be most reliable and produced useable information. Regression models

Regresssion Models

Multiple linear regression: Multiple linear regression works to fit a line to the data in order to assign coefficients to each predictor variable. The values of the predictors multiplied with their associated coefficients, along with an intercept and error term, are added together to formulate the predicted value. After running the model with all variables, you are able to perform hypothesis testing to determine which variables are significant predictors and adjust the model accordingly.

\[ y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \varepsilon \] \[ \mbox{where } y \mbox{ is the response variable (wRC+), } \\X_1, X_2, ..., X_p \mbox{ are the predictors, and } \\ \varepsilon \mbox{ is the random error.} \]

Multiple linear regression with interaction: Multiple linear regression with interaction works like multiple linear regression, with the addition of interaction terms. These interaction terms are careted by multiplying predictors together and allowing them to function as new variables. Similarly, we will be able to perform hypothesis testing to determine which variables are significant predictors and adjust the model accordingly. An example with two individual predictors is as follows.

\[ y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon \]

\[ \mbox{where } y \mbox{ is the response variable (wRC+), } \\X_1, \mbox{ } X_2, \mbox{ } X_1X_2 \mbox{ are the predictors, and } \\ \varepsilon \mbox{ is the random error.} \]

k-nearest neighbors for regression: k-nearest neighbors is a machine learning algorithm that groups similar observations together. Knn calculates the Euclidean distance between a set of test observations and a new observation to be predicted. Since differently scaled variables may skew the distances, all predictor variables will be normalized before training the model. The algorithm selected the k (manually selected) nearest observations to the new observation and averages the response variable of those nearest observations to predict the new response value.

\[ \mbox{Euclidean Distance} = \sqrt{\sum\limits_{i=1}^n (x_i - y_i)^2} \]

\[ \mbox{where } k \mbox{ is the number of dimensions in the data, } \\x_i, \mbox{ is the observed predictor value, and } \\ y_i \mbox{ is the new data point to be predicted.} \]

Classification Models

Logistic regression: Logistic regression works similar to multiple linear regression where coefficients are assigned to each predictor variables. However, the output is not a prediction of the value of the response variable. Logistic regression produces a probability, a value between 0 and 1, that a certain observation will belong to one class or another. You will notice the presence of the linear regression function within the function that produces this probability.

\[ p = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + ... + \beta_p X_p)}} \] \[ \mbox{where } p \mbox{ is the probability that an observation belongs to a certain class.} \]

K-nearest neighbors for classification: Knn for classification works very similar to knn for regression. However, rather than averaging the nearest k observations, a simple majority will determine the prediction for the new observation.

4.1.1 Assessment Parameters

Regression models and classification models will be compared separately to determine the best models. The parameters which will determine the efficacy of the regression models are adjusted R^2 and root mean square error (RMSE). The paramteres which will determine the efficacy of the classification models are Akaike information criterion (AIC) and accuracy.

Regression Models

Adj. R²: for multiple linear regression models, R² and adjusted R² are provided as outputs for the model. R² can be explained as the proportion of the variation in the data that can be explained by the model. A perfect model has an R² value of 1.00, with values above 0.65 generally being considered acceptable, depending on the circumstances of the prediction. \[ R^2 = 1 - \frac{RSS}{TSS} \\ \mbox{where } R^2 = \mbox{coefficient of determination} \\ RSS = \mbox{sum of squares of residuals} \\ \mbox{and } TSS = \mbox{total sum of squares} \] Adjusted R² is a modified version of R² that has been adjusted for the number of predictors in the model. It increases when new terms improve the model more than would be expected by chance and decreases when predictors improve the model less than expected. This discourages overfitting of a model to improve the R² value.

\[ R^2_{adj} = 1 - [\frac{(1-R^2)(n-1)}{n-k-1}] \] \[ \mbox{where } n \mbox{ is the number of points in the data sample and} \\ k \mbox{ is the number of predictors in the model} \]

RMSE: Root mean square error is the standard deviation of the residuals or prediction errors. It is a measure of the average distance between the predicted values and the actual values in the dataset. A lower RMSE indicates a lower average error and better model for prediction.

\[ RMSE = \sqrt{\sum\limits_{i=1}^n \frac{(\hat{y_i} - y_i)^2}{n}} \]

\[ \mbox{where } \hat{y_i} \mbox{is the predicted value, } \\ y_i \mbox{ is the actual value, and } \\ n \mbox{ is the number of observations.} \] Classification Models

AIC: Akaike information criterion is used to compare statistical models and determine which one is the best fit for the data. It grades a best fit model as the one that explains the greatest amount of variation using the fewest possible independent variables. A lower AIC indicates a better model.

\[ AIC = 2K -2ln(L) \\\mbox{where } K \mbox{ is the number of predictors used in the model and } \\ L \mbox{ is the log-likelihood estimate.} \]

Accuracy: for each classification model, I will use the trained model to predict on a test or validation set of data. Accuracy is simply the ratio of correct predictions to all predictions made. It will be represented as a value between 0 and 1 and can be interpreted as a percentage.

5 Statistical Models

5.1 Regression

5.1.0.1 Multiple Linear Regression

# Sophomore year
      
      full_lm_so <- lm(wRCplus.so ~ ., data = wrc_fr_so)
      summary(full_lm_so) # adj R^2 = 0.1377

## 
## Call:
## lm(formula = wRCplus.so ~ ., data = wrc_fr_so)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -179.349  -23.228    0.972   24.551  159.310 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   86.44335    9.00656   9.598  < 2e-16 ***
## PA.fr          0.06048    0.04881   1.239  0.21544    
## BA.fr         -0.36557    2.16270  -0.169  0.86578    
## SlgPct.fr     51.45628   38.62222   1.332  0.18283    
## BABIP.fr    -103.55894   34.71738  -2.983  0.00287 ** 
## wRAA.fr        0.52093    0.25886   2.012  0.04423 *  
## wRCplus.fr     0.13535    0.08387   1.614  0.10666    
## R.fr           0.07105    0.16212   0.438  0.66120    
## X2B.fr         0.11991    0.38441   0.312  0.75511    
## X3B.fr         0.32391    0.95632   0.339  0.73485    
## TB.fr          0.02640    0.02483   1.063  0.28782    
## HR.fr          1.36835    0.72380   1.891  0.05875 .  
## HRpct.fr      -2.94547    1.81991  -1.618  0.10563    
## RBI.fr         0.16932    0.14221   1.191  0.23387    
## BB.fr          0.18052    0.23889   0.756  0.44988    
## K.fr          -0.24261    0.13558  -1.789  0.07362 .  
## BBpct.fr       0.22320    0.39072   0.571  0.56786    
## Kpct.fr        0.28900    0.22336   1.294  0.19578    
## HBP.fr         0.35951    0.29674   1.212  0.22575    
## SF.fr          0.82697    0.53224   1.554  0.12031    
## SH.fr         -1.08922    0.25286  -4.308 1.68e-05 ***
## SB.fr          0.52107    0.22661   2.299  0.02152 *  
## CS.fr          0.23243    0.46973   0.495  0.62075    
## SPD.fr        -0.38131    0.60713  -0.628  0.52999    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.36 on 4714 degrees of freedom
## Multiple R-squared:  0.1419, Adjusted R-squared:  0.1377 
## F-statistic:  33.9 on 23 and 4714 DF,  p-value: < 2.2e-16

To determine if the overall model is useful for predicting wRC+, we can perform an ANOVA test with the following null and alternative hypothesis:

\[ H_0: \beta_1 = \beta_2 = ... = \beta_{23} = 0 \\ H_1: \mbox{not } H_0 \]

The p-value for this test is <2.2e-16 (essentially 0). Therefore, we are able to reject the null hypothesis and conclude that the overall model is useful for predicting wRC+. The adjusted R² value of 0.1377 is rather low however.

To determine which predictors are singificant for this model, we perform a t test with the following null and alternative hypothesis for each predictor separately:

\[ H_0: \beta_1 = 0 \\ H_1: \beta_1 \neq 0 \\ . \\ . \\ . \\ H_0: \beta_{23} = 0 \\ H_1: \beta_{23} \neq 0 \]

I will use an alpha of 0.10. In the model output above, predictors with a p-value < 0.10 are denoted with one, two, or three asterisks, or a period. For these predictors, we can reject the null hypothesis and conclude that it is a significant predictor of wRC+. The significant predictors in this model are BABIP.fr, wRAA.fr, HR.fr, K.fr, SH.fr, and SB.fr.

To achieve a parsimonious model (the simplest model with good predictive power), I will remove the variables that are not significant at an aplpha = 0.10 level.

reduced_model_so <- lm(wRCplus.so ~ BABIP.fr + wRAA.fr + HR.fr + K.fr + SH.fr + SB.fr, data = wrc_fr_so)
      summary(reduced_model_so) # adj R^2 = 0.1183

## 
## Call:
## lm(formula = wRCplus.so ~ BABIP.fr + wRAA.fr + HR.fr + K.fr + 
##     SH.fr + SB.fr, data = wrc_fr_so)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -176.222  -23.489    1.131   25.018  154.665 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 109.12088    3.83598  28.447  < 2e-16 ***
## BABIP.fr    -57.73230   12.42174  -4.648 3.45e-06 ***
## wRAA.fr       1.76403    0.14003  12.597  < 2e-16 ***
## HR.fr         1.20080    0.39412   3.047  0.00233 ** 
## K.fr          0.24458    0.06183   3.956 7.74e-05 ***
## SH.fr        -0.38153    0.22527  -1.694  0.09040 .  
## SB.fr         0.79551    0.16172   4.919 8.99e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.8 on 4731 degrees of freedom
## Multiple R-squared:  0.1195, Adjusted R-squared:  0.1183 
## F-statistic:   107 on 6 and 4731 DF,  p-value: < 2.2e-16

The reduced model is:

\[ \hat{y} = 109.12 + (-57.73)\mbox{BABIP.fr} + 1.76\mbox{wRAA.fr} + 1.20\mbox{HR.fr} + 0.24\mbox{K.fr} + (-0.38)\mbox{SH.fr} + 0.80\mbox{SB.fr} \] Since the reduced model is nested in the full model we can compare them using an anova test. Since the reduced model has 6 predictors and the full model has 17 additional predictors, the following are the null and alternative hypothesis:

\[ H_0: \beta_7 = \beta_8 = ... = \beta_23 = 0 \\ H_1: \mbox{not }H_0 \]

anova(reduced_model_so, full_lm_so)

## Analysis of Variance Table
## 
## Model 1: wRCplus.so ~ BABIP.fr + wRAA.fr + HR.fr + K.fr + SH.fr + SB.fr
## Model 2: wRCplus.so ~ PA.fr + BA.fr + SlgPct.fr + BABIP.fr + wRAA.fr + 
##     wRCplus.fr + R.fr + X2B.fr + X3B.fr + TB.fr + HR.fr + HRpct.fr + 
##     RBI.fr + BB.fr + K.fr + BBpct.fr + Kpct.fr + HBP.fr + SF.fr + 
##     SH.fr + SB.fr + CS.fr + SPD.fr
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   4731 7494853                                  
## 2   4714 7303490 17    191364 7.2656 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value of <2.2e-16 allows us to reject the null hypothesis and conclude that the full model with its 17 additional predictors is a significantly more useful model for predicting wRC+.

After repeating this exercise for junior and senior year predictions, we have the following results:

5.1.0.2 Multiple Linear Regression with Interaction

The interaction models function the same as the first order linear regression models with the addition of interaction terms. Each predictor has been multiplied with the others, bringing the total number of predictors to 276. This will be cut down based on which predictors are found to be significant.

# Sophomore year
      
      int_vars_so <- wrc_fr_so[, -24]
      wrc_int_so <- cbind(int_vars_so^2, do.call(cbind, combn(colnames(int_vars_so), 2,
                                                                  FUN=function(x) list(int_vars_so[x[1]]*int_vars_so[x[2]]))))
      colnames(wrc_int_so)[-(seq_len(ncol(int_vars_so)))] <- combn(colnames(int_vars_so), 2, 
                                                                  FUN = paste, collapse=":")
      wrc_int_so$wRCplus.so <- wrc_fr_so$wRCplus.so

ncol(wrc_int_so) # 276 total predictors with 1 response variable

## [1] 277

      full_int_so <- lm(wRCplus.so ~ ., data = wrc_int_so)
      summary(full_int_so) # adj R^2 = 0.1417

## 
## Call:
## lm(formula = wRCplus.so ~ ., data = wrc_int_so)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -182.675  -22.736    0.799   23.862  140.112 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.686e+01  1.682e+01   3.380  0.00073 ***
## PA.fr                   2.154e-03  3.404e-03   0.633  0.52698    
## BA.fr                  -3.289e+01  4.253e+01  -0.773  0.43938    
## SlgPct.fr               4.694e+02  1.708e+03   0.275  0.78350    
## BABIP.fr                2.307e+02  1.728e+03   0.133  0.89383    
## wRAA.fr                -2.663e-02  9.509e-02  -0.280  0.77942    
## wRCplus.fr             -3.997e-04  6.098e-03  -0.066  0.94775    
## R.fr                    4.082e-02  2.980e-02   1.370  0.17087    
## X2B.fr                  6.719e-02  2.473e-01   0.272  0.78589    
## X3B.fr                  8.318e-01  1.373e+00   0.606  0.54478    
## TB.fr                   6.304e-02  2.573e-02   2.450  0.01433 *  
## HR.fr                  -1.505e-01  7.623e-01  -0.197  0.84355    
## HRpct.fr               -1.670e+00  3.356e+00  -0.498  0.61880    
## RBI.fr                 -4.472e-02  2.189e-02  -2.042  0.04116 *  
## BB.fr                   2.799e-02  9.164e-02   0.305  0.76003    
## K.fr                    1.268e-02  2.563e-02   0.495  0.62093    
## BBpct.fr                6.864e-02  1.194e-01   0.575  0.56552    
## Kpct.fr                 4.662e-02  6.317e-02   0.738  0.46061    
## HBP.fr                 -3.371e-02  8.801e-02  -0.383  0.70172    
## SF.fr                   2.942e-01  3.165e-01   0.929  0.35272    
## SH.fr                   2.175e-02  6.861e-02   0.317  0.75124    
## SB.fr                  -4.587e-02  4.233e-02  -1.084  0.27856    
## CS.fr                   7.048e-02  2.461e-01   0.286  0.77460    
## SPD.fr                  5.682e-01  5.032e-01   1.129  0.25890    
## `PA.fr:BA.fr`           7.827e+00  1.516e+01   0.516  0.60569    
## `PA.fr:SlgPct.fr`       6.487e+00  8.608e+00   0.754  0.45115    
## `PA.fr:BABIP.fr`       -1.998e+01  1.402e+01  -1.425  0.15416    
## `PA.fr:wRAA.fr`         1.146e-02  3.019e-02   0.380  0.70418    
## `PA.fr:wRCplus.fr`      1.099e-02  8.713e-03   1.262  0.20710    
## `PA.fr:R.fr`           -2.808e-02  1.532e-02  -1.833  0.06688 .  
## `PA.fr:X2B.fr`          5.976e-03  4.610e-02   0.130  0.89686    
## `PA.fr:X3B.fr`         -1.833e-01  1.065e-01  -1.721  0.08524 .  
## `PA.fr:TB.fr`          -1.992e-02  9.325e-03  -2.136  0.03272 *  
## `PA.fr:HR.fr`          -2.464e-03  8.286e-02  -0.030  0.97628    
## `PA.fr:HRpct.fr`        1.690e-01  4.072e-01   0.415  0.67812    
## `PA.fr:RBI.fr`          1.015e-02  1.372e-02   0.740  0.45949    
## `PA.fr:BB.fr`           8.929e-04  2.957e-02   0.030  0.97591    
## `PA.fr:K.fr`           -9.624e-03  1.442e-02  -0.667  0.50466    
## `PA.fr:BBpct.fr`        6.435e-02  7.203e-02   0.893  0.37168    
## `PA.fr:Kpct.fr`         2.650e-03  4.166e-02   0.064  0.94928    
## `PA.fr:HBP.fr`         -2.545e-02  2.909e-02  -0.875  0.38179    
## `PA.fr:SF.fr`          -2.956e-02  4.933e-02  -0.599  0.54907    
## `PA.fr:SH.fr`           2.038e-02  2.499e-02   0.816  0.41479    
## `PA.fr:SB.fr`           3.526e-02  2.434e-02   1.449  0.14751    
## `PA.fr:CS.fr`           2.350e-02  4.158e-02   0.565  0.57205    
## `PA.fr:SPD.fr`          5.394e-02  6.892e-02   0.783  0.43386    
## `BA.fr:SlgPct.fr`       1.393e+03  3.239e+03   0.430  0.66709    
## `BA.fr:BABIP.fr`       -2.328e+03  3.068e+03  -0.759  0.44801    
## `BA.fr:wRAA.fr`        -1.221e+01  5.787e+01  -0.211  0.83296    
## `BA.fr:wRCplus.fr`      1.825e+00  9.192e+00   0.199  0.84265    
## `BA.fr:R.fr`            1.828e+01  2.655e+01   0.689  0.49113    
## `BA.fr:X2B.fr`         -1.597e+01  7.216e+01  -0.221  0.82491    
## `BA.fr:X3B.fr`         -1.016e+02  1.641e+02  -0.619  0.53568    
## `BA.fr:TB.fr`          -9.226e+00  5.752e+00  -1.604  0.10876    
## `BA.fr:HR.fr`          -2.955e+01  1.880e+02  -0.157  0.87511    
## `BA.fr:HRpct.fr`       -2.115e+02  1.686e+02  -1.254  0.20992    
## `BA.fr:RBI.fr`          4.928e+01  2.558e+01   1.926  0.05414 .  
## `BA.fr:BB.fr`          -4.137e+01  5.917e+01  -0.699  0.48443    
## `BA.fr:K.fr`           -1.317e+01  3.198e+01  -0.412  0.68051    
## `BA.fr:BBpct.fr`        1.714e+01  3.541e+01   0.484  0.62847    
## `BA.fr:Kpct.fr`         1.290e+01  2.354e+01   0.548  0.58352    
## `BA.fr:HBP.fr`          1.434e+01  5.206e+01   0.275  0.78299    
## `BA.fr:SF.fr`          -3.032e+01  9.926e+01  -0.305  0.76005    
## `BA.fr:SH.fr`          -4.938e+01  4.750e+01  -1.040  0.29856    
## `BA.fr:SB.fr`          -3.938e+01  4.778e+01  -0.824  0.40989    
## `BA.fr:CS.fr`          -3.356e+01  7.091e+01  -0.473  0.63601    
## `BA.fr:SPD.fr`          4.801e+01  7.089e+01   0.677  0.49830    
## `SlgPct.fr:BABIP.fr`    1.439e+03  2.480e+03   0.580  0.56168    
## `SlgPct.fr:wRAA.fr`     4.132e+01  4.017e+01   1.029  0.30367    
## `SlgPct.fr:wRCplus.fr` -7.586e+00  5.625e+00  -1.349  0.17749    
## `SlgPct.fr:R.fr`       -1.990e+01  1.756e+01  -1.133  0.25732    
## `SlgPct.fr:X2B.fr`     -2.343e+01  4.642e+01  -0.505  0.61369    
## `SlgPct.fr:X3B.fr`      3.762e+01  1.098e+02   0.343  0.73195    
## `SlgPct.fr:TB.fr`       1.363e+00  3.562e+00   0.383  0.70201    
## `SlgPct.fr:HR.fr`       7.013e+00  1.119e+02   0.063  0.95002    
## `SlgPct.fr:HRpct.fr`    5.551e+01  1.434e+02   0.387  0.69863    
## `SlgPct.fr:RBI.fr`     -2.412e+01  1.638e+01  -1.473  0.14083    
## `SlgPct.fr:BB.fr`       3.022e+00  3.707e+01   0.082  0.93503    
## `SlgPct.fr:K.fr`        8.654e+00  2.406e+01   0.360  0.71913    
## `SlgPct.fr:BBpct.fr`    7.293e+00  2.191e+01   0.333  0.73925    
## `SlgPct.fr:Kpct.fr`    -1.195e+01  1.557e+01  -0.767  0.44298    
## `SlgPct.fr:HBP.fr`     -2.305e+01  3.552e+01  -0.649  0.51650    
## `SlgPct.fr:SF.fr`      -1.139e+01  6.018e+01  -0.189  0.84994    
## `SlgPct.fr:SH.fr`      -1.936e+01  3.133e+01  -0.618  0.53661    
## `SlgPct.fr:SB.fr`       4.553e+00  3.136e+01   0.145  0.88456    
## `SlgPct.fr:CS.fr`       4.649e+01  5.128e+01   0.907  0.36471    
## `SlgPct.fr:SPD.fr`     -1.610e+01  4.701e+01  -0.342  0.73205    
## `BABIP.fr:wRAA.fr`     -5.410e+01  3.377e+01  -1.602  0.10927    
## `BABIP.fr:wRCplus.fr`   2.594e+00  7.139e+00   0.363  0.71633    
## `BABIP.fr:R.fr`         1.867e+01  1.903e+01   0.981  0.32661    
## `BABIP.fr:X2B.fr`      -3.948e-01  5.507e+01  -0.007  0.99428    
## `BABIP.fr:X3B.fr`       1.101e+02  1.167e+02   0.944  0.34536    
## `BABIP.fr:TB.fr`        4.400e+00  4.142e+00   1.062  0.28813    
## `BABIP.fr:HR.fr`       -7.403e+01  9.397e+01  -0.788  0.43085    
## `BABIP.fr:HRpct.fr`     4.038e+01  1.181e+02   0.342  0.73234    
## `BABIP.fr:RBI.fr`      -1.184e+01  1.765e+01  -0.671  0.50245    
## `BABIP.fr:BB.fr`        3.303e+01  3.486e+01   0.947  0.34346    
## `BABIP.fr:K.fr`         2.704e+01  1.884e+01   1.435  0.15128    
## `BABIP.fr:BBpct.fr`    -1.512e+01  2.635e+01  -0.574  0.56618    
## `BABIP.fr:Kpct.fr`     -7.836e+00  2.058e+01  -0.381  0.70334    
## `BABIP.fr:HBP.fr`       8.971e+00  3.605e+01   0.249  0.80349    
## `BABIP.fr:SF.fr`        5.264e+00  7.048e+01   0.075  0.94047    
## `BABIP.fr:SH.fr`        7.047e+01  3.737e+01   1.886  0.05939 .  
## `BABIP.fr:SB.fr`        2.635e+01  3.276e+01   0.804  0.42130    
## `BABIP.fr:CS.fr`       -2.343e-01  5.406e+01  -0.004  0.99654    
## `BABIP.fr:SPD.fr`      -7.876e+01  5.338e+01  -1.475  0.14019    
## `wRAA.fr:wRCplus.fr`    7.235e-02  4.539e-02   1.594  0.11100    
## `wRAA.fr:R.fr`         -6.061e-02  8.102e-02  -0.748  0.45443    
## `wRAA.fr:X2B.fr`       -1.160e-01  2.598e-01  -0.447  0.65517    
## `wRAA.fr:X3B.fr`       -7.748e-01  5.905e-01  -1.312  0.18952    
## `wRAA.fr:TB.fr`        -8.213e-02  3.083e-02  -2.664  0.00776 ** 
## `wRAA.fr:HR.fr`         5.811e-01  4.359e-01   1.333  0.18253    
## `wRAA.fr:HRpct.fr`     -5.047e-01  1.810e+00  -0.279  0.78044    
## `wRAA.fr:RBI.fr`        5.709e-02  6.813e-02   0.838  0.40217    
## `wRAA.fr:BB.fr`         1.040e-02  1.550e-01   0.067  0.94651    
## `wRAA.fr:K.fr`         -2.624e-02  7.961e-02  -0.330  0.74170    
## `wRAA.fr:BBpct.fr`      1.499e-01  3.377e-01   0.444  0.65711    
## `wRAA.fr:Kpct.fr`       8.256e-03  1.920e-01   0.043  0.96571    
## `wRAA.fr:HBP.fr`       -2.113e-01  1.379e-01  -1.533  0.12540    
## `wRAA.fr:SF.fr`        -3.137e-01  2.435e-01  -1.289  0.19763    
## `wRAA.fr:SH.fr`         3.646e-02  1.241e-01   0.294  0.76890    
## `wRAA.fr:SB.fr`         1.678e-01  1.199e-01   1.399  0.16179    
## `wRAA.fr:CS.fr`        -1.422e-02  2.305e-01  -0.062  0.95081    
## `wRAA.fr:SPD.fr`       -5.413e-02  3.600e-01  -0.150  0.88051    
## `wRCplus.fr:R.fr`      -2.911e-02  2.667e-02  -1.091  0.27513    
## `wRCplus.fr:X2B.fr`     3.799e-02  7.788e-02   0.488  0.62568    
## `wRCplus.fr:X3B.fr`    -8.684e-02  1.766e-01  -0.492  0.62286    
## `wRCplus.fr:TB.fr`      9.638e-03  5.352e-03   1.801  0.07175 .  
## `wRCplus.fr:HR.fr`     -2.209e-01  1.385e-01  -1.595  0.11076    
## `wRCplus.fr:HRpct.fr`   2.654e-01  2.354e-01   1.127  0.25976    
## `wRCplus.fr:RBI.fr`    -9.775e-03  2.534e-02  -0.386  0.69972    
## `wRCplus.fr:BB.fr`     -3.484e-02  4.450e-02  -0.783  0.43377    
## `wRCplus.fr:K.fr`       1.267e-02  2.436e-02   0.520  0.60310    
## `wRCplus.fr:BBpct.fr`  -6.792e-03  4.643e-02  -0.146  0.88369    
## `wRCplus.fr:Kpct.fr`    1.118e-02  3.553e-02   0.315  0.75308    
## `wRCplus.fr:HBP.fr`     5.836e-02  4.746e-02   1.230  0.21888    
## `wRCplus.fr:SF.fr`      4.092e-02  8.920e-02   0.459  0.64644    
## `wRCplus.fr:SH.fr`     -2.383e-02  4.395e-02  -0.542  0.58772    
## `wRCplus.fr:SB.fr`     -1.792e-02  4.459e-02  -0.402  0.68774    
## `wRCplus.fr:CS.fr`     -2.895e-02  7.823e-02  -0.370  0.71131    
## `wRCplus.fr:SPD.fr`     1.344e-01  8.143e-02   1.651  0.09879 .  
## `R.fr:X2B.fr`           1.711e-01  1.301e-01   1.315  0.18843    
## `R.fr:X3B.fr`           6.894e-01  3.157e-01   2.184  0.02903 *  
## `R.fr:TB.fr`           -9.441e-03  6.041e-03  -1.563  0.11816    
## `R.fr:HR.fr`           -1.137e-01  2.323e-01  -0.489  0.62467    
## `R.fr:HRpct.fr`         1.309e+00  7.769e-01   1.686  0.09196 .  
## `R.fr:RBI.fr`          -4.219e-02  3.710e-02  -1.137  0.25542    
## `R.fr:BB.fr`            2.399e-03  7.453e-02   0.032  0.97433    
## `R.fr:K.fr`             5.713e-02  3.994e-02   1.430  0.15267    
## `R.fr:BBpct.fr`         1.926e-01  1.279e-01   1.506  0.13209    
## `R.fr:Kpct.fr`         -1.837e-01  9.929e-02  -1.850  0.06437 .  
## `R.fr:HBP.fr`           1.116e-01  7.596e-02   1.469  0.14197    
## `R.fr:SF.fr`            1.764e-01  1.456e-01   1.212  0.22562    
## `R.fr:SH.fr`           -5.938e-03  6.777e-02  -0.088  0.93018    
## `R.fr:SB.fr`           -1.414e-01  6.399e-02  -2.209  0.02723 *  
## `R.fr:CS.fr`            1.526e-01  1.260e-01   1.212  0.22574    
## `R.fr:SPD.fr`          -8.717e-02  2.448e-01  -0.356  0.72183    
## `X2B.fr:X3B.fr`        -3.462e-01  9.562e-01  -0.362  0.71733    
## `X2B.fr:TB.fr`         -4.792e-02  2.825e-02  -1.696  0.08992 .  
## `X2B.fr:HR.fr`         -1.630e-01  6.530e-01  -0.250  0.80285    
## `X2B.fr:HRpct.fr`      -5.711e-01  2.369e+00  -0.241  0.80948    
## `X2B.fr:RBI.fr`         1.304e-01  1.135e-01   1.149  0.25060    
## `X2B.fr:BB.fr`         -2.553e-01  2.347e-01  -1.088  0.27661    
## `X2B.fr:K.fr`          -6.055e-02  1.188e-01  -0.510  0.61025    
## `X2B.fr:BBpct.fr`       1.464e-01  4.988e-01   0.294  0.76906    
## `X2B.fr:Kpct.fr`        1.030e-01  3.562e-01   0.289  0.77240    
## `X2B.fr:HBP.fr`         8.466e-02  2.466e-01   0.343  0.73135    
## `X2B.fr:SF.fr`          4.091e-01  4.284e-01   0.955  0.33961    
## `X2B.fr:SH.fr`          8.584e-02  2.317e-01   0.370  0.71104    
## `X2B.fr:SB.fr`         -2.440e-01  2.229e-01  -1.095  0.27378    
## `X2B.fr:CS.fr`          3.230e-03  3.912e-01   0.008  0.99341    
## `X2B.fr:SPD.fr`        -6.803e-02  5.915e-01  -0.115  0.90844    
## `X3B.fr:TB.fr`         -1.531e-01  6.003e-02  -2.550  0.01081 *  
## `X3B.fr:HR.fr`         -1.570e+00  1.545e+00  -1.016  0.30985    
## `X3B.fr:HRpct.fr`       7.831e-01  5.297e+00   0.148  0.88249    
## `X3B.fr:RBI.fr`         6.172e-01  2.759e-01   2.237  0.02533 *  
## `X3B.fr:BB.fr`          2.201e-01  5.386e-01   0.409  0.68277    
## `X3B.fr:K.fr`           2.106e-01  2.635e-01   0.800  0.42401    
## `X3B.fr:BBpct.fr`      -4.353e-02  1.104e+00  -0.039  0.96856    
## `X3B.fr:Kpct.fr`       -9.950e-01  7.863e-01  -1.265  0.20579    
## `X3B.fr:HBP.fr`         9.188e-01  5.393e-01   1.704  0.08851 .  
## `X3B.fr:SF.fr`         -3.222e-01  1.016e+00  -0.317  0.75114    
## `X3B.fr:SH.fr`         -8.008e-02  5.049e-01  -0.159  0.87400    
## `X3B.fr:SB.fr`         -4.175e-01  4.461e-01  -0.936  0.34943    
## `X3B.fr:CS.fr`          2.598e-01  8.047e-01   0.323  0.74684    
## `X3B.fr:SPD.fr`        -2.004e+00  1.438e+00  -1.393  0.16355    
## `TB.fr:HR.fr`          -1.302e-01  6.404e-02  -2.034  0.04204 *  
## `TB.fr:HRpct.fr`       -2.051e-02  1.447e-01  -0.142  0.88726    
## `TB.fr:RBI.fr`          1.754e-04  5.042e-03   0.035  0.97225    
## `TB.fr:BB.fr`           5.670e-02  2.442e-02   2.322  0.02030 *  
## `TB.fr:K.fr`           -6.856e-03  7.257e-03  -0.945  0.34485    
## `TB.fr:BBpct.fr`       -2.695e-02  2.842e-02  -0.948  0.34309    
## `TB.fr:Kpct.fr`        -1.287e-02  2.127e-02  -0.605  0.54514    
## `TB.fr:HBP.fr`          3.711e-02  2.227e-02   1.667  0.09565 .  
## `TB.fr:SF.fr`           8.028e-03  1.997e-02   0.402  0.68777    
## `TB.fr:SH.fr`           1.390e-03  1.495e-02   0.093  0.92592    
## `TB.fr:SB.fr`           1.348e-03  9.342e-03   0.144  0.88526    
## `TB.fr:CS.fr`           1.999e-02  1.819e-02   1.099  0.27179    
## `TB.fr:SPD.fr`          3.306e-02  3.421e-02   0.967  0.33379    
## `HR.fr:HRpct.fr`       -2.288e+00  4.002e+00  -0.572  0.56748    
## `HR.fr:RBI.fr`          5.199e-03  1.910e-01   0.027  0.97828    
## `HR.fr:BB.fr`          -2.659e-01  4.144e-01  -0.642  0.52116    
## `HR.fr:K.fr`            2.233e-01  2.073e-01   1.077  0.28133    
## `HR.fr:BBpct.fr`       -4.690e+00  3.636e+00  -1.290  0.19719    
## `HR.fr:Kpct.fr`         2.983e+00  1.992e+00   1.498  0.13424    
## `HR.fr:HBP.fr`          6.596e-01  4.228e-01   1.560  0.11884    
## `HR.fr:SF.fr`           5.257e-02  7.183e-01   0.073  0.94166    
## `HR.fr:SH.fr`           5.510e-01  5.003e-01   1.101  0.27088    
## `HR.fr:SB.fr`          -2.374e-01  4.057e-01  -0.585  0.55848    
## `HR.fr:CS.fr`           2.032e-01  8.116e-01   0.250  0.80228    
## `HR.fr:SPD.fr`          5.684e-01  9.436e-01   0.602  0.54698    
## `HRpct.fr:RBI.fr`       9.356e-01  6.605e-01   1.417  0.15666    
## `HRpct.fr:BB.fr`        4.253e+00  3.825e+00   1.112  0.26631    
## `HRpct.fr:K.fr`        -2.911e+00  1.924e+00  -1.513  0.13037    
## `HRpct.fr:BBpct.fr`    -1.876e-01  9.710e-01  -0.193  0.84679    
## `HRpct.fr:Kpct.fr`     -6.931e-02  7.201e-01  -0.096  0.92333    
## `HRpct.fr:HBP.fr`      -1.151e+00  1.494e+00  -0.770  0.44134    
## `HRpct.fr:SF.fr`       -3.190e-01  2.518e+00  -0.127  0.89920    
## `HRpct.fr:SH.fr`        4.826e-01  1.513e+00   0.319  0.74981    
## `HRpct.fr:SB.fr`        3.243e-01  1.443e+00   0.225  0.82216    
## `HRpct.fr:CS.fr`       -1.149e+00  2.443e+00  -0.470  0.63819    
## `HRpct.fr:SPD.fr`      -1.734e+00  2.036e+00  -0.852  0.39446    
## `RBI.fr:BB.fr`          2.903e-02  6.857e-02   0.423  0.67206    
## `RBI.fr:K.fr`          -2.002e-02  3.395e-02  -0.590  0.55548    
## `RBI.fr:BBpct.fr`      -1.002e-01  1.172e-01  -0.855  0.39266    
## `RBI.fr:Kpct.fr`        6.998e-02  8.913e-02   0.785  0.43243    
## `RBI.fr:HBP.fr`        -9.240e-02  6.671e-02  -1.385  0.16611    
## `RBI.fr:SF.fr`         -1.314e-01  1.171e-01  -1.122  0.26199    
## `RBI.fr:SH.fr`         -4.668e-02  6.148e-02  -0.759  0.44772    
## `RBI.fr:SB.fr`          4.840e-02  6.094e-02   0.794  0.42714    
## `RBI.fr:CS.fr`         -2.667e-01  1.098e-01  -2.428  0.01522 *  
## `RBI.fr:SPD.fr`        -2.287e-02  1.940e-01  -0.118  0.90615    
## `BB.fr:K.fr`            1.186e-02  7.353e-02   0.161  0.87188    
## `BB.fr:BBpct.fr`       -2.157e-01  2.419e-01  -0.892  0.37260    
## `BB.fr:Kpct.fr`        -1.508e+00  7.148e-01  -2.109  0.03498 *  
## `BB.fr:HBP.fr`          7.339e-02  1.512e-01   0.485  0.62751    
## `BB.fr:SF.fr`          -9.890e-02  2.600e-01  -0.380  0.70365    
## `BB.fr:SH.fr`          -4.050e-02  1.308e-01  -0.310  0.75690    
## `BB.fr:SB.fr`          -9.939e-02  1.218e-01  -0.816  0.41454    
## `BB.fr:CS.fr`          -9.369e-02  2.396e-01  -0.391  0.69583    
## `BB.fr:SPD.fr`         -1.798e-01  3.222e-01  -0.558  0.57695    
## `K.fr:BBpct.fr`         1.445e+00  7.027e-01   2.057  0.03979 *  
## `K.fr:Kpct.fr`         -1.008e-01  8.381e-02  -1.203  0.22922    
## `K.fr:HBP.fr`          -3.593e-02  7.494e-02  -0.479  0.63167    
## `K.fr:SF.fr`            7.433e-02  1.347e-01   0.552  0.58110    
## `K.fr:SH.fr`           -8.374e-02  7.212e-02  -1.161  0.24569    
## `K.fr:SB.fr`            3.629e-03  6.454e-02   0.056  0.95516    
## `K.fr:CS.fr`           -9.596e-02  1.157e-01  -0.829  0.40708    
## `K.fr:SPD.fr`           1.588e-02  1.611e-01   0.099  0.92149    
## `BBpct.fr:Kpct.fr`      4.481e-02  1.532e-01   0.293  0.76989    
## `BBpct.fr:HBP.fr`      -2.804e-01  2.748e-01  -1.021  0.30751    
## `BBpct.fr:SF.fr`        4.162e-01  4.364e-01   0.954  0.34027    
## `BBpct.fr:SH.fr`       -5.383e-02  2.225e-01  -0.242  0.80885    
## `BBpct.fr:SB.fr`        1.816e-01  2.199e-01   0.826  0.40899    
## `BBpct.fr:CS.fr`        1.698e-01  3.818e-01   0.445  0.65646    
## `BBpct.fr:SPD.fr`      -4.622e-01  3.565e-01  -1.297  0.19485    
## `Kpct.fr:HBP.fr`        1.507e-01  1.653e-01   0.912  0.36197    
## `Kpct.fr:SF.fr`        -5.876e-02  3.322e-01  -0.177  0.85961    
## `Kpct.fr:SH.fr`        -1.709e-01  1.668e-01  -1.024  0.30577    
## `Kpct.fr:SB.fr`        -9.829e-02  1.672e-01  -0.588  0.55656    
## `Kpct.fr:CS.fr`         1.530e-01  2.743e-01   0.558  0.57702    
## `Kpct.fr:SPD.fr`        3.632e-01  2.677e-01   1.357  0.17486    
## `HBP.fr:SF.fr`          2.713e-01  2.468e-01   1.099  0.27182    
## `HBP.fr:SH.fr`         -2.633e-02  1.143e-01  -0.230  0.81781    
## `HBP.fr:SB.fr`          2.223e-02  1.139e-01   0.195  0.84523    
## `HBP.fr:CS.fr`         -9.380e-02  2.077e-01  -0.452  0.65158    
## `HBP.fr:SPD.fr`        -5.015e-01  3.888e-01  -1.290  0.19711    
## `SF.fr:SH.fr`           2.599e-01  2.261e-01   1.149  0.25045    
## `SF.fr:SB.fr`           2.036e-01  2.220e-01   0.917  0.35920    
## `SF.fr:CS.fr`           2.895e-02  4.130e-01   0.070  0.94412    
## `SF.fr:SPD.fr`         -3.986e-01  7.110e-01  -0.561  0.57514    
## `SH.fr:SB.fr`          -2.468e-02  1.038e-01  -0.238  0.81204    
## `SH.fr:CS.fr`          -3.012e-02  1.857e-01  -0.162  0.87120    
## `SH.fr:SPD.fr`          2.285e-01  3.530e-01   0.647  0.51743    
## `SB.fr:CS.fr`           3.364e-01  1.831e-01   1.837  0.06628 .  
## `SB.fr:SPD.fr`          3.514e-01  3.291e-01   1.068  0.28570    
## `CS.fr:SPD.fr`         -1.767e+00  6.556e-01  -2.696  0.00705 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.27 on 4461 degrees of freedom
## Multiple R-squared:  0.1917, Adjusted R-squared:  0.1417 
## F-statistic: 3.833 on 276 and 4461 DF,  p-value: < 2.2e-16

      full_int_so_RMSE <- round(sqrt(mean(full_int_so$residuals^2)),2) # RMSE = 38.11
      model_stats_so[3,1:5] <- c("full_int_so", "Multiple linear regression", 
                                 "Linear model with all interaction terms", 0.1417, full_int_so_RMSE)
      
      anova_so_int <- anova(full_lm_so, full_int_so) # interaction model not significantly more useful model

The significant predictors from the full interaction model are PA.fr, wRAA.fr, R.fr, X3B.fr, SB.fr, HR.fr, BB.fr, CS.fr, Kpct.fr, K.fr, BBpct.fr, SPD.fr, TB.fr, RBI.fr, PA.fr:TB.fr, wRAA.fr:TB.fr, R.fr:X3B.fr, R.fr:SB.fr, X3B.fr:TB.fr, 3B.fr:RBI.fr, TB.fr:HR.fr, TB.fr:BB.fr, RBI.fr:CS.fr, BB.fr:Kpct.fr, K.fr:BBpct.fr, and CS.fr:SPD.fr.

Again we have nested models, allowing us to compare the full first order model and the full interaction model. In this case, the p-value is large (0.1746) and we cannot reject the null hypothesis, concluding that the larger interaction model is not more significantly useful than the more simple model.

anova_so_int$`Pr(>F)`

## [1]        NA 0.1746037

reduced_int_so <- lm(wRCplus.so ~ PA.fr + wRAA.fr + R.fr + X3B.fr + SB.fr + HR.fr + BB.fr + CS.fr + Kpct.fr +
                            K.fr + BBpct.fr + SPD.fr + TB.fr + RBI.fr + PA.fr:TB.fr + wRAA.fr:TB.fr + 
                             R.fr:X3B.fr + R.fr:SB.fr + X3B.fr:TB.fr + X3B.fr:RBI.fr + TB.fr:HR.fr + TB.fr:BB.fr +
                             RBI.fr:CS.fr + BB.fr:Kpct.fr + K.fr:BBpct.fr + CS.fr:SPD.fr, data = wrc_int_so)
      summary(reduced_int_so) # adj R^2 = 0.1133

## 
## Call:
## lm(formula = wRCplus.so ~ PA.fr + wRAA.fr + R.fr + X3B.fr + SB.fr + 
##     HR.fr + BB.fr + CS.fr + Kpct.fr + K.fr + BBpct.fr + SPD.fr + 
##     TB.fr + RBI.fr + PA.fr:TB.fr + wRAA.fr:TB.fr + R.fr:X3B.fr + 
##     R.fr:SB.fr + X3B.fr:TB.fr + X3B.fr:RBI.fr + TB.fr:HR.fr + 
##     TB.fr:BB.fr + RBI.fr:CS.fr + BB.fr:Kpct.fr + K.fr:BBpct.fr + 
##     CS.fr:SPD.fr, data = wrc_int_so)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -170.537  -23.701    0.832   25.260  163.224 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.915e+01  2.056e+00  43.358  < 2e-16 ***
## PA.fr          6.235e-05  9.284e-05   0.672  0.50183    
## wRAA.fr       -3.373e-02  8.317e-03  -4.056 5.07e-05 ***
## R.fr           7.572e-03  2.842e-03   2.665  0.00773 ** 
## X3B.fr         5.039e-01  2.689e-01   1.874  0.06106 .  
## SB.fr          4.099e-02  1.753e-02   2.338  0.01944 *  
## HR.fr          2.323e-01  5.384e-02   4.314 1.64e-05 ***
## BB.fr          6.000e-03  4.929e-03   1.217  0.22360    
## CS.fr          1.608e-01  1.591e-01   1.011  0.31218    
## Kpct.fr       -6.079e-03  2.190e-03  -2.776  0.00553 ** 
## K.fr          -4.083e-03  1.627e-03  -2.510  0.01210 *  
## BBpct.fr       2.517e-02  8.118e-03   3.100  0.00195 ** 
## SPD.fr         1.308e-02  5.103e-02   0.256  0.79771    
## TB.fr          2.268e-03  7.632e-04   2.971  0.00298 ** 
## RBI.fr         8.517e-03  2.635e-03   3.232  0.00124 ** 
## PA.fr:TB.fr   -2.284e-08  1.453e-08  -1.572  0.11596    
## wRAA.fr:TB.fr  3.073e-06  1.098e-06   2.800  0.00514 ** 
## R.fr:X3B.fr    1.134e-05  1.906e-04   0.059  0.95258    
## R.fr:SB.fr    -1.797e-05  7.510e-06  -2.393  0.01677 *  
## X3B.fr:TB.fr  -4.157e-05  3.172e-05  -1.311  0.18999    
## X3B.fr:RBI.fr -2.517e-05  2.620e-04  -0.096  0.92347    
## HR.fr:TB.fr   -1.336e-05  4.165e-06  -3.208  0.00134 ** 
## BB.fr:TB.fr   -3.635e-07  6.511e-07  -0.558  0.57673    
## CS.fr:RBI.fr   7.810e-05  1.305e-04   0.599  0.54947    
## BB.fr:Kpct.fr -1.511e-03  6.974e-04  -2.167  0.03029 *  
## K.fr:BBpct.fr  1.526e-03  6.987e-04   2.184  0.02899 *  
## CS.fr:SPD.fr  -4.567e-03  4.002e-03  -1.141  0.25384    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.92 on 4711 degrees of freedom
## Multiple R-squared:  0.1182, Adjusted R-squared:  0.1133 
## F-statistic: 24.28 on 26 and 4711 DF,  p-value: < 2.2e-16

The equation for the reduced interaction model is:

\[ \hat{y} = 8.92\times10^1 + 6.24\times10^{-5}\mbox{PA.fr} + (-3.37\times10^{-2})\mbox{wRAA.fr} + ... +(-4.567\times10^-3)\mbox{CS.fr:SPD.fr} \]

Repeating this analysis for junior and senior year predictions, we have the following results:

5.1.0.3 k-Nearest Neighbors for Regression

k-nearest neighbors models require for the predictors to be normalized so that Euclidean distance is not skewed, which I do with the following code:

# Sophomore year      

      fr_so_train_norm <- fr_so_train
      fr_so_valid_norm <- fr_so_valid
      wrc_fr_so_norm <- wrc_fr_so
      
      norm_values <- preProcess(fr_so_train[, 1:23], method = c("center", "scale"))
      fr_so_train_norm[, 1:23] <- predict(norm_values, fr_so_train[, 1:23])
      fr_so_valid_norm[, 1:23] <- predict(norm_values, fr_so_valid[, 1:23])
      wrc_fr_so_norm[, 1:23] <- predict(norm_values, wrc_fr_so[, 1:23])

Then, I run the knn model with k values between 10 and 200. The best k-value is the one that corresponds with the lowest RMSE. For the sophomore year prediction model, k=180 produces the best results.

knitr::kable(knn_results_so, escape = FALSE, booktabs = TRUE) # k = 180, valid rmse = 38.88

k	Train RMSE	Valid RMSE	Fit?
10	40.08	40.08	Over
20	39.61	39.61	Over
30	39.28	39.28	Over
40	39.09	39.09	Over
50	39.02	39.02	Over
60	39.08	39.08	Over
70	38.92	38.92	Over
80	38.98	38.98	Over
90	38.97	38.97	Over
100	38.99	38.99	Over
110	39.00	39.00	Over
120	38.97	38.97	Over
130	38.92	38.92	Over
140	38.95	38.95	Over
150	38.93	38.93	Over
160	38.94	38.94	Over
170	38.91	38.91	Over
180	38.88	38.88	Best
190	38.89	38.89	Under
200	38.96	38.96	Under

Finally, we have the results of all regression models for all prediction years.

5.2 Classification

5.2.0.1 Logistic Regression

First, I will train a logistic regression model for all seven wRC+ classifications. This requires the addition of these classifications to the dataset, which can be done with:

# Sophomore year
      
        # All levels
      
        wrc_fr_so_glm <- wrc_fr_so %>% mutate(wRCplusrating.so = 
                                        case_when(wRCplus.so <= 60 ~ "Awful",
                                          wRCplus.so > 60 & wRCplus.so <= 75 ~ "Poor",
                                          wRCplus.so > 75 & wRCplus.so <= 90 ~ "Below Average",
                                          wRCplus.so > 90 & wRCplus.so <= 110 ~ "Average",
                                          wRCplus.so > 110 & wRCplus.so <= 130 ~ "Above Average",
                                          wRCplus.so > 130 & wRCplus.so <= 150 ~ "Great",
                                          wRCplus.so > 150 ~ "Excellent")
        )

        multinom_so <- multinom(wRCplusrating.so ~ .-wRCplus.so, data = wrc_fr_so_glm)

## # weights:  175 (144 variable)
## initial  value 9219.722286 
## iter  10 value 8868.089780
## iter  20 value 8807.358678
## iter  30 value 8782.130396
## iter  40 value 8768.806553
## iter  50 value 8764.351329
## iter  60 value 8751.007971
## iter  70 value 8730.458130
## iter  80 value 8682.358291
## iter  90 value 8655.799304
## iter 100 value 8628.337570
## final  value 8628.337570 
## stopped after 100 iterations

        summary(multinom_so) # AIC = 17544.68

## Call:
## multinom(formula = wRCplusrating.so ~ . - wRCplus.so, data = wrc_fr_so_glm)
## 
## Coefficients:
##               (Intercept)         PA.fr       BA.fr   SlgPct.fr   BABIP.fr
## Above Average  0.61754215 -0.0017648055  0.01656815  0.15233187 -0.6728247
## Awful         -0.12861713 -0.0068938247  0.70224919  0.07703881  1.7426851
## Below Average  0.70664950 -0.0061367378  1.10305891 -0.63082573 -0.6931290
## Excellent     -1.13800888 -0.0067897903 -0.02906636  0.03503968  0.1015323
## Great          0.34043509 -0.0045085851 -0.37567708 -0.74037139 -1.5034511
## Poor           0.08692583  0.0002362675 -0.24826749  0.06516696  0.5382463
##                     wRAA.fr   wRCplus.fr        R.fr      X2B.fr     X3B.fr
## Above Average  0.0093078007  0.001072873 0.001902029  0.00304035 0.14914831
## Awful         -0.0391158442 -0.003694869 0.001715085 -0.03348998 0.06050513
## Below Average -0.0006215197 -0.002311949 0.005320622 -0.02747100 0.07372247
## Excellent      0.0212483826 -0.003087980 0.001907237  0.02533869 0.15124430
## Great          0.0332013163  0.002625320 0.018968212 -0.01693816 0.24439156
## Poor           0.0261193184 -0.004623368 0.000598793 -0.01385670 0.13982594
##                       TB.fr       HR.fr    HRpct.fr       RBI.fr        BB.fr
## Above Average  0.0006942084  0.11542878 -0.02347664 -0.013289809 -0.009884806
## Awful          0.0058605554  0.04601809 -0.04096951 -0.011363193 -0.014913919
## Below Average -0.0012871319 -0.14210244  0.10825088  0.009947479 -0.018703935
## Excellent      0.0062657293  0.07306020  0.10272059  0.009085875  0.021014336
## Great         -0.0018431672  0.07338962 -0.02316038  0.010657164 -0.011056815
## Poor          -0.0017580146 -0.01899575 -0.01788090 -0.022461142 -0.063048813
##                        K.fr     BBpct.fr      Kpct.fr       HBP.fr      SF.fr
## Above Average  3.863094e-03  0.004369362 -0.018900408 -0.010327634 0.10761386
## Awful          1.793095e-02  0.015914712  0.002619607 -0.028303735 0.04117028
## Below Average  3.346905e-02 -0.001236747 -0.021458220 -0.036815459 0.07887567
## Excellent      9.937545e-05  0.042233079 -0.017097630  0.045772734 0.13891523
## Great         -3.424453e-03  0.009625511 -0.003208336 -0.003068599 0.08287099
## Poor           3.001120e-02  0.039427604 -0.015734582 -0.074848176 0.02037623
##                      SH.fr        SB.fr       CS.fr      SPD.fr
## Above Average  0.003685347  0.055218570  0.02274999 -0.12966463
## Awful          0.080883421 -0.040443181  0.03349001 -0.05389362
## Below Average  0.053585336  0.004630763 -0.02531996 -0.07484186
## Excellent     -0.033317469 -0.002068828  0.08815314 -0.02524444
## Great          0.051660782  0.050768717 -0.04996450 -0.11857894
## Poor           0.071418617 -0.001629147  0.01798138 -0.05351196
## 
## Std. Errors:
##               (Intercept)       PA.fr      BA.fr  SlgPct.fr   BABIP.fr
## Above Average   0.4441279 0.003809495 0.09187218 0.09533794 0.06502439
## Awful           0.4418409 0.004453149 0.27651583 0.09333780 0.05953991
## Below Average   0.4750591 0.004450547 0.27600015 0.10080063 0.07068363
## Excellent       0.5808199 0.004599882 0.12175166 0.12710417 0.08329032
## Great           0.5029879 0.004249624 0.10426757 0.10712893 0.07296899
## Poor            0.4977756 0.004884565 0.10233704 0.10517199 0.07249790
##                  wRAA.fr  wRCplus.fr       R.fr     X2B.fr     X3B.fr
## Above Average 0.02079071 0.003308130 0.01308715 0.02186517 0.07187833
## Awful         0.02377963 0.003244235 0.01547385 0.02573976 0.08750224
## Below Average 0.02397699 0.003510159 0.01537288 0.02567444 0.08614900
## Excellent     0.02482636 0.004262231 0.01498312 0.02564009 0.08219101
## Great         0.02326907 0.003713960 0.01442527 0.02491434 0.07585241
## Poor          0.02644901 0.003727045 0.01712910 0.02838352 0.09055238
##                     TB.fr      HR.fr   HRpct.fr     RBI.fr      BB.fr
## Above Average 0.001922316 0.06174088 0.08164811 0.01150345 0.01920975
## Awful         0.002509806 0.07600944 0.08504675 0.01347576 0.02257654
## Below Average 0.002427970 0.07841285 0.09115142 0.01341703 0.02274869
## Excellent     0.002145618 0.06833050 0.09634715 0.01312329 0.02236789
## Great         0.002120893 0.06771862 0.09067646 0.01279920 0.02113943
## Poor          0.002700848 0.08768818 0.10160842 0.01516683 0.02507699
##                     K.fr   BBpct.fr    Kpct.fr     HBP.fr      SF.fr      SH.fr
## Above Average 0.01124671 0.02484403 0.01494964 0.01939941 0.04247292 0.02120005
## Awful         0.01236172 0.02479743 0.01393496 0.02282583 0.04929101 0.02249155
## Below Average 0.01287203 0.02693490 0.01559192 0.02292864 0.04876633 0.02283241
## Excellent     0.01338027 0.03086455 0.01934323 0.02207276 0.04883035 0.02714375
## Great         0.01261795 0.02744382 0.01676611 0.02139603 0.04785122 0.02309596
## Poor          0.01398667 0.02826136 0.01631665 0.02604699 0.05588617 0.02480693
##                    SB.fr      CS.fr     SPD.fr
## Above Average 0.01848102 0.03784051 0.05168203
## Awful         0.02495010 0.04288419 0.05448126
## Below Average 0.02356634 0.04404800 0.05723813
## Excellent     0.02243708 0.04476043 0.06407481
## Great         0.02030721 0.04352384 0.05707977
## Poor          0.02547533 0.04748018 0.06048924
## 
## Residual Deviance: 17256.68 
## AIC: 17544.68

so_table <- table(wrc_fr_so_glm$wRCplusrating.so, wrc_fr_so_glm$wRCplusPred)
multinom_so_acc <- round((sum(diag(so_table))/sum(so_table)),4) # overall accuracy = 24.53%

so_table

##                
##                 Average Above Average Awful Below Average Excellent Great Poor
##   Average           442           130   318            12        53    17    3
##   Above Average     357           167   205             8        53    31    3
##   Awful             268            83   403            11        15     5    2
##   Below Average     284            77   247            10        11     3    7
##   Excellent         175           100    82             3       107    19    0
##   Great             206           117   129            12        54    33    1
##   Poor              174            62   218             9         7     5    0

multinom_so_acc

## [1] 0.2453

The 7x7 table shows the results of the predictions, where the labels across the top indicate the actual value and the labels to left indicate the predicted values. The model performs with an overall accuracy of 0.2453 or 24.53%.

Next, I will train a logistic regression model to predict a binary case, only wRC+ above average (1) and wRC+ below average (0).

wrc_fr_so_glm2 <- wrc_fr_so %>% mutate(wRCplusrating.so = 
                                                 case_when(wRCplus.so < 100 ~ 0,
                                                           wRCplus.so >= 100 ~ 1))

glm_so <- glm(wRCplusrating.so ~ . - wRCplus.so, data = wrc_fr_so_glm2, family = "binomial")
summary(glm_so) # AIC = 6054.2

## 
## Call:
## glm(formula = wRCplusrating.so ~ . - wRCplus.so, family = "binomial", 
##     data = wrc_fr_so_glm2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5309  -1.0556   0.3667   1.1170   1.8657  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.724e-01  4.915e-01  -1.368  0.17132    
## PA.fr        4.651e-03  2.738e-03   1.699  0.08936 .  
## BA.fr       -3.638e-01  7.185e-01  -0.506  0.61260    
## SlgPct.fr    2.359e+00  2.097e+00   1.125  0.26068    
## BABIP.fr    -5.179e+00  1.896e+00  -2.732  0.00629 ** 
## wRAA.fr      2.452e-02  1.456e-02   1.684  0.09213 .  
## wRCplus.fr   8.605e-03  4.542e-03   1.894  0.05818 .  
## R.fr         6.708e-03  9.166e-03   0.732  0.46426    
## X2B.fr       6.080e-03  2.164e-02   0.281  0.77877    
## X3B.fr       3.771e-02  5.563e-02   0.678  0.49779    
## TB.fr        6.973e-05  1.454e-03   0.048  0.96175    
## HR.fr        1.153e-01  4.435e-02   2.599  0.00934 ** 
## HRpct.fr    -1.937e-01  9.947e-02  -1.947  0.05148 .  
## RBI.fr       3.519e-03  7.995e-03   0.440  0.65984    
## BB.fr        3.643e-03  1.356e-02   0.269  0.78818    
## K.fr        -2.164e-02  7.656e-03  -2.826  0.00471 ** 
## BBpct.fr     9.299e-03  2.108e-02   0.441  0.65908    
## Kpct.fr      2.690e-02  1.224e-02   2.199  0.02790 *  
## HBP.fr       1.783e-02  1.673e-02   1.066  0.28648    
## SF.fr        4.011e-02  2.966e-02   1.352  0.17627    
## SH.fr       -5.002e-02  1.380e-02  -3.624  0.00029 ***
## SB.fr        4.020e-02  1.372e-02   2.930  0.00339 ** 
## CS.fr       -2.517e-03  2.578e-02  -0.098  0.92221    
## SPD.fr      -3.525e-02  3.398e-02  -1.037  0.29962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6566.2  on 4737  degrees of freedom
## Residual deviance: 6006.2  on 4714  degrees of freedom
## AIC: 6054.2
## 
## Number of Fisher Scoring iterations: 6

cm_glm_so <- confusionMatrix(as.factor(ifelse(glm_so_pred >= 0.5, "1", "0")), as.factor(wrc_fr_so_glm2$wRCplusrating.so), positive = "1") # overall accuracy = 0.6351
cm_glm_so

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1583  993
##          1  736 1426
##                                           
##                Accuracy : 0.6351          
##                  95% CI : (0.6212, 0.6488)
##     No Information Rate : 0.5106          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2715          
##                                           
##  Mcnemar's Test P-Value : 7.431e-10       
##                                           
##             Sensitivity : 0.5895          
##             Specificity : 0.6826          
##          Pos Pred Value : 0.6596          
##          Neg Pred Value : 0.6145          
##              Prevalence : 0.5106          
##          Detection Rate : 0.3010          
##    Detection Prevalence : 0.4563          
##       Balanced Accuracy : 0.6361          
##                                           
##        'Positive' Class : 1               
##

In this confusion matrix, the prediction label refers to the prediction made by the model and the reference label refers to the actual classificaion of the test data. The binary case model performs with an accuracy of 0.6351 or 63.51%.

Taking only the significant predictors, we can create a reduced model.

glm_so_reduced <- glm(wRCplusrating.so ~ PA.fr + BABIP.fr + wRAA.fr + wRCplus.fr + HR.fr + HRpct.fr + K.fr + Kpct.fr + SH.fr + SB.fr - wRCplus.so, data = wrc_fr_so_glm2, family = "binomial")
summary(glm_so_reduced) # AIC = 6038.9

## 
## Call:
## glm(formula = wRCplusrating.so ~ PA.fr + BABIP.fr + wRAA.fr + 
##     wRCplus.fr + HR.fr + HRpct.fr + K.fr + Kpct.fr + SH.fr + 
##     SB.fr - wRCplus.so, family = "binomial", data = wrc_fr_so_glm2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4477  -1.0611   0.3754   1.1184   1.8692  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.392624   0.287983  -1.363  0.17277    
## PA.fr        0.007641   0.001435   5.325 1.01e-07 ***
## BABIP.fr    -5.133025   0.933535  -5.498 3.83e-08 ***
## wRAA.fr      0.034195   0.011091   3.083  0.00205 ** 
## wRCplus.fr   0.012126   0.002426   4.998 5.78e-07 ***
## HR.fr        0.118222   0.041084   2.878  0.00401 ** 
## HRpct.fr    -0.127802   0.054724  -2.335  0.01952 *  
## K.fr        -0.021298   0.007286  -2.923  0.00346 ** 
## Kpct.fr      0.026675   0.009822   2.716  0.00661 ** 
## SH.fr       -0.056854   0.013259  -4.288 1.80e-05 ***
## SB.fr        0.030881   0.009786   3.156  0.00160 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6566.2  on 4737  degrees of freedom
## Residual deviance: 6016.9  on 4727  degrees of freedom
## AIC: 6038.9
## 
## Number of Fisher Scoring iterations: 4

cm_glm_so_reduced <- confusionMatrix(as.factor(ifelse(glm_so_reduced_pred >= 0.5, "1", "0")), as.factor(wrc_fr_so_glm2$wRCplusrating.so), positive = "1") # overall accuracy = 0.6334
cm_glm_so_reduced

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1582 1000
##          1  737 1419
##                                           
##                Accuracy : 0.6334          
##                  95% CI : (0.6195, 0.6471)
##     No Information Rate : 0.5106          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2682          
##                                           
##  Mcnemar's Test P-Value : 3.249e-10       
##                                           
##             Sensitivity : 0.5866          
##             Specificity : 0.6822          
##          Pos Pred Value : 0.6582          
##          Neg Pred Value : 0.6127          
##              Prevalence : 0.5106          
##          Detection Rate : 0.2995          
##    Detection Prevalence : 0.4550          
##       Balanced Accuracy : 0.6344          
##                                           
##        'Positive' Class : 1               
##

The reduced model performs with an overall accuracy of 0.6334 or 63.34%.

Repeating these three models for junior year and senior year predictions, we have the follwoign results:

5.2.0.2 k-Nearest Neighbors for Classification

Using k-nearest neighbors for classification, like for regression, requires noramlized predictor. After normalizing the data, we can train the model using various k values to determine which performs with the highest accuracy. This model uses all seven classifications of wRC+ to predict sophomore year performance, and works best with a k value of 30.

accuracy_so_all # k = 30 gives best accuracy - 20.99%

##      k  accuracy
## 1   10 0.1867089
## 2   20 0.1951477
## 3   30 0.2099156
## 4   40 0.1898734
## 5   50 0.1803797
## 6   60 0.1814346
## 7   70 0.1993671
## 8   80 0.1962025
## 9   90 0.2067511
## 10 100 0.2004219
## 11 110 0.1983122
## 12 120 0.1940928
## 13 130 0.1951477
## 14 140 0.1940928
## 15 150 0.2078059
## 16 160 0.2004219
## 17 170 0.1993671
## 18 180 0.1940928
## 19 190 0.2067511
## 20 200 0.2099156

The confusion matrix shows how the best k model performs for each classification.

fr_so_knn_all_best <- knn(train = fr_so_knn_train_norm[, 1:23], test = fr_so_knn_valid_norm[, 1:23], cl = fr_so_knn_train_norm[, 24], k = 30)
confusionMatrix(fr_so_knn_all_best, fr_so_knn_valid_norm[, 24])

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Average Above Average Awful Below Average Excellent Great Poor
##   Average            70            77    54            55        26    38   32
##   Above Average      31            43    25            22        19    26   18
##   Awful              36            23    55            30        17    25   25
##   Below Average      14            14    11            17         4     6    5
##   Excellent           7            16     4             6        19    19    2
##   Great               5            10     3             3         1     4    4
##   Poor                2             5     7             2         2     3    6
## 
## Overall Statistics
##                                           
##                Accuracy : 0.2257          
##                  95% CI : (0.1995, 0.2537)
##     No Information Rate : 0.1983          
##     P-Value [Acc > NIR] : 0.02009         
##                                           
##                   Kappa : 0.0727          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
## 
## Statistics by Class:
## 
##                      Class: Average Class: Above Average Class: Awful
## Sensitivity                 0.42424              0.22872      0.34591
## Specificity                 0.63985              0.81447      0.80228
## Pos Pred Value              0.19886              0.23370      0.26066
## Neg Pred Value              0.84060              0.81021      0.85889
## Prevalence                  0.17405              0.19831      0.16772
## Detection Rate              0.07384              0.04536      0.05802
## Detection Prevalence        0.37131              0.19409      0.22257
## Balanced Accuracy           0.53204              0.52160      0.57410
##                      Class: Below Average Class: Excellent Class: Great
## Sensitivity                       0.12593          0.21591     0.033058
## Specificity                       0.93358          0.93721     0.968561
## Pos Pred Value                    0.23944          0.26027     0.133333
## Neg Pred Value                    0.86545          0.92114     0.872549
## Prevalence                        0.14241          0.09283     0.127637
## Detection Rate                    0.01793          0.02004     0.004219
## Detection Prevalence              0.07489          0.07700     0.031646
## Balanced Accuracy                 0.52975          0.57656     0.500809
##                      Class: Poor
## Sensitivity             0.065217
## Specificity             0.975467
## Pos Pred Value          0.222222
## Neg Pred Value          0.906623
## Prevalence              0.097046
## Detection Rate          0.006329
## Detection Prevalence    0.028481
## Balanced Accuracy       0.520342

Next, like for logisitic regression, I will also predict using binary outcomes. The table shows which k value provides the best accuracy. I will be using the model with k=150 as the best model for binary outcomes.

accuracy_so_bin # k = 150 gives best accuracy - 60.44%

##      k  accuracy
## 1   10 0.5580169
## 2   20 0.5527426
## 3   30 0.5854430
## 4   40 0.5580169
## 5   50 0.5907173
## 6   60 0.5907173
## 7   70 0.5991561
## 8   80 0.5864979
## 9   90 0.5864979
## 10 100 0.5833333
## 11 110 0.5949367
## 12 120 0.5854430
## 13 130 0.5886076
## 14 140 0.5938819
## 15 150 0.6044304
## 16 160 0.5875527
## 17 170 0.6023207
## 18 180 0.5864979
## 19 190 0.5970464
## 20 200 0.5812236

fr_so_knn_bin_best <- knn(train = fr_so_knn_bin_train_norm[, 1:23],
                          test = fr_so_knn_bin_valid_norm[, 1:23],
                          cl = as.factor(fr_so_knn_bin_train_norm[, 24]), k = 150)
confusionMatrix(fr_so_knn_bin_best, as.factor(fr_so_knn_bin_valid_norm[, 24]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 340 265
##          1 120 223
##                                           
##                Accuracy : 0.5939          
##                  95% CI : (0.5618, 0.6253)
##     No Information Rate : 0.5148          
##     P-Value [Acc > NIR] : 5.919e-07       
##                                           
##                   Kappa : 0.1943          
##                                           
##  Mcnemar's Test P-Value : 2.153e-13       
##                                           
##             Sensitivity : 0.7391          
##             Specificity : 0.4570          
##          Pos Pred Value : 0.5620          
##          Neg Pred Value : 0.6501          
##              Prevalence : 0.4852          
##          Detection Rate : 0.3586          
##    Detection Prevalence : 0.6382          
##       Balanced Accuracy : 0.5980          
##                                           
##        'Positive' Class : 0               
##

After performing the same analysis for junior and senior year predictions, we get th follwoing results:

6 Model Selection

6.1 Regression

Overall, the regression models did not perform exceptionally well. The five models performed similarly for sophomore year predictions with RMSE values falling between 38 and 40 for each, quite a large amount of error for a wRC+ value that typically falls between -100 and 200. The adj. R² values for the linear regression models did not fair well either, falling very short of the desired range of 0.65 or greater. Since these models are not very distinguishable by performs, I will select the first order linear regression model (full_lm_so/jr/sr) as the best model. The results are very similar for junior and senior year predictions as well.

6.2 Classification

The classification models had some more success than the regression models due to the ability to simplify into a binary case. The classification models for predicting all seven classes of wRC+ still resulted in poor accuracy. While the binary outcome models tell us less about future seasons, it is still a useful prediction to judge general development of players, and performs with much bettter accuracy. Again with similar accuracy across each of the binary classification models, I am going to select simplest model of the three. This is the binary logistic regression model with only the variables found to be significant. This will be the model that I use to predict future seasons for AAC freshmen from 2021.

7 Prediction

To predict future seasons, I first take the subset of the data that fits the description of the players to be predicted. Then I simply use the predict() function to compute a probability (0,1) of a player have an above average wRC+ season, with probabilities 0.50 and greater designated as above average.

aac_pred_so <- predict(glm_so_reduced, newdata = full_aac_fr, type = "response")
aac_pred_jr <- predict(glm_jr_reduced, newdata = full_aac_fr, type = "response")
aac_pred_sr <- predict(glm_sr_reduced, newdata = full_aac_fr, type = "response")

All but one of the players (2301334) is predicted to achieve above average wRC+ by their senior season. The prediction for how quickly the players achieve this, though, varies. Some are predicted to perform at above average levels for the remainder of their collegiate careers, while others are predicted to reach this level during their junior or senior seasons.

8 Appendix

8.1 Full code

setwd("~/College Classes/Fall 2021/Math Seminar")

library(tidyverse)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(corrplot)
library(caret)
library(nnet)
library(FNN)

options(max.print=1000000)

##### Read data #####

full <- read.csv("D1 Batting 2013-2021.csv")
dim(full) # 45938 rows, 39 variables

length(unique(full$school)) # 313 unique schools represented
length(unique(full$player_id)) # 21567 unique players represented

full <- full[, -c(1:3)] # remove row number, year, school, conference
dim(full) # 45938 rows, 36 variables
  
##### Data Exploration #####

  # Missing values
  
  nrow(subset(full, full$player_id == "0")) # 258 player IDs=0: 258/45938 = less than 1% of data
  nrow(full[is.na(full$player_id), ]) # 0 na
  
  nrow(full[is.na(full$Yr), ]) # 0 na
  nrow(full[full$Yr=="N/A", ]) # 179 na: 179/45938 = less than 1% of data
  
  nrow(full[is.na(full$Player), ]) # 0 na
  nrow(full[full$Player=="N/A", ]) # 0 na
  
  nrow(full[is.na(full$Pos), ]) # 0 na
  nrow(full[full$Pos=="N/A", ]) # 0 na
  
  nrow(full[is.na(full$BA), ]) # 0 na
  nrow(full[full$BA=="N/A", ]) # 0 na
  
  # Remove these
  
  full <- full[!(full$player_id==0), ]
  full <- full[!(full$Yr=="N/A"), ]
  dim(full) # 45505 rows, 36 variables
  
  # Minimum Plate Appearances
  
  summary_pa <- summary(full$PA)
  first_qrt_pa <- summary_pa[2] # 32
  median_pa <- median(full$PA) # 99
  
  hist1 <- ggplot(full, aes(x=PA)) +
    geom_histogram(fill="azure3", col="black") +
    ggtitle("PA Distribution") +
    geom_segment(aes(x=first_qrt_pa, y=0, xend=first_qrt_pa, yend=3500), col="red", size=1) +
    geom_segment(aes(x=median_pa, y=0, xend=median_pa, yend=2500), col="red", size=1) +
    annotate("text", x=first_qrt_pa+20, y=3400, label= "First") + 
    annotate("text", x=first_qrt_pa+30, y=3250, label= "Quartile") +
    annotate("text", x=median_pa+30, y=2350, label = "Median") +
    ylab("Number of players") +
    xlab("Plate appearances in a season") +
    theme_economist_white()
  
  hist1
  
  full <- full[full$PA>=first_qrt_pa, ]
  dim(full) # 34338 rows, 36 varaibles
  
  # Correlation Matrix
  
  corr_simple <- function(data=full,sig=0.90){
    #convert data to numeric in order to run correlations
    #convert to factor first to keep the integrity of the data - each value will become a number rather than turn into NA
    df_cor <- data %>% mutate_if(is.character, as.factor)
    df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)
    #run a correlation and drop the insignificant ones
    corr <- cor(df_cor)
    #prepare to drop duplicates and correlations of 1     
    corr[lower.tri(corr,diag=TRUE)] <- NA 
    #drop perfect correlations
    corr[corr == 1] <- NA 
    #turn into a 3-column table
    corr <- as.data.frame(as.table(corr))
    #remove the NA values from above 
    corr <- na.omit(corr) 
    #select significant values  
    corr <- subset(corr, abs(Freq) > sig) 
    #sort by highest correlation
    corr <- corr[order(-abs(corr$Freq)),] 
    #print table
    print(corr)
    #turn corr back into matrix in order to plot with corrplot
    mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
    
    #plot correlations visually
    corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ", title = "Correlation between variables")
  }
  corr_simple()
  
    # Drop:
      # wOBA 15
      # GS 7
      # AB 9
      # OPS 13
      # H 20
      # ISO 14
      # OBpct 11
      # GP 6
  
  # Drop selected variables
  
  full <- full[, -c(6:7,9,11,13:15,20)]
  dim(full) # 34338 rows, 28 variables
  
##### Dataset manipulation #####
  
  # Players with Freshman year stats
  
  nrow(subset(full, full$Yr=="Fr")) # 7134 players with freshman year stats
  full_fr <- subset(full, full$Yr=="Fr")
  player_id_fr <- full_fr$player_id
  
  # Subset stats only for players who have freshman year stats
  
  full_from_fr <- full[full$player_id %in% player_id_fr, ]
  
  dim(full_from_fr) # 18080 rows, 31 variables
  length(unique(full_from_fr$player_id)) # 6992 unique players represented
  
  # Length of college career
  
  full_from_fr_years <- full_from_fr %>%
    count(player_id)
  
  nrow(subset(full_from_fr_years, full_from_fr_years$n == "1")) # 2025 one year players
  nrow(subset(full_from_fr_years, full_from_fr_years$n == "2")) # 1214 two year players
  nrow(subset(full_from_fr_years, full_from_fr_years$n == "3")) # 1615 three year players
  nrow(subset(full_from_fr_years, full_from_fr_years$n == "4")) # 1911 four year players
  nrow(subset(full_from_fr_years, full_from_fr_years$n == "5")) # 224 five year players
  nrow(subset(full_from_fr_years, full_from_fr_years$n == "6")) # 3 six year players
  
  # Bar graph
  
  full_from_fr_years_1_5 <- full_from_fr_years[!(full_from_fr_years$n==6), ]
  
  bar1 <- ggplot(full_from_fr_years_1_5, aes(x=n)) +
          geom_bar(fill="azure3") +
          ggtitle("Length of Playing Career") +
          xlab("Number of seasons with stats") +
          ylab("Number of players") +
          geom_text(aes(label = ..count..), stat = "count", vjust = 1.5) +
          theme_economist_white()
  bar1
  
  # Separate dataset by year classification
  
  stats_fr <- subset(full_from_fr, full_from_fr$Yr=="Fr")
  stats_so <- subset(full_from_fr, full_from_fr$Yr=="So")
  stats_jr <- subset(full_from_fr, full_from_fr$Yr=="Jr")
  stats_sr <- subset(full_from_fr, full_from_fr$Yr=="Sr")
  
  # Create final form data frames
  
    # Freshman and sophomore
    stats_fr_so <- inner_join(stats_fr, stats_so, by="player_id", suffix=c(".fr", ".so"))
    stats_fr_so <- stats_fr_so[, -c(1,4,5,29:32)]
    stats_fr_so <- stats_fr_so %>%
      rename(Player = Player.fr)
    nrow(stats_fr_so) # 4738 players to predict sophomore year stats 
    
    # Freshman and junior
    stats_fr_jr <- inner_join(stats_fr, stats_jr, by="player_id", suffix=c(".fr", ".jr"))
    stats_fr_jr <- stats_fr_jr[, -c(1,4,5,29:32)]
    stats_fr_jr <- stats_fr_jr %>%
      rename(Player = Player.fr)
    nrow(stats_fr_jr) # 3784 players to predict junior year stats
    
    # Freshman and senior
    stats_fr_sr <- inner_join(stats_fr, stats_sr, by="player_id", suffix=c(".fr", ".sr"))
    stats_fr_sr <- stats_fr_sr[, -c(1,4,5,29:32)]
    stats_fr_sr <- stats_fr_sr %>%
      rename(Player = Player.fr)
    nrow(stats_fr_sr) # 2443 players to predict senior year stats
    
    # Full freshman to senior data frame
    stats_fr_to_jr <- left_join(stats_fr_so, stats_jr, by="player_id")
    stats_fr_to_jr <- stats_fr_to_jr[, -c(49:52)]
    stats_fr_to_jr <- stats_fr_to_jr %>%
      rename(Player = Player.x,
             PA.jr = PA,
             BA.jr = BA,
             SlgPct.jr = SlgPct,
             BABIP.jr = BABIP,
             wRAA.jr = wRAA,
             wRCplus.jr = wRCplus,
             R.jr = R,
             X2B.jr = X2B,
             X3B.jr = X3B,
             TB.jr = TB,
             HR.jr = HR,
             HRpct.jr = HRpct,
             RBI.jr = RBI,
             BB.jr = BB,
             K.jr = K,
             BBpct.jr = BBpct,
             Kpct.jr = Kpct,
             HBP.jr = HBP,
             SF.jr = SF,
             SH.jr = SH,
             SB.jr = SB,
             CS.jr = CS,
             SPD.jr = SPD)
    
    stats_fr_to_sr <- left_join(stats_fr_to_jr, stats_sr, by="player_id") 
    stats_fr_to_sr <- stats_fr_to_sr[, -c(72:75)]
    stats_fr_to_sr <- stats_fr_to_sr %>%
      rename(Player = Player.x,
             PA.sr = PA,
             BA.sr = BA,
             SlgPct.sr = SlgPct,
             BABIP.sr = BABIP,
             wRAA.sr = wRAA,
             wRCplus.sr = wRCplus,
             R.sr = R,
             X2B.sr = X2B,
             X3B.sr = X3B,
             TB.sr = TB,
             HR.sr = HR,
             HRpct.sr = HRpct,
             RBI.sr = RBI,
             BB.sr = BB,
             K.sr = K,
             BBpct.sr = BBpct,
             Kpct.sr = Kpct,
             HBP.sr = HBP,
             SF.sr = SF,
             SH.sr = SH,
             SB.sr = SB,
             CS.sr = CS,
             SPD.sr = SPD)
    
    nrow(stats_fr_to_sr) # 5127 total players remaining in full data frame
    
    
  # Regression Modeling
    
    # wRCplus prediction datasets
    
      wrc_fr_so <- stats_fr_so[, -c(1:2, 26:30, 32:48)]
      wrc_fr_jr <- stats_fr_jr[, -c(1:2, 26:30, 32:48)]
      wrc_fr_sr <- stats_fr_sr[, -c(1:2, 26:30, 32:48)]
    
    # Test & validation data sets
      
      # Freshman to sophomore
      
      set.seed(1819)
      train.rows1 <- sample(nrow(wrc_fr_so), nrow(wrc_fr_so) * 0.8)
      fr_so_train <- wrc_fr_so[train.rows1, ]
      fr_so_valid <- wrc_fr_so[-train.rows1, ]
      
      # Freshman to junior
      
      set.seed(1819)
      train.rows2 <- sample(nrow(wrc_fr_jr), nrow(wrc_fr_jr) * 0.8)
      fr_jr_train <- wrc_fr_jr[train.rows2, ]
      fr_jr_valid <- wrc_fr_jr[-train.rows2, ]
      
      # Freshman to senior
      
      set.seed(1819)
      train.rows3 <- sample(nrow(wrc_fr_sr), nrow(wrc_fr_sr) * 0.8)
      fr_sr_train <- wrc_fr_sr[train.rows3, ]
      fr_sr_valid <- wrc_fr_sr[-train.rows3, ]
      
      
    # Model Comparison Table
      
      model_stats_so <- data.frame(Model_Name=character(),
                                Model_Type=character(),
                                Model_Description=character(),
                                Adj_R_Squared=numeric(),
                                RMSE=numeric(),
                                AIC=numeric(),
                                Accuracy=numeric())
      
      model_stats_jr <- data.frame(Model_Name=character(),
                                   Model_Type=character(),
                                   Model_Description=character(),
                                   Adj_R_Squared=numeric(),
                                   RMSE=numeric(),
                                   AIC=numeric(),
                                   Accuracy=numeric())
      
      model_stats_sr <- data.frame(Model_Name=character(),
                                   Model_Type=character(),
                                   Model_Description=character(),
                                   Adj_R_Squared=numeric(),
                                   RMSE=numeric(),
                                   AIC=numeric(),
                                   Accuracy=numeric())
            
    # Multiple Regression Models (wRC+)
      
      # Sophomore year
      
      full_lm_so <- lm(wRCplus.so ~ ., data = wrc_fr_so)
      summary(full_lm_so) # adj R^2 = 0.1377
      full_lm_so_RMSE <- round(sqrt(mean(full_lm_so$residuals^2)),2) # RMSE = 39.26
      model_stats_so[1,1:5] <- c("full_lm_so", "Mulitple linear regression", "First order linear model", 0.1377,
                              full_lm_so_RMSE)
      
      reduced_model_so <- lm(wRCplus.so ~ BABIP.fr + wRAA.fr + HR.fr + K.fr + SH.fr + SB.fr, data = wrc_fr_so)
      summary(reduced_model_so) # adj R^2 = 0.1183
      reduced_model_so_RMSE <- round(sqrt(mean(reduced_model_so$residuals^2)),2) # RMSE = 39.77
      model_stats_so[2,1:5] <- c("reduced_model_so", "Multiple linear regression", 
                              "Linear model with singificant predictors", 0.1183, reduced_model_so_RMSE)
      
      anova(reduced_model_so, full_lm_so)
      
      # Junior year
      
      full_lm_jr <- lm(wRCplus.jr ~ ., data = wrc_fr_jr)
      summary(full_lm_jr) # adj R^2 = 0.1395
      full_lm_jr_RMSE <- round(sqrt(mean(full_lm_jr$residuals^2)),2) # RMSE = 40.02
      model_stats_jr[1,1:5] <- c("full_lm_jr", "Mulitple linear regression", "First order linear model", 0.1395,
                              full_lm_jr_RMSE)
      
      reduced_model_jr <- lm(wRCplus.jr ~ PA.fr + wRAA.fr + TB.fr + K.fr + SH.fr + CS.fr, data = wrc_fr_jr)
      summary(reduced_model_jr) # adj R^2 = 0.1296
      reduced_model_jr_RMSE <- round(sqrt(mean(reduced_model_jr$residuals^2)),2) # RMSE = 40.33
      model_stats_jr[2,1:5] <- c("reduced_model_jr", "Multiple linear regression", 
                                 "Linear model with singificant predictors", 0.1296, reduced_model_jr_RMSE)
      
      anova(reduced_model_jr, full_lm_jr)
      
      # Senior year
      
      full_lm_sr <- lm(wRCplus.sr ~ ., data = wrc_fr_sr)
      summary(full_lm_sr) # adj R^2 = 0.06893
      full_lm_sr_RMSE <- round(sqrt(mean(full_lm_sr$residuals^2)),2) # RMSE = 39.44
      model_stats_sr[1,1:5] <- c("full_lm_sr", "Mulitple linear regression", "First order linear model", 0.06893,
                                 full_lm_sr_RMSE)
      
      reduced_model_sr <- lm(wRCplus.sr ~ TB.fr + BBpct.fr + SF.fr + SH.fr +CS.fr, data = wrc_fr_sr)
      summary(reduced_model_sr) # adj R^2 = 0.04159
      reduced_model_sr_RMSE <- round(sqrt(mean(reduced_model_sr$residuals^2)),2) # RMSE = 40.17
      model_stats_sr[2,1:5] <- c("reduced_model_sr", "Multiple linear regression", 
                                 "Linear model with singificant predictors", 0.04159, reduced_model_sr_RMSE)
    
      anova(reduced_model_sr, full_lm_sr)
      
    # Multiple Regression Models with Interaction
      
      # Sophomore year
      
      int_vars_so <- wrc_fr_so[, -24]
      wrc_int_so <- cbind(int_vars_so^2, do.call(cbind, combn(colnames(int_vars_so), 2,
                                                                  FUN=function(x) list(int_vars_so[x[1]]*int_vars_so[x[2]]))))
      colnames(wrc_int_so)[-(seq_len(ncol(int_vars_so)))] <- combn(colnames(int_vars_so), 2, 
                                                                  FUN = paste, collapse=":")
      wrc_int_so$wRCplus.so <- wrc_fr_so$wRCplus.so
      
      
      full_int_so <- lm(wRCplus.so ~ ., data = wrc_int_so)
      summary(full_int_so) # adj R^2 = 0.1417
      full_int_so_RMSE <- round(sqrt(mean(full_int_so$residuals^2)),2) # RMSE = 38.11
      model_stats_so[3,1:5] <- c("full_int_so", "Multiple linear regression", 
                                 "Linear model with all interaction terms", 0.1417, full_int_so_RMSE)
      
      anova(full_lm_so, full_int_so) # interaction model not significantly more useful model
    
      reduced_int_so <- lm(wRCplus.so ~ PA.fr + wRAA.fr + R.fr + X3B.fr + SB.fr + HR.fr + BB.fr + CS.fr + Kpct.fr +
                            K.fr + BBpct.fr + SPD.fr + TB.fr + RBI.fr + PA.fr:TB.fr + wRAA.fr:TB.fr + 
                             R.fr:X3B.fr + R.fr:SB.fr + X3B.fr:TB.fr + X3B.fr:RBI.fr + TB.fr:HR.fr + TB.fr:BB.fr +
                             RBI.fr:CS.fr + BB.fr:Kpct.fr + K.fr:BBpct.fr + CS.fr:SPD.fr, data = wrc_int_so)
      summary(reduced_int_so) # adj R^2 = 0.1133
      reduced_int_so_RMSE <- round(sqrt(mean(reduced_int_so$residuals^2)),2) # RMSE = 38.11
      model_stats_so[4,1:5] <- c("reduced_int_so", "Multiple linear regression", 
                                 "Linear model with interaction terms and significant predictors", 0.1133,
                                 reduced_int_so_RMSE)
      
      # Junior year
      
      int_vars_jr <- wrc_fr_jr[, -24]
      wrc_int_jr <- cbind(int_vars_jr^2, do.call(cbind, combn(colnames(int_vars_jr), 2,
                                                              FUN=function(x) list(int_vars_jr[x[1]]*int_vars_jr[x[2]]))))
      colnames(wrc_int_jr)[-(seq_len(ncol(int_vars_jr)))] <- combn(colnames(int_vars_jr), 2, 
                                                                   FUN = paste, collapse=":")
      wrc_int_jr$wRCplus.jr <- wrc_fr_jr$wRCplus.jr
      
      full_int_jr <- lm(wRCplus.jr ~., data = wrc_int_jr)
      summary(full_int_jr) # adj R^2 = 0.1373
      full_int_jr_RMSE <- round(sqrt(mean(full_int_jr$residuals^2)),2) # RMSE = 38.70
      model_stats_jr[3,1:5] <- c("full_int_jr", "Multiple linear regression", 
                                 "Linear model with all interaction terms", 0.1373, full_int_jr_RMSE)
      
      anova(full_lm_jr, full_int_jr) # interaction model not significantly more useful model
      
      reduced_int_jr <- lm(wRCplus.jr ~ TB.fr + R.fr + Kpct.fr + R.fr:Kpct.fr + HBP.fr + R.fr:HBP.fr + X2B.fr +
                             X2B.fr:TB.fr + X3B.fr + X3B.fr:TB.fr + HR.fr + TB.fr:HR.fr + RBI.fr + SB.fr +
                             RBI.fr:SB.fr + SPD.fr + RBI.fr:SPD.fr + K.fr + CS.fr + K.fr:CS.fr, data = wrc_int_jr)
      summary(reduced_int_jr) # adj R^2 = 0.1084
      reduced_int_jr_RMSE <- round(sqrt(mean(reduced_int_jr$residuals^2)),2) # RMSE = 40.75
      model_stats_jr[4,1:5] <- c("reduced_int_jr", "Multiple linear regression", 
                                 "Linear model with interaction terms and significant predictors", 0.1084,
                                 reduced_int_jr_RMSE)
      
      # Senior year
      
      int_vars_sr <- wrc_fr_sr[, -24]
      wrc_int_sr <- cbind(int_vars_sr^2, do.call(cbind, combn(colnames(int_vars_sr), 2,
                                                              FUN=function(x) list(int_vars_sr[x[1]]*int_vars_sr[x[2]]))))
      colnames(wrc_int_sr)[-(seq_len(ncol(int_vars_sr)))] <- combn(colnames(int_vars_sr), 2, 
                                                                   FUN = paste, collapse=":")
      wrc_int_sr$wRCplus.sr <- wrc_fr_sr$wRCplus.sr
      
      full_int_sr <- lm(wRCplus.sr ~., data = wrc_int_sr)
      summary(full_int_sr) # adj R^2 = 0.08442
      full_int_sr_RMSE <- round(sqrt(mean(full_int_sr$residuals^2)),2) # RMSE = 37.01
      model_stats_sr[3,1:5] <- c("full_int_sr", "Multiple linear regression", 
                                 "Linear model with all interaction terms", 0.08442, full_int_sr_RMSE)
      
      anova(full_lm_sr, full_int_sr) # interaction model is signifcantly more useful model
      
      reduced_int_sr <- lm(wRCplus.sr ~ BA.fr + SlgPct.fr + X3B.fr + SH.fr + PA.fr + X2B.fr + PA.fr:X2B.fr +
                             BA.fr:SlgPct.fr + wRAA.fr + BA.fr:wRAA.fr + wRCplus.fr + BA.fr:wRCplus.fr +
                             BA.fr:X2B.fr + BA.fr:X3B.fr + HRpct.fr + BA.fr:HRpct.fr + SF.fr + BA.fr:SF.fr +
                             CS.fr + BA.fr:CS.fr + SlgPct.fr:X3B.fr + SlgPct.fr:HRpct.fr + BABIP.fr +
                             BABIP.fr:wRAA.fr + R.fr + BABIP.fr:R.fr + BABIP.fr:X2B.fr + BABIP.fr+X3B.fr +
                             wRCplus.fr:CS.fr + R.fr:HRpct.fr + X2B.fr:X3B.fr + Kpct.fr + X3B.fr:Kpct.fr +
                             TB.fr + SB.fr + TB.fr:SB.fr + HR.fr + SH.fr + HR.fr:SH.fr, data = wrc_int_sr)
      summary(reduced_int_sr) # adj R^2 = 0.06089
      reduced_int_sr_RMSE <- round(sqrt(mean(reduced_int_sr$residuals^2)),2) # RMSE = 39.50
      model_stats_sr[4,1:5] <- c("reduced_int_sr", "Multiple linear regression", 
                                 "Linear model with interaction terms and significant predictors", 0.06089,
                                 reduced_int_sr_RMSE)
      
    # Additional Models
      
      # k-nearest neighbors for regression
      
      rmse = function(actual, predicted) {
        sqrt(mean((actual - predicted) ^ 2))
      }
      
      make_knn_pred <-  function(k = 1, training, predicting) {
        pred <-  FNN::knn.reg(train = training[,1:23], 
                              test = predicting[,1:23], 
                              y = training[,24], k = k)$pred
        act  <-  predicting[,24]
        rmse(predicted = pred, actual = act)
      }
      
      k <- seq(10,200,10)  
      
      
      # Sophomore year
      
      fr_so_train_norm <- fr_so_train
      fr_so_valid_norm <- fr_so_valid
      wrc_fr_so_norm <- wrc_fr_so
      
      norm_values <- preProcess(fr_so_train[, 1:23], method = c("center", "scale"))
      fr_so_train_norm[, 1:23] <- predict(norm_values, fr_so_train[, 1:23])
      fr_so_valid_norm[, 1:23] <- predict(norm_values, fr_so_valid[, 1:23])
      wrc_fr_so_norm[, 1:23] <- predict(norm_values, wrc_fr_so[, 1:23])
      
      knn_train_rmse_so <-  sapply(k, make_knn_pred, 
                                   training = fr_so_train_norm, 
                                   predicting = fr_so_valid_norm)
      
      knn_valid_rmse_so <-  sapply(k, make_knn_pred, 
                                   training = fr_so_train_norm, 
                                   predicting = fr_so_valid_norm)
      
      best_k_so <-  k[which.min(knn_valid_rmse_so)]
      fit_status <-  ifelse(k < best_k_so, "Over", ifelse(k == best_k_so, "Best", "Under"))
      
      knn_results_so <- data.frame(
        k,
        round(knn_train_rmse_so, 2),
        round(knn_valid_rmse_so, 2),
        fit_status
      )
      
      colnames(knn_results_so) <-  c("k", "Train RMSE", "Valid RMSE", "Fit?")
      knitr::kable(knn_results_so, escape = FALSE, booktabs = TRUE) # k = 180, valid rmse = 38.88
      
      so_knn_best <- knn.reg(train = fr_so_train_norm[, 1:23], test = fr_so_valid_norm[, 1:23], 
                             y = fr_so_train_norm[,24], k = 180)
      
      model_stats_so[5,1:5] <- c("so_knn_best", "k-nearest neighbors",
                                 "k-nearest neighbors regression model with k=180", NA, 38.88)
      
      # Junior year
      
      fr_jr_train_norm <- fr_jr_train
      fr_jr_valid_norm <- fr_jr_valid
      wrc_fr_jr_norm <- wrc_fr_jr
      
      norm_values <- preProcess(fr_jr_train[, 1:23], method = c("center", "scale"))
      fr_jr_train_norm[, 1:23] <- predict(norm_values, fr_jr_train[, 1:23])
      fr_jr_valid_norm[, 1:23] <- predict(norm_values, fr_jr_valid[, 1:23])
      wrc_fr_jr_norm[, 1:23] <- predict(norm_values, wrc_fr_jr[, 1:23])
      
      
      knn_train_rmse_jr <-  sapply(k, make_knn_pred, 
                                   training = fr_jr_train_norm, 
                                   predicting = fr_jr_valid_norm)
      
      knn_valid_rmse_jr <-  sapply(k, make_knn_pred, 
                                   training = fr_jr_train_norm, 
                                   predicting = fr_jr_valid_norm)
      
      best_k_jr <-  k[which.min(knn_valid_rmse_jr)]
      fit_status = ifelse(k < best_k_jr, "Over", ifelse(k == best_k_jr, "Best", "Under"))
      
      knn_results_jr <-  data.frame(
        k,
        round(knn_train_rmse_jr, 2),
        round(knn_valid_rmse_jr, 2),
        fit_status
      )
      
      colnames(knn_results_jr) <-  c("k", "Train RMSE", "Valid RMSE", "Fit?")
      knitr::kable(knn_results_jr, escape = FALSE, booktabs = TRUE) # k = 120, valid rmse = 42.08
      
      jr_knn_best <- knn.reg(train = fr_jr_train_norm[, 1:23], test = fr_jr_valid_norm[, 1:23], 
                             y = fr_jr_train_norm[,24], k = 120)
      
      model_stats_jr[5,1:5] <- c("jr_knn_best", "k-nearest neighbors",
                                 "k-nearest neighbors regression model with k=120", NA, 42.08)
      
      # Senior year
      
      fr_sr_train_norm <- fr_sr_train
      fr_sr_valid_norm <- fr_sr_valid
      wrc_fr_sr_norm <- wrc_fr_sr
      
      norm_values <- preProcess(fr_sr_train[, 1:23], method = c("center", "scale"))
      fr_sr_train_norm[, 1:23] <- predict(norm_values, fr_sr_train[, 1:23])
      fr_sr_valid_norm[, 1:23] <- predict(norm_values, fr_sr_valid[, 1:23])
      wrc_fr_sr_norm[, 1:23] <- predict(norm_values, wrc_fr_sr[, 1:23])
      
      
      knn_train_rmse_sr <-  sapply(k, make_knn_pred, 
                                   training = fr_sr_train_norm, 
                                   predicting = fr_sr_valid_norm)
      
      knn_valid_rmse_sr <-  sapply(k, make_knn_pred, 
                                   training = fr_sr_train_norm, 
                                   predicting = fr_sr_valid_norm)
      
      best_k_sr <-  k[which.min(knn_valid_rmse_sr)]
      fit_status = ifelse(k < best_k_sr, "Over", ifelse(k == best_k_sr, "Best", "Under"))
      
      knn_results_sr <-  data.frame(
        k,
        round(knn_train_rmse_sr, 2),
        round(knn_valid_rmse_sr, 2),
        fit_status
      )
      
      colnames(knn_results_sr) <-  c("k", "Train RMSE", "Valid RMSE", "Fit?")
      knitr::kable(knn_results_sr, escape = FALSE, booktabs = TRUE) # k = 110, valid rmse = 39.76
      
      sr_knn_best <- knn.reg(train = fr_sr_train_norm[, 1:23], test = fr_sr_valid_norm[, 1:23], 
                             y = fr_sr_train_norm[,24], k = 110)
      
      model_stats_sr[5,1:5] <- c("sr_knn_best", "k-nearest neighbors",
                                 "k-nearest neighbors regression model with k=110", NA, 39.76)
      
      
      
  # Classification Modeling
      
    # Logistic Regression Models
      
      # Sophomore year
      
        # All levels
      
        wrc_fr_so_glm <- wrc_fr_so %>% mutate(wRCplusrating.so = 
                                                  case_when(wRCplus.so <= 60 ~ "Awful", 
                                                            wRCplus.so > 60 & wRCplus.so <= 75 ~ "Poor",
                                                            wRCplus.so > 75 & wRCplus.so <= 90 ~ "Below Average",
                                                            wRCplus.so > 90 & wRCplus.so <= 110 ~ "Average",
                                                            wRCplus.so > 110 & wRCplus.so <= 130 ~ "Above Average",
                                                            wRCplus.so > 130 & wRCplus.so <= 150 ~ "Great",
                                                            wRCplus.so > 150 ~ "Excellent")
        )
  
        wrc_fr_so_glm$wRCplusrating.so <- as.factor(wrc_fr_so_glm$wRCplusrating.so)   
        wrc_fr_so_glm$wRCplusrating.so <- relevel(wrc_fr_so_glm$wRCplusrating.so, ref = "Average")
        
        multinom_so <- multinom(wRCplusrating.so ~ .-wRCplus.so, data = wrc_fr_so_glm)
        summary(multinom_so) # AIC = 17544.68
        multinom_so_AIC <- round(multinom_so$AIC,0)
        
        wrc_fr_so_glm$wRCplusPred <- predict(multinom_so, newdata = wrc_fr_so_glm, "class")
        so_table <- table(wrc_fr_so_glm$wRCplusrating.so, wrc_fr_so_glm$wRCplusPred)
        multinom_so_acc <- round((sum(diag(so_table))/sum(so_table)),4) # overall accuracy = 24.53%
        
        model_stats_so[6,c(1:3,6:7)] <- c("multinom_so", "Logistic regression",
                                          "Multinomial logistic regression model with 7 categorical outcomes",
                                          multinom_so_AIC, multinom_so_acc)
      
        # Binary outcome
        
        wrc_fr_so_glm2 <- wrc_fr_so %>% mutate(wRCplusrating.so = 
                                                 case_when(wRCplus.so < 100 ~ 0,
                                                           wRCplus.so >= 100 ~ 1))
        wrc_fr_so_glm2$wRCplusrating.so <- as.factor(wrc_fr_so_glm2$wRCplusrating.so)
        
        glm_so <- glm(wRCplusrating.so ~ . - wRCplus.so, data = wrc_fr_so_glm2, family = "binomial")
        summary(glm_so) # AIC = 6054.2
        glm_so_AIC <- round(glm_so$aic,0)
        
        glm_so_pred <- predict(glm_so, wrc_fr_so_glm2, type = "response")
        cm_glm_so <- confusionMatrix(as.factor(ifelse(glm_so_pred >= 0.5, "1", "0")), as.factor(wrc_fr_so_glm2$wRCplusrating.so),
                        positive = "1") # overall accuracy = 0.6351
        glm_so_acc <- round(cm_glm_so$overall[1],4)
        
        model_stats_so[7,c(1:3,6:7)] <- c("glm_so", "Logistic regression",
                                          "Binary logistic regression model", glm_so_AIC, glm_so_acc)
        
        glm_so_reduced <- glm(wRCplusrating.so ~ PA.fr + BABIP.fr + wRAA.fr + wRCplus.fr + HR.fr + HRpct.fr +
                                K.fr + Kpct.fr + SH.fr + SB.fr - wRCplus.so, data = wrc_fr_so_glm2, family = "binomial")
        summary(glm_so_reduced) # AIC = 6038.9
        glm_so_reduced_AIC <- round(glm_so_reduced$aic,0)
        
        
        glm_so_reduced_pred <- predict(glm_so_reduced, wrc_fr_so_glm2, type = "response")
        cm_glm_so_reduced <- confusionMatrix(as.factor(ifelse(glm_so_reduced_pred >= 0.5, "1", "0")), as.factor(wrc_fr_so_glm2$wRCplusrating.so),
                        positive = "1") # overall accuracy = 0.6334
        glm_so_reduced_acc <- round(cm_glm_so_reduced$overall[1],4)
        
        model_stats_so[8, c(1:3,6:7)] <- c("glm_so_reduced", "Logistic regression",
                                           "Binary logistic regression model with significant variables",
                                           glm_so_reduced_AIC, glm_so_reduced_acc)
        
      # Junior Year
        
        # All levels
        
        wrc_fr_jr_glm <- wrc_fr_jr %>% mutate(wRCplusrating.jr = 
                                                case_when(wRCplus.jr <= 60 ~ "Awful", 
                                                          wRCplus.jr > 60 & wRCplus.jr <= 75 ~ "Poor",
                                                          wRCplus.jr > 75 & wRCplus.jr <= 90 ~ "Below Average",
                                                          wRCplus.jr > 90 & wRCplus.jr <= 110 ~ "Average",
                                                          wRCplus.jr > 110 & wRCplus.jr <= 130 ~ "Above Average",
                                                          wRCplus.jr > 130 & wRCplus.jr <= 150 ~ "Great",
                                                          wRCplus.jr > 150 ~ "Excellent")
        )
        
        wrc_fr_jr_glm$wRCplusrating.jr <- as.factor(wrc_fr_jr_glm$wRCplusrating.jr)
        wrc_fr_jr_glm$wRCplusrating.jr <- relevel(wrc_fr_jr_glm$wRCplusrating.jr, ref = "Average")
        
        multinom_jr <- multinom(wRCplusrating.jr ~ .-wRCplus.jr, data = wrc_fr_jr_glm)
        summary(multinom_jr) # AIC = 13693.88
        multinom_jr_AIC <- round(multinom_jr$AIC,0)
        
        wrc_fr_jr_glm$wRCplusPred <- predict(multinom_jr, newdata = wrc_fr_jr_glm, "class")
        jr_table <- table(wrc_fr_jr_glm$wRCplusrating.jr, wrc_fr_jr_glm$wRCplusPred)
        multinom_jr_acc <- round((sum(diag(jr_table))/sum(jr_table)),4) # overall accuracy = 25.11%
        
        model_stats_jr[6,c(1:3,6:7)] <- c("multinom_jr", "Logistic regression",
                                          "Multinomial logistic regression model with 7 categorical outcomes",
                                          multinom_jr_AIC, multinom_jr_acc)
        
        # Binary 
        
        # Binary outcome
        
        wrc_fr_jr_glm2 <- wrc_fr_jr %>% mutate(wRCplusrating.jr = 
                                                 case_when(wRCplus.jr < 100 ~ 0,
                                                           wRCplus.jr >= 100 ~ 1))
        wrc_fr_jr_glm2$wRCplusrating.jr <- as.factor(wrc_fr_jr_glm2$wRCplusrating.jr)
        
        glm_jr <- glm(wRCplusrating.jr ~ . - wRCplus.jr, data = wrc_fr_jr_glm2, family = "binomial")
        summary(glm_jr) # AIC = 4707.4
        glm_jr_AIC <- round(glm_jr$aic,0)
        
        glm_jr_pred <- predict(glm_jr, wrc_fr_jr_glm2, type = "response")
        cm_glm_jr <- confusionMatrix(as.factor(ifelse(glm_jr_pred >= 0.5, "1", "0")), as.factor(wrc_fr_jr_glm2$wRCplusrating.jr),
                        positive = "1") # overall accuracy = 0.6435
        glm_jr_acc <- round(cm_glm_jr$overall[1],4)
        
        model_stats_jr[7,c(1:3,6:7)] <- c("glm_jr", "Logistic regression",
                                          "Binary logistic regression model",
                                          glm_jr_AIC, glm_jr_acc)
        
        glm_jr_reduced <- glm(wRCplusrating.jr ~ PA.fr + BA.fr + wRCplus.fr + R.fr + K.fr + SH.fr + SB.fr +
                                CS.fr - wRCplus.jr, data = wrc_fr_jr_glm2, family = "binomial")
        summary(glm_jr_reduced) # AIC = 4707.9
        glm_jr_reduced_AIC <- round(glm_jr_reduced$aic,0)
        
        glm_jr_reduced_pred <- predict(glm_jr_reduced, wrc_fr_jr_glm2, type = "response")
        cm_glm_jr_reduced <- confusionMatrix(as.factor(ifelse(glm_jr_reduced_pred >= 0.5, "1", "0")), as.factor(wrc_fr_jr_glm2$wRCplusrating.jr),
                        positive = "1") # overall accuracy = 0.6382
        glm_jr_reduced_acc <- round(cm_glm_jr_reduced$overall[1], 4)
        
        model_stats_jr[8, c(1:3,6:7)] <- c("glm_jr_reduced", "Logistic regression",
                                           "Binary logistic regression model with significant variables",
                                           glm_jr_reduced_AIC, glm_jr_reduced_acc)
        
      # Senior Year
        
        # All levels
        
        wrc_fr_sr_glm <- wrc_fr_sr %>% mutate(wRCplusrating.sr = 
                                                case_when(wRCplus.sr <= 60 ~ "Awful", 
                                                          wRCplus.sr > 60 & wRCplus.sr <= 75 ~ "Poor",
                                                          wRCplus.sr > 75 & wRCplus.sr <= 90 ~ "Below Average",
                                                          wRCplus.sr > 90 & wRCplus.sr <= 110 ~ "Average",
                                                          wRCplus.sr > 110 & wRCplus.sr <= 130 ~ "Above Average",
                                                          wRCplus.sr > 130 & wRCplus.sr <= 150 ~ "Great",
                                                          wRCplus.sr > 150 ~ "Excellent")
        )
        
        wrc_fr_sr_glm$wRCplusrating.sr <- as.factor(wrc_fr_sr_glm$wRCplusrating.sr)
        wrc_fr_sr_glm$wRCplusrating.sr <- relevel(wrc_fr_sr_glm$wRCplusrating.sr, ref = "Average")
        
        multinom_sr <- multinom(wRCplusrating.sr ~ .-wRCplus.sr, data = wrc_fr_sr_glm)
        summary(multinom_sr) # AIC = 9265.011
        multinom_sr_AIC <- round(multinom_sr$AIC,0)
        
        wrc_fr_sr_glm$wRCplusPred <- predict(multinom_sr, newdata = wrc_fr_sr_glm, "class")
        sr_table <- table(wrc_fr_sr_glm$wRCplusrating.sr, wrc_fr_sr_glm$wRCplusPred)
        multinom_sr_acc <- round((sum(diag(sr_table))/sum(sr_table)),4) # overall accuracy = 24.15%
        
        model_stats_sr[6,c(1:3,6:7)] <- c("multinom_sr", "Logistic regression",
                                          "Multinomial logistic regression model with 7 categorical outcomes",
                                          multinom_sr_AIC, multinom_sr_acc)
        
        # Binary 
        
        # Binary outcome
        
        wrc_fr_sr_glm2 <- wrc_fr_sr %>% mutate(wRCplusrating.sr = 
                                                 case_when(wRCplus.sr < 100 ~ 0,
                                                           wRCplus.sr >= 100 ~ 1))
        wrc_fr_sr_glm2$wRCplusrating.sr <- as.factor(wrc_fr_sr_glm2$wRCplusrating.sr)
        
        glm_sr <- glm(wRCplusrating.sr ~ . - wRCplus.sr, data = wrc_fr_sr_glm2, family = "binomial")
        summary(glm_sr) # AIC = 3172.7
        glm_sr_AIC <- round(glm_sr$aic,0)
        
        glm_sr_pred <- predict(glm_sr, wrc_fr_sr_glm2, type = "response")
        cm_glm_sr <- confusionMatrix(as.factor(ifelse(glm_sr_pred >= 0.5, "1", "0")), as.factor(wrc_fr_sr_glm2$wRCplusrating.sr),
                        positive = "1") # overall accuracy = 0.6279
        glm_sr_acc <- round(cm_glm_sr$overall[1],4)
        
        model_stats_sr[7,c(1:3,6:7)] <- c("glm_sr", "Logistic regression",
                                          "Binary logistic regression model",
                                          glm_sr_AIC, glm_sr_acc)
        
        glm_sr_reduced <- glm(wRCplusrating.sr ~ PA.fr + K.fr + SF.fr + CS.fr + SPD.fr - wRCplus.sr, 
                              data = wrc_fr_sr_glm2, family = "binomial")
        summary(glm_sr_reduced) # AIC = 3192.7
        glm_sr_reduced_AIC <- round(glm_sr_reduced$aic,0)
        
        glm_sr_reduced_pred <- predict(glm_sr_reduced, wrc_fr_sr_glm2, type = "response")
        cm_glm_sr_reduced <- confusionMatrix(as.factor(ifelse(glm_sr_reduced_pred >= 0.5, "1", "0")), as.factor(wrc_fr_sr_glm2$wRCplusrating.sr),
                        positive = "1") # overall accuracy = 0.6177
        glm_sr_reduced_acc <- round(cm_glm_sr_reduced$overall[1],4)
        
        model_stats_sr[8, c(1:3,6:7)] <- c("glm_sr_reduced", "Logistic regression",
                                           "Binary logistic regression model with significant variables",
                                           glm_sr_reduced_AIC, glm_sr_reduced_acc)
     
    # k-nearest neighbors for classification
        
      # Sophomore year
        
        # All outcomes
        
          wrc_fr_so_knn <- wrc_fr_so_glm[, -c(24,26)]
          
          set.seed(1819)
          train.index1 <- sample(nrow(wrc_fr_so_knn), nrow(wrc_fr_so_knn) * 0.8)
          fr_so_knn_train <- wrc_fr_so_knn[train.index1, ]
          fr_so_knn_valid <- wrc_fr_so_knn[-train.index1, ]
          
          fr_so_knn_train_norm <- fr_so_knn_train
          fr_so_knn_valid_norm <- fr_so_knn_valid
          fr_so_knn_norm <- wrc_fr_so_knn
          
          norm_values1 <- preProcess(fr_so_knn_train[, 1:23], method = c("center", "scale"))
          fr_so_knn_train_norm[, 1:23] <- predict(norm_values1, fr_so_knn_train[, 1:23])
          fr_so_knn_valid_norm[, 1:23] <- predict(norm_values1, fr_so_knn_valid[, 1:23])
          fr_so_knn_norm[, 1:23] <- predict(norm_values1, fr_so_knn_norm[, 1:23])
          
          fr_so_knn_all <- knn(train = fr_so_knn_train_norm[, 1:23], test = fr_so_knn_valid_norm[, 1:23],
                               cl = fr_so_knn_train_norm[, 24], k = 3)
          accuracy_so_all <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
          for (i in 1:20) {
            knn_pred_so_all <- knn(train = fr_so_knn_train_norm[, 1:23],
                                   test = fr_so_knn_valid_norm[, 1:23],
                                   cl = fr_so_knn_train_norm[, 24], k = i)
            accuracy_so_all[i, 2] <- confusionMatrix(knn_pred_so_all, fr_so_knn_valid_norm[, 24])$overall[1]
          }
          accuracy_so_all # k = 30 gives best accuracy - 20.99%
          
          fr_so_knn_all_best <- knn(train = fr_so_knn_train_norm[, 1:23], test = fr_so_knn_valid_norm[, 1:23],
                               cl = fr_so_knn_train_norm[, 24], k = 30)
          confusionMatrix(fr_so_knn_all_best, fr_so_knn_valid_norm[, 24])
          fr_so_knn_all_best_acc <- round(confusionMatrix(fr_so_knn_all_best, fr_so_knn_valid_norm[, 24])$overall[1],4)
          
          model_stats_so[9, c(1:3,7)] <- c("fr_so_knn_all_best", "k-nearest neighbors",
                                             "k-nearest neighbors model with all 7 wRC+ classifications",
                                             fr_so_knn_all_best_acc)
        
        # Binary Outcome
          
          wrc_fr_so_knn_bin <- wrc_fr_so_glm2[, -24]
          
          set.seed(1819)
          train.index1 <- sample(nrow(wrc_fr_so_knn_bin), nrow(wrc_fr_so_knn_bin) * 0.8)
          fr_so_knn_bin_train <- wrc_fr_so_knn_bin[train.index1, ]
          fr_so_knn_bin_valid <- wrc_fr_so_knn_bin[-train.index1, ]
          
          fr_so_knn_bin_train_norm <- fr_so_knn_bin_train
          fr_so_knn_bin_valid_norm <- fr_so_knn_bin_valid
          fr_so_knn_bin_norm <- wrc_fr_so_knn_bin
          
          norm_values1.5 <- preProcess(fr_so_knn_bin_train[, 1:23], method = c("center", "scale"))
          fr_so_knn_bin_train_norm[, 1:23] <- predict(norm_values1.5, fr_so_knn_bin_train[, 1:23])
          fr_so_knn_bin_valid_norm[, 1:23] <- predict(norm_values1.5, fr_so_knn_bin_valid[, 1:23])
          fr_so_knn_bin_norm[, 1:23] <- predict(norm_values1.5, fr_so_knn_bin_norm[, 1:23])
          
          fr_so_knn_bin <- knn(train = fr_so_knn_bin_train_norm[, 1:23], 
                                   test = fr_so_knn_bin_valid_norm[, 1:23],
                                   cl = fr_so_knn_bin_train_norm[, 24], k = 3)
          

          accuracy_so_bin <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
          for (i in 1:20) {
            knn_pred_so_bin <- knn(train = fr_so_knn_bin_train_norm[, 1:23],
                                   test = fr_so_knn_bin_valid_norm[, 1:23],
                                   cl = as.factor(fr_so_knn_bin_train_norm[, 24]), k = i)
            accuracy_so_bin[i, 2] <- confusionMatrix(knn_pred_so_bin, as.factor(fr_so_knn_bin_valid_norm[, 24]))$overall[1]
          }
          accuracy_so_bin # k = 150 gives best accuracy - 60.44%
        
          fr_so_knn_bin_best <- knn(train = fr_so_knn_bin_train_norm[, 1:23],
                                    test = fr_so_knn_bin_valid_norm[, 1:23],
                                    cl = as.factor(fr_so_knn_bin_train_norm[, 24]), k = 150)
          confusionMatrix(fr_so_knn_bin_best, as.factor(fr_so_knn_bin_valid_norm[, 24]))
          
          fr_so_knn_bin_best_acc <- round(confusionMatrix(fr_so_knn_bin_best, as.factor(fr_so_knn_bin_valid_norm[, 24]))$overall[1],4)
          
          model_stats_so[10, c(1:3,7)] <- c("fr_so_knn_bin_best", "k-nearest neighbors",
                                           "k-nearest neighbors model with binary wRC+ classifications",
                                           fr_so_knn_bin_best_acc)
  
  
        # Junior year
          
          # All outcomes
          
          wrc_fr_jr_knn <- wrc_fr_jr_glm[, -c(24,26)]
          
          set.seed(1819)
          train.index1 <- sample(nrow(wrc_fr_jr_knn), nrow(wrc_fr_jr_knn) * 0.8)
          fr_jr_knn_train <- wrc_fr_jr_knn[train.index1, ]
          fr_jr_knn_valid <- wrc_fr_jr_knn[-train.index1, ]
          
          fr_jr_knn_train_norm <- fr_jr_knn_train
          fr_jr_knn_valid_norm <- fr_jr_knn_valid
          fr_jr_knn_norm <- wrc_fr_jr_knn
          
          norm_values2 <- preProcess(fr_jr_knn_train[, 1:23], method = c("center", "scale"))
          fr_jr_knn_train_norm[, 1:23] <- predict(norm_values2, fr_jr_knn_train[, 1:23])
          fr_jr_knn_valid_norm[, 1:23] <- predict(norm_values2, fr_jr_knn_valid[, 1:23])
          fr_jr_knn_norm[, 1:23] <- predict(norm_values2, fr_jr_knn_norm[, 1:23])
          
          fr_jr_knn_all <- knn(train = fr_jr_knn_train_norm[, 1:23], test = fr_jr_knn_valid_norm[, 1:23],
                               cl = fr_jr_knn_train_norm[, 24], k = 3)
          accuracy_jr_all <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
          for (i in 1:20) {
            knn_pred_jr_all <- knn(train = fr_jr_knn_train_norm[, 1:23],
                                   test = fr_jr_knn_valid_norm[, 1:23],
                                   cl = fr_jr_knn_train_norm[, 24], k = i)
            accuracy_jr_all[i, 2] <- confusionMatrix(knn_pred_jr_all, fr_jr_knn_valid_norm[, 24])$overall[1]
          }
          accuracy_jr_all # k = 80 gives best accuracy - 21.66%
          
          fr_jr_knn_all_best <- knn(train = fr_jr_knn_train_norm[, 1:23], test = fr_jr_knn_valid_norm[, 1:23],
                                    cl = fr_jr_knn_train_norm[, 24], k = 80)
          fr_jr_knn_all_best_acc <- round(confusionMatrix(fr_jr_knn_all_best, fr_jr_knn_valid_norm[, 24])$overall[1],4)
          
          model_stats_jr[9, c(1:3,7)] <- c("fr_jr_knn_all_best", "k-nearest neighbors",
                                           "k-nearest neighbors model with all 7 wRC+ classifications",
                                           fr_jr_knn_all_best_acc)
          
          # Binary Outcome
          
          wrc_fr_jr_knn_bin <- wrc_fr_jr_glm2[, -24]
          
          set.seed(1819)
          train.index1 <- sample(nrow(wrc_fr_jr_knn_bin), nrow(wrc_fr_jr_knn_bin) * 0.8)
          fr_jr_knn_bin_train <- wrc_fr_jr_knn_bin[train.index1, ]
          fr_jr_knn_bin_valid <- wrc_fr_jr_knn_bin[-train.index1, ]
          
          fr_jr_knn_bin_train_norm <- fr_jr_knn_bin_train
          fr_jr_knn_bin_valid_norm <- fr_jr_knn_bin_valid
          fr_jr_knn_bin_norm <- wrc_fr_jr_knn_bin
          
          norm_values2.5 <- preProcess(fr_jr_knn_bin_train[, 1:23], method = c("center", "scale"))
          fr_jr_knn_bin_train_norm[, 1:23] <- predict(norm_values2.5, fr_jr_knn_bin_train[, 1:23])
          fr_jr_knn_bin_valid_norm[, 1:23] <- predict(norm_values2.5, fr_jr_knn_bin_valid[, 1:23])
          fr_jr_knn_bin_norm[, 1:23] <- predict(norm_values2.5, fr_jr_knn_bin_norm[, 1:23])
          
          fr_jr_knn_bin <- knn(train = fr_jr_knn_bin_train_norm[, 1:23], 
                               test = fr_jr_knn_bin_valid_norm[, 1:23],
                               cl = fr_jr_knn_bin_train_norm[, 24], k = 3)
          
          
          accuracy_jr_bin <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
          for (i in 1:20) {
            knn_pred_jr_bin <- knn(train = fr_jr_knn_bin_train_norm[, 1:23],
                                   test = fr_jr_knn_bin_valid_norm[, 1:23],
                                   cl = as.factor(fr_jr_knn_bin_train_norm[, 24]), k = i)
            accuracy_jr_bin[i, 2] <- confusionMatrix(knn_pred_jr_bin, as.factor(fr_jr_knn_bin_valid_norm[, 24]))$overall[1]
          }
          accuracy_jr_bin # k = 150 gives best accuracy - 63.94%
          
          fr_jr_knn_bin_best <- knn(train = fr_jr_knn_bin_train_norm[, 1:23],
                                    test = fr_jr_knn_bin_valid_norm[, 1:23],
                                    cl = as.factor(fr_jr_knn_bin_train_norm[, 24]), k = 150)
          fr_jr_knn_bin_best_acc <- round(confusionMatrix(fr_jr_knn_bin_best, as.factor(fr_jr_knn_bin_valid_norm[, 24]))$overall[1],4)
          
          model_stats_jr[10, c(1:3,7)] <- c("fr_jr_knn_bin_best", "k-nearest neighbors",
                                            "k-nearest neighbors model with binary wRC+ classifications",
                                            fr_jr_knn_bin_best_acc)
     
        # Senior year
          
          # All outcomes
          
          wrc_fr_sr_knn <- wrc_fr_sr_glm[, -c(24,26)]
          
          set.seed(1819)
          train.index1 <- sample(nrow(wrc_fr_sr_knn), nrow(wrc_fr_sr_knn) * 0.8)
          fr_sr_knn_train <- wrc_fr_sr_knn[train.index1, ]
          fr_sr_knn_valid <- wrc_fr_sr_knn[-train.index1, ]
          
          fr_sr_knn_train_norm <- fr_sr_knn_train
          fr_sr_knn_valid_norm <- fr_sr_knn_valid
          fr_sr_knn_norm <- wrc_fr_sr_knn
          
          norm_values3 <- preProcess(fr_sr_knn_train[, 1:23], method = c("center", "scale"))
          fr_sr_knn_train_norm[, 1:23] <- predict(norm_values3, fr_sr_knn_train[, 1:23])
          fr_sr_knn_valid_norm[, 1:23] <- predict(norm_values3, fr_sr_knn_valid[, 1:23])
          fr_sr_knn_norm[, 1:23] <- predict(norm_values3, fr_sr_knn_norm[, 1:23])
          
          fr_sr_knn_all <- knn(train = fr_sr_knn_train_norm[, 1:23],
                               test = fr_sr_knn_valid_norm[, 1:23],
                               cl = fr_sr_knn_train_norm[, 24], k = 3)
          accuracy_sr_all <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
          for (i in 1:20) {
            knn_pred_sr_all <- knn(train = fr_sr_knn_train_norm[, 1:23],
                                   test = fr_sr_knn_valid_norm[, 1:23],
                                   cl = fr_sr_knn_train_norm[, 24], k = i)
            accuracy_sr_all[i, 2] <- confusionMatrix(knn_pred_sr_all, fr_sr_knn_valid_norm[, 24])$overall[1]
          }
          accuracy_sr_all # k = 190 gives best accuracy - 20.25%
          
          fr_sr_knn_all_best <- knn(train = fr_sr_knn_train_norm[, 1:23],
                                    test = fr_sr_knn_valid_norm[, 1:23],
                                    cl = fr_sr_knn_train_norm[, 24], k = 190)
          fr_sr_knn_all_best_acc <- round(confusionMatrix(fr_sr_knn_all_best, fr_sr_knn_valid_norm[, 24])$overall[1],4)
          
          model_stats_sr[9, c(1:3,7)] <- c("fr_sr_knn_all_best", "k-nearest neighbors",
                                           "k-nearest neighbors model with all 7 wRC+ classifications",
                                           fr_sr_knn_all_best_acc)
          
          # Binary Outcome
          
          wrc_fr_sr_knn_bin <- wrc_fr_sr_glm2[, -24]
          
          set.seed(1819)
          train.index1 <- sample(nrow(wrc_fr_sr_knn_bin), nrow(wrc_fr_sr_knn_bin) * 0.8)
          fr_sr_knn_bin_train <- wrc_fr_sr_knn_bin[train.index1, ]
          fr_sr_knn_bin_valid <- wrc_fr_sr_knn_bin[-train.index1, ]
          
          fr_sr_knn_bin_train_norm <- fr_sr_knn_bin_train
          fr_sr_knn_bin_valid_norm <- fr_sr_knn_bin_valid
          fr_sr_knn_bin_norm <- wrc_fr_sr_knn_bin
          
          norm_values3.5 <- preProcess(fr_sr_knn_bin_train[, 1:23], method = c("center", "scale"))
          fr_sr_knn_bin_train_norm[, 1:23] <- predict(norm_values3.5, fr_sr_knn_bin_train[, 1:23])
          fr_sr_knn_bin_valid_norm[, 1:23] <- predict(norm_values3.5, fr_sr_knn_bin_valid[, 1:23])
          fr_sr_knn_bin_norm[, 1:23] <- predict(norm_values3.5, fr_sr_knn_bin_norm[, 1:23])
          
          fr_sr_knn_bin <- knn(train = fr_sr_knn_bin_train_norm[, 1:23], 
                               test = fr_sr_knn_bin_valid_norm[, 1:23],
                               cl = fr_sr_knn_bin_train_norm[, 24], k = 3)
        
          
          accuracy_sr_bin <- data.frame(k = seq(10, 200, 10), accuracy = rep(0, 20))
          for (i in 1:20) {
            knn_pred_sr_bin <- knn(train = fr_sr_knn_bin_train_norm[, 1:23],
                                   test = fr_sr_knn_bin_valid_norm[, 1:23],
                                   cl = as.factor(fr_sr_knn_bin_train_norm[, 24]), k = i)
            accuracy_sr_bin[i, 2] <- confusionMatrix(knn_pred_sr_bin, as.factor(fr_sr_knn_bin_valid_norm[, 24]))$overall[1]
          }
          accuracy_sr_bin # k = 50 gives best accuracy - 61.55%
          
          fr_sr_knn_bin_best <- knn(train = fr_sr_knn_bin_train_norm[, 1:23],
                                    test = fr_sr_knn_bin_valid_norm[, 1:23],
                                    cl = as.factor(fr_sr_knn_bin_train_norm[, 24]), k = 50)
          fr_sr_knn_bin_best_acc <- round(confusionMatrix(fr_sr_knn_bin_best, as.factor(fr_sr_knn_bin_valid_norm[, 24]))$overall[1],4)
          
          model_stats_sr[10, c(1:3,7)] <- c("fr_sr_knn_bin_best", "k-nearest neighbors",
                                            "k-nearest neighbors model with binary wRC+ classifications",
                                            fr_sr_knn_bin_best_acc)
          
      
    # AAC Prediction using glm model
          
        # Sophomore year
          
          full <- read.csv("D1 Batting 2013-2021.csv")
          full_aac_fr <- full[full$year==2021 & full$Yr=="Fr" & full$conference == "AAC" &
                                full$PA >= 32, ] # 30 eligable freshman players in the AAC in 2021
          
          full_aac_fr <- full_aac_fr %>%
            rename(PA.fr = PA,
                   BA.fr = BA,
                   SlgPct.fr = SlgPct,
                   BABIP.fr = BABIP,
                   wRAA.fr = wRAA,
                   wRCplus.fr = wRCplus,
                   R.fr = R,
                   X2B.fr = X2B,
                   X3B.fr = X3B,
                   TB.fr = TB,
                   HR.fr = HR,
                   HRpct.fr = HRpct,
                   RBI.fr = RBI,
                   BB.fr = BB,
                   K.fr = K,
                   BBpct.fr = BBpct,
                   Kpct.fr = Kpct,
                   HBP.fr = HBP,
                   SF.fr = SF,
                   SH.fr = SH,
                   SB.fr = SB,
                   CS.fr = CS,
                   SPD.fr = SPD)
          
          full_aac_fr$wRCplus.so <- 0
          full_aac_fr$wRCplus.jr <- 0
          full_aac_fr$wRCplus.sr <- 0
          
          aac_pred_so <- predict(glm_so_reduced, newdata = full_aac_fr, type = "response")
          aac_pred_jr <- predict(glm_jr_reduced, newdata = full_aac_fr, type = "response")
          aac_pred_sr <- predict(glm_sr_reduced, newdata = full_aac_fr, type = "response")
          
          full_aac_fr <- cbind(full_aac_fr, aac_pred_so, aac_pred_jr, aac_pred_sr)
          
          full_aac_fr <- full_aac_fr %>% mutate(So_Prediction = 
                                                  case_when(aac_pred_so < 0.50 ~ "Below Average",
                                                            aac_pred_so >= 0.50 ~ "Above Average"))
          full_aac_fr <- full_aac_fr %>% mutate(Jr_Prediction = 
                                                  case_when(aac_pred_jr < 0.50 ~ "Below Average",
                                                            aac_pred_jr >= 0.50 ~ "Above Average"))
          full_aac_fr <- full_aac_fr %>% mutate(Sr_Prediction = 
                                                  case_when(aac_pred_sr < 0.50 ~ "Below Average",
                                                            aac_pred_sr >= 0.50 ~ "Above Average"))          
          
          aac_fr_predictions <- full_aac_fr[, c(3:4, 6, 8, 46:48)]

Predicting NCAA Baseball Player Development: An analysis of player performance over the course of a collegiate baseball career

Nicholas Behrendt

December 1, 2021