library(dplyr)
library(Metrics)
library(MLmetrics)
library(leaps)
library(car)
# Review NA counts
colSums(is.na(data))
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0 0 0 0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## 0 0 0 102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## 131 772 2085 0
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## 0 0 102 0
## TEAM_FIELDING_DP
## 286
# Remove NAs
data_clean <- na.omit(data)
# Confirm NAs were removed
colSums(is.na(data_clean))
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0 0 0 0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## 0 0 0 0
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## 0 0 0 0
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## 0 0 0 0
## TEAM_FIELDING_DP
## 0
train_clean <- data_clean %>% dplyr::sample_frac(.75)
test_clean <- dplyr::anti_join(data_clean, train_clean, by = 'INDEX')
# Fit model
backward_model <- lm(TARGET_WINS ~ TEAM_BASERUN_SB + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB
+ TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = train_clean)
# View summary
summary(backward_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BASERUN_SB + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E +
## TEAM_FIELDING_DP, data = train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.5399 -5.3040 -0.3688 4.7563 21.9519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.955038 12.975184 7.395 1.31e-11 ***
## TEAM_BASERUN_SB 0.030832 0.024195 1.274 0.2047
## TEAM_BATTING_HR 0.136807 0.025602 5.344 3.74e-07 ***
## TEAM_BATTING_BB 0.059201 0.011683 5.067 1.29e-06 ***
## TEAM_PITCHING_SO -0.030815 0.007384 -4.173 5.33e-05 ***
## TEAM_FIELDING_E -0.250154 0.048273 -5.182 7.75e-07 ***
## TEAM_FIELDING_DP -0.102194 0.042582 -2.400 0.0178 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.533 on 136 degrees of freedom
## Multiple R-squared: 0.5451, Adjusted R-squared: 0.525
## F-statistic: 27.16 on 6 and 136 DF, p-value: < 2.2e-16
# Make predictions on test set
backward_model_predictions = predict(backward_model, test_clean)
# Obtain RMSE between actuals and predicted
rmse(test_clean$TARGET_WINS, backward_model_predictions)
## [1] 9.006097
# Fit model
foward_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_FIELDING_E + TEAM_FIELDING_DP
+ TEAM_PITCHING_HR + TEAM_PITCHING_SO, data = train_clean)
# View summary
summary(foward_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB +
## TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_PITCHING_HR + TEAM_PITCHING_SO,
## data = train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.6833 -5.1513 -0.3735 4.8161 21.3754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.667397 21.997150 2.258 0.025545 *
## TEAM_BATTING_H 0.033231 0.011834 2.808 0.005718 **
## TEAM_BATTING_BB 0.054802 0.011501 4.765 4.79e-06 ***
## TEAM_FIELDING_E -0.234642 0.047359 -4.955 2.12e-06 ***
## TEAM_FIELDING_DP -0.113452 0.041081 -2.762 0.006546 **
## TEAM_PITCHING_HR 0.096474 0.027799 3.470 0.000697 ***
## TEAM_PITCHING_SO -0.021810 0.008004 -2.725 0.007278 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.341 on 136 degrees of freedom
## Multiple R-squared: 0.5654, Adjusted R-squared: 0.5462
## F-statistic: 29.48 on 6 and 136 DF, p-value: < 2.2e-16
# Make predictions on test set
forward_model_predictions = predict(foward_model, test_clean)
# Obtain RMSE between actuals and predicted
rmse(test_clean$TARGET_WINS, forward_model_predictions)
## [1] 9.233138
# Assumption: No Multicollinearity (VIF under 5)
vif(foward_model)
## TEAM_BATTING_H TEAM_BATTING_BB TEAM_FIELDING_E TEAM_FIELDING_DP
## 1.586554 1.342971 1.119632 1.039797
## TEAM_PITCHING_HR TEAM_PITCHING_SO
## 1.654357 1.475093
# Assumption: Mean of residuals is zero
mean(residuals(foward_model))
## [1] -4.262598e-16
# Assumption: Homoscedasticity of residuals
plot(foward_model)
# Assumption: No auto-correlation
acf(residuals(foward_model), lags=20)
First, before fully evaluating models we validated that all individual predictors had p-values below 0.05, the cutoff for a 95% confidence level. Additionally, we validated that the models F-statistics were also significant at a 95% confidence level.
Then, the two primary statistics used to choose our final model were adjusted R-squared and root mean square error (RMSE). Adjusted R-squared helped guide model selection since, like R-squared, adjusted R-squared measures the amount of variation in the dependent variable explained by the independent variables, except with a correction to ensure only independent variables with predictive power raise the statistic. RMSE was perhaps even more crucial to model selection as it is the measure of the standard deviation of the residuals, essentially a measure of accuracy in the same units as the response variable. To ensure the model can generalize to unobserved data, we calculated the RMSE on our test set.
Both of our top models–forward selection and backward elimination–saw a RMSE of 8.5. Therefore, we chose the forward selection model due to its slightly higher adjusted R-squared–0.54 vs 0.53. Additionally, since both top performing models included six predictors, parsimony was not a consideration.
Lastly, we verified the forward selection model meets OLS regression assumptions. These included: no significant multicollinearity, the mean of residuals is zero, homoscedasticity of residuals, and no significant auto-correlation. We deemed all assumptions had been met, but note, there is a slight trend in the residuals vs fitted plot (Assumption: Homoscedasticity of residuals) which may indicate a small nonlinear trend.
Bhandari, Aniruddha, “Key Difference between R-squared and Adjusted R-squared for Regression Analysis”, Analytics Vidhya, 2020 https://www.analyticsvidhya.com/blog/2020/07/difference-between-r-squared-and-adjusted-r-squared/
Glen., Stephanie “RMSE: Root Mean Square Error”, StatisticsHowTo.com https://www.statisticshowto.com/probability-and-statistics/regression-analysis/rmse-root-mean-square-error/
Gupta, Aryansh, “Linear Regression Assumptions and Diagnostics in R”, RPubs, https://rpubs.com/aryn999/LinearRegressionAssumptionsAndDiagnosticsInR
Kim, Bommae, “Understanding Diagnostic Plots for Linear Regression Analysis”, University of Virginia Library, https://data.library.virginia.edu/diagnostic-plots/