HW1.5_ML

library(dplyr)
library(Metrics)
library(MLmetrics)
library(leaps)
library(car)

Load Data

Clean data

# Review NA counts
colSums(is.na(data))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##                0                0                0                0 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##                0                0                0              102 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
##              131              772             2085                0 
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##                0                0              102                0 
## TEAM_FIELDING_DP 
##              286

# Remove NAs
data_clean <- na.omit(data) 

# Confirm NAs were removed
colSums(is.na(data_clean))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##                0                0                0                0 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##                0                0                0                0 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
##                0                0                0                0 
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##                0                0                0                0 
## TEAM_FIELDING_DP 
##                0

Create Train, Test Split

train_clean <- data_clean %>% dplyr::sample_frac(.75)
test_clean  <- dplyr::anti_join(data_clean, train_clean, by = 'INDEX')

Model 3: Backward Elimination Model

# Fit model
backward_model <- lm(TARGET_WINS ~ TEAM_BASERUN_SB + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB 
                     + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = train_clean)
# View summary
summary(backward_model)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BASERUN_SB + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = train_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.5399  -5.3040  -0.3688   4.7563  21.9519 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      95.955038  12.975184   7.395 1.31e-11 ***
## TEAM_BASERUN_SB   0.030832   0.024195   1.274   0.2047    
## TEAM_BATTING_HR   0.136807   0.025602   5.344 3.74e-07 ***
## TEAM_BATTING_BB   0.059201   0.011683   5.067 1.29e-06 ***
## TEAM_PITCHING_SO -0.030815   0.007384  -4.173 5.33e-05 ***
## TEAM_FIELDING_E  -0.250154   0.048273  -5.182 7.75e-07 ***
## TEAM_FIELDING_DP -0.102194   0.042582  -2.400   0.0178 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.533 on 136 degrees of freedom
## Multiple R-squared:  0.5451, Adjusted R-squared:  0.525 
## F-statistic: 27.16 on 6 and 136 DF,  p-value: < 2.2e-16

# Make predictions on test set
backward_model_predictions = predict(backward_model, test_clean)

# Obtain RMSE between actuals and predicted
rmse(test_clean$TARGET_WINS, backward_model_predictions)

## [1] 9.006097

Model 4: Forward Selection Model

# Fit model
foward_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_FIELDING_E + TEAM_FIELDING_DP
                   + TEAM_PITCHING_HR + TEAM_PITCHING_SO, data = train_clean)
# View summary
summary(foward_model)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_PITCHING_HR + TEAM_PITCHING_SO, 
##     data = train_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.6833  -5.1513  -0.3735   4.8161  21.3754 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      49.667397  21.997150   2.258 0.025545 *  
## TEAM_BATTING_H    0.033231   0.011834   2.808 0.005718 ** 
## TEAM_BATTING_BB   0.054802   0.011501   4.765 4.79e-06 ***
## TEAM_FIELDING_E  -0.234642   0.047359  -4.955 2.12e-06 ***
## TEAM_FIELDING_DP -0.113452   0.041081  -2.762 0.006546 ** 
## TEAM_PITCHING_HR  0.096474   0.027799   3.470 0.000697 ***
## TEAM_PITCHING_SO -0.021810   0.008004  -2.725 0.007278 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.341 on 136 degrees of freedom
## Multiple R-squared:  0.5654, Adjusted R-squared:  0.5462 
## F-statistic: 29.48 on 6 and 136 DF,  p-value: < 2.2e-16

# Make predictions on test set
forward_model_predictions = predict(foward_model, test_clean)

# Obtain RMSE between actuals and predicted
rmse(test_clean$TARGET_WINS, forward_model_predictions)

## [1] 9.233138

Verifying OLS Regression Assumptions

# Assumption: No Multicollinearity (VIF under 5)
vif(foward_model)

##   TEAM_BATTING_H  TEAM_BATTING_BB  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         1.586554         1.342971         1.119632         1.039797 
## TEAM_PITCHING_HR TEAM_PITCHING_SO 
##         1.654357         1.475093

# Assumption: Mean of residuals is zero
mean(residuals(foward_model))

## [1] -4.262598e-16

# Assumption: Homoscedasticity of residuals
plot(foward_model)

# Assumption: No auto-correlation
acf(residuals(foward_model), lags=20)

Model Selection

First, before fully evaluating models we validated that all individual predictors had p-values below 0.05, the cutoff for a 95% confidence level. Additionally, we validated that the models F-statistics were also significant at a 95% confidence level.

Then, the two primary statistics used to choose our final model were adjusted R-squared and root mean square error (RMSE). Adjusted R-squared helped guide model selection since, like R-squared, adjusted R-squared measures the amount of variation in the dependent variable explained by the independent variables, except with a correction to ensure only independent variables with predictive power raise the statistic. RMSE was perhaps even more crucial to model selection as it is the measure of the standard deviation of the residuals, essentially a measure of accuracy in the same units as the response variable. To ensure the model can generalize to unobserved data, we calculated the RMSE on our test set.

Both of our top models–forward selection and backward elimination–saw a RMSE of 8.5. Therefore, we chose the forward selection model due to its slightly higher adjusted R-squared–0.54 vs 0.53. Additionally, since both top performing models included six predictors, parsimony was not a consideration.

Lastly, we verified the forward selection model meets OLS regression assumptions. These included: no significant multicollinearity, the mean of residuals is zero, homoscedasticity of residuals, and no significant auto-correlation. We deemed all assumptions had been met, but note, there is a slight trend in the residuals vs fitted plot (Assumption: Homoscedasticity of residuals) which may indicate a small nonlinear trend.

References

Bhandari, Aniruddha, “Key Difference between R-squared and Adjusted R-squared for Regression Analysis”, Analytics Vidhya, 2020 https://www.analyticsvidhya.com/blog/2020/07/difference-between-r-squared-and-adjusted-r-squared/

Glen., Stephanie “RMSE: Root Mean Square Error”, StatisticsHowTo.com https://www.statisticshowto.com/probability-and-statistics/regression-analysis/rmse-root-mean-square-error/

Gupta, Aryansh, “Linear Regression Assumptions and Diagnostics in R”, RPubs, https://rpubs.com/aryn999/LinearRegressionAssumptionsAndDiagnosticsInR

Kim, Bommae, “Understanding Diagnostic Plots for Linear Regression Analysis”, University of Virginia Library, https://data.library.virginia.edu/diagnostic-plots/