Setup:

First, let’s read in the provided dataset.

Data Exploration:

## The dataset consists of 2276 observations of 17 variables.

The variables and their definitions can be seen below:

Variable	Definition
`INDEX`	Identification variable
`TARGET_WINS`	Number of wins
`TEAM_BATTING_H`	Base hits by batters (1B, 2B, 3B, HR)
`TEAM_BATTING_2B`	Doubles by batters (2B)
`TEAM_BATTING_3B`	Triples by batters (3B)
`TEAM_BATTING_HR`	Homeruns by batters (4B)
`TEAM_BATTING_BB`	Walks by batters
`TEAM_BATTING_HBP`	Batters hit by pitch (get a free base)
`TEAM_BATTING_SO`	Strikeouts by batters
`TEAM_BASERUN_SB`	Stolen bases
`TEAM_BASRUN_CS`	Caught stealing
`TEAM_FIELDING_E`	Errors
`TEAM_FIELDING_DP`	Double plays
`TEAM_PITCHING_BB`	Walks allowed
`TEAM_PITCHING_H`	Hits allowed
`TEAM_PITCHING_HR`	Homeruns allowed
`TEAM_PITCHING_SO`	Strikeouts by pitchers

INDEX is an identifying feature and should not be included in the linear regression model.

Next, let’s print out some summary statistics. We’re primarily interested in the TARGET_WINS variable, so we’ll look at that first.

## The mean number of wins in a season is 80.79.

## The median number of wins in a season is 82.

## The standard deviation for number of wins in a season is 15.75.

Let’s also make a histogram of the TARGET_WINS variable. This should give us a sense of the distribution of wins for teams/seasons in our population.

Overall, the number of wins in a season for a given baseball team looks fairly normally distributed. Looking at a boxplot helps to highlight the outliers.

We could describe the average team’s season using the mean for all variables below:

TARGET_WINS	80.8
TEAM_BATTING_H	1469.3
TEAM_BATTING_2B	241.2
TEAM_BATTING_3B	55.2
TEAM_BATTING_HR	99.6
TEAM_BATTING_BB	501.6
TEAM_BATTING_SO	735.6
TEAM_BASERUN_SB	124.8
TEAM_BASERUN_CS	52.8
TEAM_BATTING_HBP	59.4
TEAM_PITCHING_H	1779.2
TEAM_PITCHING_HR	105.7
TEAM_PITCHING_BB	553.0
TEAM_PITCHING_SO	817.7
TEAM_FIELDING_E	246.5
TEAM_FIELDING_DP	146.4

Data Preparation:

Let’s take a closer look at all the summary statistics for these variables and identify any data completeness issues:

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##  TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.0   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 49.0   Median :58.00    Median : 1518   Median :107.0   
##  Mean   : 52.8   Mean   :59.36    Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.0   Max.   :95.00    Max.   :30132   Max.   :343.0   
##  NA's   :772     NA's   :2085                                     
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4   
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0   
##                   NA's   :102                        NA's   :286

We can see quite a few NA values for TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_PITCHING_SO, and TEAM_FIELDING_DP. Let’s take a look at the distributions of these variables to see how to impute these missing values.

TEAM_BASERUN_SB, TEAM_PITCHING_SO, and TEAM_BASERUN_CS seem to be skewed to the right so we should probably impute the missing values using the median value for these variables. TEAM_BATTING_HBP and TEAM_FIELDING_DP seem basically normally distributed so we can use the mean here, although TEAM_BATTING_HBP has 2,085 NA values out of 2,276 observations so it may make sense to leave this variable out of our model entirely. TEAM_BATTING_SO is bimodally distributed, so we have decided to use KNN imputation, which does not rely on the shape of the distribution, for this variable.

Let’s look at raw correlations between our other included variables and a team’s win total for a season:

##                         [,1]
## TARGET_WINS       1.00000000
## TEAM_BATTING_H    0.38876752
## TEAM_BATTING_2B   0.28910365
## TEAM_BATTING_3B   0.14260841
## TEAM_BATTING_HR   0.17615320
## TEAM_BATTING_BB   0.23255986
## TEAM_BATTING_SO  -0.03606403
## TEAM_BASERUN_SB   0.12361087
## TEAM_BASERUN_CS   0.01595982
## TEAM_PITCHING_H  -0.10993705
## TEAM_PITCHING_HR  0.18901373
## TEAM_PITCHING_BB  0.12417454
## TEAM_PITCHING_SO -0.07579967
## TEAM_FIELDING_E  -0.17648476
## TEAM_FIELDING_DP -0.02884126

None of the independent variables seem to have such high correlation with TARGET_WINS. TEAM_BATTING_H is most highly correlated, with a correlation of 0.39. TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_HR, and TEAM_PITCHING_BB are all positively correlated with TARGET_WINS while TEAM_BATTING_SO, TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_FIELDING_E, and TEAM_FIELDING_DP are negatively correlated.

Some of these correlations are surprising, as we would have expected TEAM_BASERUN_CS, TEAM_PITCHING_HR, and TEAM_PITCHING_BB to be negatively correlated with TARGET_WINS, and we would have expected TEAM_PITCHING_SO and TEAM_FIELDING_DP to be positively correlated with TARGET_WINS. We won’t exclude them from our models based solely on this surprise, however.

Let’s review relationships between batting independent variables.

Most of the batting variables appear to be somewhat approximately normal although there are some cases of right skew. Overall, there aren’t any very strong correlations between these statistics at least from a preliminary visual inspection. From the distributions of these variables, we can see some that require transforming to normalize them before we use them in our linear model.

Let’s review relationships between other independent variables.

There isn’t very strong correlation between the other independent variables similar to the batting statistics although there are more examples of skewed data with these inputs. Once again, we can see that we will need to transform some of these variables.

Model Development:

First, let’s create a basic model with all untransformed variables:

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.260  -8.612   0.151   8.425  59.018 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.657e+01  5.234e+00   5.078 4.14e-07 ***
## TEAM_BATTING_H    4.708e-02  3.699e-03  12.729  < 2e-16 ***
## TEAM_BATTING_2B  -1.788e-02  9.206e-03  -1.942 0.052245 .  
## TEAM_BATTING_3B   6.137e-02  1.678e-02   3.657 0.000261 ***
## TEAM_BATTING_HR   5.752e-02  2.749e-02   2.093 0.036500 *  
## TEAM_BATTING_BB   1.085e-02  5.816e-03   1.865 0.062310 .  
## TEAM_BATTING_SO  -1.141e-02  2.579e-03  -4.427 1.00e-05 ***
## TEAM_BASERUN_SB   2.580e-02  4.317e-03   5.976 2.66e-09 ***
## TEAM_BASERUN_CS  -7.159e-03  1.577e-02  -0.454 0.649853    
## TEAM_PITCHING_H  -8.980e-04  3.673e-04  -2.445 0.014562 *  
## TEAM_PITCHING_HR  1.612e-02  2.431e-02   0.663 0.507243    
## TEAM_PITCHING_BB -2.408e-05  4.124e-03  -0.006 0.995341    
## TEAM_PITCHING_SO  3.201e-03  9.134e-04   3.505 0.000466 ***
## TEAM_FIELDING_E  -1.961e-02  2.448e-03  -8.011 1.80e-15 ***
## TEAM_FIELDING_DP -1.201e-01  1.293e-02  -9.290  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.05 on 2261 degrees of freedom
## Multiple R-squared:  0.3181, Adjusted R-squared:  0.3139 
## F-statistic: 75.34 on 14 and 2261 DF,  p-value: < 2.2e-16

We can see that the \(R^2\) value of a model that includes all the variables is not particularly high.

Multicollinearity Issues

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.822624         2.480705         2.936429        37.007740 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         6.801581         5.336220         1.816473         1.167416 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##         3.567853        29.669252         6.288097         3.257794 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         4.155586         1.343048

Despite the simplicity of the approach used by including all of the variables provided there are several variables which indicate multicollinearity thereby impacting the reliability of the variance and coefficients in the model. TEAM_BATTING_HR and TEAM_PITCHING_HR are the most correlated with other variables which is interesting that the correlation plots did not more clearly emphasize that fact from an initial spot check.

We can remove some variables that are not significant using backward step-wise elimination.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.201  -8.548   0.137   8.404  59.080 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      25.5925419  5.0907238   5.027 5.36e-07 ***
## TEAM_BATTING_H    0.0473024  0.0036728  12.879  < 2e-16 ***
## TEAM_BATTING_2B  -0.0182083  0.0091927  -1.981 0.047742 *  
## TEAM_BATTING_3B   0.0633643  0.0165996   3.817 0.000139 ***
## TEAM_BATTING_HR   0.0752404  0.0098361   7.649 2.97e-14 ***
## TEAM_BATTING_BB   0.0109356  0.0033639   3.251 0.001167 ** 
## TEAM_BATTING_SO  -0.0114146  0.0024962  -4.573 5.07e-06 ***
## TEAM_BASERUN_SB   0.0254110  0.0041873   6.069 1.51e-09 ***
## TEAM_PITCHING_H  -0.0008562  0.0003209  -2.669 0.007672 ** 
## TEAM_PITCHING_SO  0.0032329  0.0006703   4.823 1.51e-06 ***
## TEAM_FIELDING_E  -0.0192393  0.0023792  -8.086 9.91e-16 ***
## TEAM_FIELDING_DP -0.1201245  0.0129038  -9.309  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 2264 degrees of freedom
## Multiple R-squared:  0.3179, Adjusted R-squared:  0.3145 
## F-statistic: 95.91 on 11 and 2264 DF,  p-value: < 2.2e-16

The \(R^2\) for this model is not much improved. The coefficients for most of the variables around batting are positively associated with target wins which makes some sense as more hits/stolen bases should correspond with more runs and ultimately translate into wins. TEAM_BATTING_2B has a negative coefficient which is not expected at all as what differentiates doubles compared to other hits from the dependent variable. The individual T-test from our sample would also seem to indicate that it is only slightly significant. TEAM_BATTING_SO is expected to have a negative relationship with wins and the negative coefficient is aligned with the initial expectations. The predictors around pitching do not have very strong coefficients although they are significant to the model and the coefficients align with expectations that allowing hits inversely impacts winning, while striking out opposing players is beneficial to winning. Lastly, the fielding independent variables that remain (TEAM_FIELDING_E and TEAM_FIELDING_DP) appear to be consistent with expectations. Double plays may have one of the strongest impacts given it’s coefficient although there is more sample variability compared to the other predictors and more NA values present in this predictor.

Let’s check for multicolinearity between variables.

Reviewing the variance inflation factors:

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.772304         2.475869         2.876917         4.744128 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_H 
##         2.277645         5.005090         1.710653         2.725441 
## TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         1.755773         3.928217         1.339341

The variance inflation factor for TEAM_BATTING_SO is greater than 5. We can remove this predictor.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.354  -8.637   0.046   8.422  56.185 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.4516005  3.8837691   2.691  0.00717 ** 
## TEAM_BATTING_H    0.0543248  0.0033510  16.212  < 2e-16 ***
## TEAM_BATTING_2B  -0.0275776  0.0090008  -3.064  0.00221 ** 
## TEAM_BATTING_3B   0.0742212  0.0165010   4.498 7.21e-06 ***
## TEAM_BATTING_HR   0.0472798  0.0077385   6.110 1.17e-09 ***
## TEAM_BATTING_BB   0.0134412  0.0033335   4.032 5.71e-05 ***
## TEAM_BASERUN_SB   0.0204726  0.0040634   5.038 5.07e-07 ***
## TEAM_PITCHING_H  -0.0005109  0.0003132  -1.631  0.10300    
## TEAM_PITCHING_SO  0.0020318  0.0006194   3.281  0.00105 ** 
## TEAM_FIELDING_E  -0.0186422  0.0023861  -7.813 8.48e-15 ***
## TEAM_FIELDING_DP -0.1165458  0.0129366  -9.009  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.1 on 2265 degrees of freedom
## Multiple R-squared:  0.3116, Adjusted R-squared:  0.3085 
## F-statistic: 102.5 on 10 and 2265 DF,  p-value: < 2.2e-16

The coefficients were not impacted that substantially once this modification was made. The only caveat to that prior statement is that TEAM_BATTING_2B remains negatively associated with target wins but it’s significance to the model has drastically improved with the exclusion of the collinear variable. It is not intuitive why `TEAM_BATTING_SO would impact TEAM_BATTING_2B as their correlation was only 0.185. TEAM_PITCHING_SO also became slightly less significant to to the model

Let’s remove TEAM_PITCHING_H as it is no longer significant.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.785  -8.584   0.002   8.446  57.585 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      12.1798845  3.7378157   3.259  0.00114 ** 
## TEAM_BATTING_H    0.0528676  0.0032309  16.363  < 2e-16 ***
## TEAM_BATTING_2B  -0.0272628  0.0090020  -3.029  0.00249 ** 
## TEAM_BATTING_3B   0.0787841  0.0162681   4.843 1.37e-06 ***
## TEAM_BATTING_HR   0.0481046  0.0077248   6.227 5.64e-10 ***
## TEAM_BATTING_BB   0.0133595  0.0033344   4.007 6.36e-05 ***
## TEAM_BASERUN_SB   0.0216382  0.0040015   5.408 7.06e-08 ***
## TEAM_PITCHING_SO  0.0016132  0.0005639   2.861  0.00426 ** 
## TEAM_FIELDING_E  -0.0208625  0.0019604 -10.642  < 2e-16 ***
## TEAM_FIELDING_DP -0.1173637  0.0129316  -9.076  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.1 on 2266 degrees of freedom
## Multiple R-squared:  0.3108, Adjusted R-squared:  0.308 
## F-statistic: 113.5 on 9 and 2266 DF,  p-value: < 2.2e-16

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.891551         2.351793         2.737074         2.898411 
##  TEAM_BATTING_BB  TEAM_BASERUN_SB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##         2.216711         1.547476         1.230982         2.641838 
## TEAM_FIELDING_DP 
##         1.332410

Based on the definitions of TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, and TEAM_BATTING_HR, there is probably some multicollinearity going on with these variables. Let’s compare a model that uses just the total hits against a model using each individual type of hit.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.823  -8.638   0.156   8.473  52.443 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.7032191  3.5315442   2.748  0.00605 ** 
## TEAM_BATTING_H    0.0542548  0.0021078  25.740  < 2e-16 ***
## TEAM_BATTING_BB   0.0171889  0.0032588   5.275 1.46e-07 ***
## TEAM_BASERUN_SB   0.0237070  0.0037196   6.374 2.23e-10 ***
## TEAM_PITCHING_SO  0.0015116  0.0005316   2.844  0.00450 ** 
## TEAM_FIELDING_E  -0.0210881  0.0018286 -11.532  < 2e-16 ***
## TEAM_FIELDING_DP -0.1107454  0.0128421  -8.624  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.24 on 2269 degrees of freedom
## Multiple R-squared:  0.2955, Adjusted R-squared:  0.2936 
## F-statistic: 158.6 on 6 and 2269 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.232  -8.904   0.069   8.910  65.787 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      55.5049520  2.7892547  19.900  < 2e-16 ***
## TEAM_BATTING_2B   0.0694280  0.0071795   9.670  < 2e-16 ***
## TEAM_BATTING_3B   0.1953241  0.0154629  12.632  < 2e-16 ***
## TEAM_BATTING_HR   0.0748393  0.0079819   9.376  < 2e-16 ***
## TEAM_BATTING_BB   0.0103989  0.0035199   2.954  0.00317 ** 
## TEAM_BASERUN_SB   0.0221969  0.0042302   5.247 1.69e-07 ***
## TEAM_PITCHING_SO -0.0012958  0.0005657  -2.291  0.02208 *  
## TEAM_FIELDING_E  -0.0110738  0.0019737  -5.611 2.26e-08 ***
## TEAM_FIELDING_DP -0.0947871  0.0135932  -6.973 4.05e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.85 on 2267 degrees of freedom
## Multiple R-squared:  0.2293, Adjusted R-squared:  0.2266 
## F-statistic: 84.31 on 8 and 2267 DF,  p-value: < 2.2e-16

Comparing Partial F-Tests/ANOVA of reduced models

## Analysis of Variance Table
## 
## Model 1: TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   2267 435053                                  
## 2   2266 389079  1     45974 267.75 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the partial F-test indicate that there is a statistically significant difference between keeping the original reduced model with the subsequent revision to exclude the other hit predictors in the model.

## Analysis of Variance Table
## 
## Model 1: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   2269 397696                                  
## 2   2266 389079  3    8617.2 16.729 9.486e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Similarly, the partial F-test would indicate that reducing the hits variable is not necessary despite the potential reduced collinearity.

The model using TEAM_BATTING_HITS has a higher \(R^2\) so it accounts for more variability. Let’s use this variable in our model.

We can make some plots to help test our assumptions of our basic model using the plot function on our model variable:

The Q-Q plot shows that the residuals of this model are fairly normally distributed. The residuals vs. fitted plot shows a cluster of residuals and a seeming outlying point. There is no general pattern seen here and the cluster of points seems to indicate that homoscedasticity is satisfied for this model.

Let’s try transforming some of our variables to come up with a more accurate model.

TEAM_PITCHING_SO is a right-skewed variable with very large outliers. Let’s compare how four common transformations (log, fourth root, cube root, and square root) would normalize the distribution of this variable (after adding a small constant since the variable includes accurate values of 0).

The square root transformation appears to normalize the data best. Let’s confirm the ideal lambda proposed by the boxcox function from the MASS library is similar to to the square root transformation lambda (0.5) we assume will work best for this data.

## [1] 0.45

The proposed lambda of 0.45 is in fact very close to 0.5, so we will go with the easier to understand square root transformation. We will follow a similar process to find reasonable transformations for several other variables in our model without showing the process repeatedly.

variables	lambdas	adj
TEAM_BASERUN_SB	0.2	log
TEAM_BASERUN_CS	0.3	fourth root
TEAM_PITCHING_SO	0.45	square root
TEAM_BATTING_3B	0.3	log
TEAM_BATTING_BB	1.75	square
TEAM_PITCHING_H	-2	square inverse
TEAM_PITCHING_BB	0.35	cube root
TEAM_FIELDING_E	-0.95	inverse

Adjusting the ideal lambdas proposed for several variables to commonly understood transformations, we see mixed results on normalizing the distributions. Let’s use the same variables from our final untransformed model above to see if we can improve the model using transformations.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     I(TEAM_BATTING_BB^2) + log(TEAM_BASERUN_SB + 1e-04) + I(TEAM_PITCHING_SO^0.5) + 
##     I(TEAM_FIELDING_E^-1) + TEAM_FIELDING_DP, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.340  -8.557   0.113   8.603  55.118 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -3.404e+00  4.958e+00  -0.686   0.4925    
## TEAM_BATTING_H                4.868e-02  2.173e-03  22.405  < 2e-16 ***
## TEAM_BATTING_BB               5.538e-03  1.068e-02   0.518   0.6042    
## I(TEAM_BATTING_BB^2)          2.482e-05  1.131e-05   2.195   0.0283 *  
## log(TEAM_BASERUN_SB + 1e-04)  3.037e+00  4.263e-01   7.125 1.39e-12 ***
## I(TEAM_PITCHING_SO^0.5)      -1.991e-02  5.392e-02  -0.369   0.7119    
## I(TEAM_FIELDING_E^-1)         1.411e+03  1.427e+02   9.885  < 2e-16 ***
## TEAM_FIELDING_DP             -1.270e-01  1.313e-02  -9.672  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.29 on 2268 degrees of freedom
## Multiple R-squared:  0.2905, Adjusted R-squared:  0.2883 
## F-statistic: 132.7 on 7 and 2268 DF,  p-value: < 2.2e-16

Note: There are instances of TEAM_BASERUN_SB where the value is zero. Because of this, a log transformation creates an error. To account for this we increment by a small number (0.0001) so that the log transformation can be applied.

The transformed TEAM_PITCHING_SO is no longer significant, let’s remove it.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     I(TEAM_BATTING_BB^2) + log(TEAM_BASERUN_SB + 1e-04) + I(TEAM_FIELDING_E^-1) + 
##     TEAM_FIELDING_DP, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.611  -8.620   0.107   8.597  53.441 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -4.223e+00  4.433e+00  -0.953   0.3409    
## TEAM_BATTING_H                4.893e-02  2.059e-03  23.771  < 2e-16 ***
## TEAM_BATTING_BB               5.693e-03  1.067e-02   0.533   0.5938    
## I(TEAM_BATTING_BB^2)          2.473e-05  1.131e-05   2.188   0.0288 *  
## log(TEAM_BASERUN_SB + 1e-04)  3.019e+00  4.233e-01   7.132 1.33e-12 ***
## I(TEAM_FIELDING_E^-1)         1.394e+03  1.354e+02  10.297  < 2e-16 ***
## TEAM_FIELDING_DP             -1.269e-01  1.312e-02  -9.668  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.29 on 2269 degrees of freedom
## Multiple R-squared:  0.2905, Adjusted R-squared:  0.2886 
## F-statistic: 154.8 on 6 and 2269 DF,  p-value: < 2.2e-16

The adjusted \(R^2\) is slightly less for this model than for the untransformed one. The coefficients/relationships for the predictor variables become a little less intuitive due to the transformations. It makes sense that the inverse fielding errors would create a positive relationship with target wins as if its some type of rate rather than count data. The other variables that were transformed do not appear to be that much different. Let’s take a look at the diagnostic plots for this transformed model.

Once again, the Q-Q plot shows that the residuals are fairly normally distributed. From the plot of Cook’s distance, it seems there are fewer possible leverage points. The residuals vs. fitted plot also seems to indicate that homoscedasticity is satisfied.

Now we can make a model with inputs that we know from baseball.

Total hits (TEAM_BATTING_H)
Total walks gained (TEAM_BATTING_BB)
Total hits allowed (TEAM_PITCHING_H)
Total walks allowed (TEAM_PITCHING_BB)

We chose these variables based on our understanding that good teams generally tend to get on base more frequently (positive predictor variables TEAM_BATTING_HITS and TEAM_BATTING_BB) while allowing fewer runners on base (negative predictor variables TEAM_PITCHING_H and TEAM_PITCHING_BB).

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_BB, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.133  -8.860   0.379   9.373  52.416 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.3518000  3.2552864  -0.108 0.913949    
## TEAM_BATTING_H    0.0497667  0.0021032  23.663  < 2e-16 ***
## TEAM_BATTING_BB   0.0148499  0.0039923   3.720 0.000204 ***
## TEAM_PITCHING_H  -0.0025469  0.0003317  -7.679 2.36e-14 ***
## TEAM_PITCHING_BB  0.0092317  0.0027681   3.335 0.000867 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.73 on 2271 degrees of freedom
## Multiple R-squared:  0.2416, Adjusted R-squared:  0.2403 
## F-statistic: 180.9 on 4 and 2271 DF,  p-value: < 2.2e-16

When reviewing the model output it appears that all of these factors are statistically significant and some of the intuition around successful baseball performance translates from the statistics. The coefficients are all positively associated with target wins except for TEAM_PITCHING_H which makes sense given that more hits given up typically would indicate the other team may be winning the game. What is slightly surprising is that TEAM_PITCHING_BB has a positive coefficient and is highly significant as in theory more opposing base runners, even via walks, should help the other team win.

It’s interesting to note that with selected variables (walks and hits gained/allowed per team) that our adjusted \(R^2\) actually went down, indicating the amount of variability in TARGET_WINS explained by our more selective walks/hits model is less than the model including all variables.

Looking at our residual plot above, there seems to be a clustering of residuals along the x-axis at \(X \approx 80\). This shows a pattern in our residuals.

Let’s plot our response variable (Total Wins) versus each of our predictor variables to get a sense of linear relationships.

Overall, we’re seeing some loosely linear relationships between our input variables and wins. For example, offensive hits has a plausibly linear relationship to wins, whereas hits allowed (TEAM_PITCHING_H) does not have as clear of a linear relationship.

One alternative theory that uses some of the same logic as before is to maximize run producing offense (TEAM_BATTING_H,TEAM_BATTING_3B+TEAM_BATTING_2B+TEAM_BATTING_HR) as these hits imply that the batters are getting more bases which should imply that any runners are more likely to score. Another way to supercharge the offense is by advancing the runners when they get on base (TEAM_BASERUN_SB) which in theory should put us in a position to earn more runs. In terms of defense, the key area we will focus on is limiting mistakes (TEAM_FIELDING_E) and when batters get on cleaning up our mess in an efficient way (TEAM_FIELDING_DP). On paper the lean towards more offense with crisp error free defense is a recipe for success.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BASERUN_SB + TEAM_FIELDING_DP + 
##     TEAM_FIELDING_E, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.143  -8.756  -0.050   8.494  65.205 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      20.391917   3.301174   6.177 7.71e-10 ***
## TEAM_BATTING_H    0.049586   0.003078  16.108  < 2e-16 ***
## TEAM_BATTING_2B  -0.019814   0.008771  -2.259    0.024 *  
## TEAM_BATTING_3B   0.082551   0.016245   5.082 4.04e-07 ***
## TEAM_BATTING_HR   0.057850   0.007480   7.734 1.56e-14 ***
## TEAM_BASERUN_SB   0.027281   0.003804   7.171 1.00e-12 ***
## TEAM_FIELDING_DP -0.105804   0.012612  -8.389  < 2e-16 ***
## TEAM_FIELDING_E  -0.023849   0.001633 -14.602  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.16 on 2268 degrees of freedom
## Multiple R-squared:  0.3039, Adjusted R-squared:  0.3017 
## F-statistic: 141.4 on 7 and 2268 DF,  p-value: < 2.2e-16

One modification that was made due to the number of null values in fielding predictors is that we used the imputed data set to run the regression analysis to limit the records that would be excluded from the model. Overall we can see a competitive model that has a much higher \(R^2\) than some of the other variants and could potentially be closer to performing like the backward selection. TEAM_BATTING_2B has a negative coefficient which doesn’t make so much sense but perhaps the distribution of this values and the lack of significant could mean this could be excluded; however, we will keep it in the model given that it would seem to have predictive value.

Let’s confirm there isn’t any collinearity issues with this model variant:

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.601543         2.212394         2.704599         2.693294 
##  TEAM_BASERUN_SB TEAM_FIELDING_DP  TEAM_FIELDING_E 
##         1.386145         1.255963         1.817196

There is some collinearity which is to be expected with the data that is available to us, but nothing that would be problematic that requires any major changes.

Let’s confirm that the assumptions of OLS are not violated:

The diagnostic plots don’t seem to indicate a violation of the assumptions as the variance across the x-values while mostly concentrated at 80 does not show much heteroskedasticity in the residuals or the transformed standardized ones. There are some leverage points (1342 in particular) in the data set and it can definitely skew the data although some teams could have very strong seasons that aren’t invalid or anomalous to be worthy of exclusion.

Model Evaluation/Selection:

First we read in our evaluation data.

Now we can make some predictions on the test holdout data and compare results from our models before we select the best model to use on the evaluation data. First we compare the distributions of the test data to confirm we can use the same imputation methods we used to fill missing values for variables in the train data.

The test data distributions are similar to the distributions observed in the train data for these variables, so the same imputation methods can be used for each of them.

We can use the Root-mean Squared Error (RMSE) (from the modelr package) to analyze our models from above. This is one way to measure the performance of a model. In simple terms, a smaller RMSE value indicates better model performance when predicting on new data.

## RMSE lm_all: 13.1356930852633

## RMSE lm_reduced: 13.1677751937142

## RMSE lm_select: 13.7738992868948

## RMSE lm_off: 13.1707703029008

## RMSE lm_transreduced 13.4054141239418

Lastly, we can predict on our eval data based on the best RMSE value coming from our reduced model. Since we don’t have TARGET_WINS in our evaluation data, we won’t be able to evaluate the model performance against actual win totals.

The model with the lowest RMSE is our all variable model without transformations, but given the collinearity concerns it probably isn’t fair to use this version as a means of prediction, so we will also use offensive and mistake free defensive one for evaluation. Since we don’t have TARGET_WINS in our evaluation data, we won’t be able to identify the model performance against actual win totals.

However, we can look at the distribution of predicted wins to make sure our model predicts reasonable values. Knowing what we know about baseball, average teams tend to win \(~80\) games in a season (out of 162 total regular season games).

Roughly speaking, these predicted win totals look roughly normal, and centered around 80 wins, which is expected

Conclusions:

Overall, we saw somewhat comparable RMSE values between our models, indicating similar performance from the initial evaluation dataset using this metric. In terms of getting approval for the decisions made from our manager we would likely select the reduced model as it evaluated the variations that were possible and from a statistical perspective used AIC as a means for it’s decision making. That might not be convincing enough for some managers but we would explain that the combinations of some of the predictors was able to minimize our loss function. If our manager wanted us to prioritize our judgement/intuition over the statistics to more closely justify the predictors in the model then we would use the offensive model in any future forecasting needed.

The RMSE for each models are:

\(M_{reduced} = 13.16778\)
\(M_{offensive} = 13.17077\)
\(M_{transformed} = 13.40541\)
\(M_{select} = 13.7739\)

Appendix: Report Code

Below is the code for this report to generate the models and charts above.

knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
library(glue)
library(tidyverse)
library(car)
library(ResourceSelection)
library(VIM)
library(pracma)
library(MASS)
select <- dplyr::select
library(knitr)
library(modelr)
df <- read.csv("https://raw.githubusercontent.com/andrewbowen19/businessAnalyticsDataMiningDATA621/main/data/moneyball-training-data.csv")
df <- data.frame(df)
dim = dim(df)

print(glue("The dataset consists of {dim[1]} observations of {dim[2]} variables."))
set.seed(42)

# Adding a 20% holdout of our input data for model evaluation later
train <- subset(df[sample(1:nrow(df)), ], select=-c(TEAM_BATTING_HBP))%>%
          sample_frac(0.7)

test  <- dplyr::anti_join(df, train, by = 'INDEX')
test <- subset(test, select=-c(INDEX))

train <- subset(df, select=-c(INDEX))
mean_wins <- mean(train$TARGET_WINS)
median_wins <- median(train$TARGET_WINS)
sd_wins <- sd(train$TARGET_WINS)

# Print summary stats
print(glue("The mean number of wins in a season is {round(mean_wins,2)}."))
print(glue("The median number of wins in a season is {median_wins}."))
print(glue("The standard deviation for number of wins in a season is {round(sd_wins,2)}."))
ggplot(train, aes(x=TARGET_WINS)) + 
  geom_histogram() +
  labs(title = "Distribution of Wins (Histogram)", x = "Number of Wins", y = "Count")
ggplot(train, aes(x=TARGET_WINS)) + 
  geom_boxplot(fill="darkgrey") +
  labs(title = "Distribution of Wins (Boxplot)", x = "Number of Wins", y = "Count")
cMeans <- as.data.frame(round(colMeans(train, na.rm = TRUE), 1))
colnames(cMeans) <- NULL
kable(cMeans, format = "simple")

summary(train)
par(mfrow=c(2,3))
par(mai=c(.3,.3,.3,.3))

variables <- c("TEAM_BATTING_SO", "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_BATTING_HBP", "TEAM_PITCHING_SO", "TEAM_FIELDING_DP")

for (i in 1:(length(variables))) {
    hist(train[[variables[i]]], main = variables[i], col = "lightblue")
}
train_imputed <- train |>
    mutate(TEAM_BASERUN_SB = replace(TEAM_BASERUN_SB, is.na(TEAM_BASERUN_SB),
                                     median(TEAM_BASERUN_SB, na.rm=T)),
           TEAM_BASERUN_CS = replace(TEAM_BASERUN_CS, is.na(TEAM_BASERUN_CS),
                                     median(TEAM_BASERUN_CS, na.rm=T)),
           TEAM_PITCHING_SO = replace(TEAM_PITCHING_SO, is.na(TEAM_PITCHING_SO),
                                      median(TEAM_PITCHING_SO, na.rm=T)),
           TEAM_FIELDING_DP = replace(TEAM_FIELDING_DP, is.na(TEAM_FIELDING_DP),
                                      mean(TEAM_FIELDING_DP, na.rm=T))) |>
    select(-TEAM_BATTING_HBP)
    

train_imputed <- train_imputed |>
    VIM::kNN(variable = "TEAM_BATTING_SO", k = 15, numFun = weighted.mean,
             weightDist = TRUE) |>
    select(-TEAM_BATTING_SO_imp)

cor(train_imputed, df$TARGET_WINS)
train_cleaned <- train_imputed |> rename_all(~stringr::str_replace(.,"^TEAM_",""))
subset_batting <- train_cleaned |> select(contains('batting'))
kdepairs(subset_batting)
subset_pitching <- train_cleaned |> select(!contains('batting'), -TARGET_WINS)
kdepairs(subset_pitching)
lm_all <- lm(TARGET_WINS~., train_imputed)
summary(lm_all)
vif(lm_all)
lm_all_reduced <- step(lm_all, direction="backward", trace = 0)
summary(lm_all_reduced)
vif(lm_all_reduced)
lm_all_reduced <- update(lm_all_reduced, .~. - TEAM_BATTING_SO)
summary(lm_all_reduced)
lm_all_reduced <- update(lm_all_reduced, .~. - TEAM_PITCHING_H)
summary(lm_all_reduced)
vif(lm_all_reduced)
lm_all_reduced_hits <- update(lm_all_reduced, .~. - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR)
summary(lm_all_reduced_hits)
lm_all_reduced_others <- update(lm_all_reduced, .~. - TEAM_BATTING_H)
summary(lm_all_reduced_others)
anova(lm_all_reduced_others,lm_all_reduced)
anova(lm_all_reduced_hits,lm_all_reduced)
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(lm_all_reduced_hits)
train_imputed_transformed <- train_imputed
#Add a small constant to TEAM_PITCHING_SO so there are no 0 values.
train_imputed_transformed$TEAM_PITCHING_SO <- train_imputed_transformed$TEAM_PITCHING_SO + 0.001
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
#Compare how easy to understand transformations alter the distribution
hist(log(train_imputed_transformed$TEAM_PITCHING_SO),
     main = "Log Transformation", col="lightblue")
hist(nthroot(train_imputed_transformed$TEAM_PITCHING_SO, 4),
     main = "Fourth Root Transformation", col="lightblue")
hist(nthroot(train_imputed_transformed$TEAM_PITCHING_SO, 3),
     main = "Cube Root Transformation", col="lightblue")
hist(sqrt(train_imputed_transformed$TEAM_PITCHING_SO),
     main = "Square Root Transformation", col="lightblue")
bc <- boxcox(lm(train_imputed_transformed$TEAM_PITCHING_SO ~ 1),
             lambda = seq(-2, 2, length.out = 81),
             plotit = FALSE)
lambda <- bc$x[which.max(bc$y)]
lambda
variables <- c("TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_PITCHING_SO",
               "TEAM_BATTING_3B", "TEAM_BATTING_BB", "TEAM_PITCHING_H",
               "TEAM_PITCHING_BB", "TEAM_FIELDING_E")
for (i in 1:(length(variables))){
    #Add a small constant to columns with any 0 values
    if (sum(train_imputed_transformed[[variables[i]]] == 0) > 0){
        train_imputed_transformed[[variables[i]]] <-
            train_imputed_transformed[[variables[i]]] + 0.001
    }
}
for (i in 1:(length(variables))){
    if (i == 1){
        lambdas <- c()
    }
    bc <- boxcox(lm(train_imputed_transformed[[variables[i]]] ~ 1),
                 lambda = seq(-2, 2, length.out = 81),
                 plotit = FALSE)
    lambda <- bc$x[which.max(bc$y)]
    lambdas <- append(lambdas, lambda)
}
lambdas <- as.data.frame(cbind(variables, lambdas))
adj <- c("log", "fourth root", "square root", "log", "square", "square inverse", "cube root", "inverse")
lambdas <- cbind(lambdas, adj)
kable(lambdas, format = "simple")
par(mfrow=c(3, 3))
par(mai=c(.3,.3,.3,.3))
#Compare how easy to understand transformations alter the distribution
hist(log(train_imputed_transformed$TEAM_BASERUN_SB),
     main = "Log(TEAM_BASERUN_SB)", col="lightblue")
hist(nthroot(train_imputed_transformed$TEAM_BASERUN_CS, 4),
     main = "Fourth Root(TEAM_BASERUN_CS)", col="lightblue")
hist(sqrt(train_imputed_transformed$TEAM_PITCHING_SO),
     main = "Square Root(TEAM_PITCHING_SO)", col="lightblue")
hist(log(train_imputed_transformed$TEAM_BATTING_3B),
     main = "Log(TEAM_BATTING_3B)", col="lightblue")
hist(train_imputed_transformed$TEAM_BATTING_BB^2,
     main = "TEAM_BATTING_BB SQUARED", col="lightblue")
hist(train_imputed_transformed$TEAM_PITCHING_H^-2,
     main = "TEAM_PITCHING_H INVERSE SQUARED", col="lightblue")
hist(nthroot(train_imputed_transformed$TEAM_PITCHING_BB, 3),
     main = "Cube Root(TEAM_PITCHING_BB)", col="lightblue")
hist(train_imputed_transformed$TEAM_FIELDING_E^-1,
     main = "TEAM_FIELDING_E INVERSE", col="lightblue")

lm_trans <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + I(TEAM_BATTING_BB**2) + log(TEAM_BASERUN_SB + .0001) + I(TEAM_PITCHING_SO**.5) + I(TEAM_FIELDING_E**-1) + TEAM_FIELDING_DP, train_imputed)
summary(lm_trans)
lm_trans_reduced <- update(lm_trans, .~. - I(TEAM_PITCHING_SO**.5), train_imputed)
summary(lm_trans_reduced)
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(lm_trans_reduced)
# Create model with select inputs (walks and hits allowed/gained)
lm_select <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_PITCHING_H + TEAM_PITCHING_BB, train)

summary(lm_select)
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(lm_select)
# Plot selective model residuals
plot(lm_select$residuals)
# par(mfrow=c(2,2))
# par(mai=c(.3,.3,.3,.3))
# plot(TARGET_WINS ~ TEAM_BATTING_H +
#                    TEAM_BATTING_BB +
#                    TEAM_PITCHING_H +
#                    TEAM_PITCHING_BB,
#                    data=train)
ggplot(train, aes(x=TEAM_BATTING_H, y=TARGET_WINS)) + geom_point() + labs(x="Hits", y="Wins")
ggplot(train, aes(x=TEAM_BATTING_BB, y=TARGET_WINS)) + geom_point() + labs(x="Walks", y="Wins")
ggplot(train, aes(x=TEAM_PITCHING_H, y=TARGET_WINS)) + geom_point() + labs(x="Hits Allowed", y="Wins")
ggplot(train, aes(x=TEAM_PITCHING_BB, y=TARGET_WINS)) + geom_point()  + labs(x="Walks Allowed", y="Wins")


lm_off <- lm(TARGET_WINS ~ TEAM_BATTING_H+TEAM_BATTING_2B+TEAM_BATTING_3B+TEAM_BATTING_HR +TEAM_BASERUN_SB +TEAM_FIELDING_DP+TEAM_FIELDING_E, train_imputed)

summary(lm_off)
vif(lm_off)
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(lm_off)
eval_data_url <- "https://raw.githubusercontent.com/andrewbowen19/businessAnalyticsDataMiningDATA621/main/data/moneyball-evaluation-data.csv"

eval <- read.csv(eval_data_url)
par(mfrow=c(2,3))
par(mai=c(.3,.3,.3,.3))
variables <- c("TEAM_BATTING_SO", "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", 
               "TEAM_PITCHING_SO", "TEAM_FIELDING_DP")
for (i in 1:(length(variables))) {
    hist(test[[variables[i]]], main = variables[i], col = "lightblue")
}
test <- test |>
    select(-TEAM_BATTING_HBP) |>
    mutate(TEAM_BASERUN_SB = replace(TEAM_BASERUN_SB, is.na(TEAM_BASERUN_SB),
                                     median(TEAM_BASERUN_SB, na.rm=T)),
           TEAM_BASERUN_CS = replace(TEAM_BASERUN_CS, is.na(TEAM_BASERUN_CS),
                                     median(TEAM_BASERUN_CS, na.rm=T)),
           TEAM_PITCHING_SO = replace(TEAM_PITCHING_SO, is.na(TEAM_PITCHING_SO),
                                      median(TEAM_PITCHING_SO, na.rm=T)),
           TEAM_FIELDING_DP = replace(TEAM_FIELDING_DP, is.na(TEAM_FIELDING_DP),
                                      mean(TEAM_FIELDING_DP, na.rm=T)))

test <- test |>
    VIM::kNN(variable = "TEAM_BATTING_SO", k = 15, numFun = weighted.mean,
             weightDist = TRUE) |>
    select(-TEAM_BATTING_SO_imp)

# Predict using the model using all input variables
predict_all <- predict(lm_all, test)
predict_reduced <- predict(lm_all_reduced, test)
predict_select <- predict(lm_select, test)
predict_off <- predict(lm_off,test)
predict_trans_reduced <- predict(lm_trans_reduced, test)

# Calcualte RMSE and print to screen
rmse_all <- modelr::rmse(lm_all, test)
rmse_reduced <- modelr::rmse(lm_all_reduced, test)
rmse_select <- modelr::rmse(lm_select, test)
rmse_off <- modelr::rmse(lm_off, test)
rmse_trans_reduced <- modelr::rmse(lm_trans_reduced, test)

print(glue('RMSE lm_all: {rmse_all}'))
print(glue('RMSE lm_reduced: {rmse_reduced}'))
print(glue('RMSE lm_select: {rmse_select}'))
print(glue('RMSE lm_off: {rmse_off}'))
print(glue('RMSE lm_transreduced {rmse_trans_reduced}'))
# Predict and plot on evaluation data (no wins listed)
prediction_reduced <- predict(lm_all_reduced, eval)
predict_reduced_eval <- as.data.frame(prediction_reduced)

# Plot reduced model evaluation
ggplot(predict_reduced_eval,
       aes(x=prediction_reduced)) +
       geom_histogram(bins=15) +
       labs(x="Wins", 
            title="Predicted Wins (Reduced Model): Evaluation Data.")
predict_off_eval <- as.data.frame(predict(lm_off, eval))

ggplot(predict_off_eval, 
       aes(x =predict(lm_off,eval))) + geom_histogram(bins=15) + labs(x="Wins", title="Predicted Wins (Offensive Imputed Model): Evaluation Data.")

DATA 621: Homework 1 (Baseball Regression)

Shoshana Farber, Josh Forster, Glen Davis, Andrew Bowen, Charles Ugiagbe

October 1, 2023