Introduction

We will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

The model will be a multiple linear regression model on the training data to predict the number of wins for the team.


Subsections:
- Loading Data
- Description of Variables


Loading Data

Here we load in the data from the CSVs which we have downloaded. The data is separated into a training set for building the model and an evaluation set for testing the model.

We can see that we have the expected amount of records loaded in: 2,276.

## Rows: 2,276
## Columns: 17
## $ INDEX            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18, 1…
## $ TARGET_WINS      <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72, 7…
## $ TEAM_BATTING_H   <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 1391,…
## $ TEAM_BATTING_2B  <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, 179…
## $ TEAM_BATTING_3B  <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41, 2…
## $ TEAM_BATTING_HR  <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, 95,…
## $ TEAM_BATTING_BB  <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, 374…
## $ TEAM_BATTING_SO  <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 827, …
## $ TEAM_BASERUN_SB  <dbl> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, 221…
## $ TEAM_BASERUN_CS  <dbl> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 109, …
## $ TEAM_BATTING_HBP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H  <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 1391,…
## $ TEAM_PITCHING_HR <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, 95,…
## $ TEAM_PITCHING_BB <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, 391…
## $ TEAM_PITCHING_SO <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 827,…
## $ TEAM_FIELDING_E  <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131, 11…
## $ TEAM_FIELDING_DP <dbl> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 141,…


Description of Variables

Below is a short description of the variables of interest in the data set:

Variable xxxx Definition
INDEX Identification Variable
TARGET_WINS Number of wins
TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR)
TEAM_BATTING_2B Doubles by batters (2B)
TEAM_BATTING_3B Triples by batters (3B)
TEAM_BATTING_HR Homeruns by batters (4B)
TEAM_BATTING_BB Walks by batters
TEAM_BATTING_HBP Batters hit by pitch (get a free base)
TEAM_BATTING_SO Strikeouts by batters
TEAM_BASERUN_SB Stolen bases
TEAM_BASERUN_CS Caught stealing
TEAM_FIELDING_E Errors Negative
TEAM_FIELDING_DP Double Plays
TEAM_PITCHING_BB Walks allowed
TEAM_PITCHING_H Hits allowed
TEAM_PITCHING_HR Homeruns allowed
TEAM_PITCHING_SO Strikeouts by pitchers



§1 Data Exploration


Subsections:
- Data Skimming
- Histogram
- Outliers
- Missing Values
- Correlation


Data Skimming

We begin our data exploration by skimming the test data for immediate summary statistics, data types, and a visualization of distributions with mini histograms.

We have 2276 rows with 17 columns that are all numeric.

Note here that the average win per team is 81 with a standard deviation of 16 wins, this is very close to the median of 82 wins. Combining this with our boxplot suggests a normal distribution towards the middle with extremely bad teams and extremely good teams deviating from the norm.

Data summary
Name train_1
Number of rows 2276
Number of columns 17
_______________________
Column type frequency:
numeric 17
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
INDEX 0 1.00 1268.46 736.35 1 630.75 1270.5 1915.50 2535 ▇▇▇▇▇
TARGET_WINS 0 1.00 80.79 15.75 0 71.00 82.0 92.00 146 ▁▁▇▅▁
TEAM_BATTING_H 0 1.00 1469.27 144.59 891 1383.00 1454.0 1537.25 2554 ▁▇▂▁▁
TEAM_BATTING_2B 0 1.00 241.25 46.80 69 208.00 238.0 273.00 458 ▁▆▇▂▁
TEAM_BATTING_3B 0 1.00 55.25 27.94 0 34.00 47.0 72.00 223 ▇▇▂▁▁
TEAM_BATTING_HR 0 1.00 99.61 60.55 0 42.00 102.0 147.00 264 ▇▆▇▅▁
TEAM_BATTING_BB 0 1.00 501.56 122.67 0 451.00 512.0 580.00 878 ▁▁▇▇▁
TEAM_BATTING_SO 102 0.96 735.61 248.53 0 548.00 750.0 930.00 1399 ▁▆▇▇▁
TEAM_BASERUN_SB 131 0.94 124.76 87.79 0 66.00 101.0 156.00 697 ▇▃▁▁▁
TEAM_BASERUN_CS 772 0.66 52.80 22.96 0 38.00 49.0 62.00 201 ▃▇▁▁▁
TEAM_BATTING_HBP 2085 0.08 59.36 12.97 29 50.50 58.0 67.00 95 ▂▇▇▅▁
TEAM_PITCHING_H 0 1.00 1779.21 1406.84 1137 1419.00 1518.0 1682.50 30132 ▇▁▁▁▁
TEAM_PITCHING_HR 0 1.00 105.70 61.30 0 50.00 107.0 150.00 343 ▇▇▆▁▁
TEAM_PITCHING_BB 0 1.00 553.01 166.36 0 476.00 536.5 611.00 3645 ▇▁▁▁▁
TEAM_PITCHING_SO 102 0.96 817.73 553.09 0 615.00 813.5 968.00 19278 ▇▁▁▁▁
TEAM_FIELDING_E 0 1.00 246.48 227.77 65 127.00 159.0 249.25 1898 ▇▁▁▁▁
TEAM_FIELDING_DP 286 0.87 146.39 26.23 52 131.00 149.0 164.00 228 ▁▂▇▆▁


Histogram

Our histograms give us an idea that most of our data is not actually normally distributed with a rightward skew being very common. We will want to adjust the distribution of those where outlier removal isn’t enough with box-cox transformations once we get to data preparation. As a variable being normally distributed lends to its predictive power when creating a linear regression model.


Outliers

What is most immediately noticeable in the skimmed summary is the fact that there are four categories where the histogram shows outliers that are up to twenty times the 75th percentile. These categories include: TEAM_PITCHING_H (Hits allowed by the pitcher), TEAM_PITCHING_BB (Walks allowed by the pitcher), TEAM_PITCHING_SO (Strikeouts by the pitcher), and TEAM_FIELDING_E (Errors by the team). Outliers of this magnitude are usually erroneously entered data or perhaps a totaling row that we would want to take a closer look at.

Looking at the top 5 of each of these there doesn’t seem to be an obvious total row, or any row that is just completely filled with erroneous data. As each top 5 row within these categories is different rather than being a row filled with outliers. Still, if these extremely large outliers were left in then our model would be greatly affected, we can see this by observing how far off these points are from the general trend in the scatter plots against wins. So, we will remove the rows with outliers that are 1.5 beyond the IQR in our columns.


Missing Values

While the majority of our columns do not have any missing values, we can see that there are columns that have them. Most strikingly, TEAM_BATTING_HBP, is missing data from 2085 rows or 92% of our data. This column will have to be dropped as the amount of data within the column is too little to help our model, imputing it would not be useful.

TEAM_BASERUN_CS, also has a large amount of data missing with 772 rows having NA values. This could be a candidate for imputation or removal, but since the coefficient of variation is below 1, we will not remove it and instead impute.

TEAM_PITCHING_SO, TEAM_BATTING_SO, TEAM_BASERUN_SB, and TEAM_FIELDING_DP all have comparatively lower proportions of missing data that will easily be dealt with via imputation.

Rationale for Exclusion

An alternative explanation for why 92% of TEAM_BATTING_HBP is missing is that any missing value is actually a zero. To test this we researched how common it is for a batter to get hit.

Major League Baseball has stats going back to 1876: mlb.com/stats/pitching/hit-batsmen/all-time-by-season.

Out of the 47,206 pitchers in a given season between 1876 and 2023, 68% or 32,001 pitchers in a given season hit at least one batter.

And the rule dates back to 1887, which covers the majority of the time period we’re looking at. (Source: ‘Hit by pitch’ wikipedia article)

If missing values in TEAM_BATTING_HBP represented zero batters hit then we would assume the missing value percentage to be closer to 32% instead of 92% and so we feel safe in assuming the missing values are actually missing values.


Correlation

Perhaps the most important thing to explore before building this linear model is if any of our columns have a strong direct correlation with wins themselves. As we want to select for variables with strong correlation in building the model.

Note in the correlation plot below that we are missing many correlations because of the missing values within some of our columns. So we will come back to this data once we have finished data preparation. However, from what we have we can see the following:

  • Target wins has many small positive correlations with TEAM_BATTING_H being the most correlated at around 0.3.
  • TEAM_BATTING_HR and TEAM_PITCHING_HR seem to almost have a correlation of 1. This is interesting because it implies making home runs and allowing them are associated. Still, we will want to remove one of these variables to avoid colinearity. TEAM_PITCHING_HR has a weaker correlation, so it will be the one we remove.
  • The next highest absolute correlations are between TEAM_FIELDING_E with both TEAM_BATTING_BB and TEAM_PITCHING_H. This implies that teams with more fielding errors allow more home runs and are less likely to have their batters take walks. These both seem to be general indicators of a bad team.




§2 Data Preparation

After exploring our data we move on to preparing it to be more useful for our goal of creating a linear model to predict team wins.


Subsections:
- Outliers
- Missing Values
- Distribution Transformation
- Interaction Columns
- Correlation


Outliers

As previously discussed, it is important to remove outliers that go beyond 2 * IQR as these will have a very big effect on the individual coefficient of each variable skewing it to be larger or smaller than it should be. Here we chose to deal with these outliers in the response variables by capping their values as 2 * IQR in order to not skew the data back too much through utilizing a mean transform. Traditionally we would filter and cap the values at 1.5*IQR however in the case of our data that would change too much of it with the large skewing in some of our categories. Notice in our scatter plots that although we have introduced and artificial cap on the rightmost ends, our axes are now reduced by up to 10x with the capping.


Missing Values

When there are missing values within our data, attempting to create a linear model from the data will end up dropping multitudes of rows used to calculate the model. To avoid this loss of data we will deal with the missing values ourselves.

Drop Column with Too Many Missing Values

To begin with we will drop the TEAM_BATTING_HBP column from our data. Not related to dropping missing values, but we will also drop our index as it will not be useful for generating the model.

Impute the Mean for Other Missing Values

Next we deal with the other columns that have missing values by using mean (after outlier removal) imputation.

## [1] "The amount of remaining NAs in the data is: 0"


Distribution Transformation

Histograms After Replacing Missing Values

After we have excluded outliers and imputed missing values, our distributions are going to look different. So we want to look at them before we decide on how we want to take any transformations. We now can now outline the distributions that we would want to transform:

  • TEAM_BASERUN_SB is rightward skewed with the capping effect visible at the right tail.
  • TEAM_FIELDING_E and TEAM_PITCHING_H echo this capped rightward skew with a more extreme effect.
  • TEAM_BATTING_3B is rightward skewed.
  • TEAM_BATTING_HR seems to be bimodal.
  • TEAM_PITCHING_HR also seems to be bimodal.

Manually Tune Lambda Parameters

We manually tune lambda parameters to use for box-cox transformations in order to attempt to normalize these distributions. Note that for BOX_TEAM_FIELDING_E and BOX_TEAM_PITCHING_H it is very difficult to get a normal distribution due to the cap. So we will likely be served better without using these in our model.


Interaction Columns

Many times it is not enough to use the existing data after you have transformed it. Interactions of different variables can affect how well your linear regression fits the data. For example, if we want a general batting evaluation variable we could divide TEAM_BATTING_H by TEAM_BATTING_SO, this would give us a ratio of how much more likely a batter is to make a hit compared to striking out. Something that logically should feed directly into getting more wins. We also generate a similar interaction variable for pitching.


Correlation

After all of our data preparation is done we want to revisit the correlation plot to see if any of the transformations we have done have majorly changed the correlation values. Unfortunately, there are no changes in correlations between values and TARGET_WINS. We also see that the box-cox transformation we did on BOX_TEAM_BATTING_3B has dropped values, so we will want to drop that row instead of using it.




§3 Build Models


Subsections:
- Model 1
- Model 2
- Model 3


Model 1

Raw data fed into a stepwise function

Model Description

The first model that we will tackle is a baseline linear regression model from the untransformed data that has not gone through any of our data processing. This is to determine if our transformations may have had the opposite effect on the regression model and weakened it. We utilize automatic step wise regression to gradually add variables to it, determining how significant they may be within our model.

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.6103  -6.6903  -0.0775   6.5444  27.9073 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      58.169860   6.633370   8.769  < 2e-16 ***
## TEAM_BATTING_H    0.034994   0.004687   7.466 1.41e-13 ***
## TEAM_FIELDING_E  -0.155842   0.009923 -15.705  < 2e-16 ***
## TEAM_BASERUN_SB   0.035711   0.008671   4.118 4.03e-05 ***
## TEAM_BATTING_BB   0.039641   0.003356  11.813  < 2e-16 ***
## TEAM_FIELDING_DP -0.113055   0.013133  -8.609  < 2e-16 ***
## TEAM_BATTING_3B   0.162367   0.022164   7.326 3.90e-13 ***
## TEAM_BATTING_2B  -0.069446   0.009322  -7.449 1.59e-13 ***
## TEAM_BATTING_HR   0.096828   0.009421  10.278  < 2e-16 ***
## TEAM_BATTING_SO  -0.014586   0.004015  -3.633  0.00029 ***
## TEAM_BASERUN_CS   0.051942   0.018212   2.852  0.00440 ** 
## TEAM_PITCHING_SO -0.006658   0.003245  -2.052  0.04034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.554 on 1474 degrees of freedom
##   (790 observations deleted due to missingness)
## Multiple R-squared:  0.4378, Adjusted R-squared:  0.4336 
## F-statistic: 104.3 on 11 and 1474 DF,  p-value: < 2.2e-16

Model Results

Model 1 gives us an R^2 of 0.4336 with an RMSE of 9.554.

The F-statistic is 104.3 on 11 and 1474.

Model Evaluation

The residuals for model_1 are normally distributed as seen from the Normal Q-Q additionally they are centered on 0. However, it should be noted that the variance of the residuals does not seem heteroscedastic because the max variance clusters around the fitted value of 80 while decreasing on both sides in the Residuals vs Fitted graph. While the residuals can be considered to be independent here as there is no clear pattern besides this increase in variance.


Model 2

Processed data (but not transformed), fed into a stepwise function.

Processing the data consists of handling outliers, removing variables with too many missing data, and imputing missing data for the other rows.

To handle the outliers we take any point that is greater than two times the InterQuartile Range plus or minus and squish it to that boundary (2 times the IQR).

Model Description

The second model that we will generate is a linear regression model from the transformed dataset without our box-cox transformations. This is to determine if the box-cox transformations may have had the opposite effect on the regression model and weakened it. We utilize automatic step wise regression to gradually add variables to it, determining how significant they may be within our model.

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.637  -8.395   0.205   8.086  79.102 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.599729   6.940483   1.527 0.126843    
## TEAM_BATTING_H    0.034565   0.003959   8.731  < 2e-16 ***
## TEAM_BATTING_BB   0.060941   0.008297   7.345 2.86e-13 ***
## TEAM_FIELDING_DP -0.115499   0.013312  -8.676  < 2e-16 ***
## TEAM_FIELDING_E  -0.054642   0.005440 -10.044  < 2e-16 ***
## TEAM_BASERUN_SB   0.052924   0.005887   8.990  < 2e-16 ***
## TEAM_BATTING_3B   0.125732   0.017567   7.157 1.11e-12 ***
## TEAM_PITCHING_HR  0.048300   0.008968   5.386 7.94e-08 ***
## BATTING           0.006196   0.002121   2.921 0.003524 ** 
## TEAM_BASERUN_CS  -0.049032   0.019678  -2.492 0.012783 *  
## TEAM_PITCHING_H   0.016394   0.003099   5.290 1.34e-07 ***
## TEAM_PITCHING_BB -0.039449   0.007150  -5.517 3.84e-08 ***
## TEAM_BATTING_SO  -0.028624   0.006730  -4.253 2.19e-05 ***
## PITCHING         34.787007   9.281586   3.748 0.000183 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.13 on 2262 degrees of freedom
## Multiple R-squared:  0.3095, Adjusted R-squared:  0.3055 
## F-statistic: 77.97 on 13 and 2262 DF,  p-value: < 2.2e-16

Model Results

Model 2 gives us an R^2 of 0.3055 with an RMSE of 13.13. As we can see, it seems our transformations have actually weakened the linear model. The outliers that we capped likely were significant to actually determining if a team was winning games or not.

The F-statistic is 77.97 on 13 and 2262

Model Evaluation

The residuals for model_2 are normally distributed as seen from the Normal Q-Q with acceptable deviations at the ends additionally they are centered on 0. The variance of the residuals is slightly less heteroscedastic because of the outlier removal we have done and this can be seen on the Residuals vs Fitted graph. While the residuals can be considered to be independent here as there is no clear pattern besides this increase in variance.


Model 3

Processed and Transformed data, fed into a stepwise function

In addition to the data processing described for Model 2, we are transforming the data with a Box-Cox transformation to aim for a normal distribution.

Model Description

The third model that we will generate is a linear regression model from the transformed dataset with our box-cox transformations. This is to determine if the box-cox transformations may have had helped the regression model. We utilize automatic step wise regression to gradually add variables to it, determining how significant they may be within our model.

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.222  -7.946   0.244   8.231  72.879 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.868e+04  5.459e+03   7.086 1.84e-12 ***
## TEAM_BATTING_H       4.201e-02  4.249e-03   9.885  < 2e-16 ***
## TEAM_BATTING_BB      5.273e-02  6.741e-03   7.823 7.82e-15 ***
## TEAM_FIELDING_DP    -1.233e-01  1.325e-02  -9.310  < 2e-16 ***
## BOX_TEAM_FIELDING_E -1.186e+02  1.147e+01 -10.340  < 2e-16 ***
## TEAM_BATTING_3B      1.438e-01  1.766e-02   8.143 6.29e-16 ***
## BOX_TEAM_BASERUN_SB  3.702e-01  4.377e-02   8.457  < 2e-16 ***
## TEAM_PITCHING_HR     4.095e-02  8.986e-03   4.558 5.44e-06 ***
## TEAM_PITCHING_BB    -3.007e-02  5.938e-03  -5.065 4.42e-07 ***
## TEAM_PITCHING_H      7.607e-02  1.091e-02   6.975 3.99e-12 ***
## BOX_TEAM_PITCHING_H -2.676e+04  3.802e+03  -7.037 2.59e-12 ***
## TEAM_BASERUN_CS     -5.705e-02  1.958e-02  -2.913  0.00361 ** 
## TEAM_PITCHING_SO     4.081e-02  8.590e-03   4.751 2.15e-06 ***
## PITCHING            -8.186e+01  1.471e+01  -5.564 2.95e-08 ***
## BATTING              6.873e-03  2.396e-03   2.868  0.00417 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.98 on 2261 degrees of freedom
## Multiple R-squared:  0.3255, Adjusted R-squared:  0.3214 
## F-statistic: 77.95 on 14 and 2261 DF,  p-value: < 2.2e-16

Model Results

Model 3 gives us an R^2 of 0.3214 with an RMSE of 12.98. Although the Box transformed variables seem to have helped, overall we have a worse off model than our original data. Once again, the outliers that we capped likely were significant to actually determining if a team was winning games or not.

The F-statistic is 77.95 on 14 and 2261.

Model Evaluation

The residuals for model_3 are normally distributed as seen from the Normal Q-Q with acceptable deviations at the ends additionally they are centered on 0. The variance of the residuals is slightly less heteroscedastic than both previous models because of the outlier removal and transformations we have done and this can be seen on the Residuals vs Fitted graph. While the residuals can be considered to be independent here as there is no clear pattern besides this increase in variance.




§4 Select Models

Keeping model_1 which was built off the original data as the model we selected seems like the best choice here. That is because not only does it have the highest adjusted R^2 and RMSE, but it is also easier to explain the coefficients compared to the box-cox transformed model of model_3.

Also model_1 has a larger F-statistic indicating that the significants of the joint effect of all the variables in model_1 are greater than the other two models.

Do note that we have to deal with NA values in the eval dataset to get proper predictions from this model. Here we convert them all to 0, as the training data for model_1 did not actually change anything about NAs.




Conclusion

It seems that a simple linear regression model is not suited for the predictive analysis that we are doing on team wins within this baseball data set. This is highlighted in not only data transformations not improving the R^2 and RMSE, but also in the diagnostic residual plots. The fitted vs residual plot shows that there is heteroscedasticity within the residuals of each model. Thus weighted least squares regression which controls for it could be seen as more suitable to further extend this analysis..

Sometimes the data that you are dealt despite the apparent flaws in it is the best data to work with for training your model.




Code Appendix


Introduction

Library

Here are the library of packages we are using:

library(tidyverse)
library(olsrr)
library(skimr)
library(DataExplorer)
library(corrplot)
library(fabletools)
library(ggfortify)


Loading Data

train_1 <- read_csv("moneyball-training-data.csv", show_col_types = FALSE)
eval_1 <- read_csv("moneyball-evaluation-data.csv", show_col_types = FALSE)
glimpse(train_1)


§1 Data Exploration

Data Skimming

skim(train_1)


Histogram

train_1 |>
  select(TARGET_WINS) |>
  ggplot(aes(x= TARGET_WINS)) +
    geom_boxplot() +
    labs(title = "Distribution of Season Wins Per Team",x = "Season Wins")


Outliers

train_1 |>
  slice_max(order_by = TEAM_PITCHING_H, n = 5)
train_1 |>
  slice_max(order_by = TEAM_PITCHING_BB, n = 5)
train_1 |>
  slice_max(order_by = TEAM_PITCHING_SO, n = 5)
train_1 |>
  slice_max(order_by = TEAM_FIELDING_E, n = 5)
suppressWarnings(
  train_1 |>
    select(TARGET_WINS, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E) |>
    plot_scatterplot(by = "TARGET_WINS",nrow = 2L, ncol = 2L)
)


Missing Values

plot_missing(train_1, missing_only = TRUE, title = "Percent Rows Missing Values by Column")


Correlation

M <- train_1 %>%
  select(-INDEX,-TEAM_BATTING_HBP) %>%
  cor()
corrplot(M, type = 'lower', tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey")
corrplot(M, method = 'number', type = 'lower', tl.srt = 45, tl.cex = 0.5, number.cex = 0.55, na.label = "square", na.label.col = "lightgrey")


§2 Data Preparation


Outliers

train_2 <- train_1 |>
  select(-TARGET_WINS)

train_2 <- cbind(TARGET_WINS = train_1$TARGET_WINS, as_tibble(
  lapply(train_2, function(x) {
    q1 <- quantile(x, .25, na.rm = TRUE)
    q2 <- quantile(x, .75, na.rm = TRUE)
    IQR <- q2 - q1
    replace(x, x < (q1 - 2*IQR), (q1 - 2*IQR))
    replace(x, x > (q2+ 2*IQR), (q2 + 2*IQR))
})
))

suppressWarnings(
  train_2 |>
    select(TARGET_WINS, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E) |>
    plot_scatterplot(by = "TARGET_WINS",nrow = 2L, ncol = 2L)
)


Missing Values

Drop Column with Too Many Missing Values

train_2 <- train_2 |>
  drop_columns(c("TEAM_BATTING_HBP", "INDEX"))

Impute the Mean for Other Missing Values

train_2$TEAM_BATTING_SO[is.na(train_2$TEAM_BATTING_SO)] <- mean(train_2$TEAM_BATTING_SO,na.rm=TRUE)
train_2$TEAM_BASERUN_SB[is.na(train_2$TEAM_BASERUN_SB)] <- mean(train_2$TEAM_BASERUN_SB,na.rm=TRUE)
train_2$TEAM_BASERUN_CS[is.na(train_2$TEAM_BASERUN_CS)] <- mean(train_2$TEAM_BASERUN_CS,na.rm=TRUE)
train_2$TEAM_PITCHING_SO[is.na(train_2$TEAM_PITCHING_SO)] <- mean(train_2$TEAM_PITCHING_SO,na.rm=TRUE)
train_2$TEAM_FIELDING_DP[is.na(train_2$TEAM_FIELDING_DP)] <- mean(train_2$TEAM_FIELDING_DP,na.rm=TRUE)
print(paste0("The amount of remaining NAs in the data is: ", sum(is.na(train_2))))


Distribution Transformation

Histograms After Replacing Missing Values

train_2 |>
  plot_histogram(nrow = 3L, ncol = 3L)

Manually Tune Lambda Parameters

train_2 <- train_2 %>%
  mutate(BOX_TEAM_BASERUN_SB = box_cox(TEAM_BASERUN_SB, 0.57),
         BOX_TEAM_FIELDING_E = box_cox(TEAM_FIELDING_E, -0.42),
         BOX_TEAM_PITCHING_H = box_cox(TEAM_PITCHING_H, -0.69),
         BOX_TEAM_BATTING_3B = box_cox(TEAM_BATTING_3B, -.001),
         BOX_TEAM_BATTING_HR = box_cox(TEAM_BATTING_HR, 0.50),
         BOX_TEAM_PITCHING_HR = box_cox(TEAM_PITCHING_HR, 0.57)
         )

train_2 |>
  dplyr::select(BOX_TEAM_BASERUN_SB, BOX_TEAM_FIELDING_E, BOX_TEAM_PITCHING_H, BOX_TEAM_BATTING_3B, BOX_TEAM_BATTING_HR, BOX_TEAM_PITCHING_HR) %>%
  plot_histogram(nrow = 2L, ncol = 3L)


Interaction Columns

train_2 <- train_2 %>%
  mutate(BATTING = case_when(TEAM_BATTING_SO > 0 ~ TEAM_BATTING_H/TEAM_BATTING_SO, .default = TEAM_BATTING_H),
         PITCHING = TEAM_PITCHING_SO / TEAM_PITCHING_H)


Correlation

M <- train_2 %>%
  cor()
corrplot(M, type = 'lower', tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey")
corrplot(M, method = 'number', type = 'lower', tl.srt = 45, tl.cex = 0.5, number.cex = 0.35, na.label = "square", na.label.col = "lightgrey")


§3 Build Models

Model 1

Model Description

model_1 <- lm(TARGET_WINS ~ ., data = train_1)
model_1 <- ols_step_both_p(model_1)
summary(model_1$model)

Model Evaluation

autoplot(model_1$model)


Model 2

Model Description

model_2 <- lm(TARGET_WINS ~ .-BOX_TEAM_BASERUN_SB -BOX_TEAM_FIELDING_E -BOX_TEAM_PITCHING_H -BOX_TEAM_BATTING_3B -BOX_TEAM_BATTING_HR -BOX_TEAM_PITCHING_HR, data = train_2)
model_2 <- ols_step_both_p(model_2)
summary(model_2$model)

Model Evaluation

autoplot(model_2$model)


Model 3

Model Description

model_3 <- lm(TARGET_WINS ~ ., data = train_2)
model_3 <- ols_step_both_p(model_3)
summary(model_3$model)

Model Evaluation

autoplot(model_3$model)


§4 Select Models

predictions <- predict(model_1$model, eval_1 |>
  mutate_if(is.numeric, ~replace_na(.,0)))
write_csv(as_tibble(predictions),"evaluated.csv")