Baseball Win Predicting

library(tidyverse)
library(olsrr)
library(skimr)
library(DataExplorer)
library(corrplot)
library(fabletools)
library(ggfortify)

Introduction

We will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season

Loading Data

Here we load in the data from the CSVs which we have downloaded. The data is separated into a training set for building the model and an evaluation set for testing the model.

test_1 <- read_csv("moneyball-training-data.csv", show_col_types = FALSE)
eval_1 <- read_csv("moneyball-evaluation-data.csv", show_col_types = FALSE)
glimpse(test_1)

## Rows: 2,276
## Columns: 17
## $ INDEX            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18, 1…
## $ TARGET_WINS      <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72, 7…
## $ TEAM_BATTING_H   <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 1391,…
## $ TEAM_BATTING_2B  <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, 179…
## $ TEAM_BATTING_3B  <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41, 2…
## $ TEAM_BATTING_HR  <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, 95,…
## $ TEAM_BATTING_BB  <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, 374…
## $ TEAM_BATTING_SO  <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 827, …
## $ TEAM_BASERUN_SB  <dbl> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, 221…
## $ TEAM_BASERUN_CS  <dbl> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 109, …
## $ TEAM_BATTING_HBP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H  <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 1391,…
## $ TEAM_PITCHING_HR <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, 95,…
## $ TEAM_PITCHING_BB <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, 391…
## $ TEAM_PITCHING_SO <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 827,…
## $ TEAM_FIELDING_E  <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131, 11…
## $ TEAM_FIELDING_DP <dbl> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 141,…

We can see that we have the expected amount of records loaded in.

Data Exploration

We begin our data exploration by skimming the test data for immediate summary statistics, data types, and a visualization of distributions with mini histograms.

skim(test_1)

Data summary
Name	test_1
Number of rows	2276
Number of columns	17
_______________________
Column type frequency:
numeric	17
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
INDEX	0	1.00	1268.46	736.35	1	630.75	1270.5	1915.50	2535	▇▇▇▇▇
TARGET_WINS	0	1.00	80.79	15.75	0	71.00	82.0	92.00	146	▁▁▇▅▁
TEAM_BATTING_H	0	1.00	1469.27	144.59	891	1383.00	1454.0	1537.25	2554	▁▇▂▁▁
TEAM_BATTING_2B	0	1.00	241.25	46.80	69	208.00	238.0	273.00	458	▁▆▇▂▁
TEAM_BATTING_3B	0	1.00	55.25	27.94	0	34.00	47.0	72.00	223	▇▇▂▁▁
TEAM_BATTING_HR	0	1.00	99.61	60.55	0	42.00	102.0	147.00	264	▇▆▇▅▁
TEAM_BATTING_BB	0	1.00	501.56	122.67	0	451.00	512.0	580.00	878	▁▁▇▇▁
TEAM_BATTING_SO	102	0.96	735.61	248.53	0	548.00	750.0	930.00	1399	▁▆▇▇▁
TEAM_BASERUN_SB	131	0.94	124.76	87.79	0	66.00	101.0	156.00	697	▇▃▁▁▁
TEAM_BASERUN_CS	772	0.66	52.80	22.96	0	38.00	49.0	62.00	201	▃▇▁▁▁
TEAM_BATTING_HBP	2085	0.08	59.36	12.97	29	50.50	58.0	67.00	95	▂▇▇▅▁
TEAM_PITCHING_H	0	1.00	1779.21	1406.84	1137	1419.00	1518.0	1682.50	30132	▇▁▁▁▁
TEAM_PITCHING_HR	0	1.00	105.70	61.30	0	50.00	107.0	150.00	343	▇▇▆▁▁
TEAM_PITCHING_BB	0	1.00	553.01	166.36	0	476.00	536.5	611.00	3645	▇▁▁▁▁
TEAM_PITCHING_SO	102	0.96	817.73	553.09	0	615.00	813.5	968.00	19278	▇▁▁▁▁
TEAM_FIELDING_E	0	1.00	246.48	227.77	65	127.00	159.0	249.25	1898	▇▁▁▁▁
TEAM_FIELDING_DP	286	0.87	146.39	26.23	52	131.00	149.0	164.00	228	▁▂▇▆▁

We have 2276 rows with 17 columns that are all numeric.

Note here that the average win per team is 81 with a standard deviation of 16 wins, this is very close to the median of 82 wins. Combining this with our boxplot suggests a normal distribution towards the middle with extremely bad teams and extremely good teams deviating from the norm.

test_1 |>
  select(TARGET_WINS) |>
  ggplot(aes(x= TARGET_WINS)) +
    geom_boxplot() +
    labs(title = "Distribution of Season Wins Per Team",x = "Season Wins")

Our histograms give us an idea that most of our data is not actually normally distributed with a rightward skew being very common. We will want to adjust the distribution of those where outlier removal isn’t enough with box-cox transformations once we get to data preparation. As a variable being normally distributed lends to its predictive power when creating a linear regression model.

Outliers

What is most immediately noticeable to me in the skimmed summary is the fact that there are four categories where the histogram shows outliers that are up to twenty times the 75th percentile. These categories include: TEAM_PITCHING_H (Hits allowed by the pitcher), TEAM_PITCHING_BB (Walks allowed by the pitcher), TEAM_PITCHING_SO (Strikeouts by the pitcher), and TEAM_FIELDING_E (Errors by the team). Outliers of this magnitude are usually erroneously entered data or perhaps a totaling row that we would want to take a closer look at.

test_1 |>
  slice_max(order_by = TEAM_PITCHING_H, n = 5)

## # A tibble: 5 × 17
##   INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
##   <dbl>       <dbl>          <dbl>           <dbl>           <dbl>
## 1  1769          36           1674             216              54
## 2  1347           0            891             135               0
## 3  2380          41            992             263              20
## 4   459          23           1458             220              35
## 5  1494         108           1188             338               0
## # ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
## #   TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
## #   TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
## #   TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
## #   TEAM_FIELDING_DP <dbl>

test_1 |>
  slice_max(order_by = TEAM_PITCHING_BB, n = 5)

## # A tibble: 5 × 17
##   INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
##   <dbl>       <dbl>          <dbl>           <dbl>           <dbl>
## 1  1494         108           1188             338               0
## 2  2380          41            992             263              20
## 3   309          71           1325             162              76
## 4  1202          60           1518             162              68
## 5  1492          95           1494             261              68
## # ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
## #   TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
## #   TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
## #   TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
## #   TEAM_FIELDING_DP <dbl>

test_1 |>
  slice_max(order_by = TEAM_PITCHING_SO, n = 5)

## # A tibble: 5 × 17
##   INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
##   <dbl>       <dbl>          <dbl>           <dbl>           <dbl>
## 1  2380          41            992             263              20
## 2  1494         108           1188             338               0
## 3     1          39           1445             194              39
## 4  2051          46           1254             154             127
## 5   309          71           1325             162              76
## # ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
## #   TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
## #   TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
## #   TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
## #   TEAM_FIELDING_DP <dbl>

test_1 |>
  slice_max(order_by = TEAM_FIELDING_E, n = 5)

## # A tibble: 5 × 17
##   INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
##   <dbl>       <dbl>          <dbl>           <dbl>           <dbl>
## 1   459          23           1458             220              35
## 2  1347           0            891             135               0
## 3  2050          14           1437             148              56
## 4  1769          36           1674             216              54
## 5   433          39           1620             201              44
## # ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
## #   TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
## #   TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
## #   TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
## #   TEAM_FIELDING_DP <dbl>

suppressWarnings(
  test_1 |>
    select(TARGET_WINS, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E) |>
    plot_scatterplot(by = "TARGET_WINS",nrow = 2L, ncol = 2L)
)

Looking at the top 5 of each of these there doesn’t seem to be an obvious total row, or any row that is just completely filled with erroneous data. As each top 5 row within these categories is different rather than being a row filled with outliers. Still, if these extremely large outliers were left in then our model would be greatly affected, we can see this by observing how far off these points are from the general trend in the scatter plots against wins. So, we will remove the rows with outliers that are 1.5 beyond the IQR in our columns.

Missing Values

While the majority of our columns do not have any missing values, we can see that there are a columns that are. Most strikingly, TEAM_BATTING_HBP, is missing data from 2085 rows or 92% of our data. This column will have to be dropped as the amount of data within the column is too little to help our model, imputing it would not be useful.

TEAM_BASERUN_CS, also has a large amount of data missing with 772 rows having NA values. This could be a candidate for imputation or removal, but since the coefficient of variation is below 1, we will not remove it and instead impute.

TEAM_PITCHING_SO, TEAM_BATTING_SO, TEAM_BASERUN_SB, and TEAM_FIELDING_DP all have comparatively lower proportions of missing data that will easily be dealt with via imputation.

plot_missing(test_1, missing_only = TRUE, title = "Percent Rows Missing Values by Column")

Correlation

Perhaps the most important thing to explore before building this linear model is if any of our columns have a strong direct correlation with wins themselves. As we want to select for variables with strong correlation in building the model.

We will evaluate correlation with a correlation plot.

M <- test_1 %>%
  select(-INDEX,-TEAM_BATTING_HBP) %>%
  cor()
corrplot(M, type = 'lower', tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey")

corrplot(M, method = 'number', type = 'lower', tl.srt = 45, tl.cex = 0.5, number.cex = 0.55, na.label = "square", na.label.col = "lightgrey")

Note that we are missing many correlations because of the missing values within some of our columns. So we will come back to this data once we have finished data preparation. However, from what we have we can see the following:

Target wins has many small positive correlations with TEAM_BATTING_H being the most correlated at around 0.3.
TEAM_BATTING_HR and TEAM_PITCHING_HR seem to almost have a correlation of 1. This is interesting because it implies making home runs and allowing them are associated. Still, we will want to remove one of these variables to avoid colinearity. TEAM_PITCHING_HR has a weaker correlation, so it will be the one we remove.
The next highest absolute correlations are between TEAM_FIELDING_E with both TEAM_BATTING_BB and TEAM_PITCHING_H. This implies that teams with more fielding errors allow more home runs and are less likely to have their batters take walks. These both seem to be general indicators of a bad team.

Data Preparation

After exploring our data we move on to preparing it to be more useful for our goal of creating a linear model to predict team wins.

Outliers

As previously discussed, it is important to remove outliers that go beyond 2 * IQR as these will have a very big effect on the individual coefficient of each variable skewing it to be larger or smaller than it should be. Here we chose to deal with these outliers in the response variables by capping their values as 2 * IQR in order to not skew the data back too much through utilizing a mean transform. Traditionally we would filter and cap the values at 1.5*IQR however in the case of our data that would change too much of it with the large skewing in some of our categories. Notice in our scatter plots that although we have introduced and artificial cap on the rightmost ends, our axes are now reduced by up to 10x with the capping.

test_2 <- test_1 |>
  select(-TARGET_WINS)

test_2 <- cbind(TARGET_WINS = test_1$TARGET_WINS, as_tibble(
  lapply(test_2, function(x) {
    q1 <- quantile(x, .25, na.rm = TRUE)
    q2 <- quantile(x, .75, na.rm = TRUE)
    IQR <- q2 - q1
    replace(x, x < (q1 - 2*IQR), (q1 - 2*IQR))
    replace(x, x > (q2+ 2*IQR), (q2 + 2*IQR))
})
))

suppressWarnings(
  test_2 |>
    select(TARGET_WINS, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E) |>
    plot_scatterplot(by = "TARGET_WINS",nrow = 2L, ncol = 2L)
)

Missing Values

When there are missing values within our data, attempting to create a linear model from the data will end up dropping multitudes of rows used to calculate the model. To avoid this loss of data we will deal with the missing values ourselves. To begin with we will drop the TEAM_BATTING_HBP column from our data. Not related to dropping missing values, but we will also drop our index as it will not be useful for generating the model.

test_2 <- test_2 |>
  drop_columns(c("TEAM_BATTING_HBP", "INDEX"))

Next we deal with the other columns that have missing values by using mean (after outlier removal) imputation.

test_2$TEAM_BATTING_SO[is.na(test_2$TEAM_BATTING_SO)] <- mean(test_2$TEAM_BATTING_SO,na.rm=TRUE)
test_2$TEAM_BASERUN_SB[is.na(test_2$TEAM_BASERUN_SB)] <- mean(test_2$TEAM_BASERUN_SB,na.rm=TRUE)
test_2$TEAM_BASERUN_CS[is.na(test_2$TEAM_BASERUN_CS)] <- mean(test_2$TEAM_BASERUN_CS,na.rm=TRUE)
test_2$TEAM_PITCHING_SO[is.na(test_2$TEAM_PITCHING_SO)] <- mean(test_2$TEAM_PITCHING_SO,na.rm=TRUE)
test_2$TEAM_FIELDING_DP[is.na(test_2$TEAM_FIELDING_DP)] <- mean(test_2$TEAM_FIELDING_DP,na.rm=TRUE)
print(paste0("The amount of remaining NAs in the data is: ", sum(is.na(test_2))))

## [1] "The amount of remaining NAs in the data is: 0"

Distribution Transformation

After we have excluded outliers and imputated missing values, our distributions are going to look different. So we want to look at them before we decide on how we want to take any transformations. We now can now outline the distributions that we would want to transform:

TEAM_BASERUN_SB is rightward skewed with the capping effect visible at the right tail.
TEAM_FIELDING_E and TEAM_PITCHING_H echo this capped rightward skew with a more extreme effect.
TEAM_BATTING_3B is rightward skewed.
TEAM_BATTING_HR seems to be bimodal.
TEAM_PITCHING_HR also seems to be bimodal.

test_2 |>
  plot_histogram(nrow = 3L, ncol = 3L)

We manually tune lambda parameters to use for box-cox transformations in order to attempt to normalize these distributions. Note that for BOX_TEAM_FIELDING_E and BOX_TEAM_PITCHING_H it is very difficult to get a normal distribution due to the cap. So we will likely be served better without using these in our model.

test_2 <- test_2 %>%
  mutate(BOX_TEAM_BASERUN_SB = box_cox(TEAM_BASERUN_SB, 0.57),
         BOX_TEAM_FIELDING_E = box_cox(TEAM_FIELDING_E, -0.42),
         BOX_TEAM_PITCHING_H = box_cox(TEAM_PITCHING_H, -0.69),
         BOX_TEAM_BATTING_3B = box_cox(TEAM_BATTING_3B, -.001),
         BOX_TEAM_BATTING_HR = box_cox(TEAM_BATTING_HR, 0.50),
         BOX_TEAM_PITCHING_HR = box_cox(TEAM_PITCHING_HR, 0.57)
         )


test_2 |>
  dplyr::select(BOX_TEAM_BASERUN_SB, BOX_TEAM_FIELDING_E, BOX_TEAM_PITCHING_H, BOX_TEAM_BATTING_3B, BOX_TEAM_BATTING_HR, BOX_TEAM_PITCHING_HR) %>%
  plot_histogram(nrow = 2L, ncol = 3L)

Interaction Columns

Many times it is not enough to use the existing data after you have transformed it. Interactions of different variables can affect how well your linear regression fits the data. For example, if we want a general batting evaluation variable we could divide TEAM_BATTING_H by TEAM_BATTING_SO, this would give us a ratio of how much more likely a batter is to make a hit compared to striking out. Something that logically should feed directly into getting more wins. We also generate a similar interaction variable for pitching.

test_2 <- test_2 %>%
  mutate(BATTING = case_when(TEAM_BATTING_SO > 0 ~ TEAM_BATTING_H/TEAM_BATTING_SO, .default = TEAM_BATTING_H),
         PITCHING = TEAM_PITCHING_SO / TEAM_PITCHING_H)

Correlation

After all of our data preparation is done we want to revisit the correlation plot to see if any of the transformations we have done have majorly changed the correlation values. Unfortunately, there are no changes in correlations between values and TARGET_WINS. We also see that the box-cox transformation we did on BOX_TEAM_BATTING_3B has dropped values, so we will want to drop that row instead of using it.

M <- test_2 %>%
  cor()
corrplot(M, type = 'lower', tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey")

corrplot(M, method = 'number', type = 'lower', tl.srt = 45, tl.cex = 0.5, number.cex = 0.35, na.label = "square", na.label.col = "lightgrey")

Model Building

Model 1

The first model that we will tackle is a baseline linear regression model from the untransformed data that has not gone through any of our data processing. This is to determine if our transformations may have had the opposite effect on the regression model and weakened it. We utilize automatic step wise regression to gradually add variables to it, determining how significant they may be within our model.

model_1 <- lm(TARGET_WINS ~ ., data = test_1)
model_1 <- ols_step_both_p(model_1)
summary(model_1$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.6103  -6.6903  -0.0775   6.5444  27.9073 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      58.169860   6.633370   8.769  < 2e-16 ***
## TEAM_BATTING_H    0.034994   0.004687   7.466 1.41e-13 ***
## TEAM_FIELDING_E  -0.155842   0.009923 -15.705  < 2e-16 ***
## TEAM_BASERUN_SB   0.035711   0.008671   4.118 4.03e-05 ***
## TEAM_BATTING_BB   0.039641   0.003356  11.813  < 2e-16 ***
## TEAM_FIELDING_DP -0.113055   0.013133  -8.609  < 2e-16 ***
## TEAM_BATTING_3B   0.162367   0.022164   7.326 3.90e-13 ***
## TEAM_BATTING_2B  -0.069446   0.009322  -7.449 1.59e-13 ***
## TEAM_BATTING_HR   0.096828   0.009421  10.278  < 2e-16 ***
## TEAM_BATTING_SO  -0.014586   0.004015  -3.633  0.00029 ***
## TEAM_BASERUN_CS   0.051942   0.018212   2.852  0.00440 ** 
## TEAM_PITCHING_SO -0.006658   0.003245  -2.052  0.04034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.554 on 1474 degrees of freedom
##   (790 observations deleted due to missingness)
## Multiple R-squared:  0.4378, Adjusted R-squared:  0.4336 
## F-statistic: 104.3 on 11 and 1474 DF,  p-value: < 2.2e-16

Model 1 gives us an R^2 of 0.4336 with an RMSE of 9.554.

autoplot(model_1$model)

The residuals for model_1 are normally distributed as seen from the Normal Q-Q additionally they are centered on 0. However, it should be noted that the variance of the residuals does not seem heteroscedastic because the max variance clusters around the fitted value of 80 while decreasing on both sides in the Residuals vs Fitted graph. While the residuals can be considered to be independent here as there is no clear pattern besides this increase in variance.

Model 2

The second model that we will generate is a linear regression model from the transformed dataset without our box-cox transformations. This is to determine if the box-cox transformations may have had the opposite effect on the regression model and weakened it. We utilize automatic step wise regression to gradually add variables to it, determining how significant they may be within our model.

model_2 <- lm(TARGET_WINS ~ .-BOX_TEAM_BASERUN_SB -BOX_TEAM_FIELDING_E -BOX_TEAM_PITCHING_H -BOX_TEAM_BATTING_3B -BOX_TEAM_BATTING_HR -BOX_TEAM_PITCHING_HR, data = test_2)
model_2 <- ols_step_both_p(model_2)
summary(model_2$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.637  -8.395   0.205   8.086  79.102 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.599729   6.940483   1.527 0.126843    
## TEAM_BATTING_H    0.034565   0.003959   8.731  < 2e-16 ***
## TEAM_BATTING_BB   0.060941   0.008297   7.345 2.86e-13 ***
## TEAM_FIELDING_DP -0.115499   0.013312  -8.676  < 2e-16 ***
## TEAM_FIELDING_E  -0.054642   0.005440 -10.044  < 2e-16 ***
## TEAM_BASERUN_SB   0.052924   0.005887   8.990  < 2e-16 ***
## TEAM_BATTING_3B   0.125732   0.017567   7.157 1.11e-12 ***
## TEAM_PITCHING_HR  0.048300   0.008968   5.386 7.94e-08 ***
## BATTING           0.006196   0.002121   2.921 0.003524 ** 
## TEAM_BASERUN_CS  -0.049032   0.019678  -2.492 0.012783 *  
## TEAM_PITCHING_H   0.016394   0.003099   5.290 1.34e-07 ***
## TEAM_PITCHING_BB -0.039449   0.007150  -5.517 3.84e-08 ***
## TEAM_BATTING_SO  -0.028624   0.006730  -4.253 2.19e-05 ***
## PITCHING         34.787007   9.281586   3.748 0.000183 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.13 on 2262 degrees of freedom
## Multiple R-squared:  0.3095, Adjusted R-squared:  0.3055 
## F-statistic: 77.97 on 13 and 2262 DF,  p-value: < 2.2e-16

Model 2 gives us an R^2 of 0.3055 with an RMSE of 13.13. As we can see, it seems our transformations have actually weakened the linear model. The outliers that we capped likely were significant to actually determining if a team was winning games or not.

autoplot(model_2$model)

The residuals for model_2 are normally distributed as seen from the Normal Q-Q with acceptable deviations at the ends additionally they are centered on 0. The variance of the residuals is slightly less heteroscedastic because of the outlier removal we have done and this can be seen on the Residuals vs Fitted graph. While the residuals can be considered to be independent here as there is no clear pattern besides this increase in variance.

Model 3

The third model that we will generate is a linear regression model from the transformed dataset with our box-cox transformations. This is to determine if the box-cox transformations may have had helped the regression model. We utilize automatic step wise regression to gradually add variables to it, determining how significant they may be within our model.

model_3 <- lm(TARGET_WINS ~ ., data = test_2)
model_3 <- ols_step_both_p(model_3)
summary(model_3$model)

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.222  -7.946   0.244   8.231  72.879 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.868e+04  5.459e+03   7.086 1.84e-12 ***
## TEAM_BATTING_H       4.201e-02  4.249e-03   9.885  < 2e-16 ***
## TEAM_BATTING_BB      5.273e-02  6.741e-03   7.823 7.82e-15 ***
## TEAM_FIELDING_DP    -1.233e-01  1.325e-02  -9.310  < 2e-16 ***
## BOX_TEAM_FIELDING_E -1.186e+02  1.147e+01 -10.340  < 2e-16 ***
## TEAM_BATTING_3B      1.438e-01  1.766e-02   8.143 6.29e-16 ***
## BOX_TEAM_BASERUN_SB  3.702e-01  4.377e-02   8.457  < 2e-16 ***
## TEAM_PITCHING_HR     4.095e-02  8.986e-03   4.558 5.44e-06 ***
## TEAM_PITCHING_BB    -3.007e-02  5.938e-03  -5.065 4.42e-07 ***
## TEAM_PITCHING_H      7.607e-02  1.091e-02   6.975 3.99e-12 ***
## BOX_TEAM_PITCHING_H -2.676e+04  3.802e+03  -7.037 2.59e-12 ***
## TEAM_BASERUN_CS     -5.705e-02  1.958e-02  -2.913  0.00361 ** 
## TEAM_PITCHING_SO     4.081e-02  8.590e-03   4.751 2.15e-06 ***
## PITCHING            -8.186e+01  1.471e+01  -5.564 2.95e-08 ***
## BATTING              6.873e-03  2.396e-03   2.868  0.00417 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.98 on 2261 degrees of freedom
## Multiple R-squared:  0.3255, Adjusted R-squared:  0.3214 
## F-statistic: 77.95 on 14 and 2261 DF,  p-value: < 2.2e-16

Model 3 gives us an R^2 of 0.3214 with an RMSE of 12.98. Although the Box transformed variables seem to have helped, overall we have a worse off model than our original data. Once again, the outliers that we capped likely were significant to actually determining if a team was winning games or not.

autoplot(model_3$model)

The residuals for model_3 are normally distributed as seen from the Normal Q-Q with acceptable deviations at the ends additionally they are centered on 0. The variance of the residuals is slightly less heteroscedastic than both previous models because of the outlier removal and transformations we have done and this can be seen on the Residuals vs Fitted graph. While the residuals can be considered to be independent here as there is no clear pattern besides this increase in variance.

Model Selection

Keeping model_1 which was built off the original data as the model we selected seems like the best choice here. That is because not only does it have the highest adjusted R^2 and RMSE, but it is also easier to explain the coefficients compared to the box-cox transformed model of model_3.

Do note that we have to deal with NA values in the eval dataset to get proper predictions from this model. Here we convert them all to 0, as the training data for model_1 did not actually change anything about NAs.

predictions <- predict(model_1$model, eval_1 |>
  mutate_if(is.numeric, ~replace_na(.,0)))
write_csv(as_tibble(predictions),"evaluated.csv")

Conclusion

Sometimes the data that you are dealt despite the apparent flaws in it is the best data to work with for training your model.

Baseball Win Predicting

Taha Ahmad

2023-09-13

Introduction

Loading Data

Data Exploration

Outliers

Missing Values

Correlation

Data Preparation

Outliers

Missing Values

Distribution Transformation

Interaction Columns

Correlation

Model Building

Model 1

Model 2

Model 3

Model Selection

Conclusion