Introduction

How can I estimate the number of wins a baseball team will achieve in a season based on its performance data? In this report, I develop a linear model to address this question. The Moneyball dataset may provide insights into the key performance factors that contribute most to a team’s success in securing wins. Variables strongly correlated with a higher number of wins will be explored further.

In this report, I will:

Data Exploration

As part of my initial data exploration, I will focus on the following key diagnostics:

Exploring the dataset is crucial to uncovering unexpected patterns or anomalies. As I proceed, I will also consider the key assumptions for linear modeling:

While real-world data rarely meets these conditions perfectly, there are potential data transformations I can apply to mitigate violations.

The dataset contains 2,276 rows, with each representing the full-season performance of a baseball team from 1871 to 2006.

Let’s begin by examining a numerical summary of each variable:

skim_without_charts(raw)

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
INDEX 0 1 1268 736 1 631 1270 1916 2535
TARGET_WINS 0 1 81 16 0 71 82 92 146
TEAM_BATTING_H 0 1 1469 145 891 1383 1454 1537 2554
TEAM_BATTING_2B 0 1 241 47 69 208 238 273 458
TEAM_BATTING_3B 0 1 55 28 0 34 47 72 223
TEAM_BATTING_HR 0 1 100 61 0 42 102 147 264
TEAM_BATTING_BB 0 1 502 123 0 451 512 580 878
TEAM_BATTING_SO 102 1 736 249 0 548 750 930 1399
TEAM_BASERUN_SB 131 1 125 88 0 66 101 156 697
TEAM_BASERUN_CS 772 1 53 23 0 38 49 62 201
TEAM_BATTING_HBP 2085 0 59 13 29 50 58 67 95
TEAM_PITCHING_H 0 1 1779 1407 1137 1419 1518 1682 30132
TEAM_PITCHING_HR 0 1 106 61 0 50 107 150 343
TEAM_PITCHING_BB 0 1 553 166 0 476 536 611 3645
TEAM_PITCHING_SO 102 1 818 553 0 615 814 968 19278
TEAM_FIELDING_E 0 1 246 228 65 127 159 249 1898
TEAM_FIELDING_DP 286 1 146 26 52 131 149 164 228

From the summary above, I notice that certain variables have a significant amount of missing data, particularly TEAM_BATTING_HBP and TEAM_BASERUN_CS.

Some variables also display unusually high maximum values, such as TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, and TEAM_FIELDING_E which may warrant further investigation.

Additionally, some variables contain values of zero that are unrealistic in the context of a full baseball season. These include TARGET_WINS, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO.

Some of these entries are clearly erroneous. For instance, the all-time minimum batting strikeouts for a team over a complete season was 308, achieved by the 1921 Cincinnati Reds.

my approach to handling missing data will depend on its distribution across columns and rows.

plot_missing(raw, title = "% of Missing Values by Variable", ggtheme = theme_fivethirtyeight())

Here, I observe once again that missing values are not randomly distributed throughout the dataset. Notably, TEAM_BATTING_HBP has nearly 92% missing values, making it unlikely to provide useful information for my model, so it will be removed.

Similarly, TEAM_BASERUN_CS has a high percentage of missing data, with 34% of its values missing. Additionally, TEAM_PITCHING_SO and TEAM_BATTING_SO have the exact same number of missing entries, suggesting that the missingness in these variables is likely not random.

Further analysis reveals that the missing data patterns in both variables are identical. One possible approach is to code the missingness as a separate factor variable. Alternatively, I may consider removing one or both variables from the models if further analysis raises concerns about data quality.

After reviewing the missing values and identifying areas of concern to address later in the report, I will now examine the characteristics of the predictor variables.

Assessing the distribution of each variable can help identify unusual features such as extreme skewness or low variance.

raw %>%
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free", ncol = 3) +
  geom_density() +
  ggthemes::theme_fivethirtyeight()
## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_density()`).

Density plots for each variable confirm the highly skeId distributions of TEAM_FIELDING_E, TEAM_PITCHING_BB, TEAM_PITCHING_H, and TEAM_PITCHING_SO, as indicated by the numeric summary. Additionally, TEAM_BASERUN_SB and TEAM_BATTING_3B exhibit moderate skewness. Other distributions appear to be either roughly normal or bimodal.

A closer examination of the highly skeId variables may provide insights into the reasons behind their non-normality.

ggplot(data = raw, aes(x = TEAM_FIELDING_E)) +
  geom_histogram() +
  ggtitle("Errors Distribution") +
  theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In TEAM_FIELDING_E, a small number of entries exceed 1000. When these outliers are excluded, the distribution shows a moderate right skew.

ggplot(data = raw, aes(x = TEAM_PITCHING_BB)) +
  geom_histogram() +
  ggtitle("Walks AlloId Distribution") +
  theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of TEAM_PITCHING_BB appears mostly normal, aside from a few outliers with values exceeding 1000.

ggplot(data = raw, aes(x = TEAM_PITCHING_H)) +
  geom_histogram() +
  ggtitle("Hits AlloId Distribution") +
  theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s look more closely at the region where most of the values in TEAM_PITCHING_H lie, [0,6000]:

raw %>%
  filter(TEAM_PITCHING_H < 6000) %>%
  ggplot(aes(x = TEAM_PITCHING_H)) +
  geom_histogram() +
  ggtitle("Hit AlloId, Focused 0 - 6,000 Distribution") +
  theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

TEAM_PITCHING_H remains skeId even within the narroIr region around its peak. The skewness and the wide spread of data points in the right tail could point to unique phenomena, data entry errors, or structural changes, such as rule adjustments or changes in equipment, that may have impacted the data.

ggplot(data = raw, aes(x = TEAM_PITCHING_SO)) +
  geom_histogram() +
  ggtitle("Strikeouts by Pitchers Distribution") +
  theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 102 rows containing non-finite outside the scale range
## (`stat_bin()`).

Narrowing in on TEAM_PITCHING_SO:

raw %>%
  filter(TEAM_PITCHING_SO < 2500) %>%
  ggplot(aes(x = TEAM_PITCHING_SO)) +
  geom_histogram() +
  ggtitle("Strikeouts by Pitchers, Focused 0-2,500 Distribution") +
  theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data appears to be roughly normal; however, there are a considerable number of unrealistic values. Several entries in this distribution surpass the all-time strikeout record of 1,687, set by the Houston Astros in 2018. Additionally, the all-time low for strikeouts, set by the 1921 Cincinnati Reds at 308, is exceeded by many entries, and a significant number of values are recorded as zero, which is highly unrealistic.

ggplot(data = raw, aes(x = TEAM_BASERUN_SB)) +
  geom_histogram() +
  ggtitle("Stolen Bases Distribution") +
  theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 131 rows containing non-finite outside the scale range
## (`stat_bin()`).

The distribution of TEAM_BASERUN_SB does

not indicate any data entry errors, despite the presence of a few extreme outliers.

ggplot(data = raw, aes(x = TEAM_BATTING_3B)) +
  geom_histogram() +
  ggtitle("Tripples by Batters Distribution") +
  theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of TEAM_BATTING_3B does not suggest any data entry errors, though a few extreme outliers are present.

As observed in my analysis of TEAM_PITCHING_SO, I have identified outliers that exceed the historical maximum and minimum values for these categories. These outliers appear across multiple variables, leading to suspicion that some of the data may be erroneous. Ideally, I would pause the analysis and revisit the data smyce to investigate why it deviates from historical baseball records. However, since this is not feasible, I will use multiple imputation to handle both extreme outliers and missing values.

This approach comes with some risk, as imputing values can reduce the model’s overall quality. However, in this case, I believe that including extreme, unrealistic values would be more detrimental to the models. To proceed cautiously, I will create and compare two versions of the dataset: one with extreme values replaced via imputation, and another where the extreme values are left unchanged.

Another key condition for linear modeling is that independent variables should not be highly correlated with one another. While real-world data rarely meets this condition perfectly, I will aim to minimize strong pairwise correlations between variables.

The correlation plot below is organized such that variables with a theoretical positive effect on TARGET_WINS are grouped at the top and left, followed by those expected to have a negative effect. I expect to observe positive correlations among variables with positive effects, positive correlations among variables with negative effects, and negative correlations between variables with opposing effects.

#reorder based on variable list on assignment sheet. Drop TEAM_BATTING_HBP and TEAM_BASERUN_CS due to missingness.
tmp <- raw[,c(2:7,9,17,15,8,16,14,12,13)]
correlation <- cor(tmp, use = "complete.obs")
corrplot.mixed(correlation, tl.col = 'pink', tl.pos = 'lt', number.cex= 11/ncol(tmp))

This is not what I observe. No single variable shows a correlation greater than 0.35, in either direction, with the target. Additionally, some pairs of variables exhibit correlations that areso strong or misaligned that they raise concerns about data accuracy. For instance,TEAM_PITCHING_HRandTEAM_BATTING_HRhave a correlation of 0.98, essentially conveying the same information. Similarly,TEAM_PITCHING_SOandTEAM_BATTING_SO`, which share identical patterns of missing data, also display an unrealistically high correlation of 0.95.

Another requirement for linear models is that independent variables must be linearly related to the target. I can explore this by using scatterplots, examining one group of variables at a time.

Batting variables

raw %>%
  gather(starts_with("TEAM_BAT"), key = "var", value = "value") %>%
  ggplot(aes(x = value, y = TARGET_WINS)) +
  geom_point() +
  facet_wrap(~ var, scales = "free", ncol = 2) +
  ggthemes::theme_fivethirtyeight()
## Warning: Removed 2187 rows containing missing values or values outside the scale range
## (`geom_point()`).

Excluding TEAM_BATTING_HBP, the batting variables have potential for positive linear relationships. These will need to be investigated further.

Baserun and fielding variables

raw %>%
  gather(c(starts_with("TEAM_BASERUN"), starts_with("TEAM_FIELD")), key = "var", value = "value") %>%
  ggplot(aes(x = value, y = TARGET_WINS)) +
  geom_point() +
  facet_wrap(~ var, scales = "free") +
  ggthemes::theme_fivethirtyeight()
## Warning: Removed 1189 rows containing missing values or values outside the scale range
## (`geom_point()`).

The situation with these variables is similar to that of the batting variables. While there is potential for linearity, further investigation is needed to confirm it.

Pitching variables

raw %>%
  gather(starts_with("TEAM_PITCH"), key = "var", value = "value") %>%
  ggplot(aes(x = value, y = TARGET_WINS)) +
  geom_point() +
  facet_wrap(~ var, scales = "free") +
  ggthemes::theme_fivethirtyeight()
## Warning: Removed 102 rows containing missing values or values outside the scale range
## (`geom_point()`).

The scatterplots for the pitching variables do not reveal any clear linear relationships. Instead, they emphasize the presence of extreme outliers and significant skewness in three of the fmy pitching variables.

An unexpected finding during my exploratory data analysis is that the scatterplot for TEAM_BATTING_SO and TEAM_PITCHING_SO suggests the data can be divided into fmy distinct groups, with three of these groups displaying highly correlated data.

raw %>%
  mutate(SO_factor = case_when(TEAM_BATTING_SO >= TEAM_PITCHING_SO*.96+10~ 'high',
                              (TEAM_BATTING_SO<TEAM_PITCHING_SO*.96+10 
                               & TEAM_BATTING_SO>TEAM_PITCHING_SO*.96-50) ~'med_high',
                              (TEAM_BATTING_SO<TEAM_PITCHING_SO*.96-50 
                               & TEAM_BATTING_SO>TEAM_PITCHING_SO*.96-120) ~'med_low',
                              TEAM_BATTING_SO<TEAM_PITCHING_SO*.96-120 ~'low')) %>%
  filter(TEAM_PITCHING_SO < 2000) %>%
  ggplot(aes(x = TEAM_PITCHING_SO, y = TEAM_BATTING_SO, colmy = SO_factor)) +
  geom_point() +
  ggtitle("Strikeouts by Batters by Strikeouts by Pitchers") +
  theme_fivethirtyeight()

There is no theoretical reason to expect such a strong relationship between a team’s batting strikeouts and pitching strikeouts. The high correlations between these variables are unexpected and may indicate a potential issue with the data. However, since I have already chosen to proceed with the analysis, I will treat these patterns as inherent features of the data that the model will attempt to capture.

Data Preparation

To prepare the data for modeling, I first simplify variable names by removing the prefix TEAM_, and I add the grouping variable SO_FACTOR.

#Simplify column names
names(raw) <- gsub('TEAM_', '', x = names(raw))

#Add group variable, SO_FACTOR
raw <- raw %>% mutate(
  SO_FACTOR = case_when(
    BATTING_SO >= PITCHING_SO*.96+10 ~ 'high',
    (BATTING_SO<PITCHING_SO*.96+10 & BATTING_SO>PITCHING_SO*.96-50) ~ 'med_high',
    (BATTING_SO<PITCHING_SO*.96-50 & BATTING_SO>PITCHING_SO*.96-120) ~ 'med_low',
    BATTING_SO<PITCHING_SO*.96-120 ~ 'low'))

A solid understanding of baseball suggests that certain variables can be combined to create more meaningful metrics for evaluating team performance, such as _OBP, a rate statistic that measures how often a player reaches base via a hit or walk relative to their plate appearances. Creating new variables by combining existing ones helps reduce collinearity between some pairs of variables and produces models that better align with commonly used baseball performance metrics.

I created new variables to represent net stolen bases (BASERUN_NET_SB), offensive on-base percentage (OFFENSE_OBP), defensive on-base percentage (DEFENSE_OBP), and total at-bats (TOT_AT_BATS). Afterward, I removed the original variables used to calculate these new metrics.

raw <- raw %>%
  mutate("BASERUN_NET_SB" = BASERUN_SB - BASERUN_CS) %>%
  mutate("OFFENSE_OBP" = (BATTING_H + BATTING_BB)/(BATTING_H + BATTING_BB - BASERUN_CS + (162*27))) %>%
  mutate("DEFENSE_OBP" = (PITCHING_H + FIELDING_E + PITCHING_BB - FIELDING_DP)/(PITCHING_H + FIELDING_E + PITCHING_BB - FIELDING_DP + (162*27))) %>%
  mutate("TOT_AT_BATS" = BATTING_H + BATTING_BB - BASERUN_CS + (162*27))
raw <- raw[,c(1,4:6,8,13,15,18:22,2)]

After these initial transformations, I split the data into a training set (80% = 1820 rows) and a test set (20% = 456 rows).

train_rows <- sample(nrow(raw), 0.80 * nrow(raw), replace = FALSE)
train <- raw[train_rows,]
test <- raw[-train_rows,]

During the data exploration, I identified several variables with extreme outliers and outlined my approach to building two versions of the models: one trained on data with only missing values imputed and another trained on data where both missing values and outliers were imputed. I will evaluate all these models before making the final selection.

Simply replacing missing values with the column mean or median can introduce bias by disregarding the influence of other variables and the natural variability within the data. To reduce this bias, I applied a combined regression and bootstrapping method using the MICE package to handle missing data. Diagnostic checks (not shown here) confirmed that this approach produced better results, with distributions more accurately reflecting the actual data compared to other techniques, such as predictive mean matching or random forest.

A look at the summaries of my test sets, both with and without outliers imputed, shows the raw materials for my models.

With missing values imputed

train_imp_M %>%
  skim_without_charts() %>%
  yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
INDEX 0 1 1266 739 1 621 1280 1903 2534
BATTING_2B 0 1 242 47 69 209 238 274 458
BATTING_3B 0 1 55 28 0 34 47 72 223
BATTING_HR 0 1 101 61 0 44 104 148 264
BATTING_SO 0 1 731 253 0 543 748 928 1399
PITCHING_HR 0 1 107 61 0 53 108 152 343
PITCHING_SO 0 1 810 592 -248 603 810 966 19278
BASERUN_NET_SB 0 1 38 40 -140 13 35 60 628
OFFENSE_OBP 0 1 0 0 0 0 0 0 0
DEFENSE_OBP 0 1 0 0 0 0 0 0 4
TOT_AT_BATS 0 1 6325 226 1270 6217 6317 6424 7518
TARGET_WINS 0 1 81 16 0 71 82 92 146

With missing values and outliers imputed

train_imp_OM %>%
  skim_without_charts() %>%
  yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
INDEX 0 1 1266 739 1 621 1280 1903 2534
BATTING_2B 0 1 241 41 120 211 238 272 330
BATTING_3B 0 1 54 24 3 35 47 70 117
BATTING_HR 0 1 100 57 -35 46 104 147 232
BATTING_SO 0 1 739 233 -788 556 748 925 1246
PITCHING_HR 0 1 106 57 -2 56 109 150 243
PITCHING_SO 0 1 798 220 289 620 810 957 1453
BASERUN_NET_SB 0 1 38 30 -54 16 36 57 129
OFFENSE_OBP 0 1 0 0 0 0 0 0 0
DEFENSE_OBP 0 1 0 0 0 0 0 0 1
TOT_AT_BATS 0 1 6314 131 5898 6219 6309 6401 6723
TARGET_WINS 0 1 81 16 0 71 82 92 146

Build Models

In this section, I begin by building models using training sets where only missing values were imputed and evaluate their performance on a test set with the same treatment for missing values. Then, I develop and test comparable models where both missing values and outliers were imputed.

#A function to compute RMSE for test sets
rmse <- function(lm, test) {
  preds <- predict(lm, test[,c(2:12)])
  errors <- test$TARGET_WINS - preds
  return(sqrt(sum(errors^2)/(nrow(test) - length(lm$coefficients) - 1)))
}

#A function for generating residuals plots
res_plot <- function(lm){
  plot(fitted(lm),
       residuals(lm),
       xlab = "Fitted",
       ylab = "Residuals")
}

Model 1. Almost all variables

This model includes all variables except for BATTING_HR, BATTING_SO, and TOT_AT_BATS, as each of these variables is highly correlated with at least one other variable.

With only missing values imputed

#Missing imputed
lm1_M <- lm(TARGET_WINS ~ BATTING_2B + BATTING_3B + PITCHING_HR + PITCHING_SO + SO_FACTOR + BASERUN_NET_SB + OFFENSE_OBP + DEFENSE_OBP, data = train_imp_M)
stargazer(lm1_M,type='html',out='lm1_M.html',title='Model 1 - with only missing values imputed',
          covariate.labels = c('Doubles','Triples','Pitching HRs','Pitching SO','Low SO',
                               'Med/High SO','Med/Low SO','Baserunners Net SB','Offense OBP',
                               'Opponents OBP'),
          dep.var.caption = 'Target Wins',dep.var.labels.include=F,
          single.row=T, no.space = T, header=F,
          digits = 3, model.numbers = F)
res_plot(lm1_M)

qqnorm(residuals(lm1_M), ylab = "Residuals")
qqline(residuals(lm1_M))

With missing values and outliers imputed

#Missing and outliers imputed
lm1_OM <- lm(TARGET_WINS ~ BATTING_2B + BATTING_3B + PITCHING_HR + PITCHING_SO + SO_FACTOR + BASERUN_NET_SB + OFFENSE_OBP + DEFENSE_OBP, data = train_imp_OM)
stargazer(lm1_OM,type='html',out='lm1_OM.html',
          title='Model 1 - missing values and outliers imputed',
          covariate.labels = c('Doubles','Triples','Pitching HRs','Pitching SO','Low SO',
                               'Med/High SO','Med/Low SO','Baserunners Net SB','Offense OBP',
                               'Opponents OBP'),
          dep.var.caption = 'Target Wins',dep.var.labels.include=F,
          single.row=T, no.space = T, header=F,
          digits = 3, model.numbers = F)
# summary(lm1_OM)
res_plot(lm1_OM)

qqnorm(residuals(lm1_OM), ylab = "Residuals")
qqline(residuals(lm1_OM))

Evaluation and discussion

The root mean square error for each model is reported below.

lm1_M_test_rmse <- rmse(lm1_M,test_imp_M)
lm1_OM_test_rmse <- rmse(lm1_OM,test_imp_OM)

Test set RMSE for model 1 with only missing data imputed: 15.57
Test set RMSE for model 1 with both missing data and outliers imputed: 14.59

An initial review of the residuals for the model with only missing values imputed shows signs of heteroskedasticity, indicating a violation of one of the model assumptions. Specifically, the residual variance decreases at extreme values. Additionally, the QQ plot reveals that the residuals deviate from a normal distribution at the extremes.

In contrast, the model trained with outlier imputation shows improvements, suggesting that the extreme outliers removed were negatively affecting the model’s performance.

The RMSE values for both models on the test set are comparable to those on the training set, indicating that the models are not overfitting.

Model 2: Piecewise Analysis by FACTOR_SO

The scatterplot of BATTING_SO versus PITCHING_SO revealed fmy distinct groups within the data. For this model, I developed a separate model for each group. As this involves a piecewise analysis across two different training sets, a total of eight output sets were produced. Rather than presenting all the outputs, I provide a summary of the key findings.

trainlow_M <- train_imp_M %>%
  filter(SO_FACTOR == "low") %>%
  dplyr::select(-SO_FACTOR)

trainmed_low_M <- train_imp_M %>%
  filter(SO_FACTOR == "med_low") %>%
  dplyr::select(-SO_FACTOR)

trainmed_high_M <- train_imp_M %>%
  filter(SO_FACTOR == "med_high") %>%
  dplyr::select(-SO_FACTOR)

trainhigh_M <- train_imp_M %>%
  filter(SO_FACTOR == "high") %>%
  dplyr::select(-SO_FACTOR)

trainlow_OM <- train_imp_OM %>%
  filter(SO_FACTOR == "low") %>%
  dplyr::select(-SO_FACTOR)

trainmed_low_OM <- train_imp_OM %>%
  filter(SO_FACTOR == "med_low") %>%
  dplyr::select(-SO_FACTOR)

trainmed_high_OM <- train_imp_OM %>%
  filter(SO_FACTOR == "med_high") %>%
  dplyr::select(-SO_FACTOR)

trainhigh_OM <- train_imp_OM %>%
  filter(SO_FACTOR == "high") %>%
  dplyr::select(-SO_FACTOR)

testlow_M <- test_imp_M %>%
  filter(SO_FACTOR == "low") %>%
  dplyr::select(-SO_FACTOR)

testmed_low_M <- test_imp_M %>%
  filter(SO_FACTOR == "med_low") %>%
  dplyr::select(-SO_FACTOR)

testmed_high_M <- test_imp_M %>%
  filter(SO_FACTOR == "med_high") %>%
  dplyr::select(-SO_FACTOR)

testhigh_M <- test_imp_M %>%
  filter(SO_FACTOR == "high") %>%
  dplyr::select(-SO_FACTOR)

testlow_OM <- test_imp_OM %>%
  filter(SO_FACTOR == "low") %>%
  dplyr::select(-SO_FACTOR)

testmed_low_OM <- test_imp_OM %>%
  filter(SO_FACTOR == "med_low") %>%
  dplyr::select(-SO_FACTOR)

testmed_high_OM <- test_imp_OM %>%
  filter(SO_FACTOR == "med_high") %>%
  dplyr::select(-SO_FACTOR)

testhigh_OM <- test_imp_OM %>%
  filter(SO_FACTOR == "high") %>%
  dplyr::select(-SO_FACTOR)

Evaluation and discussion

The root mean square error for each model is reported below.

#RMSE on training set for piecewise model with missing values imputed:
trn_rmse_low_M <- rmse(lm_low_M,trainlow_M)
trn_rmse_med_low_M <- rmse(lm_med_low_M,trainmed_low_M)
trn_rmse_med_high_M <- rmse(lm_med_high_M,trainmed_high_M)
trn_rmse_high_M <- rmse(lm_high_M,trainhigh_M)

trn_rmse_total_M <- (trn_rmse_low_M * nrow(trainlow_M) + trn_rmse_med_low_M * nrow(trainmed_low_M) + trn_rmse_med_high_M * nrow(trainmed_high_M) + trn_rmse_high_M * nrow(trainhigh_M)) / nrow(train_imp_M)

#RMSE on test set for piecewise model with missing values imputed:
test_rmse_low_M <- rmse(lm_low_M,testlow_M)
test_rmse_med_low_M <- rmse(lm_med_low_M,testmed_low_M)
test_rmse_med_high_M <- rmse(lm_med_high_M,testmed_high_M)
test_rmse_high_M <- rmse(lm_high_M,testhigh_M)

tst_rmse_total_M <- (test_rmse_low_M * nrow(testlow_M) + test_rmse_med_low_M * nrow(testmed_low_M) + test_rmse_med_high_M * nrow(testmed_high_M) + test_rmse_high_M * nrow(testhigh_M)) / nrow(test_imp_M)

#RMSE on training set for piecewise model with missing values and outliers imputed:
trn_rmse_low_OM <- rmse(lm_low_OM,trainlow_OM)
trn_rmse_med_low_OM <- rmse(lm_med_low_OM,trainmed_low_OM)
trn_rmse_med_high_OM <- rmse(lm_med_high_OM,trainmed_high_OM)
trn_rmse_high_OM <- rmse(lm_high_OM,trainhigh_OM)

trn_rmse_total_OM <- (trn_rmse_low_OM * nrow(trainlow_OM) + trn_rmse_med_low_OM * nrow(trainmed_low_OM) + trn_rmse_med_high_OM * nrow(trainmed_high_OM) + trn_rmse_high_OM * nrow(trainhigh_OM)) / nrow(train_imp_OM)

#RMSE on test set for piecewise model with missing values and outliers imputed:
rmse_low_OM <- rmse(lm_low_OM,testlow_OM)
rmse_med_low_OM <- rmse(lm_med_low_OM,testmed_low_OM)
rmse_med_high_OM <- rmse(lm_med_high_OM,testmed_high_OM)
rmse_high_OM <- rmse(lm_high_OM,testhigh_OM)

tst_rmse_total_OM <- (rmse_low_OM * nrow(testlow_OM) + rmse_med_low_OM * nrow(testmed_low_OM) + rmse_med_high_OM * nrow(testmed_high_OM) + rmse_high_OM * nrow(testhigh_OM)) / nrow(test_imp_OM)

RMSE for piecewise models with missing values imputed Training set RMSE: 12.3
Test set RMSE: 15.43

RMSE for piecewise models with missing values and outliers imputed
Training set RMSE: 12.67
Test set RMSE: 14.82

This piecewise modeling approach resulted in RMSE values comparable to those of the previous model. The residual and QQ plots for the models where only missing values were imputed show the same heteroskedasticity and deviations from normality observed in Model 1. However, these issues were somewhat mitigated in the models where outliers were also imputed.

The second model was driven by the relationship between BATTING_SO and PITCHING_SO identified in earlier scatterplots. However, this approach presents potential challenges: the groups vary significantly in size, making visual comparisons less reliable, and dividing the data based on the relationship between these two predictors might lack sufficient context. While this could have led to overfitting, the similar RMSE values for both the training and test sets indicate that this issue was likely avoided.

Model 3: Dropping DEFENSE_OBP

In this model, I remove DEFENSE_OBP from Model 1. Although this variable is a statistically significant predictor of TARGET_WINS, its large coefficient and standard error, relative to the other variables, may be due to its high correlation with OFFENSE_OBP.

With only missing values imputed:

lm3_M <- lm(TARGET_WINS ~ BATTING_2B + BATTING_3B + PITCHING_HR + PITCHING_SO + SO_FACTOR + BASERUN_NET_SB + OFFENSE_OBP, data = train_imp_M)
summary(lm3_M)
stargazer(lm3_M,type='html',out='lm3_M.html',title='Model 3 - only missing values imputed',
          covariate.labels = c('Doubles','Triples','Pitching HRs','Pitching SO','Low SO',
                               'Med/High SO','Med/Low SO','Baserunners Net SB','Offense OBP'),
          dep.var.caption = 'Target Wins',dep.var.labels.include=F,
          single.row=T, no.space = T, header=F,
          digits = 3, model.numbers = F)
res_plot(lm3_M)

qqnorm(residuals(lm3_M), ylab = "Residuals")
qqline(residuals(lm3_M))

With missing values and outliers imputed

lm3_OM <- lm(TARGET_WINS ~ BATTING_2B + BATTING_3B + PITCHING_HR + PITCHING_SO + SO_FACTOR + BASERUN_NET_SB + OFFENSE_OBP, data = train_imp_OM)
summary(lm3_OM)
stargazer(lm3_OM,type='html',out='lm3_OM.html',
          title='Model 3 - missing values and outliers imputed',
          covariate.labels = c('Doubles','Triples','Pitching HRs','Pitching SO','Low SO',
                               'Med/High SO','Med/Low SO','Baserunners Net SB','Offense OBP'),
          dep.var.caption = 'Target Wins',dep.var.labels.include=F,
          single.row=T, no.space = T, header=F,
          digits = 3, model.numbers = F)
res_plot(lm3_OM)

qqnorm(residuals(lm3_OM), ylab = "Residuals")
qqline(residuals(lm3_OM))

Evaluation and discussion

The root mean square error for each model is reported below.

lm3_M_test_rmse <- rmse(lm3_M,test_imp_M)
lm3_OM_test_rmse <- rmse(lm3_OM,test_imp_OM)

RMSE for piecewise models with missing values imputed
RMSE on training set: 12.3
RMSE on test set: 15.43

RMSE for piecewise models with missing values and outliers imputed
RMSE on training set: 12.67
RMSE on test set: 14.82

In this model, the residuals exhibit similar behavior to those in the first model. However, I observed improvements once outliers were imputed. The residuals became more homoskedastic and normally distributed, and the statistical significance of the coefficients increased as well.

Conclusion and Model Selection

Before selecting the best-performing model, let’s review the analysis:

Data Exploration: I identified a strong likelihood of flawed data, including unrelated variables with near-perfect correlations and outliers representing unrealistic values. These issues underscore the need for closer examination, particularly with more subtle data characteristics like bimodal distributions. In practice, such concerns would need to be addressed before progressing further.

Data Preparation: Despite the concerns, I proceeded cautiously by creating factors to represent distinct cohorts within the data and generating new variables informed by my understanding of baseball.

Modeling: In the first model, I included all the remaining features of the dataset. The second model focused on training separate models for each SO_FACTOR level and combining them into a piecewise linear structure. The third model enhanced the first by eliminating one of the highly correlated variables from the dataset.

Below are the RMSE values for each model:

Models with only missing values imputed: - Model 1 RMSE: 15.57
- Model 2 RMSE: 15.43
- Model 3 RMSE: 15.41

Models with missing values and outliers imputed: - Model 1 RMSE: 14.59
- Model 2 RMSE: 14.82
- Model 3 RMSE: 14.59

The most effective model is Model 3, where both missing values and outliers were imputed. Unlike Model 2, this model leverages the entire training set in a unified approach, with highly significant predictors and intercepts—something not observed in all variables of Model 1. I believe the inclusion of new factors successfully mitigated some of the dataset’s flaws, while the engineered features contributed to better overall model efficiency and performance. The outlier imputation strategy further reduced RMSE and improved the quality of the residual plots.