How can I estimate the number of wins a baseball team will achieve in a season based on its performance data? In this report, I develop a linear model to address this question. The Moneyball dataset may provide insights into the key performance factors that contribute most to a team’s success in securing wins. Variables strongly correlated with a higher number of wins will be explored further.
In this report, I will:
As part of my initial data exploration, I will focus on the following key diagnostics:
Exploring the dataset is crucial to uncovering unexpected patterns or anomalies. As I proceed, I will also consider the key assumptions for linear modeling:
While real-world data rarely meets these conditions perfectly, there are potential data transformations I can apply to mitigate violations.
The dataset contains 2,276 rows, with each representing the full-season performance of a baseball team from 1871 to 2006.
Let’s begin by examining a numerical summary of each variable:
skim_without_charts(raw)
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
INDEX | 0 | 1 | 1268 | 736 | 1 | 631 | 1270 | 1916 | 2535 |
TARGET_WINS | 0 | 1 | 81 | 16 | 0 | 71 | 82 | 92 | 146 |
TEAM_BATTING_H | 0 | 1 | 1469 | 145 | 891 | 1383 | 1454 | 1537 | 2554 |
TEAM_BATTING_2B | 0 | 1 | 241 | 47 | 69 | 208 | 238 | 273 | 458 |
TEAM_BATTING_3B | 0 | 1 | 55 | 28 | 0 | 34 | 47 | 72 | 223 |
TEAM_BATTING_HR | 0 | 1 | 100 | 61 | 0 | 42 | 102 | 147 | 264 |
TEAM_BATTING_BB | 0 | 1 | 502 | 123 | 0 | 451 | 512 | 580 | 878 |
TEAM_BATTING_SO | 102 | 1 | 736 | 249 | 0 | 548 | 750 | 930 | 1399 |
TEAM_BASERUN_SB | 131 | 1 | 125 | 88 | 0 | 66 | 101 | 156 | 697 |
TEAM_BASERUN_CS | 772 | 1 | 53 | 23 | 0 | 38 | 49 | 62 | 201 |
TEAM_BATTING_HBP | 2085 | 0 | 59 | 13 | 29 | 50 | 58 | 67 | 95 |
TEAM_PITCHING_H | 0 | 1 | 1779 | 1407 | 1137 | 1419 | 1518 | 1682 | 30132 |
TEAM_PITCHING_HR | 0 | 1 | 106 | 61 | 0 | 50 | 107 | 150 | 343 |
TEAM_PITCHING_BB | 0 | 1 | 553 | 166 | 0 | 476 | 536 | 611 | 3645 |
TEAM_PITCHING_SO | 102 | 1 | 818 | 553 | 0 | 615 | 814 | 968 | 19278 |
TEAM_FIELDING_E | 0 | 1 | 246 | 228 | 65 | 127 | 159 | 249 | 1898 |
TEAM_FIELDING_DP | 286 | 1 | 146 | 26 | 52 | 131 | 149 | 164 | 228 |
From the summary above, I notice that certain variables have a
significant amount of missing data, particularly
TEAM_BATTING_HBP
and TEAM_BASERUN_CS
.
Some variables also display unusually high maximum values, such as
TEAM_PITCHING_H
, TEAM_PITCHING_BB
,
TEAM_PITCHING_SO
, and TEAM_FIELDING_E
which
may warrant further investigation.
Additionally, some variables contain values of zero that are
unrealistic in the context of a full baseball season. These include
TARGET_WINS
, TEAM_BATTING_3B
,
TEAM_BATTING_HR
, TEAM_BATTING_BB
,
TEAM_BATTING_SO
, TEAM_BASERUN_SB
,
TEAM_BASERUN_CS
, TEAM_PITCHING_HR
,
TEAM_PITCHING_BB
, TEAM_PITCHING_SO
.
Some of these entries are clearly erroneous. For instance, the all-time minimum batting strikeouts for a team over a complete season was 308, achieved by the 1921 Cincinnati Reds.
my approach to handling missing data will depend on its distribution across columns and rows.
plot_missing(raw, title = "% of Missing Values by Variable", ggtheme = theme_fivethirtyeight())
Here, I observe once again that missing values are not randomly
distributed throughout the dataset. Notably,
TEAM_BATTING_HBP
has nearly 92% missing values, making it
unlikely to provide useful information for my model, so it will be
removed.
Similarly, TEAM_BASERUN_CS
has a high percentage of
missing data, with 34% of its values missing. Additionally,
TEAM_PITCHING_SO
and TEAM_BATTING_SO
have the
exact same number of missing entries, suggesting that the missingness in
these variables is likely not random.
Further analysis reveals that the missing data patterns in both variables are identical. One possible approach is to code the missingness as a separate factor variable. Alternatively, I may consider removing one or both variables from the models if further analysis raises concerns about data quality.
After reviewing the missing values and identifying areas of concern to address later in the report, I will now examine the characteristics of the predictor variables.
Assessing the distribution of each variable can help identify unusual features such as extreme skewness or low variance.
raw %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free", ncol = 3) +
geom_density() +
ggthemes::theme_fivethirtyeight()
## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_density()`).
Density plots for each variable confirm the highly skeId
distributions of TEAM_FIELDING_E
,
TEAM_PITCHING_BB
, TEAM_PITCHING_H
, and
TEAM_PITCHING_SO
, as indicated by the numeric summary.
Additionally, TEAM_BASERUN_SB
and
TEAM_BATTING_3B
exhibit moderate skewness. Other
distributions appear to be either roughly normal or bimodal.
A closer examination of the highly skeId variables may provide insights into the reasons behind their non-normality.
ggplot(data = raw, aes(x = TEAM_FIELDING_E)) +
geom_histogram() +
ggtitle("Errors Distribution") +
theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In TEAM_FIELDING_E
, a small number of entries exceed
1000. When these outliers are excluded, the distribution shows a
moderate right skew.
ggplot(data = raw, aes(x = TEAM_PITCHING_BB)) +
geom_histogram() +
ggtitle("Walks AlloId Distribution") +
theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of TEAM_PITCHING_BB
appears mostly
normal, aside from a few outliers with values exceeding 1000.
ggplot(data = raw, aes(x = TEAM_PITCHING_H)) +
geom_histogram() +
ggtitle("Hits AlloId Distribution") +
theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s look more closely at the region where most of the values in
TEAM_PITCHING_H
lie, [0,6000]:
raw %>%
filter(TEAM_PITCHING_H < 6000) %>%
ggplot(aes(x = TEAM_PITCHING_H)) +
geom_histogram() +
ggtitle("Hit AlloId, Focused 0 - 6,000 Distribution") +
theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
TEAM_PITCHING_H
remains skeId even within the narroIr
region around its peak. The skewness and the wide spread of data points
in the right tail could point to unique phenomena, data entry errors, or
structural changes, such as rule adjustments or changes in equipment,
that may have impacted the data.
ggplot(data = raw, aes(x = TEAM_PITCHING_SO)) +
geom_histogram() +
ggtitle("Strikeouts by Pitchers Distribution") +
theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 102 rows containing non-finite outside the scale range
## (`stat_bin()`).
Narrowing in on TEAM_PITCHING_SO
:
raw %>%
filter(TEAM_PITCHING_SO < 2500) %>%
ggplot(aes(x = TEAM_PITCHING_SO)) +
geom_histogram() +
ggtitle("Strikeouts by Pitchers, Focused 0-2,500 Distribution") +
theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The data appears to be roughly normal; however, there are a considerable number of unrealistic values. Several entries in this distribution surpass the all-time strikeout record of 1,687, set by the Houston Astros in 2018. Additionally, the all-time low for strikeouts, set by the 1921 Cincinnati Reds at 308, is exceeded by many entries, and a significant number of values are recorded as zero, which is highly unrealistic.
ggplot(data = raw, aes(x = TEAM_BASERUN_SB)) +
geom_histogram() +
ggtitle("Stolen Bases Distribution") +
theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 131 rows containing non-finite outside the scale range
## (`stat_bin()`).
The distribution of TEAM_BASERUN_SB
does
not indicate any data entry errors, despite the presence of a few extreme outliers.
ggplot(data = raw, aes(x = TEAM_BATTING_3B)) +
geom_histogram() +
ggtitle("Tripples by Batters Distribution") +
theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of TEAM_BATTING_3B
does not suggest any
data entry errors, though a few extreme outliers are present.
As observed in my analysis of TEAM_PITCHING_SO
, I have
identified outliers that exceed the historical maximum and minimum
values for these categories. These outliers appear across multiple
variables, leading to suspicion that some of the data may be erroneous.
Ideally, I would pause the analysis and revisit the data smyce to
investigate why it deviates from historical baseball records. However,
since this is not feasible, I will use multiple imputation to handle
both extreme outliers and missing values.
This approach comes with some risk, as imputing values can reduce the model’s overall quality. However, in this case, I believe that including extreme, unrealistic values would be more detrimental to the models. To proceed cautiously, I will create and compare two versions of the dataset: one with extreme values replaced via imputation, and another where the extreme values are left unchanged.
Another key condition for linear modeling is that independent variables should not be highly correlated with one another. While real-world data rarely meets this condition perfectly, I will aim to minimize strong pairwise correlations between variables.
The correlation plot below is organized such that variables with a
theoretical positive effect on TARGET_WINS
are grouped at
the top and left, followed by those expected to have a negative effect.
I expect to observe positive correlations among variables with positive
effects, positive correlations among variables with negative effects,
and negative correlations between variables with opposing effects.
#reorder based on variable list on assignment sheet. Drop TEAM_BATTING_HBP and TEAM_BASERUN_CS due to missingness.
tmp <- raw[,c(2:7,9,17,15,8,16,14,12,13)]
correlation <- cor(tmp, use = "complete.obs")
corrplot.mixed(correlation, tl.col = 'pink', tl.pos = 'lt', number.cex= 11/ncol(tmp))
This is not what I observe. No single variable shows a correlation
greater than 0.35, in either direction, with the target. Additionally,
some pairs of variables exhibit correlations that
areso strong or misaligned that they raise concerns about data accuracy. For instance,
TEAM_PITCHING_HRand
TEAM_BATTING_HRhave a correlation of 0.98, essentially conveying the same information. Similarly,
TEAM_PITCHING_SOand
TEAM_BATTING_SO`,
which share identical patterns of missing data, also display an
unrealistically high correlation of 0.95.
Another requirement for linear models is that independent variables must be linearly related to the target. I can explore this by using scatterplots, examining one group of variables at a time.
Batting variables
raw %>%
gather(starts_with("TEAM_BAT"), key = "var", value = "value") %>%
ggplot(aes(x = value, y = TARGET_WINS)) +
geom_point() +
facet_wrap(~ var, scales = "free", ncol = 2) +
ggthemes::theme_fivethirtyeight()
## Warning: Removed 2187 rows containing missing values or values outside the scale range
## (`geom_point()`).
Excluding TEAM_BATTING_HBP
, the batting variables have
potential for positive linear relationships. These will need to be
investigated further.
Baserun and fielding variables
raw %>%
gather(c(starts_with("TEAM_BASERUN"), starts_with("TEAM_FIELD")), key = "var", value = "value") %>%
ggplot(aes(x = value, y = TARGET_WINS)) +
geom_point() +
facet_wrap(~ var, scales = "free") +
ggthemes::theme_fivethirtyeight()
## Warning: Removed 1189 rows containing missing values or values outside the scale range
## (`geom_point()`).
The situation with these variables is similar to that of the batting variables. While there is potential for linearity, further investigation is needed to confirm it.
Pitching variables
raw %>%
gather(starts_with("TEAM_PITCH"), key = "var", value = "value") %>%
ggplot(aes(x = value, y = TARGET_WINS)) +
geom_point() +
facet_wrap(~ var, scales = "free") +
ggthemes::theme_fivethirtyeight()
## Warning: Removed 102 rows containing missing values or values outside the scale range
## (`geom_point()`).
The scatterplots for the pitching variables do not reveal any clear linear relationships. Instead, they emphasize the presence of extreme outliers and significant skewness in three of the fmy pitching variables.
An unexpected finding during my exploratory data analysis is that the
scatterplot for TEAM_BATTING_SO
and
TEAM_PITCHING_SO
suggests the data can be divided into fmy
distinct groups, with three of these groups displaying highly correlated
data.
raw %>%
mutate(SO_factor = case_when(TEAM_BATTING_SO >= TEAM_PITCHING_SO*.96+10~ 'high',
(TEAM_BATTING_SO<TEAM_PITCHING_SO*.96+10
& TEAM_BATTING_SO>TEAM_PITCHING_SO*.96-50) ~'med_high',
(TEAM_BATTING_SO<TEAM_PITCHING_SO*.96-50
& TEAM_BATTING_SO>TEAM_PITCHING_SO*.96-120) ~'med_low',
TEAM_BATTING_SO<TEAM_PITCHING_SO*.96-120 ~'low')) %>%
filter(TEAM_PITCHING_SO < 2000) %>%
ggplot(aes(x = TEAM_PITCHING_SO, y = TEAM_BATTING_SO, colmy = SO_factor)) +
geom_point() +
ggtitle("Strikeouts by Batters by Strikeouts by Pitchers") +
theme_fivethirtyeight()
There is no theoretical reason to expect such a strong relationship between a team’s batting strikeouts and pitching strikeouts. The high correlations between these variables are unexpected and may indicate a potential issue with the data. However, since I have already chosen to proceed with the analysis, I will treat these patterns as inherent features of the data that the model will attempt to capture.
To prepare the data for modeling, I first simplify variable names by
removing the prefix TEAM_
, and I add the grouping variable
SO_FACTOR
.
#Simplify column names
names(raw) <- gsub('TEAM_', '', x = names(raw))
#Add group variable, SO_FACTOR
raw <- raw %>% mutate(
SO_FACTOR = case_when(
BATTING_SO >= PITCHING_SO*.96+10 ~ 'high',
(BATTING_SO<PITCHING_SO*.96+10 & BATTING_SO>PITCHING_SO*.96-50) ~ 'med_high',
(BATTING_SO<PITCHING_SO*.96-50 & BATTING_SO>PITCHING_SO*.96-120) ~ 'med_low',
BATTING_SO<PITCHING_SO*.96-120 ~ 'low'))
A solid understanding of baseball suggests that certain variables can
be combined to create more meaningful metrics for evaluating team
performance, such as _OBP
, a rate statistic that measures
how often a player reaches base via a hit or walk relative to their
plate appearances. Creating new variables by combining existing ones
helps reduce collinearity between some pairs of variables and produces
models that better align with commonly used baseball performance
metrics.
I created new variables to represent net stolen bases
(BASERUN_NET_SB
), offensive on-base percentage
(OFFENSE_OBP), defensive on-base percentage (DEFENSE_OBP
),
and total at-bats (TOT_AT_BATS
). Afterward, I removed the
original variables used to calculate these new metrics.
raw <- raw %>%
mutate("BASERUN_NET_SB" = BASERUN_SB - BASERUN_CS) %>%
mutate("OFFENSE_OBP" = (BATTING_H + BATTING_BB)/(BATTING_H + BATTING_BB - BASERUN_CS + (162*27))) %>%
mutate("DEFENSE_OBP" = (PITCHING_H + FIELDING_E + PITCHING_BB - FIELDING_DP)/(PITCHING_H + FIELDING_E + PITCHING_BB - FIELDING_DP + (162*27))) %>%
mutate("TOT_AT_BATS" = BATTING_H + BATTING_BB - BASERUN_CS + (162*27))
raw <- raw[,c(1,4:6,8,13,15,18:22,2)]
After these initial transformations, I split the data into a training set (80% = 1820 rows) and a test set (20% = 456 rows).
train_rows <- sample(nrow(raw), 0.80 * nrow(raw), replace = FALSE)
train <- raw[train_rows,]
test <- raw[-train_rows,]
During the data exploration, I identified several variables with extreme outliers and outlined my approach to building two versions of the models: one trained on data with only missing values imputed and another trained on data where both missing values and outliers were imputed. I will evaluate all these models before making the final selection.
Simply replacing missing values with the column mean or median can introduce bias by disregarding the influence of other variables and the natural variability within the data. To reduce this bias, I applied a combined regression and bootstrapping method using the MICE package to handle missing data. Diagnostic checks (not shown here) confirmed that this approach produced better results, with distributions more accurately reflecting the actual data compared to other techniques, such as predictive mean matching or random forest.
A look at the summaries of my test sets, both with and without outliers imputed, shows the raw materials for my models.
With missing values imputed
train_imp_M %>%
skim_without_charts() %>%
yank("numeric")
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
INDEX | 0 | 1 | 1266 | 739 | 1 | 621 | 1280 | 1903 | 2534 |
BATTING_2B | 0 | 1 | 242 | 47 | 69 | 209 | 238 | 274 | 458 |
BATTING_3B | 0 | 1 | 55 | 28 | 0 | 34 | 47 | 72 | 223 |
BATTING_HR | 0 | 1 | 101 | 61 | 0 | 44 | 104 | 148 | 264 |
BATTING_SO | 0 | 1 | 731 | 253 | 0 | 543 | 748 | 928 | 1399 |
PITCHING_HR | 0 | 1 | 107 | 61 | 0 | 53 | 108 | 152 | 343 |
PITCHING_SO | 0 | 1 | 810 | 592 | -248 | 603 | 810 | 966 | 19278 |
BASERUN_NET_SB | 0 | 1 | 38 | 40 | -140 | 13 | 35 | 60 | 628 |
OFFENSE_OBP | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DEFENSE_OBP | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
TOT_AT_BATS | 0 | 1 | 6325 | 226 | 1270 | 6217 | 6317 | 6424 | 7518 |
TARGET_WINS | 0 | 1 | 81 | 16 | 0 | 71 | 82 | 92 | 146 |
With missing values and outliers imputed
train_imp_OM %>%
skim_without_charts() %>%
yank("numeric")
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
INDEX | 0 | 1 | 1266 | 739 | 1 | 621 | 1280 | 1903 | 2534 |
BATTING_2B | 0 | 1 | 241 | 41 | 120 | 211 | 238 | 272 | 330 |
BATTING_3B | 0 | 1 | 54 | 24 | 3 | 35 | 47 | 70 | 117 |
BATTING_HR | 0 | 1 | 100 | 57 | -35 | 46 | 104 | 147 | 232 |
BATTING_SO | 0 | 1 | 739 | 233 | -788 | 556 | 748 | 925 | 1246 |
PITCHING_HR | 0 | 1 | 106 | 57 | -2 | 56 | 109 | 150 | 243 |
PITCHING_SO | 0 | 1 | 798 | 220 | 289 | 620 | 810 | 957 | 1453 |
BASERUN_NET_SB | 0 | 1 | 38 | 30 | -54 | 16 | 36 | 57 | 129 |
OFFENSE_OBP | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DEFENSE_OBP | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
TOT_AT_BATS | 0 | 1 | 6314 | 131 | 5898 | 6219 | 6309 | 6401 | 6723 |
TARGET_WINS | 0 | 1 | 81 | 16 | 0 | 71 | 82 | 92 | 146 |
In this section, I begin by building models using training sets where only missing values were imputed and evaluate their performance on a test set with the same treatment for missing values. Then, I develop and test comparable models where both missing values and outliers were imputed.
#A function to compute RMSE for test sets
rmse <- function(lm, test) {
preds <- predict(lm, test[,c(2:12)])
errors <- test$TARGET_WINS - preds
return(sqrt(sum(errors^2)/(nrow(test) - length(lm$coefficients) - 1)))
}
#A function for generating residuals plots
res_plot <- function(lm){
plot(fitted(lm),
residuals(lm),
xlab = "Fitted",
ylab = "Residuals")
}
This model includes all variables except for BATTING_HR
,
BATTING_SO
, and TOT_AT_BATS
, as each of these
variables is highly correlated with at least one other variable.
With only missing values imputed
#Missing imputed
lm1_M <- lm(TARGET_WINS ~ BATTING_2B + BATTING_3B + PITCHING_HR + PITCHING_SO + SO_FACTOR + BASERUN_NET_SB + OFFENSE_OBP + DEFENSE_OBP, data = train_imp_M)
stargazer(lm1_M,type='html',out='lm1_M.html',title='Model 1 - with only missing values imputed',
covariate.labels = c('Doubles','Triples','Pitching HRs','Pitching SO','Low SO',
'Med/High SO','Med/Low SO','Baserunners Net SB','Offense OBP',
'Opponents OBP'),
dep.var.caption = 'Target Wins',dep.var.labels.include=F,
single.row=T, no.space = T, header=F,
digits = 3, model.numbers = F)
res_plot(lm1_M)
qqnorm(residuals(lm1_M), ylab = "Residuals")
qqline(residuals(lm1_M))
With missing values and outliers imputed
#Missing and outliers imputed
lm1_OM <- lm(TARGET_WINS ~ BATTING_2B + BATTING_3B + PITCHING_HR + PITCHING_SO + SO_FACTOR + BASERUN_NET_SB + OFFENSE_OBP + DEFENSE_OBP, data = train_imp_OM)
stargazer(lm1_OM,type='html',out='lm1_OM.html',
title='Model 1 - missing values and outliers imputed',
covariate.labels = c('Doubles','Triples','Pitching HRs','Pitching SO','Low SO',
'Med/High SO','Med/Low SO','Baserunners Net SB','Offense OBP',
'Opponents OBP'),
dep.var.caption = 'Target Wins',dep.var.labels.include=F,
single.row=T, no.space = T, header=F,
digits = 3, model.numbers = F)
# summary(lm1_OM)
res_plot(lm1_OM)
qqnorm(residuals(lm1_OM), ylab = "Residuals")
qqline(residuals(lm1_OM))
Evaluation and discussion
The root mean square error for each model is reported below.
lm1_M_test_rmse <- rmse(lm1_M,test_imp_M)
lm1_OM_test_rmse <- rmse(lm1_OM,test_imp_OM)
Test set RMSE for model 1 with only missing data imputed: 15.57
Test set RMSE for model 1 with both missing data and outliers imputed:
14.59
An initial review of the residuals for the model with only missing values imputed shows signs of heteroskedasticity, indicating a violation of one of the model assumptions. Specifically, the residual variance decreases at extreme values. Additionally, the QQ plot reveals that the residuals deviate from a normal distribution at the extremes.
In contrast, the model trained with outlier imputation shows improvements, suggesting that the extreme outliers removed were negatively affecting the model’s performance.
The RMSE values for both models on the test set are comparable to those on the training set, indicating that the models are not overfitting.
FACTOR_SO
The scatterplot of BATTING_SO
versus
PITCHING_SO
revealed fmy distinct groups within the data.
For this model, I developed a separate model for each group. As this
involves a piecewise analysis across two different training sets, a
total of eight output sets were produced. Rather than presenting all the
outputs, I provide a summary of the key findings.
trainlow_M <- train_imp_M %>%
filter(SO_FACTOR == "low") %>%
dplyr::select(-SO_FACTOR)
trainmed_low_M <- train_imp_M %>%
filter(SO_FACTOR == "med_low") %>%
dplyr::select(-SO_FACTOR)
trainmed_high_M <- train_imp_M %>%
filter(SO_FACTOR == "med_high") %>%
dplyr::select(-SO_FACTOR)
trainhigh_M <- train_imp_M %>%
filter(SO_FACTOR == "high") %>%
dplyr::select(-SO_FACTOR)
trainlow_OM <- train_imp_OM %>%
filter(SO_FACTOR == "low") %>%
dplyr::select(-SO_FACTOR)
trainmed_low_OM <- train_imp_OM %>%
filter(SO_FACTOR == "med_low") %>%
dplyr::select(-SO_FACTOR)
trainmed_high_OM <- train_imp_OM %>%
filter(SO_FACTOR == "med_high") %>%
dplyr::select(-SO_FACTOR)
trainhigh_OM <- train_imp_OM %>%
filter(SO_FACTOR == "high") %>%
dplyr::select(-SO_FACTOR)
testlow_M <- test_imp_M %>%
filter(SO_FACTOR == "low") %>%
dplyr::select(-SO_FACTOR)
testmed_low_M <- test_imp_M %>%
filter(SO_FACTOR == "med_low") %>%
dplyr::select(-SO_FACTOR)
testmed_high_M <- test_imp_M %>%
filter(SO_FACTOR == "med_high") %>%
dplyr::select(-SO_FACTOR)
testhigh_M <- test_imp_M %>%
filter(SO_FACTOR == "high") %>%
dplyr::select(-SO_FACTOR)
testlow_OM <- test_imp_OM %>%
filter(SO_FACTOR == "low") %>%
dplyr::select(-SO_FACTOR)
testmed_low_OM <- test_imp_OM %>%
filter(SO_FACTOR == "med_low") %>%
dplyr::select(-SO_FACTOR)
testmed_high_OM <- test_imp_OM %>%
filter(SO_FACTOR == "med_high") %>%
dplyr::select(-SO_FACTOR)
testhigh_OM <- test_imp_OM %>%
filter(SO_FACTOR == "high") %>%
dplyr::select(-SO_FACTOR)
Evaluation and discussion
The root mean square error for each model is reported below.
#RMSE on training set for piecewise model with missing values imputed:
trn_rmse_low_M <- rmse(lm_low_M,trainlow_M)
trn_rmse_med_low_M <- rmse(lm_med_low_M,trainmed_low_M)
trn_rmse_med_high_M <- rmse(lm_med_high_M,trainmed_high_M)
trn_rmse_high_M <- rmse(lm_high_M,trainhigh_M)
trn_rmse_total_M <- (trn_rmse_low_M * nrow(trainlow_M) + trn_rmse_med_low_M * nrow(trainmed_low_M) + trn_rmse_med_high_M * nrow(trainmed_high_M) + trn_rmse_high_M * nrow(trainhigh_M)) / nrow(train_imp_M)
#RMSE on test set for piecewise model with missing values imputed:
test_rmse_low_M <- rmse(lm_low_M,testlow_M)
test_rmse_med_low_M <- rmse(lm_med_low_M,testmed_low_M)
test_rmse_med_high_M <- rmse(lm_med_high_M,testmed_high_M)
test_rmse_high_M <- rmse(lm_high_M,testhigh_M)
tst_rmse_total_M <- (test_rmse_low_M * nrow(testlow_M) + test_rmse_med_low_M * nrow(testmed_low_M) + test_rmse_med_high_M * nrow(testmed_high_M) + test_rmse_high_M * nrow(testhigh_M)) / nrow(test_imp_M)
#RMSE on training set for piecewise model with missing values and outliers imputed:
trn_rmse_low_OM <- rmse(lm_low_OM,trainlow_OM)
trn_rmse_med_low_OM <- rmse(lm_med_low_OM,trainmed_low_OM)
trn_rmse_med_high_OM <- rmse(lm_med_high_OM,trainmed_high_OM)
trn_rmse_high_OM <- rmse(lm_high_OM,trainhigh_OM)
trn_rmse_total_OM <- (trn_rmse_low_OM * nrow(trainlow_OM) + trn_rmse_med_low_OM * nrow(trainmed_low_OM) + trn_rmse_med_high_OM * nrow(trainmed_high_OM) + trn_rmse_high_OM * nrow(trainhigh_OM)) / nrow(train_imp_OM)
#RMSE on test set for piecewise model with missing values and outliers imputed:
rmse_low_OM <- rmse(lm_low_OM,testlow_OM)
rmse_med_low_OM <- rmse(lm_med_low_OM,testmed_low_OM)
rmse_med_high_OM <- rmse(lm_med_high_OM,testmed_high_OM)
rmse_high_OM <- rmse(lm_high_OM,testhigh_OM)
tst_rmse_total_OM <- (rmse_low_OM * nrow(testlow_OM) + rmse_med_low_OM * nrow(testmed_low_OM) + rmse_med_high_OM * nrow(testmed_high_OM) + rmse_high_OM * nrow(testhigh_OM)) / nrow(test_imp_OM)
RMSE for piecewise models with missing values imputed
Training set RMSE: 12.3
Test set RMSE: 15.43
RMSE for piecewise models with missing values and outliers
imputed
Training set RMSE: 12.67
Test set RMSE: 14.82
This piecewise modeling approach resulted in RMSE values comparable to those of the previous model. The residual and QQ plots for the models where only missing values were imputed show the same heteroskedasticity and deviations from normality observed in Model 1. However, these issues were somewhat mitigated in the models where outliers were also imputed.
The second model was driven by the relationship between
BATTING_SO
and PITCHING_SO
identified in
earlier scatterplots. However, this approach presents potential
challenges: the groups vary significantly in size, making visual
comparisons less reliable, and dividing the data based on the
relationship between these two predictors might lack sufficient context.
While this could have led to overfitting, the similar RMSE values for
both the training and test sets indicate that this issue was likely
avoided.
DEFENSE_OBP
In this model, I remove DEFENSE_OBP
from Model 1.
Although this variable is a statistically significant predictor of
TARGET_WINS
, its large coefficient and standard error,
relative to the other variables, may be due to its high correlation with
OFFENSE_OBP
.
With only missing values imputed:
lm3_M <- lm(TARGET_WINS ~ BATTING_2B + BATTING_3B + PITCHING_HR + PITCHING_SO + SO_FACTOR + BASERUN_NET_SB + OFFENSE_OBP, data = train_imp_M)
summary(lm3_M)
stargazer(lm3_M,type='html',out='lm3_M.html',title='Model 3 - only missing values imputed',
covariate.labels = c('Doubles','Triples','Pitching HRs','Pitching SO','Low SO',
'Med/High SO','Med/Low SO','Baserunners Net SB','Offense OBP'),
dep.var.caption = 'Target Wins',dep.var.labels.include=F,
single.row=T, no.space = T, header=F,
digits = 3, model.numbers = F)
res_plot(lm3_M)
qqnorm(residuals(lm3_M), ylab = "Residuals")
qqline(residuals(lm3_M))
With missing values and outliers imputed
lm3_OM <- lm(TARGET_WINS ~ BATTING_2B + BATTING_3B + PITCHING_HR + PITCHING_SO + SO_FACTOR + BASERUN_NET_SB + OFFENSE_OBP, data = train_imp_OM)
summary(lm3_OM)
stargazer(lm3_OM,type='html',out='lm3_OM.html',
title='Model 3 - missing values and outliers imputed',
covariate.labels = c('Doubles','Triples','Pitching HRs','Pitching SO','Low SO',
'Med/High SO','Med/Low SO','Baserunners Net SB','Offense OBP'),
dep.var.caption = 'Target Wins',dep.var.labels.include=F,
single.row=T, no.space = T, header=F,
digits = 3, model.numbers = F)
res_plot(lm3_OM)
qqnorm(residuals(lm3_OM), ylab = "Residuals")
qqline(residuals(lm3_OM))
Evaluation and discussion
The root mean square error for each model is reported below.
lm3_M_test_rmse <- rmse(lm3_M,test_imp_M)
lm3_OM_test_rmse <- rmse(lm3_OM,test_imp_OM)
RMSE for piecewise models with missing values imputed
RMSE on training set: 12.3
RMSE on test set: 15.43
RMSE for piecewise models with missing values and outliers
imputed
RMSE on training set: 12.67
RMSE on test set: 14.82
In this model, the residuals exhibit similar behavior to those in the first model. However, I observed improvements once outliers were imputed. The residuals became more homoskedastic and normally distributed, and the statistical significance of the coefficients increased as well.
Before selecting the best-performing model, let’s review the analysis:
Data Exploration: I identified a strong likelihood of flawed data, including unrelated variables with near-perfect correlations and outliers representing unrealistic values. These issues underscore the need for closer examination, particularly with more subtle data characteristics like bimodal distributions. In practice, such concerns would need to be addressed before progressing further.
Data Preparation: Despite the concerns, I proceeded cautiously by creating factors to represent distinct cohorts within the data and generating new variables informed by my understanding of baseball.
Modeling: In the first model, I included all the remaining
features of the dataset. The second model focused on training separate
models for each SO_FACTOR
level and combining them into a
piecewise linear structure. The third model enhanced the first by
eliminating one of the highly correlated variables from the dataset.
Below are the RMSE values for each model:
Models with only missing values imputed: - Model 1 RMSE:
15.57
- Model 2 RMSE: 15.43
- Model 3 RMSE: 15.41
Models with missing values and outliers imputed: - Model 1
RMSE: 14.59
- Model 2 RMSE: 14.82
- Model 3 RMSE: 14.59
The most effective model is Model 3, where both missing values and outliers were imputed. Unlike Model 2, this model leverages the entire training set in a unified approach, with highly significant predictors and intercepts—something not observed in all variables of Model 1. I believe the inclusion of new factors successfully mitigated some of the dataset’s flaws, while the engineered features contributed to better overall model efficiency and performance. The outlier imputation strategy further reduced RMSE and improved the quality of the residual plots.