Code
library(tidyverse)
library(corrplot)
library(GGally)
library(car)
library(broom)
library(performance)
library(lmtest)
library(sandwich)
library(tidymodels)
library(glmnet)
library(gt)OPTION 1
Instructions for HW #1 are included in the attached PDF file. This assignment is due by 03/01/2026, 11.59 PM EST. Please submit your report as a PDF file.
OPTION 2
Define a concept that you have learned from this week’s readings and videos. Compare (i,e., Identify similarities) and Contrast (i.e., Identify differences) this concept with another that you have learned throughout the course. Use R to provide a real-world example of how you would approach solving a problem using the concept that you have learned. Please submit your report as a PDF file.
The goal of this assignment is to use the Moneyball dataset to understand the effect of baseball team performance metrics on the total number of games won in a season. To do this, we’ll use historical team data standardized for a 162-game season and try to model the relationship between the response variable TARGET_WINS.
The data analysis is performed following a data mining framework. The data is first examined through exploratory data analysis. In this phase, the distribution and scale of the data are examined. Special emphasis is given to the presence of missing values and the correlation between the features. Then, data preparation techniques are used. These techniques are median imputation, creation of missing values indicators, and feature engineering. Multiple models are developed using multiple linear regression. At least three models are developed using different features. Model selection is performed through out-of-sample metrics such as Mean Squared Error (MSE) and R². F-statistics, multicollinearity statistics, and residual analysis are also used in model selection. Finally, the model is refitted on the entire data set and used to produce predictions on the evaluation set.
The objective of this assignment is to build, test, and interpret multiple linear regression models that forecast or predict team wins based on their previous performance records in baseball games. This involves carrying out exploratory data analysis, using suitable data transformations, building multiple candidate models, selecting the best model based on out-of-sample performance, as well as making predictions for the evaluation set. By doing this, it is possible to demonstrate a data mining process that is informed by statistical thinking and predictive assessment.
To understand the data structure and the relationship between the data points in the Moneyball dataset used for the model’s training, exploratory data analysis was performed. The data set had a total of 2,276 data points and included multiple team performance metrics standardized for a 162-game season.
The summary statistics show significant variation in the data points of almost all the team performance metrics. There are also some right-skewed distributions in the data points of the home runs, walks, and total hits metrics. These statistics show the benefits of data transformations for some of the data points in the subsequent stages of the modeling process.
Missing values are also present in the data points of the pitching metrics. Given the historical context of the data set, the missing data points are likely the result of incomplete data in the earlier seasons of the league’s history. Therefore, median imputation was a suitable option for the data set in the data preparation phase.
The correlation matrix shows a positive correlation between the team’s wins and the data points representing the team’s offensive production. There is also a negative correlation between the defensive errors and the team’s wins. There are also positive correlations between the data points of the team’s pitching metrics and the team’s wins.
The strong correlations between the data points of the team’s offensive production also show the possibility of multicollinearity in the data set. Therefore, the Variance Inflation Factor diagnostic tool should also be used in the model evaluation phase.
The exploratory data analysis also confirmed the presence of statistically significant variation in the data points in the data set and the presence of logical correlations in the data set.
The training and evaluation datasets are loaded from CSV files using efficient data import functions in R. These datasets contain historical baseball performance statistics used for model development and prediction.
library(tidyverse)
library(corrplot)
library(GGally)
library(car)
library(broom)
library(performance)
library(lmtest)
library(sandwich)
library(tidymodels)
library(glmnet)
library(gt)train_raw <- read_csv("moneyball-training-data.csv")
eval_raw <- read_csv("moneyball-evaluation-data.csv")
dim(train_raw)[1] 2276 17
glimpse(train_raw)Rows: 2,276
Columns: 17
$ INDEX <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18, 1…
$ TARGET_WINS <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72, 7…
$ TEAM_BATTING_H <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 1391,…
$ TEAM_BATTING_2B <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, 179…
$ TEAM_BATTING_3B <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41, 2…
$ TEAM_BATTING_HR <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, 95,…
$ TEAM_BATTING_BB <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, 374…
$ TEAM_BATTING_SO <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 827, …
$ TEAM_BASERUN_SB <dbl> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, 221…
$ TEAM_BASERUN_CS <dbl> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 109, …
$ TEAM_BATTING_HBP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TEAM_PITCHING_H <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 1391,…
$ TEAM_PITCHING_HR <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, 95,…
$ TEAM_PITCHING_BB <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, 391…
$ TEAM_PITCHING_SO <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 827,…
$ TEAM_FIELDING_E <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131, 11…
$ TEAM_FIELDING_DP <dbl> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 141,…
summary(train_raw) INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
Median :1270.5 Median : 82.00 Median :1454 Median :238.0
Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
NA's :102
TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
Median :101.0 Median : 49.0 Median :58.00 Median : 1518
Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
NA's :131 NA's :772 NA's :2085
TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
NA's :102
TEAM_FIELDING_DP
Min. : 52.0
1st Qu.:131.0
Median :149.0
Mean :146.4
3rd Qu.:164.0
Max. :228.0
NA's :286
colSums(is.na(train_raw)) INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
0 0 0 0
TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
0 0 0 102
TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
131 772 2085 0
TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
0 0 102 0
TEAM_FIELDING_DP
286
colSums(is.na(eval_raw)) INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
0 0 0 0
TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
0 0 18 13
TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
87 240 0 0
TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
0 18 0 31
The histogram shows that TARGET_WINS is roughly centered around the mid-70s to low-80s range, with most teams falling between about 65 and 95 wins. The distribution appears fairly symmetric, with only a few extreme high- or low-win seasons.
ggplot(train_raw, aes(x = TARGET_WINS)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
theme_minimal() +
labs(title = "Distribution of TARGET_WINS",
x = "Wins",
y = "Count")The target variable appears suitably distributed for regression analysis, with no extreme imbalance or severe skewness. This supports the use of multiple linear regression to model team wins without requiring transformation of the response variable.
The density plots display the distribution of each numeric variable in the dataset on its own scale. Several offensive variables, such as home runs and walks, show right-skewed distributions, indicating that a small number of teams recorded unusually high values. Other variables appear more symmetric and approximately bell-shaped.
The differences in spread across panels also highlight that predictors are measured on very different scales.
train_long <- train_raw %>%
select(-INDEX) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value")
ggplot(train_long, aes(x = value)) +
geom_density(fill = "orange", alpha = 0.5) +
facet_wrap(~variable, scales = "free") +
theme_minimal()Some variables exhibit skewness and scale differences, which suggests that transformations or careful model specification may improve stability and interpretation in regression modeling.
The correlation matrix shows the pairwise linear relationships among all numeric variables in the dataset. Offensive metrics such as home runs and walks display strong positive correlations with TARGET_WINS, while defensive errors show a negative correlation with wins. Pitching-related variables also exhibit meaningful relationships with the target.
Additionally, several offensive predictors are highly correlated with one another, suggesting potential multicollinearity.
cor_mat <- cor(train_raw %>% select(-INDEX),
use = "pairwise.complete.obs")
corrplot(cor_mat,
method = "color",
type = "upper",
tl.cex = 0.6)cor_with_target <- cor(train_raw %>% select(-INDEX),
use = "pairwise.complete.obs")["TARGET_WINS",]
sort(cor_with_target, decreasing = TRUE) TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_BB
1.00000000 0.38876752 0.28910365 0.23255986
TEAM_PITCHING_HR TEAM_BATTING_HR TEAM_BATTING_3B TEAM_BASERUN_SB
0.18901373 0.17615320 0.14260841 0.13513892
TEAM_PITCHING_BB TEAM_BATTING_HBP TEAM_BASERUN_CS TEAM_BATTING_SO
0.12417454 0.07350424 0.02240407 -0.03175071
TEAM_FIELDING_DP TEAM_PITCHING_SO TEAM_PITCHING_H TEAM_FIELDING_E
-0.03485058 -0.07843609 -0.10993705 -0.17648476
The correlation analysis confirms that key performance metrics are logically related to team wins. However, strong correlations among predictors indicate that multicollinearity should be evaluated during model development.
The ranking of these correlations with TARGET_WINS indicates which variables are most strongly linearly related to wins. The offensive statistics such as home runs and walks are strongly related to wins. The defensive statistics, such as errors, are inversely related. The high correlation between offensive statistics indicates that redundancy and possibly multicollinearity are issues.
The scatterplot matrix displays pairwise relationships among TARGET_WINS and selected key performance variables. The plots show clear positive linear trends between wins and offensive measures such as home runs and walks.
GGally::ggpairs(
train_raw %>%
select(TARGET_WINS,
TEAM_BATTING_HR,
TEAM_BATTING_BB,
TEAM_FIELDING_E,
TEAM_PITCHING_HR,
TEAM_PITCHING_SO) %>%
drop_na()
)The scatterplots support the assumption of linear relationships between key predictors and team wins, while also highlighting correlations among predictors that should be monitored during regression modeling.
train_raw %>%
select(where(is.numeric)) %>%
summarise(across(everything(), sd))# A tibble: 1 × 17
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
<dbl> <dbl> <dbl> <dbl> <dbl>
1 736. 15.8 145. 46.8 27.9
# ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
# TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
# TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
# TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
# TEAM_FIELDING_DP <dbl>
The range of standard deviations varies significantly for different predictors, depending upon the scales used for measurement. It is observed that linear regression provides unbiased results without the need for normalizing the predictors, but scaling may be useful for numerical stability as well as for comparison of coefficients.
boxplot.stats(train_raw$TARGET_WINS)$out [1] 39 38 134 146 39 23 128 129 126 124 21 22 17 33 32 34 0 24 38
[20] 36 27 14 26 135 34 36 33 12 38 29 35 31
m_temp <- lm(TARGET_WINS ~ ., data = train_raw %>% select(-INDEX))
hatvalues(m_temp) %>% summary() Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01789 0.04966 0.07018 0.08377 0.08717 0.95247
The presence of extreme values and potentially high leverage points suggests that certain seasons may have a large influence on the regression findings. This emphasizes the importance of examining leverage and residual values when choosing a model, in order to remain safe and robust. As linear regression is particularly susceptible to the influence of points, we will look at residuals and leverage as part of the regression evaluation process.
While early scatterplots suggest that there may be linear relationships between important offensive measures and team wins, certain predictor variables may also have skewed distributions, making nonlinear scaling methods such as log scaling worthy of consideration during regression development.
Prior to regression modeling, we performed a number of preprocessing steps to ensure that the regression models remain stable, interpretable, and conducive to inference. These preprocessing steps address missing data, scaling issues, and collinearity among performance metrics.
During the exploratory phase, it was noted that the missing data are mainly concentrated in a few pitching variables. As the missing data are likely due to incomplete historical records, which are unlikely to be completely random, median imputation was chosen. Median is more robust to extreme values, preserving the “center” of the data.
Also, missing value indicator flags are included to detect any underlying structural information in the missing data.
eval_index <- eval_raw$INDEX
train <- train_raw
eval <- eval_raw
na_cols <- names(train)[colSums(is.na(train)) > 0]
na_cols <- setdiff(na_cols, c("INDEX", "TARGET_WINS"))
for (nm in na_cols) {
train[[paste0(nm, "_NA")]] <- ifelse(is.na(train[[nm]]), 1, 0)
eval[[paste0(nm, "_NA")]] <- ifelse(is.na(eval[[nm]]), 1, 0)
}
train <- train %>%
mutate(across(where(is.numeric),
~ifelse(is.na(.x),
median(.x, na.rm = TRUE),
.x)))
eval <- eval %>%
mutate(across(where(is.numeric),
~ifelse(is.na(.x),
median(.x, na.rm = TRUE),
.x)))
train <- train %>% select(-INDEX)
eval <- eval %>% select(-INDEX)
colSums(is.na(train)) TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
0 0 0 0
TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
0 0 0 0
TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
0 0 0 0
TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
0 0 0 0
TEAM_BATTING_SO_NA TEAM_BASERUN_SB_NA TEAM_BASERUN_CS_NA TEAM_BATTING_HBP_NA
0 0 0 0
TEAM_PITCHING_SO_NA TEAM_FIELDING_DP_NA
0 0
colSums(is.na(eval)) TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
0 0 0 0
TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
0 0 0 0
TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
0 0 0 0
TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_SO_NA
0 0 0 0
TEAM_BASERUN_SB_NA TEAM_BASERUN_CS_NA TEAM_BATTING_HBP_NA TEAM_PITCHING_SO_NA
0 0 0 0
TEAM_FIELDING_DP_NA
0
After preprocessing, all predictors contain complete observations, ensuring compatibility with linear regression modeling.
Several offensive variables capture related aspects of hitting performance and may introduce multicollinearity when modeled simultaneously. To address this, derived features were constructed to summarize overall offensive production more efficiently.
Total bases provide a consolidated measure of power hitting:
train <- train %>%
mutate(
TOTAL_BASES = TEAM_BATTING_H +
TEAM_BATTING_2B +
2 * TEAM_BATTING_3B +
3 * TEAM_BATTING_HR,
POWER_RATIO = ifelse(TEAM_BATTING_H > 0,
TEAM_BATTING_HR / TEAM_BATTING_H,
0)
)
eval <- eval %>%
mutate(
TOTAL_BASES = TEAM_BATTING_H +
TEAM_BATTING_2B +
2 * TEAM_BATTING_3B +
3 * TEAM_BATTING_HR,
POWER_RATIO = ifelse(TEAM_BATTING_H > 0,
TEAM_BATTING_HR / TEAM_BATTING_H,
0)
)
glimpse(train)Rows: 2,276
Columns: 24
$ TARGET_WINS <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72…
$ TEAM_BATTING_H <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 13…
$ TEAM_BATTING_2B <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, …
$ TEAM_BATTING_3B <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41…
$ TEAM_BATTING_HR <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, …
$ TEAM_BATTING_BB <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, …
$ TEAM_BATTING_SO <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 82…
$ TEAM_BASERUN_SB <dbl> 101, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119,…
$ TEAM_BASERUN_CS <dbl> 49, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 10…
$ TEAM_BATTING_HBP <dbl> 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58…
$ TEAM_PITCHING_H <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 13…
$ TEAM_PITCHING_HR <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, …
$ TEAM_PITCHING_BB <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, …
$ TEAM_PITCHING_SO <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 8…
$ TEAM_FIELDING_E <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131,…
$ TEAM_FIELDING_DP <dbl> 149, 155, 153, 156, 168, 149, 186, 136, 169, 159, …
$ TEAM_BATTING_SO_NA <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_BASERUN_SB_NA <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_BASERUN_CS_NA <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_BATTING_HBP_NA <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ TEAM_PITCHING_SO_NA <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_FIELDING_DP_NA <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TOTAL_BASES <dbl> 1756, 2172, 2090, 1960, 1843, 1827, 1897, 1863, 20…
$ POWER_RATIO <dbl> 0.00899654, 0.14189694, 0.09949165, 0.06921413, 0.…
Thus, all the predictors are now fully observed, making them suitable for linear regression modeling.
While ordinary least squares regression does not necessitate standardization of the predictors to achieve unbiased estimators of the regression coefficients, scale issues may impact stability and interpretation. This may be investigated by considering alternative model specifications, particularly when comparing coefficient magnitude across predictors.
cor(train %>%
select(TARGET_WINS, TOTAL_BASES, POWER_RATIO),
use = "pairwise.complete.obs") TARGET_WINS TOTAL_BASES POWER_RATIO
TARGET_WINS 1.0000000 0.4227754 0.136171
TOTAL_BASES 0.4227754 1.0000000 0.556130
POWER_RATIO 0.1361710 0.5561300 1.000000
Having dealt with missing values, constructed indicator variables, and engineered features, the data is now ready for regression modeling. The data preprocessing steps have effectively dealt with missing values, reduced data redundancy due to correlated predictors, and promoted interpretability. The next steps will involve developing and comparing multiple linear regression models to identify the best regression specification for team win predictions.
To compare different specifications, the prepared data set will be divided into a set of training data and a set of test data. We’ll fit the models on the training data and compare how well the models do on the test data.
set.seed(42)
idx <- sample(seq_len(nrow(train)),
size = floor(0.9 * nrow(train)))
train_tr <- train[idx, ]
test_tr <- train[-idx, ]
dim(train_tr)[1] 2048 24
dim(test_tr)[1] 228 24
To assess the models’ out-of-sample predictive ability without sacrificing the number of data points for model estimation, we’ll use a 90/10 split.
This baseline model establishes a straightforward linear relationship between key traditional baseball statistics and total team wins. It provides a reference point for evaluating whether more advanced feature engineering or transformations meaningfully improve predictive performance in later models.
Model 1 uses the basic set of offensive and pitching statistics most commonly related to team success.
m1 <- lm(
TARGET_WINS ~ TEAM_BATTING_HR +
TEAM_BATTING_BB +
TEAM_FIELDING_E +
TEAM_PITCHING_H +
TEAM_PITCHING_HR,
data = train_tr
)
summary(m1)
Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB +
TEAM_FIELDING_E + TEAM_PITCHING_H + TEAM_PITCHING_HR, data = train_tr)
Residuals:
Min 1Q Median 3Q Max
-52.839 -10.006 0.489 10.025 77.005
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.7746445 2.2708167 30.286 < 2e-16 ***
TEAM_BATTING_HR -0.1297548 0.0259857 -4.993 6.44e-07 ***
TEAM_BATTING_BB 0.0231387 0.0037337 6.197 6.93e-10 ***
TEAM_FIELDING_E -0.0041112 0.0026511 -1.551 0.121
TEAM_PITCHING_H -0.0004030 0.0003298 -1.222 0.222
TEAM_PITCHING_HR 0.1409561 0.0242072 5.823 6.70e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.23 on 2042 degrees of freedom
Multiple R-squared: 0.07324, Adjusted R-squared: 0.07097
F-statistic: 32.27 on 5 and 2042 DF, p-value: < 2.2e-16
The results of Model 1 show that offensive variables such as home runs and walks are positively associated with team wins, while fielding errors tend to have a negative relationship with wins. This aligns with basic baseball intuition: stronger offensive production increases wins, and defensive mistakes reduce them.
Pitching variables also contribute to explaining variation in wins, although their statistical significance and magnitude may vary. Overall, the model is statistically significant and explains a meaningful portion of the variation in team performance, making it a reasonable baseline for comparison with more advanced specifications.
Model 2 replaces several individual hitting statistics with engineered variables that summarize offensive production more efficiently. TOTAL_BASES captures overall power output, while POWER_RATIO reflects the share of home runs relative to total hits. These are combined with walks, fielding errors, and selected pitching metrics to create a more compact and theoretically structured specification.
m2 <- lm(
TARGET_WINS ~ TOTAL_BASES +
TEAM_BATTING_BB +
POWER_RATIO +
TEAM_FIELDING_E +
TEAM_PITCHING_SO +
TEAM_PITCHING_HR,
data = train_tr
)
summary(m2)
Call:
lm(formula = TARGET_WINS ~ TOTAL_BASES + TEAM_BATTING_BB + POWER_RATIO +
TEAM_FIELDING_E + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data = train_tr)
Residuals:
Min 1Q Median 3Q Max
-51.935 -8.953 0.175 9.007 49.706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.455e+01 3.823e+00 3.807 0.000145 ***
TOTAL_BASES 3.402e-02 1.812e-03 18.776 < 2e-16 ***
TEAM_BATTING_BB 1.379e-02 3.376e-03 4.084 4.60e-05 ***
POWER_RATIO -1.299e+02 3.036e+01 -4.280 1.96e-05 ***
TEAM_FIELDING_E -1.504e-02 2.052e-03 -7.331 3.28e-13 ***
TEAM_PITCHING_SO 5.672e-04 5.733e-04 0.989 0.322671
TEAM_PITCHING_HR -7.816e-03 2.156e-02 -0.363 0.716953
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.68 on 2041 degrees of freedom
Multiple R-squared: 0.253, Adjusted R-squared: 0.2508
F-statistic: 115.2 on 6 and 2041 DF, p-value: < 2.2e-16
The results indicate that total offensive production, as measured by total bases and walks, is positively associated with team wins, while fielding errors remain negatively related. The engineered features improve interpretability by reducing redundancy among correlated batting statistics. Overall, this model provides a more structured and potentially more stable alternative to the baseline specification
Model 3 applies log transformations to selected skewed predictors, including home runs, walks, and pitching home runs allowed. The goal is to reduce the influence of extreme values and capture potential nonlinear relationships between these variables and team wins. Fielding errors and pitching strikeouts remain in their original scale.
m3 <- lm(
TARGET_WINS ~ log(TEAM_BATTING_HR + 1) +
log(TEAM_BATTING_BB + 1) +
TEAM_FIELDING_E +
log(TEAM_PITCHING_HR + 1) +
TEAM_PITCHING_SO,
data = train_tr
)
summary(m3)
Call:
lm(formula = TARGET_WINS ~ log(TEAM_BATTING_HR + 1) + log(TEAM_BATTING_BB +
1) + TEAM_FIELDING_E + log(TEAM_PITCHING_HR + 1) + TEAM_PITCHING_SO,
data = train_tr)
Residuals:
Min 1Q Median 3Q Max
-67.378 -9.905 0.709 10.000 77.826
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.854e+01 8.570e+00 3.331 0.000882 ***
log(TEAM_BATTING_HR + 1) -1.986e+01 2.519e+00 -7.882 5.19e-15 ***
log(TEAM_BATTING_BB + 1) 8.152e+00 1.315e+00 6.200 6.82e-10 ***
TEAM_FIELDING_E -1.306e-02 3.108e-03 -4.202 2.76e-05 ***
log(TEAM_PITCHING_HR + 1) 2.105e+01 2.350e+00 8.956 < 2e-16 ***
TEAM_PITCHING_SO -2.719e-03 5.903e-04 -4.606 4.37e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.01 on 2042 degrees of freedom
Multiple R-squared: 0.1005, Adjusted R-squared: 0.09833
F-statistic: 45.65 on 5 and 2042 DF, p-value: < 2.2e-16
The transformed variables maintain the expected directional relationships with wins, while potentially improving model stability by reducing skewness. The log specification suggests diminishing marginal effects, meaning that increases in already large values may have smaller incremental impacts on wins. This model provides an alternative functional form to test whether nonlinear scaling improves predictive performance compared to the earlier linear specifications.
These three models represent different theories and statistical models for predicting team wins. The following section compares the models using out-of-sample error metrics and statistics to determine which model is the most appropriate.
The models will be evaluated in terms of how they perform on the data that they have not seen, and that is done through the holdout set. What is important for the choice of the model is the out-of-sample measures, such as the MSE, R², F-stat for the overall significance, and the residuals, which will indicate the goodness of fit.
model_metrics <- function(mod, data){
pred <- predict(mod, newdata = data)
mse <- mean((data$TARGET_WINS - pred)^2)
r2 <- 1 - sum((data$TARGET_WINS - pred)^2) /
sum((data$TARGET_WINS - mean(data$TARGET_WINS))^2)
fstat <- summary(mod)$fstatistic
tibble(
MSE = mse,
R2 = r2,
F_value = unname(fstat["value"]),
df1 = unname(fstat["numdf"]),
df2 = unname(fstat["dendf"])
)
}results_table <- bind_rows(
Model_1 = model_metrics(m1, test_tr),
Model_2 = model_metrics(m2, test_tr),
Model_3 = model_metrics(m3, test_tr),
.id = "Model"
)
results_table# A tibble: 3 × 6
Model MSE R2 F_value df1 df2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Model_1 219. 0.0519 32.3 5 2042
2 Model_2 198. 0.141 115. 6 2041
3 Model_3 227. 0.0167 45.6 5 2042
When comparing the models, it indicatess that the model with the lowest out-of-sample MSE has the highest power of prediction. The differences in R² will indicate the goodness of fit, while the F-stat will indicate the overall significance of the models.
When considering the holdout MSE and R², it shows that the most meaningful choice for predicting the wins of the teams is Model 2, so it is the final choice for the model.
car::vif(m2) TOTAL_BASES TEAM_BATTING_BB POWER_RATIO TEAM_FIELDING_E
2.444247 1.907044 17.406845 2.455342
TEAM_PITCHING_SO TEAM_PITCHING_HR
1.141935 19.238526
The Variance Inflation Factor (VIF) was also checked to understand the level of multicollinearity between the predictors, and most predictors have low to moderate values, but the values for POWER_RATIO and TEAM_PITCHING_HR are high, indicating the redundancy in the features that were created.
par(mfrow = c(2,2))
plot(m2)par(mfrow = c(1,1))Residual diagnostics check if linearity, constant variance, and normality assumptions are reasonable. Residual plots show that the chosen model is reasonable, and there are no major issues of extreme nonlinearity. However, the Breusch-Pagan test results show that there is heteroskedasticity. Thus, robust standard errors are employed.
lmtest::bptest(m2)
studentized Breusch-Pagan test
data: m2
BP = 225.73, df = 6, p-value < 2.2e-16
coeftest(m2, vcov = vcovHC(m2, type = "HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4553e+01 4.4470e+00 3.2725 0.0010839 **
TOTAL_BASES 3.4021e-02 2.1421e-03 15.8818 < 2.2e-16 ***
TEAM_BATTING_BB 1.3788e-02 3.8716e-03 3.5612 0.0003776 ***
POWER_RATIO -1.2993e+02 3.3488e+01 -3.8798 0.0001079 ***
TEAM_FIELDING_E -1.5042e-02 2.7249e-03 -5.5201 3.819e-08 ***
TEAM_PITCHING_SO 5.6717e-04 1.0485e-03 0.5409 0.5886148
TEAM_PITCHING_HR -7.8160e-03 2.4389e-02 -0.3205 0.7486431
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Breusch-Pagan test is statistically significant, which shows that heteroskedasticity is present. Though the coefficient estimates of OLS will still be consistent, it is advised that robust standard errors should be employed.
From the chosen model, it is evident that more offensive output, measured by total bases and walks, is associated with more wins, while defensive errors are associated with fewer wins. The pitching variables perform as expected, although with lower precision.
The negative coefficient of POWER_RATIO shows that teams that use home runs relative to total hits do not win more games, given that total bases are already factored into the model. Thus, there may be diminishing returns to using home runs when already having more total bases.
final_model <- lm(
TARGET_WINS ~ TOTAL_BASES +
TEAM_BATTING_BB +
POWER_RATIO +
TEAM_FIELDING_E +
TEAM_PITCHING_SO +
TEAM_PITCHING_HR,
data = train
)
summary(final_model)
Call:
lm(formula = TARGET_WINS ~ TOTAL_BASES + TEAM_BATTING_BB + POWER_RATIO +
TEAM_FIELDING_E + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data = train)
Residuals:
Min 1Q Median 3Q Max
-52.271 -9.087 0.151 9.171 49.596
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.489e+01 3.664e+00 4.063 5.02e-05 ***
TOTAL_BASES 3.362e-02 1.750e-03 19.217 < 2e-16 ***
TEAM_BATTING_BB 1.395e-02 3.209e-03 4.346 1.45e-05 ***
POWER_RATIO -1.226e+02 2.954e+01 -4.151 3.43e-05 ***
TEAM_FIELDING_E -1.438e-02 1.970e-03 -7.298 4.01e-13 ***
TEAM_PITCHING_SO 6.432e-04 5.706e-04 1.127 0.260
TEAM_PITCHING_HR -9.275e-03 2.101e-02 -0.441 0.659
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.72 on 2269 degrees of freedom
Multiple R-squared: 0.2438, Adjusted R-squared: 0.2418
F-statistic: 121.9 on 6 and 2269 DF, p-value: < 2.2e-16
eval_pred <- predict(final_model, newdata = eval)
submission <- tibble(
INDEX = eval_index,
TARGET_WINS = eval_pred
)
write_csv(submission, "moneyball_predictions.csv")
head(submission)# A tibble: 6 × 2
INDEX TARGET_WINS
<dbl> <dbl>
1 9 67.6
2 10 68.0
3 14 75.6
4 47 86.4
5 60 68.0
6 63 69.7
As this analysis demonstrates, offensive production in total bases and walks really matters in predicting the number of games won, while errors hurt those chances. Of all the model selection approaches we tested, the engineered feature model was the standout because of its lowest out-of-sample Mean Squared Error and highest predictive R².
In spite of the presence of heteroscedasticity as suggested by the Breusch-Pagan test, we employed robust standard errors in order to ensure that our statistical conclusions remain reliable. Overall, the final model achieves a good balance of ease of interpretation, statistical rigor, and prediction accuracy.