Code
library(tidyverse)
library(corrplot)
library(GGally)
library(car)
library(broom)
library(performance)
library(lmtest)
library(sandwich)
library(tidymodels)
library(glmnet)
library(gt)The goal of this assignment is to use the Moneyball dataset to understand the effect of baseball team performance metrics on the total number of games won in a season. To do this, we’ll use historical team data standardized for a 162-game season and try to model the relationship between the response variable TARGET_WINS.
The data analysis is performed following a data mining framework. The data is first examined through exploratory data analysis. In this phase, the distribution and scale of the data are examined. Special emphasis is given to the presence of missing values and the correlation between the features. Then, data preparation techniques are used. These techniques are median imputation, creation of missing values indicators, and feature engineering. Multiple models are developed using multiple linear regression. At least three models are developed using different features. Model selection is performed through out-of-sample metrics such as Mean Squared Error (MSE) and R². F-statistics, multicollinearity statistics, and residual analysis are also used in model selection. Finally, the model is refitted on the entire data set and used to produce predictions on the evaluation set.
The objective of this assignment is to build, test, and interpret multiple linear regression models that forecast or predict team wins based on their previous performance records in baseball games. This involves carrying out exploratory data analysis, using suitable data transformations, building multiple candidate models, selecting the best model based on out-of-sample performance, as well as making predictions for the evaluation set. By doing this, it is possible to demonstrate a data mining process that is informed by statistical thinking and predictive assessment.
To understand the data structure and the relationship between the data points in the Moneyball dataset used for the model’s training, exploratory data analysis was performed. The data set had a total of 2,276 data points and included multiple team performance metrics standardized for a 162-game season.
The summary statistics show significant variation in the data points of almost all the team performance metrics. There are also some right-skewed distributions in the data points of the home runs, walks, and total hits metrics. These statistics show the benefits of data transformations for some of the data points in the subsequent stages of the modeling process.
Missing values are also present in the data points of the pitching metrics. Given the historical context of the data set, the missing data points are likely the result of incomplete data in the earlier seasons of the league’s history. Therefore, median imputation was a suitable option for the data set in the data preparation phase.
The correlation matrix shows a positive correlation between the team’s wins and the data points representing the team’s offensive production. There is also a negative correlation between the defensive errors and the team’s wins. There are also positive correlations between the data points of the team’s pitching metrics and the team’s wins.
The strong correlations between the data points of the team’s offensive production also show the possibility of multicollinearity in the data set. Therefore, the Variance Inflation Factor diagnostic tool should also be used in the model evaluation phase.
The exploratory data analysis also confirmed the presence of statistically significant variation in the data points in the data set and the presence of logical correlations in the data set.
library(tidyverse)
library(corrplot)
library(GGally)
library(car)
library(broom)
library(performance)
library(lmtest)
library(sandwich)
library(tidymodels)
library(glmnet)
library(gt)train_raw <- read_csv("moneyball-training-data.csv")
eval_raw <- read_csv("moneyball-evaluation-data.csv")
dim(train_raw)[1] 2276 17
glimpse(train_raw)Rows: 2,276
Columns: 17
$ INDEX <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18, 1…
$ TARGET_WINS <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72, 7…
$ TEAM_BATTING_H <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 1391,…
$ TEAM_BATTING_2B <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, 179…
$ TEAM_BATTING_3B <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41, 2…
$ TEAM_BATTING_HR <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, 95,…
$ TEAM_BATTING_BB <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, 374…
$ TEAM_BATTING_SO <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 827, …
$ TEAM_BASERUN_SB <dbl> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, 221…
$ TEAM_BASERUN_CS <dbl> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 109, …
$ TEAM_BATTING_HBP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TEAM_PITCHING_H <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 1391,…
$ TEAM_PITCHING_HR <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, 95,…
$ TEAM_PITCHING_BB <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, 391…
$ TEAM_PITCHING_SO <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 827,…
$ TEAM_FIELDING_E <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131, 11…
$ TEAM_FIELDING_DP <dbl> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 141,…
summary(train_raw) INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
Median :1270.5 Median : 82.00 Median :1454 Median :238.0
Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
NA's :102
TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
Median :101.0 Median : 49.0 Median :58.00 Median : 1518
Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
NA's :131 NA's :772 NA's :2085
TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
NA's :102
TEAM_FIELDING_DP
Min. : 52.0
1st Qu.:131.0
Median :149.0
Mean :146.4
3rd Qu.:164.0
Max. :228.0
NA's :286
colSums(is.na(train_raw)) INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
0 0 0 0
TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
0 0 0 102
TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
131 772 2085 0
TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
0 0 102 0
TEAM_FIELDING_DP
286
colSums(is.na(eval_raw)) INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
0 0 0 0
TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
0 0 18 13
TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
87 240 0 0
TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
0 18 0 31
ggplot(train_raw, aes(x = TARGET_WINS)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
theme_minimal() +
labs(title = "Distribution of TARGET_WINS",
x = "Wins",
y = "Count")train_long <- train_raw %>%
select(-INDEX) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value")
ggplot(train_long, aes(x = value)) +
geom_density(fill = "orange", alpha = 0.5) +
facet_wrap(~variable, scales = "free") +
theme_minimal()cor_mat <- cor(train_raw %>% select(-INDEX),
use = "pairwise.complete.obs")
corrplot(cor_mat,
method = "color",
type = "upper",
tl.cex = 0.6)cor_with_target <- cor(train_raw %>% select(-INDEX),
use = "pairwise.complete.obs")["TARGET_WINS",]
sort(cor_with_target, decreasing = TRUE) TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_BB
1.00000000 0.38876752 0.28910365 0.23255986
TEAM_PITCHING_HR TEAM_BATTING_HR TEAM_BATTING_3B TEAM_BASERUN_SB
0.18901373 0.17615320 0.14260841 0.13513892
TEAM_PITCHING_BB TEAM_BATTING_HBP TEAM_BASERUN_CS TEAM_BATTING_SO
0.12417454 0.07350424 0.02240407 -0.03175071
TEAM_FIELDING_DP TEAM_PITCHING_SO TEAM_PITCHING_H TEAM_FIELDING_E
-0.03485058 -0.07843609 -0.10993705 -0.17648476
The ranking of these correlations with TARGET_WINS indicates which variables are most strongly linearly related to wins. The offensive statistics such as home runs and walks are strongly related to wins. The defensive statistics, such as errors, are inversely related. The high correlation between offensive statistics indicates that redundancy and possibly multicollinearity are issues.
GGally::ggpairs(
train_raw %>%
select(TARGET_WINS,
TEAM_BATTING_HR,
TEAM_BATTING_BB,
TEAM_FIELDING_E,
TEAM_PITCHING_HR,
TEAM_PITCHING_SO) %>%
drop_na()
)train_raw %>%
select(where(is.numeric)) %>%
summarise(across(everything(), sd))# A tibble: 1 × 17
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
<dbl> <dbl> <dbl> <dbl> <dbl>
1 736. 15.8 145. 46.8 27.9
# ℹ 12 more variables: TEAM_BATTING_HR <dbl>, TEAM_BATTING_BB <dbl>,
# TEAM_BATTING_SO <dbl>, TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>,
# TEAM_BATTING_HBP <dbl>, TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>,
# TEAM_PITCHING_BB <dbl>, TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>,
# TEAM_FIELDING_DP <dbl>
The range of standard deviations varies significantly for different predictors, depending upon the scales used for measurement. It is observed that linear regression provides unbiased results without the need for normalizing the predictors, but scaling may be useful for numerical stability as well as for comparison of coefficients.
boxplot.stats(train_raw$TARGET_WINS)$out [1] 39 38 134 146 39 23 128 129 126 124 21 22 17 33 32 34 0 24 38
[20] 36 27 14 26 135 34 36 33 12 38 29 35 31
m_temp <- lm(TARGET_WINS ~ ., data = train_raw %>% select(-INDEX))
hatvalues(m_temp) %>% summary() Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01789 0.04966 0.07018 0.08377 0.08717 0.95247
The presence of extreme values and potentially high leverage points suggests that certain seasons may have a large influence on the regression findings. This emphasizes the importance of examining leverage and residual values when choosing a model, in order to remain safe and robust. As linear regression is particularly susceptible to the influence of points, we will look at residuals and leverage as part of the regression evaluation process.
While early scatterplots suggest that there may be linear relationships between important offensive measures and team wins, certain predictor variables may also have skewed distributions, making nonlinear scaling methods such as log scaling worthy of consideration during regression development.
Prior to regression modeling, we performed a number of preprocessing steps to ensure that the regression models remain stable, interpretable, and conducive to inference. These preprocessing steps address missing data, scaling issues, and collinearity among performance metrics.
During the exploratory phase, it was noted that the missing data are mainly concentrated in a few pitching variables. As the missing data are likely due to incomplete historical records, which are unlikely to be completely random, median imputation was chosen. Median is more robust to extreme values, preserving the “center” of the data.
Also, missing value indicator flags are included to detect any underlying structural information in the missing data.
eval_index <- eval_raw$INDEX
train <- train_raw
eval <- eval_raw
na_cols <- names(train)[colSums(is.na(train)) > 0]
na_cols <- setdiff(na_cols, c("INDEX", "TARGET_WINS"))
for (nm in na_cols) {
train[[paste0(nm, "_NA")]] <- ifelse(is.na(train[[nm]]), 1, 0)
eval[[paste0(nm, "_NA")]] <- ifelse(is.na(eval[[nm]]), 1, 0)
}
train <- train %>%
mutate(across(where(is.numeric),
~ifelse(is.na(.x),
median(.x, na.rm = TRUE),
.x)))
eval <- eval %>%
mutate(across(where(is.numeric),
~ifelse(is.na(.x),
median(.x, na.rm = TRUE),
.x)))
train <- train %>% select(-INDEX)
eval <- eval %>% select(-INDEX)
colSums(is.na(train)) TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
0 0 0 0
TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
0 0 0 0
TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
0 0 0 0
TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
0 0 0 0
TEAM_BATTING_SO_NA TEAM_BASERUN_SB_NA TEAM_BASERUN_CS_NA TEAM_BATTING_HBP_NA
0 0 0 0
TEAM_PITCHING_SO_NA TEAM_FIELDING_DP_NA
0 0
colSums(is.na(eval)) TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
0 0 0 0
TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
0 0 0 0
TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
0 0 0 0
TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_SO_NA
0 0 0 0
TEAM_BASERUN_SB_NA TEAM_BASERUN_CS_NA TEAM_BATTING_HBP_NA TEAM_PITCHING_SO_NA
0 0 0 0
TEAM_FIELDING_DP_NA
0
After preprocessing, all predictors contain complete observations, ensuring compatibility with linear regression modeling.
Several offensive variables capture related aspects of hitting performance and may introduce multicollinearity when modeled simultaneously. To address this, derived features were constructed to summarize overall offensive production more efficiently.
Total bases provide a consolidated measure of power hitting:
train <- train %>%
mutate(
TOTAL_BASES = TEAM_BATTING_H +
TEAM_BATTING_2B +
2 * TEAM_BATTING_3B +
3 * TEAM_BATTING_HR,
POWER_RATIO = ifelse(TEAM_BATTING_H > 0,
TEAM_BATTING_HR / TEAM_BATTING_H,
0)
)
eval <- eval %>%
mutate(
TOTAL_BASES = TEAM_BATTING_H +
TEAM_BATTING_2B +
2 * TEAM_BATTING_3B +
3 * TEAM_BATTING_HR,
POWER_RATIO = ifelse(TEAM_BATTING_H > 0,
TEAM_BATTING_HR / TEAM_BATTING_H,
0)
)
glimpse(train)Rows: 2,276
Columns: 24
$ TARGET_WINS <dbl> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72…
$ TEAM_BATTING_H <dbl> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 13…
$ TEAM_BATTING_2B <dbl> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, …
$ TEAM_BATTING_3B <dbl> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41…
$ TEAM_BATTING_HR <dbl> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, …
$ TEAM_BATTING_BB <dbl> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, …
$ TEAM_BATTING_SO <dbl> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 82…
$ TEAM_BASERUN_SB <dbl> 101, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119,…
$ TEAM_BASERUN_CS <dbl> 49, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 10…
$ TEAM_BATTING_HBP <dbl> 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58…
$ TEAM_PITCHING_H <dbl> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 13…
$ TEAM_PITCHING_HR <dbl> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, …
$ TEAM_PITCHING_BB <dbl> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, …
$ TEAM_PITCHING_SO <dbl> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 8…
$ TEAM_FIELDING_E <dbl> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131,…
$ TEAM_FIELDING_DP <dbl> 149, 155, 153, 156, 168, 149, 186, 136, 169, 159, …
$ TEAM_BATTING_SO_NA <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_BASERUN_SB_NA <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_BASERUN_CS_NA <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_BATTING_HBP_NA <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ TEAM_PITCHING_SO_NA <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TEAM_FIELDING_DP_NA <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ TOTAL_BASES <dbl> 1756, 2172, 2090, 1960, 1843, 1827, 1897, 1863, 20…
$ POWER_RATIO <dbl> 0.00899654, 0.14189694, 0.09949165, 0.06921413, 0.…
Thus, all the predictors are now fully observed, making them suitable for linear regression modeling.
While ordinary least squares regression does not necessitate standardization of the predictors to achieve unbiased estimators of the regression coefficients, scale issues may impact stability and interpretation. This may be investigated by considering alternative model specifications, particularly when comparing coefficient magnitude across predictors.
cor(train %>%
select(TARGET_WINS, TOTAL_BASES, POWER_RATIO),
use = "pairwise.complete.obs") TARGET_WINS TOTAL_BASES POWER_RATIO
TARGET_WINS 1.0000000 0.4227754 0.136171
TOTAL_BASES 0.4227754 1.0000000 0.556130
POWER_RATIO 0.1361710 0.5561300 1.000000
Having dealt with missing values, constructed indicator variables, and engineered features, the data is now ready for regression modeling. The data preprocessing steps have effectively dealt with missing values, reduced data redundancy due to correlated predictors, and promoted interpretability. The next steps will involve developing and comparing multiple linear regression models to identify the best regression specification for team win predictions.
To compare different specifications, the prepared data set will be divided into a set of training data and a set of test data. We’ll fit the models on the training data and compare how well the models do on the test data.
set.seed(42)
idx <- sample(seq_len(nrow(train)),
size = floor(0.9 * nrow(train)))
train_tr <- train[idx, ]
test_tr <- train[-idx, ]
dim(train_tr)[1] 2048 24
dim(test_tr)[1] 228 24
To assess the models’ out-of-sample predictive ability without sacrificing the number of data points for model estimation, we’ll use a 90/10 split.
m1 <- lm(
TARGET_WINS ~ TEAM_BATTING_HR +
TEAM_BATTING_BB +
TEAM_FIELDING_E +
TEAM_PITCHING_H +
TEAM_PITCHING_HR,
data = train_tr
)
summary(m1)
Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_HR + TEAM_BATTING_BB +
TEAM_FIELDING_E + TEAM_PITCHING_H + TEAM_PITCHING_HR, data = train_tr)
Residuals:
Min 1Q Median 3Q Max
-52.839 -10.006 0.489 10.025 77.005
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.7746445 2.2708167 30.286 < 2e-16 ***
TEAM_BATTING_HR -0.1297548 0.0259857 -4.993 6.44e-07 ***
TEAM_BATTING_BB 0.0231387 0.0037337 6.197 6.93e-10 ***
TEAM_FIELDING_E -0.0041112 0.0026511 -1.551 0.121
TEAM_PITCHING_H -0.0004030 0.0003298 -1.222 0.222
TEAM_PITCHING_HR 0.1409561 0.0242072 5.823 6.70e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.23 on 2042 degrees of freedom
Multiple R-squared: 0.07324, Adjusted R-squared: 0.07097
F-statistic: 32.27 on 5 and 2042 DF, p-value: < 2.2e-16
Model 1 uses the basic set of offensive and pitching statistics most commonly related to team success.
m2 <- lm(
TARGET_WINS ~ TOTAL_BASES +
TEAM_BATTING_BB +
POWER_RATIO +
TEAM_FIELDING_E +
TEAM_PITCHING_SO +
TEAM_PITCHING_HR,
data = train_tr
)
summary(m2)
Call:
lm(formula = TARGET_WINS ~ TOTAL_BASES + TEAM_BATTING_BB + POWER_RATIO +
TEAM_FIELDING_E + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data = train_tr)
Residuals:
Min 1Q Median 3Q Max
-51.935 -8.953 0.175 9.007 49.706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.455e+01 3.823e+00 3.807 0.000145 ***
TOTAL_BASES 3.402e-02 1.812e-03 18.776 < 2e-16 ***
TEAM_BATTING_BB 1.379e-02 3.376e-03 4.084 4.60e-05 ***
POWER_RATIO -1.299e+02 3.036e+01 -4.280 1.96e-05 ***
TEAM_FIELDING_E -1.504e-02 2.052e-03 -7.331 3.28e-13 ***
TEAM_PITCHING_SO 5.672e-04 5.733e-04 0.989 0.322671
TEAM_PITCHING_HR -7.816e-03 2.156e-02 -0.363 0.716953
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.68 on 2041 degrees of freedom
Multiple R-squared: 0.253, Adjusted R-squared: 0.2508
F-statistic: 115.2 on 6 and 2041 DF, p-value: < 2.2e-16
Model 2 replaces some redundant hitting statistics with consolidated offensive statistics such as total bases and power ratio.
m3 <- lm(
TARGET_WINS ~ log(TEAM_BATTING_HR + 1) +
log(TEAM_BATTING_BB + 1) +
TEAM_FIELDING_E +
log(TEAM_PITCHING_HR + 1) +
TEAM_PITCHING_SO,
data = train_tr
)
summary(m3)
Call:
lm(formula = TARGET_WINS ~ log(TEAM_BATTING_HR + 1) + log(TEAM_BATTING_BB +
1) + TEAM_FIELDING_E + log(TEAM_PITCHING_HR + 1) + TEAM_PITCHING_SO,
data = train_tr)
Residuals:
Min 1Q Median 3Q Max
-67.378 -9.905 0.709 10.000 77.826
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.854e+01 8.570e+00 3.331 0.000882 ***
log(TEAM_BATTING_HR + 1) -1.986e+01 2.519e+00 -7.882 5.19e-15 ***
log(TEAM_BATTING_BB + 1) 8.152e+00 1.315e+00 6.200 6.82e-10 ***
TEAM_FIELDING_E -1.306e-02 3.108e-03 -4.202 2.76e-05 ***
log(TEAM_PITCHING_HR + 1) 2.105e+01 2.350e+00 8.956 < 2e-16 ***
TEAM_PITCHING_SO -2.719e-03 5.903e-04 -4.606 4.37e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.01 on 2042 degrees of freedom
Multiple R-squared: 0.1005, Adjusted R-squared: 0.09833
F-statistic: 45.65 on 5 and 2042 DF, p-value: < 2.2e-16
Model 3 attempts to add log transformations for the skewed variables we detected in the exploratory data analysis. This is a simple experiment to assess the effect of nonlinear scaling on the models.
These three models represent different theories and statistical models for predicting team wins. The following section compares the models using out-of-sample error metrics and statistics to determine which model is the most appropriate.
The models will be evaluated in terms of how they perform on the data that they have not seen, and that is done through the holdout set. What is important for the choice of the model is the out-of-sample measures, such as the MSE, R², F-stat for the overall significance, and the residuals, which will indicate the goodness of fit.
model_metrics <- function(mod, data){
pred <- predict(mod, newdata = data)
mse <- mean((data$TARGET_WINS - pred)^2)
r2 <- 1 - sum((data$TARGET_WINS - pred)^2) /
sum((data$TARGET_WINS - mean(data$TARGET_WINS))^2)
fstat <- summary(mod)$fstatistic
tibble(
MSE = mse,
R2 = r2,
F_value = unname(fstat["value"]),
df1 = unname(fstat["numdf"]),
df2 = unname(fstat["dendf"])
)
}results_table <- bind_rows(
Model_1 = model_metrics(m1, test_tr),
Model_2 = model_metrics(m2, test_tr),
Model_3 = model_metrics(m3, test_tr),
.id = "Model"
)
results_table# A tibble: 3 × 6
Model MSE R2 F_value df1 df2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Model_1 219. 0.0519 32.3 5 2042
2 Model_2 198. 0.141 115. 6 2041
3 Model_3 227. 0.0167 45.6 5 2042
When comparing the models, it indicatess that the model with the lowest out-of-sample MSE has the highest power of prediction. The differences in R² will indicate the goodness of fit, while the F-stat will indicate the overall significance of the models.
When considering the holdout MSE and R², it shows that the most meaningful choice for predicting the wins of the teams is Model 2, so it is the final choice for the model.
car::vif(m2) TOTAL_BASES TEAM_BATTING_BB POWER_RATIO TEAM_FIELDING_E
2.444247 1.907044 17.406845 2.455342
TEAM_PITCHING_SO TEAM_PITCHING_HR
1.141935 19.238526
The Variance Inflation Factor (VIF) was also checked to understand the level of multicollinearity between the predictors, and most predictors have low to moderate values, but the values for POWER_RATIO and TEAM_PITCHING_HR are high, indicating the redundancy in the features that were created.
par(mfrow = c(2,2))
plot(m2)par(mfrow = c(1,1))Residual diagnostics check if linearity, constant variance, and normality assumptions are reasonable. Residual plots show that the chosen model is reasonable, and there are no major issues of extreme nonlinearity. However, the Breusch-Pagan test results show that there is heteroskedasticity. Thus, robust standard errors are employed.
lmtest::bptest(m2)
studentized Breusch-Pagan test
data: m2
BP = 225.73, df = 6, p-value < 2.2e-16
coeftest(m2, vcov = vcovHC(m2, type = "HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4553e+01 4.4470e+00 3.2725 0.0010839 **
TOTAL_BASES 3.4021e-02 2.1421e-03 15.8818 < 2.2e-16 ***
TEAM_BATTING_BB 1.3788e-02 3.8716e-03 3.5612 0.0003776 ***
POWER_RATIO -1.2993e+02 3.3488e+01 -3.8798 0.0001079 ***
TEAM_FIELDING_E -1.5042e-02 2.7249e-03 -5.5201 3.819e-08 ***
TEAM_PITCHING_SO 5.6717e-04 1.0485e-03 0.5409 0.5886148
TEAM_PITCHING_HR -7.8160e-03 2.4389e-02 -0.3205 0.7486431
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Breusch-Pagan test is statistically significant, which shows that heteroskedasticity is present. Though the coefficient estimates of OLS will still be consistent, it is advised that robust standard errors should be employed.
From the chosen model, it is evident that more offensive output, measured by total bases and walks, is associated with more wins, while defensive errors are associated with fewer wins. The pitching variables perform as expected, although with lower precision.
The negative coefficient of POWER_RATIO shows that teams that use home runs relative to total hits do not win more games, given that total bases are already factored into the model. Thus, there may be diminishing returns to using home runs when already having more total bases.
final_model <- lm(
TARGET_WINS ~ TOTAL_BASES +
TEAM_BATTING_BB +
POWER_RATIO +
TEAM_FIELDING_E +
TEAM_PITCHING_SO +
TEAM_PITCHING_HR,
data = train
)
summary(final_model)
Call:
lm(formula = TARGET_WINS ~ TOTAL_BASES + TEAM_BATTING_BB + POWER_RATIO +
TEAM_FIELDING_E + TEAM_PITCHING_SO + TEAM_PITCHING_HR, data = train)
Residuals:
Min 1Q Median 3Q Max
-52.271 -9.087 0.151 9.171 49.596
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.489e+01 3.664e+00 4.063 5.02e-05 ***
TOTAL_BASES 3.362e-02 1.750e-03 19.217 < 2e-16 ***
TEAM_BATTING_BB 1.395e-02 3.209e-03 4.346 1.45e-05 ***
POWER_RATIO -1.226e+02 2.954e+01 -4.151 3.43e-05 ***
TEAM_FIELDING_E -1.438e-02 1.970e-03 -7.298 4.01e-13 ***
TEAM_PITCHING_SO 6.432e-04 5.706e-04 1.127 0.260
TEAM_PITCHING_HR -9.275e-03 2.101e-02 -0.441 0.659
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.72 on 2269 degrees of freedom
Multiple R-squared: 0.2438, Adjusted R-squared: 0.2418
F-statistic: 121.9 on 6 and 2269 DF, p-value: < 2.2e-16
eval_pred <- predict(final_model, newdata = eval)
submission <- tibble(
INDEX = eval_index,
TARGET_WINS = eval_pred
)
write_csv(submission, "moneyball_predictions.csv")
head(submission)# A tibble: 6 × 2
INDEX TARGET_WINS
<dbl> <dbl>
1 9 67.6
2 10 68.0
3 14 75.6
4 47 86.4
5 60 68.0
6 63 69.7
As this analysis demonstrates, offensive production in total bases and walks really matters in predicting the number of games won, while errors hurt those chances. Of all the model selection approaches we tested, the engineered feature model was the standout because of its lowest out-of-sample Mean Squared Error and highest predictive R².
In spite of the presence of heteroscedasticity as suggested by the Breusch-Pagan test, we employed robust standard errors in order to ensure that our statistical conclusions remain reliable. Overall, the final model achieves a good balance of ease of interpretation, statistical rigor, and prediction accuracy.