This report develops regression models to predict MLB team wins using
batting, pitching, and fielding statistics (1871–2006).
The strongest model, which balances batting, pitching, and fielding
variables, explains 24% of the variation in wins.
Key Insights: - More Hits and Walks
drive wins.
- More Errors cost wins.
- Efficiency metrics (e.g., HR rate) confirm the Moneyball philosophy
that smarter play matters more than raw totals.
Recommendation: Use the balanced model for predictions, while exploring advanced metrics in future work.
## [1] 2276 17
## [1] 259 16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 71.00 82.00 80.79 92.00 146.00
variable | pct_missing |
---|---|
TEAM_BATTING_HBP | 91.6% |
TEAM_BASERUN_CS | 33.9% |
TEAM_FIELDING_DP | 12.6% |
TEAM_BASERUN_SB | 5.8% |
TEAM_BATTING_SO | 4.5% |
TEAM_PITCHING_SO | 4.5% |
INDEX | 0.0% |
TARGET_WINS | 0.0% |
We focus on Hits and Walks because correlation analysis shows
they’re strongly related to Wins.
Variable | Mean | Median | SD |
---|---|---|---|
TARGET_WINS | 80.79 | 82.00 | 15.75 |
TEAM_BATTING_H | 1469.27 | 1454.00 | 144.59 |
TEAM_BATTING_BB | 501.56 | 512.00 | 122.67 |
TEAM_FIELDING_E | 246.48 | 159.00 | 227.77 |
TEAM_PITCHING_SO | 817.54 | 813.50 | 540.54 |
HR_rate | 0.07 | 0.07 | 0.04 |
BB_rate | 0.34 | 0.36 | 0.09 |
SO_rate | 0.51 | 0.53 | 0.18 |
Net_Steals | 71.88 | 52.00 | 83.06 |
Net_BB | -51.45 | -24.00 | 150.83 |
Median imputation was chosen because it’s robust to outliers in count data.
##
## Regression Models
## ================================================================================================
## Dependent variable:
## ----------------------------------------------------------------------------
## TARGET_WINS
## (1) (2) (3)
## ------------------------------------------------------------------------------------------------
## TEAM_BATTING_H 0.048*** 0.051***
## (0.003) (0.003)
##
## TEAM_BATTING_HR 0.001 -0.026
## (0.008) (0.024)
##
## TEAM_BATTING_BB 0.030*** 0.017***
## (0.003) (0.003)
##
## TEAM_BATTING_SO 0.005** 0.001
## (0.002) (0.002)
##
## TEAM_FIELDING_E -0.016***
## (0.002)
##
## HR_rate 109.298***
## (12.069)
##
## BB_rate -3.706
## (4.751)
##
## SO_rate -45.808***
## (2.857)
##
## Net_Steals 0.061***
## (0.004)
##
## Net_BB 0.006*
## (0.003)
##
## TEAM_PITCHING_SO 0.001* 0.003***
## (0.001) (0.001)
##
## TEAM_PITCHING_HR 0.015
## (0.021)
##
## log_E -10.056***
## (0.978)
##
## Constant -8.918* 0.557 144.152***
## (4.717) (4.866) (6.486)
##
## ------------------------------------------------------------------------------------------------
## Observations 2,276 2,276 2,276
## R2 0.224 0.246 0.187
## Adjusted R2 0.223 0.243 0.185
## Residual Std. Error 13.887 (df = 2271) 13.702 (df = 2268) 14.223 (df = 2268)
## F Statistic 164.078*** (df = 4; 2271) 105.547*** (df = 7; 2268) 74.666*** (df = 7; 2268)
## ================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Model 1: simple batting baseline.
Model 2: strongest performance, interpretable.
Model 3: efficiency-focused, but slightly weaker fit
Model | Adj_R2 | AIC | RMSE |
---|---|---|---|
M1 | 0.223 | 18441.97 | 13.871 |
M2 | 0.243 | 18383.96 | 13.678 |
M3 | 0.185 | 18553.77 | 14.198 |
Decision: Model 2 is selected — best Adj. R², lowest
RMSE, and interpretable:
- Hits (+), Walks (+), Errors (−), Pitcher SO (+).
- HR coefficient negative due to multicollinearity, but overall model
fit is best.
INDEX | PREDICTED_WINS |
---|---|
9 | 68.89 |
10 | 70.43 |
14 | 78.62 |
47 | 85.61 |
60 | 67.77 |
63 | 69.31 |
Takeaways - Teams win more games when they
hit more and walk more.
- Errors directly cost wins and are a controllable
weakness.
- Efficiency-based features validate the Moneyball thesis, but raw
batting + pitching still outperform them in predictive power.
Future Work - Apply regularization
(ridge/lasso) to reduce multicollinearity.
- Add interaction terms (e.g., HR × SO).
- Integrate modern sabermetric stats (OPS, WAR) for
stronger prediction.
## Appendix A: Full R Code
# =========================
# Libraries
# =========================
library(tidyverse); library(corrplot); library(caret); library(car)
library(stargazer); library(kableExtra); library(broom)
# =========================
# Data Load
# =========================
train <- read.csv("moneyball-training-data.csv")
eval <- read.csv("moneyball-evaluation-data.csv")
dim(train); dim(eval)
summary(train$TARGET_WINS)
# =========================
# Missingness
# =========================
pct_na <- function(x) mean(is.na(x))*100
missing_tbl <- train %>%
summarise(across(everything(), pct_na)) %>%
pivot_longer(everything(), names_to="variable", values_to="pct_missing") %>%
arrange(desc(pct_missing))
# =========================
# Correlation Matrix
# =========================
num_vars <- train %>% select(-INDEX)
cor_mat <- cor(num_vars, use="pairwise.complete.obs")
corrplot(cor_mat, method="color", tl.cex=0.7)
# =========================
# Imputation & Feature Engineering
# =========================
preproc <- preProcess(train, method=c("medianImpute"))
train_imp <- predict(preproc, train)
eval_imp <- predict(preproc, eval)
safe_div <- function(num, den) ifelse(den==0|is.na(den), 0, num/den)
featurize <- function(df){
df %>%
mutate(
HR_rate = safe_div(TEAM_BATTING_HR, TEAM_BATTING_H),
BB_rate = safe_div(TEAM_BATTING_BB, TEAM_BATTING_H),
SO_rate = safe_div(TEAM_BATTING_SO, TEAM_BATTING_H),
Net_Steals = TEAM_BASERUN_SB - TEAM_BASERUN_CS,
Net_BB = TEAM_BATTING_BB - TEAM_PITCHING_BB,
log_E = log1p(TEAM_FIELDING_E)
)
}
train_feat <- featurize(train_imp)
eval_feat <- featurize(eval_imp)
# =========================
# Descriptive Statistics
# =========================
desc_tbl <- train_feat %>%
select(TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_BB, TEAM_FIELDING_E,
TEAM_PITCHING_SO, HR_rate, BB_rate, SO_rate, Net_Steals, Net_BB) %>%
summarise(across(everything(),
list(Mean=mean, Median=median, SD=sd),
.names="{.col}_{.fn}")) %>%
pivot_longer(everything(),
names_to=c("Variable",".value"),
names_pattern="(.*)_(Mean|Median|SD)")
# =========================
# Model Building
# =========================
m1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
TEAM_BATTING_BB + TEAM_BATTING_SO,
data=train_feat)
m2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
TEAM_BATTING_BB + TEAM_BATTING_SO +
TEAM_FIELDING_E + TEAM_PITCHING_SO + TEAM_PITCHING_HR,
data=train_feat)
m3 <- lm(TARGET_WINS ~ HR_rate + BB_rate + SO_rate +
Net_Steals + Net_BB + TEAM_PITCHING_SO + log_E,
data=train_feat)
stargazer(m1, m2, m3, type="text")
# =========================
# Model Comparison
# =========================
cmp <- tibble(
Model=c("M1","M2","M3"),
Adj_R2=c(summary(m1)$adj.r.squared,
summary(m2)$adj.r.squared,
summary(m3)$adj.r.squared),
AIC=c(AIC(m1),AIC(m2),AIC(m3)),
RMSE=c(RMSE(predict(m1,train_feat),train_feat$TARGET_WINS),
RMSE(predict(m2,train_feat),train_feat$TARGET_WINS),
RMSE(predict(m3,train_feat),train_feat$TARGET_WINS))
)
# =========================
# Diagnostics
# =========================
par(mfrow=c(2,2)); plot(m2); par(mfrow=c(1,1))
# =========================
# Predictions
# =========================
pred_out <- eval %>%
mutate(PREDICTED_WINS = predict(m2, newdata=eval_feat)) %>%
select(INDEX,PREDICTED_WINS)
write.csv(pred_out,"moneyball_predictions.csv",row.names=FALSE)