Executive Summary
1. Data Exploration
2. Data Preparation
3. Build Models
4. Select Model
Predictions
Key Takeaways & Future Work
Appendix — Full Code
Appendix A: R Code

Executive Summary

This report develops regression models to predict MLB team wins using batting, pitching, and fielding statistics (1871–2006).
The strongest model, which balances batting, pitching, and fielding variables, explains 24% of the variation in wins.

Key Insights: - More Hits and Walks drive wins.
- More Errors cost wins.
- Efficiency metrics (e.g., HR rate) confirm the Moneyball philosophy that smarter play matters more than raw totals.

Recommendation: Use the balanced model for predictions, while exploring advanced metrics in future work.

1. Data Exploration

## [1] 2276   17

## [1] 259  16

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   71.00   82.00   80.79   92.00  146.00

Top Missingness by Variable
variable	pct_missing
TEAM_BATTING_HBP	91.6%
TEAM_BASERUN_CS	33.9%
TEAM_FIELDING_DP	12.6%
TEAM_BASERUN_SB	5.8%
TEAM_BATTING_SO	4.5%
TEAM_PITCHING_SO	4.5%
INDEX	0.0%
TARGET_WINS	0.0%

We focus on Hits and Walks because correlation analysis shows they’re strongly related to Wins.

2. Data Preparation

Key Descriptive Statistics
Variable	Mean	Median	SD
TARGET_WINS	80.79	82.00	15.75
TEAM_BATTING_H	1469.27	1454.00	144.59
TEAM_BATTING_BB	501.56	512.00	122.67
TEAM_FIELDING_E	246.48	159.00	227.77
TEAM_PITCHING_SO	817.54	813.50	540.54
HR_rate	0.07	0.07	0.04
BB_rate	0.34	0.36	0.09
SO_rate	0.51	0.53	0.18
Net_Steals	71.88	52.00	83.06
Net_BB	-51.45	-24.00	150.83

Median imputation was chosen because it’s robust to outliers in count data.

3. Build Models

## 
## Regression Models
## ================================================================================================
##                                                 Dependent variable:                             
##                     ----------------------------------------------------------------------------
##                                                     TARGET_WINS                                 
##                                (1)                       (2)                      (3)           
## ------------------------------------------------------------------------------------------------
## TEAM_BATTING_H              0.048***                  0.051***                                  
##                              (0.003)                   (0.003)                                  
##                                                                                                 
## TEAM_BATTING_HR               0.001                    -0.026                                   
##                              (0.008)                   (0.024)                                  
##                                                                                                 
## TEAM_BATTING_BB             0.030***                  0.017***                                  
##                              (0.003)                   (0.003)                                  
##                                                                                                 
## TEAM_BATTING_SO              0.005**                    0.001                                   
##                              (0.002)                   (0.002)                                  
##                                                                                                 
## TEAM_FIELDING_E                                       -0.016***                                 
##                                                        (0.002)                                  
##                                                                                                 
## HR_rate                                                                        109.298***       
##                                                                                 (12.069)        
##                                                                                                 
## BB_rate                                                                          -3.706         
##                                                                                 (4.751)         
##                                                                                                 
## SO_rate                                                                        -45.808***       
##                                                                                 (2.857)         
##                                                                                                 
## Net_Steals                                                                      0.061***        
##                                                                                 (0.004)         
##                                                                                                 
## Net_BB                                                                           0.006*         
##                                                                                 (0.003)         
##                                                                                                 
## TEAM_PITCHING_SO                                       0.001*                   0.003***        
##                                                        (0.001)                  (0.001)         
##                                                                                                 
## TEAM_PITCHING_HR                                        0.015                                   
##                                                        (0.021)                                  
##                                                                                                 
## log_E                                                                          -10.056***       
##                                                                                 (0.978)         
##                                                                                                 
## Constant                     -8.918*                    0.557                  144.152***       
##                              (4.717)                   (4.866)                  (6.486)         
##                                                                                                 
## ------------------------------------------------------------------------------------------------
## Observations                  2,276                     2,276                    2,276          
## R2                            0.224                     0.246                    0.187          
## Adjusted R2                   0.223                     0.243                    0.185          
## Residual Std. Error    13.887 (df = 2271)        13.702 (df = 2268)        14.223 (df = 2268)   
## F Statistic         164.078*** (df = 4; 2271) 105.547*** (df = 7; 2268) 74.666*** (df = 7; 2268)
## ================================================================================================
## Note:                                                                *p<0.1; **p<0.05; ***p<0.01

Model 1: simple batting baseline.

Model 2: strongest performance, interpretable.

Model 3: efficiency-focused, but slightly weaker fit

4. Select Model

Model Comparison
Model	Adj_R2	AIC	RMSE
M1	0.223	18441.97	13.871
M2	0.243	18383.96	13.678
M3	0.185	18553.77	14.198

Decision: Model 2 is selected — best Adj. R², lowest RMSE, and interpretable:
- Hits (+), Walks (+), Errors (−), Pitcher SO (+).
- HR coefficient negative due to multicollinearity, but overall model fit is best.

Predictions

Sample of Predicted Wins
INDEX	PREDICTED_WINS
9	68.89
10	70.43
14	78.62
47	85.61
60	67.77
63	69.31

Key Takeaways & Future Work

Takeaways - Teams win more games when they hit more and walk more.
- Errors directly cost wins and are a controllable weakness.
- Efficiency-based features validate the Moneyball thesis, but raw batting + pitching still outperform them in predictive power.

Future Work - Apply regularization (ridge/lasso) to reduce multicollinearity.
- Add interaction terms (e.g., HR × SO).
- Integrate modern sabermetric stats (OPS, WAR) for stronger prediction.

Appendix — Full Code

Appendix A: R Code

## Appendix A: Full R Code
# =========================
# Libraries
# =========================
library(tidyverse); library(corrplot); library(caret); library(car)
library(stargazer); library(kableExtra); library(broom)

# =========================
# Data Load
# =========================
train <- read.csv("moneyball-training-data.csv")
eval  <- read.csv("moneyball-evaluation-data.csv")

dim(train); dim(eval)
summary(train$TARGET_WINS)

# =========================
# Missingness
# =========================
pct_na <- function(x) mean(is.na(x))*100
missing_tbl <- train %>%
  summarise(across(everything(), pct_na)) %>%
  pivot_longer(everything(), names_to="variable", values_to="pct_missing") %>%
  arrange(desc(pct_missing))

# =========================
# Correlation Matrix
# =========================
num_vars <- train %>% select(-INDEX)
cor_mat  <- cor(num_vars, use="pairwise.complete.obs")
corrplot(cor_mat, method="color", tl.cex=0.7)

# =========================
# Imputation & Feature Engineering
# =========================
preproc <- preProcess(train, method=c("medianImpute"))
train_imp <- predict(preproc, train)
eval_imp  <- predict(preproc, eval)

safe_div <- function(num, den) ifelse(den==0|is.na(den), 0, num/den)

featurize <- function(df){
  df %>%
    mutate(
      HR_rate = safe_div(TEAM_BATTING_HR, TEAM_BATTING_H),
      BB_rate = safe_div(TEAM_BATTING_BB, TEAM_BATTING_H),
      SO_rate = safe_div(TEAM_BATTING_SO, TEAM_BATTING_H),
      Net_Steals = TEAM_BASERUN_SB - TEAM_BASERUN_CS,
      Net_BB = TEAM_BATTING_BB - TEAM_PITCHING_BB,
      log_E = log1p(TEAM_FIELDING_E)
    )
}

train_feat <- featurize(train_imp)
eval_feat  <- featurize(eval_imp)

# =========================
# Descriptive Statistics
# =========================
desc_tbl <- train_feat %>%
  select(TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_BB, TEAM_FIELDING_E,
         TEAM_PITCHING_SO, HR_rate, BB_rate, SO_rate, Net_Steals, Net_BB) %>%
  summarise(across(everything(),
                   list(Mean=mean, Median=median, SD=sd),
                   .names="{.col}_{.fn}")) %>%
  pivot_longer(everything(),
               names_to=c("Variable",".value"),
               names_pattern="(.*)_(Mean|Median|SD)")

# =========================
# Model Building
# =========================
m1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
           TEAM_BATTING_BB + TEAM_BATTING_SO,
         data=train_feat)

m2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
           TEAM_BATTING_BB + TEAM_BATTING_SO +
           TEAM_FIELDING_E + TEAM_PITCHING_SO + TEAM_PITCHING_HR,
         data=train_feat)

m3 <- lm(TARGET_WINS ~ HR_rate + BB_rate + SO_rate +
           Net_Steals + Net_BB + TEAM_PITCHING_SO + log_E,
         data=train_feat)

stargazer(m1, m2, m3, type="text")

# =========================
# Model Comparison
# =========================
cmp <- tibble(
  Model=c("M1","M2","M3"),
  Adj_R2=c(summary(m1)$adj.r.squared,
           summary(m2)$adj.r.squared,
           summary(m3)$adj.r.squared),
  AIC=c(AIC(m1),AIC(m2),AIC(m3)),
  RMSE=c(RMSE(predict(m1,train_feat),train_feat$TARGET_WINS),
         RMSE(predict(m2,train_feat),train_feat$TARGET_WINS),
         RMSE(predict(m3,train_feat),train_feat$TARGET_WINS))
)

# =========================
# Diagnostics
# =========================
par(mfrow=c(2,2)); plot(m2); par(mfrow=c(1,1))

# =========================
# Predictions
# =========================
pred_out <- eval %>%
  mutate(PREDICTED_WINS = predict(m2, newdata=eval_feat)) %>%
  select(INDEX,PREDICTED_WINS)

write.csv(pred_out,"moneyball_predictions.csv",row.names=FALSE)

Predicting MLB Wins: A Multiple Linear Regression Analysis

Sheriann McLarty

September 29, 2025