Executive Summary

This report develops regression models to predict MLB team wins using batting, pitching, and fielding statistics (1871–2006).
The strongest model, which balances batting, pitching, and fielding variables, explains 24% of the variation in wins.

Key Insights: - More Hits and Walks drive wins.
- More Errors cost wins.
- Efficiency metrics (e.g., HR rate) confirm the Moneyball philosophy that smarter play matters more than raw totals.

Recommendation: Use the balanced model for predictions, while exploring advanced metrics in future work.

1. Data Exploration

## [1] 2276   17
## [1] 259  16
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   71.00   82.00   80.79   92.00  146.00

Top Missingness by Variable
variable pct_missing
TEAM_BATTING_HBP 91.6%
TEAM_BASERUN_CS 33.9%
TEAM_FIELDING_DP 12.6%
TEAM_BASERUN_SB 5.8%
TEAM_BATTING_SO 4.5%
TEAM_PITCHING_SO 4.5%
INDEX 0.0%
TARGET_WINS 0.0%

We focus on Hits and Walks because correlation analysis shows they’re strongly related to Wins.

2. Data Preparation

Key Descriptive Statistics
Variable Mean Median SD
TARGET_WINS 80.79 82.00 15.75
TEAM_BATTING_H 1469.27 1454.00 144.59
TEAM_BATTING_BB 501.56 512.00 122.67
TEAM_FIELDING_E 246.48 159.00 227.77
TEAM_PITCHING_SO 817.54 813.50 540.54
HR_rate 0.07 0.07 0.04
BB_rate 0.34 0.36 0.09
SO_rate 0.51 0.53 0.18
Net_Steals 71.88 52.00 83.06
Net_BB -51.45 -24.00 150.83

Median imputation was chosen because it’s robust to outliers in count data.

3. Build Models

## 
## Regression Models
## ================================================================================================
##                                                 Dependent variable:                             
##                     ----------------------------------------------------------------------------
##                                                     TARGET_WINS                                 
##                                (1)                       (2)                      (3)           
## ------------------------------------------------------------------------------------------------
## TEAM_BATTING_H              0.048***                  0.051***                                  
##                              (0.003)                   (0.003)                                  
##                                                                                                 
## TEAM_BATTING_HR               0.001                    -0.026                                   
##                              (0.008)                   (0.024)                                  
##                                                                                                 
## TEAM_BATTING_BB             0.030***                  0.017***                                  
##                              (0.003)                   (0.003)                                  
##                                                                                                 
## TEAM_BATTING_SO              0.005**                    0.001                                   
##                              (0.002)                   (0.002)                                  
##                                                                                                 
## TEAM_FIELDING_E                                       -0.016***                                 
##                                                        (0.002)                                  
##                                                                                                 
## HR_rate                                                                        109.298***       
##                                                                                 (12.069)        
##                                                                                                 
## BB_rate                                                                          -3.706         
##                                                                                 (4.751)         
##                                                                                                 
## SO_rate                                                                        -45.808***       
##                                                                                 (2.857)         
##                                                                                                 
## Net_Steals                                                                      0.061***        
##                                                                                 (0.004)         
##                                                                                                 
## Net_BB                                                                           0.006*         
##                                                                                 (0.003)         
##                                                                                                 
## TEAM_PITCHING_SO                                       0.001*                   0.003***        
##                                                        (0.001)                  (0.001)         
##                                                                                                 
## TEAM_PITCHING_HR                                        0.015                                   
##                                                        (0.021)                                  
##                                                                                                 
## log_E                                                                          -10.056***       
##                                                                                 (0.978)         
##                                                                                                 
## Constant                     -8.918*                    0.557                  144.152***       
##                              (4.717)                   (4.866)                  (6.486)         
##                                                                                                 
## ------------------------------------------------------------------------------------------------
## Observations                  2,276                     2,276                    2,276          
## R2                            0.224                     0.246                    0.187          
## Adjusted R2                   0.223                     0.243                    0.185          
## Residual Std. Error    13.887 (df = 2271)        13.702 (df = 2268)        14.223 (df = 2268)   
## F Statistic         164.078*** (df = 4; 2271) 105.547*** (df = 7; 2268) 74.666*** (df = 7; 2268)
## ================================================================================================
## Note:                                                                *p<0.1; **p<0.05; ***p<0.01

Model 1: simple batting baseline.

Model 2: strongest performance, interpretable.

Model 3: efficiency-focused, but slightly weaker fit

4. Select Model

Model Comparison
Model Adj_R2 AIC RMSE
M1 0.223 18441.97 13.871
M2 0.243 18383.96 13.678
M3 0.185 18553.77 14.198

Decision: Model 2 is selected — best Adj. R², lowest RMSE, and interpretable:
- Hits (+), Walks (+), Errors (−), Pitcher SO (+).
- HR coefficient negative due to multicollinearity, but overall model fit is best.

Predictions

Sample of Predicted Wins
INDEX PREDICTED_WINS
9 68.89
10 70.43
14 78.62
47 85.61
60 67.77
63 69.31

Key Takeaways & Future Work

Takeaways - Teams win more games when they hit more and walk more.
- Errors directly cost wins and are a controllable weakness.
- Efficiency-based features validate the Moneyball thesis, but raw batting + pitching still outperform them in predictive power.

Future Work - Apply regularization (ridge/lasso) to reduce multicollinearity.
- Add interaction terms (e.g., HR × SO).
- Integrate modern sabermetric stats (OPS, WAR) for stronger prediction.

Appendix — Full Code

Appendix A: R Code

## Appendix A: Full R Code
# =========================
# Libraries
# =========================
library(tidyverse); library(corrplot); library(caret); library(car)
library(stargazer); library(kableExtra); library(broom)

# =========================
# Data Load
# =========================
train <- read.csv("moneyball-training-data.csv")
eval  <- read.csv("moneyball-evaluation-data.csv")

dim(train); dim(eval)
summary(train$TARGET_WINS)

# =========================
# Missingness
# =========================
pct_na <- function(x) mean(is.na(x))*100
missing_tbl <- train %>%
  summarise(across(everything(), pct_na)) %>%
  pivot_longer(everything(), names_to="variable", values_to="pct_missing") %>%
  arrange(desc(pct_missing))

# =========================
# Correlation Matrix
# =========================
num_vars <- train %>% select(-INDEX)
cor_mat  <- cor(num_vars, use="pairwise.complete.obs")
corrplot(cor_mat, method="color", tl.cex=0.7)

# =========================
# Imputation & Feature Engineering
# =========================
preproc <- preProcess(train, method=c("medianImpute"))
train_imp <- predict(preproc, train)
eval_imp  <- predict(preproc, eval)

safe_div <- function(num, den) ifelse(den==0|is.na(den), 0, num/den)

featurize <- function(df){
  df %>%
    mutate(
      HR_rate = safe_div(TEAM_BATTING_HR, TEAM_BATTING_H),
      BB_rate = safe_div(TEAM_BATTING_BB, TEAM_BATTING_H),
      SO_rate = safe_div(TEAM_BATTING_SO, TEAM_BATTING_H),
      Net_Steals = TEAM_BASERUN_SB - TEAM_BASERUN_CS,
      Net_BB = TEAM_BATTING_BB - TEAM_PITCHING_BB,
      log_E = log1p(TEAM_FIELDING_E)
    )
}

train_feat <- featurize(train_imp)
eval_feat  <- featurize(eval_imp)

# =========================
# Descriptive Statistics
# =========================
desc_tbl <- train_feat %>%
  select(TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_BB, TEAM_FIELDING_E,
         TEAM_PITCHING_SO, HR_rate, BB_rate, SO_rate, Net_Steals, Net_BB) %>%
  summarise(across(everything(),
                   list(Mean=mean, Median=median, SD=sd),
                   .names="{.col}_{.fn}")) %>%
  pivot_longer(everything(),
               names_to=c("Variable",".value"),
               names_pattern="(.*)_(Mean|Median|SD)")

# =========================
# Model Building
# =========================
m1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
           TEAM_BATTING_BB + TEAM_BATTING_SO,
         data=train_feat)

m2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
           TEAM_BATTING_BB + TEAM_BATTING_SO +
           TEAM_FIELDING_E + TEAM_PITCHING_SO + TEAM_PITCHING_HR,
         data=train_feat)

m3 <- lm(TARGET_WINS ~ HR_rate + BB_rate + SO_rate +
           Net_Steals + Net_BB + TEAM_PITCHING_SO + log_E,
         data=train_feat)

stargazer(m1, m2, m3, type="text")

# =========================
# Model Comparison
# =========================
cmp <- tibble(
  Model=c("M1","M2","M3"),
  Adj_R2=c(summary(m1)$adj.r.squared,
           summary(m2)$adj.r.squared,
           summary(m3)$adj.r.squared),
  AIC=c(AIC(m1),AIC(m2),AIC(m3)),
  RMSE=c(RMSE(predict(m1,train_feat),train_feat$TARGET_WINS),
         RMSE(predict(m2,train_feat),train_feat$TARGET_WINS),
         RMSE(predict(m3,train_feat),train_feat$TARGET_WINS))
)

# =========================
# Diagnostics
# =========================
par(mfrow=c(2,2)); plot(m2); par(mfrow=c(1,1))

# =========================
# Predictions
# =========================
pred_out <- eval %>%
  mutate(PREDICTED_WINS = predict(m2, newdata=eval_feat)) %>%
  select(INDEX,PREDICTED_WINS)

write.csv(pred_out,"moneyball_predictions.csv",row.names=FALSE)