Overview

This project will be comparing the correlation between on-base percentage (OBP) across eras in baseball. Specifically, the “Moneyball” era (1995-2004) and the modern game (2015-2024).

Introduction

The film Moneyball highlighted how the Oakland Athletics used undervalued statistics to gain a competitive advantage during the late 1990s and early 2000s. They applied statistics that were not widley used to indicate success, such as on-base percentage and slugging percentage. Since then, baseball analytics have evolved significantly, and many teams now incorporate on-base skills into their player evaluation strategies. This project investigates whether OBP is still as strongly associated with team success today as it was during the Moneyball era. The data used in this project comes from the punblicly-available Lahman Baseball dataset. The dataset contains team-level statistics for seasons dating back to the early 1870’s.

Research Question

How does on-base percentage (OBP) correlate with team winning percentage in Major League Baseball, and has that relationship changed between the Moneyball era (1995–2004) and the modern era (2015–2024)?

Data Cleanup

The variables used for this analysis include:

At-bats (AB) Hits (H) Walks (BB) Hit by pitch (HBP) Sacrifice flies (SF) Wins (W) Losses (L) On-base percentage (OBP) Winning Percentage (WinPct) OBP and WinPct are not directly counted in the dataset provided, but can be manually derived from the formulas; \(OBP\) = \(H + BB + HBP / AB + SF + BB + HBP\)

\(WinPct\) = \(W / W + L\)

The data is divided into two equal-length eras;

Moneyball Era: 1995–2004 Modern Era: 2015–2024

# Load in appropriate data
library(tidyverse)
teams<-read_csv("/Users/patrickmannion/downloads/Teams.csv")

# fiter data for the specific eras 
teams_filtered <- teams%>%
  filter(yearID %in% c(1995:2004, 2015:2024))

This is a massive dataset containg a ton of information that is not useful to the current question being asked. All that’s needed in the data between 1995-2004 anbd 2015-2024, the rest can be disregarded.

# Calculate OBP and WinPct Using Derived Formulas
teams_clean <- teams_filtered %>%
  mutate(
    WinPct = W / (W + L),
    OBP = (H + BB + HBP) / (AB + BB + HBP + SF),
    Era = ifelse(yearID <= 2004, "Moneyball Era", "Modern Era")
  ) %>%
  filter(!is.na(OBP))

# era specific datasets with OBP and WinPct

moneyball_data <- teams_clean %>% filter(Era == "Moneyball Era")
modern_data    <- teams_clean %>% filter(Era == "Modern Era")

The final datasets contain team statisitics from 1995-2004, which is our Moneyball era, and 2015-2024, which is our Modern era. The teams_clean data contains the two new variables, \(OBP\) and \(WinPct\), calculated using the formulas shown above.

Data Exploration

#Create Scatterplot of OBP vs WinPct
 ggplot(teams_clean, aes(x = OBP, y = WinPct, color = Era)) +
   geom_point(alpha = 0.6) +
   geom_smooth(method = "lm", se = FALSE) +
   labs( x = "On-base Percentage (OBP)",
          y = "Winning Percentage",
       title = "OBP vs Winning Percentage by Era")

Data Analysis

Correlation Coefficients

#correlation between ObP and WinPct
 cor(moneyball_data$OBP, moneyball_data$WinPct)
## [1] 0.6034974
 cor(modern_data$OBP, modern_data$WinPct)
## [1] 0.6552929

The coorelation coefficient for the moneyball era is r=0.603, and the correlation coefficient for the modern era is r=0.655. Both indicate a strong, positive linear relatioship between OBP and WinPct

Linear Regression

# linear model of OBP/WinPCT relationship between both samples
 model_moneyball <- lm(WinPct ~ OBP, data = moneyball_data)
 summary(model_moneyball)
## 
## Call:
## lm(formula = WinPct ~ OBP, data = moneyball_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.140596 -0.043007 -0.004749  0.039170  0.160740 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.59055    0.08439  -6.998 1.78e-11 ***
## OBP          3.23748    0.25032  12.933  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05983 on 292 degrees of freedom
## Multiple R-squared:  0.3642, Adjusted R-squared:  0.362 
## F-statistic: 167.3 on 1 and 292 DF,  p-value: < 2.2e-16
 model_modern <- lm(WinPct ~ OBP, data = modern_data)
 summary(model_modern)
## 
## Call:
## lm(formula = WinPct ~ OBP, data = modern_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.190034 -0.038898  0.004926  0.043272  0.150378 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.8257     0.0886   -9.32   <2e-16 ***
## OBP           4.1628     0.2780   14.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06177 on 298 degrees of freedom
## Multiple R-squared:  0.4294, Adjusted R-squared:  0.4275 
## F-statistic: 224.3 on 1 and 298 DF,  p-value: < 2.2e-16

Important Values (Moneyball)

OBP coefficient: 3.237 p-value: < 2e-16 R²: 0.364 OBP represents about 36% of the variation in team winning percent

Important Values (Modern)

OBP coefficient: 4.163 p-value: < 2e-16 R²: 0.429 OBP represents about 43% of the variation in winning

Interaction Model

# Test wether the relationship between OBP and WInPct changes between eras
 interaction_model <- lm(WinPct ~ OBP * Era, data = teams_clean)
 summary(interaction_model)
## 
## Call:
## lm(formula = WinPct ~ OBP * Era, data = teams_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.190034 -0.040785 -0.000122  0.041555  0.160740 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -0.82575    0.08723  -9.466   <2e-16 ***
## OBP                   4.16279    0.27369  15.210   <2e-16 ***
## EraMoneyball Era      0.23520    0.12235   1.922   0.0550 .  
## OBP:EraMoneyball Era -0.92531    0.37370  -2.476   0.0136 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06082 on 590 degrees of freedom
## Multiple R-squared:  0.3999, Adjusted R-squared:  0.3969 
## F-statistic: 131.1 on 3 and 590 DF,  p-value: < 2.2e-16

Slopes

OBP (Modern Era baseline)

Estimate: 4.163

OBP × Moneyball Era (Interaction Term)

Estimate: −0.925

Modern Era OBP Effect 4.163

Moneyball Era OBP Effect 4.163−0.925=3.238

Since a full point increase in OBP is unlikley , a more realistic increase (0.010) will be used. 0.010×3.24=0.032, about a 3.2% increase in WinPct(Moneyball) 0.010x4.16=0.042, about a 4.2% increase in WinPct (Modern)

Conclusions

Linear regression results show that on-base percentage (OBP) is a statistically significant predictor of team winning percentage in both eras. During the Moneyball era, OBP explained approximately 36% of the variation in winning percentage, while in the modern era it explained approximately 43%.

During the Moneyball era, teams that improved their OBP by 10 points tended to win about 3 more games per 100 games as a result. In modern MLB, a 10‑point increase in OBP corresponds to roughly 4 more wins per 100 games.The interaction model confirms that the effect of OBP on winning percentage differs by era. The interaction term between OBP and era is statistically significant (p = 0.0136), providing strong evidence that OBP has a stronger relationship with winning percentage in the modern era than during the Moneyball era. Teams that excel at getting on base continue to gain a competitive advantage.

Limitations

This project only looks at one offensive statistic, ob-base percentage. Baseball is a multi-dimensional sport, stats such as ERA, run prevention, power hitting, and others are needed to fully explain team success.This analysis uses team-level statistics, not individual player data. It cannot idenitfy the players that increase their teams OBP, and does not account for roster chnages and lineup construction.The results describe correlation rather than causation, and the linear modeling approach assumes a simplified relationship between OBP and winning percentage.


This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: Patrick Mannion Semester: Spring 2026