Introduction

The Oakland Athletics’ 2002 season, chronicled in Michael Lewis’ book “Moneyball: The Art of Winning an Unfair Game,” stands as a testament to the revolutionary approach taken by the team’s front office in the face of financial constraints. Conventional baseball wisdom often relied on subjective assessments of player abilities, emphasizing outdated statistics such as batting average and runs batted in. However, the A’s challenged this narrative by embracing a data-centric methodology that prioritized overlooked statistical metrics, particularly on-base percentage and slugging percentage.

This project delves into the essence of Moneyball, seeking to harness the power of statistical analysis and advanced metrics to identify undervalued players who can potentially replace the lost talents of Jason Giambi, Johnny Damon, and Jason Isringhausen. By leveraging historical baseball data and applying analytical techniques in R, this project aims to unveil players whose performance might have been undervalued by the traditional scouting and player assessment methods, aligning with the Moneyball philosophy of fielding a competitive team despite financial constraints.

Through this exploration, we seek to exemplify the transformative potential of data-driven decision-making in sports and showcase the practical application of statistical analysis in redefining player evaluation and team composition within Major League Baseball.

Objective:

The objective of this project is to replicate the principles of Moneyball, as outlined in Michael Lewis’ book, by employing rigorous statistical analysis and data-driven decision-making within the realm of baseball. Specifically, we aim to identify undervalued players to potentially replace the key players lost by the Oakland Athletics during the 2001-02 offseason (Jason Giambi, Johnny Damon, and Jason Isringhausen) using statistical metrics to redefine player value in Major League Baseball (MLB).

Analysis

# installing packages
install.packages("tidyverse", repos = "https:\\cran.rstudio.com")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# loading the data
batting_mb <- read.csv("F:\\R practice\\R Projects\\Project 4\\Money Ball Project\\Batting.csv")

# checking head() and str()
head(batting_mb)
##    playerID yearID stint teamID lgID  G G_batting AB R H X2B X3B HR RBI SB CS
## 1 aardsda01   2004     1    SFN   NL 11        11  0 0 0   0   0  0   0  0  0
## 2 aardsda01   2006     1    CHN   NL 45        43  2 0 0   0   0  0   0  0  0
## 3 aardsda01   2007     1    CHA   AL 25         2  0 0 0   0   0  0   0  0  0
## 4 aardsda01   2008     1    BOS   AL 47         5  1 0 0   0   0  0   0  0  0
## 5 aardsda01   2009     1    SEA   AL 73         3  0 0 0   0   0  0   0  0  0
## 6 aardsda01   2010     1    SEA   AL 53         4  0 0 0   0   0  0   0  0  0
##   BB SO IBB HBP SH SF GIDP G_old
## 1  0  0   0   0  0  0    0    11
## 2  0  0   0   0  1  0    0    45
## 3  0  0   0   0  0  0    0     2
## 4  0  1   0   0  0  0    0     5
## 5  0  0   0   0  0  0    0    NA
## 6  0  0   0   0  0  0    0    NA
str(batting_mb)
## 'data.frame':    97889 obs. of  24 variables:
##  $ playerID : chr  "aardsda01" "aardsda01" "aardsda01" "aardsda01" ...
##  $ yearID   : int  2004 2006 2007 2008 2009 2010 2012 1954 1955 1956 ...
##  $ stint    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID   : chr  "SFN" "CHN" "CHA" "BOS" ...
##  $ lgID     : chr  "NL" "NL" "AL" "AL" ...
##  $ G        : int  11 45 25 47 73 53 1 122 153 153 ...
##  $ G_batting: int  11 43 2 5 3 4 NA 122 153 153 ...
##  $ AB       : int  0 2 0 1 0 0 NA 468 602 609 ...
##  $ R        : int  0 0 0 0 0 0 NA 58 105 106 ...
##  $ H        : int  0 0 0 0 0 0 NA 131 189 200 ...
##  $ X2B      : int  0 0 0 0 0 0 NA 27 37 34 ...
##  $ X3B      : int  0 0 0 0 0 0 NA 6 9 14 ...
##  $ HR       : int  0 0 0 0 0 0 NA 13 27 26 ...
##  $ RBI      : int  0 0 0 0 0 0 NA 69 106 92 ...
##  $ SB       : int  0 0 0 0 0 0 NA 2 3 2 ...
##  $ CS       : int  0 0 0 0 0 0 NA 2 1 4 ...
##  $ BB       : int  0 0 0 0 0 0 NA 28 49 37 ...
##  $ SO       : int  0 0 0 1 0 0 NA 39 61 54 ...
##  $ IBB      : int  0 0 0 0 0 0 NA NA 5 6 ...
##  $ HBP      : int  0 0 0 0 0 0 NA 3 3 2 ...
##  $ SH       : int  0 1 0 0 0 0 NA 6 7 5 ...
##  $ SF       : int  0 0 0 0 0 0 NA 4 4 7 ...
##  $ GIDP     : int  0 0 0 0 0 0 NA 13 20 21 ...
##  $ G_old    : int  11 45 2 5 NA NA NA 122 153 153 ...
# Call the head() of the first five rows of AB (At Bats) column
head(batting_mb$AB, 5)
## [1] 0 2 0 1 0
# Call the head of the doubles (X2B) column
head(batting_mb$X2B)
## [1] 0 0 0 0 0 0

We need to add four more statistics that were used in Moneyball! These are:

  1. Batting Average

  2. On Base Percentage

  3. 1B (Singles)

  4. Slugging Percentage

# creating new columns
batting_mb$BA <- batting_mb$H / batting_mb$AB #Batting Average

batting_mb <- batting_mb %>% 
  mutate(OBG = (H+BB+HBP) / (AB+BB+HBP+SF)) #On-base percentage
batting_mb <- batting_mb %>% rename("OBP" = "OBG")

batting_mb <- batting_mb %>%  #1B (Singles)
  mutate(X1B = H-X2B-X3B-HR)

batting_mb <- batting_mb %>% 
  mutate(SLG = (X1B) + (2*X2B) + (3*X3B) + (4*HR) / AB) #Slugging Percentage

colnames(batting_mb)
##  [1] "playerID"  "yearID"    "stint"     "teamID"    "lgID"      "G"        
##  [7] "G_batting" "AB"        "R"         "H"         "X2B"       "X3B"      
## [13] "HR"        "RBI"       "SB"        "CS"        "BB"        "SO"       
## [19] "IBB"       "HBP"       "SH"        "SF"        "GIDP"      "G_old"    
## [25] "BA"        "OBP"       "X1B"       "SLG"

Merging Salary Data with Batting Data

We know we don’t just want the best players, we want the most undervalued players, meaning we will also need to know current salary information! We have salary information in the csv file ‘Salaries.csv’.

# Merging Salary Data with Batting Data
sal <- read.csv("F:\\R practice\\R Projects\\Project 4\\Money Ball Project\\Salaries.csv")

# Use summary to get a summary of the batting data frame and notice the minimum year 
# in the yearID column
summary(batting_mb$yearID) #1871 is the minimum year
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1871    1931    1970    1962    1995    2013
summary(sal$yearID) #1985 is the minimum year
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1985    1993    1999    1999    2006    2013
# to merge we need to filter out years less than 1985 from batting dataset
batting_mb_adj <- batting_mb %>% 
  subset(yearID >= 1985)

summary(batting_mb_adj$yearID)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1985    1993    2000    2000    2007    2013
# merging
combo <- merge(batting_mb_adj, sal, by =  c('playerID','yearID'))

summary(combo)
##    playerID             yearID         stint         teamID.x        
##  Length:25397       Min.   :1985   Min.   :1.000   Length:25397      
##  Class :character   1st Qu.:1993   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :1999   Median :1.000   Mode  :character  
##                     Mean   :1999   Mean   :1.098                     
##                     3rd Qu.:2006   3rd Qu.:1.000                     
##                     Max.   :2013   Max.   :4.000                     
##                                                                      
##     lgID.x                G            G_batting            AB       
##  Length:25397       Min.   :  1.00   Min.   :  0.00   Min.   :  0.0  
##  Class :character   1st Qu.: 26.00   1st Qu.:  8.00   1st Qu.:  5.0  
##  Mode  :character   Median : 50.00   Median : 42.00   Median : 85.0  
##                     Mean   : 64.06   Mean   : 57.58   Mean   :182.4  
##                     3rd Qu.:101.00   3rd Qu.:101.00   3rd Qu.:336.0  
##                     Max.   :163.00   Max.   :163.00   Max.   :716.0  
##                                      NA's   :906      NA's   :2661   
##        R                H               X2B              X3B        
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:  0.00   1st Qu.:  1.00   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median :  9.00   Median : 19.00   Median : 3.000   Median : 0.000  
##  Mean   : 24.71   Mean   : 48.18   Mean   : 9.276   Mean   : 1.033  
##  3rd Qu.: 43.00   3rd Qu.: 87.25   3rd Qu.:16.000   3rd Qu.: 1.000  
##  Max.   :152.00   Max.   :262.00   Max.   :59.000   Max.   :23.000  
##  NA's   :2661     NA's   :2661     NA's   :2661     NA's   :2661    
##        HR              RBI               SB                CS       
##  Min.   : 0.000   Min.   :  0.00   Min.   :  0.000   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.: 0.00  
##  Median : 1.000   Median :  8.00   Median :  0.000   Median : 0.00  
##  Mean   : 5.369   Mean   : 23.56   Mean   :  3.568   Mean   : 1.54  
##  3rd Qu.: 7.000   3rd Qu.: 39.00   3rd Qu.:  3.000   3rd Qu.: 2.00  
##  Max.   :73.000   Max.   :165.00   Max.   :110.000   Max.   :29.00  
##  NA's   :2661     NA's   :2661     NA's   :2661      NA's   :2661   
##        BB               SO              IBB               HBP        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Min.   : 0.000  
##  1st Qu.:  0.00   1st Qu.:  2.00   1st Qu.:  0.000   1st Qu.: 0.000  
##  Median :  6.00   Median : 20.00   Median :  0.000   Median : 0.000  
##  Mean   : 17.98   Mean   : 33.52   Mean   :  1.533   Mean   : 1.614  
##  3rd Qu.: 29.00   3rd Qu.: 55.00   3rd Qu.:  2.000   3rd Qu.: 2.000  
##  Max.   :232.00   Max.   :223.00   Max.   :120.000   Max.   :35.000  
##  NA's   :2661     NA's   :2661     NA's   :2662      NA's   :2670    
##        SH               SF              GIDP            G_old       
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 20.00  
##  Median : 0.000   Median : 0.000   Median : 2.000   Median : 47.00  
##  Mean   : 1.786   Mean   : 1.554   Mean   : 4.127   Mean   : 61.43  
##  3rd Qu.: 2.000   3rd Qu.: 2.000   3rd Qu.: 7.000   3rd Qu.:101.00  
##  Max.   :39.000   Max.   :17.000   Max.   :35.000   Max.   :163.00  
##  NA's   :2661     NA's   :2662     NA's   :2661     NA's   :3414    
##        BA             OBP             X1B             SLG        
##  Min.   :0.000   Min.   :0.000   Min.   :  0.0   Min.   :  0.00  
##  1st Qu.:0.160   1st Qu.:0.208   1st Qu.:  0.0   1st Qu.:  4.00  
##  Median :0.242   Median :0.305   Median : 13.0   Median : 37.15  
##  Mean   :0.212   Mean   :0.270   Mean   : 32.5   Mean   : 62.32  
##  3rd Qu.:0.276   3rd Qu.:0.346   3rd Qu.: 59.0   3rd Qu.:112.05  
##  Max.   :1.000   Max.   :1.000   Max.   :225.0   Max.   :291.05  
##  NA's   :5618    NA's   :5562    NA's   :2661    NA's   :5618    
##    teamID.y            lgID.y              salary        
##  Length:25397       Length:25397       Min.   :       0  
##  Class :character   Class :character   1st Qu.:  255000  
##  Mode  :character   Mode  :character   Median :  550000  
##                                        Mean   : 1879256  
##                                        3rd Qu.: 2150000  
##                                        Max.   :33000000  
## 

Analyzing the Lost Players

As previously mentioned, the Oakland A’s lost 3 key players during the off-season. We’ll want to get their stats to see what we have to replace. The players lost were: first baseman 2000 AL MVP Jason Giambi (giambja01) to the New York Yankees, outfielder Johnny Damon (damonjo01) to the Boston Red Sox and infielder Rainer Gustavo “Ray” Olmedo (‘saenzol01’).

# Create a data frame called lost_players from the combo data frame consisting of these 3 players
lp <- c ("giambja01" , "damonjo01", "saenzol01")

lost_players <- combo %>% 
  subset(playerID %in% lp)

# Since all these players were lost in after 2001 in the off-season, let's only 
# concern ourselves with the data from 2001
lost_players <- lost_players %>% 
  subset(yearID == 2001)

# Reduce the lost_players data frame to the following columns:
# playerID,H,X2B,X3B,HR,OBP,SLG,BA,AB
lost_players <- lost_players %>% 
  select(playerID, H, X2B, X3B, HR, OBP, SLG, BA, AB)

Replacement Players

Now we have all the information we need! Now we need to find Replacement Players for the key three players we lost! However, you have three constraints:

- The total combined salary of the three players can not exceed 15 million              dollars.

- Their combined number of At Bats (AB) needs to be equal to or greater than the        lost players.

- Their mean OBP had to equal to or greater than the mean OBP of the lost players
combo_without_lp <- combo %>% 
  subset(!(playerID %in% lp)) %>% 
  filter(yearID == 2001)

ggplot(combo_without_lp, aes(x = OBP, y = salary)) +
  geom_point()

Looks like there is no point in paying above 8 million or so (I’m just eyeballing this number). I’ll choose that as a cutt off point. There are also a lot of players with OBP==0. Let’s get rid of them too.

combo_without_lp <- filter(combo_without_lp, salary < 8000000, OBP > 0)


sum(lost_players$AB) # 1469
## [1] 1469
mean(lost_players$OBP) # 0.363
## [1] 0.3638687
# The total AB of the lost players is 1469. This is about 1500, meaning I should probably cut off my avail.players at 1500/3= 500 AB.
combo_without_lp <- filter(combo_without_lp, AB >= 500)

# Now let's sort by OBP and see what we've got!
shortlist <- head(arrange(combo_without_lp, desc(OBP)), 10)

shortlist <- shortlist %>% 
  select(playerID, salary, AB, OBP)

selected_players <- shortlist[2:4, ]

print(selected_players)
##    playerID  salary  AB       OBP
## 2 berkmla01  305000 577 0.4302326
## 3 gonzalu01 4833333 609 0.4285714
## 4 thomeji01 7875000 526 0.4161491

Conclusion: