The Oakland Athletics’ 2002 season, chronicled in Michael Lewis’ book “Moneyball: The Art of Winning an Unfair Game,” stands as a testament to the revolutionary approach taken by the team’s front office in the face of financial constraints. Conventional baseball wisdom often relied on subjective assessments of player abilities, emphasizing outdated statistics such as batting average and runs batted in. However, the A’s challenged this narrative by embracing a data-centric methodology that prioritized overlooked statistical metrics, particularly on-base percentage and slugging percentage.
This project delves into the essence of Moneyball, seeking to harness the power of statistical analysis and advanced metrics to identify undervalued players who can potentially replace the lost talents of Jason Giambi, Johnny Damon, and Jason Isringhausen. By leveraging historical baseball data and applying analytical techniques in R, this project aims to unveil players whose performance might have been undervalued by the traditional scouting and player assessment methods, aligning with the Moneyball philosophy of fielding a competitive team despite financial constraints.
Through this exploration, we seek to exemplify the transformative potential of data-driven decision-making in sports and showcase the practical application of statistical analysis in redefining player evaluation and team composition within Major League Baseball.
The objective of this project is to replicate the principles of Moneyball, as outlined in Michael Lewis’ book, by employing rigorous statistical analysis and data-driven decision-making within the realm of baseball. Specifically, we aim to identify undervalued players to potentially replace the key players lost by the Oakland Athletics during the 2001-02 offseason (Jason Giambi, Johnny Damon, and Jason Isringhausen) using statistical metrics to redefine player value in Major League Baseball (MLB).
# installing packages
install.packages("tidyverse", repos = "https:\\cran.rstudio.com")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# loading the data
batting_mb <- read.csv("F:\\R practice\\R Projects\\Project 4\\Money Ball Project\\Batting.csv")
# checking head() and str()
head(batting_mb)
## playerID yearID stint teamID lgID G G_batting AB R H X2B X3B HR RBI SB CS
## 1 aardsda01 2004 1 SFN NL 11 11 0 0 0 0 0 0 0 0 0
## 2 aardsda01 2006 1 CHN NL 45 43 2 0 0 0 0 0 0 0 0
## 3 aardsda01 2007 1 CHA AL 25 2 0 0 0 0 0 0 0 0 0
## 4 aardsda01 2008 1 BOS AL 47 5 1 0 0 0 0 0 0 0 0
## 5 aardsda01 2009 1 SEA AL 73 3 0 0 0 0 0 0 0 0 0
## 6 aardsda01 2010 1 SEA AL 53 4 0 0 0 0 0 0 0 0 0
## BB SO IBB HBP SH SF GIDP G_old
## 1 0 0 0 0 0 0 0 11
## 2 0 0 0 0 1 0 0 45
## 3 0 0 0 0 0 0 0 2
## 4 0 1 0 0 0 0 0 5
## 5 0 0 0 0 0 0 0 NA
## 6 0 0 0 0 0 0 0 NA
str(batting_mb)
## 'data.frame': 97889 obs. of 24 variables:
## $ playerID : chr "aardsda01" "aardsda01" "aardsda01" "aardsda01" ...
## $ yearID : int 2004 2006 2007 2008 2009 2010 2012 1954 1955 1956 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : chr "SFN" "CHN" "CHA" "BOS" ...
## $ lgID : chr "NL" "NL" "AL" "AL" ...
## $ G : int 11 45 25 47 73 53 1 122 153 153 ...
## $ G_batting: int 11 43 2 5 3 4 NA 122 153 153 ...
## $ AB : int 0 2 0 1 0 0 NA 468 602 609 ...
## $ R : int 0 0 0 0 0 0 NA 58 105 106 ...
## $ H : int 0 0 0 0 0 0 NA 131 189 200 ...
## $ X2B : int 0 0 0 0 0 0 NA 27 37 34 ...
## $ X3B : int 0 0 0 0 0 0 NA 6 9 14 ...
## $ HR : int 0 0 0 0 0 0 NA 13 27 26 ...
## $ RBI : int 0 0 0 0 0 0 NA 69 106 92 ...
## $ SB : int 0 0 0 0 0 0 NA 2 3 2 ...
## $ CS : int 0 0 0 0 0 0 NA 2 1 4 ...
## $ BB : int 0 0 0 0 0 0 NA 28 49 37 ...
## $ SO : int 0 0 0 1 0 0 NA 39 61 54 ...
## $ IBB : int 0 0 0 0 0 0 NA NA 5 6 ...
## $ HBP : int 0 0 0 0 0 0 NA 3 3 2 ...
## $ SH : int 0 1 0 0 0 0 NA 6 7 5 ...
## $ SF : int 0 0 0 0 0 0 NA 4 4 7 ...
## $ GIDP : int 0 0 0 0 0 0 NA 13 20 21 ...
## $ G_old : int 11 45 2 5 NA NA NA 122 153 153 ...
# Call the head() of the first five rows of AB (At Bats) column
head(batting_mb$AB, 5)
## [1] 0 2 0 1 0
# Call the head of the doubles (X2B) column
head(batting_mb$X2B)
## [1] 0 0 0 0 0 0
We need to add four more statistics that were used in Moneyball! These are:
Batting Average
On Base Percentage
1B (Singles)
Slugging Percentage
# creating new columns
batting_mb$BA <- batting_mb$H / batting_mb$AB #Batting Average
batting_mb <- batting_mb %>%
mutate(OBG = (H+BB+HBP) / (AB+BB+HBP+SF)) #On-base percentage
batting_mb <- batting_mb %>% rename("OBP" = "OBG")
batting_mb <- batting_mb %>% #1B (Singles)
mutate(X1B = H-X2B-X3B-HR)
batting_mb <- batting_mb %>%
mutate(SLG = (X1B) + (2*X2B) + (3*X3B) + (4*HR) / AB) #Slugging Percentage
colnames(batting_mb)
## [1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
## [7] "G_batting" "AB" "R" "H" "X2B" "X3B"
## [13] "HR" "RBI" "SB" "CS" "BB" "SO"
## [19] "IBB" "HBP" "SH" "SF" "GIDP" "G_old"
## [25] "BA" "OBP" "X1B" "SLG"
We know we don’t just want the best players, we want the most undervalued players, meaning we will also need to know current salary information! We have salary information in the csv file ‘Salaries.csv’.
# Merging Salary Data with Batting Data
sal <- read.csv("F:\\R practice\\R Projects\\Project 4\\Money Ball Project\\Salaries.csv")
# Use summary to get a summary of the batting data frame and notice the minimum year
# in the yearID column
summary(batting_mb$yearID) #1871 is the minimum year
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1871 1931 1970 1962 1995 2013
summary(sal$yearID) #1985 is the minimum year
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1985 1993 1999 1999 2006 2013
# to merge we need to filter out years less than 1985 from batting dataset
batting_mb_adj <- batting_mb %>%
subset(yearID >= 1985)
summary(batting_mb_adj$yearID)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1985 1993 2000 2000 2007 2013
# merging
combo <- merge(batting_mb_adj, sal, by = c('playerID','yearID'))
summary(combo)
## playerID yearID stint teamID.x
## Length:25397 Min. :1985 Min. :1.000 Length:25397
## Class :character 1st Qu.:1993 1st Qu.:1.000 Class :character
## Mode :character Median :1999 Median :1.000 Mode :character
## Mean :1999 Mean :1.098
## 3rd Qu.:2006 3rd Qu.:1.000
## Max. :2013 Max. :4.000
##
## lgID.x G G_batting AB
## Length:25397 Min. : 1.00 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 26.00 1st Qu.: 8.00 1st Qu.: 5.0
## Mode :character Median : 50.00 Median : 42.00 Median : 85.0
## Mean : 64.06 Mean : 57.58 Mean :182.4
## 3rd Qu.:101.00 3rd Qu.:101.00 3rd Qu.:336.0
## Max. :163.00 Max. :163.00 Max. :716.0
## NA's :906 NA's :2661
## R H X2B X3B
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 9.00 Median : 19.00 Median : 3.000 Median : 0.000
## Mean : 24.71 Mean : 48.18 Mean : 9.276 Mean : 1.033
## 3rd Qu.: 43.00 3rd Qu.: 87.25 3rd Qu.:16.000 3rd Qu.: 1.000
## Max. :152.00 Max. :262.00 Max. :59.000 Max. :23.000
## NA's :2661 NA's :2661 NA's :2661 NA's :2661
## HR RBI SB CS
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 1.000 Median : 8.00 Median : 0.000 Median : 0.00
## Mean : 5.369 Mean : 23.56 Mean : 3.568 Mean : 1.54
## 3rd Qu.: 7.000 3rd Qu.: 39.00 3rd Qu.: 3.000 3rd Qu.: 2.00
## Max. :73.000 Max. :165.00 Max. :110.000 Max. :29.00
## NA's :2661 NA's :2661 NA's :2661 NA's :2661
## BB SO IBB HBP
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 6.00 Median : 20.00 Median : 0.000 Median : 0.000
## Mean : 17.98 Mean : 33.52 Mean : 1.533 Mean : 1.614
## 3rd Qu.: 29.00 3rd Qu.: 55.00 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :232.00 Max. :223.00 Max. :120.000 Max. :35.000
## NA's :2661 NA's :2661 NA's :2662 NA's :2670
## SH SF GIDP G_old
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 20.00
## Median : 0.000 Median : 0.000 Median : 2.000 Median : 47.00
## Mean : 1.786 Mean : 1.554 Mean : 4.127 Mean : 61.43
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 7.000 3rd Qu.:101.00
## Max. :39.000 Max. :17.000 Max. :35.000 Max. :163.00
## NA's :2661 NA's :2662 NA's :2661 NA's :3414
## BA OBP X1B SLG
## Min. :0.000 Min. :0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.:0.160 1st Qu.:0.208 1st Qu.: 0.0 1st Qu.: 4.00
## Median :0.242 Median :0.305 Median : 13.0 Median : 37.15
## Mean :0.212 Mean :0.270 Mean : 32.5 Mean : 62.32
## 3rd Qu.:0.276 3rd Qu.:0.346 3rd Qu.: 59.0 3rd Qu.:112.05
## Max. :1.000 Max. :1.000 Max. :225.0 Max. :291.05
## NA's :5618 NA's :5562 NA's :2661 NA's :5618
## teamID.y lgID.y salary
## Length:25397 Length:25397 Min. : 0
## Class :character Class :character 1st Qu.: 255000
## Mode :character Mode :character Median : 550000
## Mean : 1879256
## 3rd Qu.: 2150000
## Max. :33000000
##
As previously mentioned, the Oakland A’s lost 3 key players during the off-season. We’ll want to get their stats to see what we have to replace. The players lost were: first baseman 2000 AL MVP Jason Giambi (giambja01) to the New York Yankees, outfielder Johnny Damon (damonjo01) to the Boston Red Sox and infielder Rainer Gustavo “Ray” Olmedo (‘saenzol01’).
# Create a data frame called lost_players from the combo data frame consisting of these 3 players
lp <- c ("giambja01" , "damonjo01", "saenzol01")
lost_players <- combo %>%
subset(playerID %in% lp)
# Since all these players were lost in after 2001 in the off-season, let's only
# concern ourselves with the data from 2001
lost_players <- lost_players %>%
subset(yearID == 2001)
# Reduce the lost_players data frame to the following columns:
# playerID,H,X2B,X3B,HR,OBP,SLG,BA,AB
lost_players <- lost_players %>%
select(playerID, H, X2B, X3B, HR, OBP, SLG, BA, AB)
Now we have all the information we need! Now we need to find Replacement Players for the key three players we lost! However, you have three constraints:
- The total combined salary of the three players can not exceed 15 million dollars.
- Their combined number of At Bats (AB) needs to be equal to or greater than the lost players.
- Their mean OBP had to equal to or greater than the mean OBP of the lost players
combo_without_lp <- combo %>%
subset(!(playerID %in% lp)) %>%
filter(yearID == 2001)
ggplot(combo_without_lp, aes(x = OBP, y = salary)) +
geom_point()
Looks like there is no point in paying above 8 million or so (I’m just eyeballing this number). I’ll choose that as a cutt off point. There are also a lot of players with OBP==0. Let’s get rid of them too.
combo_without_lp <- filter(combo_without_lp, salary < 8000000, OBP > 0)
sum(lost_players$AB) # 1469
## [1] 1469
mean(lost_players$OBP) # 0.363
## [1] 0.3638687
# The total AB of the lost players is 1469. This is about 1500, meaning I should probably cut off my avail.players at 1500/3= 500 AB.
combo_without_lp <- filter(combo_without_lp, AB >= 500)
# Now let's sort by OBP and see what we've got!
shortlist <- head(arrange(combo_without_lp, desc(OBP)), 10)
shortlist <- shortlist %>%
select(playerID, salary, AB, OBP)
selected_players <- shortlist[2:4, ]
print(selected_players)
## playerID salary AB OBP
## 2 berkmla01 305000 577 0.4302326
## 3 gonzalu01 4833333 609 0.4285714
## 4 thomeji01 7875000 526 0.4161491
This project, a vital component of my Udemy Data Science certification with R course, replicates the Moneyball principles elucidated by Michael Lewis.
Aimed at identifying undervalued baseball players, the project scrutinized historical data using statistical analysis in R.
The objective was to find replacement players for the Oakland Athletics lost during the 2001-02 offseason, considering constraints such as salary, At Bats (AB), and On Base Percentage (OBP).
Successfully identified berkmla01, gonzalu01, and thomeji01 as potential replacements meeting salary limits, AB criteria, and surpassing the mean OBP of the departed players.
Demonstrates the practical application of data science methodologies in redefining player evaluation and strategic decision-making in professional sports, aligning with the transformative spirit of Moneyball.