This is a project inspired by the book MoneyBall, here is a wiki of it: Moneyball. The data can be found here
During the 01-02 offseason the Oakland A’s lost three key players to teams with larger revenues.
The goal of this project is to look at the data of players and their salaries for those years and try to find players of the same calibre (statistically atleast) that have been undervalued by the market and thus, are suitable low salary replacements.
library(ggplot2)
library(dplyr)
batting <- read.csv('Batting.csv')
sal <- read.csv('Salaries.csv')
head(batting)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS
## 1 aardsda01 2004 1 SFN NL 11 0 0 0 0 0 0 0 0 0
## 2 aaronha01 1954 1 ML1 NL 122 468 58 131 27 6 13 69 2 2
## 3 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1
## 4 aaronha01 1956 1 ML1 NL 153 609 106 200 34 14 26 92 2 4
## 5 aaronha01 1957 1 ML1 NL 151 615 118 198 27 6 44 132 1 1
## 6 aaronha01 1958 1 ML1 NL 153 601 109 196 34 4 30 95 4 1
## BB SO IBB HBP SH SF GIDP
## 1 0 0 0 0 0 0 0
## 2 28 39 NA 3 6 4 13
## 3 49 61 5 3 7 4 20
## 4 37 54 6 2 5 7 21
## 5 57 58 15 0 0 3 13
## 6 59 49 16 1 0 3 21
The type of the variables are the following:
str(batting)
## 'data.frame': 85978 obs. of 22 variables:
## $ playerID: Factor w/ 16208 levels "aardsda01","aaronha01",..: 1 2 2 2 2 2 2 2 2 2 ...
## $ yearID : int 2004 1954 1955 1956 1957 1958 1959 1960 1961 1962 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 148 levels "ALT","ANA","ARI",..: 116 79 79 79 79 79 79 79 79 79 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ G : int 11 122 153 153 151 153 154 153 155 156 ...
## $ AB : int 0 468 602 609 615 601 629 590 603 592 ...
## $ R : int 0 58 105 106 118 109 116 102 115 127 ...
## $ H : int 0 131 189 200 198 196 223 172 197 191 ...
## $ X2B : int 0 27 37 34 27 34 46 20 39 28 ...
## $ X3B : int 0 6 9 14 6 4 7 11 10 6 ...
## $ HR : int 0 13 27 26 44 30 39 40 34 45 ...
## $ RBI : int 0 69 106 92 132 95 123 126 120 128 ...
## $ SB : int 0 2 3 2 1 4 8 16 21 15 ...
## $ CS : int 0 2 1 4 1 1 0 7 9 7 ...
## $ BB : int 0 28 49 37 57 59 51 60 56 66 ...
## $ SO : int 0 39 61 54 58 49 54 63 64 73 ...
## $ IBB : int 0 NA 5 6 15 16 17 13 20 14 ...
## $ HBP : int 0 3 3 2 0 1 4 2 2 3 ...
## $ SH : int 0 6 7 5 0 0 0 0 1 0 ...
## $ SF : int 0 4 4 7 3 3 9 12 9 6 ...
## $ GIDP : int 0 13 20 21 13 21 19 8 16 14 ...
str(sal)
## 'data.frame': 16447 obs. of 5 variables:
## $ yearID : int 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
## $ teamID : Factor w/ 32 levels "ANA","ARI","ATL",..: 27 27 27 28 28 28 28 28 28 28 ...
## $ lgID : Factor w/ 2 levels "AL","NL": 1 1 1 2 2 2 2 2 2 2 ...
## $ playerID: Factor w/ 3288 levels "aardsda01","aasedo01",..: 3044 3270 3282 269 335 361 691 701 791 1017 ...
## $ salary : num 315000 159000 272500 250000 385000 ...
There are three more statistics used in the book MoneyBall that are not present in the data. But they can be calculated from other features.
The statistics are the following:
The wiki articles gives us their formulas:
First, the Batting Average is calculated by dividing Hits by “At Bat”:
batting$BA <- batting$H / batting$AB
On Base Percentage (OBP):
batting <- batting %>%
mutate(OBP = (H+BB+HBP)/(AB+BB+HBP+SF))
For the Slugging Percentage, we’ll need the 1B (singles), which we’ll have to calculate by subtracting doubles, triples and home runs from the total hits.
batting <- batting %>%
mutate(X1B = H - X2B - X3B - HR)
Then can calculate the slugging percentage (SLG):
batting <- batting %>%
mutate(SLG = (X1B + (2*X2B) + (3*X3B) + (4*HR))/AB)
We will now merge the datasets, Batting for statistics and Salaries for the players salaries. But since the salary data only starts from 1985, while the Batting data starts in 1871, we need to cut the years before that from the Batting data:
c(min(sal$yearID), min(batting$yearID))
## [1] 1985 1871
Removing data before 1985 and merging the two datasets:
batting <- subset(batting, yearID >= 1985)
combo <- merge(batting,sal,by=c('playerID','yearID'))
Above we mentioned that the A’s lost 3 key players, they were (with PlayerID):
To find fitting replacements we have to look at their statistics to see what the benchmark is that we are looking for:
lost_players <- subset(combo,playerID %in% c('giambja01','damonjo01','isrinja01'))
The A’s lost those players in 2001, so we limit the data to 2001 and look at the required columns:
lost_players <- lost_players %>%
subset(yearID==2001) %>%
select(playerID,H,X2B,X3B,HR,OBP,SLG,BA,AB)
lost_players
## playerID H X2B X3B HR OBP SLG BA AB
## 3486 damonjo01 165 34 4 9 0.3235294 0.3633540 0.2562112 644
## 5373 giambja01 178 47 2 38 0.4769001 0.6596154 0.3423077 520
## 7543 isrinja01 67 21 1 9 0.2911765 0.3836066 0.2196721 305
Limiting the entire data to 2001:
combo_2001 <- subset(combo,yearID==2001)
When we look for our new players the constraints in our search are:
sum(lost_players$AB)
## [1] 1469
mean(lost_players$OBP)
## [1] 0.3638687
So, combined AB should be greater than or equal to 1469 and the mean OBP should be greater than or equal to 0.364
Let’s look at salaries and OBP to get a feel of what we have:
ggplot(combo_2001,aes(x=OBP,y=salary)) + geom_point()
## Warning: Removed 167 rows containing missing values (geom_point).
A reasonable threshold for salary seems to be around 8 million, while OBP has to be atleast more than 0.
combo_2001 <- subset(combo_2001, salary < 8000000 & OBP > 0)
The average AB for the lost players:
mean(lost_players$AB)
## [1] 489.6667
So we should be looking to select players with AB around 489, but let’s put it a little lower as we will have players with a higher AB as well.
combo_2001 <- subset(combo_2001, AB > 450)
Let’s arrange our filtered players in descending order based on their AB, and pick the top 10 as our options.
options <- head(arrange(combo_2001,desc(OBP)),10)
select(options, playerID,AB,salary,OBP)
## playerID AB salary OBP
## 1 giambja01 520 4103333 0.4769001
## 2 heltoto01 587 4950000 0.4316547
## 3 berkmla01 577 305000 0.4302326
## 4 gonzalu01 609 4833333 0.4285714
## 5 martied01 470 5500000 0.4234079
## 6 thomeji01 526 7875000 0.4161491
## 7 alomaro01 575 7750000 0.4146707
## 8 edmonji01 500 6333333 0.4102142
## 9 gilesbr02 576 7333333 0.4035608
## 10 pujolal01 590 200000 0.4029630
Now, Jason Giambi will have to be ignored, but we can look at different combinations of these top players.
For example a possible trio could be the top 3 (not regarding Jason Giambi).
new_players <- subset(combo_2001,playerID %in% c('heltoto01','berkmla01','gonzalu01'))
Combined salary:
sum(new_players$salary)
## [1] 10088333
Total “At Bat”:
sum(new_players$AB)
## [1] 1773
Mean OBP:
mean(new_players$OBP)
## [1] 0.4301529