During the 2001-02 offseason, the Oakland A’s team lost three key players to teams with larger revenues.
The goal of this project is to look at player and salary data for those years, to find players of the same calibre (statistically) who have been undervalued by the market and thus, are suitable low salary replacements.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
batting <- read.csv('Batting.csv')
sal <- read.csv('Salaries.csv')
head(batting)
## playerID yearID stint teamID lgID G G_batting AB R H X2B X3B HR RBI SB
## 1 aardsda01 2004 1 SFN NL 11 11 0 0 0 0 0 0 0 0
## 2 aardsda01 2006 1 CHN NL 45 43 2 0 0 0 0 0 0 0
## 3 aardsda01 2007 1 CHA AL 25 2 0 0 0 0 0 0 0 0
## 4 aardsda01 2008 1 BOS AL 47 5 1 0 0 0 0 0 0 0
## 5 aardsda01 2009 1 SEA AL 73 3 0 0 0 0 0 0 0 0
## 6 aardsda01 2010 1 SEA AL 53 4 0 0 0 0 0 0 0 0
## CS BB SO IBB HBP SH SF GIDP G_old
## 1 0 0 0 0 0 0 0 0 11
## 2 0 0 0 0 0 1 0 0 45
## 3 0 0 0 0 0 0 0 0 2
## 4 0 0 1 0 0 0 0 0 5
## 5 0 0 0 0 0 0 0 0 NA
## 6 0 0 0 0 0 0 0 0 NA
Checking out the type of variables:
glimpse(batting)
## Observations: 95,195
## Variables: 24
## $ playerID <fct> aardsda01, aardsda01, aardsda01, aardsda01, aardsda0...
## $ yearID <int> 2004, 2006, 2007, 2008, 2009, 2010, 1954, 1955, 1956...
## $ stint <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ teamID <fct> SFN, CHN, CHA, BOS, SEA, SEA, ML1, ML1, ML1, ML1, ML...
## $ lgID <fct> NL, NL, AL, AL, AL, AL, NL, NL, NL, NL, NL, NL, NL, ...
## $ G <int> 11, 45, 25, 47, 73, 53, 122, 153, 153, 151, 153, 154...
## $ G_batting <int> 11, 43, 2, 5, 3, 4, 122, 153, 153, 151, 153, 154, 15...
## $ AB <int> 0, 2, 0, 1, 0, 0, 468, 602, 609, 615, 601, 629, 590,...
## $ R <int> 0, 0, 0, 0, 0, 0, 58, 105, 106, 118, 109, 116, 102, ...
## $ H <int> 0, 0, 0, 0, 0, 0, 131, 189, 200, 198, 196, 223, 172,...
## $ X2B <int> 0, 0, 0, 0, 0, 0, 27, 37, 34, 27, 34, 46, 20, 39, 28...
## $ X3B <int> 0, 0, 0, 0, 0, 0, 6, 9, 14, 6, 4, 7, 11, 10, 6, 4, 2...
## $ HR <int> 0, 0, 0, 0, 0, 0, 13, 27, 26, 44, 30, 39, 40, 34, 45...
## $ RBI <int> 0, 0, 0, 0, 0, 0, 69, 106, 92, 132, 95, 123, 126, 12...
## $ SB <int> 0, 0, 0, 0, 0, 0, 2, 3, 2, 1, 4, 8, 16, 21, 15, 31, ...
## $ CS <int> 0, 0, 0, 0, 0, 0, 2, 1, 4, 1, 1, 0, 7, 9, 7, 5, 4, 4...
## $ BB <int> 0, 0, 0, 0, 0, 0, 28, 49, 37, 57, 59, 51, 60, 56, 66...
## $ SO <int> 0, 0, 0, 1, 0, 0, 39, 61, 54, 58, 49, 54, 63, 64, 73...
## $ IBB <int> 0, 0, 0, 0, 0, 0, NA, 5, 6, 15, 16, 17, 13, 20, 14, ...
## $ HBP <int> 0, 0, 0, 0, 0, 0, 3, 3, 2, 0, 1, 4, 2, 2, 3, 0, 0, 1...
## $ SH <int> 0, 1, 0, 0, 0, 0, 6, 7, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ SF <int> 0, 0, 0, 0, 0, 0, 4, 4, 7, 3, 3, 9, 12, 9, 6, 5, 2, ...
## $ GIDP <int> 0, 0, 0, 0, 0, 0, 13, 20, 21, 13, 21, 19, 8, 16, 14,...
## $ G_old <int> 11, 45, 2, 5, NA, NA, 122, 153, 153, 151, 153, 154, ...
There are three more statistics that were used in Moneyball, that we don’t have here. So we’ll have to calculate them from the data we currently have.
The statistics are:
batting$BA <- batting$H / batting$AB
tail(batting$BA,5)
## [1] NaN 0.1839080 0.3125000 0.2864964 0.1132075
We can find the formula for On Base Percentage on the wikipedia page linked above.
batting <- batting %>%
mutate(OBP = (H+BB+HBP)/(AB+BB+HBP+SF))
For the Slugging Percentage, we’ll need the 1B (singles), which we’ll have to calculate by subtracting doubles, triples and home runs from the total hits.
Then we’ll calculate the slugging percentage using the formula in the wikipedia article.
batting <- batting %>%
mutate(X1B = H - X2B - X3B - HR)
batting <- batting %>%
mutate(SLG = (X1B + (2*X2B) + (3*X3B) + (4*HR))/AB)
Remember that the Oakland A’s had limited revenue, so the players that we look for have to be cheap, in addition to being good. We loaded two datasets at the beginning, one for batting statistics and another for salary data for the players. We’ll merge both into a single dataset.
Let’s start by looking at the summaries for both datasets.
summary(batting)
## playerID yearID stint teamID
## mcguide01: 31 Min. :1871 Min. :1.000 CHN : 4611
## henderi01: 29 1st Qu.:1930 1st Qu.:1.000 PHI : 4522
## newsobo01: 29 Median :1969 Median :1.000 PIT : 4478
## johnto01 : 28 Mean :1960 Mean :1.076 SLN : 4446
## kaatji01 : 28 3rd Qu.:1993 3rd Qu.:1.000 CIN : 4316
## ansonca01: 27 Max. :2011 Max. :5.000 CLE : 4231
## (Other) :95023 (Other):68591
## lgID G G_batting AB
## AA : 1891 Min. : 1.00 Min. : 0.00 Min. : 0.0
## AL :43124 1st Qu.: 12.00 1st Qu.: 7.00 1st Qu.: 9.0
## FL : 470 Median : 35.00 Median : 32.00 Median : 62.0
## NL :48494 Mean : 51.67 Mean : 49.08 Mean :154.8
## PL : 147 3rd Qu.: 82.00 3rd Qu.: 82.00 3rd Qu.:261.0
## UA : 332 Max. :165.00 Max. :165.00 Max. :716.0
## NA's: 737 NA's :6284
## R H X2B X3B
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 1.00 1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 5.00 Median : 12.00 Median : 2.00 Median : 0.000
## Mean : 20.59 Mean : 40.59 Mean : 6.81 Mean : 1.446
## 3rd Qu.: 32.00 3rd Qu.: 66.00 3rd Qu.:11.00 3rd Qu.: 2.000
## Max. :192.00 Max. :262.00 Max. :67.00 Max. :36.000
## NA's :6284 NA's :6284 NA's :6284 NA's :6284
## HR RBI SB CS
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.00 Median : 5.00 Median : 0.000 Median : 0.000
## Mean : 2.98 Mean : 18.56 Mean : 3.293 Mean : 1.406
## 3rd Qu.: 3.00 3rd Qu.: 28.00 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :73.00 Max. :191.00 Max. :138.000 Max. :42.000
## NA's :6284 NA's :6708 NA's :7584 NA's :29739
## BB SO IBB HBP
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 4.00 Median : 11.00 Median : 0.00 Median : 0.000
## Mean : 14.29 Mean : 21.74 Mean : 1.31 Mean : 1.135
## 3rd Qu.: 21.00 3rd Qu.: 31.00 3rd Qu.: 1.00 3rd Qu.: 1.000
## Max. :232.00 Max. :223.00 Max. :120.00 Max. :51.000
## NA's :6284 NA's :14122 NA's :42849 NA's :9104
## SH SF GIDP G_old
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 11.00
## Median : 1.000 Median : 0.00 Median : 1.00 Median : 34.00
## Mean : 2.609 Mean : 1.21 Mean : 3.35 Mean : 50.99
## 3rd Qu.: 3.000 3rd Qu.: 2.00 3rd Qu.: 5.00 3rd Qu.: 82.00
## Max. :67.000 Max. :19.00 Max. :36.00 Max. :165.00
## NA's :12623 NA's :42318 NA's :32393 NA's :2494
## BA OBP X1B SLG
## Min. :0.000 Min. :0.00 Min. : 0.00 Min. :0.000
## 1st Qu.:0.149 1st Qu.:0.19 1st Qu.: 1.00 1st Qu.:0.180
## Median :0.231 Median :0.29 Median : 9.00 Median :0.308
## Mean :0.210 Mean :0.26 Mean : 29.35 Mean :0.291
## 3rd Qu.:0.275 3rd Qu.:0.34 3rd Qu.: 48.00 3rd Qu.:0.397
## Max. :1.000 Max. :1.00 Max. :225.00 Max. :4.000
## NA's :12860 NA's :48460 NA's :6284 NA's :12860
summary(sal)
## yearID teamID lgID playerID
## Min. :1985 CLE : 867 AL:11744 moyerja01: 25
## 1st Qu.:1993 LAN : 861 NL:12212 vizquom01: 24
## Median :1999 PHI : 861 glavito02: 23
## Mean :1999 SLN : 858 bondsba01: 22
## 3rd Qu.:2006 BAL : 855 griffke02: 22
## Max. :2013 NYA : 855 thomeji01: 22
## (Other):18799 (Other) :23818
## salary
## Min. : 0
## 1st Qu.: 250000
## Median : 507950
## Mean : 1864357
## 3rd Qu.: 2100000
## Max. :33000000
##
The batting data goes back to 1871, whie the salary data starts at 1985. We’ll have to remove batting data before 1985 (which we have no use for anyways).
batting <- subset(batting, yearID >= 1985)
summary(batting)
## playerID yearID stint teamID
## moyerja01: 26 Min. :1985 Min. :1.000 CLE : 1219
## mulhote01: 26 1st Qu.:1992 1st Qu.:1.000 SDN : 1213
## weathda01: 26 Median :1999 Median :1.000 CIN : 1202
## maddugr01: 25 Mean :1999 Mean :1.079 PIT : 1202
## sierrru01: 25 3rd Qu.:2005 3rd Qu.:1.000 TEX : 1198
## johnsra05: 24 Max. :2011 Max. :4.000 NYN : 1195
## (Other) :32805 (Other):25728
## lgID G G_batting AB
## AA: 0 Min. : 1.00 Min. : 0.00 Min. : 0
## AL:15981 1st Qu.: 14.00 1st Qu.: 4.00 1st Qu.: 3
## FL: 0 Median : 34.00 Median : 27.00 Median : 48
## NL:16976 Mean : 51.74 Mean : 46.02 Mean :146
## PL: 0 3rd Qu.: 78.00 3rd Qu.: 77.00 3rd Qu.:245
## UA: 0 Max. :163.00 Max. :163.00 Max. :716
## NA's :4248
## R H X2B X3B
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 4.00 Median : 8.00 Median : 1.000 Median : 0.000
## Mean : 19.74 Mean : 38.41 Mean : 7.371 Mean : 0.838
## 3rd Qu.: 30.00 3rd Qu.: 62.00 3rd Qu.:12.000 3rd Qu.: 1.000
## Max. :152.00 Max. :262.00 Max. :59.000 Max. :23.000
## NA's :4248 NA's :4248 NA's :4248 NA's :4248
## HR RBI SB CS
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 3.00 Median : 0.000 Median : 0.000
## Mean : 4.208 Mean : 18.69 Mean : 2.856 Mean : 1.253
## 3rd Qu.: 5.000 3rd Qu.: 28.00 3rd Qu.: 2.000 3rd Qu.: 1.000
## Max. :73.000 Max. :165.00 Max. :110.000 Max. :29.000
## NA's :4248 NA's :4248 NA's :4248 NA's :4248
## BB SO IBB HBP
## Min. : 0.00 Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.0 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 3.00 Median : 12.0 Median : 0.000 Median : 0.000
## Mean : 14.29 Mean : 26.9 Mean : 1.204 Mean : 1.281
## 3rd Qu.: 21.00 3rd Qu.: 42.0 3rd Qu.: 1.000 3rd Qu.: 1.000
## Max. :232.00 Max. :223.0 Max. :120.000 Max. :35.000
## NA's :4248 NA's :4248 NA's :4249 NA's :4258
## SH SF GIDP G_old
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 11.0
## Median : 0.000 Median : 0.000 Median : 1.000 Median : 32.0
## Mean : 1.497 Mean : 1.236 Mean : 3.285 Mean : 49.7
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 5.000 3rd Qu.: 77.0
## Max. :39.000 Max. :17.000 Max. :35.000 Max. :163.0
## NA's :4248 NA's :4249 NA's :4248 NA's :2494
## BA OBP X1B SLG
## Min. :0.000 Min. :0.000 Min. : 0 Min. :0.000
## 1st Qu.:0.139 1st Qu.:0.190 1st Qu.: 0 1st Qu.:0.167
## Median :0.235 Median :0.298 Median : 6 Median :0.333
## Mean :0.207 Mean :0.264 Mean : 26 Mean :0.306
## 3rd Qu.:0.275 3rd Qu.:0.343 3rd Qu.: 43 3rd Qu.:0.424
## Max. :1.000 Max. :1.000 Max. :225 Max. :4.000
## NA's :8245 NA's :8165 NA's :4248 NA's :8245
The merge:
combo <- merge(batting,sal,by=c('playerID','yearID'))
summary(combo)
## playerID yearID stint teamID.x
## moyerja01: 26 Min. :1985 Min. :1.000 PHI : 877
## weathda01: 25 1st Qu.:1993 1st Qu.:1.000 BOS : 867
## gaettga01: 23 Median :1998 Median :1.000 CLE : 865
## griffke02: 23 Mean :1998 Mean :1.098 LAN : 864
## johnsra05: 23 3rd Qu.:2005 3rd Qu.:1.000 NYA : 863
## maddugr01: 23 Max. :2011 Max. :4.000 CIN : 856
## (Other) :23506 (Other):18457
## lgID.x G G_batting AB
## AA: 0 Min. : 1.00 Min. : 0.00 Min. : 0.0
## AL:11463 1st Qu.: 26.00 1st Qu.: 7.00 1st Qu.: 5.0
## FL: 0 Median : 50.00 Median : 41.00 Median : 87.0
## NL:12186 Mean : 63.96 Mean : 57.24 Mean :183.4
## PL: 0 3rd Qu.:101.00 3rd Qu.:101.00 3rd Qu.:338.0
## UA: 0 Max. :163.00 Max. :163.00 Max. :716.0
## NA's :2609
## R H X2B X3B
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 9.00 Median : 20.00 Median : 3.000 Median : 0.000
## Mean : 24.98 Mean : 48.56 Mean : 9.335 Mean : 1.047
## 3rd Qu.: 43.00 3rd Qu.: 88.00 3rd Qu.:16.000 3rd Qu.: 1.000
## Max. :152.00 Max. :262.00 Max. :59.000 Max. :23.000
## NA's :2609 NA's :2609 NA's :2609 NA's :2609
## HR RBI SB CS
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 1.000 Median : 8.00 Median : 0.000 Median : 0.000
## Mean : 5.395 Mean : 23.81 Mean : 3.611 Mean : 1.575
## 3rd Qu.: 7.000 3rd Qu.: 40.00 3rd Qu.: 3.000 3rd Qu.: 2.000
## Max. :73.000 Max. :165.00 Max. :110.000 Max. :29.000
## NA's :2609 NA's :2609 NA's :2609 NA's :2609
## BB SO IBB HBP
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 7.00 Median : 20.00 Median : 0.000 Median : 0.000
## Mean : 18.19 Mean : 33.26 Mean : 1.565 Mean : 1.617
## 3rd Qu.: 29.00 3rd Qu.: 54.00 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :232.00 Max. :223.00 Max. :120.000 Max. :35.000
## NA's :2609 NA's :2609 NA's :2610 NA's :2618
## SH SF GIDP G_old
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 20.00
## Median : 0.000 Median : 0.000 Median : 2.000 Median : 47.00
## Mean : 1.818 Mean : 1.577 Mean : 4.152 Mean : 61.43
## 3rd Qu.: 2.000 3rd Qu.: 3.000 3rd Qu.: 7.000 3rd Qu.:101.00
## Max. :39.000 Max. :17.000 Max. :35.000 Max. :163.00
## NA's :2609 NA's :2610 NA's :2609 NA's :1666
## BA OBP X1B SLG
## Min. :0.000 Min. :0.000 Min. : 0.00 Min. :0.000
## 1st Qu.:0.162 1st Qu.:0.211 1st Qu.: 0.00 1st Qu.:0.202
## Median :0.243 Median :0.306 Median : 14.00 Median :0.352
## Mean :0.213 Mean :0.272 Mean : 32.78 Mean :0.319
## 3rd Qu.:0.277 3rd Qu.:0.347 3rd Qu.: 59.00 3rd Qu.:0.433
## Max. :1.000 Max. :1.000 Max. :225.00 Max. :4.000
## NA's :5232 NA's :5180 NA's :2609 NA's :5232
## teamID.y lgID.y salary
## CLE : 882 AL:11484 Min. : 0
## PIT : 871 NL:12165 1st Qu.: 237500
## PHI : 868 Median : 520000
## SDN : 863 Mean : 1747040
## CIN : 862 3rd Qu.: 2000000
## LAN : 855 Max. :33000000
## (Other):18448
We mentioned that Oakland A’s lost 3 key players. They were:
Selecting the lost players:
lost_players <- subset(combo,playerID %in% c('giambja01','damonjo01','saenzol01'))
2001 was the year when Oakland lost those players. Limiting data to 2001 and required columns:
lost_players <- lost_players %>%
subset(yearID==2001) %>%
select(playerID,H,X2B,X3B,HR,OBP,SLG,BA,AB)
lost_players
## playerID H X2B X3B HR OBP SLG BA AB
## 4758 damonjo01 165 34 4 9 0.3235294 0.3633540 0.2562112 644
## 7291 giambja01 178 47 2 38 0.4769001 0.6596154 0.3423077 520
## 18696 saenzol01 67 21 1 9 0.2911765 0.3836066 0.2196721 305
Limiting source of information to 2001:
combo_2001 <- subset(combo,yearID==2001)
The constraints for our player search are:
sum(lost_players$AB)
## [1] 1469
mean(lost_players$OBP)
## [1] 0.3638687
So, - Combined AB should be equal to or greater than 1469. - Mean OBP should be greater than or equal to 0.364
Let’s plot the salaries and OBP to get a feel for what we have:
ggplot(combo_2001,aes(x=OBP,y=salary)) + geom_point()
## Warning: Removed 168 rows containing missing values (geom_point).
A good threshold for the salary seems to be 8 million, while we definitely have to look for OBPs above 0.
combo_2001 <- subset(combo_2001, salary < 8000000 & OBP > 0)
Let’s check out the average AB for our players:
mean(lost_players$AB)
## [1] 489.6667
We should be aiming to select players with AB around that mark 489, but it’s okay to go a bit lower than that for our threshold as we’ll have players with higher AB as well.
combo_2001 <- subset(combo_2001, AB > 450)
Let’s arrange our filtered players in the descending order of their AB, and pick the top 10 as our options.
options <- head(arrange(combo_2001,desc(OBP)),10)
select(options, playerID,AB,salary,OBP)
## playerID AB salary OBP
## 1 giambja01 520 4103333 0.4769001
## 2 heltoto01 587 4950000 0.4316547
## 3 berkmla01 577 305000 0.4302326
## 4 gonzalu01 609 4833333 0.4285714
## 5 martied01 470 5500000 0.4234079
## 6 thomeji01 526 7875000 0.4161491
## 7 alomaro01 575 7750000 0.4146707
## 8 edmonji01 500 6333333 0.4102142
## 9 gilesbr02 576 7333333 0.4035608
## 10 pujolal01 590 200000 0.4029630
And there we have it, we obviously have to ignore giambja01 (as he’s one of the players we have lost), but we can look at different combinations of the top players in our list (or explore further down the list).
This finishes the project.