Baseball was never my favorite sport. To be honest, I was always a basketball guy. That doesn’t deter me from being amazed by the incredible story of the 2002 Oakland A’s.
The previous year the Athletics lost three of their key players to the big gun teams. Billy Beane, the team’s general manager, didn’t let this faze him, rather, he made some very unorthodox free agent signings for players that nobody thought would make a difference. Through the power of statistics and data analysis, the team was able to find the right players that would help them reach an amazing record of 103-59!
This story was the subject of the book Moneyball, written by the legendary author Michael Lewis. The main premise is that what common knowledge suggests about how to succeed in sports, specially in the MLB, is not always correct. Metrics like stolen bases, runs batted in, and batting average are known and respected by everyone. Measurements like on-base percentage and slugging percentage, however, are not usually thought of as better indicators of offensive success, but that’s exactly what the A’s were looking for.
In this project, we will recreate the story of the 2002 Oakland A’s and see if we, with the power of data analysis and R, can find the players that can replace the stars that left them the year before.
A useful website to consult for baseball statistics is http://www.seanlahman.com/baseball-archive/statistics/. We will use the ‘Batting.csv’ file.
batting <- read.csv('./data/Batting.csv')
I’ll use head() to check out the data
head(batting)
## playerID yearID stint teamID lgID G G_batting AB R H X2B X3B HR RBI SB
## 1 aardsda01 2004 1 SFN NL 11 11 0 0 0 0 0 0 0 0
## 2 aardsda01 2006 1 CHN NL 45 43 2 0 0 0 0 0 0 0
## 3 aardsda01 2007 1 CHA AL 25 2 0 0 0 0 0 0 0 0
## 4 aardsda01 2008 1 BOS AL 47 5 1 0 0 0 0 0 0 0
## 5 aardsda01 2009 1 SEA AL 73 3 0 0 0 0 0 0 0 0
## 6 aardsda01 2010 1 SEA AL 53 4 0 0 0 0 0 0 0 0
## CS BB SO IBB HBP SH SF GIDP G_old
## 1 0 0 0 0 0 0 0 0 11
## 2 0 0 0 0 0 1 0 0 45
## 3 0 0 0 0 0 0 0 0 2
## 4 0 0 1 0 0 0 0 0 5
## 5 0 0 0 0 0 0 0 0 NA
## 6 0 0 0 0 0 0 0 0 NA
Now I’ll use str() to take a look at the structure. Right away I can tell that calling specific columns might be tricky because of the weird acronymns.
str(batting)
## 'data.frame': 97889 obs. of 24 variables:
## $ playerID : Factor w/ 18107 levels "aardsda01","aaronha01",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ yearID : int 2004 2006 2007 2008 2009 2010 2012 1954 1955 1956 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 117 35 33 16 116 116 93 80 80 80 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: 4 4 2 2 2 2 2 4 4 4 ...
## $ G : int 11 45 25 47 73 53 1 122 153 153 ...
## $ G_batting: int 11 43 2 5 3 4 NA 122 153 153 ...
## $ AB : int 0 2 0 1 0 0 NA 468 602 609 ...
## $ R : int 0 0 0 0 0 0 NA 58 105 106 ...
## $ H : int 0 0 0 0 0 0 NA 131 189 200 ...
## $ X2B : int 0 0 0 0 0 0 NA 27 37 34 ...
## $ X3B : int 0 0 0 0 0 0 NA 6 9 14 ...
## $ HR : int 0 0 0 0 0 0 NA 13 27 26 ...
## $ RBI : int 0 0 0 0 0 0 NA 69 106 92 ...
## $ SB : int 0 0 0 0 0 0 NA 2 3 2 ...
## $ CS : int 0 0 0 0 0 0 NA 2 1 4 ...
## $ BB : int 0 0 0 0 0 0 NA 28 49 37 ...
## $ SO : int 0 0 0 1 0 0 NA 39 61 54 ...
## $ IBB : int 0 0 0 0 0 0 NA NA 5 6 ...
## $ HBP : int 0 0 0 0 0 0 NA 3 3 2 ...
## $ SH : int 0 1 0 0 0 0 NA 6 7 5 ...
## $ SF : int 0 0 0 0 0 0 NA 4 4 7 ...
## $ GIDP : int 0 0 0 0 0 0 NA 13 20 21 ...
## $ G_old : int 11 45 2 5 NA NA NA 122 153 153 ...
I’ll explore some of the columns with head() again
head(batting$AB)
## [1] 0 2 0 1 0 0
head(batting$X2B)
## [1] 0 0 0 0 0 0
Time to create some new statistics!
In Moneyball, there were three important metrics: Batting Average, On Base Percentage, and Slugging Percentage.
The formula for Batting Average is:
AVG = \(\frac{H}{AB}\)
This means that Batting Average is equal to H (hits) divided by AB (At Base)
Let’s create a new column called BA and add it to the data frame.
batting$BA <- batting$H / batting$AB
Let’s check what happened.
tail(batting$BA, 5)
## [1] 0.1230769 0.2746479 0.1470588 0.2745098 0.2138728
The formula for On-Base Percentage is considerably more complicated:
OBP = \(\frac{H+BB+HBP}{AB+BB+HBP+SF}\)
H is hits, BB is bases on balls (walks), HBP is hit by pitch, AB is at bat, and SF is sacrifice fly
Let’s create a column called OBP and add it to the data frame.
batting$OBP = (batting$H + batting$BB + batting$HBP) / (batting$AB + batting$BB + batting$HBP + batting$SF)
Let’s check what happened.
tail(batting$OBP, 5)
## [1] 0.1343284 0.3443918 0.1470588 0.3543759 0.2901554
The formula for slugging percentage is:
SLG = \(\frac{(1B) + (2 * 2B) + (3 * 3B) + (4 * HR)}{AB}\)
Because the dataframe doesn’t have first base data, we have to create it
batting$X1B <- batting$H - batting$X2B - batting$X3B - batting$HR
Now we can create a colulmn for slugging average
batting$SLG <- ((1 * batting$X1B) + (2 * batting$X2B) + (3 * batting$X3B) + (4 * batting$HR)) / batting$AB
Let’s check the structure of the data frame
str(batting)
## 'data.frame': 97889 obs. of 28 variables:
## $ playerID : Factor w/ 18107 levels "aardsda01","aaronha01",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ yearID : int 2004 2006 2007 2008 2009 2010 2012 1954 1955 1956 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 117 35 33 16 116 116 93 80 80 80 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: 4 4 2 2 2 2 2 4 4 4 ...
## $ G : int 11 45 25 47 73 53 1 122 153 153 ...
## $ G_batting: int 11 43 2 5 3 4 NA 122 153 153 ...
## $ AB : int 0 2 0 1 0 0 NA 468 602 609 ...
## $ R : int 0 0 0 0 0 0 NA 58 105 106 ...
## $ H : int 0 0 0 0 0 0 NA 131 189 200 ...
## $ X2B : int 0 0 0 0 0 0 NA 27 37 34 ...
## $ X3B : int 0 0 0 0 0 0 NA 6 9 14 ...
## $ HR : int 0 0 0 0 0 0 NA 13 27 26 ...
## $ RBI : int 0 0 0 0 0 0 NA 69 106 92 ...
## $ SB : int 0 0 0 0 0 0 NA 2 3 2 ...
## $ CS : int 0 0 0 0 0 0 NA 2 1 4 ...
## $ BB : int 0 0 0 0 0 0 NA 28 49 37 ...
## $ SO : int 0 0 0 1 0 0 NA 39 61 54 ...
## $ IBB : int 0 0 0 0 0 0 NA NA 5 6 ...
## $ HBP : int 0 0 0 0 0 0 NA 3 3 2 ...
## $ SH : int 0 1 0 0 0 0 NA 6 7 5 ...
## $ SF : int 0 0 0 0 0 0 NA 4 4 7 ...
## $ GIDP : int 0 0 0 0 0 0 NA 13 20 21 ...
## $ G_old : int 11 45 2 5 NA NA NA 122 153 153 ...
## $ BA : num NaN 0 NaN 0 NaN ...
## $ OBP : num NaN 0 NaN 0 NaN ...
## $ X1B : int 0 0 0 0 0 0 NA 85 116 126 ...
## $ SLG : num NaN 0 NaN 0 NaN ...
Because we are looking for the most undervalued players for our team, we need to also know the current salary information! We will use the file ‘Salaries.csv’
sal <- read.csv('./data/Salaries.csv')
If we use summary(), we will see that a lot of the data for the batting dataframe goes WAYY back in time, like 1871! That information is not important to us, so let’s make a subset that contains data from 1985 and onwards.
summary(batting)
## playerID yearID stint teamID
## mcguide01: 31 Min. :1871 Min. :1.000 CHN : 4720
## henderi01: 29 1st Qu.:1931 1st Qu.:1.000 PHI : 4621
## newsobo01: 29 Median :1970 Median :1.000 PIT : 4575
## johnto01 : 28 Mean :1962 Mean :1.077 SLN : 4535
## kaatji01 : 28 3rd Qu.:1995 3rd Qu.:1.000 CIN : 4393
## ansonca01: 27 Max. :2013 Max. :5.000 CLE : 4318
## (Other) :97717 (Other):70727
## lgID G G_batting AB
## AA : 1890 Min. : 1.00 Min. : 0.00 Min. : 0.0
## AL :44369 1st Qu.: 13.00 1st Qu.: 7.00 1st Qu.: 9.0
## FL : 470 Median : 35.00 Median : 32.00 Median : 61.0
## NL :49944 Mean : 51.65 Mean : 49.13 Mean :154.1
## PL : 147 3rd Qu.: 81.00 3rd Qu.: 81.00 3rd Qu.:260.0
## UA : 332 Max. :165.00 Max. :165.00 Max. :716.0
## NA's: 737 NA's :1406 NA's :6413
## R H X2B X3B
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 5.00 Median : 12.00 Median : 2.0 Median : 0.000
## Mean : 20.47 Mean : 40.37 Mean : 6.8 Mean : 1.424
## 3rd Qu.: 31.00 3rd Qu.: 66.00 3rd Qu.:10.0 3rd Qu.: 2.000
## Max. :192.00 Max. :262.00 Max. :67.0 Max. :36.000
## NA's :6413 NA's :6413 NA's :6413 NA's :6413
## HR RBI SB CS
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 5.00 Median : 0.000 Median : 0.000
## Mean : 3.002 Mean : 18.47 Mean : 3.265 Mean : 1.385
## 3rd Qu.: 3.000 3rd Qu.: 28.00 3rd Qu.: 2.000 3rd Qu.: 1.000
## Max. :73.000 Max. :191.00 Max. :138.000 Max. :42.000
## NA's :6413 NA's :6837 NA's :7713 NA's :29867
## BB SO IBB HBP
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 4.00 Median : 11.00 Median : 0.00 Median : 0.000
## Mean : 14.21 Mean : 21.95 Mean : 1.28 Mean : 1.136
## 3rd Qu.: 21.00 3rd Qu.: 31.00 3rd Qu.: 1.00 3rd Qu.: 1.000
## Max. :232.00 Max. :223.00 Max. :120.00 Max. :51.000
## NA's :6413 NA's :14251 NA's :42977 NA's :9233
## SH SF GIDP G_old
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 11.00
## Median : 1.000 Median : 0.0 Median : 1.00 Median : 34.00
## Mean : 2.564 Mean : 1.2 Mean : 3.33 Mean : 50.99
## 3rd Qu.: 3.000 3rd Qu.: 2.0 3rd Qu.: 5.00 3rd Qu.: 82.00
## Max. :67.000 Max. :19.0 Max. :36.00 Max. :165.00
## NA's :12751 NA's :42446 NA's :32521 NA's :5189
## BA OBP X1B SLG
## Min. :0.000 Min. :0.00 Min. : 0.00 Min. :0.000
## 1st Qu.:0.148 1st Qu.:0.19 1st Qu.: 1.00 1st Qu.:0.179
## Median :0.231 Median :0.29 Median : 9.00 Median :0.309
## Mean :0.209 Mean :0.26 Mean : 29.14 Mean :0.291
## 3rd Qu.:0.275 3rd Qu.:0.34 3rd Qu.: 48.00 3rd Qu.:0.397
## Max. :1.000 Max. :1.00 Max. :225.00 Max. :4.000
## NA's :13520 NA's :49115 NA's :6413 NA's :13520
batting <- subset(batting, yearID >= 1985)
summary(batting)
## playerID yearID stint teamID
## moyerja01: 27 Min. :1985 Min. :1.00 SDN : 1313
## mulhote01: 26 1st Qu.:1993 1st Qu.:1.00 CLE : 1306
## weathda01: 26 Median :2000 Median :1.00 PIT : 1299
## maddugr01: 25 Mean :2000 Mean :1.08 NYN : 1297
## sierrru01: 25 3rd Qu.:2007 3rd Qu.:1.00 BOS : 1279
## thomeji01: 25 Max. :2013 Max. :4.00 CIN : 1279
## (Other) :35498 (Other):27879
## lgID G G_batting AB
## AA: 0 Min. : 1.0 Min. : 0.00 Min. : 0.0
## AL:17226 1st Qu.: 14.0 1st Qu.: 4.00 1st Qu.: 3.0
## FL: 0 Median : 34.0 Median : 27.00 Median : 47.0
## NL:18426 Mean : 51.7 Mean : 46.28 Mean :144.7
## PL: 0 3rd Qu.: 77.0 3rd Qu.: 77.00 3rd Qu.:241.0
## UA: 0 Max. :163.0 Max. :163.00 Max. :716.0
## NA's :1406 NA's :4377
## R H X2B X3B
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 4.00 Median : 8.00 Median : 1.000 Median : 0.000
## Mean : 19.44 Mean : 37.95 Mean : 7.293 Mean : 0.824
## 3rd Qu.: 30.00 3rd Qu.: 61.00 3rd Qu.:11.000 3rd Qu.: 1.000
## Max. :152.00 Max. :262.00 Max. :59.000 Max. :23.000
## NA's :4377 NA's :4377 NA's :4377 NA's :4377
## HR RBI SB CS
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 3.00 Median : 0.000 Median : 0.000
## Mean : 4.169 Mean : 18.41 Mean : 2.811 Mean : 1.219
## 3rd Qu.: 5.000 3rd Qu.: 27.00 3rd Qu.: 2.000 3rd Qu.: 1.000
## Max. :73.000 Max. :165.00 Max. :110.000 Max. :29.000
## NA's :4377 NA's :4377 NA's :4377 NA's :4377
## BB SO IBB HBP
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 3.00 Median : 12.00 Median : 0.000 Median : 0.000
## Mean : 14.06 Mean : 27.03 Mean : 1.171 Mean : 1.273
## 3rd Qu.: 21.00 3rd Qu.: 42.00 3rd Qu.: 1.000 3rd Qu.: 1.000
## Max. :232.00 Max. :223.00 Max. :120.000 Max. :35.000
## NA's :4377 NA's :4377 NA's :4378 NA's :4387
## SH SF GIDP G_old
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 11.0
## Median : 0.000 Median : 0.000 Median : 1.00 Median : 32.0
## Mean : 1.465 Mean : 1.212 Mean : 3.25 Mean : 49.7
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 5.00 3rd Qu.: 77.0
## Max. :39.000 Max. :17.000 Max. :35.00 Max. :163.0
## NA's :4377 NA's :4378 NA's :4377 NA's :5189
## BA OBP X1B SLG
## Min. :0.000 Min. :0.000 Min. : 0.00 Min. :0.000
## 1st Qu.:0.136 1st Qu.:0.188 1st Qu.: 0.00 1st Qu.:0.167
## Median :0.233 Median :0.296 Median : 6.00 Median :0.333
## Mean :0.205 Mean :0.262 Mean : 25.66 Mean :0.304
## 3rd Qu.:0.274 3rd Qu.:0.342 3rd Qu.: 42.00 3rd Qu.:0.423
## Max. :1.000 Max. :1.000 Max. :225.00 Max. :4.000
## NA's :8905 NA's :8821 NA's :4377 NA's :8905
As we can see, the minimum for the yearID is now 1985.
Now we need to merge the batting data with the salary data. Because of repetition, we need to merge on both playerID and yearID.
combo <- merge(batting, sal, by=c('playerID', 'yearID'))
Let’s check the data
summary(combo)
## playerID yearID stint teamID.x
## moyerja01: 27 Min. :1985 Min. :1.000 LAN : 940
## thomeji01: 25 1st Qu.:1993 1st Qu.:1.000 PHI : 937
## weathda01: 25 Median :1999 Median :1.000 BOS : 935
## vizquom01: 24 Mean :1999 Mean :1.098 NYA : 928
## gaettga01: 23 3rd Qu.:2006 3rd Qu.:1.000 CLE : 920
## griffke02: 23 Max. :2013 Max. :4.000 SDN : 914
## (Other) :25250 (Other):19823
## lgID.x G G_batting AB
## AA: 0 Min. : 1.00 Min. : 0.00 Min. : 0.0
## AL:12292 1st Qu.: 26.00 1st Qu.: 8.00 1st Qu.: 5.0
## FL: 0 Median : 50.00 Median : 42.00 Median : 85.0
## NL:13105 Mean : 64.06 Mean : 57.58 Mean :182.4
## PL: 0 3rd Qu.:101.00 3rd Qu.:101.00 3rd Qu.:336.0
## UA: 0 Max. :163.00 Max. :163.00 Max. :716.0
## NA's :906 NA's :2661
## R H X2B X3B
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 9.00 Median : 19.00 Median : 3.000 Median : 0.000
## Mean : 24.71 Mean : 48.18 Mean : 9.276 Mean : 1.033
## 3rd Qu.: 43.00 3rd Qu.: 87.25 3rd Qu.:16.000 3rd Qu.: 1.000
## Max. :152.00 Max. :262.00 Max. :59.000 Max. :23.000
## NA's :2661 NA's :2661 NA's :2661 NA's :2661
## HR RBI SB CS
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 1.000 Median : 8.00 Median : 0.000 Median : 0.00
## Mean : 5.369 Mean : 23.56 Mean : 3.568 Mean : 1.54
## 3rd Qu.: 7.000 3rd Qu.: 39.00 3rd Qu.: 3.000 3rd Qu.: 2.00
## Max. :73.000 Max. :165.00 Max. :110.000 Max. :29.00
## NA's :2661 NA's :2661 NA's :2661 NA's :2661
## BB SO IBB HBP
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 6.00 Median : 20.00 Median : 0.000 Median : 0.000
## Mean : 17.98 Mean : 33.52 Mean : 1.533 Mean : 1.614
## 3rd Qu.: 29.00 3rd Qu.: 55.00 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :232.00 Max. :223.00 Max. :120.000 Max. :35.000
## NA's :2661 NA's :2661 NA's :2662 NA's :2670
## SH SF GIDP G_old
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 20.00
## Median : 0.000 Median : 0.000 Median : 2.000 Median : 47.00
## Mean : 1.786 Mean : 1.554 Mean : 4.127 Mean : 61.43
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 7.000 3rd Qu.:101.00
## Max. :39.000 Max. :17.000 Max. :35.000 Max. :163.00
## NA's :2661 NA's :2662 NA's :2661 NA's :3414
## BA OBP X1B SLG
## Min. :0.000 Min. :0.000 Min. : 0.0 Min. :0.000
## 1st Qu.:0.160 1st Qu.:0.208 1st Qu.: 0.0 1st Qu.:0.200
## Median :0.242 Median :0.305 Median : 13.0 Median :0.351
## Mean :0.212 Mean :0.270 Mean : 32.5 Mean :0.317
## 3rd Qu.:0.276 3rd Qu.:0.346 3rd Qu.: 59.0 3rd Qu.:0.432
## Max. :1.000 Max. :1.000 Max. :225.0 Max. :4.000
## NA's :5618 NA's :5562 NA's :2661 NA's :5618
## teamID.y lgID.y salary
## CLE : 935 AL:12304 Min. : 0
## PIT : 932 NL:13093 1st Qu.: 255000
## PHI : 931 Median : 550000
## SDN : 923 Mean : 1879256
## LAN : 921 3rd Qu.: 2150000
## CIN : 912 Max. :33000000
## (Other):19843
It was stated earlier that the Oakland A’s lost 3 important players during the 2001 off-season. Those players were: Jason Giambi, Johnny Damon, and Rainer Gustavo Olmedo. Let’s make a subset of these
lost_players <- subset(combo, playerID %in% c('giambja01', 'damonjo01', 'saenzol01'))
lost_players
## playerID yearID stint teamID.x lgID.x G G_batting AB R H X2B
## 5135 damonjo01 1995 1 KCA AL 47 47 188 32 53 11
## 5136 damonjo01 1996 1 KCA AL 145 145 517 61 140 22
## 5137 damonjo01 1997 1 KCA AL 146 146 472 70 130 12
## 5138 damonjo01 1998 1 KCA AL 161 161 642 104 178 30
## 5139 damonjo01 1999 1 KCA AL 145 145 583 101 179 39
## 5140 damonjo01 2000 1 KCA AL 159 159 655 136 214 42
## 5141 damonjo01 2001 1 OAK AL 155 155 644 108 165 34
## 5142 damonjo01 2002 1 BOS AL 154 154 623 118 178 34
## 5143 damonjo01 2003 1 BOS AL 145 145 608 103 166 32
## 5144 damonjo01 2004 1 BOS AL 150 150 621 123 189 35
## 5145 damonjo01 2005 1 BOS AL 148 148 624 117 197 35
## 5146 damonjo01 2006 1 NYA AL 149 149 593 115 169 35
## 5147 damonjo01 2007 1 NYA AL 141 141 533 93 144 27
## 5148 damonjo01 2008 1 NYA AL 143 143 555 95 168 27
## 5149 damonjo01 2009 1 NYA AL 143 143 550 107 155 36
## 5150 damonjo01 2010 1 DET AL 145 145 539 81 146 36
## 5151 damonjo01 2011 1 TBA AL 150 150 582 79 152 29
## 7872 giambja01 1995 1 OAK AL 54 54 176 27 45 7
## 7873 giambja01 1996 1 OAK AL 140 140 536 84 156 40
## 7874 giambja01 1997 1 OAK AL 142 142 519 66 152 41
## 7875 giambja01 1998 1 OAK AL 153 153 562 92 166 28
## 7876 giambja01 1999 1 OAK AL 158 158 575 115 181 36
## 7877 giambja01 2000 1 OAK AL 152 152 510 108 170 29
## 7878 giambja01 2001 1 OAK AL 154 154 520 109 178 47
## 7879 giambja01 2002 1 NYA AL 155 155 560 120 176 34
## 7880 giambja01 2003 1 NYA AL 156 156 535 97 134 25
## 7881 giambja01 2004 1 NYA AL 80 80 264 33 55 9
## 7882 giambja01 2005 1 NYA AL 139 139 417 74 113 14
## 7883 giambja01 2006 1 NYA AL 139 139 446 92 113 25
## 7884 giambja01 2007 1 NYA AL 83 83 254 31 60 8
## 7885 giambja01 2008 1 NYA AL 145 145 458 68 113 19
## 7886 giambja01 2009 1 OAK AL 83 83 269 39 52 13
## 7887 giambja01 2009 2 COL NL 19 19 24 4 7 1
## 7888 giambja01 2010 1 COL NL 87 87 176 17 43 9
## 7889 giambja01 2011 1 COL NL 64 64 131 20 34 6
## 7890 giambja01 2012 1 COL NL 60 NA 89 7 20 4
## 7891 giambja01 2013 1 CLE AL 71 71 186 21 34 8
## 20112 saenzol01 1999 1 OAK AL 97 97 255 41 70 18
## 20113 saenzol01 2000 1 OAK AL 76 76 214 40 67 12
## 20114 saenzol01 2001 1 OAK AL 106 106 305 33 67 21
## 20115 saenzol01 2002 1 OAK AL 68 68 156 15 43 10
## 20116 saenzol01 2005 1 LAN NL 109 109 319 39 84 24
## 20117 saenzol01 2006 1 LAN NL 103 103 179 30 53 15
## 20118 saenzol01 2007 1 LAN NL 92 92 110 9 21 5
## X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old BA
## 5135 5 3 23 7 0 12 22 0 1 2 3 2 47 0.2819149
## 5136 5 6 50 25 5 31 64 3 3 10 5 4 145 0.2707930
## 5137 8 8 48 16 10 42 70 2 3 6 1 3 146 0.2754237
## 5138 10 18 66 26 12 58 84 4 4 3 3 4 161 0.2772586
## 5139 9 14 77 36 6 67 50 5 3 3 4 13 145 0.3070326
## 5140 10 16 88 46 9 65 60 4 1 8 12 7 159 0.3267176
## 5141 4 9 49 27 12 61 70 1 5 5 4 7 155 0.2562112
## 5142 11 14 63 31 6 65 70 5 6 3 5 4 154 0.2857143
## 5143 6 12 67 30 6 68 74 4 2 6 6 5 145 0.2730263
## 5144 6 20 94 19 8 76 71 1 2 0 3 8 150 0.3043478
## 5145 6 10 75 18 1 53 69 3 2 0 9 5 148 0.3157051
## 5146 5 24 80 25 10 67 85 1 4 2 5 4 149 0.2849916
## 5147 2 12 63 27 3 66 79 1 2 1 3 4 141 0.2701689
## 5148 5 17 71 29 8 64 82 0 1 2 1 5 143 0.3027027
## 5149 3 24 82 12 0 71 98 1 2 2 1 9 NA 0.2818182
## 5150 5 8 51 11 1 69 90 2 2 2 1 5 NA 0.2708720
## 5151 7 16 73 19 6 51 92 1 7 2 5 4 150 0.2611684
## 7872 0 6 25 2 1 28 31 0 3 1 2 4 54 0.2556818
## 7873 1 20 79 0 1 51 95 3 5 1 5 15 140 0.2910448
## 7874 2 20 81 0 1 55 89 3 6 0 8 11 142 0.2928709
## 7875 0 27 110 2 2 81 102 7 5 0 9 16 153 0.2953737
## 7876 1 33 123 1 1 105 106 6 7 0 8 11 158 0.3147826
## 7877 1 43 137 2 0 137 96 6 9 0 8 9 152 0.3333333
## 7878 2 38 120 2 0 129 83 24 13 0 9 17 154 0.3423077
## 7879 1 41 122 2 2 109 112 4 15 0 5 18 155 0.3142857
## 7880 0 41 107 2 1 129 140 9 21 0 5 9 156 0.2504673
## 7881 0 12 40 0 1 47 62 1 8 0 3 5 80 0.2083333
## 7882 0 32 87 0 0 108 109 5 19 0 1 7 139 0.2709832
## 7883 0 37 113 2 0 110 106 12 16 0 7 10 139 0.2533632
## 7884 0 14 39 1 0 40 66 2 8 0 1 1 83 0.2362205
## 7885 1 32 96 2 1 76 111 5 22 0 9 6 145 0.2467249
## 7886 0 11 40 0 0 50 72 1 7 0 2 6 NA 0.1933086
## 7887 0 2 11 0 0 7 8 0 0 0 0 0 NA 0.2916667
## 7888 0 6 35 2 0 35 47 5 6 0 5 5 NA 0.2443182
## 7889 0 13 32 0 0 17 45 0 3 0 1 1 64 0.2595420
## 7890 0 1 8 0 0 20 24 2 2 0 2 4 NA 0.2247191
## 7891 0 9 31 0 1 23 56 0 4 0 3 8 NA 0.1827957
## 20112 0 11 41 1 1 22 47 1 15 0 3 6 97 0.2745098
## 20113 2 9 33 1 0 25 40 2 7 0 1 6 76 0.3130841
## 20114 1 9 32 0 1 19 64 1 13 1 3 9 106 0.2196721
## 20115 1 6 18 1 1 13 31 1 7 0 2 2 68 0.2756410
## 20116 0 15 63 0 1 27 63 1 3 0 2 12 109 0.2633229
## 20117 0 11 48 0 0 14 47 1 7 0 4 4 103 0.2960894
## 20118 0 4 18 0 0 16 25 0 2 0 4 5 92 0.1909091
## OBP X1B SLG teamID.y lgID.y salary
## 5135 0.3235294 34 0.4414894 KCA AL 109000
## 5136 0.3129496 107 0.3675048 KCA AL 180000
## 5137 0.3378378 102 0.3855932 KCA AL 240000
## 5138 0.3394625 120 0.4392523 KCA AL 460000
## 5139 0.3789954 117 0.4768439 KCA AL 2100000
## 5140 0.3819918 146 0.4946565 KCA AL 4000000
## 5141 0.3235294 118 0.3633540 OAK AL 7100000
## 5142 0.3562232 119 0.4430177 BOS AL 7250000
## 5143 0.3450292 116 0.4046053 BOS AL 7500000
## 5144 0.3803419 128 0.4766506 BOS AL 8000000
## 5145 0.3662791 146 0.4391026 BOS AL 8250000
## 5146 0.3587444 105 0.4822934 NYA AL 13000000
## 5147 0.3509934 103 0.3958724 NYA AL 13000000
## 5148 0.3752013 119 0.4612613 NYA AL 13000000
## 5149 0.3653846 92 0.4890909 NYA AL 13000000
## 5150 0.3551555 97 0.4007421 DET AL 8000000
## 5151 0.3255814 100 0.4175258 TBA AL 5250000
## 7872 0.3636364 32 0.3977273 OAK AL 109000
## 7873 0.3551089 95 0.4813433 OAK AL 120000
## 7874 0.3622449 89 0.4951830 OAK AL 205000
## 7875 0.3835616 111 0.4893238 OAK AL 315000
## 7876 0.4215827 111 0.5530435 OAK AL 2103333
## 7877 0.4759036 97 0.6470588 OAK AL 3103333
## 7878 0.4769001 91 0.6596154 OAK AL 4103333
## 7879 0.4354136 100 0.5982143 NYA AL 10428571
## 7880 0.4115942 68 0.5271028 NYA AL 11428571
## 7881 0.3416149 34 0.3787879 NYA AL 12428571
## 7882 0.4403670 67 0.5347722 NYA AL 13428571
## 7883 0.4127807 51 0.5582960 NYA AL 20428571
## 7884 0.3564356 38 0.4330709 NYA AL 23428571
## 7885 0.3734513 61 0.5021834 NYA AL 23428571
## 7886 0.3323171 28 0.3643123 OAK AL 4000000
## 7887 0.4516129 4 0.5833333 OAK AL 4000000
## 7888 0.3783784 28 0.3977273 COL NL 1750000
## 7889 0.3552632 15 0.6030534 COL NL 1000000
## 7890 0.3716814 15 0.3033708 COL NL 1000000
## 7891 0.2824074 17 0.3709677 CLE AL 750000
## 20112 0.3627119 41 0.4745098 OAK AL 240000
## 20113 0.4008097 44 0.5140187 OAK AL 260000
## 20114 0.2911765 36 0.3836066 OAK AL 290000
## 20115 0.3539326 26 0.4679487 OAK AL 800000
## 20116 0.3247863 45 0.4796238 LAN NL 650000
## 20117 0.3627451 27 0.5642458 LAN NL 1000000
## 20118 0.2954545 12 0.3454545 LAN NL 1000000
Now let’s focus only on the year 2001.
lost_players <- subset(lost_players, yearID == 2001)
Let’s reduce the lost_players data frame to just a certain amount of columns
lost_players <- lost_players[,c('playerID', 'H', 'X2B', 'X3B', 'HR', 'OBP', 'SLG', 'BA', 'AB')]
head(lost_players)
## playerID H X2B X3B HR OBP SLG BA AB
## 5141 damonjo01 165 34 4 9 0.3235294 0.3633540 0.2562112 644
## 7878 giambja01 178 47 2 38 0.4769001 0.6596154 0.3423077 520
## 20114 saenzol01 67 21 1 9 0.2911765 0.3836066 0.2196721 305
Finally we have all of the information we need to find the necessary replacement players! In order to do this, we need to follow three constraints:
First we’ll grab the available players from year 2001
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
avail.players <- filter(combo, yearID == 2001)
library(ggplot2)
ggplot(avail.players, aes(x=OBP, y=salary)) + geom_point()
## Warning: Removed 168 rows containing missing values (geom_point).
Let’s get rid of players with salary higher than 8 million and OBP equal to 0
avail.players <- filter(avail.players, salary<8000000, OBP > 0)
Let’s try to find players with AB higher than 500
avail.players <- filter(avail.players, AB >= 500)
Let’s sort by OBP and see what we have so far.
possible <- head(arrange(avail.players, desc(OBP)), 10)
possible <- possible[,c('playerID', 'OBP', 'AB', 'salary')]
possible
## playerID OBP AB salary
## 1 giambja01 0.4769001 520 4103333
## 2 heltoto01 0.4316547 587 4950000
## 3 berkmla01 0.4302326 577 305000
## 4 gonzalu01 0.4285714 609 4833333
## 5 thomeji01 0.4161491 526 7875000
## 6 alomaro01 0.4146707 575 7750000
## 7 edmonji01 0.4102142 500 6333333
## 8 gilesbr02 0.4035608 576 7333333
## 9 pujolal01 0.4029630 590 200000
## 10 olerujo01 0.4011799 572 6700000
Obviously we can’t choose giambja again, but the next three look good!
possible[2:4,]
## playerID OBP AB salary
## 2 heltoto01 0.4316547 587 4950000
## 3 berkmla01 0.4302326 577 305000
## 4 gonzalu01 0.4285714 609 4833333
Boom! We just found three players that can replace the ones we lost! Looks like we just saved the Oakland A’s season!