This is a project inspired by the book MoneyBall, here is a wiki of it: Moneyball. The data can be found here

During the 01-02 offseason the Oakland A’s lost three key players to teams with larger revenues.

The goal of this project is to look at the data of players and their salaries for those years and try to find players of the same calibre (statistically atleast) that have been undervalued by the market and thus, are suitable low salary replacements.

Setup

library(ggplot2)
library(dplyr)
batting <- read.csv('Batting.csv')
sal <- read.csv('Salaries.csv')
head(batting)
##    playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB CS
## 1 aardsda01   2004     1    SFN   NL  11   0   0   0   0   0  0   0  0  0
## 2 aaronha01   1954     1    ML1   NL 122 468  58 131  27   6 13  69  2  2
## 3 aaronha01   1955     1    ML1   NL 153 602 105 189  37   9 27 106  3  1
## 4 aaronha01   1956     1    ML1   NL 153 609 106 200  34  14 26  92  2  4
## 5 aaronha01   1957     1    ML1   NL 151 615 118 198  27   6 44 132  1  1
## 6 aaronha01   1958     1    ML1   NL 153 601 109 196  34   4 30  95  4  1
##   BB SO IBB HBP SH SF GIDP
## 1  0  0   0   0  0  0    0
## 2 28 39  NA   3  6  4   13
## 3 49 61   5   3  7  4   20
## 4 37 54   6   2  5  7   21
## 5 57 58  15   0  0  3   13
## 6 59 49  16   1  0  3   21

The type of the variables are the following:

str(batting)
## 'data.frame':    85978 obs. of  22 variables:
##  $ playerID: Factor w/ 16208 levels "aardsda01","aaronha01",..: 1 2 2 2 2 2 2 2 2 2 ...
##  $ yearID  : int  2004 1954 1955 1956 1957 1958 1959 1960 1961 1962 ...
##  $ stint   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID  : Factor w/ 148 levels "ALT","ANA","ARI",..: 116 79 79 79 79 79 79 79 79 79 ...
##  $ lgID    : Factor w/ 6 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ G       : int  11 122 153 153 151 153 154 153 155 156 ...
##  $ AB      : int  0 468 602 609 615 601 629 590 603 592 ...
##  $ R       : int  0 58 105 106 118 109 116 102 115 127 ...
##  $ H       : int  0 131 189 200 198 196 223 172 197 191 ...
##  $ X2B     : int  0 27 37 34 27 34 46 20 39 28 ...
##  $ X3B     : int  0 6 9 14 6 4 7 11 10 6 ...
##  $ HR      : int  0 13 27 26 44 30 39 40 34 45 ...
##  $ RBI     : int  0 69 106 92 132 95 123 126 120 128 ...
##  $ SB      : int  0 2 3 2 1 4 8 16 21 15 ...
##  $ CS      : int  0 2 1 4 1 1 0 7 9 7 ...
##  $ BB      : int  0 28 49 37 57 59 51 60 56 66 ...
##  $ SO      : int  0 39 61 54 58 49 54 63 64 73 ...
##  $ IBB     : int  0 NA 5 6 15 16 17 13 20 14 ...
##  $ HBP     : int  0 3 3 2 0 1 4 2 2 3 ...
##  $ SH      : int  0 6 7 5 0 0 0 0 1 0 ...
##  $ SF      : int  0 4 4 7 3 3 9 12 9 6 ...
##  $ GIDP    : int  0 13 20 21 13 21 19 8 16 14 ...
str(sal)
## 'data.frame':    16447 obs. of  5 variables:
##  $ yearID  : int  1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
##  $ teamID  : Factor w/ 32 levels "ANA","ARI","ATL",..: 27 27 27 28 28 28 28 28 28 28 ...
##  $ lgID    : Factor w/ 2 levels "AL","NL": 1 1 1 2 2 2 2 2 2 2 ...
##  $ playerID: Factor w/ 3288 levels "aardsda01","aasedo01",..: 3044 3270 3282 269 335 361 691 701 791 1017 ...
##  $ salary  : num  315000 159000 272500 250000 385000 ...

Missing Features/Feature Engineering

There are three more statistics used in the book MoneyBall that are not present in the data. But they can be calculated from other features.

The statistics are the following:

The wiki articles gives us their formulas:

First, the Batting Average is calculated by dividing Hits by “At Bat”:

batting$BA <- batting$H / batting$AB

On Base Percentage (OBP):

batting <- batting %>%
  mutate(OBP = (H+BB+HBP)/(AB+BB+HBP+SF))

For the Slugging Percentage, we’ll need the 1B (singles), which we’ll have to calculate by subtracting doubles, triples and home runs from the total hits.

batting <- batting %>%
  mutate(X1B = H - X2B - X3B - HR)

Then can calculate the slugging percentage (SLG):

batting <- batting %>%
  mutate(SLG = (X1B + (2*X2B) + (3*X3B) + (4*HR))/AB)

Combining Batting and Salary Data

We will now merge the datasets, Batting for statistics and Salaries for the players salaries. But since the salary data only starts from 1985, while the Batting data starts in 1871, we need to cut the years before that from the Batting data:

c(min(sal$yearID), min(batting$yearID))
## [1] 1985 1871

Removing data before 1985 and merging the two datasets:

batting <- subset(batting, yearID >= 1985)

combo <- merge(batting,sal,by=c('playerID','yearID'))

Analyzing the lost players

Above we mentioned that the A’s lost 3 key players, they were (with PlayerID):

To find fitting replacements we have to look at their statistics to see what the benchmark is that we are looking for:

lost_players <- subset(combo,playerID %in% c('giambja01','damonjo01','isrinja01'))

The A’s lost those players in 2001, so we limit the data to 2001 and look at the required columns:

lost_players <- lost_players %>%
  subset(yearID==2001) %>%
  select(playerID,H,X2B,X3B,HR,OBP,SLG,BA,AB)

lost_players
##       playerID   H X2B X3B HR       OBP       SLG        BA  AB
## 3486 damonjo01 165  34   4  9 0.3235294 0.3633540 0.2562112 644
## 5373 giambja01 178  47   2 38 0.4769001 0.6596154 0.3423077 520
## 7543 isrinja01  67  21   1  9 0.2911765 0.3836066 0.2196721 305

Limiting the entire data to 2001:

combo_2001 <- subset(combo,yearID==2001)

When we look for our new players the constraints in our search are:

sum(lost_players$AB)
## [1] 1469
mean(lost_players$OBP)
## [1] 0.3638687

So, combined AB should be greater than or equal to 1469 and the mean OBP should be greater than or equal to 0.364

Let’s look at salaries and OBP to get a feel of what we have:

ggplot(combo_2001,aes(x=OBP,y=salary)) + geom_point()
## Warning: Removed 167 rows containing missing values (geom_point).

A reasonable threshold for salary seems to be around 8 million, while OBP has to be atleast more than 0.

combo_2001 <- subset(combo_2001, salary < 8000000 & OBP > 0)

The average AB for the lost players:

mean(lost_players$AB)
## [1] 489.6667

So we should be looking to select players with AB around 489, but let’s put it a little lower as we will have players with a higher AB as well.

combo_2001 <- subset(combo_2001, AB > 450)

Let’s arrange our filtered players in descending order based on their AB, and pick the top 10 as our options.

options <- head(arrange(combo_2001,desc(OBP)),10)

select(options, playerID,AB,salary,OBP)
##     playerID  AB  salary       OBP
## 1  giambja01 520 4103333 0.4769001
## 2  heltoto01 587 4950000 0.4316547
## 3  berkmla01 577  305000 0.4302326
## 4  gonzalu01 609 4833333 0.4285714
## 5  martied01 470 5500000 0.4234079
## 6  thomeji01 526 7875000 0.4161491
## 7  alomaro01 575 7750000 0.4146707
## 8  edmonji01 500 6333333 0.4102142
## 9  gilesbr02 576 7333333 0.4035608
## 10 pujolal01 590  200000 0.4029630

Now, Jason Giambi will have to be ignored, but we can look at different combinations of these top players.

For example a possible trio could be the top 3 (not regarding Jason Giambi).

new_players <- subset(combo_2001,playerID %in% c('heltoto01','berkmla01','gonzalu01'))

Combined salary:

sum(new_players$salary)
## [1] 10088333

Total “At Bat”:

sum(new_players$AB)
## [1] 1773

Mean OBP:

mean(new_players$OBP)
## [1] 0.4301529