We try to use Integer Linear Programming to build a perfect 25 men roster baseball team. We present our best team below which is the solution of the ILP model we built using the 2015 MLB season player data. If you understand baseball please evaluate our resulting baseball team and drop a comment, so that we know whether ILP can be used to get a decent baseball team. After the table I describe how we arrived at our solution.
Let’s read in the 2015 regular season player level data.
dat = read.csv("Baseball Data.csv")
head(dat[,1:4])
## Salary Name POS Bats
## 1 510000 Joc Pederson OF L
## 2 512500 Stephen Vogt 1B L
## 3 3550000 Wilson Ramos C R
## 4 31000000 Clayton Kershaw SP
## 5 15000000 Jhonny Peralta SS R
## 6 2000000 Carlos Villanueva Reliever
dat[is.na(dat)] = 0
There were NA's
for some players and their game statistics which we replaced with 0. The reason we replaced the missing data with zeros is that when we construct the player utility index missing data won’t count towards or against players.
Each baseball player has game statistics associated with them. Below is the list of player level data.
names(dat)
## [1] "Salary" "Name" "POS" "Bats" "Throws" "Team"
## [7] "G" "PA" "HR" "R" "RBI" "SB"
## [13] "BB." "K." "ISO" "BABIP" "AVG" "OBP"
## [19] "SLG" "wOBA" "wRC." "BsR" "Off" "Def"
## [25] "WAR" "playerid"
You can see the statistics description in the collapsible list and appendix.
Since the game statistics are in different units we standardize the data by subtracting the mean and dividing by the standard deviation, \(x_{changed} = \frac{x-\mu}{s}\). Additionaly, we add two new variables Off.norm
and Def.norm
which are normalized Off
and Def
ratings using the formula \(x_{changed}=\frac{x-min(x)}{max(x)-min(x)}\). We use the normalized offensive and defensive ratings to quickly evaluate the optimal team according to the ILP.
# select numeric columns and relevant variables
dat.scaled = scale(dat[,sapply(dat, class) == "numeric"][,c(-1:-2,-19)])
# normalize Off and Def
dat$Off.norm = (dat$Off-min(dat$Off))/(max(dat$Off)-min(dat$Off))
dat$Def.norm = (dat$Def-min(dat$Def))/(max(dat$Def)-min(dat$Def))
head(dat.scaled[,1:4])
## PA HR R RBI
## [1,] 0.9239111 1.2879067 0.7024833 0.4469482
## [2,] 0.6851676 0.6505590 0.4831027 0.8744364
## [3,] 0.6625837 0.4115537 0.0687172 0.7989973
## [4,] -0.9634531 -0.7834733 -0.9306832 -0.9109555
## [5,] 1.1013556 0.5708906 0.6293565 0.8744364
## [6,] -0.9634531 -0.7834733 -0.9306832 -0.9109555
Now that we have scaled player stats we will weigh them and add them up to obtain the player utility index \(U_i\) for player \(i\) to use it in the objective function.
\[U_i(x) = w_{1}\text{PA}_i+w_{2}\text{HR}_i+w_{3}\text{R}_i+w_{4}\text{RBI}_i+w_{5}\text{SB}_i+w_{6}\text{ISO}_i+w_{7}\text{BABIP}_i+w_{8}\text{AVG}_i+w_{9}\text{OBP}_i+w_{10}\text{SLG}_i+w_{11}\text{wOBA}_i+w_{12}\text{wRC.}_i+w_{13}\text{BsR}_i+w_{14}\text{Off}_i+w_{15}\text{Def}_i+w_{16}\text{WAR}_i\]
\(\text{ for player } i \text{ where } i \in \{1,199\}\)
By introducing weights we can construct the weight vector which best suits our preferences. For example, if we wanted the player utility index to value the offensive statistics like RBI
more than the defensive statistics like Def
we would just assign a bigger weight to RBI. We decided to value each statistic equally, i.e. weights are equal.
In baseball there are 25 active men roster and 40 men roster that includes the 25 men active roster. To start a new team we focus on building the perfect 25 men roster. Typically, a 25 men roster will consist of five starting pitchers (SP), seven relief pitchers (Reliever), two catchers (C), six infielders (IN), and five outfielders (OF). Current position variable POS
has more than 5 aforementioned groups. We group them in the POS2
variable by the five types SP, Reliever, C, IN, OF.
position = function(x){ # given position x change x to group
if(x %in% c("1B","2B","3B","SS")) x = "IN"
else if(x %in% c("Closer")) x = "Reliever"
else x=as.character(x)
}
dat$POS2 = sapply(dat$POS, position)
Additionally, we will make sure that our 25 men active roster has at least one player of each of the following positions: first base (1B), second base (2B), third base (3B) and Short stop (SS).
There is no salary cap in the Major League Baseball association, but rather a threshold of 189$ million for the 40 men roster for period 2014-2016 beyond which a luxury tax applies. For the first time violators the tax is 22.5% of the amount they were over the threshold. We decided that we would allocate 178$ million for the 25 men roster.
To model the above basic constraints and an objective function we came up with the player utility index \(U(x_1,x_2,...,x_n)\) which is a function of the chosen set of \(n\) player game statistics, 16 in our case. In our model we maximize the sum of the player utility indices. We have 16 game statistics of interest which are
PA, HR, R, RBI, SB, ISO, BABIP, AVG, OBP, SLG, wOBA, wRC., BsR, Off, Def, WAR, Off.norm, Def.norm
Below is the resulting model. \[ \begin{align} \text{max } & \sum^{199}_{i=1}U_i*x_i \\ \text{s. t. } & \sum^{199}_{i=1}x_i = 25 \\ & \sum x_{\text{SP}} \ge 5 \\ & \sum x_{\text{Reliever}} \ge 7 \\ & \sum x_{\text{C}} \ge 2 \\ & \sum x_{\text{IN}} \ge 6 \\ & \sum x_{\text{OF}} \ge 5 \\ & \sum x_{\text{POS}} \ge 1 \text{ for } POS \in \{\text{1B,2B,3B,SS}\}\\ & \sum x_{\text{LeftHandPitchers}} \ge 2 \\ & \sum x_{\text{LeftHandBatters}} \ge 2 \\ & \frac{1}{25} \sum Stat_{ij}x_{i} \ge mean(Stat_{j}) \text{ for } j = 1,2,...,16 \\ & \sum^{199}_{i=1}salary_i*x_i \le 191.22 \end{align} \] where
Constraint (2) ensures that we get 25 players. Constraints (3) through (10) ensure that number of players with certain attributes meets the required minimum. Collection of constraints (11) makes sure that our team’s average game stastistics outperform the average game statistics across all players. Constraint (12) ensures that we stay within our budget including the luxury tax.
Below is the solution of this programm.
library("lpSolve")
i = 199 # number of players (variables)
# constraints
cons = rbind(
rep(1,i), # 25 man constraint (2)
sapply(dat$POS2, function(x) if (x == "SP") x=1 else x=0), # (3)
sapply(dat$POS2, function(x) if (x == "Reliever") x=1 else x=0), # (4)
sapply(dat$POS2, function(x) if (x == "C") x=1 else x=0), # (5)
sapply(dat$POS2, function(x) if (x == "IN") x=1 else x=0), # (6)
sapply(dat$POS2, function(x) if (x == "OF") x=1 else x=0), # (7)
sapply(dat$POS, function(x) if (x == "1B") x=1 else x=0), # (8)
sapply(dat$POS, function(x) if (x == "2B") x=1 else x=0), # (8)
sapply(dat$POS, function(x) if (x == "3B") x=1 else x=0), # (8)
sapply(dat$POS, function(x) if (x == "SS") x=1 else x=0), # (8)
sapply(dat$Throws, function(x) if (x == "L") x=1 else x=0), # (9)
sapply(dat$Bats, function(x) if (x == "L") x=1 else x=0), # (10)
t(dat[,colnames(dat.scaled)])/25, # (11) outperform the average
dat$Salary/1000000 # (12) budget constraint
)
# model
f.obj = apply(dat.scaled,1,sum)
f.dir = c("=",rep(">=",27),"<=")
f.rhs = c(25,5,7,2,6,5,2,2,rep(1,4),
apply(dat[,colnames(dat.scaled)],2,mean),
178)
model = lp("max", f.obj, cons, f.dir, f.rhs, all.bin=T,compute.sens=1)
model
## Success: the objective function is 135.6201
sol = model$solution
sol
## [1] 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## [71] 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
## [141] 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
Let’s look at our ideal baseball team given the constraints outlined above.
# selected players
dat[which(sol>0),c(1:3,6,28:29)]
## Salary Name POS Team Def.norm POS2
## 4 31000000 Clayton Kershaw SP 0.3713163 SP
## 8 6083333 Mike Trout OF Angels 0.4125737 OF
## 24 19750000 David Price SP 0.3713163 SP
## 26 507500 Dellin Betances Closer Yankees 0.3713163 Reliever
## 29 509000 Matt Duffy 2B Giants 0.5972495 IN
## 53 17142857 Max Scherzer SP 0.3713163 SP
## 54 17277777 Buster Posey C Giants 0.5324165 C
## 62 509500 Carson Smith Closer Mariners 0.3713163 Reliever
## 71 519500 A.J. Pollock OF Diamondbacks 0.5422397 OF
## 83 535000 Trevor Rosenthal Closer Cardinals 0.3713163 Reliever
## 87 547100 Cody Allen Closer Indians 0.3713163 Reliever
## 108 2500000 Dee Gordon 2B Marlins 0.5402750 IN
## 109 2500000 Bryce Harper OF Nationals 0.2043222 OF
## 113 2725000 Lorenzo Cain OF Royals 0.6797642 OF
## 115 3083333 Paul Goldschmidt 1B Diamondbacks 0.2357564 IN
## 119 3200000 Zach Britton Closer Orioles 0.3713163 Reliever
## 121 3630000 Jake Arrieta SP 0.3713163 SP
## 129 4300000 Josh Donaldson 3B Blue Jays 0.5815324 IN
## 139 6000000 Chris Sale SP 0.3713163 SP
## 142 7000000 Russell Martin C Blue Jays 0.6110020 C
## 143 7000000 Wade Davis Closer Royals 0.3713163 Reliever
## 150 8050000 Aroldis Chapman Closer Reds 0.3713163 Reliever
## 163 10500000 Yoenis Cespedes OF - - - 0.5795678 OF
## 176 14000000 Joey Votto 1B Reds 0.1886051 IN
## 194 543000 Xander Bogaerts SS Red Sox 0.5284872 IN
Seems like a decent team with the mean normalized offensive and defensive ratings of 0.414495 and 0.4275835 respectively. For comparison mean normalized offensive and defensive ratings for all players are 0.3019702 and 0.3821564 respectively. Our team outperforms the average and its mean offensive and defensive ratings are better than \(82.9145729\)% and \(78.3919598\)% of other players correspondingly.
While this is a straightforward way to model the selection of the players there are several nuances we need to address. One of them is that the standardized game statistics are not additively independent. As a result, the our utility index poorly measures the player’s value and is biased. It is possible to construct an unbiased utility index which has been done a lot in baseball (look up sabermetrics). Off
and Def
and a lot of other statistics are examples of utility indices.
Another issue we need to addrees is when we substituted the missing values with zero. Players with missing game statistics values have their utility index diminished because one of the stats used to calculate it is zero. However, imputing with zero is better than imputing with the mean in our case. By imputing with the mean we would introduce new information into the data which may be misleading, ex. g. a player’s game stat is worse/better than the average. As a result, the player utility index would be overestimated/underestimated.
Finally, I believe that using statistical and mathematical methods is only acceptable as a supplement to the decision making process not only in baseball, but in every field.
Baseball statistics abbreviations
Source: Wikipedia::Baseball statistics