Baseball Optimization

Introduction

For this project, we decided to come up with our own project instead of referring to a previous sabermetric model. The project ended up becoming an optimization model, run in Excel with OpenSolver, attempting to derive an “All Star” Major League Baseball team from 2018 league and salary data. As you know, there are nine starting positions on a baseball field ranging from the pitcher and catcher to the four infielders and three outfielders. Right off the bat, the goal was to create a model involving comparing salaries of players and on-field baseball performance data. The final model ended up optimizing the net WAR (Wins Above Replacement), which is an on-field baseball statistic that could apply to every player regardless of position, of a team given a certain salary cap (the maximum amount one team can pay all of its players) in the year 2018.

Initially, the hardest part was deciding which player statistics would be used in evaluating the performance of each player in their specific position, and it came down to looking between on-base percentage (OBP) or WAR. The OBP statistic shows how frequently a batter reaches a base per plate appearance, and it can sometimes be used to evaluate pitchers. On the other hand, WAR is a Sabermetric invention that tries to sum up a player’s total contributions to their team in one number. The equation adds together Batting Runs, Base Running Runs, Fielding Runs, Positional Adjustment, League Adjustment, and Replacement Runs, and then divides the total by Runs Per Win. After some guidance, WAR was determined to be the better, or at least more interesting, statistic to look at for this model.

Looking through all of the databases used for baseball data, we landed on using Baseball Reference’s statistics in R Studio. We combined fielding and player tables to ultimately filter the data for the variables that we wanted which were playerID, position, salary, and WAR, as seen in Figure 1. A challenge that we came across with filtering the data at first was that some players played multiple positions, but we were able to overcome that obstacle. Moving to our model, we were interested in finding a mix of players that would fit under a certain salary cap while still achieving a high net WAR, so we looked up the highest salary cap in the MLB. Most websites surprisingly had different answers for which team had the highest, but we landed on using the number $235,650,000 as our salary cap.

Due to simplifying the model and time constraints, we left pitchers out of the mix of positions. Assuming a 25 man roster, we decided that 10 of the players could be pitchers, and they were factored out. To still account for them in the salary, a pitcher average salary was created and subtracted from the gross salary cap. Figure 2 shows how the new net salary was derived. In addition, if someone wanted to run this model for other levels of salary caps, we included a medium and a low for reference.

(Image 1)

After this, the other eight starting positions were organized into six separate categories by combining the three outfield positions into one, along with the four infielders and the catcher. Each category had a “Required” cell that indicated how many players to select for the specific position, as you can see in Figure 3. The position of catcher and the four infielders had two required players while the outfielders had five. The model depicts every player by row with columns showing the player name, their salary, their WAR statistic for that season, and a column depicting whether that player was picked for the All-Star team.

(Image 2)

The goal for the model was to create an “All-Star” team that could produce the highest net WAR while staying within the boundaries of a salary cap. The cell being optimized in this situation was the net WAR because the max WAR was desired. The variable cells were designated to be the highlighted “Chosen” columns because depending on who was chosen, that is where we would gather our net WAR. As Figure 4 shows, one can see the different constraints that were set in the model using OpenSolver.

(Image 3)

Every “Chosen” cell ended up matching the “Required” cell number due to the constraints placed upon them. Additionally, every “Chosen” column had to be binary in order to indicate which players were being picked for each position. Lastly, the net salaries of all the players chosen remained less than or equal to the salary less the salaries excluded for pitchers. As a reminder, for every figure shown, the highest salary cap was used. If one wanted to observe the other salary caps, they would just need to change the net salaries constraint to be less than or equal to the new salary cap. Overall, the goal of selecting teams with the highest net WAR based off the given salary caps was achieved.

After finding the optimal team roster in Excel, we went back to R Studio with the model to look closer at the individual statistics of the players chosen.

Optimal Team

All of our players that were chosen have the highest WAR that we could afford according to the Dodgers 2018 salary cap. Of the players that made the optimal roster, only one had a WAR under 4. When the best players of the optimal roster are selected for a starting lineup (of 8 players because we are disregarding pitchers and DH) the optimal roster achieves a WAR of 57.45. In comparison, the best WAR non-pitching team had in 2018 was the Houston Astros with 19.7. One unit of WAR is equal to 10 runs more than you could have with a replacement player, which is also how many runs you need to be up to win one more game. Because one unit of WAR is equal to 10 runs, our lineup has a positive run differential of 574.5 more runs than it would with a team full of replacements. That means that our team would have a differential of 377.5 more runs than the 2018 Astros. The optimal team had a batting average of .270, .025 higher than the 2018 Astros. The optimal team also put together better stats in every major category than 2018 Astros. The Houston Astros had the best OPS in the league at .847 while our team compiled an OPS of .891, a .044 difference.

Comparable Teams

In order to better understand the strength and significance of the Optimal Team, comparable teams were created. Included are the Best, Worst, Efficient, and Inefficient teams. These players are the same ones used for the Optimal team and will also be built with 2 players at each position except for the outfield, which will include 5 players on the roster. Like the optimal team, players with salaries under $1 million will be excluded since they will skew the data.

Best Team

The best team was created by putting together a roster with the 15 highest WAR players. The team does not observe any contraints with salary and is truly meant to be the best possible team available.

The roster is quite impressive, boasting nine players with WARs over four. However, despite being the best team, it only differs by one player from the optimal team. The optimal team uses Marcus Semien as its secondary shortstop opposed to the best team’s Didi Gregorius. This one difference is not much as Gregorius adds only .1 WAR to the team and significant amount of salary. This roster is most similar to the optimal roster of the rosters produced.

Efficient Team

In order to further compare the optimal team, the efficient team was created to evaluate the best way to maximize a roster’s WAR while minimizing the cost to achieve such WAR. In order to achieve this team, we created a new variable called WAR_Cost. This variable takes the salary of the player and divides it by their WAR. In order to avoid complications with bad players, any players with a WAR under 0 was filtered out of the data set. The data table below shows the players that can be found on this roster.

The efficient team differs significantly from the optimal team. While the optimal team’s best individual player WAR belongs to Mookie Betts at 10.63, the efficient team’s best WAR is Jose Ramirez at 7.62. However, while Bett’s WAR comes at a cost of $998,000 per WAR, Ramirez produces his WAR at only $371,000 per WAR. Ramirez did make it on to the optimal team and had the lowest cost per WAR on the team. Comparatively, he ranks fourth lowest on the efficient team. In fact, Paul Goldschmidt, who was able to make it onto both rosters, ranked as the highest cost per WAR on the efficient team while ranking 8th highest cost per WAR on the optimal team. Overall, the two rosters only had four players overlap.

The biggest differences between these two rosters is the amount spent and their combined WAR. While the optimal roster spent nearly $182 million, the efficient roster spent only $46 million. In addition, the optimal roster was still able to produce a team WAR of 94.86, towering over the efficient team’s 62.52. When looking at how the teams spent their money per WAR, the optimal team spent $1,178,000 more per WAR. A team like the Rays, who have a smaller payroll, would be more likely to use the efficient roster while the Dodgers might target the optimal roster.

Inefficient and Worst Teams

For fun, it seemed like it would be interesting to compare how the worst team and inefficient teams would look. These teams consisted of some of the worst players that received a salary over $1 million.

The inefficient team was created taking any players with a WAR over 0. Using players under 0 created a negative number for the cost per WAR, which did not make logical sense. What was not taken into account, however, was the cost of WAR for marginal players, or players that may not have seen much time in the field despite their salary. This could have been the result of injury, a bench role, or some other circumstance. These players remained included anyways. Below is the inefficient team.

The roster for the inefficient team is actually very interesting. It includes several players that are likely bound for the hall of fame, including Albert Pujols and Miguel Cabrera. Considering that higher salaries lend more opportunities for inefficient players, it might be expected that many of the players have high salaries. However, seven of the 15 players have salaries over $10 million, three of which are over $20 million. Even a player making “only” $1 million made this team with 61 games played. So not all inefficient players came in the form of high salaries.

The worst team ignored any salaries or costs of WAR. Put simply, it is a compilation of the absolute worst players based on WAR a tean could possibly field. Unlike the efficiency teams, this team includes negative WAR players. As a result the worst team uses the absolute worst players. Below is the data table containing this roster.

The best full-time player on the worst roster was Adam Jones, accumulating a staggering WAR of 0.28 at the cost of $17.3 million. In contrast, Chris Davis was, unsurprisingly, the worst player on the worst roster with a -3.6 WAR at a cost of $23 million. This glorious roster could be produced at the cheap rate of $103 million, $78 million less than the optimal team and $57 million more than the efficient team.

Team Stat Table

The table below shows the combined stats of each of the teams created. It allows for easy comparison between the teams and to see how vastly different some of these teams were from one another.

team	WAR	Cost	avg_GMs	BA	OBP	SLG	OPS	H	2B	3B	HR	RBI	SB	SB%	BB	SO	SO_BB	war_cost
Best	94.96	187061600	145.733	0.297	0.380	0.522	0.901	2425	490	53	412	1296	207	0.805	1082	1645	1.520	1969898.9
Efficient	62.52	45997648	138.400	0.269	0.342	0.489	0.831	1995	407	53	375	1158	113	0.743	823	1711	2.079	735726.9
Inefficient	4.01	175670833	95.400	0.250	0.304	0.390	0.695	1208	241	7	142	572	36	0.735	379	979	2.583	43808187.8
Optimal	94.86	181936600	147.400	0.296	0.378	0.513	0.891	2451	502	50	400	1280	211	0.808	1095	1707	1.559	1917948.6
Worst	-11.85	103445833	106.533	0.232	0.280	0.356	0.636	1249	232	14	135	539	49	0.721	360	1194	3.317	-8729606.2