Baseball Salaries vs. Performance

Introduction

In this document I compare the salaries of Major League Baseball position players versus their performance as measured by the “on-base plus slugging” (OPS) statistic. I chose OPS because it has been shown to be a better predictor of a player’s value to a team than batting average, while still being relatively simple to calculate.

My primary goals in doing this analysis are as follows:

To explore the distribution of OPS: how many players have a relatively low OPS, a relatively high OPS, or an OPS value somewhere in the middle.
To explore the distribution of salaries in a similar way.
To explore the relationship between OPS and salaries.

I look at OPS for the 2015 season and salaries for the 2016 season, under the assumption that 2015 performance would have at least some influence on 2016 salaries. I chose 2015 and 2016 for the analysis because 2016 is the last year for which my data source has salary data.

I also explore the distribution of players’ time spent in the major leagues, since players’ lifetime earnings are determined not only by their salaries but also by how many years they play.

Finally, I also have an interest in knowing whether OPS values follow a Gaussian (normal) distribution, so I apply two common tests for normality to the OPS values for 2015.

For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.

Setup and data preparation

Libraries

I use the tidyverse package of functions for general data manipulation and the tools package to get the md5sum function.

library("tidyverse")
library("tools")

Data sources

I use data from the Lahman baseball database, a publicly-available resource containing historical data for baseball players from the initial days of the game through the present. For more information on how I obtained this data see the “References” section.

I use the following CSV files from the Lahman database in the analysis:

Batting.csv contains regular season batting statistics for players.
Appearances.csv contains information on the positions played by players in each season.
Salaries.csv contains payroll salaries for players.

Rows in each of these files contain the following variables:

playerID. A unique alphanumeric ID assigned to each player.
yearID. A (four-digit) year in which the player was on a major league team roster.

Since a player may play for multiple teams in a given season, the combination of playerID and yearID is not necessarily unique.

The CSV file Batting.csv contains an additional variable to uniquely identify a row:

stint. A number identifying each of a player’s stints in MLB in a given year; for example, if a player starts out the season with one team and then switches teams mid-season, their time with the first team would have stint equal to 1 and their time with the second team would have stint equal to 2.

Within Batting.csv I use the following variables for each player/year/stint:

AB. Number of official at-bats.
H. Hits.
2B. Doubles.
3B. Triples.
HR. Home runs.
BB. Walks.
HBP. On base due to being hit by a pitch.
SF. Sacrifice flies.

Within Appearances.csv I use the G_p variable, games pitched.

Within Salaries.csv I use the salary variable.

Before doing anything else I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.

stopifnot(md5sum("Batting.csv") == "5081b151f61a5debb6cb91b2f2f00195")
stopifnot(md5sum("Appearances.csv") == "ba95f3e1b55ca6748b5140ee648346e7")
stopifnot(md5sum("Salaries.csv") == "0bb931df8e973987c4dd1f8c4f68a4c3")

Reading in and preparing the data

I begin by reading in the various CSV files; the col_types parameter is a string identifying each column as a character string (“c”) or an integer (“i”):

batting <- read_csv("Batting.csv", , col_types = "ciicciiiiiiiiiiiiiiiii")
salaries <- read_csv("Salaries.csv", col_types = "iccci")
appearances <- read_csv("Appearances.csv", col_types = "iccciiiiiiiiiiiiiiiii")

Analysis

I am interested in the OPS of position players evaluated on their batting performance (i.e., who are not pitchers). I also want to look only at players who have some minimum number of batting chances in a given season.

There are at least two ways to make this determination: First, if a new player has accumulated 130 or more at-bats in previous seasons then they are no longer considered a rookie in the current year. Thus any player who has 130 or more at-bats in a season can be considered to be reasonably active.

Second, a player must have at least 502 plate appearances in a year in order to qualify for the batting championship. The actual number of at-bats can be less, since (for example) a walk counts as a plate appearance but not as an at-bat. In practice a player contending for the batting championship would typically have at least 400 at-bats.

For purposes of this analysis I require a minimum of 130 at-bats, since at-bats are available directly in the database and also because this less-stringent criterion makes for a larger sample size.

Identifying position players with 130 or more at-bats

I first create a temporary table listing all players in the 2015 season who were not pitchers, as follows:

Start with the Appearances table.
Filter for all player appearances in 2015.
For each player, calculate the total number of games they pitched that year for all teams on which they played.
Filter for players who pitched no games at all.
Keep only the player IDs.

temp1 <-
  appearances %>%
  filter(yearID == 2015) %>%
  group_by(playerID) %>%
  summarize(G_p = sum(G_p)) %>%
  filter(G_p == 0) %>%
  select(playerID)

In 2015 there were 613 players who were position players (not pitchers).

I then create a second temporary table totaling at-bats for all players in 2015, as follows:

Start with the batting table.
Filter for all player at-bats in 2015.
For each player, calculate the total number of at-bats that year for all teams on which they played.
Filter for players with 130 or more at-bats.
Keep only the player IDs.

temp2 <-
  batting %>%
  filter(yearID == 2015) %>%
  group_by(playerID) %>%
  summarize(AB = sum(AB)) %>%
  filter(AB >= 130) %>%
  select(playerID)

In 2015 there were 391 players who had 130 or more at-bats (either as a position player or a pitcher).

Finally I join the two temporary tables to create a table players containing only position players with more than 130 at-bats in 2015.

players <- inner_join(temp1, temp2, by = "playerID")

In 2015 there were 375 position players who had more than 130 at-bats.

Calculating OPS

I next calculate the 2015 OPS for the players.

OPS is defined as the sum of the on-base percentage and the slugging percentage. Players can get on-base as the result of a hit (variable H), a walk (BB), or being hit by a pitch (HBP). The on-base percentage is the sum of those variables divided by the number of plate appearances, which is in turn calculated as the sum of the at-bats, walks, being hit by a pitch, and sacrifice flies (SF).

The slugging percentage is defined as the total number of bases produced by hits divided by the number of at-bats. The total number of bases is calculated as the number of hits (H), each of which counts for at least one base, together with the number of extra bases resulting from doubles (2B), triples (3B), and home runs (HR).

I create a table ops_2015 as follows:

Start with the batting table.
Filter for players in 2015.
For each player, calculate the total hits, at-bats, doubles, triples, home runs, walks, times hit by a pitch, and sacrifice flies, across all the teams for which they played in that year.
Join the resulting table with the players table, to produce a table containing only position players with at least 130 at-bats.
Calculate the OPS for each player as a new field OPS.
Keep only the fields for the player ID and the OPS.

ops_2015 <-
  batting %>%
  filter(yearID == 2015) %>%
  group_by(playerID) %>%
  summarize(
    H = sum(H),
    AB = sum(AB),
    `2B` = sum(`2B`),
    `3B` = sum(`3B`),
    HR = sum(HR),
    BB = sum(BB),
    HBP = sum(HBP),
    SF = sum(SF)
  ) %>%
  inner_join(players, by = "playerID") %>%
  mutate(
    OPS =
      (H + BB + HBP) / (AB + BB + HBP + SF) +
        (H + `2B` + 2 * `3B` + 3 * HR) / AB
  ) %>%
  select(playerID, OPS)

Plotting the distribution of OPS

I now create a histogram showing the distribution of OPS in 2015, that is, how many position players had an OPS within particular ranges. I add a vertical dashed line to show the average OPS across all the qualifying position players.

ops_2015 %>%
  ggplot() +
  geom_histogram(
    mapping = aes(x = OPS),
    binwidth = 0.020
  ) +
  geom_vline(
    mapping = aes(xintercept = mean(OPS)),
    linetype = "dashed"
  ) +
  xlab("2015 On-Base Percentage Plus Slugging Percentage (OPS)") +
  ylab("Number of Players") +
  scale_x_continuous(breaks = seq(0.4, 1.2, 0.1))

The OPS of qualifying position players in 2015 ranged from a low of 0.459 to a high of 1.109. The average OPS was 0.732 (with standard deviation of 0.097).

Calculating salaries

I next look at the distribution of salaries in 2016, starting with creating a table salaries_2016 of the total salaries for players in 2016:

Start with the salaries table.
Filter for salaries in 2016.
Calculate the total salary for each player summed across all teams for which they played. I remove missing values, since a player may have played for more than one team but received salary from only one.
Convert the salaries to express them in thousands of dollars.
Join the resulting table with the players table using the player ID as the common field, in order to include only salaries for players who were qualifying position players in 2015.
Keep only the playerID and salary fields.

salaries_2016 <-
  salaries %>%
  filter(yearID == 2016) %>%
  group_by(playerID) %>%
  summarize(salary = sum(salary, na.rm = TRUE)) %>%
  mutate(salary = salary / 1000) %>%
  inner_join(players, by = "playerID") %>%
  select(playerID, salary)

Some players in 2015 were no longer in Major League Baseball in 2016. Of the 375 qualifying position players in 2015, only 330 were on MLB team payrolls in 2016.

Plotting the distribution of salaries

I now create a histogram showing the distribution of salaries in 2016, that is, how many qualifying position players had a salary within particular ranges (for example, between $500,000 and $1.5 million, between $1.5 million and $2.5 million, and so on). I add a vertical dashed line to show the average salary across all the qualifying position players.

Because salaries for some players are so high that they distort the average, I also add a dotted line to show the median salary. (By definition half of all players make less than the median salary.)

salaries_2016 %>%
  ggplot() +
  geom_histogram(
    mapping = aes(x = salary),
    binwidth = 1000
  ) +
  geom_vline(
    mapping = aes(xintercept = mean(salary)),
    linetype = "dashed"
  ) +
  geom_vline(
    mapping = aes(xintercept = median(salary)),
    linetype = "dotted"
  ) +
  xlab("2016 Salary ($000)") +
  ylab("Number of Players") +
  scale_x_continuous(breaks = seq(0, 30000, 5000))

The salary of qualifying position players in 2016 ranged from a low of $507,500 to a high of about $28 million. The median salary was about $3,075,000 (versus an average salary of about $5,913,000). There were 116 players in the lowest salary category in the histogram above (between $500,000 and $1.5 million), about 35% of the 330 players in the sample.

Comparing 2016 salaries to 2015 OPS

I now create a scatterplot showing 2016 salaries compared to the 2015 OPS of each qualifying position player, as follows:

Create a table salary_vs_ops combining the 2015 OPS and 2016 salary data for each qualifying position player.
Create a plot with salary on the x-axis and OPS on the y-axis.
Plot one point for each player’s 2015 OPS and corresponding 2016 salary.
Add a linear regression line to indicate the general trend of 2016 salary increasing with 2015 OPS.
Customize x- and y-axis labels and tick marks.
Use the black and white theme for plot colors.

salary_vs_ops <- inner_join(ops_2015, salaries_2016, by = "playerID")
salary_vs_ops %>%
  ggplot(mapping = aes(x = OPS, y = salary)) +
  geom_point() +
  geom_smooth(method = "lm") +
  xlab("2015 On-Base Percentage Plus Slugging Percentage (OPS)") +
  ylab("2016 Salary ($000)") +
  scale_x_continuous(breaks = seq(0.4, 1.2, 0.1)) +
  scale_y_continuous(breaks = seq(0, 30000, 5000))

To take a more formal look at the relationship between 2015 OPS and 2016 salaries I calculate a linear model of salary versus OPS:

lm_salary_vs_ops <- lm(salary_vs_ops$salary ~ salary_vs_ops$OPS)
summary(lm_salary_vs_ops)

## 
## Call:
## lm(formula = salary_vs_ops$salary ~ salary_vs_ops$OPS)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8873  -4549  -1882   2987  18981 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -8587       2882  -2.979   0.0031 ** 
## salary_vs_ops$OPS    19459       3839   5.069 6.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6346 on 328 degrees of freedom
## Multiple R-squared:  0.07264,    Adjusted R-squared:  0.06981 
## F-statistic: 25.69 on 1 and 328 DF,  p-value: 6.709e-07

Based on the scatterplot and the relatively low R-squared values, OPS is a relatively poor predictor of salary in the following year. The most likely reason is that many players are on long-term contracts at pay levels that do not reflect their current performance.

How long do players stay in the major leagues?

I want to know how long players typically stay in the major leagues once they succeed in getting there. In order to do this I need to ensure that I count all the years that a given player was active, from their very first season to their very last season. That means I need to exclude currently active players, since their careers may extend for several more years. But what counts as being “currently active”?

To be conservative I exclude all players who made major league appearances after 2010. This helps account for players who (for example) were in the major leagues in 2016, were sent down to the minors for 2017, and might reappear in the major leagues in 2018 or later. If a player has been out of the major leagues for seven or eight years the chances of them coming back are very small, if not zero.

I also exclude players who were active before 1901, the time many consider to be the beginning of “modern baseball”. Most notably, it marks the consolidation of professional baseball into the two current major leagues, the National League and American League, with all other leagues relegated to “minor” status.

I first construct a table of players excluded from the analysis, as follows:

Start with the appearances table.
Filter for years before 1901 and after 2010.
Keep only the player ID.
Eliminate duplicate rows so that each player ID appears only once in the table.

excluded_players <-
  appearances %>%
  filter(yearID < 1901 | yearID > 2010) %>%
  select(playerID) %>%
  unique()

There are 5,074 players excluded from the analysis.

I then calculate the number of seasons played by players who came into the major leagues in 1901 or thereafter and whose last season was no later than 2010. I do this as follows:

Again, start with the appearances table.
Filter for the years 1901 through 2010.
Keep only the year and player ID.
Eliminate duplicate rows. This prevents double counting when a player plays with two or more teams in a single season.
Remove all seasons played by the excluded players, i.e., those active before 1901 or after 2010. (The anti_join() function keeps only those rows where the player ID is not in the excluded_players table being joined.)
For each remaining player calculate the number of seasons they appeared in the major leagues. (Note that these seasons may not be consecutive, for example if a player gets sent back down to the minor leagues for a season and then is called back up to the majors in a subsequent season.)

seasons_per_player <-
  appearances %>%
  filter(yearID %in% 1901:2010) %>%
  select(yearID, playerID) %>%
  unique() %>%
  anti_join(excluded_players, by = "playerID") %>%
  group_by(playerID) %>%
  summarize(seasons = n())

The resulting table includes 14,108 players who played a total of 72,707 seasons within the 110-year period from 1901 through 2010.

I now plot a histogram showing the distribution of number of seasons played. As I did when plotting the distribution of salaries, I add a vertical dashed line to show the average number of seasons played, and a vertical dotted line to show the median number of seasons played.

seasons_per_player %>%
  ggplot() +
  geom_histogram(
    mapping = aes(x = seasons),
    binwidth = 1
  ) +
  geom_vline(
    mapping = aes(xintercept = mean(seasons)),
    linetype = "dashed"
  ) +
  geom_vline(
    mapping = aes(xintercept = median(seasons)),
    linetype = "dotted"
  ) +
  xlab("Number of Seasons in Major Leagues") +
  ylab("Number of Players") +
  scale_x_continuous(breaks = seq(0, 30, 5)) +
  scale_y_continuous(breaks = seq(0, 3000, 500))

The maximum number of seasons played by any player was 27. The median number of seasons played was 3 (versus an average number of seasons played of about 5.2). There were 3,808 players who played only one season, about 27% of the 14,108 players in the sample.

How many players leave the major leagues after playing a given number of seasons? (In other words, what is the typical “mortality rate” for baseball players?) I calculate this as follows:

Start with the seasons_per_player table from above.
Total the number of players who played a given number of seasons.
Add a final row to indicate that zero players played beyond the maximum number of seasons played by anyone.
Compute the number of players who left baseball after a given number of seasons, as the number of players who played that many seasons minus the number of players who played one more season. (The function lead() takes values from the next row of the table.)
Filter out the final row, for which the number of players who left is undefined.
Compute the mortality rate as the ratio of the number of players leaving after a given number of seasons to the number of players who played that many seasons.
Keep only the number of seasons and the mortality rate for each number of seasons.

mortality <-
  seasons_per_player %>%
  group_by(seasons) %>%
  summarize(n_players = n()) %>%
  add_row(seasons = max(seasons_per_player$seasons) + 1, n_players = 0) %>%
  mutate(n_left_bb = n_players - lead(n_players)) %>%
  filter(!is.na(n_left_bb)) %>%
  mutate(rate = n_left_bb / n_players) %>%
  select(seasons, rate)

On average about 27% of players leave baseball after playing a given number of seasons. However this disguises a good deal of variation in the “mortality rate”, as shown in the following graph.

mortality %>%
  ggplot(mapping = aes(x = seasons, y = rate)) +
  geom_bar(stat = "identity") +
  geom_smooth(method = "loess") +
  xlab("Number of Seasons in Major Leagues") +
  ylab("Fraction of Players Leaving the Major Leagues") +
  scale_x_continuous(breaks = seq(0, 30, 5)) +
  scale_y_continuous(breaks = seq(0, 1, 0.2))

In general the fraction of players leaving baseball after playing a given number of seasons follows a U-shaped curve: “mortality” is highest in the early years, flattens out around 8 to 10 seasons played, and then gradually rises again. (There are a couple of anomalies at 21 and 26 seasons played, likely due to the small sample size that far out in the tail.)

Are OPS values normally distributed?

At first glance the OPS values for 2015 have the rough form of a Gaussian (normal) distribution: symmetrical around a central peak, with left and right tails that diminish relatively rapidly to zero. Is OPS truly normally distributed?

There are at least two ways to check this. First, I use the Shapiro-Wilk normality test on the OPS values for qualifying position players in 2015:

sw_2015 <- shapiro.test(ops_2015$OPS)

The value of the Shapiro-Wilk test statistic is 0.995 with associated p-value of 0.335. The smaller the p-value the less likely the distribution is to be Gaussian. Since the p-value is relatively high, I shouldn’t rule out the possibility that the OPS values are in fact normally distributed.

Next I do a Q-Q plot of the OPS data to compare the quantiles of the OPS values with the quantiles of a set of values from a Gaussian distribution.

qqnorm(
  ops_2015$OPS,
  main = "Q-Q Plot of 2015 OPS vs. Gaussian Distribution",
  xlab = "Quantiles for Gaussian Distribution",
  ylab = "2015 OPS Quantiles"
)
qqline(ops_2015$OPS)

The OPS values seem fairly consistent with values from a Gaussian distribution within their middle range, but there are systematic deviations from normality in the left and right tails. Thus I have some doubts as to whether OPS values are truly normally distributed.

Appendix

Caveats

Many people consider OPS to fall short of an ideal statistic, for a variety of reasons. In particular, it puts equal weight on on-base percentage and slugging percentage, even though in practice on-base percentage is more important.

Various statistics have been proposed as replacements for OPS, including On-base Plus Slugging Plus (OPS+) and Wins Above Replacement (WAR). However for my purposes OPS is adequate while being much easier to calculate.

Another caveat relates to long-term contracts: a player’s current salary is not a function of their immediate past performance (i.e., in the previous year, as I’ve graphed above), but rather of their performance prior to when they signed the contract—which may have been several years ago. Modeling this better would require information about contract dates and terms, information not available in the data source I’m using.

References

I extracted the CSV files Batting.csv, Salaries.csv, and Appearances.csv from the March 28, 2018 version of the Lahman baseball database. (This is a zip archive of CSV files.)

Environment

I used the following R environment in doing the analysis above:

sessionInfo()

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] tools     stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] bindrcpp_0.2    dplyr_0.7.2     purrr_0.2.3     readr_1.1.1    
## [5] tidyr_0.7.0     tibble_1.4.2    ggplot2_2.2.1   tidyverse_1.1.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.12     cellranger_1.1.0 pillar_1.2.1     compiler_3.4.4  
##  [5] plyr_1.8.4       bindr_0.1        forcats_0.2.0    digest_0.6.12   
##  [9] lubridate_1.6.0  jsonlite_1.5     evaluate_0.10.1  nlme_3.1-137    
## [13] gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1  rlang_0.2.0     
## [17] psych_1.7.5      yaml_2.1.14      parallel_3.4.4   haven_1.1.0     
## [21] xml2_1.1.1       httr_1.3.0       stringr_1.2.0    knitr_1.17      
## [25] hms_0.3          rprojroot_1.2    grid_3.4.4       glue_1.1.1      
## [29] R6_2.2.2         readxl_1.0.0     foreign_0.8-70   rmarkdown_1.6   
## [33] modelr_0.1.1     reshape2_1.4.2   magrittr_1.5     backports_1.1.0 
## [37] scales_0.4.1     htmltools_0.3.6  rvest_0.3.2      assertthat_0.2.0
## [41] mnormt_1.5-5     colorspace_1.3-2 labeling_0.3     stringi_1.1.5   
## [45] lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2

Source code

You can find the source code for this analysis and others at my Seven Answers public Gitlab repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.