In this document I compare the salaries of Major League Baseball position players versus their performance as measured by the “on-base plus slugging” (OPS) statistic. I chose OPS because it has been shown to be a better predictor of a player’s value to a team than batting average, while still being relatively simple to calculate.
My primary goals in doing this analysis are as follows:
I look at OPS for the 2015 season and salaries for the 2016 season, under the assumption that 2015 performance would have at least some influence on 2016 salaries. I chose 2015 and 2016 for the analysis because 2016 is the last year for which my data source has salary data.
I also explore the distribution of players’ time spent in the major leagues, since players’ lifetime earnings are determined not only by their salaries but also by how many years they play.
Finally, I also have an interest in knowing whether OPS values follow a Gaussian (normal) distribution, so I apply two common tests for normality to the OPS values for 2015.
For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.
I use the tidyverse package of functions for general data manipulation and the tools package to get the md5sum
function.
library("tidyverse")
library("tools")
I use data from the Lahman baseball database, a publicly-available resource containing historical data for baseball players from the initial days of the game through the present. For more information on how I obtained this data see the “References” section.
I use the following CSV files from the Lahman database in the analysis:
Batting.csv
contains regular season batting statistics for players.Appearances.csv
contains information on the positions played by players in each season.Salaries.csv
contains payroll salaries for players.Rows in each of these files contain the following variables:
playerID
. A unique alphanumeric ID assigned to each player.yearID
. A (four-digit) year in which the player was on a major league team roster.Since a player may play for multiple teams in a given season, the combination of playerID
and yearID
is not necessarily unique.
The CSV file Batting.csv
contains an additional variable to uniquely identify a row:
stint
. A number identifying each of a player’s stints in MLB in a given year; for example, if a player starts out the season with one team and then switches teams mid-season, their time with the first team would have stint
equal to 1 and their time with the second team would have stint
equal to 2.Within Batting.csv
I use the following variables for each player/year/stint:
AB
. Number of official at-bats.H
. Hits.2B
. Doubles.3B
. Triples.HR
. Home runs.BB
. Walks.HBP
. On base due to being hit by a pitch.SF
. Sacrifice flies.Within Appearances.csv
I use the G_p
variable, games pitched.
Within Salaries.csv
I use the salary
variable.
Before doing anything else I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.
stopifnot(md5sum("Batting.csv") == "5081b151f61a5debb6cb91b2f2f00195")
stopifnot(md5sum("Appearances.csv") == "ba95f3e1b55ca6748b5140ee648346e7")
stopifnot(md5sum("Salaries.csv") == "0bb931df8e973987c4dd1f8c4f68a4c3")
I begin by reading in the various CSV files; the col_types
parameter is a string identifying each column as a character string (“c”) or an integer (“i”):
batting <- read_csv("Batting.csv", , col_types = "ciicciiiiiiiiiiiiiiiii")
salaries <- read_csv("Salaries.csv", col_types = "iccci")
appearances <- read_csv("Appearances.csv", col_types = "iccciiiiiiiiiiiiiiiii")
I am interested in the OPS of position players evaluated on their batting performance (i.e., who are not pitchers). I also want to look only at players who have some minimum number of batting chances in a given season.
There are at least two ways to make this determination: First, if a new player has accumulated 130 or more at-bats in previous seasons then they are no longer considered a rookie in the current year. Thus any player who has 130 or more at-bats in a season can be considered to be reasonably active.
Second, a player must have at least 502 plate appearances in a year in order to qualify for the batting championship. The actual number of at-bats can be less, since (for example) a walk counts as a plate appearance but not as an at-bat. In practice a player contending for the batting championship would typically have at least 400 at-bats.
For purposes of this analysis I require a minimum of 130 at-bats, since at-bats are available directly in the database and also because this less-stringent criterion makes for a larger sample size.
I first create a temporary table listing all players in the 2015 season who were not pitchers, as follows:
Appearances
table.temp1 <-
appearances %>%
filter(yearID == 2015) %>%
group_by(playerID) %>%
summarize(G_p = sum(G_p)) %>%
filter(G_p == 0) %>%
select(playerID)
In 2015 there were 613 players who were position players (not pitchers).
I then create a second temporary table totaling at-bats for all players in 2015, as follows:
batting
table.temp2 <-
batting %>%
filter(yearID == 2015) %>%
group_by(playerID) %>%
summarize(AB = sum(AB)) %>%
filter(AB >= 130) %>%
select(playerID)
In 2015 there were 391 players who had 130 or more at-bats (either as a position player or a pitcher).
Finally I join the two temporary tables to create a table players
containing only position players with more than 130 at-bats in 2015.
players <- inner_join(temp1, temp2, by = "playerID")
In 2015 there were 375 position players who had more than 130 at-bats.
I next calculate the 2015 OPS for the players.
OPS is defined as the sum of the on-base percentage and the slugging percentage. Players can get on-base as the result of a hit (variable H
), a walk (BB
), or being hit by a pitch (HBP
). The on-base percentage is the sum of those variables divided by the number of plate appearances, which is in turn calculated as the sum of the at-bats, walks, being hit by a pitch, and sacrifice flies (SF
).
The slugging percentage is defined as the total number of bases produced by hits divided by the number of at-bats. The total number of bases is calculated as the number of hits (H
), each of which counts for at least one base, together with the number of extra bases resulting from doubles (2B
), triples (3B
), and home runs (HR
).
I create a table ops_2015
as follows:
batting
table.players
table, to produce a table containing only position players with at least 130 at-bats.OPS
.ops_2015 <-
batting %>%
filter(yearID == 2015) %>%
group_by(playerID) %>%
summarize(
H = sum(H),
AB = sum(AB),
`2B` = sum(`2B`),
`3B` = sum(`3B`),
HR = sum(HR),
BB = sum(BB),
HBP = sum(HBP),
SF = sum(SF)
) %>%
inner_join(players, by = "playerID") %>%
mutate(
OPS =
(H + BB + HBP) / (AB + BB + HBP + SF) +
(H + `2B` + 2 * `3B` + 3 * HR) / AB
) %>%
select(playerID, OPS)
I now create a histogram showing the distribution of OPS in 2015, that is, how many position players had an OPS within particular ranges. I add a vertical dashed line to show the average OPS across all the qualifying position players.
ops_2015 %>%
ggplot() +
geom_histogram(
mapping = aes(x = OPS),
binwidth = 0.020
) +
geom_vline(
mapping = aes(xintercept = mean(OPS)),
linetype = "dashed"
) +
xlab("2015 On-Base Percentage Plus Slugging Percentage (OPS)") +
ylab("Number of Players") +
scale_x_continuous(breaks = seq(0.4, 1.2, 0.1))
The OPS of qualifying position players in 2015 ranged from a low of 0.459 to a high of 1.109. The average OPS was 0.732 (with standard deviation of 0.097).
I next look at the distribution of salaries in 2016, starting with creating a table salaries_2016
of the total salaries for players in 2016:
salaries
table.players
table using the player ID as the common field, in order to include only salaries for players who were qualifying position players in 2015.playerID
and salary
fields.salaries_2016 <-
salaries %>%
filter(yearID == 2016) %>%
group_by(playerID) %>%
summarize(salary = sum(salary, na.rm = TRUE)) %>%
mutate(salary = salary / 1000) %>%
inner_join(players, by = "playerID") %>%
select(playerID, salary)
Some players in 2015 were no longer in Major League Baseball in 2016. Of the 375 qualifying position players in 2015, only 330 were on MLB team payrolls in 2016.
I now create a histogram showing the distribution of salaries in 2016, that is, how many qualifying position players had a salary within particular ranges (for example, between $500,000 and $1.5 million, between $1.5 million and $2.5 million, and so on). I add a vertical dashed line to show the average salary across all the qualifying position players.
Because salaries for some players are so high that they distort the average, I also add a dotted line to show the median salary. (By definition half of all players make less than the median salary.)
salaries_2016 %>%
ggplot() +
geom_histogram(
mapping = aes(x = salary),
binwidth = 1000
) +
geom_vline(
mapping = aes(xintercept = mean(salary)),
linetype = "dashed"
) +
geom_vline(
mapping = aes(xintercept = median(salary)),
linetype = "dotted"
) +
xlab("2016 Salary ($000)") +
ylab("Number of Players") +
scale_x_continuous(breaks = seq(0, 30000, 5000))
The salary of qualifying position players in 2016 ranged from a low of $507,500 to a high of about $28 million. The median salary was about $3,075,000 (versus an average salary of about $5,913,000). There were 116 players in the lowest salary category in the histogram above (between $500,000 and $1.5 million), about 35% of the 330 players in the sample.
I now create a scatterplot showing 2016 salaries compared to the 2015 OPS of each qualifying position player, as follows:
salary_vs_ops
combining the 2015 OPS and 2016 salary data for each qualifying position player.salary_vs_ops <- inner_join(ops_2015, salaries_2016, by = "playerID")
salary_vs_ops %>%
ggplot(mapping = aes(x = OPS, y = salary)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("2015 On-Base Percentage Plus Slugging Percentage (OPS)") +
ylab("2016 Salary ($000)") +
scale_x_continuous(breaks = seq(0.4, 1.2, 0.1)) +
scale_y_continuous(breaks = seq(0, 30000, 5000))
To take a more formal look at the relationship between 2015 OPS and 2016 salaries I calculate a linear model of salary versus OPS:
lm_salary_vs_ops <- lm(salary_vs_ops$salary ~ salary_vs_ops$OPS)
summary(lm_salary_vs_ops)
##
## Call:
## lm(formula = salary_vs_ops$salary ~ salary_vs_ops$OPS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8873 -4549 -1882 2987 18981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8587 2882 -2.979 0.0031 **
## salary_vs_ops$OPS 19459 3839 5.069 6.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6346 on 328 degrees of freedom
## Multiple R-squared: 0.07264, Adjusted R-squared: 0.06981
## F-statistic: 25.69 on 1 and 328 DF, p-value: 6.709e-07
Based on the scatterplot and the relatively low R-squared values, OPS is a relatively poor predictor of salary in the following year. The most likely reason is that many players are on long-term contracts at pay levels that do not reflect their current performance.
I want to know how long players typically stay in the major leagues once they succeed in getting there. In order to do this I need to ensure that I count all the years that a given player was active, from their very first season to their very last season. That means I need to exclude currently active players, since their careers may extend for several more years. But what counts as being “currently active”?
To be conservative I exclude all players who made major league appearances after 2010. This helps account for players who (for example) were in the major leagues in 2016, were sent down to the minors for 2017, and might reappear in the major leagues in 2018 or later. If a player has been out of the major leagues for seven or eight years the chances of them coming back are very small, if not zero.
I also exclude players who were active before 1901, the time many consider to be the beginning of “modern baseball”. Most notably, it marks the consolidation of professional baseball into the two current major leagues, the National League and American League, with all other leagues relegated to “minor” status.
I first construct a table of players excluded from the analysis, as follows:
appearances
table.excluded_players <-
appearances %>%
filter(yearID < 1901 | yearID > 2010) %>%
select(playerID) %>%
unique()
There are 5,074 players excluded from the analysis.
I then calculate the number of seasons played by players who came into the major leagues in 1901 or thereafter and whose last season was no later than 2010. I do this as follows:
appearances
table.anti_join()
function keeps only those rows where the player ID is not in the excluded_players
table being joined.)seasons_per_player <-
appearances %>%
filter(yearID %in% 1901:2010) %>%
select(yearID, playerID) %>%
unique() %>%
anti_join(excluded_players, by = "playerID") %>%
group_by(playerID) %>%
summarize(seasons = n())
The resulting table includes 14,108 players who played a total of 72,707 seasons within the 110-year period from 1901 through 2010.
I now plot a histogram showing the distribution of number of seasons played. As I did when plotting the distribution of salaries, I add a vertical dashed line to show the average number of seasons played, and a vertical dotted line to show the median number of seasons played.
seasons_per_player %>%
ggplot() +
geom_histogram(
mapping = aes(x = seasons),
binwidth = 1
) +
geom_vline(
mapping = aes(xintercept = mean(seasons)),
linetype = "dashed"
) +
geom_vline(
mapping = aes(xintercept = median(seasons)),
linetype = "dotted"
) +
xlab("Number of Seasons in Major Leagues") +
ylab("Number of Players") +
scale_x_continuous(breaks = seq(0, 30, 5)) +
scale_y_continuous(breaks = seq(0, 3000, 500))
The maximum number of seasons played by any player was 27. The median number of seasons played was 3 (versus an average number of seasons played of about 5.2). There were 3,808 players who played only one season, about 27% of the 14,108 players in the sample.
How many players leave the major leagues after playing a given number of seasons? (In other words, what is the typical “mortality rate” for baseball players?) I calculate this as follows:
seasons_per_player
table from above.lead()
takes values from the next row of the table.)mortality <-
seasons_per_player %>%
group_by(seasons) %>%
summarize(n_players = n()) %>%
add_row(seasons = max(seasons_per_player$seasons) + 1, n_players = 0) %>%
mutate(n_left_bb = n_players - lead(n_players)) %>%
filter(!is.na(n_left_bb)) %>%
mutate(rate = n_left_bb / n_players) %>%
select(seasons, rate)
On average about 27% of players leave baseball after playing a given number of seasons. However this disguises a good deal of variation in the “mortality rate”, as shown in the following graph.
mortality %>%
ggplot(mapping = aes(x = seasons, y = rate)) +
geom_bar(stat = "identity") +
geom_smooth(method = "loess") +
xlab("Number of Seasons in Major Leagues") +
ylab("Fraction of Players Leaving the Major Leagues") +
scale_x_continuous(breaks = seq(0, 30, 5)) +
scale_y_continuous(breaks = seq(0, 1, 0.2))
In general the fraction of players leaving baseball after playing a given number of seasons follows a U-shaped curve: “mortality” is highest in the early years, flattens out around 8 to 10 seasons played, and then gradually rises again. (There are a couple of anomalies at 21 and 26 seasons played, likely due to the small sample size that far out in the tail.)
At first glance the OPS values for 2015 have the rough form of a Gaussian (normal) distribution: symmetrical around a central peak, with left and right tails that diminish relatively rapidly to zero. Is OPS truly normally distributed?
There are at least two ways to check this. First, I use the Shapiro-Wilk normality test on the OPS values for qualifying position players in 2015:
sw_2015 <- shapiro.test(ops_2015$OPS)
The value of the Shapiro-Wilk test statistic is 0.995 with associated p-value of 0.335. The smaller the p-value the less likely the distribution is to be Gaussian. Since the p-value is relatively high, I shouldn’t rule out the possibility that the OPS values are in fact normally distributed.
Next I do a Q-Q plot of the OPS data to compare the quantiles of the OPS values with the quantiles of a set of values from a Gaussian distribution.
qqnorm(
ops_2015$OPS,
main = "Q-Q Plot of 2015 OPS vs. Gaussian Distribution",
xlab = "Quantiles for Gaussian Distribution",
ylab = "2015 OPS Quantiles"
)
qqline(ops_2015$OPS)
The OPS values seem fairly consistent with values from a Gaussian distribution within their middle range, but there are systematic deviations from normality in the left and right tails. Thus I have some doubts as to whether OPS values are truly normally distributed.
Many people consider OPS to fall short of an ideal statistic, for a variety of reasons. In particular, it puts equal weight on on-base percentage and slugging percentage, even though in practice on-base percentage is more important.
Various statistics have been proposed as replacements for OPS, including On-base Plus Slugging Plus (OPS+) and Wins Above Replacement (WAR). However for my purposes OPS is adequate while being much easier to calculate.
Another caveat relates to long-term contracts: a player’s current salary is not a function of their immediate past performance (i.e., in the previous year, as I’ve graphed above), but rather of their performance prior to when they signed the contract—which may have been several years ago. Modeling this better would require information about contract dates and terms, information not available in the data source I’m using.
I extracted the CSV files Batting.csv
, Salaries.csv
, and Appearances.csv
from the March 28, 2018 version of the Lahman baseball database. (This is a zip archive of CSV files.)
I used the following R environment in doing the analysis above:
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] tools stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] bindrcpp_0.2 dplyr_0.7.2 purrr_0.2.3 readr_1.1.1
## [5] tidyr_0.7.0 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.1.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 cellranger_1.1.0 pillar_1.2.1 compiler_3.4.4
## [5] plyr_1.8.4 bindr_0.1 forcats_0.2.0 digest_0.6.12
## [9] lubridate_1.6.0 jsonlite_1.5 evaluate_0.10.1 nlme_3.1-137
## [13] gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.2.0
## [17] psych_1.7.5 yaml_2.1.14 parallel_3.4.4 haven_1.1.0
## [21] xml2_1.1.1 httr_1.3.0 stringr_1.2.0 knitr_1.17
## [25] hms_0.3 rprojroot_1.2 grid_3.4.4 glue_1.1.1
## [29] R6_2.2.2 readxl_1.0.0 foreign_0.8-70 rmarkdown_1.6
## [33] modelr_0.1.1 reshape2_1.4.2 magrittr_1.5 backports_1.1.0
## [37] scales_0.4.1 htmltools_0.3.6 rvest_0.3.2 assertthat_0.2.0
## [41] mnormt_1.5-5 colorspace_1.3-2 labeling_0.3 stringi_1.1.5
## [45] lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
You can find the source code for this analysis and others at my Seven Answers public Gitlab repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.