Statistical analysis has revolutionized Major League Baseball (MLB) but there are still inefficiencies. Due to various subjective factors (marketability, negotiation skills, loyalty), and a labor union which favors veteran contracts, many players aren’t paid proportionally to their on-field performance. While baseball performance is easily measurable thanks to the Wins Above Replacement (WAR) statistic, “years played” is more correlated with salary than WAR. This analysis reveals that multiple players are under or overpaid by 10s of millions of dollars, and that certain teams systematically over or underpay. Included in this investigation are action items to address these inefficiencies for players, teams, and union negotiators alike. The primary model explains salary as a function of experience and WAR. Conclusions include that contracts over $10M are almost always overpaying and for every increase in 1 of years played in the MLB, a typical player can expect a salary boost of $809K.
Major League Baseball (MLB) is the second highest revenue sports league in the United States, having generated $10.8B in revenue for 2022, second only to the NFL’s $18B. Roughly 45% of MLB’s revenue goes towards its players’ salaries. But, as with office work, baseball salaries are not perfectly correlated with performance. Unlike office workers, for MLB players we can quantify which players are overpaid, underpaid, and by how much using linear regression.
My project will use the player statistics Wins Above Replacement (WAR) and experience (years in MLB) to predict MLB salaries. WAR is a baseball statistic which captures a player’s overall offensive and defensive performance in a single value, therefore it’s an efficient way compare players.
My research question is: What is the relationship between a baseball player’s Wins Above Replacement (WAR) statistic, their experience level, and their salary? Here the independent variables are experience and WAR while the dependent variable is salary. For example, how much would increasing WAR or years of MLB played by 1 be expected to increase a player’s salary? I will use this research question as a jumping off point for such data exploration as which players are the most overpaid or underpaid, and which teams tend to overpay or underpay.
The MLB baseball player data I will use was scraped from Baseball Reference pages from 2005-2019 then uploaded to Kaggle. Baseball-Reference.com is the authoritative site on baseball stats. For documentation, and so users running this notebook don’t need a Kaggle account, I’ve taken the data from Kaggle and uploaded it to my Github here, where I’ll access it in this RMD.
Reading in dataset, showing dimensions, and head.
baseball <- read.csv("https://raw.githubusercontent.com/rossboehme/DATA606/main/finalproject/baseball_2005_to_2019.csv")
dim(baseball)
## [1] 16224 8
head(baseball)
## team year name war sal exp playerid lastsal
## 1 BOS 2018 Mookie Betts 10.6 10500000 5 bettsmo01 950000
## 2 LAA 2012 Mike Trout 10.5 492500 2 troutmi01 414000
## 3 KCR 2009 Zack Greinke 10.4 3750000 6 greinza01 1400000
## 4 NYM 2018 Jacob deGrom 10.3 7400000 5 degroja01 4050000
## 5 STL 2009 Albert Pujols 9.7 14427326 9 pujolal01 13870949
## 6 WSN 2015 Bryce Harper 9.7 2500000 4 harpebr03 2150000
This Baseball-Reference data contains 16,224 observations, one for each
MLB player per season. The 8 columns are:
While I could use lastsal as an independent variable/predictor, it would ruin my model’s ability to without bias answer “Which players are over or underpaid?” A given player’s salary in their previous year would have a strong pearson coefficient with their salary in the current year. However the “previous salary” does not give insight into whether they merit that salary, like “war” and “exp” do.
The only major changes I’ll make to the dataset are:This dataset is relatively simple and clean. Overview of my data preparation:
#Checking data types
str(baseball)
## 'data.frame': 16224 obs. of 8 variables:
## $ team : chr "BOS" "LAA" "KCR" "NYM" ...
## $ year : int 2018 2012 2009 2018 2009 2015 2018 2015 2016 2005 ...
## $ name : chr "Mookie Betts" "Mike Trout" "Zack Greinke" "Jacob deGrom" ...
## $ war : num 10.6 10.5 10.4 10.3 9.7 9.7 9.7 9.5 9.5 9.4 ...
## $ sal : int 10500000 492500 3750000 7400000 14427326 2500000 573000 25000000 566000 26000000 ...
## $ exp : int 5 2 6 5 9 4 4 12 3 12 ...
## $ playerid: chr "bettsmo01" "troutmi01" "greinza01" "degroja01" ...
## $ lastsal : int 950000 414000 1400000 4050000 13870949 2150000 544000 26000000 514500 22000000 ...
#Convering year to string
baseball$year <- as.character(baseball$year)
#Ensuring correct data types
str(baseball)
## 'data.frame': 16224 obs. of 8 variables:
## $ team : chr "BOS" "LAA" "KCR" "NYM" ...
## $ year : chr "2018" "2012" "2009" "2018" ...
## $ name : chr "Mookie Betts" "Mike Trout" "Zack Greinke" "Jacob deGrom" ...
## $ war : num 10.6 10.5 10.4 10.3 9.7 9.7 9.7 9.5 9.5 9.4 ...
## $ sal : int 10500000 492500 3750000 7400000 14427326 2500000 573000 25000000 566000 26000000 ...
## $ exp : int 5 2 6 5 9 4 4 12 3 12 ...
## $ playerid: chr "bettsmo01" "troutmi01" "greinza01" "degroja01" ...
## $ lastsal : int 950000 414000 1400000 4050000 13870949 2150000 544000 26000000 514500 22000000 ...
#Filter for only 2019, remove observations with 0 salary, then reset index
baseball <- baseball[baseball$year == '2019',]
baseball <- baseball[baseball$sal > 0,]
rownames(baseball) <- NULL
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
baseball %>%
select(sal, war, exp) %>%
summary() %>%
knitr::kable()
| sal | war | exp | |
|---|---|---|---|
| Min. : 100000 | Min. :-2.2000 | Min. : 2.000 | |
| 1st Qu.: 556950 | 1st Qu.:-0.2000 | 1st Qu.: 3.000 | |
| Median : 591875 | Median : 0.3000 | Median : 4.000 | |
| Mean : 3645685 | Mean : 0.8743 | Mean : 5.564 | |
| 3rd Qu.: 4000000 | 3rd Qu.: 1.5000 | 3rd Qu.: 8.000 | |
| Max. :38333333 | Max. : 9.1000 | Max. :19.000 |
library(ggplot2)
library(cowplot)
sal_hist <- ggplot(data = baseball, aes(x = sal,fill=)) +
geom_histogram(fill="#F8B195") +
xlab("Salary Bin") +
ylab("Number of Players") +
ggtitle("MLB Player Salaries 2019") +
guides(fill="none") +
scale_x_continuous(labels = scales::dollar_format(prefix="$", suffix = "M",scale = 1e-6),breaks=seq(0,50000000,5000000))
war_bar <- ggplot(data = baseball, aes(x = war,)) +
geom_bar(fill="#F67280") +
xlab("WAR") +
ylab("Number of Players") +
ggtitle("MLB Player WAR 2019") +
guides(fill="none")
exp_bar <- ggplot(data = baseball, aes(x = exp)) +
geom_bar(fill="#C06C84") +
xlab("Experience (Years Played)") +
ylab("Number of Players") +
ggtitle("MLB Player Experience 2019") +
guides(fill="none") +
scale_x_continuous(breaks=seq(0,20,1))
plot_grid(sal_hist,war_bar,exp_bar)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Showing salary by team. There are large discrepancies in payrolls. The Boston Red Sox (BOS), Chicago Cubs (CHC), and Washington Nationals (WSN) all spent more than $175M on salaries, while nearly half (14) of all the 32 teams spent less than $100M. The lowest spender, Toronto Blue Jays (TOR), payed out only ~$40M.
library(dplyr)
library(ggplot2)
baseball %>%
group_by(team) %>%
dplyr::summarise(total_salaries = sum(sal)) %>%
dplyr::arrange(total_salaries) %>%
ggplot(aes(x= reorder(team,total_salaries), y = total_salaries)) +
geom_bar(stat='identity',fill="#008000") +
xlab("Team") +
ylab("Sum of Salaries") +
ggtitle("MLB Salaries Per Team in 2019") +
coord_flip() +
guides(fill="none") +
scale_y_continuous(labels = scales::dollar_format(prefix="$", suffix = "M",scale = 1e-6))
Showing relationship between salary and WAR, including linear regression line. As you can see, there are many outliers, but there is a slight apparent trend that players with higher WAR earn a higher salary. If we look at only exp levels of 7+, where players are past their rookie contract, the trend is more apparent, however the sample size is more limited.
sal_vs_war_all <- baseball %>%
ggplot(mapping = aes(x = war, y = sal)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Wins Above Replacement (WAR)") +
ylab("Salary") +
ggtitle("MLB Player Salary vs. WAR - 2019") +
scale_y_continuous(labels = scales::dollar_format(prefix="$", suffix = "M",scale = 1e-6))
sal_vs_war_all
## `geom_smooth()` using formula = 'y ~ x'
sal_vs_war_7plus <- baseball %>%
filter(exp >= 7) %>%
ggplot(mapping = aes(x = war, y = sal)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Wins Above Replacement (WAR)") +
ylab("Salary") +
ggtitle("7+ Year Exp Player Salary vs. WAR - 2019") +
scale_y_continuous(labels = scales::dollar_format(prefix="$", suffix = "M",scale = 1e-6))
sal_vs_war_7plus
## `geom_smooth()` using formula = 'y ~ x'
Using a correlation matrix to assess which x variables (experience, WAR) have the highest pearson coefficient with my y variable (salary). It appears that experience (“exp”) has a higher correlation with salary than WAR. This is an interesting finding because experience doesn’t win baseball games while WAR does; there’s an apparent inefficiency worth exploring.
In addition, it’s important to show that my two predictors are not collinear, which adds credibility to my regression coefficients and p-values.
bball_small <- baseball[c('sal','exp','war')]
cor(bball_small)
## sal exp war
## sal 1.0000000 0.61913822 0.29749278
## exp 0.6191382 1.00000000 0.09640601
## war 0.2974928 0.09640601 1.00000000
Graphing to show stronger correlation between salary vs. experience than the previous chart’s salary vs. WAR. There are still many outliers, but hopefully I can create an effective model if I combine WAR + experience as independent variables.
sal_vs_exp <- baseball %>%
ggplot(mapping = aes(x = exp, y = sal)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Experience (Year in MLB)") +
ylab("Salary") +
ggtitle("MLB Player Salary vs. Experience - 2019") +
scale_y_continuous(labels = scales::dollar_format(prefix="$", suffix = "M",scale = 1e-6))
sal_vs_exp
## `geom_smooth()` using formula = 'y ~ x'
I will use lm, a linear regression model for my analysis because as shown in section 3 above, there is a generally linear relationship between my independent (WAR + experience) and dependent (salary) variables.
Initializing models for evaluation. Using both WAR + experience as independent variables yields the highest R squared (0.4392) vs. just WAR (0.08752) or just exp (0.3827). The “combined” model has p-values well below 0.05 for both WAR and experience (<2e-16 for each).
war_model <- lm(sal ~ war, data=baseball)
summary(war_model)
##
## Call:
## lm(formula = sal ~ war, data = baseball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11287855 -2518270 -1898288 426877 29147535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2769072 202982 13.642 <2e-16 ***
## war 1002614 105567 9.497 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5516000 on 929 degrees of freedom
## Multiple R-squared: 0.0885, Adjusted R-squared: 0.08752
## F-statistic: 90.2 on 1 and 929 DF, p-value: < 2.2e-16
exp_model <- lm(sal ~ exp, data=baseball)
summary(exp_model)
##
## Call:
## lm(formula = sal ~ exp, data = baseball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13125819 -1492902 -479137 532978 30183840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2003159 278149 -7.202 1.23e-12 ***
## exp 1015265 42248 24.031 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4537000 on 929 degrees of freedom
## Multiple R-squared: 0.3833, Adjusted R-squared: 0.3827
## F-statistic: 577.5 on 1 and 929 DF, p-value: < 2.2e-16
combined_model <- lm(sal ~ war + exp, data=baseball)
summary(combined_model)
##
## Call:
## lm(formula = sal ~ war + exp, data = baseball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11972996 -2101750 -35317 1180592 25882078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2499333 269967 -9.258 <2e-16 ***
## war 808969 83147 9.729 <2e-16 ***
## exp 977319 40456 24.158 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4324000 on 928 degrees of freedom
## Multiple R-squared: 0.4404, Adjusted R-squared: 0.4392
## F-statistic: 365.2 on 2 and 928 DF, p-value: < 2.2e-16
However, we can’t trust p-values and parameter estimates on their own. Before I perform an OLS regression analysis, I will need to assess whether the conditions of least squares regression are reasonable (using residual plots to evaluate).
plot(combined_model)
summary(combined_model)
##
## Call:
## lm(formula = sal ~ war + exp, data = baseball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11972996 -2101750 -35317 1180592 25882078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2499333 269967 -9.258 <2e-16 ***
## war 808969 83147 9.729 <2e-16 ***
## exp 977319 40456 24.158 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4324000 on 928 degrees of freedom
## Multiple R-squared: 0.4404, Adjusted R-squared: 0.4392
## F-statistic: 365.2 on 2 and 928 DF, p-value: < 2.2e-16
Interpreting equation of line. Assuming a base of $-2,499,333 yearly salary at 0 exp and 0 war, for every increase of 1 in war, salary increases by $808,969 and for every increase of 1 in exp, salary increases by $977,319. \[ \hat{y} = -2,499,333 + 808,969 \times war\ + 977,319 \times exp \]
Interpreting R-squared - The adjusted R-squared is 0.4392, which means that 43.92% of the dependent (salary) variable’s variance can be explained.I’ll add columns to the data to calculate y-hat and the residuals for each observation. I can use these new columns to determine which are the most overpaid players and which teams tend to overpay the most.
y_interc <- -2499333
war_coef <- 808969
exp_coef <- 977319
baseball <- baseball %>%
mutate(expected_sal = y_interc + (war_coef*war) + (exp_coef*exp)) %>%
mutate(residual_sal = sal - expected_sal) %>%
mutate(pay_status = if_else(residual_sal >=0, "Overpaid","Underpaid" ))
Charting players’ actual v. expected salary, with “expected” based on my lm model. As you can see, once a player receives a salary of $10M or more, the vast majority of them are overpaid relative to their experience and WAR. There are only 2 players (1.7%) who are “underpaid” while 117 (98.3%) are “overpaid.” This shows an inefficiency for potential further exploration: high salaries ($10M+) are almost never worth it in terms of the WAR and experience you’re buying.
actual_v_exp_chart <- ggplot(baseball, aes(x = expected_sal, y = sal, colour=pay_status)) +
geom_point() +
ggtitle('MLB Players\' Actual v. Expected Salary 2019') +
labs(x="Expected Player Salary",y="Actual Salary") +
scale_y_continuous(labels = scales::dollar_format(scale = .000001,suffix = "M")) +
scale_x_continuous(labels = scales::dollar_format(scale = .000001,suffix = "M")) +
guides(fill=guide_legend(title="Pay Status"))
actual_v_exp_chart
To drive home the point that $10M+ salary players are overpaid, this histogram shows that a much higher proportion of sub-$10M salaries are underpaid.
sal_status_hist <- ggplot(data = baseball, aes(x = sal,fill=pay_status)) +
geom_histogram(bins=10) +
xlab("Salary Bin") +
ylab("Number of Players") +
ggtitle("MLB Player Salaries 2019") +
scale_x_continuous(labels = scales::dollar_format(prefix="$", suffix = "M",scale = 1e-6),breaks=seq(0,50000000,10000000))
sal_status_hist
The three most underpaid and overplayed players in the MLB are graphed below. All underpaid players should attempt to renegotiate as soon as possible or sign with a different team because their opportunity cost is equal to the delta between expected and actual salary (+$10M for the three most underpaid). Overpaid players are in a monetarily comfortable position because all MLB contracts are guaranteed; They don’t have to worry about being dropped from the team. However, for every overpaid player, there is an underpaid player, and therefore overpaid players may come under scrutiny from their teammates and the media.
baseball %>%
group_by(name) %>%
dplyr::summarise(residual_sal = sum(residual_sal)) %>%
dplyr::arrange(residual_sal) %>%
filter(dense_rank(residual_sal) <= 3 | dense_rank(desc(residual_sal)) <= 3) %>%
ggplot(aes(x= reorder(name,residual_sal), y = residual_sal)) +
geom_bar(stat='identity',fill="#950ACA") +
xlab("Player") +
ylab("Salary Compared to Expected") +
ggtitle("3 Most Overpaid and Underpaid MLB Players in 2019") +
guides(fill="none") +
scale_y_continuous(labels = scales::dollar_format(prefix="$", suffix = "M",scale = 1e-6))
The 3 teams which underpay and overpay the most in MLB are graphed below. Overpaying teams should potentially not renew their overpaying contracts. It’s not a sustainable way to win unless because the marketing of big names brings in proportional revenue. Underpaying teams risk developing a bad reputation, and are likely not paying well enough to achieve a decent season record. This poor play may be deliberate as “tanking” has come into fashion for pro sports.
baseball %>%
group_by(team) %>%
dplyr::summarise(residual_sal = sum(residual_sal)) %>%
dplyr::arrange(residual_sal) %>%
filter(dense_rank(residual_sal) <= 3 | dense_rank(desc(residual_sal)) <= 3) %>%
ggplot(aes(x= reorder(team,residual_sal), y = residual_sal)) +
geom_bar(stat='identity',fill="#BCCA0A") +
xlab("Team") +
ylab("Salary Compared to Expected") +
ggtitle("3 Most Overpaid and Underpaid MLB Teams in 2019") +
guides(fill="none") +
scale_y_continuous(labels = scales::dollar_format(prefix="$", suffix = "M",scale = 1e-6))