We consider Major League Baseball (MLB) teams regular season performance as a function of the players’ salaries. We would like to test and assess the relationship between the two and which other factors affect and improve predictability of team performance using salaries. Our analysis starts with simple linear regression between the two and its dynamic in time, and proceeds by building more complex models based on players’ impact and role. While such approach may seem illusive due to unavailability of players data before season begins its feasibility is based on of the assumption that every team expects certain performance and impact from the players.
MLB teams pioneered application of data science in sport which resulted in the analytical field and environment rich with data sources, advanced metrics, models and research. MLB salaries negotiated up front to get players to play for teams are not simple transactions. They include a lot of analysis by team scouts and analytical departments to allocate budget among many available players. But at the end of the day (and before season starts) each player gets certain salary and at the end of season each team gets certain number of wins and losses. In between we have a 6 intense months filled with hundreds of games played by hundreds of players. Below we will run an exercise of predicting how player salaries affect overall team performance incrementally considering more complex models.
Consider how overall team spendings on players predict team performance. Let’s take total team payroll for a season as independent variable (predictor) and use team regular season winning percentage as a proxy for team performance. So this simple model looks like this:
Each team each season presents a single data point for our model starting with 1985 when salary records started in MLB. Below we prepare data with Lahman and data.table packages resulting in teamPayroll
dataset:
library(Lahman)
library(data.table)
teams = as.data.table(Teams)
teams = teams[, .(yearID,
lgID = as.character(lgID),
teamID = as.character(teamID),
franchID = as.character(franchID),
Rank, G, W, L, R, ERA, SO,
WinPercent = W/(W+L))]
salaries = as.data.table(Salaries)
salaries = salaries[, c("lgID", "teamID", "salary1M") :=
list(as.character(lgID), as.character(teamID), salary / 1e6L)]
payroll = salaries[, .(payroll = sum(salary1M)), by=.(teamID, yearID)]
teamPayroll = merge(teams, payroll, by=c("teamID","yearID"))
First, observe linear regression model visually:
Comparing models for AL and NL shows no obvious differences in linear trends so we will keep our data together for both leagues. This doesn’t mean that looking into more dynamic between leagues is not warranted in further analysis though.
While the latest Lahman package includes 2016 stats its Salaries
data doesn’t seem to contain correct teamID
s so our analysis covers 1985 through 2015.
The interesting aspect of analysis of payroll effect on performance is observing its trend over time. So we construct series of models for each year from 1985 through 2015:
Observing how the model’s slope \(\beta_1\) trends over years will tell us a story of how more or less effective payroll spendings became over time. Remember that for our model interpretation of the slope \(\beta_1\) is how much $1 million in payroll increases (positive slope) or decreases (negative slope) team’s winning percentage.
YEARS = seq(1985, 2015)
lmodelsEach = vector("list", length = length(YEARS))
lmodelsAccu = vector("list", length = length(YEARS))
lmodelsAfter = vector("list", length = length(YEARS))
for(YEAR_START in YEARS) {
data = teamPayroll[yearID == YEAR_START & yearID <= 2015]
lmodelsEach[[YEAR_START - 1985 + 1]] = lm(formula = WinPercent ~ payroll, data = data)
data = teamPayroll[yearID <= YEAR_START & yearID <= 2015]
lmodelsAccu[[YEAR_START - 1985 + 1]] = lm(formula = WinPercent ~ payroll, data = data)
data = teamPayroll[yearID >= YEAR_START & yearID <= 2015]
lmodelsAfter[[YEAR_START - 1985 + 1]] = lm(formula = WinPercent ~ payroll, data = data)
}
The top plot shows how \(\beta_1\) trends from season to season. Starting with 2000 the effect of payroll becomes negligeble compared to most of previous seasons. Payroll plays lesser and lesser role as time pregresses from 80’s into 90’s and 2000’s.
The middle plot builds cumulative models on the data from all seasons to the left (inclusive) and is yet another confirmation of the above trend that payroll’s effect on performance reduces with time.
Note that the bottom plot has different scale for \(\beta_1\) which is ten times smaller than the first two. Its models build cumulative models on the seasons to the right (inclusive). Thus its \(\beta_1\) for 1985 model is the same as the \(\beta_1\) for 2015 from the middle plot as both operate on the same seasons. Given such small scale it is difficult to make proper conclusions about its up and then down trend but it certainly indicates that payroll impact diminishes after 2002.
If we consider team performance vs. payroll models over 5 year stretches then their dynamics look like this:
teamPayroll[, year_interval := cut(teamPayroll$yearID,
seq(from=1985, to=2015, by=5),
include.lowest = TRUE,
right=FALSE, ordered_result = TRUE)]
linRegModelsOver5YearIntervals = teamPayroll[yearID <= 2015,
.(payrollModel = list(lm(WinPercent ~ payroll, .SD))),
by = year_interval ]
data = data.table(interval = linRegModelsOver5YearIntervals[[1]],
beta1 = sapply(linRegModelsOver5YearIntervals[[2]],
function(x) {coefficients(x)[[2]]}),
adjRSquared = sapply(linRegModelsOver5YearIntervals[[2]],
function(x) {summary(x)$adj.r.squared}))
Those are hardly impressive models but to understand why they behave like this it’s beneficial to look at actual data points for each 5-year interval:
We can conclude that over last 30 years range and disbalance in team payrolls turned MLB into different type of league the least. It also hints at possibility that there is more the relationship between how much players paid and teams results.
So is it really irrrelevant how much teams pay? What if we attempt to dig deeper inside salaries to uncover more predictability and hidden relationship between how much teams pay and how well they perform? One idea is to connect players’ impact on team performance with their salaries:
When Major League Baseball (MLB) team signs up a player there are more factors besides salary associated with contract. Among these factors a player’s position stands out as more important. If we add position into the mix then instead of single predictor (team payroll) we get multiple predictors corresponding to moneys paid according player’s position. But how to measure player’s impact based on position or positions he played?
Lahman has records of all player appearances in the table Appearances
. Player season total appearances (per team) are recorded by season and team (just as salaries) as follows:
Having number of games played in these positions each player’s salary can be allocated between them. Hence we introduce position-based “payrolls” corresponding to fractions of team payroll. Not to overplay own hand we start with the following position-based predictors:
appearances = as.data.table(Appearances)
appearances = appearances[, .(yearID,
lgID = as.character(lgID),
teamID = as.character(teamID),
playerID,
G_all,
G_batting, G_defense, G_p, G_c,
G_1b, G_2b, G_3b, G_ss, G_of)]
appearances = appearances[, c("P", "B", "C", "B1", "B2", "B3", "SS", "OF") :=
list(G_p / G_all,
ifelse(G_p > 0, 0.0, G_batting / G_all),
ifelse(G_p > 0, 0.0, G_c / G_defense),
ifelse(G_p > 0, 0.0, G_1b / G_defense),
ifelse(G_p > 0, 0.0, G_2b / G_defense),
ifelse(G_p > 0, 0.0, G_3b / G_defense),
ifelse(G_p > 0, 0.0, G_ss / G_defense),
ifelse(G_p > 0, 0.0, G_of / G_defense))]
positionSalaries = merge(salaries, appearances)
positionSalaries = positionSalaries[, .(yearID, teamID, lgID, playerID, salary1M,
pitcher = salary1M * P,
batter = salary1M * B,
catcher = salary1M * C,
firstBase = salary1M * B1,
secondBase = salary1M * B2,
thirdBase = salary1M * B3,
shortstop = salary1M * SS,
outfielder = salary1M * OF)]
payrollByPositions = positionSalaries[,
.(pitcher = sum(pitcher),
batter = sum(batter),
catcher = sum(catcher),
firstBase = sum(firstBase),
secondBase = sum(secondBase),
thirdBase = sum(thirdBase),
shortstop = sum(shortstop),
outfielder = sum(outfielder)),
by=.(teamID, yearID)]
teamPayrollByPositions = merge(teams, payrollByPositions, by=c("teamID","yearID"))
Not surprisingly the position-based model (Positions) produces the best (albeit marginally) model and predictions. What surprises is that leaving out all positions but pitcher from that model (Pitcher) results in simple regression model that at least as good or better than our first Payroll simple regression. Lastly, the effect of payroll doesn’t compare to effect of total pitcher salary after looking at \(\beta_1\) values.
We conclude that effective position-based payroll model executed on pitching salary only is equivalent to full payrol/position model:
At this point true baseball fan may ask a question: does it matter if pitcher was a starter, a closer, or a reliever? We will attempt to answer this next by constructing following model:
Unfortunately Lahman dataset has not explicit pitcher roles for obvious reason: pitchers’ roles are fluent and may change during season. But even as they do these roles ususally are identified with most pitchers during single season. Which means they could be identified by observing pitchers’ playing statistics.Remember that we want to assign each pitcher on a team each season to one of 3 roles - starter, closer, or reliever - using pitching stats during this season (and team tenure) from Lahman dataset. Pitching table contains following attributes (among others) that should help:
We could simply devise empirical rules for each role based on our understanding of the game. But lacking real baseball experience and possessing realistic view of our baseball knowledge such approach may cause additional questions the least. Instead we chose to let data speak for itself and cluster players into groups that w approximate pitcher roles. By examining resulting clusters and assigning them roles we then assign each pitcher to a role based on cluster membership:
To use Kmeans clustering algorithm first we want to determine if 3 (remember that there are 3 roles: starter, closer, and reliever) indeed produce optimal number of groups when clustering pitchers by G
, GS
, and `SV’. The elbow method computes 9 models for \(k = 2, 3, ..., 10\) (number of clusters):
pitching = as.data.table(Pitching)
pitching = pitching[, c("lgID", "teamID") :=
list(as.character(lgID), as.character(teamID))]
# scale attributes used in clustering first
pitching[ , c("G_scaled", "GS_scaled", "SV_scaled") := lapply(.SD, function(x) as.double(scale(x))),
.SDcols = c("G","GS","SV")]
km.models = vector(mode = "list", length = 9)
for(k in seq(2,10)) {
km.models[[k-1]] = kmeans(pitching[yearID >= 1985 & yearID <= 2015,
c("G_scaled","GS_scaled","SV_scaled")],
centers = k, nstart = 20)
}
Then we plot line chart of total within-cluster sum of squares for each value of \(k\) to find the “elbow” on the arm the plot resembles:
This clearly indicates that \(k=4\) is the best choice with next best choice \(k=3\). To interpret and confirm which clusters correpond to which pitcher roles we visualize centroids for both \(k=3,4\):
By observing these plots we can conlclude the following:
To wrap up kmeans analysis next plot contains cluster sizes and roles for \(k=4\) model:
pitching_data = pitching[yearID >= 1985 & yearID <= 2015,]
pitching_data = cbind(pitching_data, clusterid=km.model4$cluster)
clusterToRoleMap = data.table(clusterid = c(1,2,3,4),
role = c('Reliever','Reliever','Closer','Starter'))
pitching_data[clusterToRoleMap, role := role, on = c(clusterid = "clusterid")]
pitcherRoleSalaries = merge(salaries, pitching_data)
pitcherRoleSalaries = pitcherRoleSalaries[, .(yearID, teamID, lgID, playerID, salary1M,
starter = ifelse(role == 'Starter', salary1M, 0.),
reliever = ifelse(role == 'Reliever', salary1M, 0.),
closer = ifelse(role == 'Closer', salary1M, 0.))]
payrollByPitcherRole = pitcherRoleSalaries[,
.(starter = sum(starter),
reliever = sum(reliever),
closer = sum(closer)),
by=.(teamID, yearID)]
teamPayrollByPitcherRole = merge(teams, payrollByPitcherRole, by=c("teamID","yearID"))
lm.fit.roles = lm(formula = WinPercent ~ starter + reliever + closer,
data = teamPayrollByPitcherRole[yearID <= 2015])
lm.fit.roleStarterCloser = lm(formula = WinPercent ~ starter + closer,
data = teamPayrollByPitcherRole[yearID <= 2015])
lm.fit.roleStarterReliever = lm(formula = WinPercent ~ starter + reliever,
data = teamPayrollByPitcherRole[yearID <= 2015])
lm.fit.roleStarter = lm(formula = WinPercent ~ starter,
data = teamPayrollByPitcherRole[yearID <= 2015])
summary(lm.fit.roles)
##
## Call:
## lm(formula = WinPercent ~ starter + reliever + closer, data = teamPayrollByPitcherRole[yearID <=
## 2015])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.207048 -0.047137 -0.002674 0.049935 0.205454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4836747 0.0035814 135.053 < 2e-16 ***
## starter 0.0016391 0.0002211 7.412 2.9e-13 ***
## reliever -0.0011648 0.0003033 -3.841 0.000131 ***
## closer 0.0023087 0.0008353 2.764 0.005827 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06548 on 884 degrees of freedom
## Multiple R-squared: 0.09767, Adjusted R-squared: 0.09461
## F-statistic: 31.9 on 3 and 884 DF, p-value: < 2.2e-16
Note that we placed signficant predictor in first position (\(\beta_1\)) in every model (\(p-value < 0.001\) or better)
Based on comparing adjusted \(R^2\), \(RMSE\) (Root Mean Squared Error), \(\beta_1\) of the linear models constrcuted that far the pitcher role model stands out (not clear how much…).
TO DO: unfinished - work in progress …
In fact, we can consider every baseball team’s season as realization of certain function \(F(s, p)\) that translates player signings (\(s\) is a vector of contract salaries and \(p\) is a vector of player positions - both the same length for each team) before season starts into team resulting performance after end of the season:
Function \(F(s,p)\) acts on cumulative numbers for each team and is not concerned with how each player performed personally (under-, over- or just achieved). Moreover, it aggregates over all players for the same position looking at total spendings per position.
If we assess team performance based on salaries spendings over its whole roster using the 2 factors above we may find out which positional spendings are key to team success. But having contract just salaries contract info without looking at real impact players had during the season will skew results. Thus, we want to consider not just how much players by these two factors to assess how much money spent per position affect team performance
Sean Lahman maintains historical baseball database that contains complete batting and pitching statistics from 1871 to 2016, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. We will use it for the R exercises.
The dataset is available in variety of formats including CSV. In fact, R package Lahman provides the tables from the ‘Sean Lahman Baseball Database’ as a set of R data.frames. Similar package(s) are available for Python.
Data dictionary readme2014.txt is found with data files.
Lahman Baseball Database files are found in the folder baseballdatabank-2017. Because of the large footprint we packaged data with the archive baseballdatabank-2017.zip and unzip it in place before loading them.