How are a Major League Baseball (MLB) player’s annual salary and his career and recent-season performance related? Specifically, to what extent do years of experience, career hits, career home runs, and recent walks explain variation in player salaries?
The dependent variable is Salary
(player annual salary, in thousands of US dollars). The candidate
independent variables are:
| Variable | Description |
|---|---|
Years |
Number of years in the major leagues |
CHits |
Number of hits during career |
CHmRun |
Number of career home runs |
Walks |
Number of walks in 1986 |
Hits |
Number of hits in 1986 |
AtBat |
Number of times at bat in 1986 |
League, Division |
Categorical league/division indicators |
This easily satisfies the requirement of at least three independent variables, with both continuous and categorical predictors available for the analysis.
Sports compensation is a public, high-stakes topic that mixes labor economics with performance measurement. A few reasons I find it interesting:
I will use the Hitters dataset from the
ISLR2 package — a well-known cleaned dataset of MLB player
records from the 1986 season together with career statistics through
1986 and salary for the 1987 season. The dataset originates from the
StatLib library at Carnegie Mellon University and is documented in
James, Witten, Hastie & Tibshirani’s An Introduction to
Statistical Learning.
data("Hitters", package = "ISLR2")
dim(Hitters)
## [1] 322 20
str(Hitters)
## 'data.frame': 322 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 13 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
## $ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
## $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
## $ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
Hitters contains 322 players and
20 variables. Salary information is missing for some
players, so we remove those rows before modeling.
hit <- Hitters |>
filter(!is.na(Salary))
cat("Rows before:", nrow(Hitters), "\n")
## Rows before: 322
cat("Rows after dropping NA Salary:", nrow(hit), "\n")
## Rows after dropping NA Salary: 263
cat("Players dropped:", nrow(Hitters) - nrow(hit), "\n")
## Players dropped: 59
A short codebook for the variables used in this analysis:
| Variable | Type | Meaning |
|---|---|---|
Salary |
numeric | 1987 annual salary in thousands of dollars (response) |
AtBat, Hits, HmRun,
Runs, RBI, Walks |
numeric | 1986-season counts |
Years |
numeric | Years in MLB |
CAtBat, CHits, CHmRun,
CRuns, CRBI, CWalks |
numeric | Career counts |
League |
factor (A/N) | American or National League at end of 1986 |
Division |
factor (E/W) | Position in his division at end of 1986 |
PutOuts, Assists, Errors |
numeric | Defensive stats, 1986 |
NewLeague |
factor (A/N) | League at the start of 1987 |
summary(hit[, c("Salary", "Years", "CHits", "CHmRun",
"Walks", "Hits", "AtBat")])
## Salary Years CHits CHmRun
## Min. : 67.5 Min. : 1.000 Min. : 4.0 Min. : 0.00
## 1st Qu.: 190.0 1st Qu.: 4.000 1st Qu.: 212.0 1st Qu.: 15.00
## Median : 425.0 Median : 6.000 Median : 516.0 Median : 40.00
## Mean : 535.9 Mean : 7.312 Mean : 722.2 Mean : 69.24
## 3rd Qu.: 750.0 3rd Qu.:10.000 3rd Qu.:1054.0 3rd Qu.: 92.50
## Max. :2460.0 Max. :24.000 Max. :4256.0 Max. :548.00
## Walks Hits AtBat
## Min. : 0.00 Min. : 1.0 Min. : 19.0
## 1st Qu.: 23.00 1st Qu.: 71.5 1st Qu.:282.5
## Median : 37.00 Median :103.0 Median :413.0
## Mean : 41.11 Mean :107.8 Mean :403.6
## 3rd Qu.: 57.00 3rd Qu.:141.5 3rd Qu.:526.0
## Max. :105.00 Max. :238.0 Max. :687.0
A few things stand out immediately:
CHits,
CHmRun) have enormous range — rookies sit near zero while
veterans accumulate thousands.Years ranges from 1 to 24, with median
around 6 — a healthy mix of rookies, mid-career players, and
veterans.p1 <- ggplot(hit, aes(Salary)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(title = "Salary (raw)", x = "Salary ($1000s)", y = "Count") +
theme_minimal()
p2 <- ggplot(hit, aes(log(Salary))) +
geom_histogram(bins = 30, fill = "darkorange", color = "white") +
labs(title = "log(Salary)", x = "log Salary", y = "Count") +
theme_minimal()
grid.arrange(p1, p2, ncol = 2)
The raw Salary distribution is strongly right-skewed,
while log(Salary) is roughly symmetric. This
already foreshadows that a log-transformed response will likely work
better — we will return to this in Part 3.
hit |>
select(Salary, Years, CHits, CHmRun, Walks, Hits) |>
GGally::ggpairs(
lower = list(continuous = wrap("points", alpha = 0.4, size = 0.7)),
upper = list(continuous = wrap("cor", size = 3))
) +
theme_minimal(base_size = 9)
Observations:
Salary correlates positively with all five predictors
shown.CHits and CHmRun are highly
correlated (≈ 0.95) — a hint of multicollinearity among career
counters.Salary vs. Years plot looks
non-linear (saturating), suggesting a quadratic term
may help.ggplot(hit, aes(League, Salary, fill = League)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.4) +
scale_y_log10() +
labs(title = "Salary by League (log scale)",
y = "Salary ($1000s, log scale)") +
theme_minimal() + theme(legend.position = "none")
League differences look modest. We’ll let the regression decide whether to keep this variable.
Based on Part 2 and on domain reasoning (career cumulative output should price into salary; recent-season output signals current form; league/division capture market and divisional context), the initial full model is:
\[ \log(\text{Salary}_i) \;=\; \beta_0 + \beta_1\,\text{Years}_i + \beta_2\,\text{CHits}_i + \beta_3\,\text{CHmRun}_i + \beta_4\,\text{Walks}_i + \beta_5\,\text{Hits}_i + \beta_6\,\text{AtBat}_i + \beta_7\,\text{League}_i + \beta_8\,\text{Division}_i + \varepsilon_i \]
with \(\varepsilon_i \overset{iid}{\sim} N(0, \sigma^2)\).
Why log-transform Salary? Part 2 showed
Salary is right-skewed and spans an order of magnitude. A
log-response makes coefficients interpretable as approximate
percent changes and stabilizes residual variance — a more
natural assumption for salary data.
mod_full <- lm(log(Salary) ~ Years + CHits + CHmRun + Walks +
Hits + AtBat + League + Division,
data = hit)
summary(mod_full)
##
## Call:
## lm(formula = log(Salary) ~ Years + CHits + CHmRun + Walks + Hits +
## AtBat + League + Division, data = hit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25433 -0.48807 0.06565 0.42205 2.89074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5784732 0.1715968 26.682 < 2e-16 ***
## Years 0.0505591 0.0204473 2.473 0.014067 *
## CHits 0.0002885 0.0001698 1.699 0.090594 .
## CHmRun 0.0006994 0.0007994 0.875 0.382441
## Walks 0.0067921 0.0023657 2.871 0.004436 **
## Hits 0.0125655 0.0032853 3.825 0.000165 ***
## AtBat -0.0021905 0.0010277 -2.132 0.034006 *
## LeagueN 0.1195378 0.0786726 1.519 0.129897
## DivisionW -0.1650037 0.0769699 -2.144 0.033003 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6177 on 254 degrees of freedom
## Multiple R-squared: 0.5321, Adjusted R-squared: 0.5174
## F-statistic: 36.11 on 8 and 254 DF, p-value: < 2.2e-16
We refine the model in two complementary ways: an ANOVA F-test comparing the full model to a reduced model, and stepwise selection by AIC.
mod_step <- step(mod_full, direction = "both", trace = FALSE)
summary(mod_step)
##
## Call:
## lm(formula = log(Salary) ~ Years + CHits + Walks + Hits + AtBat +
## League + Division, data = hit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2769 -0.4954 0.0439 0.4100 2.9097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5666892 0.1709887 26.708 < 2e-16 ***
## Years 0.0516609 0.0203991 2.533 0.011926 *
## CHits 0.0003483 0.0001554 2.241 0.025870 *
## Walks 0.0072884 0.0022956 3.175 0.001683 **
## Hits 0.0120889 0.0032384 3.733 0.000233 ***
## AtBat -0.0020779 0.0010191 -2.039 0.042487 *
## LeagueN 0.1092144 0.0777470 1.405 0.161315
## DivisionW -0.1662300 0.0769217 -2.161 0.031624 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6175 on 255 degrees of freedom
## Multiple R-squared: 0.5307, Adjusted R-squared: 0.5178
## F-statistic: 41.19 on 7 and 255 DF, p-value: < 2.2e-16
anova(mod_step, mod_full)
## Analysis of Variance Table
##
## Model 1: log(Salary) ~ Years + CHits + Walks + Hits + AtBat + League +
## Division
## Model 2: log(Salary) ~ Years + CHits + CHmRun + Walks + Hits + AtBat +
## League + Division
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 255 97.221
## 2 254 96.929 1 0.29212 0.7655 0.3824
The stepwise procedure drops variables whose contribution does not
justify their complexity, and the ANOVA F-test confirms there is no
significant loss going from the full model to the reduced model (large
p-value → fail to reject the reduced model). We therefore proceed with
mod_step.
broom::tidy(mod_step) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
knitr::kable(caption = "Refined model coefficients")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 4.5667 | 0.1710 | 26.7076 | 0.0000 |
| Years | 0.0517 | 0.0204 | 2.5325 | 0.0119 |
| CHits | 0.0003 | 0.0002 | 2.2413 | 0.0259 |
| Walks | 0.0073 | 0.0023 | 3.1749 | 0.0017 |
| Hits | 0.0121 | 0.0032 | 3.7331 | 0.0002 |
| AtBat | -0.0021 | 0.0010 | -2.0389 | 0.0425 |
| LeagueN | 0.1092 | 0.0777 | 1.4047 | 0.1613 |
| DivisionW | -0.1662 | 0.0769 | -2.1610 | 0.0316 |
par(mfrow = c(2, 2))
plot(mod_step)
par(mfrow = c(1, 1))
Interpretation of the four diagnostic plots:
vif(mod_step)
## Years CHits Walks Hits AtBat League Division
## 6.570991 6.972886 1.708108 14.674739 15.486885 1.039031 1.020041
Variance Inflation Factors above 5 (some sources use 10) indicate
problematic multicollinearity. If any VIF is large (we
expect CHits and CHmRun may be the culprits,
since their pairwise correlation was ~0.95), interpretation of those
individual coefficients is unreliable, though predictions remain valid.
A common fix — replacing two correlated career counters with a single
composite — is discussed in Part 4.
The Part 2 scatterplot of Salary vs. Years
suggested a saturating (concave) shape. We test that
explicitly:
mod_quad <- update(mod_step, . ~ . + I(Years^2))
anova(mod_step, mod_quad)
## Analysis of Variance Table
##
## Model 1: log(Salary) ~ Years + CHits + Walks + Hits + AtBat + League +
## Division
## Model 2: log(Salary) ~ Years + CHits + Walks + Hits + AtBat + League +
## Division + I(Years^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 255 97.221
## 2 254 64.505 1 32.716 128.83 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_quad)$coefficients["I(Years^2)", ]
## Estimate Std. Error t value Pr(>|t|)
## -1.580990e-02 1.392925e-03 -1.135015e+01 2.018548e-24
A significant coefficient on I(Years^2) (and a
significant ANOVA F) would confirm a non-linear Years
effect: salary rises with experience but at a decreasing rate,
eventually plateauing or falling — consistent with player aging
curves.
We also check whether the value of one extra hit depends on experience (an interaction):
mod_int <- update(mod_step, . ~ . + Years:CHits)
anova(mod_step, mod_int)
## Analysis of Variance Table
##
## Model 1: log(Salary) ~ Years + CHits + Walks + Hits + AtBat + League +
## Division
## Model 2: log(Salary) ~ Years + CHits + Walks + Hits + AtBat + League +
## Division + Years:CHits
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 255 97.221
## 2 254 69.734 1 27.487 100.12 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If the interaction p-value is not small, we keep the simpler model.
The final model below includes the quadratic term in
Years if it was significant above:
mod_final <- mod_quad
summary(mod_final)
##
## Call:
## lm(formula = log(Salary) ~ Years + CHits + Walks + Hits + AtBat +
## League + Division + I(Years^2), data = hit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.26944 -0.26809 -0.01143 0.25886 3.03369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1068864 0.1453133 28.262 < 2e-16 ***
## Years 0.2592748 0.0247339 10.483 < 2e-16 ***
## CHits 0.0010584 0.0001414 7.484 1.18e-12 ***
## Walks 0.0058999 0.0018775 3.142 0.00187 **
## Hits 0.0050473 0.0027148 1.859 0.06416 .
## AtBat -0.0010224 0.0008369 -1.222 0.22297
## LeagueN 0.0726107 0.0635350 1.143 0.25418
## DivisionW -0.0805389 0.0632319 -1.274 0.20393
## I(Years^2) -0.0158099 0.0013929 -11.350 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5039 on 254 degrees of freedom
## Multiple R-squared: 0.6886, Adjusted R-squared: 0.6788
## F-statistic: 70.21 on 8 and 254 DF, p-value: < 2.2e-16
Coefficients of lm(log(Salary) ~ ...) are interpreted as
approximate percent changes in salary per unit change
in the predictor (holding others fixed):
Years effect is non-linear: the
linear and quadratic terms together imply salary climbs steeply early in
a career, then flattens (and possibly declines) — consistent with
veterans being paid for past production rather than future
projection.Returning to the research question: career and
recent-season performance jointly explain a substantial share of salary
variation (Adjusted \(R^2\) reported
above). Career cumulative production (CHits,
CHmRun) and recent on-base activity (Walks)
are the most consistent positive predictors, while the experience effect
saturates over time.
Address multicollinearity. Career counters move together because every plate appearance contributes to all of them. Options:
CHits, CHmRun, CRBI,
CRuns, CWalks with per-year
rates (e.g. CHits / Years) — interpretable and
largely uncorrelated with Years.glmnet) instead of OLS for stable coefficient
estimates.Robust regression for outliers. A handful of
high-salary stars exert disproportionate leverage.
MASS::rlm or quantile regression
(quantreg::rq) would reduce their influence and yield more
representative coefficients for “typical” players.
Cross-validated model comparison. AIC is fine
for in-sample selection, but a 10-fold CV RMSE would give a more honest
sense of out-of-sample predictive accuracy. The caret or
tidymodels packages make this clean.
Years of experience: a 32-year-old rookie is paid very
differently than a 22-year-old rookie.sqrt() of count
variables sometimes linearizes them better than the raw values,
particularly for the heavily right-skewed career totals.CHits and Salary doesn’t
mean adding a hit causes a salary bump; both reflect underlying
talent.Future extensions: