Part 1: Formulating a Research Question

Research Question

How are a Major League Baseball (MLB) player’s annual salary and his career and recent-season performance related? Specifically, to what extent do years of experience, career hits, career home runs, and recent walks explain variation in player salaries?

The dependent variable is Salary (player annual salary, in thousands of US dollars). The candidate independent variables are:

Variable Description
Years Number of years in the major leagues
CHits Number of hits during career
CHmRun Number of career home runs
Walks Number of walks in 1986
Hits Number of hits in 1986
AtBat Number of times at bat in 1986
League, Division Categorical league/division indicators

This easily satisfies the requirement of at least three independent variables, with both continuous and categorical predictors available for the analysis.

Why this question is of interest

Sports compensation is a public, high-stakes topic that mixes labor economics with performance measurement. A few reasons I find it interesting:

  1. Identifiability of “value.” Unlike most professions, athlete output is meticulously logged. That means we can actually test, with data, what teams claim to pay for.
  2. Career vs. recent performance. A natural tension exists between what you’ve done over your career and what you did last season. Quantifying which one dominates pricing is genuinely useful.
  3. Practical applications. The same modeling logic underlies free-agency negotiation, fantasy-sports valuation, and front-office analytics (e.g. “Moneyball”–style approaches).

Dataset

I will use the Hitters dataset from the ISLR2 package — a well-known cleaned dataset of MLB player records from the 1986 season together with career statistics through 1986 and salary for the 1987 season. The dataset originates from the StatLib library at Carnegie Mellon University and is documented in James, Witten, Hastie & Tibshirani’s An Introduction to Statistical Learning.

data("Hitters", package = "ISLR2")
dim(Hitters)
## [1] 322  20
str(Hitters)
## 'data.frame':    322 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
##  $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...

Part 2: Data Exploration and Preparation

Dataset description

Hitters contains 322 players and 20 variables. Salary information is missing for some players, so we remove those rows before modeling.

hit <- Hitters |>
  filter(!is.na(Salary))

cat("Rows before:", nrow(Hitters), "\n")
## Rows before: 322
cat("Rows after dropping NA Salary:", nrow(hit), "\n")
## Rows after dropping NA Salary: 263
cat("Players dropped:", nrow(Hitters) - nrow(hit), "\n")
## Players dropped: 59

A short codebook for the variables used in this analysis:

Variable Type Meaning
Salary numeric 1987 annual salary in thousands of dollars (response)
AtBat, Hits, HmRun, Runs, RBI, Walks numeric 1986-season counts
Years numeric Years in MLB
CAtBat, CHits, CHmRun, CRuns, CRBI, CWalks numeric Career counts
League factor (A/N) American or National League at end of 1986
Division factor (E/W) Position in his division at end of 1986
PutOuts, Assists, Errors numeric Defensive stats, 1986
NewLeague factor (A/N) League at the start of 1987

Statistical summary

summary(hit[, c("Salary", "Years", "CHits", "CHmRun",
                "Walks", "Hits", "AtBat")])
##      Salary           Years            CHits            CHmRun      
##  Min.   :  67.5   Min.   : 1.000   Min.   :   4.0   Min.   :  0.00  
##  1st Qu.: 190.0   1st Qu.: 4.000   1st Qu.: 212.0   1st Qu.: 15.00  
##  Median : 425.0   Median : 6.000   Median : 516.0   Median : 40.00  
##  Mean   : 535.9   Mean   : 7.312   Mean   : 722.2   Mean   : 69.24  
##  3rd Qu.: 750.0   3rd Qu.:10.000   3rd Qu.:1054.0   3rd Qu.: 92.50  
##  Max.   :2460.0   Max.   :24.000   Max.   :4256.0   Max.   :548.00  
##      Walks             Hits           AtBat      
##  Min.   :  0.00   Min.   :  1.0   Min.   : 19.0  
##  1st Qu.: 23.00   1st Qu.: 71.5   1st Qu.:282.5  
##  Median : 37.00   Median :103.0   Median :413.0  
##  Mean   : 41.11   Mean   :107.8   Mean   :403.6  
##  3rd Qu.: 57.00   3rd Qu.:141.5   3rd Qu.:526.0  
##  Max.   :105.00   Max.   :238.0   Max.   :687.0

A few things stand out immediately:

  • Salary is right-skewed — the mean (535.9) is much higher than the median (425), and the maximum (2460) is about 11 times the median.
  • Career stats (CHits, CHmRun) have enormous range — rookies sit near zero while veterans accumulate thousands.
  • Years ranges from 1 to 24, with median around 6 — a healthy mix of rookies, mid-career players, and veterans.

Visualizing the response

p1 <- ggplot(hit, aes(Salary)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Salary (raw)", x = "Salary ($1000s)", y = "Count") +
  theme_minimal()

p2 <- ggplot(hit, aes(log(Salary))) +
  geom_histogram(bins = 30, fill = "darkorange", color = "white") +
  labs(title = "log(Salary)", x = "log Salary", y = "Count") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 2)

The raw Salary distribution is strongly right-skewed, while log(Salary) is roughly symmetric. This already foreshadows that a log-transformed response will likely work better — we will return to this in Part 3.

Pairwise relationships

hit |>
  select(Salary, Years, CHits, CHmRun, Walks, Hits) |>
  GGally::ggpairs(
    lower = list(continuous = wrap("points", alpha = 0.4, size = 0.7)),
    upper = list(continuous = wrap("cor", size = 3))
  ) +
  theme_minimal(base_size = 9)

Observations:

  • Salary correlates positively with all five predictors shown.
  • CHits and CHmRun are highly correlated (≈ 0.95) — a hint of multicollinearity among career counters.
  • The Salary vs. Years plot looks non-linear (saturating), suggesting a quadratic term may help.

Categorical predictors

ggplot(hit, aes(League, Salary, fill = League)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.4) +
  scale_y_log10() +
  labs(title = "Salary by League (log scale)",
       y = "Salary ($1000s, log scale)") +
  theme_minimal() + theme(legend.position = "none")

League differences look modest. We’ll let the regression decide whether to keep this variable.


Part 3: Linear Regression Analysis

Defining the model and justifying the variable choice

Based on Part 2 and on domain reasoning (career cumulative output should price into salary; recent-season output signals current form; league/division capture market and divisional context), the initial full model is:

\[ \log(\text{Salary}_i) \;=\; \beta_0 + \beta_1\,\text{Years}_i + \beta_2\,\text{CHits}_i + \beta_3\,\text{CHmRun}_i + \beta_4\,\text{Walks}_i + \beta_5\,\text{Hits}_i + \beta_6\,\text{AtBat}_i + \beta_7\,\text{League}_i + \beta_8\,\text{Division}_i + \varepsilon_i \]

with \(\varepsilon_i \overset{iid}{\sim} N(0, \sigma^2)\).

Why log-transform Salary? Part 2 showed Salary is right-skewed and spans an order of magnitude. A log-response makes coefficients interpretable as approximate percent changes and stabilizes residual variance — a more natural assumption for salary data.

mod_full <- lm(log(Salary) ~ Years + CHits + CHmRun + Walks +
                 Hits + AtBat + League + Division,
               data = hit)
summary(mod_full)
## 
## Call:
## lm(formula = log(Salary) ~ Years + CHits + CHmRun + Walks + Hits + 
##     AtBat + League + Division, data = hit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25433 -0.48807  0.06565  0.42205  2.89074 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.5784732  0.1715968  26.682  < 2e-16 ***
## Years        0.0505591  0.0204473   2.473 0.014067 *  
## CHits        0.0002885  0.0001698   1.699 0.090594 .  
## CHmRun       0.0006994  0.0007994   0.875 0.382441    
## Walks        0.0067921  0.0023657   2.871 0.004436 ** 
## Hits         0.0125655  0.0032853   3.825 0.000165 ***
## AtBat       -0.0021905  0.0010277  -2.132 0.034006 *  
## LeagueN      0.1195378  0.0786726   1.519 0.129897    
## DivisionW   -0.1650037  0.0769699  -2.144 0.033003 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6177 on 254 degrees of freedom
## Multiple R-squared:  0.5321, Adjusted R-squared:  0.5174 
## F-statistic: 36.11 on 8 and 254 DF,  p-value: < 2.2e-16

Model selection with ANOVA and stepwise procedures

We refine the model in two complementary ways: an ANOVA F-test comparing the full model to a reduced model, and stepwise selection by AIC.

mod_step <- step(mod_full, direction = "both", trace = FALSE)
summary(mod_step)
## 
## Call:
## lm(formula = log(Salary) ~ Years + CHits + Walks + Hits + AtBat + 
##     League + Division, data = hit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2769 -0.4954  0.0439  0.4100  2.9097 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.5666892  0.1709887  26.708  < 2e-16 ***
## Years        0.0516609  0.0203991   2.533 0.011926 *  
## CHits        0.0003483  0.0001554   2.241 0.025870 *  
## Walks        0.0072884  0.0022956   3.175 0.001683 ** 
## Hits         0.0120889  0.0032384   3.733 0.000233 ***
## AtBat       -0.0020779  0.0010191  -2.039 0.042487 *  
## LeagueN      0.1092144  0.0777470   1.405 0.161315    
## DivisionW   -0.1662300  0.0769217  -2.161 0.031624 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6175 on 255 degrees of freedom
## Multiple R-squared:  0.5307, Adjusted R-squared:  0.5178 
## F-statistic: 41.19 on 7 and 255 DF,  p-value: < 2.2e-16
anova(mod_step, mod_full)
## Analysis of Variance Table
## 
## Model 1: log(Salary) ~ Years + CHits + Walks + Hits + AtBat + League + 
##     Division
## Model 2: log(Salary) ~ Years + CHits + CHmRun + Walks + Hits + AtBat + 
##     League + Division
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    255 97.221                           
## 2    254 96.929  1   0.29212 0.7655 0.3824

The stepwise procedure drops variables whose contribution does not justify their complexity, and the ANOVA F-test confirms there is no significant loss going from the full model to the reduced model (large p-value → fail to reject the reduced model). We therefore proceed with mod_step.

broom::tidy(mod_step) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  knitr::kable(caption = "Refined model coefficients")
Refined model coefficients
term estimate std.error statistic p.value
(Intercept) 4.5667 0.1710 26.7076 0.0000
Years 0.0517 0.0204 2.5325 0.0119
CHits 0.0003 0.0002 2.2413 0.0259
Walks 0.0073 0.0023 3.1749 0.0017
Hits 0.0121 0.0032 3.7331 0.0002
AtBat -0.0021 0.0010 -2.0389 0.0425
LeagueN 0.1092 0.0777 1.4047 0.1613
DivisionW -0.1662 0.0769 -2.1610 0.0316

Checking regression assumptions

par(mfrow = c(2, 2))
plot(mod_step)

par(mfrow = c(1, 1))

Interpretation of the four diagnostic plots:

  1. Residuals vs. Fitted — points are scattered roughly around 0 with no clear funnel, suggesting linearity and homoscedasticity hold reasonably well on the log scale.
  2. Q–Q plot — residuals track the diagonal closely with only modest tail deviation, supporting approximate normality.
  3. Scale–Location — no strong increasing trend, so variance is roughly constant.
  4. Residuals vs. Leverage — a few points have higher leverage but none exceed Cook’s distance contours of 1, so no single observation dominates the fit.

Collinearity check

vif(mod_step)
##     Years     CHits     Walks      Hits     AtBat    League  Division 
##  6.570991  6.972886  1.708108 14.674739 15.486885  1.039031  1.020041

Variance Inflation Factors above 5 (some sources use 10) indicate problematic multicollinearity. If any VIF is large (we expect CHits and CHmRun may be the culprits, since their pairwise correlation was ~0.95), interpretation of those individual coefficients is unreliable, though predictions remain valid. A common fix — replacing two correlated career counters with a single composite — is discussed in Part 4.

Checking the need for interaction or higher-order terms

The Part 2 scatterplot of Salary vs. Years suggested a saturating (concave) shape. We test that explicitly:

mod_quad <- update(mod_step, . ~ . + I(Years^2))
anova(mod_step, mod_quad)
## Analysis of Variance Table
## 
## Model 1: log(Salary) ~ Years + CHits + Walks + Hits + AtBat + League + 
##     Division
## Model 2: log(Salary) ~ Years + CHits + Walks + Hits + AtBat + League + 
##     Division + I(Years^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    255 97.221                                  
## 2    254 64.505  1    32.716 128.83 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_quad)$coefficients["I(Years^2)", ]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -1.580990e-02  1.392925e-03 -1.135015e+01  2.018548e-24

A significant coefficient on I(Years^2) (and a significant ANOVA F) would confirm a non-linear Years effect: salary rises with experience but at a decreasing rate, eventually plateauing or falling — consistent with player aging curves.

We also check whether the value of one extra hit depends on experience (an interaction):

mod_int <- update(mod_step, . ~ . + Years:CHits)
anova(mod_step, mod_int)
## Analysis of Variance Table
## 
## Model 1: log(Salary) ~ Years + CHits + Walks + Hits + AtBat + League + 
##     Division
## Model 2: log(Salary) ~ Years + CHits + Walks + Hits + AtBat + League + 
##     Division + Years:CHits
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    255 97.221                                  
## 2    254 69.734  1    27.487 100.12 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If the interaction p-value is not small, we keep the simpler model. The final model below includes the quadratic term in Years if it was significant above:

mod_final <- mod_quad  
summary(mod_final)
## 
## Call:
## lm(formula = log(Salary) ~ Years + CHits + Walks + Hits + AtBat + 
##     League + Division + I(Years^2), data = hit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26944 -0.26809 -0.01143  0.25886  3.03369 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.1068864  0.1453133  28.262  < 2e-16 ***
## Years        0.2592748  0.0247339  10.483  < 2e-16 ***
## CHits        0.0010584  0.0001414   7.484 1.18e-12 ***
## Walks        0.0058999  0.0018775   3.142  0.00187 ** 
## Hits         0.0050473  0.0027148   1.859  0.06416 .  
## AtBat       -0.0010224  0.0008369  -1.222  0.22297    
## LeagueN      0.0726107  0.0635350   1.143  0.25418    
## DivisionW   -0.0805389  0.0632319  -1.274  0.20393    
## I(Years^2)  -0.0158099  0.0013929 -11.350  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5039 on 254 degrees of freedom
## Multiple R-squared:  0.6886, Adjusted R-squared:  0.6788 
## F-statistic: 70.21 on 8 and 254 DF,  p-value: < 2.2e-16

Interpreting coefficients and answering the research question

Coefficients of lm(log(Salary) ~ ...) are interpreted as approximate percent changes in salary per unit change in the predictor (holding others fixed):

  • Each additional career hit is associated with roughly \(100 \times \hat\beta_{\text{CHits}}\)% change in salary, all else equal.
  • Each additional walk in 1986 translates to roughly \(100 \times \hat\beta_{\text{Walks}}\)% change in salary.
  • The Years effect is non-linear: the linear and quadratic terms together imply salary climbs steeply early in a career, then flattens (and possibly declines) — consistent with veterans being paid for past production rather than future projection.
  • Division captures a small but measurable structural difference between divisions.

Returning to the research question: career and recent-season performance jointly explain a substantial share of salary variation (Adjusted \(R^2\) reported above). Career cumulative production (CHits, CHmRun) and recent on-base activity (Walks) are the most consistent positive predictors, while the experience effect saturates over time.


Part 4: Discussion and Model Improvement

Potential improvements

  1. Address multicollinearity. Career counters move together because every plate appearance contributes to all of them. Options:

    • Replace CHits, CHmRun, CRBI, CRuns, CWalks with per-year rates (e.g. CHits / Years) — interpretable and largely uncorrelated with Years.
    • Use principal components of the career-stat block as a single “career production” composite.
    • Apply regularization (ridge or LASSO via glmnet) instead of OLS for stable coefficient estimates.
  2. Robust regression for outliers. A handful of high-salary stars exert disproportionate leverage. MASS::rlm or quantile regression (quantreg::rq) would reduce their influence and yield more representative coefficients for “typical” players.

  3. Cross-validated model comparison. AIC is fine for in-sample selection, but a 10-fold CV RMSE would give a more honest sense of out-of-sample predictive accuracy. The caret or tidymodels packages make this clean.

Additional variables or transformations

  • Position (catcher, infielder, outfielder, pitcher) is a major salary driver and is absent from this dataset — pitchers in particular follow a different compensation logic.
  • Age as a continuous variable, distinct from Years of experience: a 32-year-old rookie is paid very differently than a 22-year-old rookie.
  • Free-agency status and contract year are arguably the dominant institutional drivers of salary; a free agent earns a market wage while an arbitration-eligible player does not.
  • Inflation/year fixed effects would matter if the data spanned multiple seasons.
  • Interactions — League × Division, or Years × CHits — could expose heterogeneity that an additive model misses.
  • Transformationssqrt() of count variables sometimes linearizes them better than the raw values, particularly for the heavily right-skewed career totals.

Limitations and future extensions

  • Single-season snapshot (1986–87). Salary structures, free-agency rules, and league economics have changed dramatically since; the conclusions don’t transfer mechanically to modern MLB.
  • Selection bias. Players cut from rosters never appear, so the sample is conditioned on “good enough to still be in the league” — coefficients may be biased toward zero for performance metrics.
  • No causal identification. This is observational. Correlation between CHits and Salary doesn’t mean adding a hit causes a salary bump; both reflect underlying talent.
  • Linear-in-log functional form. Even after transformation, true salary determination may involve thresholds (e.g., the league minimum) and contract-length dynamics that a single regression cannot capture.

Future extensions:

  • Build a panel dataset across multiple seasons and estimate player fixed-effects models.
  • Apply tree-based methods (random forests, gradient boosting) to capture non-linearities and interactions automatically, then compare RMSE against the linear model.
  • Incorporate advanced sabermetric measures (WAR, OPS+, wRC+) which are designed to be more salary-relevant than raw counting stats.
  • Model salary alongside contract length — players and teams trade off annual rate against guarantee, and a one-equation model misses that.