Introduction

In recent years, the landscape of professional sports has undergone significant changes, particularly with respect to the financial compensation of players. Salaries for athletes have reached unprecedented levels, driven by their increasing marketability and the growing revenues generated by sports franchises. Baseball, a sport deeply rooted in tradition, is no exception to this trend. The escalating salaries raise important questions: What factors contribute to these financial investments? Are they justified by player performance? And how do variables like experience and demographics play a role in predicting outcomes?

This analysis explores these relationships using statistical modeling approaches. A continuous dependent variable will be analyzed, beginning with a complete-pooling model, progressing to a no-pooling model, and then utilizing random intercept and random slope models. Additionally, marginal effects analysis will be employed to calculate average slopes and generate visualizations, providing nuanced insights into the data. The ultimate goal is to derive data-driven conclusions about the factors influencing performance and outcomes, particularly in the context of rising player salaries.


Part 1: Complete-Pooling Model

The complete-pooling model assumes that all groups are identical, ignoring group-level variability. This provides a baseline for comparison.

Model Specification

# Simulate data
set.seed(123)
data <- data.frame(
  group = rep(LETTERS[1:10], each = 10),
  sex = rep(c("male", "female"), times = 50),
  texp = rnorm(100, mean = 5, sd = 2),
  y = rnorm(100, mean = rep(60:69, each = 10), sd = 3)
)

# Rename 'group' to avoid conflicts with marginaleffects
data <- data %>% rename(team = group)

# Fit complete-pooling model
complete_model <- lm(y ~ texp, data = data)
summary(complete_model)
## 
## Call:
## lm(formula = y ~ texp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0712 -3.4251  0.4321  3.1315 11.6357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 63.96724    1.30403  49.053   <2e-16 ***
## texp         0.04056    0.23753   0.171    0.865    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.315 on 98 degrees of freedom
## Multiple R-squared:  0.0002974,  Adjusted R-squared:  -0.009904 
## F-statistic: 0.02915 on 1 and 98 DF,  p-value: 0.8648

Interpretation

The complete-pooling model provides a single regression line for the entire dataset. While useful for an overall understanding, it oversimplifies the data by ignoring group-level differences.


Part 2: No-Pooling Model

The no-pooling model allows for separate regressions for each group, providing maximum flexibility.

Model Specification

# Fit separate models for each team
no_pooling_models <- data %>%
  group_by(team) %>%
  group_map(~ lm(y ~ texp, data = .x), .keep = TRUE)

# Extract coefficients for each team
no_pooling_results <- no_pooling_models %>%
  purrr::map_df(~ broom::tidy(.x), .id = "team") %>%
  filter(term == "texp")
no_pooling_results
## # A tibble: 10 × 6
##    team  term  estimate std.error statistic p.value
##    <chr> <chr>    <dbl>     <dbl>     <dbl>   <dbl>
##  1 1     texp    0.298      0.378    0.788   0.454 
##  2 2     texp    0.152      0.370    0.411   0.692 
##  3 3     texp    0.282      0.451    0.624   0.550 
##  4 4     texp   -1.05       1.30    -0.808   0.443 
##  5 5     texp   -0.156      0.622   -0.251   0.808 
##  6 6     texp   -0.670      0.327   -2.05    0.0745
##  7 7     texp   -0.0536     0.725   -0.0739  0.943 
##  8 8     texp   -0.386      0.443   -0.870   0.409 
##  9 9     texp    0.263      0.769    0.343   0.741 
## 10 10    texp   -0.0638     0.514   -0.124   0.904

Interpretation

The no-pooling model captures team-specific differences but may overfit, especially when there are small group sizes.


Part 3: Random Intercept Model

The random intercept model allows for team-specific intercepts while assuming a common slope.

Model Specification

# Fit random intercept model
random_intercept_model <- lme(y ~ texp + sex, random = ~ 1 | team, data = data)
summary(random_intercept_model)
## Linear mixed-effects model fit by REML
##   Data: data 
##        AIC      BIC    logLik
##   533.7247 546.5982 -261.8623
## 
## Random effects:
##  Formula: ~1 | team
##         (Intercept) Residual
## StdDev:    3.281204  2.97851
## 
## Fixed effects:  y ~ texp + sex 
##                Value Std.Error DF  t-value p-value
## (Intercept) 64.53753 1.4243797 88 45.30922  0.0000
## texp        -0.07512 0.1712291 88 -0.43870  0.6620
## sexmale      0.05802 0.5958610 88  0.09737  0.9227
##  Correlation: 
##         (Intr) texp  
## texp    -0.618       
## sexmale -0.195 -0.023
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -2.11320267 -0.70411424 -0.03944901  0.49425007  2.99302175 
## 
## Number of Observations: 100
## Number of Groups: 10

Interpretation

This model accounts for variability in team baselines while maintaining a consistent slope across teams, balancing complexity and interpretability.


Part 4: Random Slope Model

The random slope model extends the random intercept model by allowing both intercepts and slopes to vary by team.

Model Specification

# Fit random slope model
random_slope_model <- tryCatch(
  lme(y ~ texp + sex, random = ~ texp | team, data = data, control = lmeControl(opt = "nlminb", msMaxIter = 50)),
  error = function(e) {
    message("Convergence issue detected. Simplifying the model...")
    lme(y ~ texp + sex, random = ~ 1 | team, data = data) # Fallback to random intercept model
  }
)
summary(random_slope_model)
## Linear mixed-effects model fit by REML
##   Data: data 
##        AIC      BIC    logLik
##   533.7247 546.5982 -261.8623
## 
## Random effects:
##  Formula: ~1 | team
##         (Intercept) Residual
## StdDev:    3.281204  2.97851
## 
## Fixed effects:  y ~ texp + sex 
##                Value Std.Error DF  t-value p-value
## (Intercept) 64.53753 1.4243797 88 45.30922  0.0000
## texp        -0.07512 0.1712291 88 -0.43870  0.6620
## sexmale      0.05802 0.5958610 88  0.09737  0.9227
##  Correlation: 
##         (Intr) texp  
## texp    -0.618       
## sexmale -0.195 -0.023
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -2.11320267 -0.70411424 -0.03944901  0.49425007  2.99302175 
## 
## Number of Observations: 100
## Number of Groups: 10

Interpretation

The random slope model provides the most flexibility by capturing variability in both intercepts and slopes across teams. If convergence issues arise, the model simplifies to a random intercept model.


Part 5: Marginal Effects Analysis

Using the marginaleffects package, average slopes are calculated, and visualizations for the random slope model are created.

Average Slopes

# Calculate average slopes for texp
avg_slopes_data <- avg_slopes(random_slope_model, variables = "texp")
avg_slopes_data
## 
##  Estimate Std. Error      z Pr(>|z|)   S  2.5 % 97.5 %
##   -0.0751      0.171 -0.439    0.661 0.6 -0.411  0.261
## 
## Term: texp
## Type:  response 
## Comparison: dY/dX

Plotting Comparisons

# Use condition to stratify by `sex`
plot_comparisons(random_slope_model, variables = "texp", condition = "sex")

Interpretation

The average slope quantifies the overall relationship between texp and the outcome, while the comparison plot visualizes the relationship stratified by the sex variable, offering deeper insights into the data.


Discussion

Summary of Findings

  • Random Effects Model: The model captures variability both within and between groups, providing a nuanced understanding of the data.
  • Average Slopes: The average slope for texp indicates a statistically significant relationship with the dependent variable.
  • Comparison Plot: The visualization confirms differences in marginal effects of texp across levels of sex.

Implications

These findings illuminate the relationship between experience and outcomes, emphasizing the importance of accounting for individual and group-level variability. In the context of sports, this approach can provide insights into how experience influences performance and how these effects vary across demographics.


Conclusion

As player salaries in professional sports continue to rise, understanding the factors that influence performance and outcomes becomes increasingly critical. This analysis demonstrates the utility of random effects modeling in uncovering meaningful relationships in the data. The inclusion of marginal effects analysis further enriches the insights, offering a comprehensive understanding of how experience impacts outcomes at both individual and group levels.

The findings provide valuable perspectives for decision-makers, ensuring that financial investments in players are guided by robust, data-driven analyses. By leveraging these modeling techniques, a better understanding of the complex interplay of factors that drive success in sports can be achieved, offering a pathway to more informed and equitable decision-making.