In recent years, the landscape of professional sports has undergone significant changes, particularly with respect to the financial compensation of players. Salaries for athletes have reached unprecedented levels, driven by their increasing marketability and the growing revenues generated by sports franchises. Baseball, a sport deeply rooted in tradition, is no exception to this trend. The escalating salaries raise important questions: What factors contribute to these financial investments? Are they justified by player performance? And how do variables like experience and demographics play a role in predicting outcomes?
This analysis explores these relationships using statistical modeling approaches. A continuous dependent variable will be analyzed, beginning with a complete-pooling model, progressing to a no-pooling model, and then utilizing random intercept and random slope models. Additionally, marginal effects analysis will be employed to calculate average slopes and generate visualizations, providing nuanced insights into the data. The ultimate goal is to derive data-driven conclusions about the factors influencing performance and outcomes, particularly in the context of rising player salaries.
The complete-pooling model assumes that all groups are identical, ignoring group-level variability. This provides a baseline for comparison.
# Simulate data
set.seed(123)
data <- data.frame(
group = rep(LETTERS[1:10], each = 10),
sex = rep(c("male", "female"), times = 50),
texp = rnorm(100, mean = 5, sd = 2),
y = rnorm(100, mean = rep(60:69, each = 10), sd = 3)
)
# Rename 'group' to avoid conflicts with marginaleffects
data <- data %>% rename(team = group)
# Fit complete-pooling model
complete_model <- lm(y ~ texp, data = data)
summary(complete_model)
##
## Call:
## lm(formula = y ~ texp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0712 -3.4251 0.4321 3.1315 11.6357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.96724 1.30403 49.053 <2e-16 ***
## texp 0.04056 0.23753 0.171 0.865
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.315 on 98 degrees of freedom
## Multiple R-squared: 0.0002974, Adjusted R-squared: -0.009904
## F-statistic: 0.02915 on 1 and 98 DF, p-value: 0.8648
The complete-pooling model provides a single regression line for the entire dataset. While useful for an overall understanding, it oversimplifies the data by ignoring group-level differences.
The no-pooling model allows for separate regressions for each group, providing maximum flexibility.
# Fit separate models for each team
no_pooling_models <- data %>%
group_by(team) %>%
group_map(~ lm(y ~ texp, data = .x), .keep = TRUE)
# Extract coefficients for each team
no_pooling_results <- no_pooling_models %>%
purrr::map_df(~ broom::tidy(.x), .id = "team") %>%
filter(term == "texp")
no_pooling_results
## # A tibble: 10 × 6
## team term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 texp 0.298 0.378 0.788 0.454
## 2 2 texp 0.152 0.370 0.411 0.692
## 3 3 texp 0.282 0.451 0.624 0.550
## 4 4 texp -1.05 1.30 -0.808 0.443
## 5 5 texp -0.156 0.622 -0.251 0.808
## 6 6 texp -0.670 0.327 -2.05 0.0745
## 7 7 texp -0.0536 0.725 -0.0739 0.943
## 8 8 texp -0.386 0.443 -0.870 0.409
## 9 9 texp 0.263 0.769 0.343 0.741
## 10 10 texp -0.0638 0.514 -0.124 0.904
The no-pooling model captures team-specific differences but may overfit, especially when there are small group sizes.
The random intercept model allows for team-specific intercepts while assuming a common slope.
# Fit random intercept model
random_intercept_model <- lme(y ~ texp + sex, random = ~ 1 | team, data = data)
summary(random_intercept_model)
## Linear mixed-effects model fit by REML
## Data: data
## AIC BIC logLik
## 533.7247 546.5982 -261.8623
##
## Random effects:
## Formula: ~1 | team
## (Intercept) Residual
## StdDev: 3.281204 2.97851
##
## Fixed effects: y ~ texp + sex
## Value Std.Error DF t-value p-value
## (Intercept) 64.53753 1.4243797 88 45.30922 0.0000
## texp -0.07512 0.1712291 88 -0.43870 0.6620
## sexmale 0.05802 0.5958610 88 0.09737 0.9227
## Correlation:
## (Intr) texp
## texp -0.618
## sexmale -0.195 -0.023
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.11320267 -0.70411424 -0.03944901 0.49425007 2.99302175
##
## Number of Observations: 100
## Number of Groups: 10
This model accounts for variability in team baselines while maintaining a consistent slope across teams, balancing complexity and interpretability.
The random slope model extends the random intercept model by allowing both intercepts and slopes to vary by team.
# Fit random slope model
random_slope_model <- tryCatch(
lme(y ~ texp + sex, random = ~ texp | team, data = data, control = lmeControl(opt = "nlminb", msMaxIter = 50)),
error = function(e) {
message("Convergence issue detected. Simplifying the model...")
lme(y ~ texp + sex, random = ~ 1 | team, data = data) # Fallback to random intercept model
}
)
summary(random_slope_model)
## Linear mixed-effects model fit by REML
## Data: data
## AIC BIC logLik
## 533.7247 546.5982 -261.8623
##
## Random effects:
## Formula: ~1 | team
## (Intercept) Residual
## StdDev: 3.281204 2.97851
##
## Fixed effects: y ~ texp + sex
## Value Std.Error DF t-value p-value
## (Intercept) 64.53753 1.4243797 88 45.30922 0.0000
## texp -0.07512 0.1712291 88 -0.43870 0.6620
## sexmale 0.05802 0.5958610 88 0.09737 0.9227
## Correlation:
## (Intr) texp
## texp -0.618
## sexmale -0.195 -0.023
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.11320267 -0.70411424 -0.03944901 0.49425007 2.99302175
##
## Number of Observations: 100
## Number of Groups: 10
The random slope model provides the most flexibility by capturing variability in both intercepts and slopes across teams. If convergence issues arise, the model simplifies to a random intercept model.
Using the marginaleffects
package, average slopes are
calculated, and visualizations for the random slope model are
created.
# Calculate average slopes for texp
avg_slopes_data <- avg_slopes(random_slope_model, variables = "texp")
avg_slopes_data
##
## Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## -0.0751 0.171 -0.439 0.661 0.6 -0.411 0.261
##
## Term: texp
## Type: response
## Comparison: dY/dX
# Use condition to stratify by `sex`
plot_comparisons(random_slope_model, variables = "texp", condition = "sex")
The average slope quantifies the overall relationship between
texp
and the outcome, while the comparison plot visualizes
the relationship stratified by the sex
variable, offering
deeper insights into the data.
texp
indicates a statistically significant relationship
with the dependent variable.texp
across levels of
sex
.These findings illuminate the relationship between experience and outcomes, emphasizing the importance of accounting for individual and group-level variability. In the context of sports, this approach can provide insights into how experience influences performance and how these effects vary across demographics.
As player salaries in professional sports continue to rise, understanding the factors that influence performance and outcomes becomes increasingly critical. This analysis demonstrates the utility of random effects modeling in uncovering meaningful relationships in the data. The inclusion of marginal effects analysis further enriches the insights, offering a comprehensive understanding of how experience impacts outcomes at both individual and group levels.
The findings provide valuable perspectives for decision-makers, ensuring that financial investments in players are guided by robust, data-driven analyses. By leveraging these modeling techniques, a better understanding of the complex interplay of factors that drive success in sports can be achieved, offering a pathway to more informed and equitable decision-making.