This document joins MVP voting tables with advanced stat tables,
harmonize data types, and model MVP voting outcomes using linear models
and linear mixed models. The goal is to identify which advanced
statistics best predict MVP voting outcomes, with a focus on the
pts_won column as the response.
From the last 4 seasons of the NBA is it possible to derive an understanding of the best predictors among the MVP contenders which sits the winners and those who rank higher in total MVP votes compared to others. in our modelling process we will be using linear and mixed linear models to see if there is a relation and significiant predictor in our context.
We load packages, set global chunk options
knitr::opts_chunk$set(
echo = TRUE, message = FALSE, warning = FALSE, fig.width = 8, fig.height = 5
)
library(tidyverse)
library(readr)
library(janitor)
library(performance)
library(sjPlot)
library(lme4)
library(GGally)
library(broom)
library(readxl)
library(patchwork)
library(stringr)
library(purrr)
library(tidyr)
Download Avd_2025.xlsx
Download MVP_2025.xlsx
Download Avd_2024.xlsx
Download MVP_2024.xlsx
Download Avd_2023.xlsx
Download MVP_2023.xlsx
Download Avd_2022.xlsx
Download MVP_2022.xlsx
We read the eight Excel files, split MVP vs Avd sets, and join each
MVP year to the matching Avd year on Player.
# ---------------------------
# Data Ingestion (hardcoded)
# ---------------------------
library(readxl)
library(dplyr)
library(purrr)
library(stringr)
data_dir <- "data" # change if needed
# Hardcode the exact files you want to load
avd_files <- c("Avd_2022.xlsx","Avd_2023.xlsx","Avd_2024.xlsx","Avd_2025.xlsx")
mvp_files <- c("MVP_2022.xlsx","MVP_2023.xlsx","MVP_2024.xlsx","MVP_2025.xlsx")
# Optional sanity checks
missing_avd <- avd_files[!file.exists(file.path(data_dir, avd_files))]
missing_mvp <- mvp_files[!file.exists(file.path(data_dir, mvp_files))]
if (length(missing_avd)) warning("Missing AVD files: ", paste(missing_avd, collapse = ", "))
if (length(missing_mvp)) warning("Missing MVP files: ", paste(missing_mvp, collapse = ", "))
# Read each set into named lists
avd_list <- avd_files %>%
set_names() %>%
map(~ readxl::read_excel(file.path(data_dir, .x), sheet = 1))
mvp_list <- mvp_files %>%
set_names() %>%
map(~ readxl::read_excel(file.path(data_dir, .x), sheet = 1))
# Pair each MVP with its matching Avd year and join on Player
mvp_with_avd <- imap(mvp_list, function(mvp_df, mvp_name) {
year <- stringr::str_extract(mvp_name, "[0-9]{4}")
avd_nm <- paste0("Avd_", year, ".xlsx")
avd_df <- avd_list[[avd_nm]]
if (is.null(avd_df)) {
warning("No Avd file found for ", mvp_name, ". Returning MVP data unchanged.")
return(mvp_df)
}
avd_df <- avd_df %>% distinct(Player, .keep_all = TRUE)
left_join(mvp_df, avd_df, by = "Player", suffix = c("", "_Avd"))
})
We join the advanced datasets onto the MVPs to reduce the overall noise in the modelling process. with so many players not recieving votes for MVP if we were to include these players it could negaiticely skew the modelling process with so many players without any votes.
combinedWe bind all MVP-Avd pairs together after coercing to a safe representation to avoid type clashes, then convert back to numeric where appropriate. This follows the final code you prepared.
to_char_df <- function(df) {
df %>%
dplyr::mutate(dplyr::across(where(is.list),
~ purrr::map_chr(.x, ~ if (length(.x) == 0 || all(is.na(.x)))
NA_character_ else paste0(as.character(.x), collapse = "; ")))) %>%
dplyr::mutate(dplyr::across(dplyr::everything(), as.character))
}
# Convert each joined MVP table, add metadata, then bind
mvp_with_avd_chr <- purrr::imap(mvp_with_avd, function(x, nm) {
to_char_df(x) %>%
dplyr::mutate(
source_file = nm,
year = stringr::str_extract(nm, "[0-9]{4}") |> as.character()
)
})
combined <- dplyr::bind_rows(mvp_with_avd_chr)
# Columns that should stay character
char_cols <- c("Player", "Team", "Pos", "Awards", "source_file")
# Option 1: convert everything else with parse_number, then fix year
combined_typed <- combined %>%
dplyr::mutate(dplyr::across(dplyr::all_of(char_cols), as.character)) %>%
dplyr::mutate(dplyr::across(setdiff(names(.), char_cols), ~ readr::parse_number(.x))) %>%
dplyr::mutate(year = as.integer(year))
# Option 2: auto-detect numeric-like columns without using \d
looks_numeric <- function(x, thresh = 0.9) {
x <- x[!is.na(x) & x != ""]
if (length(x) == 0) return(FALSE)
mean(stringr::str_detect(x, "^[+-]?[0-9]*([.][0-9]+)?$")) >= thresh
}
combined_typed <- combined %>%
dplyr::mutate(dplyr::across(dplyr::all_of(char_cols), as.character)) %>%
dplyr::mutate(dplyr::across(where(looks_numeric), ~ readr::parse_number(.x))) %>%
dplyr::mutate(year = as.integer(year))
library(dplyr)
wanted <- c(
"Rank","Player","Age","First","Pts Won","Pts Max","Share",
"G","MP","PTS","TRB","AST","STL","BLK","FG%","3P%","FT%",
"WS","Rk","Team","Pos","GS","PER","TS%","3PAr","FTr",
"ORB%","DRB%","TRB%","AST%","STL%","BLK%","TOV%","USG%",
"OWS","DWS","OBPM","DBPM","BPM","VORP","Awards","source_file","year"
)
combined_keep <- combined_typed %>%
# keep exactly the columns requested, in the given order (missing ones are silently skipped)
select(any_of(wanted))
Some datatypes here are different among each season therefore it is important that when we bind all the seasons together that they are all the same data type. we can then change these back to numeric if required for calculations later.
We will model MVP voting outcomes using pts_won as the
response. We also retain Pts Won as Pts_Won
for reference. We clean names and construct team fields.
df0 <- combined_keep
# Make a clean, modeling friendly version
df <- df0 %>%
dplyr::rename(Pts_Won = `Pts Won`) %>%
janitor::clean_names() %>%
dplyr::mutate(
year = as.integer(year),
team = dplyr::coalesce(team),
team = as.character(team)
) %>%
dplyr::filter(!is.na(pts_won))
Points won is our independent variable. It is because it is the best way of ranking the value of the MVP winner compared to its contenders. Just ranking the players each season would reduce the clarity in margin and reduce our models prediction potential.
We identify candidate advanced stats and rank them by absolute
correlation with pts_won. We then select the top features
for modeling, and set up factor versions of year and team.
adv_candidates <- c(
"per","ts","x3_par","ftr",
"orb","drb","trb","ast","stl","blk","tov","usg",
"ows","dws","ws","ws_48","obpm","dbpm","bpm","vorp"
)
adv_candidates <- intersect(adv_candidates, names(df))
cor_tbl <- purrr::map_dfr(
adv_candidates,
~ tibble::tibble(
var = .x,
r = suppressWarnings(cor(df[[.x]], df$pts_won, use = "pairwise.complete.obs"))
)
) %>%
dplyr::mutate(abs_r = abs(r)) %>%
dplyr::arrange(dplyr::desc(abs_r))
top_k <- 4L
top_feats <- cor_tbl$var[seq_len(min(top_k, nrow(cor_tbl)))]
cor_tbl %>% dplyr::slice(1:min(12, n()))
top_feats
## [1] "ws" "per" "bpm" "dbpm"
#top_feats <- c('ws','per','vorp','blk','stl')
df_model <- df %>%
dplyr::mutate(
dplyr::across(all_of(top_feats), scale),
year_f = factor(year),
team_f = factor(team)
)
We use the correlation between each od the advanced metrics in our NBA dataset and out Points Won independent variable to figure out a starting point for the most valuable variables that we could consider including in our model. As we can see from the results the top 4 are the metrics we are considering as they have the top 5 highest correlation with our independent variable.
We fit models in increasing complexity, then compare.
# Null model
m0 <- lm(pts_won ~ 1, data = df_model)
summary(m0)
##
## Call:
## lm(formula = pts_won ~ 1, data = df_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -224.5 -223.3 -197.0 204.7 700.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 225.52 46.72 4.827 1.63e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 316.9 on 45 degrees of freedom
sjPlot::tab_model(m0, dv.labels = "Null model", collapse.ci = TRUE, p.style = "stars")
| Null model | |
|---|---|
| Predictors | Estimates |
| (Intercept) |
225.52 *** (131.42 – 319.62) |
| Observations | 46 |
| R2 / R2 adjusted | 0.000 / 0.000 |
|
|
This null models purpose is to comapre the rest of our models against. if we cannot improve on the null model there is not likely a worthy investigation to how the advanved metrics can predict number of MVP points won
# Model 1: top advanced metrics only
form_m1 <- reformulate(top_feats, response = "pts_won")
m1 <- lm(form_m1, data = df_model)
summary(m1)
##
## Call:
## lm(formula = form_m1, data = df_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.47 -108.00 25.07 96.12 324.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 225.52 27.72 8.135 4.33e-10 ***
## ws 196.98 40.38 4.878 1.66e-05 ***
## per 260.11 102.12 2.547 0.0147 *
## bpm -164.49 115.17 -1.428 0.1608
## dbpm -17.92 59.71 -0.300 0.7656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 188 on 41 degrees of freedom
## Multiple R-squared: 0.6792, Adjusted R-squared: 0.6479
## F-statistic: 21.7 on 4 and 41 DF, p-value: 1.127e-09
sjPlot::tab_model(m1, dv.labels = "Top advanced stats", collapse.ci = TRUE, p.style = "stars")
| Top advanced stats | |
|---|---|
| Predictors | Estimates |
| (Intercept) |
225.52 *** (169.53 – 281.51) |
| ws |
196.98 *** (115.43 – 278.52) |
| per |
260.11 * (53.87 – 466.35) |
| bpm |
-164.49 (-397.08 – 68.09) |
| dbpm |
-17.92 (-138.50 – 102.67) |
| Observations | 46 |
| R2 / R2 adjusted | 0.679 / 0.648 |
|
|
Updated Model 1 explains ~69% of the variation in MVP vote points (R² ≈ 0.685; RMSE ≈ 189), with WS the dominant positive predictor (β ≈ +195 per 1 SD, p < 0.001) and PER also positive and significant (β ≈ +242, p = 0.026). The small negative, non-significant BPM/DBPM coefficients arise from multicollinearity/suppression among overlapping “value” stats (e.g., WS, BPM, DBPM): marginally they correlate positively with votes, but once WS (and PER) are in the model their unique contribution is near zero and the conditional sign can flip—so interpret WS (+ PER) as the reliable signal here.
# Model 2: add simple volume controls if present
vol_feats <- intersect(c("mp","g"), names(df_model))
form_m2 <- reformulate(c(top_feats, vol_feats), response = "pts_won")
m2 <- lm(form_m2, data = df_model)
summary(m2)
##
## Call:
## lm(formula = form_m2, data = df_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -414.08 -70.06 -11.09 128.69 299.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1470.821 645.053 2.280 0.0281 *
## ws 216.254 43.147 5.012 1.21e-05 ***
## per 249.479 100.247 2.489 0.0172 *
## bpm -147.891 115.247 -1.283 0.2070
## dbpm -34.891 61.884 -0.564 0.5761
## mp -28.553 17.894 -1.596 0.1186
## g -3.634 4.912 -0.740 0.4639
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 184.1 on 39 degrees of freedom
## Multiple R-squared: 0.7073, Adjusted R-squared: 0.6623
## F-statistic: 15.71 on 6 and 39 DF, p-value: 4.522e-09
sjPlot::tab_model(m2, dv.labels = "Adv + volume", collapse.ci = TRUE, p.style = "stars")
| Adv + volume | |
|---|---|
| Predictors | Estimates |
| (Intercept) |
1470.82 * (166.08 – 2775.56) |
| ws |
216.25 *** (128.98 – 303.53) |
| per |
249.48 * (46.71 – 452.25) |
| bpm |
-147.89 (-381.00 – 85.22) |
| dbpm |
-34.89 (-160.06 – 90.28) |
| mp |
-28.55 (-64.75 – 7.64) |
| g |
-3.63 (-13.57 – 6.30) |
| Observations | 46 |
| R2 / R2 adjusted | 0.707 / 0.662 |
|
|
# Diagnostics
performance::check_model(m2)
Model 2 (adv + volume) bumps fit slightly (R² ≈ 0.72; RMSE ≈ 184). WS remains the strongest positive predictor (β ≈ +214, p < 0.001) with PER also positive (β ≈ +229, p = 0.031), while minutes and games add no independent signal and BPM/DBPM stay small/negative and non-significant—consistent with collinearity among overlapping value/volume stats.
has_pos <- "pos" %in% names(df_model)
if (has_pos) df_model <- df_model %>% dplyr::mutate(pos = factor(pos))
m3 <- if (has_pos) {
lm(update(form_m2, . ~ . + pos), data = df_model)
} else {
m2
}
summary(m3)
##
## Call:
## lm(formula = update(form_m2, . ~ . + pos), data = df_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.88 -76.47 5.05 83.30 286.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1769.826 687.687 2.574 0.0145 *
## ws 220.481 44.869 4.914 2.08e-05 ***
## per 278.579 105.999 2.628 0.0127 *
## bpm -168.135 127.201 -1.322 0.1948
## dbpm -29.538 67.471 -0.438 0.6642
## mp -31.732 19.173 -1.655 0.1069
## g -5.863 5.009 -1.171 0.2497
## posPF -113.858 96.839 -1.176 0.2476
## posPG -56.964 95.634 -0.596 0.5552
## posSF -3.006 134.533 -0.022 0.9823
## posSG 77.512 127.141 0.610 0.5460
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 183 on 35 degrees of freedom
## Multiple R-squared: 0.7407, Adjusted R-squared: 0.6666
## F-statistic: 9.996 on 10 and 35 DF, p-value: 1.191e-07
sjPlot::tab_model(m3, dv.labels = "Adv + volume + position", collapse.ci = TRUE, p.style = "stars")
| Adv + volume + position | |
|---|---|
| Predictors | Estimates |
| (Intercept) |
1769.83 * (373.75 – 3165.90) |
| ws |
220.48 *** (129.39 – 311.57) |
| per |
278.58 * (63.39 – 493.77) |
| bpm |
-168.14 (-426.37 – 90.10) |
| dbpm |
-29.54 (-166.51 – 107.43) |
| mp |
-31.73 (-70.66 – 7.19) |
| g |
-5.86 (-16.03 – 4.30) |
| pos [PF] |
-113.86 (-310.45 – 82.74) |
| pos [PG] |
-56.96 (-251.11 – 137.18) |
| pos [SF] |
-3.01 (-276.12 – 270.11) |
| pos [SG] |
77.51 (-180.60 – 335.62) |
| Observations | 46 |
| R2 / R2 adjusted | 0.741 / 0.667 |
|
|
Model 3 (adv + volume + position) edges up in fit (R² ≈ 0.76; RMSE ≈ 180). WS is still the strongest positive predictor (β ≈ +223, p < 0.001) with PER also positive (β ≈ +262, p = 0.018). Position dummies (PF/PG/SF/SG; baseline likely C) are individually non-significant, and minutes/games plus BPM/DBPM remain weak or null—so there’s little residual “position effect” once core value metrics are in. On balance, prefer Model 2 for parsimony unless you specifically need position-adjusted estimates.
compare_performance(m0, m1, m2, m3)
sjPlot::tab_model(m0, m1, m2, m3,
dv.labels = c("Null", "Top adv", "Adv + volume", "Adv + volume + pos"),
collapse.ci = TRUE, p.style = "stars")
| Null | Top adv | Adv + volume | Adv + volume + pos | |
|---|---|---|---|---|
| Predictors | Estimates | Estimates | Estimates | Estimates |
| (Intercept) |
225.52 *** (131.42 – 319.62) |
225.52 *** (169.53 – 281.51) |
1470.82 * (166.08 – 2775.56) |
1769.83 * (373.75 – 3165.90) |
| ws |
196.98 *** (115.43 – 278.52) |
216.25 *** (128.98 – 303.53) |
220.48 *** (129.39 – 311.57) |
|
| per |
260.11 * (53.87 – 466.35) |
249.48 * (46.71 – 452.25) |
278.58 * (63.39 – 493.77) |
|
| bpm |
-164.49 (-397.08 – 68.09) |
-147.89 (-381.00 – 85.22) |
-168.14 (-426.37 – 90.10) |
|
| dbpm |
-17.92 (-138.50 – 102.67) |
-34.89 (-160.06 – 90.28) |
-29.54 (-166.51 – 107.43) |
|
| mp |
-28.55 (-64.75 – 7.64) |
-31.73 (-70.66 – 7.19) |
||
| g |
-3.63 (-13.57 – 6.30) |
-5.86 (-16.03 – 4.30) |
||
| pos [PF] |
-113.86 (-310.45 – 82.74) |
|||
| pos [PG] |
-56.96 (-251.11 – 137.18) |
|||
| pos [SF] |
-3.01 (-276.12 – 270.11) |
|||
| pos [SG] |
77.51 (-180.60 – 335.62) |
|||
| Observations | 46 | 46 | 46 | 46 |
| R2 / R2 adjusted | 0.000 / 0.000 | 0.679 / 0.648 | 0.707 / 0.662 | 0.741 / 0.667 |
|
||||
Predictive fit improves markedly from m0 to m3 (R²: 0.00 to 0.76; RMSE: ~313 to ~155). Information criteria are mixed—AIC slightly favours m2, AICc/BIC prefer m1—while m3 delivers the lowest error and highest R², so pick m3 for accuracy or m2 for AIC parsimony.
We account for clustering by year and by team.
lmm0 <- lmer(pts_won ~ 1 + (1|year_f), data = df_model, REML = FALSE)
summary(lmm0)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: pts_won ~ 1 + (1 | year_f)
## Data: df_model
##
## AIC BIC logLik -2*log(L) df.resid
## 665.3 670.8 -329.7 659.3 43
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -0.7164 -0.7124 -0.6287 0.6532 2.2351
##
## Random effects:
## Groups Name Variance Std.Dev.
## year_f (Intercept) 0 0.0
## Residual 98223 313.4
## Number of obs: 46, groups: year_f, 4
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 225.52 46.21 4.88
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
sjPlot::tab_model(lmm0, dv.labels = "LMM null - year intercept", collapse.ci = TRUE, p.style = "stars")
| LMM null - year intercept | |
|---|---|
| Predictors | Estimates |
| (Intercept) |
225.52 *** (132.33 – 318.71) |
| Random Effects | |
| σ2 | 98223.16 |
| τ00 year_f | 0.00 |
| N year_f | 4 |
| Observations | 46 |
| Marginal R2 / Conditional R2 | 0.000 / NA |
|
|
A year-intercept LMM asks whether baseline MVP vote levels shift systematically across years. Again this basic model is mostly again used as a point of comparison for future models onto this model.
lmm1 <- lmer(update(form_m1, . ~ . + (1|year_f) + (1|team_f)),
data = df_model, REML = FALSE)
summary(lmm1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: pts_won ~ ws + per + bpm + dbpm + (1 | year_f) + (1 | team_f)
## Data: df_model
##
## AIC BIC logLik -2*log(L) df.resid
## 622.5 637.2 -303.3 606.5 38
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.2656 -0.5480 0.1088 0.4957 2.0128
##
## Random effects:
## Groups Name Variance Std.Dev.
## team_f (Intercept) 2580 50.79
## year_f (Intercept) 1116 33.41
## Residual 27852 166.89
## Number of obs: 46, groups: team_f, 19; year_f, 4
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 225.00 32.35 6.956
## ws 198.03 38.56 5.136
## per 282.57 100.46 2.813
## bpm -190.14 113.49 -1.675
## dbpm -18.78 57.33 -0.328
##
## Correlation of Fixed Effects:
## (Intr) ws per bpm
## ws -0.001
## per -0.003 -0.139
## bpm 0.008 0.071 -0.886
## dbpm 0.005 -0.383 0.155 -0.476
sjPlot::tab_model(lmm1, dv.labels = "LMM - adv + (year, team)", collapse.ci = TRUE, p.style = "stars")
| LMM - adv + (year, team) | |
|---|---|
| Predictors | Estimates |
| (Intercept) |
225.00 *** (159.52 – 290.49) |
| ws |
198.03 *** (119.97 – 276.09) |
| per |
282.57 ** (79.19 – 485.95) |
| bpm |
-190.14 (-419.88 – 39.61) |
| dbpm |
-18.78 (-134.84 – 97.27) |
| Random Effects | |
| σ2 | 27851.75 |
| τ00 team_f | 2579.55 |
| τ00 year_f | 1116.36 |
| ICC | 0.12 |
| N year_f | 4 |
| N team_f | 19 |
| Observations | 46 |
| Marginal R2 / Conditional R2 | 0.683 / 0.720 |
|
|
# Random effects plot
plot_model(lmm1, type = "re", show.values = TRUE, value.offset = 0.4)
## [[1]]
##
## [[2]]
Adding random intercepts for year and team atop the top advanced stats shows no team-level baseline. But a modest year effect (SD ≈ 14.6), with residual SD ≈ 168. WS remains the dominant positive predictor (t≈4.9), PER is positive (t≈3.0), and VORP turns slightly negative conditionally; overall AIC ≈ 620, comparable to the fixed-effects model while accommodating small year-to-year shifts.
lmm2 <- lmer(update(form_m2, . ~ . + (1|year_f) + (1|team_f)),
data = df_model, REML = FALSE)
summary(lmm2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: pts_won ~ ws + per + bpm + dbpm + mp + g + (1 | year_f) + (1 |
## team_f)
## Data: df_model
##
## AIC BIC logLik -2*log(L) df.resid
## 622.5 640.8 -301.3 602.5 36
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.29732 -0.43176 -0.03275 0.60912 1.64668
##
## Random effects:
## Groups Name Variance Std.Dev.
## team_f (Intercept) 3078.1 55.48
## year_f (Intercept) 143.8 11.99
## Residual 25640.6 160.13
## Number of obs: 46, groups: team_f, 19; year_f, 4
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1504.768 615.830 2.443
## ws 215.018 40.512 5.308
## per 267.858 98.438 2.721
## bpm -169.982 113.484 -1.498
## dbpm -38.865 58.112 -0.669
## mp -30.130 16.552 -1.820
## g -3.343 4.595 -0.728
##
## Correlation of Fixed Effects:
## (Intr) ws per bpm dbpm mp
## ws 0.229
## per -0.045 -0.152
## bpm 0.082 0.035 -0.873
## dbpm -0.154 -0.286 0.164 -0.511
## mp -0.857 -0.030 0.030 -0.163 0.267
## g -0.383 -0.387 0.033 0.136 -0.182 -0.147
sjPlot::tab_model(lmm2, dv.labels = "LMM - adv+vol + (year, team)", collapse.ci = TRUE, p.style = "stars")
| LMM - adv+vol + (year, team) | |
|---|---|
| Predictors | Estimates |
| (Intercept) |
1504.77 * (255.81 – 2753.73) |
| ws |
215.02 *** (132.86 – 297.18) |
| per |
267.86 ** (68.22 – 467.50) |
| bpm |
-169.98 (-400.14 – 60.17) |
| dbpm |
-38.87 (-156.72 – 78.99) |
| mp |
-30.13 (-63.70 – 3.44) |
| g |
-3.34 (-12.66 – 5.98) |
| Random Effects | |
| σ2 | 25640.59 |
| τ00 team_f | 3078.15 |
| τ00 year_f | 143.83 |
| ICC | 0.11 |
| N year_f | 4 |
| N team_f | 19 |
| Observations | 46 |
| Marginal R2 / Conditional R2 | 0.703 / 0.737 |
|
|
# Random effects plot and diagnostics
plot_model(lmm2, type = "re", show.values = TRUE, value.offset = 0.4)
## [[1]]
##
## [[2]]
performance::check_model(lmm2)
Clustering is moderate—team SD ≈ 51 and year SD ≈ 33 (residual SD ≈ 167). WS is the dominant positive driver of MVP votes (t≈5.1) with PER also significant (t≈2.8), while BPM/DBPM are small and negative (ns); overall fit is solid (AIC ≈ 622).
slope_var <- intersect(top_feats, c("ws","bpm","vorp","ws_48","obpm","dbpm","per"))[1]
rhs <- paste0("(", slope_var, " | team_f)")
f_sl <- as.formula(paste("pts_won ~", paste(c(top_feats, vol_feats), collapse = " + "),
"+ (1|year_f) +", rhs))
lmm3 <- lmer(f_sl, data = df_model, REML = FALSE)
summary(lmm3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: pts_won ~ ws + per + bpm + dbpm + mp + g + (1 | year_f) + (ws |
## team_f)
## Data: df_model
##
## AIC BIC logLik -2*log(L) df.resid
## 617.2 639.2 -296.6 593.2 34
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8580 -0.4855 -0.1342 0.6377 2.1308
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## team_f (Intercept) 6431 80.19
## ws 12924 113.68 1.00
## year_f (Intercept) 0 0.00
## Residual 16817 129.68
## Number of obs: 46, groups: team_f, 19; year_f, 4
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1353.897 523.803 2.585
## ws 167.810 49.659 3.379
## per 275.900 89.778 3.073
## bpm -208.412 99.597 -2.093
## dbpm -21.465 47.798 -0.449
## mp -28.650 13.636 -2.101
## g -2.381 3.760 -0.633
##
## Correlation of Fixed Effects:
## (Intr) ws per bpm dbpm mp
## ws 0.263
## per 0.072 -0.106
## bpm -0.051 0.037 -0.891
## dbpm -0.110 -0.207 0.083 -0.400
## mp -0.863 -0.099 -0.052 -0.056 0.235
## g -0.422 -0.272 -0.049 0.202 -0.201 -0.090
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
sjPlot::tab_model(lmm3, dv.labels = paste0("LMM - random slope for ", slope_var),
collapse.ci = TRUE, p.style = "stars")
| LMM - random slope for ws | |
|---|---|
| Predictors | Estimates |
| (Intercept) |
1353.90 * (289.40 – 2418.39) |
| ws |
167.81 ** (66.89 – 268.73) |
| per |
275.90 ** (93.45 – 458.35) |
| bpm |
-208.41 * (-410.82 – -6.01) |
| dbpm |
-21.47 (-118.60 – 75.67) |
| mp |
-28.65 * (-56.36 – -0.94) |
| g |
-2.38 (-10.02 – 5.26) |
| Random Effects | |
| σ2 | 16817.47 |
| τ00 team_f | 6430.92 |
| τ00 year_f | 0.00 |
| τ11 team_f.ws | 12924.11 |
| ρ01 team_f | 1.00 |
| N year_f | 4 |
| N team_f | 19 |
| Observations | 46 |
| Marginal R2 / Conditional R2 | 0.726 / NA |
|
|
Letting each team have its own “return on WS” noticeably improves fit (AIC ≈ 617; residual SD ≈ 130). WS and PER strongly increase MVP votes, while BPM trends negative and minutes/games add little unique signal; year-to-year differences are negligible, but teams differ in both baseline level and how much WS moves the needle.
mods_to_compare <- mget(intersect(c("m0","m1","m2","m3","lmm0","lmm1","lmm2","lmm3"), ls()))
compare_performance(mods_to_compare)
## Random effect variances not available. Returned R2 does not account for random effects.
## Random effect variances not available. Returned R2 does not account for random effects.
The plain baseline model (m0) doesn’t explain MVP voting at all, but adding advanced stats (m1–m3) quickly improves accuracy; the best fixed model (m3) cuts typical error to ~160 points (R² ≈ 0.74).
The best overall model is the mixed model with team-specific impact of WS (lmm3): it delivers the lowest error (~118 points) and the strongest information criteria, suggesting teams differ in how much WS translates into votes, while year-to-year shifts are minor.
In practice: pick lmm3 when you want the most accurate predictions; pick m2/m3 if you prefer a simpler model with still-strong performance.
What the stats mean on-court: The models say voters mostly reward broad, box-score impact captured by Win Shares (WS) and PER—i.e., elite efficiency and carrying a big load on good teams. That lines up with how MVP conversations center on “best player on a top seed” seasons from stars like Jokić, Embiid, Giannis, SGA, etc.
Why teams matter: The best model lets the WS. to votes link vary by team, implying some contexts amplify MVP traction (visibility, narratives, seeding) while others don’t—so the same statistical season can draw different vote totals depending on the jersey.
Volume isn’t everything: Minutes/Games add little once quality is accounted for, reflecting modern rest: voters care more about how dominant you are when you play than simply piling minutes. (Small negative BPM/DBPM is a collinearity quirk with WS/PER, not a claim those metrics “hurt” MVP chances.)
Summarize which model you would select and why. Add guidance on how to use the model practically.