Overview

This document joins MVP voting tables with advanced stat tables, harmonize data types, and model MVP voting outcomes using linear models and linear mixed models. The goal is to identify which advanced statistics best predict MVP voting outcomes, with a focus on the pts_won column as the response.

From the last 4 seasons of the NBA is it possible to derive an understanding of the best predictors among the MVP contenders which sits the winners and those who rank higher in total MVP votes compared to others. in our modelling process we will be using linear and mixed linear models to see if there is a relation and significiant predictor in our context.

Setup

We load packages, set global chunk options

knitr::opts_chunk$set(
  echo = TRUE, message = FALSE, warning = FALSE, fig.width = 8, fig.height = 5
)

library(tidyverse)
library(readr)
library(janitor)
library(performance)
library(sjPlot)
library(lme4)
library(GGally)
library(broom)
library(readxl)
library(patchwork)
library(stringr)
library(purrr)
library(tidyr)

Data Ingestion: Excel Workbooks (MVP and Avd)

Download Avd_2025.xlsx
Download MVP_2025.xlsx

Download Avd_2024.xlsx
Download MVP_2024.xlsx

Download Avd_2023.xlsx
Download MVP_2023.xlsx

Download Avd_2022.xlsx
Download MVP_2022.xlsx

We read the eight Excel files, split MVP vs Avd sets, and join each MVP year to the matching Avd year on Player.

# ---------------------------
# Data Ingestion (hardcoded)
# ---------------------------
library(readxl)
library(dplyr)
library(purrr)
library(stringr)

data_dir <- "data"  # change if needed

# Hardcode the exact files you want to load
avd_files <- c("Avd_2022.xlsx","Avd_2023.xlsx","Avd_2024.xlsx","Avd_2025.xlsx")
mvp_files <- c("MVP_2022.xlsx","MVP_2023.xlsx","MVP_2024.xlsx","MVP_2025.xlsx")

# Optional sanity checks
missing_avd <- avd_files[!file.exists(file.path(data_dir, avd_files))]
missing_mvp <- mvp_files[!file.exists(file.path(data_dir, mvp_files))]
if (length(missing_avd)) warning("Missing AVD files: ", paste(missing_avd, collapse = ", "))
if (length(missing_mvp)) warning("Missing MVP files: ", paste(missing_mvp, collapse = ", "))

# Read each set into named lists
avd_list <- avd_files %>%
  set_names() %>%
  map(~ readxl::read_excel(file.path(data_dir, .x), sheet = 1))

mvp_list <- mvp_files %>%
  set_names() %>%
  map(~ readxl::read_excel(file.path(data_dir, .x), sheet = 1))

# Pair each MVP with its matching Avd year and join on Player
mvp_with_avd <- imap(mvp_list, function(mvp_df, mvp_name) {
  year   <- stringr::str_extract(mvp_name, "[0-9]{4}")
  avd_nm <- paste0("Avd_", year, ".xlsx")
  avd_df <- avd_list[[avd_nm]]

  if (is.null(avd_df)) {
    warning("No Avd file found for ", mvp_name, ". Returning MVP data unchanged.")
    return(mvp_df)
  }

  avd_df <- avd_df %>% distinct(Player, .keep_all = TRUE)
  left_join(mvp_df, avd_df, by = "Player", suffix = c("", "_Avd"))
})

We join the advanced datasets onto the MVPs to reduce the overall noise in the modelling process. with so many players not recieving votes for MVP if we were to include these players it could negaiticely skew the modelling process with so many players without any votes.

Harmonize Types and Build `combined`

We bind all MVP-Avd pairs together after coercing to a safe representation to avoid type clashes, then convert back to numeric where appropriate. This follows the final code you prepared.

to_char_df <- function(df) {
  df %>%
    dplyr::mutate(dplyr::across(where(is.list),
                  ~ purrr::map_chr(.x, ~ if (length(.x) == 0 || all(is.na(.x)))
                    NA_character_ else paste0(as.character(.x), collapse = "; ")))) %>%
    dplyr::mutate(dplyr::across(dplyr::everything(), as.character))
}

# Convert each joined MVP table, add metadata, then bind
mvp_with_avd_chr <- purrr::imap(mvp_with_avd, function(x, nm) {
  to_char_df(x) %>%
    dplyr::mutate(
      source_file = nm,
      year = stringr::str_extract(nm, "[0-9]{4}") |> as.character()
    )
})

combined <- dplyr::bind_rows(mvp_with_avd_chr)

# Columns that should stay character
char_cols <- c("Player", "Team", "Pos", "Awards", "source_file")

# Option 1: convert everything else with parse_number, then fix year
combined_typed <- combined %>%
  dplyr::mutate(dplyr::across(dplyr::all_of(char_cols), as.character)) %>%
  dplyr::mutate(dplyr::across(setdiff(names(.), char_cols), ~ readr::parse_number(.x))) %>%
  dplyr::mutate(year = as.integer(year))

# Option 2: auto-detect numeric-like columns without using \d
looks_numeric <- function(x, thresh = 0.9) {
  x <- x[!is.na(x) & x != ""]
  if (length(x) == 0) return(FALSE)
  mean(stringr::str_detect(x, "^[+-]?[0-9]*([.][0-9]+)?$")) >= thresh
}

combined_typed <- combined %>%
  dplyr::mutate(dplyr::across(dplyr::all_of(char_cols), as.character)) %>%
  dplyr::mutate(dplyr::across(where(looks_numeric), ~ readr::parse_number(.x))) %>%
  dplyr::mutate(year = as.integer(year))

library(dplyr)

wanted <- c(
  "Rank","Player","Age","First","Pts Won","Pts Max","Share",
  "G","MP","PTS","TRB","AST","STL","BLK","FG%","3P%","FT%",
  "WS","Rk","Team","Pos","GS","PER","TS%","3PAr","FTr",
  "ORB%","DRB%","TRB%","AST%","STL%","BLK%","TOV%","USG%",
  "OWS","DWS","OBPM","DBPM","BPM","VORP","Awards","source_file","year"
)

combined_keep <- combined_typed %>%
  # keep exactly the columns requested, in the given order (missing ones are silently skipped)
  select(any_of(wanted))

Some datatypes here are different among each season therefore it is important that when we bind all the seasons together that they are all the same data type. we can then change these back to numeric if required for calculations later.

Modeling Objective and Data Prep

We will model MVP voting outcomes using pts_won as the response. We also retain Pts Won as Pts_Won for reference. We clean names and construct team fields.

df0 <- combined_keep

# Make a clean, modeling friendly version
df <- df0 %>%
  dplyr::rename(Pts_Won = `Pts Won`) %>%
  janitor::clean_names() %>%
  dplyr::mutate(
    year = as.integer(year),
    team = dplyr::coalesce(team),
    team = as.character(team)
  ) %>%
  dplyr::filter(!is.na(pts_won))

Points won is our independent variable. It is because it is the best way of ranking the value of the MVP winner compared to its contenders. Just ranking the players each season would reduce the clarity in margin and reduce our models prediction potential.

Feature Scouting

We identify candidate advanced stats and rank them by absolute correlation with pts_won. We then select the top features for modeling, and set up factor versions of year and team.

adv_candidates <- c(
  "per","ts","x3_par","ftr",
  "orb","drb","trb","ast","stl","blk","tov","usg",
  "ows","dws","ws","ws_48","obpm","dbpm","bpm","vorp"
)

adv_candidates <- intersect(adv_candidates, names(df))

cor_tbl <- purrr::map_dfr(
  adv_candidates,
  ~ tibble::tibble(
      var = .x,
      r = suppressWarnings(cor(df[[.x]], df$pts_won, use = "pairwise.complete.obs"))
    )
) %>%
  dplyr::mutate(abs_r = abs(r)) %>%
  dplyr::arrange(dplyr::desc(abs_r))

top_k <- 4L
top_feats <- cor_tbl$var[seq_len(min(top_k, nrow(cor_tbl)))]

cor_tbl %>% dplyr::slice(1:min(12, n()))

top_feats

## [1] "ws"   "per"  "bpm"  "dbpm"

#top_feats <- c('ws','per','vorp','blk','stl')

df_model <- df %>%
  dplyr::mutate(
    dplyr::across(all_of(top_feats), scale),
    year_f = factor(year),
    team_f = factor(team)
  )

We use the correlation between each od the advanced metrics in our NBA dataset and out Points Won independent variable to figure out a starting point for the most valuable variables that we could consider including in our model. As we can see from the results the top 4 are the metrics we are considering as they have the top 5 highest correlation with our independent variable.

Linear Models

We fit models in increasing complexity, then compare.

Linear Models

Null model - m0

# Null model
m0 <- lm(pts_won ~ 1, data = df_model)
summary(m0)

## 
## Call:
## lm(formula = pts_won ~ 1, data = df_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -224.5 -223.3 -197.0  204.7  700.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   225.52      46.72   4.827 1.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 316.9 on 45 degrees of freedom

sjPlot::tab_model(m0, dv.labels = "Null model", collapse.ci = TRUE, p.style = "stars")

	Null model
Predictors	Estimates
(Intercept)	225.52 ^*** (131.42 – 319.62)
Observations	46
R² / R² adjusted	0.000 / 0.000
p<0.05 p<0.01 * p<0.001

This null models purpose is to comapre the rest of our models against. if we cannot improve on the null model there is not likely a worthy investigation to how the advanved metrics can predict number of MVP points won

Model 1 - Top advanced stats only (m1)

# Model 1: top advanced metrics only
form_m1 <- reformulate(top_feats, response = "pts_won")
m1 <- lm(form_m1, data = df_model)
summary(m1)

## 
## Call:
## lm(formula = form_m1, data = df_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.47 -108.00   25.07   96.12  324.70 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   225.52      27.72   8.135 4.33e-10 ***
## ws            196.98      40.38   4.878 1.66e-05 ***
## per           260.11     102.12   2.547   0.0147 *  
## bpm          -164.49     115.17  -1.428   0.1608    
## dbpm          -17.92      59.71  -0.300   0.7656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 188 on 41 degrees of freedom
## Multiple R-squared:  0.6792, Adjusted R-squared:  0.6479 
## F-statistic:  21.7 on 4 and 41 DF,  p-value: 1.127e-09

sjPlot::tab_model(m1, dv.labels = "Top advanced stats", collapse.ci = TRUE, p.style = "stars")

	Top advanced stats
Predictors	Estimates
(Intercept)	225.52 ^*** (169.53 – 281.51)
ws	196.98 ^*** (115.43 – 278.52)
per	260.11 ^* (53.87 – 466.35)
bpm	-164.49 (-397.08 – 68.09)
dbpm	-17.92 (-138.50 – 102.67)
Observations	46
R² / R² adjusted	0.679 / 0.648
p<0.05 p<0.01 * p<0.001

Updated Model 1 explains ~69% of the variation in MVP vote points (R² ≈ 0.685; RMSE ≈ 189), with WS the dominant positive predictor (β ≈ +195 per 1 SD, p < 0.001) and PER also positive and significant (β ≈ +242, p = 0.026). The small negative, non-significant BPM/DBPM coefficients arise from multicollinearity/suppression among overlapping “value” stats (e.g., WS, BPM, DBPM): marginally they correlate positively with votes, but once WS (and PER) are in the model their unique contribution is near zero and the conditional sign can flip—so interpret WS (+ PER) as the reliable signal here.

Model 2 - Add simple volume controls (m2)

# Model 2: add simple volume controls if present
vol_feats <- intersect(c("mp","g"), names(df_model))
form_m2 <- reformulate(c(top_feats, vol_feats), response = "pts_won")
m2 <- lm(form_m2, data = df_model)
summary(m2)

## 
## Call:
## lm(formula = form_m2, data = df_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -414.08  -70.06  -11.09  128.69  299.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1470.821    645.053   2.280   0.0281 *  
## ws           216.254     43.147   5.012 1.21e-05 ***
## per          249.479    100.247   2.489   0.0172 *  
## bpm         -147.891    115.247  -1.283   0.2070    
## dbpm         -34.891     61.884  -0.564   0.5761    
## mp           -28.553     17.894  -1.596   0.1186    
## g             -3.634      4.912  -0.740   0.4639    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 184.1 on 39 degrees of freedom
## Multiple R-squared:  0.7073, Adjusted R-squared:  0.6623 
## F-statistic: 15.71 on 6 and 39 DF,  p-value: 4.522e-09

sjPlot::tab_model(m2, dv.labels = "Adv + volume", collapse.ci = TRUE, p.style = "stars")

	Adv + volume
Predictors	Estimates
(Intercept)	1470.82 ^* (166.08 – 2775.56)
ws	216.25 ^*** (128.98 – 303.53)
per	249.48 ^* (46.71 – 452.25)
bpm	-147.89 (-381.00 – 85.22)
dbpm	-34.89 (-160.06 – 90.28)
mp	-28.55 (-64.75 – 7.64)
g	-3.63 (-13.57 – 6.30)
Observations	46
R² / R² adjusted	0.707 / 0.662
p<0.05 p<0.01 * p<0.001

# Diagnostics
performance::check_model(m2)

Model 2 (adv + volume) bumps fit slightly (R² ≈ 0.72; RMSE ≈ 184). WS remains the strongest positive predictor (β ≈ +214, p < 0.001) with PER also positive (β ≈ +229, p = 0.031), while minutes and games add no independent signal and BPM/DBPM stay small/negative and non-significant—consistent with collinearity among overlapping value/volume stats.

Model 3 - Add position factor if available (m3)

has_pos <- "pos" %in% names(df_model)
if (has_pos) df_model <- df_model %>% dplyr::mutate(pos = factor(pos))

m3 <- if (has_pos) {
  lm(update(form_m2, . ~ . + pos), data = df_model)
} else {
  m2
}

summary(m3)

## 
## Call:
## lm(formula = update(form_m2, . ~ . + pos), data = df_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.88  -76.47    5.05   83.30  286.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1769.826    687.687   2.574   0.0145 *  
## ws           220.481     44.869   4.914 2.08e-05 ***
## per          278.579    105.999   2.628   0.0127 *  
## bpm         -168.135    127.201  -1.322   0.1948    
## dbpm         -29.538     67.471  -0.438   0.6642    
## mp           -31.732     19.173  -1.655   0.1069    
## g             -5.863      5.009  -1.171   0.2497    
## posPF       -113.858     96.839  -1.176   0.2476    
## posPG        -56.964     95.634  -0.596   0.5552    
## posSF         -3.006    134.533  -0.022   0.9823    
## posSG         77.512    127.141   0.610   0.5460    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 183 on 35 degrees of freedom
## Multiple R-squared:  0.7407, Adjusted R-squared:  0.6666 
## F-statistic: 9.996 on 10 and 35 DF,  p-value: 1.191e-07

sjPlot::tab_model(m3, dv.labels = "Adv + volume + position", collapse.ci = TRUE, p.style = "stars")

	Adv + volume + position
Predictors	Estimates
(Intercept)	1769.83 ^* (373.75 – 3165.90)
ws	220.48 ^*** (129.39 – 311.57)
per	278.58 ^* (63.39 – 493.77)
bpm	-168.14 (-426.37 – 90.10)
dbpm	-29.54 (-166.51 – 107.43)
mp	-31.73 (-70.66 – 7.19)
g	-5.86 (-16.03 – 4.30)
pos [PF]	-113.86 (-310.45 – 82.74)
pos [PG]	-56.96 (-251.11 – 137.18)
pos [SF]	-3.01 (-276.12 – 270.11)
pos [SG]	77.51 (-180.60 – 335.62)
Observations	46
R² / R² adjusted	0.741 / 0.667
p<0.05 p<0.01 * p<0.001

Model 3 (adv + volume + position) edges up in fit (R² ≈ 0.76; RMSE ≈ 180). WS is still the strongest positive predictor (β ≈ +223, p < 0.001) with PER also positive (β ≈ +262, p = 0.018). Position dummies (PF/PG/SF/SG; baseline likely C) are individually non-significant, and minutes/games plus BPM/DBPM remain weak or null—so there’s little residual “position effect” once core value metrics are in. On balance, prefer Model 2 for parsimony unless you specifically need position-adjusted estimates.

LM comparison and summary

compare_performance(m0, m1, m2, m3)

sjPlot::tab_model(m0, m1, m2, m3,
                  dv.labels = c("Null", "Top adv", "Adv + volume", "Adv + volume + pos"),
                  collapse.ci = TRUE, p.style = "stars")

	Null	Top adv	Adv + volume	Adv + volume + pos
Predictors	Estimates	Estimates	Estimates	Estimates
(Intercept)	225.52 ^*** (131.42 – 319.62)	225.52 ^*** (169.53 – 281.51)	1470.82 ^* (166.08 – 2775.56)	1769.83 ^* (373.75 – 3165.90)
ws		196.98 ^*** (115.43 – 278.52)	216.25 ^*** (128.98 – 303.53)	220.48 ^*** (129.39 – 311.57)
per		260.11 ^* (53.87 – 466.35)	249.48 ^* (46.71 – 452.25)	278.58 ^* (63.39 – 493.77)
bpm		-164.49 (-397.08 – 68.09)	-147.89 (-381.00 – 85.22)	-168.14 (-426.37 – 90.10)
dbpm		-17.92 (-138.50 – 102.67)	-34.89 (-160.06 – 90.28)	-29.54 (-166.51 – 107.43)
mp			-28.55 (-64.75 – 7.64)	-31.73 (-70.66 – 7.19)
g			-3.63 (-13.57 – 6.30)	-5.86 (-16.03 – 4.30)
pos [PF]				-113.86 (-310.45 – 82.74)
pos [PG]				-56.96 (-251.11 – 137.18)
pos [SF]				-3.01 (-276.12 – 270.11)
pos [SG]				77.51 (-180.60 – 335.62)
Observations	46	46	46	46
R² / R² adjusted	0.000 / 0.000	0.679 / 0.648	0.707 / 0.662	0.741 / 0.667
p<0.05 p<0.01 * p<0.001

Predictive fit improves markedly from m0 to m3 (R²: 0.00 to 0.76; RMSE: ~313 to ~155). Information criteria are mixed—AIC slightly favours m2, AICc/BIC prefer m1—while m3 delivers the lowest error and highest R², so pick m3 for accuracy or m2 for AIC parsimony.

Mixed-Effects Models

We account for clustering by year and by team.

LMM0 - Random intercepts for year

lmm0 <- lmer(pts_won ~ 1 + (1|year_f), data = df_model, REML = FALSE)
summary(lmm0)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: pts_won ~ 1 + (1 | year_f)
##    Data: df_model
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##     665.3     670.8    -329.7     659.3        43 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -0.7164 -0.7124 -0.6287  0.6532  2.2351 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  year_f   (Intercept)     0      0.0   
##  Residual             98223    313.4   
## Number of obs: 46, groups:  year_f, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   225.52      46.21    4.88
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

sjPlot::tab_model(lmm0, dv.labels = "LMM null - year intercept", collapse.ci = TRUE, p.style = "stars")

	LMM null - year intercept
Predictors	Estimates
(Intercept)	225.52 ^*** (132.33 – 318.71)
Random Effects
σ²	98223.16
τ₀₀ _{year_f}	0.00
N _{year_f}	4
Observations	46
Marginal R² / Conditional R²	0.000 / NA
p<0.05 p<0.01 * p<0.001

A year-intercept LMM asks whether baseline MVP vote levels shift systematically across years. Again this basic model is mostly again used as a point of comparison for future models onto this model.

LMM1 - Top advanced stats with random intercepts for year and team

lmm1 <- lmer(update(form_m1, . ~ . + (1|year_f) + (1|team_f)),
             data = df_model, REML = FALSE)
summary(lmm1)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: pts_won ~ ws + per + bpm + dbpm + (1 | year_f) + (1 | team_f)
##    Data: df_model
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##     622.5     637.2    -303.3     606.5        38 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2656 -0.5480  0.1088  0.4957  2.0128 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  team_f   (Intercept)  2580     50.79  
##  year_f   (Intercept)  1116     33.41  
##  Residual             27852    166.89  
## Number of obs: 46, groups:  team_f, 19; year_f, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   225.00      32.35   6.956
## ws            198.03      38.56   5.136
## per           282.57     100.46   2.813
## bpm          -190.14     113.49  -1.675
## dbpm          -18.78      57.33  -0.328
## 
## Correlation of Fixed Effects:
##      (Intr) ws     per    bpm   
## ws   -0.001                     
## per  -0.003 -0.139              
## bpm   0.008  0.071 -0.886       
## dbpm  0.005 -0.383  0.155 -0.476

sjPlot::tab_model(lmm1, dv.labels = "LMM - adv + (year, team)", collapse.ci = TRUE, p.style = "stars")

	LMM - adv + (year, team)
Predictors	Estimates
(Intercept)	225.00 ^*** (159.52 – 290.49)
ws	198.03 ^*** (119.97 – 276.09)
per	282.57 ^** (79.19 – 485.95)
bpm	-190.14 (-419.88 – 39.61)
dbpm	-18.78 (-134.84 – 97.27)
Random Effects
σ²	27851.75
τ₀₀ _{team_f}	2579.55
τ₀₀ _{year_f}	1116.36
ICC	0.12
N _{year_f}	4
N _{team_f}	19
Observations	46
Marginal R² / Conditional R²	0.683 / 0.720
p<0.05 p<0.01 * p<0.001

# Random effects plot
plot_model(lmm1, type = "re", show.values = TRUE, value.offset = 0.4)

## [[1]]

## 
## [[2]]

Adding random intercepts for year and team atop the top advanced stats shows no team-level baseline. But a modest year effect (SD ≈ 14.6), with residual SD ≈ 168. WS remains the dominant positive predictor (t≈4.9), PER is positive (t≈3.0), and VORP turns slightly negative conditionally; overall AIC ≈ 620, comparable to the fixed-effects model while accommodating small year-to-year shifts.

LMM2 - Advanced + volume with random intercepts for year and team

lmm2 <- lmer(update(form_m2, . ~ . + (1|year_f) + (1|team_f)),
             data = df_model, REML = FALSE)
summary(lmm2)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: pts_won ~ ws + per + bpm + dbpm + mp + g + (1 | year_f) + (1 |  
##     team_f)
##    Data: df_model
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##     622.5     640.8    -301.3     602.5        36 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.29732 -0.43176 -0.03275  0.60912  1.64668 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  team_f   (Intercept)  3078.1   55.48  
##  year_f   (Intercept)   143.8   11.99  
##  Residual             25640.6  160.13  
## Number of obs: 46, groups:  team_f, 19; year_f, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 1504.768    615.830   2.443
## ws           215.018     40.512   5.308
## per          267.858     98.438   2.721
## bpm         -169.982    113.484  -1.498
## dbpm         -38.865     58.112  -0.669
## mp           -30.130     16.552  -1.820
## g             -3.343      4.595  -0.728
## 
## Correlation of Fixed Effects:
##      (Intr) ws     per    bpm    dbpm   mp    
## ws    0.229                                   
## per  -0.045 -0.152                            
## bpm   0.082  0.035 -0.873                     
## dbpm -0.154 -0.286  0.164 -0.511              
## mp   -0.857 -0.030  0.030 -0.163  0.267       
## g    -0.383 -0.387  0.033  0.136 -0.182 -0.147

sjPlot::tab_model(lmm2, dv.labels = "LMM - adv+vol + (year, team)", collapse.ci = TRUE, p.style = "stars")

	LMM - adv+vol + (year, team)
Predictors	Estimates
(Intercept)	1504.77 ^* (255.81 – 2753.73)
ws	215.02 ^*** (132.86 – 297.18)
per	267.86 ^** (68.22 – 467.50)
bpm	-169.98 (-400.14 – 60.17)
dbpm	-38.87 (-156.72 – 78.99)
mp	-30.13 (-63.70 – 3.44)
g	-3.34 (-12.66 – 5.98)
Random Effects
σ²	25640.59
τ₀₀ _{team_f}	3078.15
τ₀₀ _{year_f}	143.83
ICC	0.11
N _{year_f}	4
N _{team_f}	19
Observations	46
Marginal R² / Conditional R²	0.703 / 0.737
p<0.05 p<0.01 * p<0.001

# Random effects plot and diagnostics
plot_model(lmm2, type = "re", show.values = TRUE, value.offset = 0.4)

## [[1]]

## 
## [[2]]

performance::check_model(lmm2)

Clustering is moderate—team SD ≈ 51 and year SD ≈ 33 (residual SD ≈ 167). WS is the dominant positive driver of MVP votes (t≈5.1) with PER also significant (t≈2.8), while BPM/DBPM are small and negative (ns); overall fit is solid (AIC ≈ 622).

LMM3 - Optional random slope across teams

slope_var <- intersect(top_feats, c("ws","bpm","vorp","ws_48","obpm","dbpm","per"))[1]

rhs <- paste0("(", slope_var, " | team_f)")
f_sl <- as.formula(paste("pts_won ~", paste(c(top_feats, vol_feats), collapse = " + "),
                         "+ (1|year_f) +", rhs))
lmm3 <- lmer(f_sl, data = df_model, REML = FALSE)
summary(lmm3)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: pts_won ~ ws + per + bpm + dbpm + mp + g + (1 | year_f) + (ws |  
##     team_f)
##    Data: df_model
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##     617.2     639.2    -296.6     593.2        34 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8580 -0.4855 -0.1342  0.6377  2.1308 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  team_f   (Intercept)  6431     80.19       
##           ws          12924    113.68   1.00
##  year_f   (Intercept)     0      0.00       
##  Residual             16817    129.68       
## Number of obs: 46, groups:  team_f, 19; year_f, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 1353.897    523.803   2.585
## ws           167.810     49.659   3.379
## per          275.900     89.778   3.073
## bpm         -208.412     99.597  -2.093
## dbpm         -21.465     47.798  -0.449
## mp           -28.650     13.636  -2.101
## g             -2.381      3.760  -0.633
## 
## Correlation of Fixed Effects:
##      (Intr) ws     per    bpm    dbpm   mp    
## ws    0.263                                   
## per   0.072 -0.106                            
## bpm  -0.051  0.037 -0.891                     
## dbpm -0.110 -0.207  0.083 -0.400              
## mp   -0.863 -0.099 -0.052 -0.056  0.235       
## g    -0.422 -0.272 -0.049  0.202 -0.201 -0.090
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

sjPlot::tab_model(lmm3, dv.labels = paste0("LMM - random slope for ", slope_var),
                  collapse.ci = TRUE, p.style = "stars")

	LMM - random slope for ws
Predictors	Estimates
(Intercept)	1353.90 ^* (289.40 – 2418.39)
ws	167.81 ^** (66.89 – 268.73)
per	275.90 ^** (93.45 – 458.35)
bpm	-208.41 ^* (-410.82 – -6.01)
dbpm	-21.47 (-118.60 – 75.67)
mp	-28.65 ^* (-56.36 – -0.94)
g	-2.38 (-10.02 – 5.26)
Random Effects
σ²	16817.47
τ₀₀ _{team_f}	6430.92
τ₀₀ _{year_f}	0.00
τ₁₁ _{team_f.ws}	12924.11
ρ₀₁ _{team_f}	1.00
N _{year_f}	4
N _{team_f}	19
Observations	46
Marginal R² / Conditional R²	0.726 / NA
p<0.05 p<0.01 * p<0.001

Letting each team have its own “return on WS” noticeably improves fit (AIC ≈ 617; residual SD ≈ 130). WS and PER strongly increase MVP votes, while BPM trends negative and minutes/games add little unique signal; year-to-year differences are negligible, but teams differ in both baseline level and how much WS moves the needle.

Mixed model comparison and summary

mods_to_compare <- mget(intersect(c("m0","m1","m2","m3","lmm0","lmm1","lmm2","lmm3"), ls()))
compare_performance(mods_to_compare)

## Random effect variances not available. Returned R2 does not account for random effects.
## Random effect variances not available. Returned R2 does not account for random effects.

The plain baseline model (m0) doesn’t explain MVP voting at all, but adding advanced stats (m1–m3) quickly improves accuracy; the best fixed model (m3) cuts typical error to ~160 points (R² ≈ 0.74).

The best overall model is the mixed model with team-specific impact of WS (lmm3): it delivers the lowest error (~118 points) and the strongest information criteria, suggesting teams differ in how much WS translates into votes, while year-to-year shifts are minor.

In practice: pick lmm3 when you want the most accurate predictions; pick m2/m3 if you prefer a simpler model with still-strong performance.

What the stats mean on-court: The models say voters mostly reward broad, box-score impact captured by Win Shares (WS) and PER—i.e., elite efficiency and carrying a big load on good teams. That lines up with how MVP conversations center on “best player on a top seed” seasons from stars like Jokić, Embiid, Giannis, SGA, etc.

Why teams matter: The best model lets the WS. to votes link vary by team, implying some contexts amplify MVP traction (visibility, narratives, seeding) while others don’t—so the same statistical season can draw different vote totals depending on the jersey.

Volume isn’t everything: Minutes/Games add little once quality is accounted for, reflecting modern rest: voters care more about how dominant you are when you play than simply piling minutes. (Small negative BPM/DBPM is a collinearity quirk with WS/PER, not a claim those metrics “hurt” MVP chances.)

Summarize which model you would select and why. Add guidance on how to use the model practically.

Modeling NBA MVP Voting with Advanced Stats and Mixed Models

Alexander Martin

2025-10-25

Overview

Setup

Data Ingestion: Excel Workbooks (MVP and Avd)

Harmonize Types and Build `combined`

Modeling Objective and Data Prep

Feature Scouting

Linear Models

Linear Models

Null model - m0

Model 1 - Top advanced stats only (m1)

Model 2 - Add simple volume controls (m2)

Model 3 - Add position factor if available (m3)

LM comparison and summary

Mixed-Effects Models

LMM0 - Random intercepts for year

LMM1 - Top advanced stats with random intercepts for year and team

LMM2 - Advanced + volume with random intercepts for year and team

LMM3 - Optional random slope across teams

Mixed model comparison and summary

Modeling NBA MVP Voting with Advanced Stats and Mixed Models

Alexander Martin

2025-10-25

Overview

Setup

Data Ingestion: Excel Workbooks (MVP and Avd)

Harmonize Types and Build combined

Modeling Objective and Data Prep

Feature Scouting

Linear Models

Linear Models

Null model - m0

Model 1 - Top advanced stats only (m1)

Model 2 - Add simple volume controls (m2)

Model 3 - Add position factor if available (m3)

LM comparison and summary

Mixed-Effects Models

LMM0 - Random intercepts for year

LMM1 - Top advanced stats with random intercepts for year and team

LMM2 - Advanced + volume with random intercepts for year and team

LMM3 - Optional random slope across teams

Mixed model comparison and summary

Harmonize Types and Build `combined`