MVP! MVP!

NBA basketball has started again. It’s a chance for closure on the 2019-2020 season and an opportunity to thrill millions of fans stuck at home during the global pandemic.



Goat

Goat

Real MVP

I’ve always been interested in knowing if season stats actually predict who the MVP is. There are certainly instances where the best player did not receive the MVP nod, but should have (e.g., Michael Jordan in 96/97).

So, there is most definitely error in who should be MVP.

Quick thanks to basketball-reference.com, which has all the basketball stats you could ever want.

Some of the metrics that basketball-reference lists are:

  • g (games played)
  • mp (average minutes played)
  • pts (average points per game)
  • trb (average total rebounds per game)
  • ast (average assists per game)
  • stl (average steals per game)
  • blk (average blocks per game)
  • fg_percent (average field goal percentage)
  • x3p_percent (average 3pt percentage)
  • ft_percent (average free throw percentage)
  • ws (an estimate of number of wins contributed by a player)
  • ws_48 (an estimate of the number wins contributed by a player per 48 minutes (league average is .100))

There is also other information about what share of MVP votes a player received, their age, team, etc.

What predicts an MVP

I used rank as the outcome. There are issues with this decision. Namely, that the variable is a rank-ordered variable, but for these purposes, I’m going to consider mvp rank to be a continuous variable. Putting that aside, let’s run a model using the stats listed above. Because there are multiple rows per year, I need to run a mixed-effects model. lme4 will allow me to account for the multi-level structure of the data ( I’ll do so by specifying that I want to estimate the random intercepts for year) And just for the p values, I’ll load the package lmerTest.

library(tidyverse) #load GOAT package
library(lme4)
library(lmerTest)

head(mvp3)
## # A tibble: 6 x 20
##     age     g    mp   pts   trb   ast   stl   blk fg_percent x3p_percent
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>       <dbl>
## 1    24    72  32.8  27.7  12.5   5.9   1.3   1.5      0.578       0.256
## 2    29    78  36.8  36.1   6.6   7.5   2     0.7      0.442       0.368
## 3    28    77  36.9  28     8.2   4.1   2.2   0.4      0.438       0.386
## 4    23    80  31.3  20.1  10.8   7.3   1.4   0.7      0.511       0.307
## 5    30    69  33.8  27.3   5.3   5.2   1.3   0.4      0.472       0.437
## 6    28    80  35.5  25.8   4.6   6.9   1.1   0.4      0.444       0.369
## # ... with 10 more variables: ft_percent <dbl>, ws <dbl>, ws_48 <dbl>,
## #   mvp <dbl>, year <dbl>, rank <dbl>, player <chr>, pred <dbl>, screwed <chr>,
## #   lucky <chr>
# Mixed-Effects Model with random intercepts of year specified

m1 <- lmer(rank ~ g + mp + pts + trb + ast + stl + blk + fg_percent + x3p_percent + ft_percent + ws + ws_48 + (1|year), data = mvp3)

Warning! Nerd Stuff

Alight excellent.

Let’s see what predicts mvp rank

summary(m1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: 
## rank ~ g + mp + pts + trb + ast + stl + blk + fg_percent + x3p_percent +  
##     ft_percent + ws + ws_48 + (1 | year)
##    Data: mvp3
## 
## REML criterion at convergence: 3281.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6099 -0.6868 -0.0289  0.6548  3.2459 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  year     (Intercept)  0.683   0.8264  
##  Residual             10.397   3.2245  
## Number of obs: 628, groups:  year, 40
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)  19.96873    7.46286 609.89359   2.676  0.00766 ** 
## g            -0.00881    0.04717 612.27973  -0.187  0.85191    
## mp           -0.08167    0.10858 614.34555  -0.752  0.45223    
## pts          -0.26989    0.03854 529.39034  -7.003 7.65e-12 ***
## trb          -0.26811    0.06776 608.88047  -3.957 8.50e-05 ***
## ast          -0.52444    0.07154 613.34058  -7.331 7.24e-13 ***
## stl           0.68512    0.29928 603.81689   2.289  0.02241 *  
## blk          -0.55838    0.21971 606.61808  -2.541  0.01129 *  
## fg_percent   17.51947    4.22977 474.60834   4.142 4.08e-05 ***
## x3p_percent   3.09026    1.10596 611.91176   2.794  0.00537 ** 
## ft_percent    4.37864    2.23443 603.28443   1.960  0.05050 .  
## ws           -0.32042    0.31566 609.19869  -1.015  0.31048    
## ws_48       -36.92054   18.54454 612.07915  -1.991  0.04693 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The variables that seem to predict MVP rank are:

Just as a side note, if you check out estimate values of fg_percent and ws_48 and are surprised how much they seem to affect rank, it’s just a scaling issue.

Ok, so let’s visualize this.

What I wanted to originally check is if there were people who should have gotten the mvp nod, but who did not.

I used the predict function from the lme4 package to extract predicted values and then I plotted those against actual ranks. I also used the ggforce package to produce the outline around those players who probably should have won and those players who probably should not have won.

In these graphs, the red line is if the data were perfectly predicted by the model. That is, if the data suggested you should be ranked 1st and you were actually ranked 1st, then you would fall on the red line. The blue line is the relation between the predicted rank and the actual rank. Remember that our predicted values are based on the model we built above from each players yearly performance statistics. So we can see that our model is definitely not perfect.

Caveat

It may be dramatic to say that some players were lucky or got screwed. But, probably at least some players were surprised to win or not win the MVP.

Below is one more interactive graph that shows you the player, their actual rank, and their predicted rank.

Final Thoughts

It’s interesting to think about what goes into choosing an MVP. I would also imagine that some metrics may be more or less heavily-weighed now compared to the 1980s.

Regardless, if you want to have a good shot at predicting who the next mvp will be, think about those metrics listed above. You might get close.