NBA basketball has started again. It’s a chance for closure on the 2019-2020 season and an opportunity to thrill millions of fans stuck at home during the global pandemic.
Goat
I’ve always been interested in knowing if season stats actually predict who the MVP is. There are certainly instances where the best player did not receive the MVP nod, but should have (e.g., Michael Jordan in 96/97).
So, there is most definitely error in who should be MVP.
Quick thanks to basketball-reference.com, which has all the basketball stats you could ever want.
Some of the metrics that basketball-reference lists are:
There is also other information about what share of MVP votes a player received, their age, team, etc.
I used rank as the outcome. There are issues with this decision. Namely, that the variable is a rank-ordered variable, but for these purposes, I’m going to consider mvp rank to be a continuous variable. Putting that aside, let’s run a model using the stats listed above. Because there are multiple rows per year, I need to run a mixed-effects model. lme4 will allow me to account for the multi-level structure of the data ( I’ll do so by specifying that I want to estimate the random intercepts for year) And just for the p values, I’ll load the package lmerTest.
library(tidyverse) #load GOAT package
library(lme4)
library(lmerTest)
head(mvp3)
## # A tibble: 6 x 20
## age g mp pts trb ast stl blk fg_percent x3p_percent
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 24 72 32.8 27.7 12.5 5.9 1.3 1.5 0.578 0.256
## 2 29 78 36.8 36.1 6.6 7.5 2 0.7 0.442 0.368
## 3 28 77 36.9 28 8.2 4.1 2.2 0.4 0.438 0.386
## 4 23 80 31.3 20.1 10.8 7.3 1.4 0.7 0.511 0.307
## 5 30 69 33.8 27.3 5.3 5.2 1.3 0.4 0.472 0.437
## 6 28 80 35.5 25.8 4.6 6.9 1.1 0.4 0.444 0.369
## # ... with 10 more variables: ft_percent <dbl>, ws <dbl>, ws_48 <dbl>,
## # mvp <dbl>, year <dbl>, rank <dbl>, player <chr>, pred <dbl>, screwed <chr>,
## # lucky <chr>
# Mixed-Effects Model with random intercepts of year specified
m1 <- lmer(rank ~ g + mp + pts + trb + ast + stl + blk + fg_percent + x3p_percent + ft_percent + ws + ws_48 + (1|year), data = mvp3)
Alight excellent.
Let’s see what predicts mvp rank
summary(m1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## rank ~ g + mp + pts + trb + ast + stl + blk + fg_percent + x3p_percent +
## ft_percent + ws + ws_48 + (1 | year)
## Data: mvp3
##
## REML criterion at convergence: 3281.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.6099 -0.6868 -0.0289 0.6548 3.2459
##
## Random effects:
## Groups Name Variance Std.Dev.
## year (Intercept) 0.683 0.8264
## Residual 10.397 3.2245
## Number of obs: 628, groups: year, 40
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 19.96873 7.46286 609.89359 2.676 0.00766 **
## g -0.00881 0.04717 612.27973 -0.187 0.85191
## mp -0.08167 0.10858 614.34555 -0.752 0.45223
## pts -0.26989 0.03854 529.39034 -7.003 7.65e-12 ***
## trb -0.26811 0.06776 608.88047 -3.957 8.50e-05 ***
## ast -0.52444 0.07154 613.34058 -7.331 7.24e-13 ***
## stl 0.68512 0.29928 603.81689 2.289 0.02241 *
## blk -0.55838 0.21971 606.61808 -2.541 0.01129 *
## fg_percent 17.51947 4.22977 474.60834 4.142 4.08e-05 ***
## x3p_percent 3.09026 1.10596 611.91176 2.794 0.00537 **
## ft_percent 4.37864 2.23443 603.28443 1.960 0.05050 .
## ws -0.32042 0.31566 609.19869 -1.015 0.31048
## ws_48 -36.92054 18.54454 612.07915 -1.991 0.04693 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variables that seem to predict MVP rank are:
Just as a side note, if you check out estimate values of fg_percent and ws_48 and are surprised how much they seem to affect rank, it’s just a scaling issue.
Ok, so let’s visualize this.
What I wanted to originally check is if there were people who should have gotten the mvp nod, but who did not.
I used the predict function from the lme4 package to extract predicted values and then I plotted those against actual ranks. I also used the ggforce package to produce the outline around those players who probably should have won and those players who probably should not have won.
In these graphs, the red line is if the data were perfectly predicted by the model. That is, if the data suggested you should be ranked 1st and you were actually ranked 1st, then you would fall on the red line. The blue line is the relation between the predicted rank and the actual rank. Remember that our predicted values are based on the model we built above from each players yearly performance statistics. So we can see that our model is definitely not perfect.
It may be dramatic to say that some players were lucky or got screwed. But, probably at least some players were surprised to win or not win the MVP.
Below is one more interactive graph that shows you the player, their actual rank, and their predicted rank.
It’s interesting to think about what goes into choosing an MVP. I would also imagine that some metrics may be more or less heavily-weighed now compared to the 1980s.
Regardless, if you want to have a good shot at predicting who the next mvp will be, think about those metrics listed above. You might get close.