library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Today, I’ll be working with NBA player data. This data contains many different stats accumulated by NBA players over the course of a season (usually calculated on a per-game basis). This particular data set is for the most recent complete regular season of NBA basketball, the 2022-2023 season
df <- read.csv('nba_22-23_regseason.csv', sep=';')
I’ve decided that the variables I’m going to attempt to build a model around seeing if average minutes played predicts 3-point shooting percentage. I felt that attempting to correlate many other variables would essentially mean ignoring a true causal factor. For example, building a model around points scored and another stat like rebounds may show a relationship that’s really due to long playing time or simply being a very good player.
3-point percentage, by contrast, is not necessarily likely to go up or down based mostly on volume. It’s a question of whether the players who play a lot (perhaps the best proxy for the best players overall) tend to be the most accurate 3-point shooters.
Here, I’ll filter down to only players who play at least 5 minutes a game and attempt at least one 3-pointer a game.
df5 <- df %>%
filter(df$MP > 5 & X3P >= 1)
Plotting my variables against each other:
ggplot(df5, aes(x = MP, y = X3P.)) +
geom_point() +
geom_smooth(method='lm', formula= y~x)
And building a model:
model <- lm(X3P. ~ MP, data = df5)
Let’s check the model summary:
summary(model)
##
## Call:
## lm(formula = X3P. ~ MP, data = df5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11736 -0.03041 -0.00307 0.02303 0.61885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3920111 0.0116626 33.613 <2e-16 ***
## MP -0.0007239 0.0004371 -1.656 0.0987 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0564 on 299 degrees of freedom
## Multiple R-squared: 0.009093, Adjusted R-squared: 0.005779
## F-statistic: 2.744 on 1 and 299 DF, p-value: 0.09868
# Extracting residuals
residuals <- resid(model)
Interesting! There does not appear to be a linear relationship between minutes played and 3-point percentage. Practically speaking, what this seems to suggest is that while some players fit a high-minute/high-3P% profile (think Steph Curry) and others fit a low-minute/low-3P% profile (think a young player working on their long game), there are also many players who fit outside that mold. For example, some players who play a great deal of minutes and are considered some of the best players in the game are not necessarily the best 3-point shooters (like Giannis Antetekuonpo). Meanwhile, some of the highest-% 3-point shooters could be considered “role players” who are very good at that one aspect, but whose game is not otherwise “complete” enough to make them high-minte players.
Now let’s do a residuals analysis to test our model and its assumptions further.
# Basic summary statistics of residuals
summary(residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.117360 -0.030410 -0.003071 0.000000 0.023034 0.618848
Because the mean and median are both close to 0, this summary tells us the model does hew closely to the data. It’s simply a very weak predictor.
# Histogram of residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")
This is a narrow but relatively normal distribution for the residuals,
so we meet the normality assumption.
# Q-Q plot of residuals
qqnorm(residuals)
qqline(residuals)
Despite tailing off the end, the relatively straight line here confirms the normal distribution of the residuals.