library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Today, I’ll be working with NBA player data. This data contains many different stats accumulated by NBA players over the course of a season (usually calculated on a per-game basis). This particular data set is for the most recent complete regular season of NBA basketball, the 2022-2023 season

df <- read.csv('nba_22-23_regseason.csv', sep=';')

I’ve decided that the variables I’m going to attempt to build a model around seeing if average minutes played predicts 3-point shooting percentage. I felt that attempting to correlate many other variables would essentially mean ignoring a true causal factor. For example, building a model around points scored and another stat like rebounds may show a relationship that’s really due to long playing time or simply being a very good player.

3-point percentage, by contrast, is not necessarily likely to go up or down based mostly on volume. It’s a question of whether the players who play a lot (perhaps the best proxy for the best players overall) tend to be the most accurate 3-point shooters.

Here, I’ll filter down to only players who play at least 5 minutes a game and attempt at least one 3-pointer a game.

df5 <- df %>%
  filter(df$MP > 5 & X3P >= 1)

Plotting my variables against each other:

ggplot(df5, aes(x = MP, y = X3P.)) +
  geom_point() +
  geom_smooth(method='lm', formula= y~x)

And building a model:

model <- lm(X3P. ~ MP, data = df5)

Let’s check the model summary:

summary(model)
## 
## Call:
## lm(formula = X3P. ~ MP, data = df5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11736 -0.03041 -0.00307  0.02303  0.61885 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.3920111  0.0116626  33.613   <2e-16 ***
## MP          -0.0007239  0.0004371  -1.656   0.0987 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0564 on 299 degrees of freedom
## Multiple R-squared:  0.009093,   Adjusted R-squared:  0.005779 
## F-statistic: 2.744 on 1 and 299 DF,  p-value: 0.09868
# Extracting residuals
residuals <- resid(model)

Interesting! There does not appear to be a linear relationship between minutes played and 3-point percentage. Practically speaking, what this seems to suggest is that while some players fit a high-minute/high-3P% profile (think Steph Curry) and others fit a low-minute/low-3P% profile (think a young player working on their long game), there are also many players who fit outside that mold. For example, some players who play a great deal of minutes and are considered some of the best players in the game are not necessarily the best 3-point shooters (like Giannis Antetekuonpo). Meanwhile, some of the highest-% 3-point shooters could be considered “role players” who are very good at that one aspect, but whose game is not otherwise “complete” enough to make them high-minte players.

Now let’s do a residuals analysis to test our model and its assumptions further.

# Basic summary statistics of residuals
summary(residuals)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.117360 -0.030410 -0.003071  0.000000  0.023034  0.618848

Because the mean and median are both close to 0, this summary tells us the model does hew closely to the data. It’s simply a very weak predictor.

# Histogram of residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")

This is a narrow but relatively normal distribution for the residuals, so we meet the normality assumption.

# Q-Q plot of residuals
qqnorm(residuals)
qqline(residuals)

Despite tailing off the end, the relatively straight line here confirms the normal distribution of the residuals.