shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")

head(shot_logs)
##    GAME_ID                  MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           1
## 2 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           2
## 3 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           3
## 4 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           4
## 5 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           5
## 6 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           6
##   PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1      1       1:09       10.8        2        1.9       7.7        2
## 2      1       0:14        3.4        0        0.8      28.2        3
## 3      1       0:00         NA        3        2.7      10.1        2
## 4      2      11:47       10.3        2        1.9      17.2        2
## 5      2      10:34       10.9        2        2.7       3.7        2
## 6      2       8:15        9.1        2        4.4      18.4        2
##   SHOT_RESULT  CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1        made    Anderson, Alan                     101187            1.3   1
## 2      missed Bogdanovic, Bojan                     202711            6.1   0
## 3      missed Bogdanovic, Bojan                     202711            0.9   0
## 4      missed     Brown, Markel                     203900            3.4   0
## 5      missed   Young, Thaddeus                     201152            1.1   0
## 6      missed   Williams, Deron                     101114            2.6   0
##   PTS   player_name player_id
## 1   2 brian roberts    203148
## 2   0 brian roberts    203148
## 3   0 brian roberts    203148
## 4   0 brian roberts    203148
## 5   0 brian roberts    203148
## 6   0 brian roberts    203148

Introduction

For this week’s data dive, I chose to build a generalized linear model to explore factors that affect whether a shot is made (FGM = 1) in the NBA Shot Logs dataset. This builds on our prior logistic modeling and introduces diagnostics and deeper interpretation.

Building the Logistic Regression Model

Response Variable: FGM (Field Goal Made): Binary (1 = made, 0 = missed)

Explanatory Variables: SHOT_DIST: Distance of the shot in feet

CLOSE_DEF_DIST: Distance to the nearest defender

SHOT_CLOCK: Time left on the shot clock

DRIBBLES: Number of dribbles before the shot

#Convert FGM to binary factor (if not already)#
shot_logs <- shot_logs |> mutate(FGM = as.factor(FGM))

#Fit logistic regression model#
model <- glm(FGM ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK + DRIBBLES,
             data = shot_logs, family = binomial(link = "logit"))

summary(model)
## 
## Call:
## glm(formula = FGM ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK + 
##     DRIBBLES, family = binomial(link = "logit"), data = shot_logs)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.0129298  0.0192101   0.673    0.501    
## SHOT_DIST      -0.0600867  0.0008597 -69.894   <2e-16 ***
## CLOSE_DEF_DIST  0.1044951  0.0028132  37.144   <2e-16 ***
## SHOT_CLOCK      0.0178157  0.0010562  16.868   <2e-16 ***
## DRIBBLES       -0.0198401  0.0017884 -11.094   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 168479  on 122199  degrees of freedom
## Residual deviance: 161997  on 122195  degrees of freedom
##   (5553 observations deleted due to missingness)
## AIC: 162007
## 
## Number of Fisher Scoring iterations: 4

Interpretation of Coefficients

#View odds ratios for interpretation#
exp(coef(model))
##    (Intercept)      SHOT_DIST CLOSE_DEF_DIST     SHOT_CLOCK       DRIBBLES 
##      1.0130138      0.9416829      1.1101499      1.0179753      0.9803554

Model Summary & Coefficient Interpretation: This logistic regression model predicts the probability that a shot is made (FGM = 1) based on four explanatory variables:

SHOT_DIST (shot distance)

CLOSE_DEF_DIST (defender distance)

SHOT_CLOCK (seconds left on the shot clock)

DRIBBLES (number of dribbles before the shot)

All four predictors are statistically significant (p < 0.001), except the intercept.

Odds Ratio Interpretations (Exponentiated coefficients)

SHOT_DIST = 0.942 - Each additional foot farther from the basket reduces the odds of making the shot by ~5.8%.

CLOSE_DEF_DIST = 1.110 - Each extra foot of space from a defender increases the odds of making the shot by ~11%.

SHOT_CLOCK = 1.018 - Each additional second on the shot clock increases the odds of making the shot by ~1.8%, suggesting that more time leads to better-quality shots.

DRIBBLES = 0.980 - Each extra dribble reduces the odds of shot success by ~2%, potentially indicating tougher or more contested shots.

Insight Summary: Shot distance, defender pressure, shot timing, and dribble count all meaningfully influence shot success. The model reinforces basketball intuition — longer, more rushed, and more heavily contested shots are less likely to go in, while more space and time improve shot quality.

Diagnosing the Model

par(mfrow = c(2, 2))
plot(model)

vif(model)
##      SHOT_DIST CLOSE_DEF_DIST     SHOT_CLOCK       DRIBBLES 
##       1.601752       1.588324       1.044782       1.034664

Insights from VIF Values: All VIF values are well below 5, and especially below the more conservative threshold of 2, which is often used when being cautious.

This means the explanatory variables:

Are not highly correlated with one another Do not introduce multicollinearity Are providing unique and reliable contributions to the model

What This Means for the Model:

Coefficient estimates are stable and trustworthy. Interpretations (like shot distance decreasing odds, and defender space increasing them) are valid without distortion from overlapping predictor effects. No need to remove or combine variables due to collinearity concerns.