shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")

head(shot_logs)
##    GAME_ID                  MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           1
## 2 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           2
## 3 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           3
## 4 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           4
## 5 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           5
## 6 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           6
##   PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1      1       1:09       10.8        2        1.9       7.7        2
## 2      1       0:14        3.4        0        0.8      28.2        3
## 3      1       0:00         NA        3        2.7      10.1        2
## 4      2      11:47       10.3        2        1.9      17.2        2
## 5      2      10:34       10.9        2        2.7       3.7        2
## 6      2       8:15        9.1        2        4.4      18.4        2
##   SHOT_RESULT  CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1        made    Anderson, Alan                     101187            1.3   1
## 2      missed Bogdanovic, Bojan                     202711            6.1   0
## 3      missed Bogdanovic, Bojan                     202711            0.9   0
## 4      missed     Brown, Markel                     203900            3.4   0
## 5      missed   Young, Thaddeus                     201152            1.1   0
## 6      missed   Williams, Deron                     101114            2.6   0
##   PTS   player_name player_id
## 1   2 brian roberts    203148
## 2   0 brian roberts    203148
## 3   0 brian roberts    203148
## 4   0 brian roberts    203148
## 5   0 brian roberts    203148
## 6   0 brian roberts    203148

Introduction

For this data dive, we are building a logistic regression model to predict a binary outcome from the NBA Shot Logs dataset. We’ll interpret the model and calculate a confidence interval for one of the coefficients.

Choosing the Binary Outcome Variable

We’ll use the FGM (Field Goal Made) column:

It is binary: 1 = shot made, 0 = shot missed.

It’s an important performance metric in basketball.

shot_logs <- shot_logs |> mutate(FGM = as.factor(FGM))

Choosing Explanatory Variables

We’ll use the following:

SHOT_DIST: Shot distance in feet

CLOSE_DEF_DIST: Defender’s distance in feet

SHOT_CLOCK: Time left on the shot clock

DRIBBLES: Number of dribbles before the shot

These variables may reasonably influence whether a shot is made.

Build the Logistic Regression Model

logit_model <- glm(FGM ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK + DRIBBLES, 
                   data = shot_logs, 
                   family = binomial(link = "logit"))

summary(logit_model)
## 
## Call:
## glm(formula = FGM ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK + 
##     DRIBBLES, family = binomial(link = "logit"), data = shot_logs)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.0129298  0.0192101   0.673    0.501    
## SHOT_DIST      -0.0600867  0.0008597 -69.894   <2e-16 ***
## CLOSE_DEF_DIST  0.1044951  0.0028132  37.144   <2e-16 ***
## SHOT_CLOCK      0.0178157  0.0010562  16.868   <2e-16 ***
## DRIBBLES       -0.0198401  0.0017884 -11.094   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 168479  on 122199  degrees of freedom
## Residual deviance: 161997  on 122195  degrees of freedom
##   (5553 observations deleted due to missingness)
## AIC: 162007
## 
## Number of Fisher Scoring iterations: 4

Interpret the Coefficients

Coefficient Interpretations (in log-odds):

(Intercept) = 0.0129 (Not statistically significant) When all other variables are 0, the log-odds of making the shot is ~0.013.

Since p = 0.501, it’s not statistically significant and we generally don’t interpret it alone here.

SHOT_DIST = -0.0601 (Significant, p < 2e-16) For every additional foot away from the basket, the log-odds of making the shot decreases by ~0.0601.

In terms of odds, this means:

exp(-0.0601)
## [1] 0.9416704

Each additional foot decreases the odds of making the shot by ~5.8%.

CLOSE_DEF_DIST = +0.1045 (Significant, p < 2e-16) For each extra foot of space from the defender, the log-odds of making the shot increases by ~0.1045.

In odds:

exp(0.1045)
## [1] 1.110155

Each foot of defender space increases the odds of making the shot by ~11%.

SHOT_CLOCK = +0.0178 (Significant, p < 2e-16) For each additional second left on the shot clock, the log-odds of making the shot increases by ~0.0178.

In odds:

exp(0.0178)
## [1] 1.017959

Each second increases odds of making the shot by ~1.8%. More time left = better shot quality, Less time left = worse shot quality.

DRIBBLES = -0.0198 (Significant, p < 2e-16) For each extra dribble before the shot, the log-odds of making it decreases by ~0.0198.

In odds:

exp(-0.0198)
## [1] 0.9803947

Each dribble reduces shot success odds by ~2%, possibly due to tougher, more contested shots.

Compute Odds Ratios

exp(coef(logit_model))
##    (Intercept)      SHOT_DIST CLOSE_DEF_DIST     SHOT_CLOCK       DRIBBLES 
##      1.0130138      0.9416829      1.1101499      1.0179753      0.9803554

Each odds ratio tells us how the odds of making a shot change with a one-unit increase in that predictor.

What This Tells Us About Shot Success: Longer shots = lower success (makes sense, harder to make from far).

More space = more success (less defensive pressure).

More time = better decisions = better outcomes.

More dribbles = lower odds, possibly due to tougher, rushed, or more contested shots.

Build Confidence Interval for One Coefficient

coef_shot_dist <- coef(summary(logit_model))["SHOT_DIST", "Estimate"]
se_shot_dist <- coef(summary(logit_model))["SHOT_DIST", "Std. Error"]

ci_lower <- coef_shot_dist - 1.96 * se_shot_dist
ci_upper <- coef_shot_dist + 1.96 * se_shot_dist

exp(c(ci_lower, ci_upper))
## [1] 0.9400975 0.9432709

This means:

We are 95% confident that each additional foot of shot distance reduces the odds of making the shot by between ~5.7% and ~6.0%.

Here’s how we break it down:

The odds ratio for SHOT_DIST is ~0.942.

The 95% CI is entirely below 1 - this means the effect is statistically significant and consistently negative.

Specifically:

Lower bound = 0.9401 - implies max effect of ~6.0% decrease in odds.

Upper bound = 0.9433 - implies minimum effect of ~5.7% decrease in odds.

Every extra foot a player moves away from the basket reduces their odds of making a shot by about 5.7–6.0%, with 95% confidence — all else being equal.

Final Summary: Logistic Regression Data Dive

In this data dive, we modeled the probability of a shot being made (FGM = 1) using a logistic regression model with four explanatory variables:

SHOT_DIST (distance of the shot in feet)

CLOSE_DEF_DIST (distance to the nearest defender)

SHOT_CLOCK (seconds left on the shot clock)

DRIBBLES (number of dribbles before the shot)

Key Findings: SHOT_DIST had a significant negative effect on shot success.

Odds ratio: 0.942

95% Confidence Interval: [0.9401, 0.9433]

Interpretation: Each additional foot away from the basket reduces the odds of making the shot by ~5.7–6.0%, with high confidence.

CLOSE_DEF_DIST had a positive effect on shot success.

Odds ratio: 1.110

Interpretation: Each extra foot of defender space increases the odds of making the shot by ~11%.

SHOT_CLOCK had a modest positive effect.

Odds ratio: 1.018

Interpretation: More time on the clock slightly increases the chances of success (~1.8% per second).

DRIBBLES had a negative effect.

Odds ratio: 0.980

Interpretation: Each extra dribble reduces shot success odds by ~2%, likely indicating more contested or complex shot situations.

Model Insights: All predictors were statistically significant (p < 0.001).

The model gives interpretable and directionally valid results based on basketball intuition.

The confidence interval for SHOT_DIST confirms its impact is both statistically significant and practically meaningful.

Final Takeaway Shot success in the NBA is strongly influenced by distance to the basket, defender proximity, and shot timing. This logistic model provides a useful statistical lens to understand these patterns and can be used to further explore player-specific performance, strategic decisions, or shot quality metrics.