Logistic Regression Model

# Creating binary variable to identify high volume three-point shooters
# Creating dummy variable for position where PGs and SGs are 1 and other positions are 0.

df_logit <-
  df |>
  mutate(
    role_group = if_else(pos %in% c("PG", "SG"), 1, 0),
    high_3pt_volume = if_else(fga_3p >= 0.4, 1, 0)
  )
# GLM model

logit_model <-
  glm(high_3pt_volume ~ dist + role_group + age,
      data = df_logit,
      family = binomial)

summary(logit_model)
## 
## Call:
## glm(formula = high_3pt_volume ~ dist + role_group + age, family = binomial, 
##     data = df_logit)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.76746    0.64890 -19.676  < 2e-16 ***
## dist          1.09537    0.04421  24.779  < 2e-16 ***
## role_group   -0.93735    0.13806  -6.790 1.12e-11 ***
## age          -0.08207    0.01627  -5.043 4.59e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4453.0  on 3220  degrees of freedom
## Residual deviance: 1535.1  on 3217  degrees of freedom
##   (36 observations deleted due to missingness)
## AIC: 1543.1
## 
## Number of Fisher Scoring iterations: 7

Coefficient Interpretation: - dist (Average Shot Distance) = The coefficient is positive, meaning that for each additional foot of average shot distance, the log-odds of being a high volume three-point shooter increases by 1.095, holding other variables constant. - role_group (position) = This dummy variable compares guards (PGs & SGs), who are identified as 1, to non-guards. The negative coefficient (-0.937) shows that the odds of guards being high volume three-point shooters compared to non-guards is lower. This is not entirely expected since guards tend to shoot more three-pointers. This could be due to the three-point era where all positions, not just guards with certain exceptions, need to shoot well to succeed in a modern offense. - age (player’s age) = This coefficient will help assess whether experience influences shooting tendencies. The negative coefficient (-0.082) shows that the odds of an older player to be high volume three-point shooters is less likely compared to younger players. This makes sense intuitively since with the recent rise of the three-point area, younger players focus on practicing their three-point shot compared to older players who played in an era where three-points shots were more selective.

Insight: The model showed all three variables are statistically significant factors in determining the log-odds of a player being a high volume three-point shooter. Average shot distance has a positive effect while age and position have negative effects. Significance: The results showed that average shot distance is the strongest predictor of three-point shooting behavior while role and age also matter, but at a smaller effect. The unexpected result that guards are less likely to be high volume three-point shooters is worth noting. This might support the new concept of positionless basketball where anyone can shoot three’s rather than traditional positions. Further Question: Why are guards less likely to be high volume three-point shooters based on this dataset?

Confidence Intervals for Average Shot Distance

# Confidence Interval for dist (Average Shot Distance)

coef_est <- coef(summary(logit_model))["dist", "Estimate"]
se_est   <- coef(summary(logit_model))["dist", "Std. Error"]

lower <- coef_est - 1.96 * se_est
upper <- coef_est + 1.96 * se_est

lower
## [1] 1.008731
upper
## [1] 1.182017

Interpretation: There is 95% confidence that the true effect of average shot distance on the log(odds) of being a high volume three-point shooter is between 1.0087 and 1.1820. Since the interval does not include 0, it shows strong evidence of average shot distance having a statistically significant positive effect. Insights: The constructed 95% confidence interval for the dist coefficient was created to asses the precision and reliability of its effects on the log-odds of a player being a high volume three-point shooter. With an interval of (1.0087 and 1.1820), a positive and pretty narrow interval, shows stable and consistent estimates. Significance: This confidence interval not only reinforces that average shot distance is statistically significant, but the strongest predictor in the model, due to the large magnitude of the interval, in determining shooting profiles. Since the interval is both positive and close together, we can be confident that the estimate is precise. Further Question: Based on dist being a significant predictor and looking at the previous linear regression model which included fga_3p (three-point attempt rate), would using fga_3p instead be more or less impactful?

Logistic Regression Diagnostic (Cook’s Distance)

plot(logit_model, which = 4, id.n = 3)

Insights: The plot shows that the majority of observations have low influence on the model meaning most players contrubite similarly to the model fit. However, there are some observations that have a significantly higher influence. Issue/Severity: The severity is low since there are few influential observations. I have high confidence that no single observation is driving the model. Significance: With the results not heavily driven by a small number of unusual observations, this indicates a strong model. However, these few observations are worth noting who they are to see if data cleaning (low volume shooting) or certain player profiles are somewhat shaping the model. Further Question: Who are these players that represent the most influential observations? Are they data cleaning cases or true outliers based on role/playstyle?