shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")

head(shot_logs)
##    GAME_ID                  MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           1
## 2 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           2
## 3 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           3
## 4 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           4
## 5 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           5
## 6 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           6
##   PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1      1       1:09       10.8        2        1.9       7.7        2
## 2      1       0:14        3.4        0        0.8      28.2        3
## 3      1       0:00         NA        3        2.7      10.1        2
## 4      2      11:47       10.3        2        1.9      17.2        2
## 5      2      10:34       10.9        2        2.7       3.7        2
## 6      2       8:15        9.1        2        4.4      18.4        2
##   SHOT_RESULT  CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1        made    Anderson, Alan                     101187            1.3   1
## 2      missed Bogdanovic, Bojan                     202711            6.1   0
## 3      missed Bogdanovic, Bojan                     202711            0.9   0
## 4      missed     Brown, Markel                     203900            3.4   0
## 5      missed   Young, Thaddeus                     201152            1.1   0
## 6      missed   Williams, Deron                     101114            2.6   0
##   PTS   player_name player_id
## 1   2 brian roberts    203148
## 2   0 brian roberts    203148
## 3   0 brian roberts    203148
## 4   0 brian roberts    203148
## 5   0 brian roberts    203148
## 6   0 brian roberts    203148
#Group data by shot type and calculate FGM%#
fgm_summary <- shot_logs |> 
  group_by(PTS_TYPE) |> 
  summarize(fgm_percentage = mean(FGM, na.rm = TRUE) * 100)

print(fgm_summary)
## # A tibble: 2 × 2
##   PTS_TYPE fgm_percentage
##      <int>          <dbl>
## 1        2           48.9
## 2        3           35.2

Introduction

Last week, I built a simple linear regression model predicting FGM% based on SHOT_DIST. This week, I will expand the model by adding more predictors and diagnosing potential issues.

Part 1: Expanding the Regression Model

Review of Last Week’s Simple Model:

lm_simple <- lm(FGM * 100 ~ SHOT_DIST, data = shot_logs)
summary(lm_simple)
## 
## Call:
## lm(formula = FGM * 100 ~ SHOT_DIST, data = shot_logs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.89 -43.60 -32.71  47.34  91.01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 59.88610    0.24950  240.03   <2e-16 ***
## SHOT_DIST   -1.07828    0.01538  -70.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.84 on 127751 degrees of freedom
## Multiple R-squared:  0.03707,    Adjusted R-squared:  0.03706 
## F-statistic:  4918 on 1 and 127751 DF,  p-value: < 2.2e-16

This model showed that SHOT_DIST negatively impacts FGM%, but R² was only ~3.7%, meaning many factors influence shooting accuracy.

New Predictors

I will add the following predictors:

CLOSE_DEF_DIST (Defender Distance in Feet) Justification: More space should increase shooting accuracy. Potential issue: If highly correlated with SHOT_DIST, it may cause multicollinearity.

SHOT_CLOCK (Seconds Remaining on Shot Clock) Justification: Rushed shots should be less accurate. Potential issue: If all low shot clock values are also long shots, there could be overlap with SHOT_DIST.

Interaction Term: SHOT_DIST * CLOSE_DEF_DIST Justification: The effect of SHOT_DIST may change based on defender pressure. Potential issue: If CLOSE_DEF_DIST is already strongly related to SHOT_DIST, this term might not add much insight.

Fit the New Multiple Regression Model

lm_expanded <- lm(FGM * 100 ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK + SHOT_DIST:CLOSE_DEF_DIST, data = shot_logs)
summary(lm_expanded)
## 
## Call:
## lm(formula = FGM * 100 ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK + 
##     SHOT_DIST:CLOSE_DEF_DIST, data = shot_logs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -153.90  -44.41  -30.25   50.01   91.68 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              43.528209   0.532323   81.77   <2e-16 ***
## SHOT_DIST                -0.959496   0.030839  -31.11   <2e-16 ***
## CLOSE_DEF_DIST            4.034113   0.113035   35.69   <2e-16 ***
## SHOT_CLOCK                0.466636   0.024734   18.87   <2e-16 ***
## SHOT_DIST:CLOSE_DEF_DIST -0.108605   0.006141  -17.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.48 on 122195 degrees of freedom
##   (5553 observations deleted due to missingness)
## Multiple R-squared:  0.05268,    Adjusted R-squared:  0.05265 
## F-statistic:  1699 on 4 and 122195 DF,  p-value: < 2.2e-16

Part 2: Checking for Multicollinearity

Multicollinearity occurs when two or more predictors are highly correlated, making it hard to isolate their effects.

vif_scores <- vif(lm_expanded, type = "predictor")
## GVIFs computed for predictors
print(vif_scores)
##                    GVIF Df GVIF^(1/(2*Df)) Interacts With
## SHOT_DIST      1.053832  3        1.008777 CLOSE_DEF_DIST
## CLOSE_DEF_DIST 1.053832  3        1.008777      SHOT_DIST
## SHOT_CLOCK     1.053832  1        1.026563           --  
##                         Other Predictors
## SHOT_DIST                     SHOT_CLOCK
## CLOSE_DEF_DIST                SHOT_CLOCK
## SHOT_CLOCK     SHOT_DIST, CLOSE_DEF_DIST

SHOT_DIST (1.05) - No multicollinearity. CLOSE_DEF_DIST (1.05) - No multicollinearity. SHOT_CLOCK (1.05) - No multicollinearity. We can see there is no concern for multicolinerity.

Why I used GVIF: GVIF helps when there are interaction terms or categorical variables with multiple levels. When I did not include type = “predictor” there was an issue interpreting the VIF for the interaction term.

Part 3: Diagnosing Model Assumptions with Diagnostic Plots

We use five key diagnostic plots to check for issues in the model.

par(mfrow = c(2, 3))
plot(lm_expanded, which = 1:5)

Diagnostic Plot Interpretations:

Residuals vs. Fitted (Checking Linearity & Homoscedasticity) Ideal: Randomly scattered residuals around 0. Bad Sign: If a pattern (e.g., curve or funnel shape) appears, it suggests non-linearity or heteroscedasticity. Our Residuals vs. Fitted plot is not great. I am not sure why they trail down like that. They are not randomly scattered at all indicating there are problems. I think that based on the plot this shows a severe issue.

Normal Q-Q Plot (Checking Normality of Residuals) Ideal: Residuals should follow a straight 45-degree line. Bad Sign: If points deviate significantly from the line, residuals may not be normally distributed. Our Q-Q plot is medium. I would’ve liked to see less deviation from the line, but since there is that deviation we know that our residuals may not be normally distributed. I would say that this plot does not raise concern for a severe issue but it is something to keep an eye on.

Scale-Location Plot (Checking Homoscedasticity - Constant Variance) Ideal: A flat horizontal line. Bad Sign: If the points fan out or curve, it suggests heteroscedasticity (variance is changing across fitted values). We see an interesting scale location plot. I am not entirely sure what to make of it.

Cook’s Distance Plot (Checking Influential Observations) Ideal: No points significantly larger than others. Bad Sign: If a few points have extremely high Cook’s distances, they may be highly influential outliers. There are a few points on our cook’s distance plot that suggest high influence, but other than that it looks pretty good. I would say this plot does not raise conerns for severity, but the influential outliers maybe should be removed.

Residuals vs. Leverage (Checking Influential Points) Ideal: No points with high leverage. Bad Sign: If points are far from the center and have high Cook’s distance, they heavily influence the model. It is interesting how we have that tail that tracks down in this plot. We can see there is a large number of points that do not have high leverage, but the tail indicates that there are some points with high leverage. I would say this plot calls for moderate severity. The tail that trails downward is very suspicious.

Part 4: Model Evaluation & Next Steps

cat("Simple Model R²:", summary(lm_simple)$r.squared, "\n")
## Simple Model R²: 0.03707199
cat("Expanded Model R²:", summary(lm_expanded)$r.squared, "\n")
## Expanded Model R²: 0.05267887

By adding the new predictors our R-squared increased indicating that they improved the predictive power, however it only accounts for a small amount of the variance in FGM%.

Possible Explanations for the Small R² Increase Shooting accuracy depends on more than just distance, defender pressure, and shot clock.

Factors like shot type (layup vs. jump shot), defender height, player skill, fatigue, game context could be missing. The relationship might not be linear.

If FGM% does not follow a straight-line relationship with predictors, a non-linear model (e.g., logistic regression, polynomial regression) may be better. Multicollinearity might still be an issue.

If some predictors are highly correlated, they don’t add much unique information, which limits R² improvement. Random variation plays a role in shot success.

FGM% is partially random, meaning some variation cannot be explained by measurable stats.

While adding CLOSE_DEF_DIST, SHOT_CLOCK, and an interaction term slightly improved the model (raising R² by 0.02), the increase is minimal. This suggests that while these factors influence shot success, there are likely additional factors (e.g., shot type, player ability, game context) that contribute to FGM% variability. Future models could explore non-linear relationships or incorporate additional predictors to improve accuracy.