shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")
head(shot_logs)
## GAME_ID MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN A W 24 1
## 2 21400899 MAR 04, 2015 - CHA @ BKN A W 24 2
## 3 21400899 MAR 04, 2015 - CHA @ BKN A W 24 3
## 4 21400899 MAR 04, 2015 - CHA @ BKN A W 24 4
## 5 21400899 MAR 04, 2015 - CHA @ BKN A W 24 5
## 6 21400899 MAR 04, 2015 - CHA @ BKN A W 24 6
## PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1 1 1:09 10.8 2 1.9 7.7 2
## 2 1 0:14 3.4 0 0.8 28.2 3
## 3 1 0:00 NA 3 2.7 10.1 2
## 4 2 11:47 10.3 2 1.9 17.2 2
## 5 2 10:34 10.9 2 2.7 3.7 2
## 6 2 8:15 9.1 2 4.4 18.4 2
## SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1 made Anderson, Alan 101187 1.3 1
## 2 missed Bogdanovic, Bojan 202711 6.1 0
## 3 missed Bogdanovic, Bojan 202711 0.9 0
## 4 missed Brown, Markel 203900 3.4 0
## 5 missed Young, Thaddeus 201152 1.1 0
## 6 missed Williams, Deron 101114 2.6 0
## PTS player_name player_id
## 1 2 brian roberts 203148
## 2 0 brian roberts 203148
## 3 0 brian roberts 203148
## 4 0 brian roberts 203148
## 5 0 brian roberts 203148
## 6 0 brian roberts 203148
#Group data by shot type and calculate FGM%#
fgm_summary <- shot_logs |>
group_by(PTS_TYPE) |>
summarize(fgm_percentage = mean(FGM, na.rm = TRUE) * 100)
print(fgm_summary)
## # A tibble: 2 × 2
## PTS_TYPE fgm_percentage
## <int> <dbl>
## 1 2 48.9
## 2 3 35.2
Last week, I built a simple linear regression model predicting FGM% based on SHOT_DIST. This week, I will expand the model by adding more predictors and diagnosing potential issues.
Review of Last Week’s Simple Model:
lm_simple <- lm(FGM * 100 ~ SHOT_DIST, data = shot_logs)
summary(lm_simple)
##
## Call:
## lm(formula = FGM * 100 ~ SHOT_DIST, data = shot_logs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.89 -43.60 -32.71 47.34 91.01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.88610 0.24950 240.03 <2e-16 ***
## SHOT_DIST -1.07828 0.01538 -70.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.84 on 127751 degrees of freedom
## Multiple R-squared: 0.03707, Adjusted R-squared: 0.03706
## F-statistic: 4918 on 1 and 127751 DF, p-value: < 2.2e-16
This model showed that SHOT_DIST negatively impacts FGM%, but R² was only ~3.7%, meaning many factors influence shooting accuracy.
I will add the following predictors:
CLOSE_DEF_DIST (Defender Distance in Feet) Justification: More space should increase shooting accuracy. Potential issue: If highly correlated with SHOT_DIST, it may cause multicollinearity.
SHOT_CLOCK (Seconds Remaining on Shot Clock) Justification: Rushed shots should be less accurate. Potential issue: If all low shot clock values are also long shots, there could be overlap with SHOT_DIST.
Interaction Term: SHOT_DIST * CLOSE_DEF_DIST Justification: The effect of SHOT_DIST may change based on defender pressure. Potential issue: If CLOSE_DEF_DIST is already strongly related to SHOT_DIST, this term might not add much insight.
lm_expanded <- lm(FGM * 100 ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK + SHOT_DIST:CLOSE_DEF_DIST, data = shot_logs)
summary(lm_expanded)
##
## Call:
## lm(formula = FGM * 100 ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK +
## SHOT_DIST:CLOSE_DEF_DIST, data = shot_logs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -153.90 -44.41 -30.25 50.01 91.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.528209 0.532323 81.77 <2e-16 ***
## SHOT_DIST -0.959496 0.030839 -31.11 <2e-16 ***
## CLOSE_DEF_DIST 4.034113 0.113035 35.69 <2e-16 ***
## SHOT_CLOCK 0.466636 0.024734 18.87 <2e-16 ***
## SHOT_DIST:CLOSE_DEF_DIST -0.108605 0.006141 -17.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.48 on 122195 degrees of freedom
## (5553 observations deleted due to missingness)
## Multiple R-squared: 0.05268, Adjusted R-squared: 0.05265
## F-statistic: 1699 on 4 and 122195 DF, p-value: < 2.2e-16
Multicollinearity occurs when two or more predictors are highly correlated, making it hard to isolate their effects.
vif_scores <- vif(lm_expanded, type = "predictor")
## GVIFs computed for predictors
print(vif_scores)
## GVIF Df GVIF^(1/(2*Df)) Interacts With
## SHOT_DIST 1.053832 3 1.008777 CLOSE_DEF_DIST
## CLOSE_DEF_DIST 1.053832 3 1.008777 SHOT_DIST
## SHOT_CLOCK 1.053832 1 1.026563 --
## Other Predictors
## SHOT_DIST SHOT_CLOCK
## CLOSE_DEF_DIST SHOT_CLOCK
## SHOT_CLOCK SHOT_DIST, CLOSE_DEF_DIST
SHOT_DIST (1.05) - No multicollinearity. CLOSE_DEF_DIST (1.05) - No multicollinearity. SHOT_CLOCK (1.05) - No multicollinearity. We can see there is no concern for multicolinerity.
Why I used GVIF: GVIF helps when there are interaction terms or categorical variables with multiple levels. When I did not include type = “predictor” there was an issue interpreting the VIF for the interaction term.
We use five key diagnostic plots to check for issues in the model.
par(mfrow = c(2, 3))
plot(lm_expanded, which = 1:5)
Diagnostic Plot Interpretations:
Residuals vs. Fitted (Checking Linearity & Homoscedasticity) Ideal: Randomly scattered residuals around 0. Bad Sign: If a pattern (e.g., curve or funnel shape) appears, it suggests non-linearity or heteroscedasticity. Our Residuals vs. Fitted plot is not great. I am not sure why they trail down like that. They are not randomly scattered at all indicating there are problems. I think that based on the plot this shows a severe issue.
Normal Q-Q Plot (Checking Normality of Residuals) Ideal: Residuals should follow a straight 45-degree line. Bad Sign: If points deviate significantly from the line, residuals may not be normally distributed. Our Q-Q plot is medium. I would’ve liked to see less deviation from the line, but since there is that deviation we know that our residuals may not be normally distributed. I would say that this plot does not raise concern for a severe issue but it is something to keep an eye on.
Scale-Location Plot (Checking Homoscedasticity - Constant Variance) Ideal: A flat horizontal line. Bad Sign: If the points fan out or curve, it suggests heteroscedasticity (variance is changing across fitted values). We see an interesting scale location plot. I am not entirely sure what to make of it.
Cook’s Distance Plot (Checking Influential Observations) Ideal: No points significantly larger than others. Bad Sign: If a few points have extremely high Cook’s distances, they may be highly influential outliers. There are a few points on our cook’s distance plot that suggest high influence, but other than that it looks pretty good. I would say this plot does not raise conerns for severity, but the influential outliers maybe should be removed.
Residuals vs. Leverage (Checking Influential Points) Ideal: No points with high leverage. Bad Sign: If points are far from the center and have high Cook’s distance, they heavily influence the model. It is interesting how we have that tail that tracks down in this plot. We can see there is a large number of points that do not have high leverage, but the tail indicates that there are some points with high leverage. I would say this plot calls for moderate severity. The tail that trails downward is very suspicious.
cat("Simple Model R²:", summary(lm_simple)$r.squared, "\n")
## Simple Model R²: 0.03707199
cat("Expanded Model R²:", summary(lm_expanded)$r.squared, "\n")
## Expanded Model R²: 0.05267887
By adding the new predictors our R-squared increased indicating that they improved the predictive power, however it only accounts for a small amount of the variance in FGM%.
Possible Explanations for the Small R² Increase Shooting accuracy depends on more than just distance, defender pressure, and shot clock.
Factors like shot type (layup vs. jump shot), defender height, player skill, fatigue, game context could be missing. The relationship might not be linear.
If FGM% does not follow a straight-line relationship with predictors, a non-linear model (e.g., logistic regression, polynomial regression) may be better. Multicollinearity might still be an issue.
If some predictors are highly correlated, they don’t add much unique information, which limits R² improvement. Random variation plays a role in shot success.
FGM% is partially random, meaning some variation cannot be explained by measurable stats.
While adding CLOSE_DEF_DIST, SHOT_CLOCK, and an interaction term slightly improved the model (raising R² by 0.02), the increase is minimal. This suggests that while these factors influence shot success, there are likely additional factors (e.g., shot type, player ability, game context) that contribute to FGM% variability. Future models could explore non-linear relationships or incorporate additional predictors to improve accuracy.