New Regression Model Creation

# Creating dummy variable for position where PGs and SGs are 1 and other positions are 0.
df_reg <-
  df |>
  mutate(
    role_group = if_else(pos %in% c("PG", "SG"), 1, 0)
  )
# Building off the week 8 linear regression model comparing FG% and average shot distance
# added a position dummy variable and 3-point attempt rate 
lm_model <- lm(fg_percent ~ dist + role_group + fga_3p, data = df_reg)

summary(lm_model)
## 
## Call:
## lm(formula = fg_percent ~ dist + role_group + fga_3p, data = df_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60380 -0.02947  0.01167  0.04767  0.69724 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.612913   0.006617  92.626  < 2e-16 ***
## dist        -0.011387   0.001082 -10.527  < 2e-16 ***
## role_group  -0.017229   0.004085  -4.217 2.54e-05 ***
## fga_3p      -0.001429   0.024004  -0.060    0.953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1055 on 3217 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.2611, Adjusted R-squared:  0.2604 
## F-statistic:   379 on 3 and 3217 DF,  p-value: < 2.2e-16

Why add variables? role_group: based on all analysis done from previous weeks, including the ANOVA model I created last week, showed that a player’s position does impact their shooting percentage (FG%). This variable will capture the difference in player roles (guards vs bigs). fga_3p (three-point attempts): with the rise of the three-point shot, it is important to measure the attempts players take to understand shot selection and provides more detailed context beyond how far a player is shooting from the basket. On average, players who shoot more three-pointers have lower FG%.

Insights: Average shot distance and a player’s position are significant with a p-value of less than 0.01 meaning that as average shot distance increases by 1 ft, a player’s FG% will decrease and if a player is a guard (PG or SG), their field goal % will decrease. While insigificant with a p-value of 0.953, as a player’s three-point shot attempt increases by 1%, their FG% will decrease. Significance: Based on the model, a player’s position and their average shot distance are the most significance variables when assessing a player’s overall shooting efficiency while the amount of three-point attempts of your entire shot selection is not sigificant. Further Question: Would it be more valuable to replace the average shot distance with different shot types (dunks, three-pointers (corner 3s, mid-range, etc.)?

Five Diagnostic Plots

# Residual vs Fitted
plot(lm_model, which = 1)

Insights: The plot shows that residuals are generally centered around zero, supporting the idea that the model is not over or under predicting FG%. However toward the end of the plot, there is a slight curve and widening of the spread at higher fitted values, suggesting the relationship may not be perfectly linear and there is inconsistent variance. Issue/Severity: There appears to be slight heteroskedasticity. The curve is not extreme, but it is noticeable enough to suggest that the linear model may not be the best fit. I have moderate confidence that linearity is relatively strongly satisfied. Significance: This indicates that the model does an overall good job of capturing the trend of how FG% changes based on the chosen variables. This also justifies the work to do before conducting this type of model by removing outliers with data cleaning, specifically low-volume players taking one shot for example. Further Question: Would adding a non-linear variable improve the model’s fit?

# Correlation Heatmap
corr_matrix <-
  cor(df_reg[, c("fg_percent", "dist", "role_group", "fga_3p")],
      use = "complete.obs")

heatmap(corr_matrix,
        Rowv = NA, Colv = NA,
        col = colorRampPalette(c("blue", "white", "red"))(100),
        scale = "none",
        margins = c(6, 6))

# Heatmap Legend
legend("topright",
       legend = c("-1", "-0.5", "0", "0.5", "1"),
       fill = colorRampPalette(c("blue", "white", "red"))(5),
       title = "Correlation",
       cex = 0.8)

Insights: There is a strong positive correlation between average shot distance (dist) and three-point attempts (fga_3p), meaning there is significant overlap when it comes to these two variables capturing shooting efficiency. Both variables are also negatively correlated with FG%, which aligns with idea that shots taken from further distances reduce shooting efficiency. Since fga_3p is highly correlated with distance and not statistically significant, I would most likely remove it from the final model for interpretability. Issue/Severity: High severity because there is strong correlation between dist and fga_3p with the value being colored as approximately 1. Significance: With such high correlation between the two variables, the model struggles to distinguish each variable’s individual affect on FG%. This most likely explains why fga_3p was not statistically significant in the linear regression model despite being conceptually important leading to uncertain coefficient interpretations. Further Question: Would it be more valuable to replace the average shot distance with different shot types (dunks, three-pointers (corner 3s, mid-range, etc.)?

# Residual Histogram
hist(residuals(lm_model),
     main = "Histogram of Residuals",
     xlab = "Residuals")

Insights: The residuals are roughly centered around zero and have a bell shaped looking curve, supporting the assumption of normality. However, there are some extreme residuals present as seen by the tails extending in both directions. Issue/Severity: This is seemingly mild deviation from normality, mainly driven by the extreme tails on both ends under a perfect normal distribution. I have moderate confidence that normality is reasonable, mainly driven by the large sample size. Significance: While not perfectly normal, the large sample size (3000+ observations) makes the model relatively robust. Statistical inferences (p-values, confidence intervals, etc.) are still reasonable reliable. This again justifies the need for some data cleaning to remove outliers as previously mentioned in the correlation heatmap section. Further Question: Along with potential data cleaning, are these extreme residuals a cause of specialized players (someone like Duncan Robinson who mostly shoots behind the three-point line at a high accuracy rate)?

# QQ-Plot
plot(lm_model, which = 2)

Insights: The plot shows that the residuals follow the normal line closely around the middle, but deviate significantly at the tails. This shows that while most residuals behave normally, outliers do appear in the distribution. Issue/Severity: There is a somewhat significant deviation from the normality line at the extreme tails. I have moderate confidence that normality is reasonable, mainly driven by the large sample size.
Significance: The tail extremes suggest that some observations have unusually large prediction errors. This does not invalidate the model, but it does show that the model does not fully capture extreme cases, which could impact inference at the margins. As stated in the correlation heatmap and histogram of residuals, some data cleaning prior to this model creation would help improve the fit. Further Question: By how much would the tails move toward or away from the normal line if data cleaning was done prior to running the model?

# Cook's D
plot(lm_model, which = 4)

Insights: The plot shows that most observations have minimal influence on the model. However, a few observations have noticeably higher values, being flagged as potentially influential. Issue/Severity: The severity is low since there are few influential observations. I have high confidence that no single observation is driving the model Significance: This suggests the model is stable and not completely swayed by a single observation. However, these influential points can affect the coefficient estimates. Further Question: Who are these influential observations, and are they meaningful observations (no need for data cleaning) or actual player/data anomalies?