# Creating the linear model
lm_model <- lm(fg_3p ~ dist + x3p_percent_cor_3 + fga_3p, data = df)
summary(lm_model)
##
## Call:
## lm(formula = fg_3p ~ dist + x3p_percent_cor_3 + fga_3p, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42971 -0.03101 0.00727 0.03930 0.48813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.128739 0.006929 18.580 < 2e-16 ***
## dist 0.006486 0.001018 6.373 2.16e-10 ***
## x3p_percent_cor_3 0.355870 0.008521 41.762 < 2e-16 ***
## fga_3p -0.052435 0.022507 -2.330 0.0199 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08254 on 2806 degrees of freedom
## (447 observations deleted due to missingness)
## Multiple R-squared: 0.4408, Adjusted R-squared: 0.4402
## F-statistic: 737.2 on 3 and 2806 DF, p-value: < 2.2e-16
Insights: The linear regression shows that all three variables are statistically significant in explaining the variation in overall three-point percentage. Of the variables, corner 3pt% (x3p_percent_cor_3) has the strongest effect with a positive coefficient of 0.356, meaning those with a higher accuracy from the corner tend to have higher 3pt% overall. It is important to note that there is a mechanical relationship between corner 3pt% and overall 3pt% since corner 3pt% is part of the overall 3pt%. Average shot distance (dist) also has a positive effect, meaning players who take longer shots tend to have a higher 3pt% since most of their shots are behind the three-point arc. Three-point attempt (fga_3p) has a small negative effect on 3pt%, meaning players who attempt more three-pointers tend to have a lower efficiency. Significance: The model explains a good portion of the variation in 3pt% (R^2 approximately 44%) indicating a strong association between these variables and 3pt%, though it does not imply a casual relationship. The results show that corner 3pt% is the most important factor in determining overall three-point shooting of these variables, meaning players who excel in high-efficiency shot types (corner 3s are considered the most efficient three-point shot due to the player being the closest to the basket on the three-point arc), tend to be strong shooters overall. The negative relationship between three-point attempts and three-point shooting is natural since players who take high-volume (more difficult) three-point shots usually have lower efficiency. Further Question: Would including a variable to account for shot quality/difficulty or defensive pressure improve the model?
Interpretation: Holding dist and fga_3p constant, a 1 percentage point increase in corner three-point percentage is associated with an approximate 0.356 percentage point increase in overall three-point percentage. Insights: The coefficient for corner 3pt% (0.356) is large and significant, meaning a player’s corner three accuracy is strongly associated with their overall three-point shooting performance. Significance: This suggests that corner three-point shooting is a strong indicator of overall three-point shooting ability. With corner threes being the highest percentage shot behind the three-point line (due to it being the shortest distance from the basket), players who shoot well from the corner tend to be more efficient shooters overall. Further Question: Are these players specializing in just the corner three or do they shoot well overall from three regardless of spot?
# Residual vs Fitted
plot(lm_model, which = 1)
Insights: While the residuals are generally centered around zero, the
red line shows a curve meaning the relationship between the independent
variables and 3pt% in not linear. Issue/Severity: The curved trend
suggests the model does not fully capture the true relationship between
the variables. There is slight heteroskedasticity since the spread of
residuals slightly increase as fitted values increase. This would be
considered moderate severity leading to a moderate confidence in the
linear model capturing this relationship. Significance: While the model
captures the general trend in 3pt%, it most likely oversimplifies the
relationship between shooting behavior and shot performance. The effects
of corner 3pt% or shot volume may not be strictly linear across all
ranges. Another perspective to consider is removing outliers with data
cleaning, specifically low-volume players taking one shot for example as
this is standard activity in the professional basketball analytics
field.
Further Question: Would adding a non-linear variable improve the model’s
fit?