shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")
head(shot_logs)
## GAME_ID MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN A W 24 1
## 2 21400899 MAR 04, 2015 - CHA @ BKN A W 24 2
## 3 21400899 MAR 04, 2015 - CHA @ BKN A W 24 3
## 4 21400899 MAR 04, 2015 - CHA @ BKN A W 24 4
## 5 21400899 MAR 04, 2015 - CHA @ BKN A W 24 5
## 6 21400899 MAR 04, 2015 - CHA @ BKN A W 24 6
## PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1 1 1:09 10.8 2 1.9 7.7 2
## 2 1 0:14 3.4 0 0.8 28.2 3
## 3 1 0:00 NA 3 2.7 10.1 2
## 4 2 11:47 10.3 2 1.9 17.2 2
## 5 2 10:34 10.9 2 2.7 3.7 2
## 6 2 8:15 9.1 2 4.4 18.4 2
## SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1 made Anderson, Alan 101187 1.3 1
## 2 missed Bogdanovic, Bojan 202711 6.1 0
## 3 missed Bogdanovic, Bojan 202711 0.9 0
## 4 missed Brown, Markel 203900 3.4 0
## 5 missed Young, Thaddeus 201152 1.1 0
## 6 missed Williams, Deron 101114 2.6 0
## PTS player_name player_id
## 1 2 brian roberts 203148
## 2 0 brian roberts 203148
## 3 0 brian roberts 203148
## 4 0 brian roberts 203148
## 5 0 brian roberts 203148
## 6 0 brian roberts 203148
In this analysis, we investigate factors that influence Field Goal Percentage (FGM%), which is one of the most valuable metrics in basketball. We perform:
ANOVA to test if a categorical variable affects shot success. Linear regression to assess how a continuous variable influences shot success.
Response Variable (Y) FGM% (Field Goal Percentage) This is the most valuable variable because it directly measures shooting efficiency.
Categorical Explanatory Variable for ANOVA PTS_TYPE (2PT vs. 3PT) We expect 3PT shots to have lower FGM% than 2PT shots.
Continuous Explanatory Variable for Regression SHOT_DIST (Shot Distance in feet) We expect FGM% to decrease as shot distance increases.
ANOVA - Does Shot Type Influence FGM%? Null Hypothesis (H₀): There is no significant difference in FGM% between 2PT and 3PT shots.
Alternative Hypothesis (H₁): There is a significant difference in FGM% between 2PT and 3PT shots.
#Group data by shot type and calculate FGM%#
fgm_summary <- shot_logs |>
group_by(PTS_TYPE) |>
summarize(fgm_percentage = mean(FGM, na.rm = TRUE) * 100)
print(fgm_summary)
## # A tibble: 2 × 2
## PTS_TYPE fgm_percentage
## <int> <dbl>
## 1 2 48.9
## 2 3 35.2
#Perform ANOVA test#
anova_result <- aov(FGM ~ PTS_TYPE, data = shot_logs)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## PTS_TYPE 1 469 469.2 1922 <2e-16 ***
## Residuals 127751 31180 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation of ANOVA Output If p-value < 0.05, we reject H₀, meaning shot type significantly affects FGM%. If p-value ≥ 0.05, we fail to reject H₀, meaning there is no significant difference between 2PT and 3PT shots. The p-value is infact less than 0.05 therefore we can reject the null hypothesis.
#Convert SHOT_TYPE to a factor to ensure proper labeling#
shot_logs <- shot_logs |> mutate(PTS_TYPE = factor(PTS_TYPE, labels = c("2PT", "3PT")))
#Create grouped data to compute FGM%#
fgm_summary <- shot_logs |>
group_by(PTS_TYPE) |>
summarize(fgm_percentage = mean(FGM, na.rm = TRUE) * 100)
#Bar chart of FGM% by pts type#
ggplot(fgm_summary, aes(x = PTS_TYPE, y = fgm_percentage, fill = PTS_TYPE)) +
geom_bar(stat = "identity", alpha = 0.7, width = 0.5) +
labs(
title = "FGM% for 2PT vs. 3PT Shots",
x = "Shot Type (PTS_TYPE)",
y = "Field Goal Percentage (%)"
) +
scale_fill_manual(values = c("2PT" = "blue", "3PT" = "red")) +
theme_minimal() +
theme(legend.position = "none")
As we can see there is a significant difference in FGM% for 2pt and 3pt
shots, further reassuring our rejection of the null hypothesis. We see
about a difference of 13-15% in FGM%.
#Bin shot distances into 5-foot intervals#
shot_distance_summary <- shot_logs |>
mutate(shot_dist_bin = cut(SHOT_DIST, breaks = seq(0, max(SHOT_DIST, na.rm = TRUE), by = 3), include.lowest = TRUE)) |>
group_by(shot_dist_bin) |>
summarize(fgm_percentage = mean(FGM, na.rm = TRUE) * 100)
#Line plot of FGM% vs. Shot Distance Bins#
ggplot(shot_distance_summary, aes(x = shot_dist_bin, y = fgm_percentage, group = 1)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "darkblue", size = 2) +
labs(
title = "FGM% vs. Shot Distance",
x = "Shot Distance (Binned in 5-Foot Intervals)",
y = "Field Goal Percentage (%)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Expected Insights: Does FGM% decrease steadily as shot distance increases? Are there distance ranges where shooting is surprisingly efficient? Does a steep drop in accuracy suggest optimal shot locations?
As we can see there is a general steady decrease in FGM% as shot distance increases. I would say there aren’t any ranges with a surprisingly good FGM% however we can see that there is a flattening out from 9-18 feet before it begins to steadily decrease more. The drastic drop off from 27ft and on is generally expected as that is entering the range of a very long 3-pointer. The drop from 0-9 ft is also expected because layups (even with a defender neary) are much easier than short ranged shots. Obviously the optimal shot locations are as cose as possible, however a team won’t be able to only shoot layups to win. It is important to distribute your shots at different ranges to make defending your team offense harder and less predictable. With that being said, offenses can scheme plays to open shots from 0-3 ft as well as shots in the range of 9-21 ft.
#Fit regression model#
lm_model <- lm(FGM * 100 ~ SHOT_DIST, data = shot_logs)
summary(lm_model)
##
## Call:
## lm(formula = FGM * 100 ~ SHOT_DIST, data = shot_logs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.89 -43.60 -32.71 47.34 91.01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.88610 0.24950 240.03 <2e-16 ***
## SHOT_DIST -1.07828 0.01538 -70.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.84 on 127751 degrees of freedom
## Multiple R-squared: 0.03707, Adjusted R-squared: 0.03706
## F-statistic: 4918 on 1 and 127751 DF, p-value: < 2.2e-16
Interpretation of Regression Output Intercept (β₀): Expected FGM% for 0 ft shots (layups). Slope (β₁): Change in FGM% for every 1-ft increase in shot distance. R² Value: Percentage of FGM% variability explained by shot distance.
If β₁ is negative, it confirms that longer shots reduce accuracy. If p-value for β₁ < 0.05, we conclude SHOT_DIST significantly affects FGM%. If R² is low, other factors (e.g., defender pressure) also influence shot success.
For every 1-foot increase in shot distance, the expected FGM% decreases by 1.08 percentage points. Example: If a 5-ft shot has a 60% success rate, then a 6-ft shot is expected to have 58.92% success.
3.71% of the variance in FGM% is explained by Shot Distance. This means that shot distance alone does not strongly predict field goal percentage, suggesting other factors (defender distance, shot type, player ability) contribute to shot success.
B₁ (-1.08) confirms that longer shots are harder to make, with each foot reducing FGM%. R² (0.037) is very low, meaning other variables are crucial for predicting shot success. Next Step: Consider adding more predictors (e.g., defender distance, shot clock time) to improve the model.
Based on the ANOVA and linear regression results, we can confidently conclude that:
3PT shots are significantly harder to make than 2PT shots. Longer shots reduce shooting accuracy, with FGM% decreasing as shot distance increases. However, since shot distance only explains ~3.7% of FGM% variance, additional factors (e.g., defender distance, shot clock time, shot type) should be explored to build a more complete predictive model.
In addition to SHOT_DIST, we will add:
CLOSE_DEF_DIST (Defender Distance in feet) - More defensive pressure likely reduces accuracy. SHOT_CLOCK (Seconds remaining on the shot clock) - Rushed shots may be lower quality. DRIBBLES (Number of dribbles before the shot) - More dribbles might indicate tougher shots.
#Fit multiple linear regression model#
lm_model_improved <- lm(FGM * 100 ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK + DRIBBLES, data = shot_logs)
summary(lm_model_improved)
##
## Call:
## lm(formula = FGM * 100 ~ SHOT_DIST + CLOSE_DEF_DIST + SHOT_CLOCK +
## DRIBBLES, data = shot_logs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92.78 -44.47 -29.37 49.14 96.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.73634 0.45411 111.73 <2e-16 ***
## SHOT_DIST -1.39298 0.01916 -72.72 <2e-16 ***
## CLOSE_DEF_DIST 2.25997 0.06059 37.30 <2e-16 ***
## SHOT_CLOCK 0.42465 0.02482 17.11 <2e-16 ***
## DRIBBLES -0.48051 0.04133 -11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.52 on 122195 degrees of freedom
## (5553 observations deleted due to missingness)
## Multiple R-squared: 0.0513, Adjusted R-squared: 0.05127
## F-statistic: 1652 on 4 and 122195 DF, p-value: < 2.2e-16
Interpretation of New Predictors SHOT_DIST (-1.39) - Longer shots reduce accuracy even more than in the first model (B₁ was -1.08 before). CLOSE_DEF_DIST (+2.26) - More defender space increases shot accuracy (expected). SHOT_CLOCK (+0.42) - More time on the shot clock increases accuracy, suggesting rushed shots are harder. DRIBBLES (-0.48) - More dribbles before a shot slightly reduce accuracy, indicating tougher shots.
Conclusion: Adding more predictors improved the model slightly, but FGM% is still influenced by other factors not included in this analysis.