shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")
head(shot_logs)
## GAME_ID MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN A W 24 1
## 2 21400899 MAR 04, 2015 - CHA @ BKN A W 24 2
## 3 21400899 MAR 04, 2015 - CHA @ BKN A W 24 3
## 4 21400899 MAR 04, 2015 - CHA @ BKN A W 24 4
## 5 21400899 MAR 04, 2015 - CHA @ BKN A W 24 5
## 6 21400899 MAR 04, 2015 - CHA @ BKN A W 24 6
## PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1 1 1:09 10.8 2 1.9 7.7 2
## 2 1 0:14 3.4 0 0.8 28.2 3
## 3 1 0:00 NA 3 2.7 10.1 2
## 4 2 11:47 10.3 2 1.9 17.2 2
## 5 2 10:34 10.9 2 2.7 3.7 2
## 6 2 8:15 9.1 2 4.4 18.4 2
## SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1 made Anderson, Alan 101187 1.3 1
## 2 missed Bogdanovic, Bojan 202711 6.1 0
## 3 missed Bogdanovic, Bojan 202711 0.9 0
## 4 missed Brown, Markel 203900 3.4 0
## 5 missed Young, Thaddeus 201152 1.1 0
## 6 missed Williams, Deron 101114 2.6 0
## PTS player_name player_id
## 1 2 brian roberts 203148
## 2 0 brian roberts 203148
## 3 0 brian roberts 203148
## 4 0 brian roberts 203148
## 5 0 brian roberts 203148
## 6 0 brian roberts 203148
Pair 1: Shot Distance vs. Field Goal Success Rate (FGM Percentage) Response Variable: FGM Percentage (New Variable) Explanatory Variable: Shot Distance (SHOT_DIST) Why?: We expect that as shot distance increases, field goal success decreases. Calculate FGM Percentage by Shot Distance
#Calculate FGM percentage for each shot distance#
shot_logs_summary <- shot_logs |>
group_by(SHOT_DIST) |>
summarize(
total_shots = n(),
made_shots = sum(FGM, na.rm = TRUE),
fgm_percentage = made_shots / total_shots * 100
)
head(shot_logs_summary)
## # A tibble: 6 × 4
## SHOT_DIST total_shots made_shots fgm_percentage
## <dbl> <int> <int> <dbl>
## 1 0 4 2 50
## 2 0.1 58 32 55.2
## 3 0.2 100 64 64
## 4 0.3 163 111 68.1
## 5 0.4 205 132 64.4
## 6 0.5 288 188 65.3
#Scatter plot with trend line#
ggplot(shot_logs_summary, aes(x = SHOT_DIST, y = fgm_percentage)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Shot Distance vs. Field Goal Percentage",
x = "Shot Distance (ft)",
y = "FGM Percentage (%)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Conclusions from the Plot: Negative trend: As SHOT_DIST increases, FGM%
decreases. Possible outliers: Some mid-range shots have unexpected high
or low success rates. Why it makes sense: Longer shots are harder to
make due to defender pressure and shooting difficulty. Shots > 30ft
are long 3s and > 35ft are VERY long. We can see the data gets
sparatic when the shots are >30ft .
#Compute correlation coefficient for SHOT_DIST vs. FGM Percentage#
correlation_1 <- cor(shot_logs_summary$SHOT_DIST, shot_logs_summary$fgm_percentage, use = "complete.obs")
print(correlation_1)
## [1] -0.8767138
We can see that since the confidence interval is close to -1 then there is a strong negaive relationship. As the shot distance increases, FGM% decreases.
#Calculate confidence interval for mean FGM%#
fgm_mean <- mean(shot_logs_summary$fgm_percentage, na.rm = TRUE)
fgm_sd <- sd(shot_logs_summary$fgm_percentage, na.rm = TRUE)
n <- nrow(shot_logs_summary)
error_margin <- qt(0.975, df = n-1) * (fgm_sd / sqrt(n))
#Confidence interval#
ci_lower <- fgm_mean - error_margin
ci_upper <- fgm_mean + error_margin
cat("95% Confidence Interval for FGM%:", ci_lower, "to", ci_upper, "\n")
## 95% Confidence Interval for FGM%: 28.48949 to 32.32078
We are constructing a 95% confidence interval for the mean FGM% across different shot distances. This tells us the range where we expect the true average shot success rate to fall for the entire population of NBA shots, based on our sample. Conclusion The confidence interval provides a range for expected FGM% in the population. If the range is wide, shot success varies significantly across distances. We have a pretty precise confidence interval indicating our estimate is reliable.
#Calculate FGM percentage based on shot clock remaining#
shot_clock_summary <- shot_logs |>
group_by(SHOT_CLOCK) |>
summarize(
total_shots = n(),
made_shots = sum(FGM, na.rm = TRUE),
fgm_percentage = made_shots / total_shots * 100
)
head(shot_clock_summary)
## # A tibble: 6 × 4
## SHOT_CLOCK total_shots made_shots fgm_percentage
## <dbl> <int> <int> <dbl>
## 1 0 75 13 17.3
## 2 0.1 62 18 29.0
## 3 0.2 63 12 19.0
## 4 0.3 70 13 18.6
## 5 0.4 86 20 23.3
## 6 0.5 96 32 33.3
#Scatter plot with trend line#
ggplot(shot_clock_summary, aes(x = SHOT_CLOCK, y = fgm_percentage)) +
geom_point(color = "green") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Shot Clock vs. Field Goal Percentage",
x = "Shot Clock (Seconds Remaining)",
y = "FGM Percentage (%)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
Observations: Downward trend: As SHOT_CLOCK decreases, FGM% also
decreases. Outliers present: Some rushed shots have high FGM%, possibly
fast-break layups. Expected pattern: Players take more forced shots with
low shot clock, reducing accuracy.
#Compute correlation coefficient for SHOT_CLOCK vs. FGM Percentage#
correlation_2 <- cor(shot_clock_summary$SHOT_CLOCK, shot_clock_summary$fgm_percentage, use = "complete.obs")
print(correlation_2)
## [1] 0.8097026
We can see a stronger coorelation here. As the shot clock increases (has more time on it) then FGM% also generally increases. You can also look at it the other way around which would make more since because the shot clock counts down. As the shot clock decreases then FGM% also decreases.
Why the Pearson Method? The cor() function in R uses Pearson’s correlation coefficient by default unless otherwise specified. Pearson correlation measures the linear relationship between two continuous variables, making it the most appropriate choice when:
Both variables are continuous (e.g., SHOT_CLOCK vs. FGM%). You assume a linear relationship between the variables. There are no extreme outliers significantly affecting the correlation. For example, in my analysis of SHOT_CLOCK vs. FGM%, Pearson correlation helps determine if as shot clock time increases, FGM% increases or decreases in a linear fashion.
#Calculate confidence interval for mean FGM%#
fgm_mean_2 <- mean(shot_clock_summary$fgm_percentage, na.rm = TRUE)
fgm_sd_2 <- sd(shot_clock_summary$fgm_percentage, na.rm = TRUE)
n_2 <- nrow(shot_clock_summary)
error_margin_2 <- qt(0.975, df = n_2-1) * (fgm_sd_2 / sqrt(n_2))
#Confidence interval#
ci_lower_2 <- fgm_mean_2 - error_margin_2
ci_upper_2 <- fgm_mean_2 + error_margin_2
cat("95% Confidence Interval for FGM% (Shot Clock):", ci_lower_2, "to", ci_upper_2, "\n")
## 95% Confidence Interval for FGM% (Shot Clock): 43.8475 to 45.83305
We construct a 95% confidence interval for the mean FGM% at different shot clock times. This tells us how rushed vs. well-timed shots affect shot success. Conclusion This confidence interval tells us how much shot clock time affects FGM%. A wide interval means shot success is highly unpredictable at different shot clock times. We have a very narrow 2% interval indicating that the prediction is accurate and shot clock has a predictable impact on FGM%. We are 95% confident that the true average shot success rate for all shot clock times is between 43.8% and 45.8%.