shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")

head(shot_logs)

##    GAME_ID                  MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           1
## 2 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           2
## 3 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           3
## 4 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           4
## 5 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           5
## 6 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           6
##   PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1      1       1:09       10.8        2        1.9       7.7        2
## 2      1       0:14        3.4        0        0.8      28.2        3
## 3      1       0:00         NA        3        2.7      10.1        2
## 4      2      11:47       10.3        2        1.9      17.2        2
## 5      2      10:34       10.9        2        2.7       3.7        2
## 6      2       8:15        9.1        2        4.4      18.4        2
##   SHOT_RESULT  CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1        made    Anderson, Alan                     101187            1.3   1
## 2      missed Bogdanovic, Bojan                     202711            6.1   0
## 3      missed Bogdanovic, Bojan                     202711            0.9   0
## 4      missed     Brown, Markel                     203900            3.4   0
## 5      missed   Young, Thaddeus                     201152            1.1   0
## 6      missed   Williams, Deron                     101114            2.6   0
##   PTS   player_name player_id
## 1   2 brian roberts    203148
## 2   0 brian roberts    203148
## 3   0 brian roberts    203148
## 4   0 brian roberts    203148
## 5   0 brian roberts    203148
## 6   0 brian roberts    203148

1 Building Two Pairs

Pair 1: Shot Distance vs. Field Goal Success Rate (FGM Percentage) Response Variable: FGM Percentage (New Variable) Explanatory Variable: Shot Distance (SHOT_DIST) Why?: We expect that as shot distance increases, field goal success decreases. Calculate FGM Percentage by Shot Distance

#Calculate FGM percentage for each shot distance#
shot_logs_summary <- shot_logs |>
  group_by(SHOT_DIST) |>
  summarize(
    total_shots = n(),
    made_shots = sum(FGM, na.rm = TRUE),
    fgm_percentage = made_shots / total_shots * 100
  )

head(shot_logs_summary)

## # A tibble: 6 × 4
##   SHOT_DIST total_shots made_shots fgm_percentage
##       <dbl>       <int>      <int>          <dbl>
## 1       0             4          2           50  
## 2       0.1          58         32           55.2
## 3       0.2         100         64           64  
## 4       0.3         163        111           68.1
## 5       0.4         205        132           64.4
## 6       0.5         288        188           65.3

2 Visualize

#Scatter plot with trend line#
ggplot(shot_logs_summary, aes(x = SHOT_DIST, y = fgm_percentage)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Shot Distance vs. Field Goal Percentage",
    x = "Shot Distance (ft)",
    y = "FGM Percentage (%)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Conclusions from the Plot: Negative trend: As SHOT_DIST increases, FGM% decreases. Possible outliers: Some mid-range shots have unexpected high or low success rates. Why it makes sense: Longer shots are harder to make due to defender pressure and shooting difficulty. Shots > 30ft are long 3s and > 35ft are VERY long. We can see the data gets sparatic when the shots are >30ft .

3 Coorelation Analysis

#Compute correlation coefficient for SHOT_DIST vs. FGM Percentage#
correlation_1 <- cor(shot_logs_summary$SHOT_DIST, shot_logs_summary$fgm_percentage, use = "complete.obs")
print(correlation_1)

## [1] -0.8767138

We can see that since the confidence interval is close to -1 then there is a strong negaive relationship. As the shot distance increases, FGM% decreases.

4 Confidence Interval for FGM Percentage

#Calculate confidence interval for mean FGM%#
fgm_mean <- mean(shot_logs_summary$fgm_percentage, na.rm = TRUE)
fgm_sd <- sd(shot_logs_summary$fgm_percentage, na.rm = TRUE)
n <- nrow(shot_logs_summary)
error_margin <- qt(0.975, df = n-1) * (fgm_sd / sqrt(n))

#Confidence interval#
ci_lower <- fgm_mean - error_margin
ci_upper <- fgm_mean + error_margin

cat("95% Confidence Interval for FGM%:", ci_lower, "to", ci_upper, "\n")

## 95% Confidence Interval for FGM%: 28.48949 to 32.32078

We are constructing a 95% confidence interval for the mean FGM% across different shot distances. This tells us the range where we expect the true average shot success rate to fall for the entire population of NBA shots, based on our sample. Conclusion The confidence interval provides a range for expected FGM% in the population. If the range is wide, shot success varies significantly across distances. We have a pretty precise confidence interval indicating our estimate is reliable.

5 Second Variable Pair - Shot Clock vs. FGM Percentage

#Calculate FGM percentage based on shot clock remaining#
shot_clock_summary <- shot_logs |>
  group_by(SHOT_CLOCK) |>
  summarize(
    total_shots = n(),
    made_shots = sum(FGM, na.rm = TRUE),
    fgm_percentage = made_shots / total_shots * 100
  )

head(shot_clock_summary)

## # A tibble: 6 × 4
##   SHOT_CLOCK total_shots made_shots fgm_percentage
##        <dbl>       <int>      <int>          <dbl>
## 1        0            75         13           17.3
## 2        0.1          62         18           29.0
## 3        0.2          63         12           19.0
## 4        0.3          70         13           18.6
## 5        0.4          86         20           23.3
## 6        0.5          96         32           33.3

6 Visualizing Shot Clock vs. FGM Percentage

#Scatter plot with trend line#
ggplot(shot_clock_summary, aes(x = SHOT_CLOCK, y = fgm_percentage)) +
  geom_point(color = "green") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Shot Clock vs. Field Goal Percentage",
    x = "Shot Clock (Seconds Remaining)",
    y = "FGM Percentage (%)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 1 rows containing missing values (`geom_point()`).

Observations: Downward trend: As SHOT_CLOCK decreases, FGM% also decreases. Outliers present: Some rushed shots have high FGM%, possibly fast-break layups. Expected pattern: Players take more forced shots with low shot clock, reducing accuracy.

7 Coorelation Analysis

#Compute correlation coefficient for SHOT_CLOCK vs. FGM Percentage#
correlation_2 <- cor(shot_clock_summary$SHOT_CLOCK, shot_clock_summary$fgm_percentage, use = "complete.obs")
print(correlation_2)

## [1] 0.8097026

We can see a stronger coorelation here. As the shot clock increases (has more time on it) then FGM% also generally increases. You can also look at it the other way around which would make more since because the shot clock counts down. As the shot clock decreases then FGM% also decreases.

Why the Pearson Method? The cor() function in R uses Pearson’s correlation coefficient by default unless otherwise specified. Pearson correlation measures the linear relationship between two continuous variables, making it the most appropriate choice when:

Both variables are continuous (e.g., SHOT_CLOCK vs. FGM%). You assume a linear relationship between the variables. There are no extreme outliers significantly affecting the correlation. For example, in my analysis of SHOT_CLOCK vs. FGM%, Pearson correlation helps determine if as shot clock time increases, FGM% increases or decreases in a linear fashion.

8 Confidence Interval for FGM% Based on Shot Clock

#Calculate confidence interval for mean FGM%#
fgm_mean_2 <- mean(shot_clock_summary$fgm_percentage, na.rm = TRUE)
fgm_sd_2 <- sd(shot_clock_summary$fgm_percentage, na.rm = TRUE)
n_2 <- nrow(shot_clock_summary)
error_margin_2 <- qt(0.975, df = n_2-1) * (fgm_sd_2 / sqrt(n_2))

#Confidence interval#
ci_lower_2 <- fgm_mean_2 - error_margin_2
ci_upper_2 <- fgm_mean_2 + error_margin_2

cat("95% Confidence Interval for FGM% (Shot Clock):", ci_lower_2, "to", ci_upper_2, "\n")

## 95% Confidence Interval for FGM% (Shot Clock): 43.8475 to 45.83305

We construct a 95% confidence interval for the mean FGM% at different shot clock times. This tells us how rushed vs. well-timed shots affect shot success. Conclusion This confidence interval tells us how much shot clock time affects FGM%. A wide interval means shot success is highly unpredictable at different shot clock times. We have a very narrow 2% interval indicating that the prediction is accurate and shot clock has a predictable impact on FGM%. We are 95% confident that the true average shot success rate for all shot clock times is between 43.8% and 45.8%.

Data Dive 6

jace vayhinger

2025-02-18