shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")

head(shot_logs)
##    GAME_ID                  MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           1
## 2 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           2
## 3 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           3
## 4 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           4
## 5 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           5
## 6 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           6
##   PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1      1       1:09       10.8        2        1.9       7.7        2
## 2      1       0:14        3.4        0        0.8      28.2        3
## 3      1       0:00         NA        3        2.7      10.1        2
## 4      2      11:47       10.3        2        1.9      17.2        2
## 5      2      10:34       10.9        2        2.7       3.7        2
## 6      2       8:15        9.1        2        4.4      18.4        2
##   SHOT_RESULT  CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1        made    Anderson, Alan                     101187            1.3   1
## 2      missed Bogdanovic, Bojan                     202711            6.1   0
## 3      missed Bogdanovic, Bojan                     202711            0.9   0
## 4      missed     Brown, Markel                     203900            3.4   0
## 5      missed   Young, Thaddeus                     201152            1.1   0
## 6      missed   Williams, Deron                     101114            2.6   0
##   PTS   player_name player_id
## 1   2 brian roberts    203148
## 2   0 brian roberts    203148
## 3   0 brian roberts    203148
## 4   0 brian roberts    203148
## 5   0 brian roberts    203148
## 6   0 brian roberts    203148

Introduction

This analysis explores two different null hypotheses based on the NBA Shot Logs dataset. We use:

Neyman-Pearson Hypothesis Testing (calculating power, setting Type 1/Type 2 error levels, and determining sample size).

Fisher’s Significance Testing (interpreting p-values to assess statistical significance).

Hypothesis 1

Null Hypothesis (H₀): The average field goal percentage (FGM%) is the same for short-range shots (≤15 ft) and long-range shots (>15 ft).

Alternative Hypothesis (H₁): The average field goal percentage is lower for long-range shots (>15 ft) than for short-range shots (≤15 ft).

#Calculate sample size required for a two-sample t-test#
effect_size <- 0.5
power_level <- 0.8
alpha_level <- 0.05

sample_size <- pwr.t.test(d = effect_size, power = power_level, sig.level = alpha_level, type = "two.sample", alternative = "two.sided")$n
cat("Required Sample Size per Group:", ceiling(sample_size), "\n")
## Required Sample Size per Group: 64

I chose an alpha level of 0.05 as it is a commonly accepted threshold for statistical significance, balancing the risk of Type I errors. The power level was set at 0.8 to ensure a sufficient probability of detecting a true effect while minimizing Type II errors.

#Create subsets for short-range and long-range shots#
short_range <- shot_logs |> filter(SHOT_DIST <= 15) |> pull(FGM)
long_range <- shot_logs |> filter(SHOT_DIST > 15) |> pull(FGM)

#Perform independent t-test#
t_test_result <- t.test(short_range, long_range, alternative = "greater", var.equal = FALSE)
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  short_range and long_range
## t = 55.063, df = 127224, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1469915       Inf
## sample estimates:
## mean of x mean of y 
## 0.5245769 0.3730592

We can see the p-value is less than 0.05, therefore, we can reject the null hypothesis and conclude that long range shots (> 15ft) have a significantly less FGM% than short range shots (< 15ft).

Visualization 1: Bar Chart of FGM% for Short vs. Long-Range Shots

#Calculate FGM% for each shot distance category#
fgm_summary <- shot_logs |> 
  mutate(shot_category = ifelse(SHOT_DIST <= 15, "≤15 ft", ">15 ft")) |> 
  group_by(shot_category) |> 
  summarize(fgm_percentage = mean(FGM, na.rm = TRUE) * 100)

#Bar chart of FGM% by shot distance category#
ggplot(fgm_summary, aes(x = shot_category, y = fgm_percentage, fill = shot_category)) +
  geom_bar(stat = "identity", alpha = 0.7, width = 0.5) +
  labs(
    title = "FGM% for Short vs. Long-Range Shots",
    x = "Shot Distance Category",
    y = "Field Goal Percentage (%)"
  ) +
  scale_fill_manual(values = c("≤15 ft" = "blue", ">15 ft" = "red")) +
  theme_minimal() +
  theme(legend.position = "none")

For this visual you can see that it supports our alternative hypothesis, that short range shots have a significantly higher FGM% than long range shots. Long range shots appear to have an approximate FGM% of 37% and short range shots have an approximate FGM% of 53%.

Hypothesis 2

Null Hypothesis (H₀): There is no relationship between the shot clock time and the field goal percentage.

Alternative Hypothesis (H₁): Shots taken with less time on the shot clock have a lower field goal percentage.

filtered_data <- shot_logs |> filter(!is.na(SHOT_CLOCK))
#Bin SHOT_CLOCK into Early (>=12s) and Late (<=12s)#
filtered_data <- filtered_data |> 
  mutate(SHOT_CLOCK_BIN = ifelse(SHOT_CLOCK >= 12, "Early", "Late"))

#Convert FGM to factor (0 = Missed, 1 = Made)#
filtered_data$FGM <- as.factor(filtered_data$FGM)

#Perform Fisher's Exact Test on binned SHOT_CLOCK and FGM#
fisher_test_result <- fisher.test(table(filtered_data$SHOT_CLOCK_BIN, filtered_data$FGM))
print(fisher_test_result)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(filtered_data$SHOT_CLOCK_BIN, filtered_data$FGM)
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7575553 0.7927184
## sample estimates:
## odds ratio 
##  0.7749621

Here we can see the p-value is less than 0.05 so we ca reject the null hypothesis of “There is no relationship between the shot clock time and the field goal percentage.” And we can accept the alternative hypothesis of “Shots taken with less time on the shot clock have a lower field goal percentage.”

Visualization 2: Line Chart of FGM% Over Shot Clock Time

#Group data by shot clock intervals and calculate FGM%#
shot_clock_summary <- shot_logs |> 
  mutate(shot_clock_bin = cut(SHOT_CLOCK, breaks = seq(0, 24, by = 2), include.lowest = TRUE)) |> 
  group_by(shot_clock_bin) |> 
  summarize(fgm_percentage = mean(FGM, na.rm = TRUE) * 100)

#Line plot of FGM% over shot clock intervals#
ggplot(shot_clock_summary, aes(x = shot_clock_bin, y = fgm_percentage, group = 1)) +
  geom_line(color = "green", size = 1) +
  geom_point(color = "darkgreen", size = 2) +
  labs(
    title = "FGM% vs. Shot Clock Time",
    x = "Shot Clock (Binned in 2-Second Intervals)",
    y = "Field Goal Percentage (%)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This is a really insightful visual that shows the trend of FGM% decreasing as the shot clock decreases as well. This visual supports the acceptance of the alternative hypothesis. We can see a large drop off of FGM% when the less than 6 seconds on the shot clock, which is very interesting. It would appear that the “sweet spot” is when there are 18-24 seconds on the shot clock. These shots could be fast breaks, catch and shoot plays, or defensive busts that allowed for quick baskets.