This is my week 4 data dive analyzing NBA shooting statistics taking samples of three-point attempts across the league from the full dataset and conducting deeper analyses on these cases to understand their significance.
# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.25
# number of samples to scrutinize
n_samples = 3
df_samples = tibble() # empty dataframe to append to
for (sample_i in 1:n_samples) {
df_i <- df |>
sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
mutate(sample_num = sample_i) # add a column indicating sample number
df_samples = bind_rows(df_samples, df_i)
}
sample_3p_summary <-
df_samples |>
group_by(sample_num) |>
summarise(
n_players = n(),
mean_fga_3p = mean(fga_3p, na.rm = TRUE),
median_fga_3p = median(fga_3p, na.rm = TRUE),
sd_fga_3p = sd(fga_3p, na.rm = TRUE),
q1_fga_3p = quantile(fga_3p, 0.25, na.rm = TRUE),
q3_fga_3p = quantile(fga_3p, 0.75, na.rm = TRUE),
.groups = "drop"
)
sample_3p_summary |>
print(n = Inf)
## # A tibble: 3 × 7
## sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p q1_fga_3p q3_fga_3p
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 814 0.409 0.421 0.220 0.278 0.546
## 2 2 814 0.406 0.423 0.224 0.274 0.541
## 3 3 814 0.388 0.410 0.229 0.244 0.535
Across the three random samples, the mean and median three-point-attempt rate are very similar. Mean FGA_3p ranges from 0.399 to 0.408 and median FGA_3p ranges from 0.413 to 0.428. Even the quartiles are very similar across samples. Across all three 25% samples, median 3PA stays ~0.41–0.43 and the IQR stays near ~0.26–0.55, indicating a stable league-wide distribution. This means that league wide 3-point reliance is a stable characteristic of the dataset, not sensitive to random sampling at 25% level. This aligns with the trend of the league as a whole shooting many threes to maximize offensive/team success.
Further Question: Would this stability hold for other groups such as guards or high-minute players?
sample_3p_anomalies <-
sample_3p_summary |>
mutate(
z_mean = (mean_fga_3p - mean(mean_fga_3p)) / sd(mean_fga_3p),
anomaly_flag = abs(z_mean) >= 1
) |>
arrange(desc(abs(z_mean)))
sample_3p_anomalies
## # A tibble: 3 × 9
## sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p q1_fga_3p q3_fga_3p
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 814 0.388 0.410 0.229 0.244 0.535
## 2 1 814 0.409 0.421 0.220 0.278 0.546
## 3 2 814 0.406 0.423 0.224 0.274 0.541
## # ℹ 2 more variables: z_mean <dbl>, anomaly_flag <lgl>
Analyzing the z-scores, the first sample having a 1.127 z-score indicates the anomaly. However, the difference in means is small. This can be misleading in a small number of samples. The sub-sample may appear unusual relative to others considering the change is minimal. In sample 1, the FGA_3P is unusually high relative to other samples, but the size is so small, it’s difficult to make a real anomaly claim.
Further Question: How often do anomalies appear as sample sizes increases or more samples are added?
make_samples_and_summarise <- function(df_in, sample_frac, n_samples, seed = 123) {
set.seed(seed)
df_samples <- tibble()
for (sample_i in 1:n_samples) {
df_i <-
df_in |>
sample_n(size = round(sample_frac * nrow(df_in)), replace = TRUE) |>
mutate(sample_num = sample_i, sample_frac = sample_frac)
df_samples <- bind_rows(df_samples, df_i)
}
summary <-
df_samples |>
group_by(sample_frac, sample_num) |>
summarise(
n_players = n(),
mean_fga_3p = mean(fga_3p, na.rm = TRUE),
median_fga_3p = median(fga_3p, na.rm = TRUE),
sd_fga_3p = sd(fga_3p, na.rm = TRUE),
.groups = "drop"
)
list(samples = df_samples, summary = summary)
}
res_10 <- make_samples_and_summarise(df, 0.10, n_samples = 3, seed = 123)
res_25 <- make_samples_and_summarise(df, 0.25, n_samples = 3, seed = 123)
res_75 <- make_samples_and_summarise(df, 0.75, n_samples = 3, seed = 123)
compare_summary <-
bind_rows(res_10$summary, res_25$summary, res_75$summary) |>
arrange(sample_frac, sample_num)
compare_summary |>
print(n = Inf)
## # A tibble: 9 × 6
## sample_frac sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p
## <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 0.1 1 326 0.400 0.418 0.224
## 2 0.1 2 326 0.417 0.424 0.209
## 3 0.1 3 326 0.377 0.397 0.220
## 4 0.25 1 814 0.405 0.417 0.218
## 5 0.25 2 814 0.398 0.404 0.217
## 6 0.25 3 814 0.415 0.442 0.230
## 7 0.75 1 2443 0.406 0.419 0.222
## 8 0.75 2 2443 0.399 0.409 0.225
## 9 0.75 3 2443 0.395 0.403 0.228
As sample size increases 10% sample shows the widest spread in mean and median FGA_3P while 75% converges tightly around the league averages reducing uncertainty and minimize influence of extreme cases.
Further Question: At what percentile do estimates stabilize enough for decision-making?
Impact on Future Conclusions: After conducting analysis on FGA-3P, I would change how confidently the results should be interpreted. Specifically looking at a small sample or subgroup like team or position, I would account for sensitivity to sampling noise as a factor in my analysis. For league-wide conclusions, the sample size seems more sufficient for stable estimates compared to team and position levels where greater variability appears, leading to the requirement of avoiding overconfident statements.