Week 4 Data Dive Overview

This is my week 4 data dive analyzing NBA shooting statistics taking samples of three-point attempts across the league from the full dataset and conducting deeper analyses on these cases to understand their significance.

# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.25

# number of samples to scrutinize
n_samples = 3

df_samples = tibble()  # empty dataframe to append to

for (sample_i in 1:n_samples) {
  df_i <- df |>
    sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
    mutate(sample_num = sample_i)  # add a column indicating sample number
  
  df_samples = bind_rows(df_samples, df_i)
}
sample_3p_summary <-
  df_samples |>
  group_by(sample_num) |>
  summarise(
    n_players      = n(),
    mean_fga_3p    = mean(fga_3p, na.rm = TRUE),
    median_fga_3p  = median(fga_3p, na.rm = TRUE),
    sd_fga_3p      = sd(fga_3p, na.rm = TRUE),
    q1_fga_3p      = quantile(fga_3p, 0.25, na.rm = TRUE),
    q3_fga_3p      = quantile(fga_3p, 0.75, na.rm = TRUE),
    .groups = "drop"
  )

sample_3p_summary |>
  print(n = Inf)
## # A tibble: 3 × 7
##   sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p q1_fga_3p q3_fga_3p
##        <int>     <int>       <dbl>         <dbl>     <dbl>     <dbl>     <dbl>
## 1          1       814       0.409         0.421     0.220     0.278     0.546
## 2          2       814       0.406         0.423     0.224     0.274     0.541
## 3          3       814       0.388         0.410     0.229     0.244     0.535

Across the three random samples, the mean and median three-point-attempt rate are very similar. Mean FGA_3p ranges from 0.399 to 0.408 and median FGA_3p ranges from 0.413 to 0.428. Even the quartiles are very similar across samples. Across all three 25% samples, median 3PA stays ~0.41–0.43 and the IQR stays near ~0.26–0.55, indicating a stable league-wide distribution. This means that league wide 3-point reliance is a stable characteristic of the dataset, not sensitive to random sampling at 25% level. This aligns with the trend of the league as a whole shooting many threes to maximize offensive/team success.

Further Question: Would this stability hold for other groups such as guards or high-minute players?

sample_3p_anomalies <-
  sample_3p_summary |>
  mutate(
    z_mean = (mean_fga_3p - mean(mean_fga_3p)) / sd(mean_fga_3p),
    anomaly_flag = abs(z_mean) >= 1 
  ) |>
  arrange(desc(abs(z_mean)))

sample_3p_anomalies
## # A tibble: 3 × 9
##   sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p q1_fga_3p q3_fga_3p
##        <int>     <int>       <dbl>         <dbl>     <dbl>     <dbl>     <dbl>
## 1          3       814       0.388         0.410     0.229     0.244     0.535
## 2          1       814       0.409         0.421     0.220     0.278     0.546
## 3          2       814       0.406         0.423     0.224     0.274     0.541
## # ℹ 2 more variables: z_mean <dbl>, anomaly_flag <lgl>

Analyzing the z-scores, the first sample having a 1.127 z-score indicates the anomaly. However, the difference in means is small. This can be misleading in a small number of samples. The sub-sample may appear unusual relative to others considering the change is minimal. In sample 1, the FGA_3P is unusually high relative to other samples, but the size is so small, it’s difficult to make a real anomaly claim.

Further Question: How often do anomalies appear as sample sizes increases or more samples are added?

make_samples_and_summarise <- function(df_in, sample_frac, n_samples, seed = 123) {
  set.seed(seed)
  
  df_samples <- tibble()
  
  for (sample_i in 1:n_samples) {
    df_i <-
      df_in |>
      sample_n(size = round(sample_frac * nrow(df_in)), replace = TRUE) |>
      mutate(sample_num = sample_i, sample_frac = sample_frac)
    
    df_samples <- bind_rows(df_samples, df_i)
  }
  
  summary <-
    df_samples |>
    group_by(sample_frac, sample_num) |>
    summarise(
      n_players     = n(),
      mean_fga_3p   = mean(fga_3p, na.rm = TRUE),
      median_fga_3p = median(fga_3p, na.rm = TRUE),
      sd_fga_3p     = sd(fga_3p, na.rm = TRUE),
      .groups = "drop"
    )
  
  list(samples = df_samples, summary = summary)
}

res_10 <- make_samples_and_summarise(df, 0.10, n_samples = 3, seed = 123)
res_25 <- make_samples_and_summarise(df, 0.25, n_samples = 3, seed = 123)
res_75 <- make_samples_and_summarise(df, 0.75, n_samples = 3, seed = 123)


compare_summary <-
  bind_rows(res_10$summary, res_25$summary, res_75$summary) |>
  arrange(sample_frac, sample_num)

compare_summary |>
  print(n = Inf)
## # A tibble: 9 × 6
##   sample_frac sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p
##         <dbl>      <int>     <int>       <dbl>         <dbl>     <dbl>
## 1        0.1           1       326       0.400         0.418     0.224
## 2        0.1           2       326       0.417         0.424     0.209
## 3        0.1           3       326       0.377         0.397     0.220
## 4        0.25          1       814       0.405         0.417     0.218
## 5        0.25          2       814       0.398         0.404     0.217
## 6        0.25          3       814       0.415         0.442     0.230
## 7        0.75          1      2443       0.406         0.419     0.222
## 8        0.75          2      2443       0.399         0.409     0.225
## 9        0.75          3      2443       0.395         0.403     0.228

As sample size increases 10% sample shows the widest spread in mean and median FGA_3P while 75% converges tightly around the league averages reducing uncertainty and minimize influence of extreme cases.

Further Question: At what percentile do estimates stabilize enough for decision-making?

Impact on Future Conclusions: After conducting analysis on FGA-3P, I would change how confidently the results should be interpreted. Specifically looking at a small sample or subgroup like team or position, I would account for sensitivity to sampling noise as a factor in my analysis. For league-wide conclusions, the sample size seems more sufficient for stable estimates compared to team and position levels where greater variability appears, leading to the requirement of avoiding overconfident statements.