week4datadive

Week 4 Data Dive Overview

This is my week 4 data dive analyzing NBA shooting statistics taking samples of three-point attempts across the league from the full dataset and conducting deeper analyses on these cases to understand their significance.

# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.25

# number of samples to scrutinize
n_samples = 3

df_samples = tibble()  # empty dataframe to append to

for (sample_i in 1:n_samples) {
  df_i <- df |>
    sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
    mutate(sample_num = sample_i)  # add a column indicating sample number
  
  df_samples = bind_rows(df_samples, df_i)
}

sample_3p_summary <-
  df_samples |>
  group_by(sample_num) |>
  summarise(
    n_players      = n(),
    mean_fga_3p    = mean(fga_3p, na.rm = TRUE),
    median_fga_3p  = median(fga_3p, na.rm = TRUE),
    sd_fga_3p      = sd(fga_3p, na.rm = TRUE),
    q1_fga_3p      = quantile(fga_3p, 0.25, na.rm = TRUE),
    q3_fga_3p      = quantile(fga_3p, 0.75, na.rm = TRUE),
    .groups = "drop"
  )

sample_3p_summary |>
  print(n = Inf)

## # A tibble: 3 × 7
##   sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p q1_fga_3p q3_fga_3p
##        <int>     <int>       <dbl>         <dbl>     <dbl>     <dbl>     <dbl>
## 1          1       814       0.409         0.421     0.220     0.278     0.546
## 2          2       814       0.406         0.423     0.224     0.274     0.541
## 3          3       814       0.388         0.410     0.229     0.244     0.535

Across the three random samples, the mean and median three-point-attempt rate are very similar. Mean FGA_3p ranges from 0.399 to 0.408 and median FGA_3p ranges from 0.413 to 0.428. Even the quartiles are very similar across samples. Across all three 25% samples, median 3PA stays ~0.41–0.43 and the IQR stays near ~0.26–0.55, indicating a stable league-wide distribution. This means that league wide 3-point reliance is a stable characteristic of the dataset, not sensitive to random sampling at 25% level. This aligns with the trend of the league as a whole shooting many threes to maximize offensive/team success.

Further Question: Would this stability hold for other groups such as guards or high-minute players?

sample_3p_anomalies <-
  sample_3p_summary |>
  mutate(
    z_mean = (mean_fga_3p - mean(mean_fga_3p)) / sd(mean_fga_3p),
    anomaly_flag = abs(z_mean) >= 1 
  ) |>
  arrange(desc(abs(z_mean)))

sample_3p_anomalies

## # A tibble: 3 × 9
##   sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p q1_fga_3p q3_fga_3p
##        <int>     <int>       <dbl>         <dbl>     <dbl>     <dbl>     <dbl>
## 1          3       814       0.388         0.410     0.229     0.244     0.535
## 2          1       814       0.409         0.421     0.220     0.278     0.546
## 3          2       814       0.406         0.423     0.224     0.274     0.541
## # ℹ 2 more variables: z_mean <dbl>, anomaly_flag <lgl>

Analyzing the z-scores, the first sample having a 1.127 z-score indicates the anomaly. However, the difference in means is small. This can be misleading in a small number of samples. The sub-sample may appear unusual relative to others considering the change is minimal. In sample 1, the FGA_3P is unusually high relative to other samples, but the size is so small, it’s difficult to make a real anomaly claim.

Further Question: How often do anomalies appear as sample sizes increases or more samples are added?

make_samples_and_summarise <- function(df_in, sample_frac, n_samples, seed = 123) {
  set.seed(seed)
  
  df_samples <- tibble()
  
  for (sample_i in 1:n_samples) {
    df_i <-
      df_in |>
      sample_n(size = round(sample_frac * nrow(df_in)), replace = TRUE) |>
      mutate(sample_num = sample_i, sample_frac = sample_frac)
    
    df_samples <- bind_rows(df_samples, df_i)
  }
  
  summary <-
    df_samples |>
    group_by(sample_frac, sample_num) |>
    summarise(
      n_players     = n(),
      mean_fga_3p   = mean(fga_3p, na.rm = TRUE),
      median_fga_3p = median(fga_3p, na.rm = TRUE),
      sd_fga_3p     = sd(fga_3p, na.rm = TRUE),
      .groups = "drop"
    )
  
  list(samples = df_samples, summary = summary)
}

res_10 <- make_samples_and_summarise(df, 0.10, n_samples = 3, seed = 123)
res_25 <- make_samples_and_summarise(df, 0.25, n_samples = 3, seed = 123)
res_75 <- make_samples_and_summarise(df, 0.75, n_samples = 3, seed = 123)


compare_summary <-
  bind_rows(res_10$summary, res_25$summary, res_75$summary) |>
  arrange(sample_frac, sample_num)

compare_summary |>
  print(n = Inf)

## # A tibble: 9 × 6
##   sample_frac sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p
##         <dbl>      <int>     <int>       <dbl>         <dbl>     <dbl>
## 1        0.1           1       326       0.400         0.418     0.224
## 2        0.1           2       326       0.417         0.424     0.209
## 3        0.1           3       326       0.377         0.397     0.220
## 4        0.25          1       814       0.405         0.417     0.218
## 5        0.25          2       814       0.398         0.404     0.217
## 6        0.25          3       814       0.415         0.442     0.230
## 7        0.75          1      2443       0.406         0.419     0.222
## 8        0.75          2      2443       0.399         0.409     0.225
## 9        0.75          3      2443       0.395         0.403     0.228

As sample size increases 10% sample shows the widest spread in mean and median FGA_3P while 75% converges tightly around the league averages reducing uncertainty and minimize influence of extreme cases.

Further Question: At what percentile do estimates stabilize enough for decision-making?

Impact on Future Conclusions: After conducting analysis on FGA-3P, I would change how confidently the results should be interpreted. Specifically looking at a small sample or subgroup like team or position, I would account for sensitivity to sampling noise as a factor in my analysis. For league-wide conclusions, the sample size seems more sufficient for stable estimates compared to team and position levels where greater variability appears, leading to the requirement of avoiding overconfident statements.

week4datadive

2026-02-08

Week 4 Data Dive Overview