Week 4 Data Dive Overview

This is my week 4 data dive analyzing NBA shooting statistics taking samples of three-point attempts across the league from the full dataset and conducting deeper analyses on these cases to understand their significance.

# loading the appropriate packages, reading the CSV file, and cleaning column
# names to follow R format.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(scales)
## Warning: package 'scales' was built under R version 4.5.2
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(ggplot2)

df <-
  read_csv("C:/Users/guyon/OneDrive/Desktop/NBA_Shooting_Stats.csv") |>
  clean_names() |>
  filter(!team %in% c("2TM", "3TM", "4TM"))  # remove combined-team rows
## Rows: 3669 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Player, Team, Pos, Season
## dbl (25): Rk, Age, G, GS, MP, FG%, Dist., FGA_2P, FGA_0-3, FGA_3-10, FGA_10-...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.25

# number of samples to scrutinize
n_samples = 3

df_samples = tibble()  # empty dataframe to append to

for (sample_i in 1:n_samples) {
  df_i <- df |>
    sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
    mutate(sample_num = sample_i)  # add a column indicating sample number
  
  df_samples = bind_rows(df_samples, df_i)
}
sample_3p_summary <-
  df_samples |>
  group_by(sample_num) |>
  summarise(
    n_players      = n(),
    mean_fga_3p    = mean(fga_3p, na.rm = TRUE),
    median_fga_3p  = median(fga_3p, na.rm = TRUE),
    sd_fga_3p      = sd(fga_3p, na.rm = TRUE),
    q1_fga_3p      = quantile(fga_3p, 0.25, na.rm = TRUE),
    q3_fga_3p      = quantile(fga_3p, 0.75, na.rm = TRUE),
    .groups = "drop"
  )

sample_3p_summary |>
  print(n = Inf)
## # A tibble: 3 × 7
##   sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p q1_fga_3p q3_fga_3p
##        <int>     <int>       <dbl>         <dbl>     <dbl>     <dbl>     <dbl>
## 1          1       814       0.401         0.41      0.229     0.260     0.544
## 2          2       814       0.418         0.427     0.231     0.280     0.569
## 3          3       814       0.400         0.422     0.227     0.258     0.544

Across the three random samples, the mean and median three-point-attempt rate are very similar. Mean FGA_3p ranges from 0.399 to 0.408 and median FGA_3p ranges from 0.413 to 0.428. Even the quartiles are very similar across samples. Across all three 25% samples, median 3PA stays ~0.41–0.43 and the IQR stays near ~0.26–0.55, indicating a stable league-wide distribution. This means that league wide 3-point reliance is a stable characteristic of the dataset, not sensitive to random sampling at 25% level. This aligns with the trend of the league as a whole shooting many threes to maximize offensive/team success.

Further Question: Would this stability hold for other groups such as guards or high-minute players?

sample_3p_anomalies <-
  sample_3p_summary |>
  mutate(
    z_mean = (mean_fga_3p - mean(mean_fga_3p)) / sd(mean_fga_3p),
    anomaly_flag = abs(z_mean) >= 1 
  ) |>
  arrange(desc(abs(z_mean)))

sample_3p_anomalies
## # A tibble: 3 × 9
##   sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p q1_fga_3p q3_fga_3p
##        <int>     <int>       <dbl>         <dbl>     <dbl>     <dbl>     <dbl>
## 1          2       814       0.418         0.427     0.231     0.280     0.569
## 2          3       814       0.400         0.422     0.227     0.258     0.544
## 3          1       814       0.401         0.41      0.229     0.260     0.544
## # ℹ 2 more variables: z_mean <dbl>, anomaly_flag <lgl>

Analyzing the z-scores, the first sample having a 1.127 z-score indicates the anomaly. However, the difference in means is small. This can be misleading in a small number of samples. The sub-sample may appear unusual relative to others considering the change is minimal. In sample 1, the FGA_3P is unusually high relative to other samples, but the size is so small, it’s difficult to make a real anomaly claim.

Further Question: How often do anomalies appear as sample sizes increases or more samples are added?

make_samples_and_summarise <- function(df_in, sample_frac, n_samples, seed = 123) {
  set.seed(seed)
  
  df_samples <- tibble()
  
  for (sample_i in 1:n_samples) {
    df_i <-
      df_in |>
      sample_n(size = round(sample_frac * nrow(df_in)), replace = TRUE) |>
      mutate(sample_num = sample_i, sample_frac = sample_frac)
    
    df_samples <- bind_rows(df_samples, df_i)
  }
  
  summary <-
    df_samples |>
    group_by(sample_frac, sample_num) |>
    summarise(
      n_players     = n(),
      mean_fga_3p   = mean(fga_3p, na.rm = TRUE),
      median_fga_3p = median(fga_3p, na.rm = TRUE),
      sd_fga_3p     = sd(fga_3p, na.rm = TRUE),
      .groups = "drop"
    )
  
  list(samples = df_samples, summary = summary)
}

res_10 <- make_samples_and_summarise(df, 0.10, n_samples = 3, seed = 123)
res_25 <- make_samples_and_summarise(df, 0.25, n_samples = 3, seed = 123)
res_75 <- make_samples_and_summarise(df, 0.75, n_samples = 3, seed = 123)


compare_summary <-
  bind_rows(res_10$summary, res_25$summary, res_75$summary) |>
  arrange(sample_frac, sample_num)

compare_summary |>
  print(n = Inf)
## # A tibble: 9 × 6
##   sample_frac sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p
##         <dbl>      <int>     <int>       <dbl>         <dbl>     <dbl>
## 1        0.1           1       326       0.400         0.418     0.224
## 2        0.1           2       326       0.417         0.424     0.209
## 3        0.1           3       326       0.377         0.397     0.220
## 4        0.25          1       814       0.405         0.417     0.218
## 5        0.25          2       814       0.398         0.404     0.217
## 6        0.25          3       814       0.415         0.442     0.230
## 7        0.75          1      2443       0.406         0.419     0.222
## 8        0.75          2      2443       0.399         0.409     0.225
## 9        0.75          3      2443       0.395         0.403     0.228

As sample size increases 10% sample shows the widest spread in mean and median FGA_3P while 75% converges tightly around the league averages reducing uncertainty and minimize influence of extreme cases.

Further Question: At what percentile do estimates stabilize enough for decision-making?