This is my week 4 data dive analyzing NBA shooting statistics taking samples of three-point attempts across the league from the full dataset and conducting deeper analyses on these cases to understand their significance.
# loading the appropriate packages, reading the CSV file, and cleaning column
# names to follow R format.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(scales)
## Warning: package 'scales' was built under R version 4.5.2
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(ggplot2)
df <-
read_csv("C:/Users/guyon/OneDrive/Desktop/NBA_Shooting_Stats.csv") |>
clean_names() |>
filter(!team %in% c("2TM", "3TM", "4TM")) # remove combined-team rows
## Rows: 3669 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Player, Team, Pos, Season
## dbl (25): Rk, Age, G, GS, MP, FG%, Dist., FGA_2P, FGA_0-3, FGA_3-10, FGA_10-...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.25
# number of samples to scrutinize
n_samples = 3
df_samples = tibble() # empty dataframe to append to
for (sample_i in 1:n_samples) {
df_i <- df |>
sample_n(size = sample_frac * nrow(df), replace = TRUE) |>
mutate(sample_num = sample_i) # add a column indicating sample number
df_samples = bind_rows(df_samples, df_i)
}
sample_3p_summary <-
df_samples |>
group_by(sample_num) |>
summarise(
n_players = n(),
mean_fga_3p = mean(fga_3p, na.rm = TRUE),
median_fga_3p = median(fga_3p, na.rm = TRUE),
sd_fga_3p = sd(fga_3p, na.rm = TRUE),
q1_fga_3p = quantile(fga_3p, 0.25, na.rm = TRUE),
q3_fga_3p = quantile(fga_3p, 0.75, na.rm = TRUE),
.groups = "drop"
)
sample_3p_summary |>
print(n = Inf)
## # A tibble: 3 × 7
## sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p q1_fga_3p q3_fga_3p
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 814 0.401 0.41 0.229 0.260 0.544
## 2 2 814 0.418 0.427 0.231 0.280 0.569
## 3 3 814 0.400 0.422 0.227 0.258 0.544
Across the three random samples, the mean and median three-point-attempt rate are very similar. Mean FGA_3p ranges from 0.399 to 0.408 and median FGA_3p ranges from 0.413 to 0.428. Even the quartiles are very similar across samples. Across all three 25% samples, median 3PA stays ~0.41–0.43 and the IQR stays near ~0.26–0.55, indicating a stable league-wide distribution. This means that league wide 3-point reliance is a stable characteristic of the dataset, not sensitive to random sampling at 25% level. This aligns with the trend of the league as a whole shooting many threes to maximize offensive/team success.
Further Question: Would this stability hold for other groups such as guards or high-minute players?
sample_3p_anomalies <-
sample_3p_summary |>
mutate(
z_mean = (mean_fga_3p - mean(mean_fga_3p)) / sd(mean_fga_3p),
anomaly_flag = abs(z_mean) >= 1
) |>
arrange(desc(abs(z_mean)))
sample_3p_anomalies
## # A tibble: 3 × 9
## sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p q1_fga_3p q3_fga_3p
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 814 0.418 0.427 0.231 0.280 0.569
## 2 3 814 0.400 0.422 0.227 0.258 0.544
## 3 1 814 0.401 0.41 0.229 0.260 0.544
## # ℹ 2 more variables: z_mean <dbl>, anomaly_flag <lgl>
Analyzing the z-scores, the first sample having a 1.127 z-score indicates the anomaly. However, the difference in means is small. This can be misleading in a small number of samples. The sub-sample may appear unusual relative to others considering the change is minimal. In sample 1, the FGA_3P is unusually high relative to other samples, but the size is so small, it’s difficult to make a real anomaly claim.
Further Question: How often do anomalies appear as sample sizes increases or more samples are added?
make_samples_and_summarise <- function(df_in, sample_frac, n_samples, seed = 123) {
set.seed(seed)
df_samples <- tibble()
for (sample_i in 1:n_samples) {
df_i <-
df_in |>
sample_n(size = round(sample_frac * nrow(df_in)), replace = TRUE) |>
mutate(sample_num = sample_i, sample_frac = sample_frac)
df_samples <- bind_rows(df_samples, df_i)
}
summary <-
df_samples |>
group_by(sample_frac, sample_num) |>
summarise(
n_players = n(),
mean_fga_3p = mean(fga_3p, na.rm = TRUE),
median_fga_3p = median(fga_3p, na.rm = TRUE),
sd_fga_3p = sd(fga_3p, na.rm = TRUE),
.groups = "drop"
)
list(samples = df_samples, summary = summary)
}
res_10 <- make_samples_and_summarise(df, 0.10, n_samples = 3, seed = 123)
res_25 <- make_samples_and_summarise(df, 0.25, n_samples = 3, seed = 123)
res_75 <- make_samples_and_summarise(df, 0.75, n_samples = 3, seed = 123)
compare_summary <-
bind_rows(res_10$summary, res_25$summary, res_75$summary) |>
arrange(sample_frac, sample_num)
compare_summary |>
print(n = Inf)
## # A tibble: 9 × 6
## sample_frac sample_num n_players mean_fga_3p median_fga_3p sd_fga_3p
## <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 0.1 1 326 0.400 0.418 0.224
## 2 0.1 2 326 0.417 0.424 0.209
## 3 0.1 3 326 0.377 0.397 0.220
## 4 0.25 1 814 0.405 0.417 0.218
## 5 0.25 2 814 0.398 0.404 0.217
## 6 0.25 3 814 0.415 0.442 0.230
## 7 0.75 1 2443 0.406 0.419 0.222
## 8 0.75 2 2443 0.399 0.409 0.225
## 9 0.75 3 2443 0.395 0.403 0.228
As sample size increases 10% sample shows the widest spread in mean and median FGA_3P while 75% converges tightly around the league averages reducing uncertainty and minimize influence of extreme cases.
Further Question: At what percentile do estimates stabilize enough for decision-making?