library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load Dataset

nba <- read.csv("nba.csv")

Random Subsamples (Replacement)

# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.1

# number of samples to scrutinize
n_samples = 5

df_samples = tibble()  # empty dataframe to append to

for (sample_i in 1:n_samples) {
  df_i <- nba |>
    sample_n(size = sample_frac * nrow(nba), replace = TRUE) |>
    mutate(sample_num = sample_i)  # add a column indicating sample number
  
  df_samples = bind_rows(df_samples, df_i)
}

df_samples
## # A tibble: 850 × 20
##    bbrID   Date  Tm    Opp     TRB   AST   STL   BLK   PTS  GmSc Season Playoffs
##    <chr>   <chr> <chr> <chr> <int> <int> <int> <int> <int> <dbl> <chr>  <chr>   
##  1 smithm… 1991… BOS   ORL       6     3     1     0    23  20.7 1990-… false   
##  2 thomam… 2020… TOR   BRK       6     1     3     0    15  15.8 2019-… false   
##  3 tinslj… 2007… IND   MIN       6     6     4     0    37  29.8 2006-… false   
##  4 mclemb… 2017… SAC   ATL       9     3     4     0    22  20.1 2016-… false   
##  5 richaj… 2010… PHO   POR       8     2     3     0    42  39.5 2009-… true    
##  6 popema… 2001… MIL   GSW       4     3     1     0    10   9.9 2000-… false   
##  7 smitho… 1990… ORL   SEA       8     4     2     1    33  32.2 1990-… false   
##  8 brewec… 2014… MIN   HOU       2     1     6     1    51  40.8 2013-… false   
##  9 alarim… 1989… WSB   CHH      11     1     2     1    20  20.4 1989-… false   
## 10 doziep… 2021… DEN   HOU       7     3     2     2    23  24.3 2020-… false   
## # ℹ 840 more rows
## # ℹ 8 more variables: Year <int>, GameIndex <int>, GmScMovingZ <dbl>,
## #   GmScMovingZTop2Delta <dbl>, Date2 <chr>, GmSc2 <dbl>, GmScMovingZ2 <dbl>,
## #   sample_num <int>

I started by taking 5 samples at 10% sample size, resulting in exactly 170 games taken into account for each subsample. There was a lot of variance between samples especially within the extremes like minimums and maximums. It also caused rare performances to dissapear in some cases. When gradually increasing the sample size all the way up to 75%, everything seemed to stabilize including extreme values. This proves that as sample size increases, variability decreases which explains the Law of Large Numbers perfectly.

Compare Average Points by Sample

df_samples |>
  group_by(sample_num) |>
  summarise(
    avg_pts = mean(PTS, na.rm = TRUE),
    sd_pts  = sd(PTS, na.rm = TRUE),
    n_games = n()
  )
## # A tibble: 5 × 4
##   sample_num avg_pts sd_pts n_games
##        <int>   <dbl>  <dbl>   <int>
## 1          1    25.5  10.0      170
## 2          2    26     9.09     170
## 3          3    25.4   9.67     170
## 4          4    27.2  11.3      170
## 5          5    24.5   9.76     170

Each sample contains the same amount of games but slightly different means and standard deviations of points scored. Even when you are sampling from the same population, the average scoring fluctuates. This demonstrates sampling variability because conclusions depend on which sample you happen to collect.

Anomaly within High-Scoring Games

df_samples |>
  group_by(sample_num) |>
  summarise(max_pts = max(PTS, na.rm = TRUE))
## # A tibble: 5 × 2
##   sample_num max_pts
##        <int>   <int>
## 1          1      62
## 2          2      54
## 3          3      60
## 4          4      62
## 5          5      60

Only the first sample contains Kobe Bryant’s historical 81-point game which would be considered a major outlier, but in the rest of the samples it does not appear at all. The other samples are relatively similar to eachother when it comes to the maximum points scored in a game. This shows how anomaly detection can depend entirely on which samples are collected in the first place.

Consistency Across Samples

df_samples |>
  group_by(sample_num) |>
  summarise(
    mean_pts = mean(PTS, na.rm = TRUE),
    min_pts = min(PTS, na.rm = TRUE),
    median_pts = median(PTS, na.rm = TRUE),
    max_pts = max(PTS, na.rm = TRUE)
  )
## # A tibble: 5 × 5
##   sample_num mean_pts min_pts median_pts max_pts
##        <int>    <dbl>   <int>      <dbl>   <int>
## 1          1     25.5       4         24      62
## 2          2     26         8         25      54
## 3          3     25.4       8         24      60
## 4          4     27.2       5         25      62
## 5          5     24.5       4         23      60

The means and medians of points scored stay pretty consistent but still seem to fluctuate from sample to sample. The extremes like minimum and especially maximum have the widest range between each subsample. Central tendency is usually a lot more stable than extreme values, suggesting that averages and medians in particular may be more reliable when the sample sizes are smaller.

Future Implications/Conclusions

This data dive has shown me how smaller samples can misrepresent what is really going on especially in categories where extremes exist. These samples contain high variance and may exaggerate or even eliminate rare events entirely. Larger samples seem to produce a lot more stable reliable summaries. In the future, I should avoid making any bold claims about small subsets and check whether any conclusions that I do make hold strong across multiple samples.

Further Questions

What is the optimal sample size or is it just simply the more the better? Why is the median the most consistent value among subsamples? Does bootstrapping provide a more stable understanding of rare events? Would stratified sampling help reduce variability in small samples?