library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load Dataset
nba <- read.csv("nba.csv")
# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.1
# number of samples to scrutinize
n_samples = 5
df_samples = tibble() # empty dataframe to append to
for (sample_i in 1:n_samples) {
df_i <- nba |>
sample_n(size = sample_frac * nrow(nba), replace = TRUE) |>
mutate(sample_num = sample_i) # add a column indicating sample number
df_samples = bind_rows(df_samples, df_i)
}
df_samples
## # A tibble: 850 × 20
## bbrID Date Tm Opp TRB AST STL BLK PTS GmSc Season Playoffs
## <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <dbl> <chr> <chr>
## 1 smithm… 1991… BOS ORL 6 3 1 0 23 20.7 1990-… false
## 2 thomam… 2020… TOR BRK 6 1 3 0 15 15.8 2019-… false
## 3 tinslj… 2007… IND MIN 6 6 4 0 37 29.8 2006-… false
## 4 mclemb… 2017… SAC ATL 9 3 4 0 22 20.1 2016-… false
## 5 richaj… 2010… PHO POR 8 2 3 0 42 39.5 2009-… true
## 6 popema… 2001… MIL GSW 4 3 1 0 10 9.9 2000-… false
## 7 smitho… 1990… ORL SEA 8 4 2 1 33 32.2 1990-… false
## 8 brewec… 2014… MIN HOU 2 1 6 1 51 40.8 2013-… false
## 9 alarim… 1989… WSB CHH 11 1 2 1 20 20.4 1989-… false
## 10 doziep… 2021… DEN HOU 7 3 2 2 23 24.3 2020-… false
## # ℹ 840 more rows
## # ℹ 8 more variables: Year <int>, GameIndex <int>, GmScMovingZ <dbl>,
## # GmScMovingZTop2Delta <dbl>, Date2 <chr>, GmSc2 <dbl>, GmScMovingZ2 <dbl>,
## # sample_num <int>
I started by taking 5 samples at 10% sample size, resulting in exactly 170 games taken into account for each subsample. There was a lot of variance between samples especially within the extremes like minimums and maximums. It also caused rare performances to dissapear in some cases. When gradually increasing the sample size all the way up to 75%, everything seemed to stabilize including extreme values. This proves that as sample size increases, variability decreases which explains the Law of Large Numbers perfectly.
df_samples |>
group_by(sample_num) |>
summarise(
avg_pts = mean(PTS, na.rm = TRUE),
sd_pts = sd(PTS, na.rm = TRUE),
n_games = n()
)
## # A tibble: 5 × 4
## sample_num avg_pts sd_pts n_games
## <int> <dbl> <dbl> <int>
## 1 1 25.5 10.0 170
## 2 2 26 9.09 170
## 3 3 25.4 9.67 170
## 4 4 27.2 11.3 170
## 5 5 24.5 9.76 170
Each sample contains the same amount of games but slightly different means and standard deviations of points scored. Even when you are sampling from the same population, the average scoring fluctuates. This demonstrates sampling variability because conclusions depend on which sample you happen to collect.
df_samples |>
group_by(sample_num) |>
summarise(max_pts = max(PTS, na.rm = TRUE))
## # A tibble: 5 × 2
## sample_num max_pts
## <int> <int>
## 1 1 62
## 2 2 54
## 3 3 60
## 4 4 62
## 5 5 60
Only the first sample contains Kobe Bryant’s historical 81-point game which would be considered a major outlier, but in the rest of the samples it does not appear at all. The other samples are relatively similar to eachother when it comes to the maximum points scored in a game. This shows how anomaly detection can depend entirely on which samples are collected in the first place.
df_samples |>
group_by(sample_num) |>
summarise(
mean_pts = mean(PTS, na.rm = TRUE),
min_pts = min(PTS, na.rm = TRUE),
median_pts = median(PTS, na.rm = TRUE),
max_pts = max(PTS, na.rm = TRUE)
)
## # A tibble: 5 × 5
## sample_num mean_pts min_pts median_pts max_pts
## <int> <dbl> <int> <dbl> <int>
## 1 1 25.5 4 24 62
## 2 2 26 8 25 54
## 3 3 25.4 8 24 60
## 4 4 27.2 5 25 62
## 5 5 24.5 4 23 60
The means and medians of points scored stay pretty consistent but still seem to fluctuate from sample to sample. The extremes like minimum and especially maximum have the widest range between each subsample. Central tendency is usually a lot more stable than extreme values, suggesting that averages and medians in particular may be more reliable when the sample sizes are smaller.
This data dive has shown me how smaller samples can misrepresent what is really going on especially in categories where extremes exist. These samples contain high variance and may exaggerate or even eliminate rare events entirely. Larger samples seem to produce a lot more stable reliable summaries. In the future, I should avoid making any bold claims about small subsets and check whether any conclusions that I do make hold strong across multiple samples.
What is the optimal sample size or is it just simply the more the better? Why is the median the most consistent value among subsamples? Does bootstrapping provide a more stable understanding of rare events? Would stratified sampling help reduce variability in small samples?