library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

NBA Dataset

Loading in the data:
NBA_Data <- read_csv("NBA Dataset for Submission.csv")
## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl  (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date  (1): game_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Filtering out unnecessary games:
NBA_Data <- NBA_Data |>
  filter(season_type != 'All-Star',
         season_type != 'All Star',
         season_type != 'Pre Season')

The Sampling Process (first iteration)

# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.25

# number of samples to scrutinize
n_samples = 3

df_samples = tibble()

for (sample_i in 1:n_samples) {
  df_i <- NBA_Data |>
    sample_n(size = sample_frac * nrow(NBA_Data), replace = TRUE) |>
    mutate(sample_num = sample_i)
  
  df_samples <- bind_rows(df_samples, df_i)
}

Investigating the Samples (first iteration)

samp1 <- df_samples |>
  filter(sample_num == 1)

samp2 <- df_samples |>
  filter(sample_num == 2)

samp3 <- df_samples |>
  filter(sample_num == 3)

Is there a difference in the distribution of home points in each sample?

summary(samp1$pts_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    55.0    95.0   104.0   103.8   113.0   157.0
summary(samp2$pts_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    59.0    94.0   103.0   103.8   113.0   173.0
summary(samp3$pts_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    56.0    95.0   104.0   104.2   113.0   173.0

The distributions are pretty consistent across the summary statistics. While the range fluctuates a bit, the samples are so large that the means, median, and quartiles all stay consistent. I am curious to see what happens when I change the sample size and number of samples.

The Sampling Process (second iteration)

# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.1

# number of samples to scrutinize
n_samples = 7

df_samples = tibble()

for (sample_i in 1:n_samples) {
  df_i <- NBA_Data |>
    sample_n(size = sample_frac * nrow(NBA_Data), replace = TRUE) |>
    mutate(sample_num = sample_i)
  
  df_samples <- bind_rows(df_samples, df_i)
}

Investigating the Samples (second iteration)

samp1 <- df_samples |>
  filter(sample_num == 1)

samp2 <- df_samples |>
  filter(sample_num == 2)

samp3 <- df_samples |>
  filter(sample_num == 3)

samp4 <- df_samples |>
  filter(sample_num == 4)

samp5 <- df_samples |>
  filter(sample_num == 5)

samp6 <- df_samples |>
  filter(sample_num == 6)

samp7 <- df_samples |>
  filter(sample_num == 7)

Is there a difference in the distribution of home points in each sample?

summary(samp1$pts_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      59      95     104     104     113     155
summary(samp2$pts_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      49      94     104     104     113     157
summary(samp3$pts_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    57.0    95.0   104.0   104.2   113.0   158.0
summary(samp4$pts_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    56.0    95.0   104.0   103.9   113.0   154.0
summary(samp5$pts_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    57.0    95.0   103.0   103.8   113.0   152.0
summary(samp6$pts_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    57.0    94.5   104.0   104.1   113.0   161.0
summary(samp7$pts_home)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      63      94     104     104     113     155

While there is not a world of change between observations, we can see with the smaller sample size and increased number of samples that there is more variation in the results. In fact, no two means are the same, which cannot be said for the first iteration of this in which every sample provided the same sample mean, median, 1st, and 3rd quartile.

Looking forward, I think it is important to use a larger number of samples to ensure a wider range of results. It is why I am particularly fond of this dataset that spans such a wide range of the league’s history and offers “samples” from back in the day (the 1980s) and the modern era (up to 2022/2023).