library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
NBA_Data <- read_csv("NBA Dataset for Submission.csv")
## Rows: 46977 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): season_type, game_id, team_abbreviation_home, team_name_home, tea...
## dbl (65): season_id, season, team_id_home, team_id_away, fgm_home, fga_home...
## date (1): game_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NBA_Data <- NBA_Data |>
filter(season_type != 'All-Star',
season_type != 'All Star',
season_type != 'Pre Season')
# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.25
# number of samples to scrutinize
n_samples = 3
df_samples = tibble()
for (sample_i in 1:n_samples) {
df_i <- NBA_Data |>
sample_n(size = sample_frac * nrow(NBA_Data), replace = TRUE) |>
mutate(sample_num = sample_i)
df_samples <- bind_rows(df_samples, df_i)
}
samp1 <- df_samples |>
filter(sample_num == 1)
samp2 <- df_samples |>
filter(sample_num == 2)
samp3 <- df_samples |>
filter(sample_num == 3)
summary(samp1$pts_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.0 95.0 104.0 103.8 113.0 157.0
summary(samp2$pts_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.0 94.0 103.0 103.8 113.0 173.0
summary(samp3$pts_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.0 95.0 104.0 104.2 113.0 173.0
The distributions are pretty consistent across the summary statistics. While the range fluctuates a bit, the samples are so large that the means, median, and quartiles all stay consistent. I am curious to see what happens when I change the sample size and number of samples.
# change this number, and consider how it affects the sub-sample analysis
sample_frac = 0.1
# number of samples to scrutinize
n_samples = 7
df_samples = tibble()
for (sample_i in 1:n_samples) {
df_i <- NBA_Data |>
sample_n(size = sample_frac * nrow(NBA_Data), replace = TRUE) |>
mutate(sample_num = sample_i)
df_samples <- bind_rows(df_samples, df_i)
}
samp1 <- df_samples |>
filter(sample_num == 1)
samp2 <- df_samples |>
filter(sample_num == 2)
samp3 <- df_samples |>
filter(sample_num == 3)
samp4 <- df_samples |>
filter(sample_num == 4)
samp5 <- df_samples |>
filter(sample_num == 5)
samp6 <- df_samples |>
filter(sample_num == 6)
samp7 <- df_samples |>
filter(sample_num == 7)
summary(samp1$pts_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59 95 104 104 113 155
summary(samp2$pts_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49 94 104 104 113 157
summary(samp3$pts_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.0 95.0 104.0 104.2 113.0 158.0
summary(samp4$pts_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.0 95.0 104.0 103.9 113.0 154.0
summary(samp5$pts_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.0 95.0 103.0 103.8 113.0 152.0
summary(samp6$pts_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.0 94.5 104.0 104.1 113.0 161.0
summary(samp7$pts_home)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 63 94 104 104 113 155
While there is not a world of change between observations, we can see with the smaller sample size and increased number of samples that there is more variation in the results. In fact, no two means are the same, which cannot be said for the first iteration of this in which every sample provided the same sample mean, median, 1st, and 3rd quartile.
Looking forward, I think it is important to use a larger number of samples to ensure a wider range of results. It is why I am particularly fond of this dataset that spans such a wide range of the league’s history and offers “samples” from back in the day (the 1980s) and the modern era (up to 2022/2023).