Start by setting up the packages to manipulate data.
suppressPackageStartupMessages({
library(tidyverse)
library(rio)
source("aptheme.R") #Code that helps format graphs
})
Start by reading in data and build function that will sample data.
data <- import("plays.csv")
get_sample <- function(data,
samp_fraction,
n_samples){
for(i in 1:n_samples){
subdata <- data[sample(x = 1:nrow(data), size = nrow(data) * samp_fraction, replace = TRUE),]
subdata$sample_id <- i
if(exists("sample_data")){
sample_data <- rbind(sample_data,
subdata)
} else {
sample_data <- subdata
}
}
return(sample_data)
}
Now that there is a function, we can start generating samples
#Set seed for reproducibility
set.seed(100)
small_sample <- get_sample(data,
samp_fraction = .15,
n_samples = 5) %>%
mutate(sample_type = "small")
med_sample <- get_sample(data,
samp_fraction = .50,
n_samples = 5) %>%
mutate(sample_type = "medium")
large_sample <- get_sample(data,
samp_fraction = .75,
n_samples = 5) %>%
mutate(sample_type = "large")
#Combine samples into one data frame
sample_data <- bind_rows(small_sample, med_sample, large_sample)
With these samples the yards gained variable is likely the most telling, since the other variables need more context to really be analyzed.
sample_data %>%
group_by(sample_type, sample_id) %>%
summarise(avg_yards = mean(yardsGained)) %>%
ggplot(aes(x = avg_yards)) +
geom_histogram(aes(fill = sample_type), position = "identity", alpha = .75) +
geom_vline(xintercept = mean(data$yardsGained)) +
theme_ap(family = "sans")+
labs(x = "Yards Gained",
title = "Average Yards Gained by Sample Size") +
labs(fill = "Sample Size") +
theme(legend.position = "right")
## `summarise()` has grouped output by 'sample_type'. You can override using the
## `.groups` argument.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Here we can see a couple of interesting things. First we can see that the small sample is the most dispersed of the three sizes, which makes sense given what we know about the law of large numbers. Then the medium sample has the single average closest to the mean of the true data set, but the large sample sizes are closer as a whole to the true mean.
The good news is that even the average furthest from the true mean is only off by a couple tenths of a yard. Of the total number of plays we have in the data set is realatively small in the grand scheme of an NFL season, but even when we sample a much smaller data set from this data set, it’s not terribly far off from out “true” data set.
The other thing that we want to consider is the number of first, second, third and fourth downs in the data set. Teams make different decisions on different downs, so if one is underrepresented, it could lead to incorrect conclusions down the road.
sample_data %>%
filter(sample_type == "large") %>%
count(sample_id, down) %>%
group_by(sample_id) %>%
summarise(down = down,
down_share = n/sum(n)) %>%
arrange(desc(down))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'sample_id'. You can override using the
## `.groups` argument.
## # A tibble: 20 × 3
## # Groups: sample_id [5]
## sample_id down down_share
## <int> <int> <dbl>
## 1 1 4 0.0186
## 2 2 4 0.0189
## 3 3 4 0.0218
## 4 4 4 0.0211
## 5 5 4 0.0194
## 6 1 3 0.204
## 7 2 3 0.205
## 8 3 3 0.203
## 9 4 3 0.203
## 10 5 3 0.201
## 11 1 2 0.328
## 12 2 2 0.332
## 13 3 2 0.331
## 14 4 2 0.330
## 15 5 2 0.332
## 16 1 1 0.450
## 17 2 1 0.444
## 18 3 1 0.444
## 19 4 1 0.446
## 20 5 1 0.448
Looking here, I’m not too worried about the sampled data being unbalanced. Looking at the share each down makes up of the data sets, it’s off by a couple tenths of a percent. If we had seen a sample or two with say 15% fourth down plays, or with 5% then maybe there’s a balance/anomoly problem, but looking at this data set, I feel pretty good that our original data is representative of the total populaiton data.