The purpose of this week’s data dive is for you to think critically about what might go wrong when it comes time to make conclusions about your data.
Your RMarkdown notebook for this data dive should contain the following:
A collection of 5 random samples of your data (with replacement).
Each subsample should be as long as roughly 50% percent of your data. We are simulating the act of collecting data from a population where the “population” is represented by the data set you already have. *
Store each sample set in a separate data frame (e.g.,
df_2
might be the second of these samples). *
Of course, these subsamples should each include both categorical and continuous (numeric) data. *
Scrutinize these subsamples. Note: you might find
group_by
quite helpful here!
How different are they? *
What would you have called an anomaly in one sub-sample that you wouldn’t in another? *
Are there aspects of the data that are consistent among all sub-samples? *
Try to incorporate Monte Carlo simulations here (see Lab 3) if you can … *
Consider how this investigation affects how you might draw conclusions about the data in the future. *
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
For this weeks data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings
## # A tibble: 638 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## 7 Baltimore Ravens 2000 12 4 333 165
## 8 Pittsburgh Steelers 2000 9 7 321 255
## 9 Jacksonville Jaguars 2000 7 9 367 327
## 10 Cincinnati Bengals 2000 4 12 185 359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
We want to collect 5 random samples from standings data. To do this I will use the sample() function to randomly select rows.
# x is a vector with the same length as the number of rows of standings
x = 1:nrow(standings)
# n is the number of samples we want, which is half of standings
n = nrow(standings)/2
# Creating 5 random samples, with seed 123
set.seed(1)
sample_1 <- standings[sample(x, n, replace = T),]
sample_2 <- standings[sample(x, n, replace = T),]
sample_3 <- standings[sample(x, n, replace = T),]
sample_4 <- standings[sample(x, n, replace = T),]
sample_5 <- standings[sample(x, n, replace = T),]
There is a chance a team can appear more than one in a sample. This is because we are sampling with replacement.
It would be interesting to compare the relative performances of each team in each sample. To do this I will group by ‘team_name’ and find summary statistics of each and then compare results.
sample_1_summary <- sample_1 |>
group_by(team_name) |>
summarise(mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
mean_sr = mean(simple_rating),
mean_or = mean(offensive_ranking),
mean_dr = mean(defensive_ranking),
count_playoffs = sum(playoffs == 'Playoffs'),
count_sb = sum(sb_winner == 'Superbowl'),
count_n = n())
sample_2_summary <- sample_2 |>
group_by(team_name) |>
summarise(mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
mean_sr = mean(simple_rating),
mean_or = mean(offensive_ranking),
mean_dr = mean(defensive_ranking),
count_playoffs = sum(playoffs == 'Playoffs'),
count_sb = sum(sb_winner == 'Superbowl'),
count_n = n())
sample_3_summary <- sample_3 |>
group_by(team_name) |>
summarise(mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
mean_sr = mean(simple_rating),
mean_or = mean(offensive_ranking),
mean_dr = mean(defensive_ranking),
count_playoffs = sum(playoffs == 'Playoffs'),
count_sb = sum(sb_winner == 'Superbowl'),
count_n = n())
sample_4_summary <- sample_4 |>
group_by(team_name) |>
summarise(mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
mean_sr = mean(simple_rating),
mean_or = mean(offensive_ranking),
mean_dr = mean(defensive_ranking),
count_playoffs = sum(playoffs == 'Playoffs'),
count_sb = sum(sb_winner == 'Superbowl'),
count_n = n())
sample_5_summary <- sample_5 |>
group_by(team_name) |>
summarise(mean_wins = mean(wins),
mean_loses = mean(loss),
mean_pf = mean(points_for),
mean_pa = mean(points_against),
mean_mov = mean(margin_of_victory),
mean_sos = mean(strength_of_schedule),
mean_sr = mean(simple_rating),
mean_or = mean(offensive_ranking),
mean_dr = mean(defensive_ranking),
count_playoffs = sum(playoffs == 'Playoffs'),
count_sb = sum(sb_winner == 'Superbowl'),
count_n = n())
I now have summaries of the 5 samples. To show that my code has worked I will create a plot of the average points for each team in each sample.
ggplot() +
geom_point(data = sample_1_summary, mapping = aes(x=team_name, y=mean_pf), color = 'red')+
geom_point(data = sample_2_summary, mapping = aes(x=team_name, y=mean_pf), color = 'orange')+
geom_point(data = sample_3_summary, mapping = aes(x=team_name, y=mean_pf), color = 'green')+
geom_point(data = sample_4_summary, mapping = aes(x=team_name, y=mean_pf), color = 'blue')+
geom_point(data = sample_5_summary, mapping = aes(x=team_name, y=mean_pf), color = 'purple')+
labs(title = 'Sampling Averages of Points For by NFL Teams ', x = 'Team Name', y= 'Points For') +
coord_flip() +
theme_minimal()
We can see the average between the samples varies, and is primed for us to investigate their differences.
These samples are likely different in many ways so to I will need to focus my investigation into a particular aspect of the samples. I have decided to look at the distribution of playoff appearances by each team.
I will first create a histogram of each sample to visualize their distribution.
x <- sample_1_summary$count_playoffs
mu_x <- mean(x)
sd_x <- sd(x)
ggplot() +
geom_histogram(data = sample_1_summary, aes(count_playoffs), binwidth = 1)+
labs(title = "Sample 1 Histogram of Playoff Appearances per NFL Teams",
x = "Playoff Appearances",
y = "# of teams") +
theme_minimal()
cat("Sample 1 mean: ",mu_x, '\nSample 1 Standard Deviation: ', sd_x)
## Sample 1 mean: 3.625
## Sample 1 Standard Deviation: 2.744496
From the histogram we will see it resemble a Poisson Distribution. This distribution is used for counting the number of events (playoff appearance) that occur within a fixed interval of time or space, given the event occurs with a known constant mean rate and independently of the time since the last event.
y <- dpois(x, lambda = mu_x)
params <- paste("(lambda = ",mu_x,')')
ggplot() +
geom_segment(mapping = aes(x = x, y = 0, xend = x, yend = y)) +
geom_point(mapping = aes(x = x, y = y)) +
labs(title = paste("PMF for Sample 1 Poisson Distribution", params),
x = "# of playoff appearances by a team",
y = "Probability") +
theme_minimal()
x <- sample_2_summary$count_playoffs
mu_x <- mean(x)
sd_x <- sd(x)
ggplot() +
geom_histogram(data = sample_2_summary, aes(count_playoffs), binwidth = 1)+
labs(title = "Sample 2 Histogram of Playoff Appearances per NFL Teams",
x = "Playoff Appearances",
y = "# of teams") +
theme_minimal()
cat("Sample 2 mean: ",mu_x, '\nSample 2 Standard Deviation: ', sd_x)
## Sample 2 mean: 3.40625
## Sample 2 Standard Deviation: 2.486861
y <- dpois(x, lambda = mu_x)
params <- paste("(lambda = ",mu_x,')')
ggplot() +
geom_segment(mapping = aes(x = x, y = 0, xend = x, yend = y)) +
geom_point(mapping = aes(x = x, y = y)) +
labs(title = paste("PMF for Sample 2 Poisson Distribution", params),
x = "# of playoff appearances by a team",
y = "Probability") +
theme_minimal()
x <- sample_3_summary$count_playoffs
mu_x <- mean(x)
sd_x <- sd(x)
ggplot() +
geom_histogram(data = sample_3_summary, aes(count_playoffs), binwidth = 1)+
labs(title = "Sample 3 Histogram of Playoff Appearances per NFL Teams",
x = "Playoff Appearances",
y = "# of teams") +
theme_minimal()
cat("Sample 3 mean: ",mu_x, '\nSample 3 Standard Deviation: ', sd_x)
## Sample 3 mean: 3.90625
## Sample 3 Standard Deviation: 2.888904
y <- dpois(x, lambda = mu_x)
params <- paste("(lambda = ",mu_x,')')
ggplot() +
geom_segment(mapping = aes(x = x, y = 0, xend = x, yend = y)) +
geom_point(mapping = aes(x = x, y = y)) +
labs(title = paste("PMF for Sample 3 Poisson Distribution", params),
x = "# of playoff appearances by a team",
y = "Probability") +
theme_minimal()
x <- sample_4_summary$count_playoffs
mu_x <- mean(x)
sd_x <- sd(x)
ggplot() +
geom_histogram(data = sample_4_summary, aes(count_playoffs), binwidth = 1)+
labs(title = "Sample 4 Histogram of Playoff Appearances per NFL Teams",
x = "Playoff Appearances",
y = "# of teams") +
theme_minimal()
cat("Sample 4 mean: ",mu_x, '\nSample 4 Standard Deviation: ', sd_x)
## Sample 4 mean: 3.71875
## Sample 4 Standard Deviation: 2.65469
y <- dpois(x, lambda = mu_x)
params <- paste("(lambda = ",mu_x,')')
ggplot() +
geom_segment(mapping = aes(x = x, y = 0, xend = x, yend = y)) +
geom_point(mapping = aes(x = x, y = y)) +
labs(title = paste("PMF for Sample 4 Poisson Distribution", params),
x = "# of playoff appearances by a team",
y = "Probability") +
theme_minimal()
x <- sample_5_summary$count_playoffs
mu_x <- mean(x)
sd_x <- sd(x)
ggplot() +
geom_histogram(data = sample_5_summary, aes(count_playoffs), binwidth = 1)+
labs(title = "Sample 5 Histogram of Playoff Appearances per NFL Teams",
x = "Playoff Appearances",
y = "# of teams") +
theme_minimal()
cat("Sample 5 mean: ",mu_x, '\nSample 5 Standard Deviation: ', sd_x)
## Sample 5 mean: 3.6875
## Sample 5 Standard Deviation: 2.787501
y <- dpois(x, lambda = mu_x)
params <- paste("(lambda = ",mu_x,')')
ggplot() +
geom_segment(mapping = aes(x = x, y = 0, xend = x, yend = y)) +
geom_point(mapping = aes(x = x, y = y)) +
labs(title = paste("PMF for Sample 5 Poisson Distribution", params),
x = "# of playoff appearances by a team",
y = "Probability") +
theme_minimal()
If we think about the distribution of playoff appearances by NFL teams, it makes sense to treat it as a normal distribution with discrete values. Most teams will have roughly the same number of playoff appearances over the same time period, few teams will have less, and a few will have more, but most centered around the mean. If we can treat the samples as being normally distributed, we can find their Z-score.
68.3% of Z-scores fall between (-1,1) standard deviations. 95.5% of Z-scores fall between (-2,2) standard deviations. 99.7% of Z-Scores fall between (-3,3). For the sake of this data dive, I will declare any sample with a Z-score outside 3 standard deviations an anomaly.
mu_1 <- mean(sample_1_summary$count_playoffs)
sd_1 <- sd(sample_1_summary$count_playoffs)
zscore_sample_1 <- (sample_1_summary$count_playoffs - mu_1) / sd_1
sort(zscore_sample_1)
## [1] -1.3208254 -1.3208254 -1.3208254 -0.9564598 -0.9564598 -0.9564598
## [7] -0.9564598 -0.9564598 -0.5920941 -0.5920941 -0.5920941 -0.5920941
## [13] -0.5920941 -0.2277285 -0.2277285 -0.2277285 -0.2277285 -0.2277285
## [19] 0.1366371 0.1366371 0.1366371 0.5010027 0.5010027 0.5010027
## [25] 0.5010027 0.8653684 0.8653684 1.2297340 1.2297340 1.5940996
## [31] 1.9584653 2.6871965
The output of the cell above is the range of z-scores for each teams total number of playoff appearances in sample 1. Sample 1 contains zero data point that falls outside of 3 standard deviations.
sample_1_summary |>
select(team_name,count_playoffs)
## # A tibble: 32 × 2
## team_name count_playoffs
## <chr> <int>
## 1 49ers 2
## 2 Bears 2
## 3 Bengals 5
## 4 Bills 1
## 5 Broncos 3
## 6 Browns 0
## 7 Buccaneers 2
## 8 Cardinals 0
## 9 Chargers 3
## 10 Chiefs 5
## # ℹ 22 more rows
There are three teams that have zero playoff appearances, which is the theoretical minimum they could have. I hypothesize that future samples will not consider 0 appearances as an anomaly.
Using sample 1’s distribution, we can calculate the Z-scores of the other samples to compare them to sample 1.
zscore_sample_2 <- (sample_2_summary$count_playoffs - mu_1) / sd_1
sort(zscore_sample_2)
## [1] -1.3208254 -1.3208254 -1.3208254 -0.9564598 -0.9564598 -0.5920941
## [7] -0.5920941 -0.5920941 -0.5920941 -0.5920941 -0.5920941 -0.5920941
## [13] -0.5920941 -0.5920941 -0.5920941 -0.2277285 -0.2277285 -0.2277285
## [19] -0.2277285 0.1366371 0.1366371 0.1366371 0.5010027 0.5010027
## [25] 0.5010027 0.5010027 0.8653684 0.8653684 0.8653684 0.8653684
## [31] 1.2297340 3.0515622
sample_2_summary |>
select(team_name,count_playoffs) |>
filter(count_playoffs == max(count_playoffs))
## # A tibble: 1 × 2
## team_name count_playoffs
## <chr> <int>
## 1 Packers 12
Sample 2 contains an anomaly with a standard deviation of 3.05. The anomaly in question are the Packers with 12 playoff appearances. Lets compare this sub-sample to the other samples and see if they also classify it as an anomaly. To do this, we will change the mu and sd values to reflect the distribution of each sample, then find the Z-scores of sample 2.
mu_2 <- mean(sample_2_summary$count_playoffs)
sd_2 <- sd(sample_2_summary$count_playoffs)
zscore_sample_2 <- (sample_2_summary$count_playoffs - mu_2) / sd_2
sort(zscore_sample_2)
## [1] -1.3696988 -1.3696988 -1.3696988 -0.9675854 -0.9675854 -0.5654720
## [7] -0.5654720 -0.5654720 -0.5654720 -0.5654720 -0.5654720 -0.5654720
## [13] -0.5654720 -0.5654720 -0.5654720 -0.1633586 -0.1633586 -0.1633586
## [19] -0.1633586 0.2387548 0.2387548 0.2387548 0.6408682 0.6408682
## [25] 0.6408682 0.6408682 1.0429816 1.0429816 1.0429816 1.0429816
## [31] 1.4450951 3.4556621
The sub-sample can again be classified as an anomaly relative to sample 2 distribution.
mu_3 <- mean(sample_3_summary$count_playoffs)
sd_3 <- sd(sample_3_summary$count_playoffs)
zscore_sample_2 <- (sample_2_summary$count_playoffs - mu_3) / sd_3
sort(zscore_sample_2)
## [1] -1.35215640 -1.35215640 -1.35215640 -1.00600436 -1.00600436 -0.65985233
## [7] -0.65985233 -0.65985233 -0.65985233 -0.65985233 -0.65985233 -0.65985233
## [13] -0.65985233 -0.65985233 -0.65985233 -0.31370029 -0.31370029 -0.31370029
## [19] -0.31370029 0.03245175 0.03245175 0.03245175 0.37860379 0.37860379
## [25] 0.37860379 0.37860379 0.72475583 0.72475583 0.72475583 0.72475583
## [31] 1.07090787 2.80166807
And there we have it! Sample 3 would not classify the sample 2 sub-sample as an anomaly. Lets continue to Sample 4.
mu_4 <- mean(sample_4_summary$count_playoffs)
sd_4 <- sd(sample_4_summary$count_playoffs)
zscore_sample_2 <- (sample_2_summary$count_playoffs - mu_4) / sd_4
sort(zscore_sample_2)
## [1] -1.4008227 -1.4008227 -1.4008227 -1.0241309 -1.0241309 -0.6474391
## [7] -0.6474391 -0.6474391 -0.6474391 -0.6474391 -0.6474391 -0.6474391
## [13] -0.6474391 -0.6474391 -0.6474391 -0.2707472 -0.2707472 -0.2707472
## [19] -0.2707472 0.1059446 0.1059446 0.1059446 0.4826364 0.4826364
## [25] 0.4826364 0.4826364 0.8593282 0.8593282 0.8593282 0.8593282
## [31] 1.2360200 3.1194791
The sub-sample can be classified as an anomaly relative to sample 4’s distributino.
mu_5 <- mean(sample_5_summary$count_playoffs)
sd_5 <- sd(sample_5_summary$count_playoffs)
zscore_sample_2 <- (sample_2_summary$count_playoffs - mu_5) / sd_5
sort(zscore_sample_2)
## [1] -1.3228695 -1.3228695 -1.3228695 -0.9641252 -0.9641252 -0.6053810
## [7] -0.6053810 -0.6053810 -0.6053810 -0.6053810 -0.6053810 -0.6053810
## [13] -0.6053810 -0.6053810 -0.6053810 -0.2466367 -0.2466367 -0.2466367
## [19] -0.2466367 0.1121076 0.1121076 0.1121076 0.4708519 0.4708519
## [25] 0.4708519 0.4708519 0.8295961 0.8295961 0.8295961 0.8295961
## [31] 1.1883404 2.9820618
And there it is again! Sample 5 would not classify the sample 2 sub-sample as an anomaly.
We can conclude that some sub-samples will be classified as anomalies in one distribution that wouldn’t be in another.
We also have further evidence that no sample considers zero wins an anomaly, which would answer the question if there are aspects of the data that are consistent among all sub-samples.
Let’s imagine a scenario where I want to calculate the number of playoff appearance by the top 1% of teams. How many playoff appearances would that be?
To do this we can run a Monte Carlo Simulation. The first step is to generate random samples from our distribution. I want this simulation to be representative of all 5 of our samples, so I will calculate the mean of all 5 for our lambda values.
samples_mean <- mean(c(sample_1_summary$count_playoffs, sample_2_summary$count_playoffs, sample_3_summary$count_playoffs, sample_4_summary$count_playoffs, sample_5_summary$count_playoffs))
samples_mean
## [1] 3.66875
mu <- samples_mean
n <- 1e4
sample_1_simulation <- data.frame(value = rpois(n, lambda = mu))
We have now done our simulation, lets plot our result.
ggplot(data = sample_1_simulation) +
geom_histogram(mapping = aes(x = value),binwidth = 1) +
theme_minimal()
x <- sample_1_simulation$value
y <- dpois(sample_1_simulation$value, lambda = mu)
params <- paste("(lambda = ",mu,')')
ggplot() +
geom_segment(mapping = aes(x = x, y = 0, xend = x, yend = y)) +
geom_point(mapping = aes(x = x, y = y)) +
labs(title = paste("PMF for Poisson Distribution of Samples", params),
x = "# of playoff appearances by a team",
y = "Probability") +
theme_minimal()
top_1_percent <- qpois(0.99, lambda = mu)
top_1_percent
## [1] 9
Our results tells us that the top 1% of teams will have 9+ wins.
These results demonstrate the differences samples can have. What might be normal in one sample might be an anomaly in another. It also shows the importance of having large data sets. A large data set will encapsulate much more information and nuance and be closer to the true population, therefore allowing you to make better conclusions.
In this data dive we found the average number of playoff appearances in each sample. In the population data set, there was much more data and represented 20 years of playoff appearances. Our samples are half the size, so we might expect the data to represent about 10 years of playoff appearances. Sample 3 had an average of 3.9 appearances. If we extrapolate our results, we would expect over 20 years the average number of playoff appearance to be around 7.8.
standings_summary <- standings |>
group_by(team_name) |>
summarise(
count_playoffs = sum(playoffs == 'Playoffs')
)
mean(standings_summary$count_playoffs)
## [1] 7.5
As we can see the true population mean is 7.5 which is pretty close to our sample. However, as the sample size goes down, there is more and more of a chance that the sample mean will be different from the population mean.
This is important to remember in future exercise to prevent under fitting our models.