The purpose of this week’s data dive is for you to explore hypothesis testing with your dataset.
Your RMarkdown notebook for this data dive should contain the following:
After having explored your dataset over the past few weeks, you should already have some questions.
Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis:
Come up with an alpha level, power level, and minimum effect size, and explain why you chose each value.
Determine if you have enough data to perform a Neyman-Pearson hypothesis test. If you do, show your sample size calculation, perform the test, and interpret results. If not, explain why there isn’t enough data.
Perform a Fisher’s style test for significance on the same hypothesis, and interpret the p-value.
(In the end, you should have two hypothesis tests for each hypothesis, equating two four total tests.)
Build two visualizations that best illustrate your results, one for each null hypothesis.
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
For this weeks data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings
## # A tibble: 638 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## 7 Baltimore Ravens 2000 12 4 333 165
## 8 Pittsburgh Steelers 2000 9 7 321 255
## 9 Jacksonville Jaguars 2000 7 9 367 327
## 10 Cincinnati Bengals 2000 4 12 185 359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(effsize)
library(pwrss)
##
## Attaching package: 'pwrss'
##
## The following object is masked from 'package:stats':
##
## power.t.test
library(boot)
For this weeks data dive I must devise at least two different null hypotheses based on two different aspects of the data. Using my experience from the previous data dives I will be investigating the points_for and defensive_ranking attributes.
My first null hypothesis is as follows:
defensive_ranking_data <- standings |>
select(defensive_ranking, playoffs, sb_winner) |>
filter(playoffs == "Playoffs")
This code prepares the data for analysis by isolating teams that reached the playoffs and focusing on their defensive rankings and Superbowl outcomes.
defensive_ranking_data |>
ggplot() +
geom_boxplot(mapping =
aes(x = defensive_ranking,
y = factor(sb_winner, levels = c("No Superbowl", "Won Superbowl"),
labels = c("Non-Winners", "Winners")))) +
labs(title = "Defensive Ranking of Playoff teams",
x = "Defensive Ranking",
y = "Playoff Result") +
theme_minimal()
This visualization provides an initial comparison of defensive rankings between teams that won the Superbowl and those that did not, among those that made the playoffs.
Our initial look at the data suggests that playoff teams who were Superbowl winning teams have a higher defensive ranking than non-Superbowl winning teams.
avg_defensive_ranking <- defensive_ranking_data |>
group_by(sb_winner) |>
summarize(avg_defensive_rank = mean(defensive_ranking)) |>
arrange(sb_winner)
avg_defensive_ranking
## # A tibble: 2 × 2
## sb_winner avg_defensive_rank
## <chr> <dbl>
## 1 No Superbowl 2.10
## 2 Won Superbowl 4.06
This step computes the average defensive ranking for each group (Superbowl winners vs. non-winners) to quantify differences.
observed_diff <- (avg_defensive_ranking$avg_defensive_rank[2] -
avg_defensive_ranking$avg_defensive_rank[1])
paste("Observed Difference: ", observed_diff)
## [1] "Observed Difference: 1.95545454545455"
# the same bootstrapping function from lab_06
bootstrap <- function (x, func=mean, n_iter=10^4) {
set.seed(100)
# empty vector to be filled with values from each iteration
func_values <- c(NULL)
# we simulate sampling `n_iter` times
for (i in 1:n_iter) {
# pull the sample (e.g., a vector or data frame)
x_sample <- sample_n(x, size = length(x), replace = TRUE)
# add on this iteration's value to the collection
func_values <- c(func_values, func(x_sample))
}
return(func_values)
}
diff_in_avg <- function (x_data) {
avg_defensive_ranking <- x_data |>
group_by(sb_winner) |>
summarize(avg_defensive_rank = mean(defensive_ranking)) |>
arrange(sb_winner)
# difference = revenue_with - revenue_without
diff <- (avg_defensive_ranking$avg_defensive_rank[2] -
avg_defensive_ranking$avg_defensive_rank[1])
return(diff)
}
diffs_in_avgs <- na.omit(bootstrap(defensive_ranking_data, diff_in_avg, n_iter = 1000))
By bootstrapping differences in average defensive rankings, we assess the robustness of our observations and account for sampling variability.
When running the bootstrap, there were times when no Superbowl winning team was sampled. This caused the difference calculation to be NA. I have included na.omit to remove these runs since its imperative that we are comparing Superbowl winning teams to non-Superbowl winning teams
ggplot() +
geom_function(xlim = c(-20, 20),
fun = function(x) dnorm(x, mean = 0,
sd = sd(diffs_in_avgs))) +
geom_vline(mapping = aes(xintercept = observed_diff,
color = paste("observed: ",
round(observed_diff)))) +
labs(title = "Bootstrapped Sampling Distribution of Revenue Differences",
x = "Difference in Revenue Calculated",
y = "Probability Density",
color = "") +
scale_x_continuous(breaks = seq(-20, 20, 2)) +
theme_minimal()
This plot shows the distribution of bootstrapped differences and highlights the observed difference, aiding in the visual assessment of hypothesis testing.
My Alternative Hypothesis is that the defensive ranking of Superbowl winning teams vs non-Superbowl winning teams among playoff teams is not equal.
cohen.d(d = filter(defensive_ranking_data, sb_winner == "No Superbowl") |> pluck("defensive_ranking"),
f = filter(defensive_ranking_data, sb_winner == "Won Superbowl") |> pluck("defensive_ranking"))
##
## Cohen's d
##
## d estimate: -0.6565775 (medium)
## 95 percent confidence interval:
## lower upper
## -1.1204383 -0.1927166
Cohen’s D provides a standardized measure of the effect size, indicating the strength and direction of the relationship between defensive rankings and Superbowl victories.
With a Cohen D value of -0.65 we have enough effect to conduct a hypothesis test.
After consideration I have decided
A meaningful difference in defensive ranking between teams is 2.
We can accept an \(\alpha = 0.1\) and power \(1 - \beta = .85\).
Considering NFL teams must balance how many resources the allocate between offense and defense, it is important that they dont over or under estimate the significance of defensive performance. Howver there is a common phrase among sports fans, “Defense wins championships.”
The alpha level indicates the probability of having a Type I error. A typical alpha level is 0.05, so having a slightly higher value indicates a more aggressive stance of finding significant effects that could justify investment in defensive strategies.
The power level indicates the complement of having a Type II error. Setting a high power level of 0.85 ensure a high probability of detecting the true effect of defensive performance on a teams success.
test <- pwrss.t.2means(mu1 = 2,
sd1 = sd(pluck(defensive_ranking_data, "defensive_ranking")),
kappa = 1,
power = .85, alpha = 0.1,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.85
## n1 = 34
## n2 = 34
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 66
## Non-centrality parameter = 2.73
## Type I error rate = 0.1
## Type II error rate = 0.15
plot(test)
This analysis calculates the necessary sample sizes for each group to ensure the statistical test has adequate power to detect a meaningful difference in defensive rankings.
From this we can see n1 and n2 equal 34. This means we need 34 defensive rankings for each group for our hypothesis test to maintain the strength defined. Unfortunately, this data only consists of 20 Superbowl winning teams, meaning we shouldn’t conduct a hypothesis test.
f_sampling <- function(x) dnorm(x, mean = 0,
sd = sd(diffs_in_avgs))
ggplot() +
stat_function(mapping = aes(fill = 'more extreme samples'),
fun = f_sampling,
xlim = c(observed_diff, 15),
geom = "area") +
stat_function(mapping = aes(fill = 'more extreme samples'),
fun = f_sampling,
xlim = c(-15, -observed_diff),
geom = "area") +
geom_function(xlim = c(-15, 15),
fun = f_sampling) +
geom_vline(mapping = aes(xintercept = observed_diff,
color = paste("observed: ",
round(observed_diff, 1)))) +
labs(title = "Bootstrapped Sampling Distribution of Revenue Differences",
x = "Difference in Revenue Calculated",
y = "Probability Density",
color = "",
fill = "") +
scale_x_continuous(breaks = seq(-15, 15, 2)) +
scale_fill_manual(values = 'lightblue') +
theme_minimal()
This visualization highlights the bootstrapped sampling distribution of the differences, marking areas that represent more extreme samples compared to the observed difference. The areas filled in light blue on both sides of the observed difference denote the regions of more extreme outcomes, helping to visually assess the significance of the observed difference.
# "demean" the bootstrapped samples to simulate mu = 0
diffs_in_avgs_d <- diffs_in_avgs - mean(diffs_in_avgs)
# proportion of times the difference is more extreme
paste("p-value ",
sum(abs(observed_diff) < abs(diffs_in_avgs_d)) /
length(diffs_in_avgs_d))
## [1] "p-value 0.607287449392713"
This step computes the p-value by quantifying how often bootstrapped differences that are as extreme as, or more extreme than, the observed difference occur under the simulated null hypothesis.
Assuming the defensive ranking has no effect on whether a playoff team wins the Superbowl, then 60 of 100 samples this large would yield a difference in defensive ranking of 1.95 or more.
This p-value is larger than our alpha value and therefore we cannot reject our null hypothesis.
My second null hypothesis is as follows:
points_diff_data <- standings |>
select(points_differential, playoffs)
This code prepares the data by extracting the points differential and playoff status of teams.
points_diff_data |>
ggplot() +
geom_boxplot(mapping =
aes(x = points_differential,
y = factor(playoffs, levels = c("No Playoffs", "Playoffs"),
labels = c("No Playoffs", "Playoffs")))) +
labs(title = "Points Differential of Playoff and Non-Playoff teams",
x = "Points Differential",
y = "Playoff Result") +
theme_minimal()
This visualization allows for an initial comparison of points differentials between teams that made the playoffs and those that did not.
avg_points_diff <- points_diff_data |>
group_by(playoffs) |>
summarize(avg_points_diff = mean(points_differential)) |>
arrange(playoffs)
avg_points_diff
## # A tibble: 2 × 2
## playoffs avg_points_diff
## <chr> <dbl>
## 1 No Playoffs -55.7
## 2 Playoffs 92.4
observed_diff <- (avg_points_diff$avg_points_diff[2] -
avg_points_diff$avg_points_diff[1])
paste("Observed Difference: ", observed_diff)
## [1] "Observed Difference: 148.138630653266"
This step quantifies the observed difference in points differentials.
# the same bootstrapping function from lab_06
bootstrap <- function (x, func=mean, n_iter=10^4) {
set.seed(100)
# empty vector to be filled with values from each iteration
func_values <- c(NULL)
# we simulate sampling `n_iter` times
for (i in 1:n_iter) {
# pull the sample (e.g., a vector or data frame)
x_sample <- sample_n(x, size = length(x), replace = TRUE)
# add on this iteration's value to the collection
func_values <- c(func_values, func(x_sample))
}
return(func_values)
}
diff_in_avg <- function (x_data) {
avg_points_diff <- x_data |>
group_by(playoffs) |>
summarize(avg_points_diff = mean(points_differential)) |>
arrange(playoffs)
# difference = revenue_with - revenue_without
diff <- (avg_points_diff$avg_points_diff[2] -
avg_points_diff$avg_points_diff[1])
return(diff)
}
diffs_in_avgs <- na.omit(bootstrap(points_diff_data, diff_in_avg, n_iter = 1000))
By bootstrapping differences in average points differential, we assess the robustness of our observations and account for sampling variability.
ggplot() +
geom_function(xlim = c(-500, 500),
fun = function(x) dnorm(x, mean = 0,
sd = sd(diffs_in_avgs))) +
geom_vline(mapping = aes(xintercept = observed_diff,
color = paste("observed: ",
round(observed_diff)))) +
labs(title = "Bootstrapped Sampling Distribution of Revenue Differences",
x = "Difference in Revenue Calculated",
y = "Probability Density",
color = "") +
scale_x_continuous(breaks = seq(-500, 500, 100)) +
theme_minimal()
This plot aids in visualizing the significance of the observed difference.
My alternative hypothesis is that the average points differential of a playoff teams and a non-playoff teams is not equal.
cohen.d(d = filter(points_diff_data, playoffs == "No Playoffs") |> pluck("points_differential"),
f = filter(points_diff_data, playoffs == "Playoffs") |> pluck("points_differential"))
##
## Cohen's d
##
## d estimate: -2.080678 (large)
## 95 percent confidence interval:
## lower upper
## -2.277754 -1.883602
The Cohen’s D value quantifies the effect size of the difference in points differential between playoff and non-playoff teams. We can see that the difference is large and therefore there is reason to not conduct a Hypothesis test.
After consideration I have decided
A meaningful difference in point differential between teams is 25 points.
I can accept an \(\alpha = 0.1\) and power \(1 - \beta = .85\).
Considering NFL teams must balance how many resources the allocate between offense and defense, it is important that teams understand point differential between playoff teams and non-playoff teams .
Setting the alpha level to 0.1 gives a 10% chance of committing a Type I error, which would mean rejecting the null hypothesis when it is true. Meaning that playoff teams and non-playoff teams have equal average points differential.
A power level of 0.85 means there is an 85% chance of correctly rejecting the null hypothesis when it is false, which is considered a strong level of power in many contexts. This level is chosen to ensure a high likelihood of detecting an effect if it exists. This would be an important effect for NFL teams if they have a particularly weak offensive or defensive team. They would need to ensure they allocate more resources to one side in order to make the playoffs.
test <- pwrss.t.2means(mu1 = 50,
sd1 = sd(pluck(points_diff_data, "points_differential")),
kappa = 1,
power = .85, alpha = 0.1,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.85
## n1 = 60
## n2 = 60
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 118
## Non-centrality parameter = 2.709
## Type I error rate = 0.1
## Type II error rate = 0.15
plot(test)
## Warning in qt(1 - prob.extreme, df = df, ncp = ncp, lower.tail = TRUE): full
## precision may not have been achieved in 'pnt{final}'
This analysis determines the necessary sample size to ensure the hypothesis test has sufficient power to detect a meaningful difference.
From this we can see n1 and n2 equal 60. This means we need 60 teams for each group for our hypothesis test to maintain the strength defined, which we do in fact have.
f_sampling <- function(x) dnorm(x, mean = 0,
sd = sd(diffs_in_avgs))
ggplot() +
stat_function(mapping = aes(fill = 'more extreme samples'),
fun = f_sampling,
xlim = c(observed_diff, 500),
geom = "area") +
stat_function(mapping = aes(fill = 'more extreme samples'),
fun = f_sampling,
xlim = c(-500, -observed_diff),
geom = "area") +
geom_function(xlim = c(-500, 500),
fun = f_sampling) +
geom_vline(mapping = aes(xintercept = observed_diff,
color = paste("observed: ",
round(observed_diff, 1)))) +
labs(title = "Bootstrapped Sampling Distribution of Revenue Differences",
x = "Difference in Revenue Calculated",
y = "Probability Density",
color = "",
fill = "") +
scale_x_continuous(breaks = seq(-500, 500, 100)) +
scale_fill_manual(values = 'lightblue') +
theme_minimal()
# "demean" the bootstrapped samples to simulate mu = 0
diffs_in_avgs_d <- diffs_in_avgs - mean(diffs_in_avgs)
# proportion of times the difference is more extreme
paste("p-value ",
sum(abs(observed_diff) < abs(diffs_in_avgs_d)) /
length(diffs_in_avgs_d))
## [1] "p-value 0.154708520179372"
This step provides a p-value, indicating the probability of observing a difference as extreme as, or more extreme than, the one observed if the null hypothesis were true.
Assuming the points_differential has no effect on whether a team makes the playoffs, then 15 of 100 samples this large would yield a difference in defensive ranking of 148.1 or more.
This p-value is larger than our alpha value and therefore we cannot reject the null hypothesis.