My Code

Reading and Viewing the Data

For this weeks data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.

standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')

## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

standings

## # A tibble: 638 × 15
##    team         team_name  year  wins  loss points_for points_against
##    <chr>        <chr>     <dbl> <dbl> <dbl>      <dbl>          <dbl>
##  1 Miami        Dolphins   2000    11     5        323            226
##  2 Indianapolis Colts      2000    10     6        429            326
##  3 New York     Jets       2000     9     7        321            321
##  4 Buffalo      Bills      2000     8     8        315            350
##  5 New England  Patriots   2000     5    11        276            338
##  6 Tennessee    Titans     2000    13     3        346            191
##  7 Baltimore    Ravens     2000    12     4        333            165
##  8 Pittsburgh   Steelers   2000     9     7        321            255
##  9 Jacksonville Jaguars    2000     7     9        367            327
## 10 Cincinnati   Bengals    2000     4    12        185            359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

library(boot)

For this weeks data dive I must devise at least two different null hypotheses based on two different aspects of the data. Using my experience from the previous data dives I will be investigating the points_for and defensive_ranking attributes.

Null Hypothesis 1: Defensive Ranking of Superbowl Winners vs. Playoff Teams

My first null hypothesis is as follows:

The average defensive ranking of Superbowl winners is equal to teams who made the playoffs but didn’t win the Superbowl.

defensive_ranking_data <- standings |> 
  select(defensive_ranking, playoffs, sb_winner) |>
  filter(playoffs == "Playoffs")

This code prepares the data for analysis by isolating teams that reached the playoffs and focusing on their defensive rankings and Superbowl outcomes.

defensive_ranking_data |>
  ggplot() +
  geom_boxplot(mapping = 
                 aes(x = defensive_ranking, 
                     y = factor(sb_winner, levels = c("No Superbowl", "Won Superbowl"),
                                labels = c("Non-Winners", "Winners")))) +
  labs(title = "Defensive Ranking of Playoff teams",
       x = "Defensive Ranking",
       y = "Playoff Result") +
  theme_minimal()

This visualization provides an initial comparison of defensive rankings between teams that won the Superbowl and those that did not, among those that made the playoffs.

Our initial look at the data suggests that playoff teams who were Superbowl winning teams have a higher defensive ranking than non-Superbowl winning teams.

avg_defensive_ranking <- defensive_ranking_data |>
  group_by(sb_winner) |>
  summarize(avg_defensive_rank = mean(defensive_ranking)) |>
  arrange(sb_winner)

avg_defensive_ranking

## # A tibble: 2 × 2
##   sb_winner     avg_defensive_rank
##   <chr>                      <dbl>
## 1 No Superbowl                2.10
## 2 Won Superbowl               4.06

This step computes the average defensive ranking for each group (Superbowl winners vs. non-winners) to quantify differences.

observed_diff <- (avg_defensive_ranking$avg_defensive_rank[2] -
                  avg_defensive_ranking$avg_defensive_rank[1])
paste("Observed Difference: ", observed_diff)

## [1] "Observed Difference:  1.95545454545455"

# the same bootstrapping function from lab_06
bootstrap <- function (x, func=mean, n_iter=10^4) {
  set.seed(100)
  # empty vector to be filled with values from each iteration
  func_values <- c(NULL)
  
  # we simulate sampling `n_iter` times
  for (i in 1:n_iter) {
    # pull the sample (e.g., a vector or data frame)
    x_sample <- sample_n(x, size = length(x), replace = TRUE)
    
    # add on this iteration's value to the collection
    func_values <- c(func_values, func(x_sample))
  }
  
  return(func_values)
}

diff_in_avg <- function (x_data) {
  avg_defensive_ranking <- x_data |>
    group_by(sb_winner) |>
    summarize(avg_defensive_rank = mean(defensive_ranking)) |>
    arrange(sb_winner)
  
  # difference = revenue_with - revenue_without
  diff <- (avg_defensive_ranking$avg_defensive_rank[2] - 
           avg_defensive_ranking$avg_defensive_rank[1])
  
  return(diff)
}

diffs_in_avgs <- na.omit(bootstrap(defensive_ranking_data, diff_in_avg, n_iter = 1000))

By bootstrapping differences in average defensive rankings, we assess the robustness of our observations and account for sampling variability.

When running the bootstrap, there were times when no Superbowl winning team was sampled. This caused the difference calculation to be NA. I have included na.omit to remove these runs since its imperative that we are comparing Superbowl winning teams to non-Superbowl winning teams

ggplot() +
  geom_function(xlim = c(-20, 20), 
                fun = function(x) dnorm(x, mean = 0, 
                                        sd = sd(diffs_in_avgs))) +
  geom_vline(mapping = aes(xintercept = observed_diff,
                           color = paste("observed: ",
                                         round(observed_diff)))) +
  labs(title = "Bootstrapped Sampling Distribution of Revenue Differences",
       x = "Difference in Revenue Calculated",
       y = "Probability Density",
       color = "") +
  scale_x_continuous(breaks = seq(-20, 20, 2)) +
  theme_minimal()

This plot shows the distribution of bootstrapped differences and highlights the observed difference, aiding in the visual assessment of hypothesis testing.

My Alternative Hypothesis is that the defensive ranking of Superbowl winning teams vs non-Superbowl winning teams among playoff teams is not equal.

cohen.d(d = filter(defensive_ranking_data, sb_winner == "No Superbowl") |> pluck("defensive_ranking"),
        f = filter(defensive_ranking_data, sb_winner == "Won Superbowl") |> pluck("defensive_ranking"))

## 
## Cohen's d
## 
## d estimate: -0.6565775 (medium)
## 95 percent confidence interval:
##      lower      upper 
## -1.1204383 -0.1927166

Cohen’s D provides a standardized measure of the effect size, indicating the strength and direction of the relationship between defensive rankings and Superbowl victories.

With a Cohen D value of -0.65 we have enough effect to conduct a hypothesis test.

After consideration I have decided

A meaningful difference in defensive ranking between teams is 2.
We can accept an \(\alpha = 0.1\) and power \(1 - \beta = .85\).

Considering NFL teams must balance how many resources the allocate between offense and defense, it is important that they dont over or under estimate the significance of defensive performance. Howver there is a common phrase among sports fans, “Defense wins championships.”

The alpha level indicates the probability of having a Type I error. A typical alpha level is 0.05, so having a slightly higher value indicates a more aggressive stance of finding significant effects that could justify investment in defensive strategies.

The power level indicates the complement of having a Type II error. Setting a high power level of 0.85 ensure a high probability of detecting the true effect of defensive performance on a teams success.

test <- pwrss.t.2means(mu1 = 2, 
                       sd1 = sd(pluck(defensive_ranking_data, "defensive_ranking")),
                       kappa = 1,
                       power = .85, alpha = 0.1, 
                       alternative = "not equal")

##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.85 
##   n1 = 34 
##   n2 = 34 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 66 
##  Non-centrality parameter = 2.73 
##  Type I error rate = 0.1 
##  Type II error rate = 0.15

plot(test)

This analysis calculates the necessary sample sizes for each group to ensure the statistical test has adequate power to detect a meaningful difference in defensive rankings.

From this we can see n1 and n2 equal 34. This means we need 34 defensive rankings for each group for our hypothesis test to maintain the strength defined. Unfortunately, this data only consists of 20 Superbowl winning teams, meaning we shouldn’t conduct a hypothesis test.

Fisher’s Significance Test of Hypothesis 1

f_sampling <- function(x) dnorm(x, mean = 0, 
                                sd = sd(diffs_in_avgs))

ggplot() +
  stat_function(mapping = aes(fill = 'more extreme samples'),
                fun = f_sampling, 
                xlim = c(observed_diff, 15),
                geom = "area") +
  stat_function(mapping = aes(fill = 'more extreme samples'),
                fun = f_sampling, 
                xlim = c(-15, -observed_diff),
                geom = "area") +
  geom_function(xlim = c(-15, 15), 
                fun = f_sampling) +
  geom_vline(mapping = aes(xintercept = observed_diff,
                           color = paste("observed: ",
                                         round(observed_diff, 1)))) +
  labs(title = "Bootstrapped Sampling Distribution of Revenue Differences",
       x = "Difference in Revenue Calculated",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-15, 15, 2)) +
  scale_fill_manual(values = 'lightblue') +
  theme_minimal()

This visualization highlights the bootstrapped sampling distribution of the differences, marking areas that represent more extreme samples compared to the observed difference. The areas filled in light blue on both sides of the observed difference denote the regions of more extreme outcomes, helping to visually assess the significance of the observed difference.

# "demean" the bootstrapped samples to simulate mu = 0
diffs_in_avgs_d <- diffs_in_avgs - mean(diffs_in_avgs)

# proportion of times the difference is more extreme
paste("p-value ", 
      sum(abs(observed_diff) < abs(diffs_in_avgs_d)) /
        length(diffs_in_avgs_d))

## [1] "p-value  0.607287449392713"

This step computes the p-value by quantifying how often bootstrapped differences that are as extreme as, or more extreme than, the observed difference occur under the simulated null hypothesis.

Assuming the defensive ranking has no effect on whether a playoff team wins the Superbowl, then 60 of 100 samples this large would yield a difference in defensive ranking of 1.95 or more.

This p-value is larger than our alpha value and therefore we cannot reject our null hypothesis.

Null Hypothesis 2: Points Differential of Playoff Teams vs. Non-Playoff Teams

My second null hypothesis is as follows:

The average points differential between playoff teams and non-playoff teams is equal.

points_diff_data <- standings |>
  select(points_differential, playoffs)

This code prepares the data by extracting the points differential and playoff status of teams.

points_diff_data |>
  ggplot() +
  geom_boxplot(mapping = 
                 aes(x = points_differential, 
                     y = factor(playoffs, levels = c("No Playoffs", "Playoffs"),
                                labels = c("No Playoffs", "Playoffs")))) +
  labs(title = "Points Differential of Playoff and Non-Playoff teams",
       x = "Points Differential",
       y = "Playoff Result") +
  theme_minimal()

This visualization allows for an initial comparison of points differentials between teams that made the playoffs and those that did not.

avg_points_diff <- points_diff_data |>
  group_by(playoffs) |>
  summarize(avg_points_diff = mean(points_differential)) |>
  arrange(playoffs)

avg_points_diff

## # A tibble: 2 × 2
##   playoffs    avg_points_diff
##   <chr>                 <dbl>
## 1 No Playoffs           -55.7
## 2 Playoffs               92.4

observed_diff <- (avg_points_diff$avg_points_diff[2] -
                  avg_points_diff$avg_points_diff[1])
paste("Observed Difference: ", observed_diff)

## [1] "Observed Difference:  148.138630653266"

This step quantifies the observed difference in points differentials.

# the same bootstrapping function from lab_06
bootstrap <- function (x, func=mean, n_iter=10^4) {
  set.seed(100)
  # empty vector to be filled with values from each iteration
  func_values <- c(NULL)
  
  # we simulate sampling `n_iter` times
  for (i in 1:n_iter) {
    # pull the sample (e.g., a vector or data frame)
    x_sample <- sample_n(x, size = length(x), replace = TRUE)
    
    # add on this iteration's value to the collection
    func_values <- c(func_values, func(x_sample))
  }
  
  return(func_values)
}

diff_in_avg <- function (x_data) {
  avg_points_diff <- x_data |>
    group_by(playoffs) |>
    summarize(avg_points_diff = mean(points_differential)) |>
    arrange(playoffs)
  
  # difference = revenue_with - revenue_without
  diff <- (avg_points_diff$avg_points_diff[2] - 
           avg_points_diff$avg_points_diff[1])
  
  return(diff)
}

diffs_in_avgs <- na.omit(bootstrap(points_diff_data, diff_in_avg, n_iter = 1000))

By bootstrapping differences in average points differential, we assess the robustness of our observations and account for sampling variability.

ggplot() +
  geom_function(xlim = c(-500, 500), 
                fun = function(x) dnorm(x, mean = 0, 
                                        sd = sd(diffs_in_avgs))) +
  geom_vline(mapping = aes(xintercept = observed_diff,
                           color = paste("observed: ",
                                         round(observed_diff)))) +
  labs(title = "Bootstrapped Sampling Distribution of Revenue Differences",
       x = "Difference in Revenue Calculated",
       y = "Probability Density",
       color = "") +
  scale_x_continuous(breaks = seq(-500, 500, 100)) +
  theme_minimal()

This plot aids in visualizing the significance of the observed difference.

My alternative hypothesis is that the average points differential of a playoff teams and a non-playoff teams is not equal.

cohen.d(d = filter(points_diff_data, playoffs == "No Playoffs") |> pluck("points_differential"),
        f = filter(points_diff_data, playoffs == "Playoffs") |> pluck("points_differential"))

## 
## Cohen's d
## 
## d estimate: -2.080678 (large)
## 95 percent confidence interval:
##     lower     upper 
## -2.277754 -1.883602

The Cohen’s D value quantifies the effect size of the difference in points differential between playoff and non-playoff teams. We can see that the difference is large and therefore there is reason to not conduct a Hypothesis test.

After consideration I have decided

A meaningful difference in point differential between teams is 25 points.
I can accept an \(\alpha = 0.1\) and power \(1 - \beta = .85\).

Considering NFL teams must balance how many resources the allocate between offense and defense, it is important that teams understand point differential between playoff teams and non-playoff teams .

Setting the alpha level to 0.1 gives a 10% chance of committing a Type I error, which would mean rejecting the null hypothesis when it is true. Meaning that playoff teams and non-playoff teams have equal average points differential.

A power level of 0.85 means there is an 85% chance of correctly rejecting the null hypothesis when it is false, which is considered a strong level of power in many contexts. This level is chosen to ensure a high likelihood of detecting an effect if it exists. This would be an important effect for NFL teams if they have a particularly weak offensive or defensive team. They would need to ensure they allocate more resources to one side in order to make the playoffs.

test <- pwrss.t.2means(mu1 = 50, 
                       sd1 = sd(pluck(points_diff_data, "points_differential")),
                       kappa = 1,
                       power = .85, alpha = 0.1, 
                       alternative = "not equal")

##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.85 
##   n1 = 60 
##   n2 = 60 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 118 
##  Non-centrality parameter = 2.709 
##  Type I error rate = 0.1 
##  Type II error rate = 0.15

plot(test)

## Warning in qt(1 - prob.extreme, df = df, ncp = ncp, lower.tail = TRUE): full
## precision may not have been achieved in 'pnt{final}'

This analysis determines the necessary sample size to ensure the hypothesis test has sufficient power to detect a meaningful difference.

From this we can see n1 and n2 equal 60. This means we need 60 teams for each group for our hypothesis test to maintain the strength defined, which we do in fact have.

Fisher’s Significance Test of Hypothesis 2

f_sampling <- function(x) dnorm(x, mean = 0, 
                                sd = sd(diffs_in_avgs))

ggplot() +
  stat_function(mapping = aes(fill = 'more extreme samples'),
                fun = f_sampling, 
                xlim = c(observed_diff, 500),
                geom = "area") +
  stat_function(mapping = aes(fill = 'more extreme samples'),
                fun = f_sampling, 
                xlim = c(-500, -observed_diff),
                geom = "area") +
  geom_function(xlim = c(-500, 500), 
                fun = f_sampling) +
  geom_vline(mapping = aes(xintercept = observed_diff,
                           color = paste("observed: ",
                                         round(observed_diff, 1)))) +
  labs(title = "Bootstrapped Sampling Distribution of Revenue Differences",
       x = "Difference in Revenue Calculated",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-500, 500, 100)) +
  scale_fill_manual(values = 'lightblue') +
  theme_minimal()

# "demean" the bootstrapped samples to simulate mu = 0
diffs_in_avgs_d <- diffs_in_avgs - mean(diffs_in_avgs)

# proportion of times the difference is more extreme
paste("p-value ", 
      sum(abs(observed_diff) < abs(diffs_in_avgs_d)) /
        length(diffs_in_avgs_d))

## [1] "p-value  0.154708520179372"

This step provides a p-value, indicating the probability of observing a difference as extreme as, or more extreme than, the one observed if the null hypothesis were true.

Assuming the points_differential has no effect on whether a team makes the playoffs, then 15 of 100 samples this large would yield a difference in defensive ranking of 148.1 or more.

This p-value is larger than our alpha value and therefore we cannot reject the null hypothesis.

Week 7 Data Dive – Hypothesis Testing