Data Dive: AB and Hypothesis Testing


Introduction

This data dive applies hypothesis testing within an AB testing framework to evaluate differences between two groups of flights. Two separate hypotheses are tested using different statistical frameworks: the Neyman–Pearson approach for decision-based testing with power considerations, and Fisher’s significance testing framework based on p-values. The goal is to evaluate whether observed differences are statistically and practically meaningful.


Hypothesis 1: Neyman-Pearson Framework

Do evening flights have higher average departure delays than morning flights?

library(tidyverse)
library(nycflights13)
df1 <- flights |>
  filter(!is.na(dep_delay),!is.na(dep_time)) |>
  mutate(hour = dep_time %/% 100) |>
  filter(hour >= 5 & hour <= 23) |>
  mutate (
    group = case_when(
      hour >= 5 & hour < 12 ~ "Morning",
      hour >= 17 & hour <=23 ~ "Evening",
      TRUE ~ NA_character_
    )
  ) |>
  filter(!is.na(group))

State Hypotheses

\[ H_0: \mu_{evening} - \mu_{morning} = 0 \]

\[ H_1: \mu_{evening} - \mu_{morning} > 0 \]

Alpha level: 0.05 Desired power: 0.80 Minimum meaningful difference: 5 minutes

Rationale: A 5-minute increase in average delay is practically meaningful because even small increases can affect passenger connections and scheduling reliability. Alpha= 0.05 balances the risk of false positives, and power= 0.80 ensures reasonable detection of meaningful differences.


power.t.test(delta = 5,
             sd = sd(df1$dep_delay),
             sig.level = 0.05,
             power = 0.80,
             type = "two.sample",
             alternative = "one.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 851.5692
##           delta = 5
##              sd = 41.47708
##       sig.level = 0.05
##           power = 0.8
##     alternative = one.sided
## 
## NOTE: n is number in *each* group
df1 |> count(group)
## # A tibble: 2 × 2
##   group        n
##   <chr>    <int>
## 1 Evening  98852
## 2 Morning 129539

Since, the dataset contains thousands of observations in each group, we have sufficient sample size to perform the test.

t.test(dep_delay ~ group, data = df1, alternative = "greater" )
## 
##  Welch Two Sample t-test
## 
## data:  dep_delay by group
## t = 142.09, df = 125496, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Evening and group Morning is greater than 0
## 95 percent confidence interval:
##  25.69906      Inf
## sample estimates:
## mean in group Evening mean in group Morning 
##             27.791375              1.791329
df1 |>
  ggplot(aes(x = group, y = dep_delay)) +
  geom_boxplot()+
  coord_cartesian(ylim = c(-20, 100)) +
  labs(
    title = "Departure Delay: Morning VS Evening Flights",
    x = "Group",
    y = "Departure Delay (minutes)"
  )+
  theme_classic()

Interpretation:

The Welch two-sample t-test yields a test statistic of 142.09 with a p-value less than 2.2 × 10⁻¹⁶. Because the p-value is far below the chosen alpha level of 0.05, we reject the null hypothesis and conclude that evening flights have significantly higher average departure delays than morning flights. The estimated mean delay for evening flights is approximately 27.79 minutes, compared to 1.79 minutes for morning flights, resulting in a difference of about 26 minutes. This difference greatly exceeds the predefined meaningful threshold of 5 minutes, indicating that the result is not only statistically significant but also practically important. The 95% confidence internal suggests that the true difference in means is at least 25.7 minutes, reinforcing the magnitude of the effect.

Although the large sample size contributes to the extremely small p-value, the substantial difference in group means indicates that the result reflects a meaningful operational pattern than a trivial statistical artifact.


Hypothesis 2- Fisher’s Significance Testing Framework

Are long-distance flights more likely to arrive than short-distance flights?

df2 <- flights |>
  filter(!is.na(arr_delay), !is.na(distance)) |>
  mutate(
    late = arr_delay > 0,
    distance_group = if_else(distance > 1000, "Long", "Short")
  )

State Hypothesis

\[ H_0: Propotion late(Long) = Proportion late (Short) \]


prop.test(
  x = df2 |> 
    group_by(distance_group) |>
    summarise(sum(late)) |>
    pull(),
  
  n = df2 |> group_by(distance_group) |>
    summarise(n()) |>
    pull()
)
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  pull(summarise(group_by(df2, distance_group), sum(late))) out of pull(summarise(group_by(df2, distance_group), n()))
## X-squared = 209.23, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.02839680 -0.02161892
## sample estimates:
##    prop 1    prop 2 
## 0.3923607 0.4173686
df2 |>
  group_by(distance_group) |>
  summarise(prop_late = mean(late)) |>
  ggplot(aes(x = distance_group, y= prop_late)) +
  geom_col()+
  labs(
    title = "Proportion of Late Arrivals: Long vs Short Flights",
    x = "Distance Group",
    y = "Proportion Late"
  )+
  theme_classic()

Interpretation

The two-sample test for equality of proportions yields a p-value less than 2.2 × 10⁻¹⁶. Because this is far below 0.05, we reject the null hypothesis and conclude that the proportion of late arrivals differs between long and short flights. The estimated late-arrival rate is approximately 39.2% for long flights and 41.7% for short flights, indicating that short flights are slightly more likely to arrive late. The 95% confidence interval for the difference in proportions is approximately (-0.0284, -0.0216), suggesting that long flights have a late-arrival rate about 2-3 percentage points lower than short flights.

Although this difference is statistically significant due to the large sample size, the practical magnitude of the effect is relatively modest. This highlights the importance of distinguishing between statistical significance and practical significance when interpreting results.


Final Reflection

This analysis illustrates the differences between the Neyman-Pearson and Fisher approaches to hypothesis testing. In the first test, the Neyman-Pearson framework required intentional decisions about alpha, power, and a minimum meaningful effect size, emphasizing controlled decision-making and practical significance. In the second test, the Fisher framework focused on the strength of evidence through the p-value, demonstrating how large dataset can produce statistically significant results even for relatively small differences. Together, these results highlight the importance of distinguishing between statistical and practical significance, and show that hypothesis testing is not only about detecting differences, but about interpreting them responsibly within context.