1 1. Overview

This workbook is a hands-on companion to Confidence Interval (CI) estimation and Hypothesis Testing.
You’ll see how to: - compute CIs for means, proportions, differences of means/proportions, and variances; - perform one-sample, two-sample (independent/paired) tests, ANOVA, chi-square tests; - interpret results with realistic scenarios and publication-quality visuals.

We interleave explanation, commented code, plots, interpretations, extension questions, and real-life context.

1.1 2. Key Concepts for Confidence Intervals

Confidence intervals (CIs) combine sample information with probability theory to quantify uncertainty about a population parameter. Below are the key building blocks:


1.1.1 Point Estimate

  • Definition: A single best guess of a population parameter, computed from sample data.
  • Examples:
    • Mean engagement time: mean(x)
    • Proportion of successful sign-ups: sum(success)/n

The point estimate is the center of the CI.


1.1.2 Standard Error (SE)

  • Definition: The estimated standard deviation of the estimator.
  • Formula (for the sample mean):
    \[ SE = \frac{s}{\sqrt{n}} \]
  • Example: If sd(x) = 1.5 and n = 100, then
    \[ SE = 1.5/\sqrt{100} = 0.15 \]

The SE shrinks as sample size increases, yielding tighter CIs.


1.1.3 Confidence Level

  • Definition: The long-run probability that the CI covers the true parameter.
  • Common choices: 90%, 95%, 99%.
  • Interpretation: A 95% CI does not mean 95% of your data are inside the interval; rather, if you repeated the sampling many times, about 95% of such intervals would capture the true mean.

1.1.4 Margin of Error (MoE)

  • Definition: The extra width added to the point estimate to build the CI.
  • Formula:
    \[ \text{MoE} = t^* \times SE \quad \text{or} \quad z^* \times SE \]
  • Example: For a 95% CI with SE = 0.15 and t* ≈ 1.984,
    \[ \text{MoE} = 1.984 \times 0.15 ≈ 0.30 \]

1.1.5 Rule of Thumb

  • Higher confidence level → wider interval. (More conservative, less risk of missing the truth.)
  • Lower confidence level → narrower interval. (More precise, but riskier.)

Quick Illustration (95% vs 99% CI):

set.seed(1)
x <- rnorm(50, mean = 10, sd = 2)

ci_95 <- t.test(x, conf.level = 0.95)$conf.int
ci_99 <- t.test(x, conf.level = 0.99)$conf.int

data.frame(
  CI_Level = c("95%", "99%"),
  Lower = c(ci_95[1], ci_99[1]),
  Upper = c(ci_95[2], ci_99[2])
)
library(tidyverse)
library(scales)
# Optional utilities (uncomment to use)
# install.packages("car")
# library(car)

2 Use Case 1 — One-Sample Mean CI (Engagement Time)

Context. Product Analytics needs to report the average time-on-section (minutes) for a new feature to assess baseline engagement. A 95% CI helps communicate precision to stakeholders.

2.1 Learning goals

  • Compute a 95% confidence interval for a population mean when σ is unknown (use t).
  • Compare the t.test() CI to a manual CI (x̄ ± t*·SE).
  • Visualize the estimate and CI; communicate uncertainty to non-technical audiences.
  • Check assumptions (approx. normal errors or sufficiently large n).
  • Explore sensitivity to sample size and variance.

2.2 Data setup

We’ll simulate a realistic sample of 100 user sessions from a feature’s new section.

set.seed(123)
time_spent <- rnorm(n = 100, mean = 5, sd = 1.5)  # synthetic but realistic
tt <- t.test(time_spent, conf.level = 0.95)       # t-based CI for mean (σ unknown)
tt$conf.int
## [1] 4.863925 5.407293
## attr(,"conf.level")
## [1] 0.95
mean_time <- mean(time_spent)
ci <- tt$conf.int

ggplot(data.frame(time_spent), aes(x = time_spent)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.25, fill = "skyblue", color = "grey30") +
  geom_vline(xintercept = ci[1], linetype = "dashed", color = "red", linewidth = 1) +
  geom_vline(xintercept = ci[2], linetype = "dashed", color = "red", linewidth = 1) +
  geom_vline(xintercept = mean_time, color = "#2E86AB", linewidth = 1) +
  annotate("text", x = ci[1], y = 0.12, label = paste("Lower CI:", round(ci[1], 2)),
           hjust = 1.1, color = "red") +
  annotate("text", x = ci[2], y = 0.12, label = paste("Upper CI:", round(ci[2], 2)),
           hjust = -0.1, color = "red") +
  annotate("text", x = mean_time, y = 0.16, label = paste("Mean:", round(mean_time, 2)),
           hjust = 0, color = "#2E86AB") +
  labs(title = "Time on New Section: Mean and 95% CI",
       x = "Minutes", y = "Density") +
  theme_minimal(base_size = 13)
Histogram with mean and 95% CI overlaid.

Histogram with mean and 95% CI overlaid.

Interpretation. The 95% CI provides a plausible range for the true average time across all users.
Extension. How does the CI change for n=50 vs n=500?


3 Use Case 2 — One-Sample Mean Hypothesis Test (Right-Tailed)

Context. The Growth team has set a KPI: average time-on-section ≥ 4.5 minutes. We want to test whether the observed mean engagement time exceeds 4.5 minutes. This is a one-sample, one-sided (right-tailed) t-test.


3.1 Learning goals

  • Formulate and conduct a right-tailed hypothesis test.
  • Interpret the test statistic, p-value, and decision rule.
  • Visualize the rejection region and sample mean in context.
  • Extend the use case to strategic decision-making (whether the KPI target is achieved).

3.2 Hypotheses

  • Null hypothesis (H₀): μ ≤ 4.5 minutes
  • Alternative hypothesis (H₁): μ > 4.5 minutes

We test at α = 0.05 (5% significance level).


3.3 Data setup

We continue with the engagement dataset from Use Case 1.

ht <- t.test(time_spent, mu = 4.5, alternative = "greater")
ht
## 
##  One Sample t-test
## 
## data:  time_spent
## t = 4.6421, df = 99, p-value = 5.3e-06
## alternative hypothesis: true mean is greater than 4.5
## 95 percent confidence interval:
##  4.908264      Inf
## sample estimates:
## mean of x 
##  5.135609
dens <- density(time_spent)
ddf  <- tibble(x = dens$x, d = dens$y)

ggplot(ddf, aes(x, d)) +
  geom_histogram(data = data.frame(time_spent),
                 aes(x = time_spent, y = ..density..),
                 binwidth = 0.25, fill = "grey85", color = "grey30") +
  geom_line(color = "#2E86AB") +
  geom_vline(xintercept = 4.5, linetype = "dashed", color = "red", linewidth = 1) +
  geom_area(data = subset(ddf, x > 4.5), aes(x, d), fill = "red", alpha = 0.25) +
  labs(title = "One-Sample Right-Tailed Test: μ > 4.5",
       subtitle = paste0("p-value = ", formatC(ht$p.value, format = "e", digits = 2)),
       x = "Minutes", y = "Density") +
  theme_minimal(base_size = 13)
Right-tailed test: shaded region beyond the hypothesized mean (4.5).

Right-tailed test: shaded region beyond the hypothesized mean (4.5).

Interpretation. If p < 0.05, evidence suggests mean > 4.5.
Extension. Would a two-sided test be more appropriate during exploratory analysis? # Use Case A — One-Sample Mean (Engagement Time on New Section)

Scenario. Product wants to quantify average time (minutes) users spend on a new website section and report a 95% CI. Later, we’ll test if it’s more than 4.5 minutes (target).

# Simulate 100 users' session times (mins): bell-shaped with mild spread
set.seed(123)
time_spent <- rnorm(n = 100, mean = 5, sd = 1.5)

# One-sample t CI for the mean (unknown sigma)
tt <- t.test(time_spent, conf.level = 0.95)
tt$conf.int
## [1] 4.863925 5.407293
## attr(,"conf.level")
## [1] 0.95
mean_time <- mean(time_spent)
ci <- tt$conf.int

library(ggplot2)
ggplot(data.frame(time_spent), aes(x = time_spent)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.25,
                 fill = "skyblue", color = "grey30") +
  geom_vline(xintercept = ci[1], linetype = "dashed", color = "red", linewidth = 1) +
  geom_vline(xintercept = ci[2], linetype = "dashed", color = "red", linewidth = 1) +
  geom_vline(xintercept = mean_time, color = "#2E86AB", linewidth = 1) +
  annotate("text", x = ci[1], y = 0.12, label = paste("Lower CI:", round(ci[1], 2)),
           hjust = 1.1, color = "red") +
  annotate("text", x = ci[2], y = 0.12, label = paste("Upper CI:", round(ci[2], 2)),
           hjust = -0.1, color = "red") +
  annotate("text", x = mean_time, y = 0.16, label = paste("Mean:", round(mean_time, 2)),
           hjust = 0, color = "#2E86AB") +
  labs(title = "Time on New Section: Mean and 95% CI",
       x = "Minutes", y = "Density") +
  theme_minimal(base_size = 13)

Interpretation
We are 95% confident that the true average time lies within the CI. This communicates uncertainty around the observed mean and supports stakeholder decisions with a range, not a single number.

Extension Questions
- What does the CI look like for n=50 or n=500 users?
- How does the CI change with increased variability (e.g., sd=2 vs sd=1)?


4 Hypothesis Test (One-Sample Mean): Is Mean > 4.5?

Hypotheses.
H0: μ ≤ 4.5 vs. H1: μ > 4.5

ht <- t.test(time_spent, mu = 4.5, alternative = "greater")
ht
## 
##  One Sample t-test
## 
## data:  time_spent
## t = 4.6421, df = 99, p-value = 5.3e-06
## alternative hypothesis: true mean is greater than 4.5
## 95 percent confidence interval:
##  4.908264      Inf
## sample estimates:
## mean of x 
##  5.135609
dens <- density(time_spent)
ddf  <- tibble::tibble(x = dens$x, d = dens$y)

ggplot(ddf, aes(x, d)) +
  geom_histogram(data = data.frame(time_spent),
                 aes(x = time_spent, y = ..density..),
                 binwidth = 0.25, fill = "grey80", color = "grey30") +
  geom_line(color = "#2E86AB") +
  geom_vline(xintercept = 4.5, linetype = "dashed", color = "red", linewidth = 1) +
  geom_area(data = subset(ddf, x > 4.5), aes(x, d), fill = "red", alpha = 0.25) +
  labs(title = "One-Sample Right-Tailed Test: μ > 4.5",
       subtitle = paste0("p-value = ", formatC(ht$p.value, format = "e", digits = 2)),
       x = "Minutes", y = "Density") +
  theme_minimal(base_size = 13)

Interpretation
If the p-value < 0.05, we conclude average time is greater than 4.5 min. The right-tail shading visualizes the rejection region.
Remember: p-value quantifies evidence against H0; it is not the probability H0 is true.

Extension Questions
- Would a two-sided test be more appropriate during exploration?
- Translate the uplift into business value (e.g., additional ads viewed).


5 CI Width vs Confidence Level

ci85 <- t.test(time_spent, conf.level = 0.85)$conf.int
ci90 <- t.test(time_spent, conf.level = 0.90)$conf.int
ci99 <- t.test(time_spent, conf.level = 0.99)$conf.int

conf_data <- tibble::tibble(
  level = c("85%", "90%", "99%"),
  mean  = mean_time,
  lo    = c(ci85[1], ci90[1], ci99[1]),
  hi    = c(ci85[2], ci90[2], ci99[2])
)

ggplot(conf_data, aes(level, mean, group = level)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = lo, ymax = hi), width = 0.15, linewidth = 0.8) +
  labs(title = "Confidence Intervals at Different Confidence Levels",
       x = "Confidence Level", y = "Mean Time (minutes)") +
  theme_minimal(base_size = 13)


6 Use Case B — A/B Test on Conversion (Proportions)

Context.
In digital product experimentation, conversion rate optimization (CRO) is a central metric. A/B tests are commonly used to evaluate whether design or content changes lead to improved user actions. In this scenario, a company is testing a new homepage design (B) against the current version (A). Over the course of one week:

  • A: 8,100 visits, 405 purchases → 5.0% conversion
  • B: 7,900 visits, 474 purchases → 6.0% conversion

At first glance, version B appears to generate a +1.0 percentage point increase (a 20% relative lift). The crucial question is whether this difference is statistically significant or could simply be due to random variation in user behavior.

Confidence intervals and hypothesis testing help quantify this uncertainty. A CI communicates the plausible range of conversion rates for each version, while a hypothesis test assesses whether the difference between A and B is large enough to be unlikely under the assumption of no true effect.

This type of analysis is widely applied in marketing, e-commerce, and product analytics, guiding data-driven decisions on rollout, further experimentation, or design iteration.

A_n <- 8100; A_s <- 405
B_n <- 7900; B_s <- 474

ab <- tibble::tibble(
  design = c("A","B"),
  purchases = c(A_s, B_s),
  visits    = c(A_n, B_n),
  rate      = purchases / visits
)
ab
# prop.test returns score-based CI (more reliable than simple Wald)
A_ci <- prop.test(A_s, A_n)$conf.int
B_ci <- prop.test(B_s, B_n)$conf.int

ab_ci <- ab %>%
  dplyr::mutate(lo = c(A_ci[1], B_ci[1]),
         hi = c(A_ci[2], B_ci[2]))

ggplot(ab_ci, aes(design, rate, fill = design)) +
  geom_col(width = 0.55) +
  geom_errorbar(aes(ymin = lo, ymax = hi), width = 0.12, linewidth = 0.8) +
  geom_text(aes(label = scales::percent(rate, accuracy = 0.01)),
            vjust = -0.7, fontface = "bold") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_manual(values = c("A"="#2E86AB","B"="#F18F01")) +
  labs(title = "Conversion with 95% CIs",
       x = NULL, y = "Conversion Rate") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Interpretation
B appears +1.0 pp higher (20% relative lift). CIs show uncertainty; overlapping CIs do not automatically mean “no difference”—use the formal two-proportion test.


7 Hypothesis Test: Two Proportions

Context.
While confidence intervals provide a range of plausible values for each group’s conversion rate, decision-making often requires a direct statistical test of whether the observed difference is likely to be due to chance. In an A/B test, this is done through a two-proportion hypothesis test.

Here, the null hypothesis (H0) states that the conversion rates for Homepage A and Homepage B are equal (pA = pB). The alternative hypothesis (H1) is that the conversion rates are different (pA ≠ pB). This test evaluates whether the 1% absolute lift observed for Design B is statistically significant or simply sampling noise.

In practice, this type of test is widely used in marketing and product optimization to determine if a new design, feature, or campaign truly outperforms the existing one. A small p-value provides evidence to reject the null hypothesis and conclude that the performance difference is real, while a large p-value suggests that the evidence is insufficient to claim improvement.

Hypotheses.
H0: pA = pB vs H1: pA ≠ pB

two_prop <- prop.test(x = c(A_s, B_s), n = c(A_n, B_n), correct = FALSE)
two_prop
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(A_s, B_s) out of c(A_n, B_n)
## X-squared = 7.703, df = 1, p-value = 0.005513
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.017067685 -0.002932315
## sample estimates:
## prop 1 prop 2 
##   0.05   0.06

Interpretation
If the p-value is small and the CI for (pB − pA) excludes 0, we have evidence B improves conversion. Report both effect size and significance.

8 7. Use Case C — One-Population Variance (Chi-Square CI)

Scenario. Ops monitors turnaround time (TAT, minutes) for a lab’s Day shift and wants a CI for process variability (σ, σ²). While means are useful for summarizing central tendency, in operations management and quality control, the spread of the data (variance and standard deviation) is often just as critical. For example, in a clinical laboratory, even if the average turnaround time (TAT) for tests meets service expectations, high variability could mean that some samples take excessively long, causing SLA violations and workflow bottlenecks.

To assess the reliability of the process, we can construct a confidence interval for the variance (σ²) and standard deviation (σ) using the Chi-square distribution. This allows the ops team to quantify uncertainty around process variability and evaluate whether improvements are needed.

In practice, such variance confidence intervals are used in manufacturing quality control, healthcare operations, and IT system latency monitoring, where reducing variability is as important as improving the mean. A wide upper bound on the CI signals that although the average may look fine, the system occasionally produces outliers—something stakeholders must address to maintain consistency and reliability.

# Simulate a stable Day-shift process
n_day <- 420
day_minutes <- rnorm(n_day, mean = 56, sd = 11)

n   <- length(day_minutes)
s2  <- var(day_minutes)
alpha <- 0.05

# Chi-square quantiles (df = n-1)
qL <- qchisq(1 - alpha/2, df = n - 1)
qU <- qchisq(alpha/2,     df = n - 1)

ci_var <- c((n - 1)*s2/qL, (n - 1)*s2/qU)
ci_sd  <- sqrt(ci_var)
list(variance_CI = ci_var, sd_CI = ci_sd)
## $variance_CI
## [1] 102.0647 133.8538
## 
## $sd_CI
## [1] 10.10271 11.56952

Interpretation
With ~95% confidence, the true variance lies in variance_CI and true SD in sd_CI. Even with a satisfactory mean, a large upper SD bound implies frequent SLA misses → target variance reduction.

Extension Questions
- Compare SD CIs for different shifts.
- How sensitive is this interval to non-normality?


9 8. Use Case D — Two Means (Independent): Average Session Time (Welch t-test)

Scenario. Compare mean time on Design A vs B (independent users).

set.seed(77)
A_time <- rnorm(200, mean = 5.0, sd = 1.4)
B_time <- rnorm(220, mean = 5.4, sd = 1.6)

t_ind <- t.test(B_time, A_time, var.equal = FALSE)  # Welch by default
t_ind
## 
##  Welch Two Sample t-test
## 
## data:  B_time and A_time
## t = 4.6961, df = 417.93, p-value = 3.603e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3985485 0.9723917
## sample estimates:
## mean of x mean of y 
##  5.582033  4.896563
sum_df <- tibble(
  design = c("A","B"),
  mean   = c(mean(A_time), mean(B_time)),
  sd     = c(sd(A_time), sd(B_time)),
  n      = c(length(A_time), length(B_time))
) %>%
  mutate(se = sd/sqrt(n),
         lo = mean - 1.96*se,
         hi = mean + 1.96*se)

ggplot(sum_df, aes(design, mean, fill = design)) +
  geom_col(width = 0.55) +
  geom_errorbar(aes(ymin = lo, ymax = hi), width = 0.12) +
  scale_fill_manual(values = c("A"="#2E86AB","B"="#F18F01")) +
  labs(title = "Average Session Time by Design",
       x = NULL, y = "Minutes") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Interpretation
If p-value < 0.05 and CI for (μ_B − μ_A) excludes 0, B likely increases mean session time. Assess practical as well as statistical significance.


10 9. Use Case E — Paired Means (Pre/Post)

Scenario. Same users took a tutorial before and after a UI tweak; we compare time to complete.

set.seed(88)
n <- 80
pre  <- rnorm(n,  mean = 12.0, sd = 3.0)
post <- pre - rnorm(n, mean = 1.2, sd = 1.5)  # expected improvement

t_pair <- t.test(post, pre, paired = TRUE)  # H0: mean(post - pre) = 0
t_pair
## 
##  Paired t-test
## 
## data:  post and pre
## t = -6.2797, df = 79, p-value = 1.715e-08
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -1.3709441 -0.7110319
## sample estimates:
## mean difference 
##       -1.040988
df <- tibble(id = 1:n, pre = pre, post = post) %>%
  pivot_longer(cols = c(pre, post), names_to = "phase", values_to = "time")

ggplot(df, aes(phase, time, group = id)) +
  geom_line(alpha = 0.3, color = "grey60") +
  stat_summary(fun = mean, geom = "point", size = 3, color = "#2E86AB") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.15) +
  labs(title = "Task Time: Pre vs Post (Paired Users)",
       x = NULL, y = "Minutes") +
  theme_minimal(base_size = 13)

Interpretation
Paired design increases power by removing between-user variability. If the mean difference CI is below 0, the tweak reduced time.


11 10. Use Case F — >2 Means (ANOVA) + Nonparametric Alternative

Scenario. Compare average checkout duration across Desktop, Tablet, Mobile.

set.seed(101)
desk  <- rnorm(120, mean = 90, sd = 15)
tab   <- rnorm(115, mean = 95, sd = 16)
mob   <- rnorm(130, mean = 100, sd = 18)

checkout <- tibble(
  secs  = c(desk, tab, mob),
  group = rep(c("Desktop","Tablet","Mobile"),
              times = c(length(desk), length(tab), length(mob)))
)

anova_model <- aov(secs ~ group, data = checkout)
summary(anova_model)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## group         2   4448  2224.1   9.213 0.000125 ***
## Residuals   362  87393   241.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(checkout, aes(group, secs, fill = group)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  stat_summary(fun = mean, geom = "point", size = 3, color = "black") +
  stat_summary(fun = mean, fun.min = function(z) mean(z) - sd(z)/sqrt(length(z)),
               fun.max = function(z) mean(z) + sd(z)/sqrt(length(z)),
               geom = "errorbar", width = 0.12, color = "black") +
  scale_fill_manual(values = c("#2E86AB","#F18F01","#7CB518")) +
  labs(title = "Checkout Duration by Platform", x = NULL, y = "Seconds") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Interpretation
A significant ANOVA suggests at least one mean differs. Follow with Tukey HSD or pairwise tests (with multiplicity control).

TukeyHSD(anova_model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = secs ~ group, data = checkout)
## 
## $group
##                     diff        lwr       upr     p adj
## Mobile-Desktop  8.319406  3.6904268 12.948386 0.0000879
## Tablet-Desktop  5.607068  0.8353814 10.378755 0.0164273
## Tablet-Mobile  -2.712338 -7.3933525  1.968677 0.3611745

Nonparametric alternative (if normality/homoscedasticity fail):

kruskal.test(secs ~ group, data = checkout)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  secs by group
## Kruskal-Wallis chi-squared = 15.951, df = 2, p-value = 0.0003438

12 11. Use Case G — Categorical Association (Chi-Square & Fisher)

Scenario. Device × Purchase association.

# Simulate a contingency table
tab <- matrix(c(120, 30,   # Desktop: purchased / not
                90,  35,   # Tablet
                140, 80),  # Mobile
              nrow = 3, byrow = TRUE)
dimnames(tab) <- list(Device = c("Desktop","Tablet","Mobile"),
                      Purchase = c("Yes","No"))

tab
##          Purchase
## Device    Yes No
##   Desktop 120 30
##   Tablet   90 35
##   Mobile  140 80
chisq.test(tab)   # Large counts → chi-square OK
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 11.665, df = 2, p-value = 0.00293

If a cell has small counts, use Fisher’s exact (2×2 case) instead:

# Example 2x2
small_tab <- matrix(c(12, 3, 4, 11), nrow = 2, byrow = TRUE)
fisher.test(small_tab)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  small_tab
## p-value = 0.009221
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   1.589058 87.245326
## sample estimates:
## odds ratio 
##   9.952827

Interpretation
A small p-value suggests device type and purchase are not independent (association exists). Inspect standardized residuals for the pattern.


13 12. Assumption Checks & Alternatives

  • Normality (for means): Shapiro–Wilk on residuals; Q-Q plots.
  • Equal variances: bartlett.test() (normal), leveneTest() (robust; needs car).
shapiro.test(residuals(anova_model))  # normality of residuals
bartlett.test(secs ~ group, data = checkout)  # homoscedasticity
# car::leveneTest(secs ~ group, data = checkout) # robust alternative

Nonparametric tests (no normality needed): - Two independent groups: Mann–Whitney Uwilcox.test(x, y, paired = FALSE)
- Paired groups: Wilcoxon signed-rankwilcox.test(x, y, paired = TRUE)
- >2 groups: Kruskal–Walliskruskal.test(y ~ group)


14 13. Summary: CI ↔︎ Hypothesis Testing

  • CIs quantify magnitude + uncertainty (effect size with a range).
  • Tests quantify evidence (p-values) against H0.
  • Report both: they tell complementary stories needed for sound, data-driven decisions.

Practice prompts 1. Convert one of the above scenarios into a power question: What n is needed to detect a practical effect at 80% power?
2. Re-run the A/B example with sequential monitoring—how does peeking affect Type I error?
3. For the ANOVA dataset, simulate non-normal errors and compare ANOVA vs Kruskal–Wallis outcomes.