This workbook is a hands-on companion to Confidence
Interval (CI) estimation and Hypothesis Testing.
You’ll see how to: - compute CIs for means,
proportions, differences of
means/proportions, and variances; - perform
one-sample, two-sample
(independent/paired) tests, ANOVA,
chi-square tests; - interpret results with
realistic scenarios and publication-quality
visuals.
We interleave explanation, commented code, plots, interpretations, extension questions, and real-life context.
Confidence intervals (CIs) combine sample information with probability theory to quantify uncertainty about a population parameter. Below are the key building blocks:
mean(x)sum(success)/nThe point estimate is the center of the CI.
sd(x) = 1.5 and
n = 100, thenThe SE shrinks as sample size increases, yielding tighter CIs.
SE = 0.15
and t* ≈ 1.984,Quick Illustration (95% vs 99% CI):
set.seed(1)
x <- rnorm(50, mean = 10, sd = 2)
ci_95 <- t.test(x, conf.level = 0.95)$conf.int
ci_99 <- t.test(x, conf.level = 0.99)$conf.int
data.frame(
CI_Level = c("95%", "99%"),
Lower = c(ci_95[1], ci_99[1]),
Upper = c(ci_95[2], ci_99[2])
)
library(tidyverse)
library(scales)
# Optional utilities (uncomment to use)
# install.packages("car")
# library(car)
Context. Product Analytics needs to report the average time-on-section (minutes) for a new feature to assess baseline engagement. A 95% CI helps communicate precision to stakeholders.
t.test() CI to a manual CI
(x̄ ± t*·SE).We’ll simulate a realistic sample of 100 user sessions from a feature’s new section.
set.seed(123)
time_spent <- rnorm(n = 100, mean = 5, sd = 1.5) # synthetic but realistic
tt <- t.test(time_spent, conf.level = 0.95) # t-based CI for mean (σ unknown)
tt$conf.int
## [1] 4.863925 5.407293
## attr(,"conf.level")
## [1] 0.95
mean_time <- mean(time_spent)
ci <- tt$conf.int
ggplot(data.frame(time_spent), aes(x = time_spent)) +
geom_histogram(aes(y = ..density..), binwidth = 0.25, fill = "skyblue", color = "grey30") +
geom_vline(xintercept = ci[1], linetype = "dashed", color = "red", linewidth = 1) +
geom_vline(xintercept = ci[2], linetype = "dashed", color = "red", linewidth = 1) +
geom_vline(xintercept = mean_time, color = "#2E86AB", linewidth = 1) +
annotate("text", x = ci[1], y = 0.12, label = paste("Lower CI:", round(ci[1], 2)),
hjust = 1.1, color = "red") +
annotate("text", x = ci[2], y = 0.12, label = paste("Upper CI:", round(ci[2], 2)),
hjust = -0.1, color = "red") +
annotate("text", x = mean_time, y = 0.16, label = paste("Mean:", round(mean_time, 2)),
hjust = 0, color = "#2E86AB") +
labs(title = "Time on New Section: Mean and 95% CI",
x = "Minutes", y = "Density") +
theme_minimal(base_size = 13)
Histogram with mean and 95% CI overlaid.
Interpretation. The 95% CI provides a plausible
range for the true average time across all users.
Extension. How does the CI change for
n=50 vs n=500?
Context. The Growth team has set a KPI: average time-on-section ≥ 4.5 minutes. We want to test whether the observed mean engagement time exceeds 4.5 minutes. This is a one-sample, one-sided (right-tailed) t-test.
We test at α = 0.05 (5% significance level).
We continue with the engagement dataset from Use Case 1.
ht <- t.test(time_spent, mu = 4.5, alternative = "greater")
ht
##
## One Sample t-test
##
## data: time_spent
## t = 4.6421, df = 99, p-value = 5.3e-06
## alternative hypothesis: true mean is greater than 4.5
## 95 percent confidence interval:
## 4.908264 Inf
## sample estimates:
## mean of x
## 5.135609
dens <- density(time_spent)
ddf <- tibble(x = dens$x, d = dens$y)
ggplot(ddf, aes(x, d)) +
geom_histogram(data = data.frame(time_spent),
aes(x = time_spent, y = ..density..),
binwidth = 0.25, fill = "grey85", color = "grey30") +
geom_line(color = "#2E86AB") +
geom_vline(xintercept = 4.5, linetype = "dashed", color = "red", linewidth = 1) +
geom_area(data = subset(ddf, x > 4.5), aes(x, d), fill = "red", alpha = 0.25) +
labs(title = "One-Sample Right-Tailed Test: μ > 4.5",
subtitle = paste0("p-value = ", formatC(ht$p.value, format = "e", digits = 2)),
x = "Minutes", y = "Density") +
theme_minimal(base_size = 13)
Right-tailed test: shaded region beyond the hypothesized mean (4.5).
Interpretation. If p < 0.05,
evidence suggests mean > 4.5.
Extension. Would a two-sided test be
more appropriate during exploratory analysis? # Use Case A — One-Sample
Mean (Engagement Time on New Section)
Scenario. Product wants to quantify average time (minutes) users spend on a new website section and report a 95% CI. Later, we’ll test if it’s more than 4.5 minutes (target).
# Simulate 100 users' session times (mins): bell-shaped with mild spread
set.seed(123)
time_spent <- rnorm(n = 100, mean = 5, sd = 1.5)
# One-sample t CI for the mean (unknown sigma)
tt <- t.test(time_spent, conf.level = 0.95)
tt$conf.int
## [1] 4.863925 5.407293
## attr(,"conf.level")
## [1] 0.95
mean_time <- mean(time_spent)
ci <- tt$conf.int
library(ggplot2)
ggplot(data.frame(time_spent), aes(x = time_spent)) +
geom_histogram(aes(y = ..density..), binwidth = 0.25,
fill = "skyblue", color = "grey30") +
geom_vline(xintercept = ci[1], linetype = "dashed", color = "red", linewidth = 1) +
geom_vline(xintercept = ci[2], linetype = "dashed", color = "red", linewidth = 1) +
geom_vline(xintercept = mean_time, color = "#2E86AB", linewidth = 1) +
annotate("text", x = ci[1], y = 0.12, label = paste("Lower CI:", round(ci[1], 2)),
hjust = 1.1, color = "red") +
annotate("text", x = ci[2], y = 0.12, label = paste("Upper CI:", round(ci[2], 2)),
hjust = -0.1, color = "red") +
annotate("text", x = mean_time, y = 0.16, label = paste("Mean:", round(mean_time, 2)),
hjust = 0, color = "#2E86AB") +
labs(title = "Time on New Section: Mean and 95% CI",
x = "Minutes", y = "Density") +
theme_minimal(base_size = 13)
Interpretation
We are 95% confident that the true average time lies within the CI. This
communicates uncertainty around the observed mean and supports
stakeholder decisions with a range, not a single number.
Extension Questions
- What does the CI look like for n=50 or n=500 users?
- How does the CI change with increased variability (e.g., sd=2 vs
sd=1)?
Hypotheses.
H0: μ ≤ 4.5 vs. H1: μ > 4.5
ht <- t.test(time_spent, mu = 4.5, alternative = "greater")
ht
##
## One Sample t-test
##
## data: time_spent
## t = 4.6421, df = 99, p-value = 5.3e-06
## alternative hypothesis: true mean is greater than 4.5
## 95 percent confidence interval:
## 4.908264 Inf
## sample estimates:
## mean of x
## 5.135609
dens <- density(time_spent)
ddf <- tibble::tibble(x = dens$x, d = dens$y)
ggplot(ddf, aes(x, d)) +
geom_histogram(data = data.frame(time_spent),
aes(x = time_spent, y = ..density..),
binwidth = 0.25, fill = "grey80", color = "grey30") +
geom_line(color = "#2E86AB") +
geom_vline(xintercept = 4.5, linetype = "dashed", color = "red", linewidth = 1) +
geom_area(data = subset(ddf, x > 4.5), aes(x, d), fill = "red", alpha = 0.25) +
labs(title = "One-Sample Right-Tailed Test: μ > 4.5",
subtitle = paste0("p-value = ", formatC(ht$p.value, format = "e", digits = 2)),
x = "Minutes", y = "Density") +
theme_minimal(base_size = 13)
Interpretation
If the p-value < 0.05, we conclude average time is greater than 4.5
min. The right-tail shading visualizes the rejection region.
Remember: p-value quantifies evidence against H0; it is not the
probability H0 is true.
Extension Questions
- Would a two-sided test be more appropriate during exploration?
- Translate the uplift into business value (e.g., additional ads
viewed).
ci85 <- t.test(time_spent, conf.level = 0.85)$conf.int
ci90 <- t.test(time_spent, conf.level = 0.90)$conf.int
ci99 <- t.test(time_spent, conf.level = 0.99)$conf.int
conf_data <- tibble::tibble(
level = c("85%", "90%", "99%"),
mean = mean_time,
lo = c(ci85[1], ci90[1], ci99[1]),
hi = c(ci85[2], ci90[2], ci99[2])
)
ggplot(conf_data, aes(level, mean, group = level)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = lo, ymax = hi), width = 0.15, linewidth = 0.8) +
labs(title = "Confidence Intervals at Different Confidence Levels",
x = "Confidence Level", y = "Mean Time (minutes)") +
theme_minimal(base_size = 13)
Context.
In digital product experimentation, conversion rate optimization
(CRO) is a central metric. A/B tests are commonly used to
evaluate whether design or content changes lead to improved user
actions. In this scenario, a company is testing a new homepage
design (B) against the current version (A).
Over the course of one week:
At first glance, version B appears to generate a +1.0 percentage point increase (a 20% relative lift). The crucial question is whether this difference is statistically significant or could simply be due to random variation in user behavior.
Confidence intervals and hypothesis testing help quantify this uncertainty. A CI communicates the plausible range of conversion rates for each version, while a hypothesis test assesses whether the difference between A and B is large enough to be unlikely under the assumption of no true effect.
This type of analysis is widely applied in marketing, e-commerce, and product analytics, guiding data-driven decisions on rollout, further experimentation, or design iteration.
A_n <- 8100; A_s <- 405
B_n <- 7900; B_s <- 474
ab <- tibble::tibble(
design = c("A","B"),
purchases = c(A_s, B_s),
visits = c(A_n, B_n),
rate = purchases / visits
)
ab
# prop.test returns score-based CI (more reliable than simple Wald)
A_ci <- prop.test(A_s, A_n)$conf.int
B_ci <- prop.test(B_s, B_n)$conf.int
ab_ci <- ab %>%
dplyr::mutate(lo = c(A_ci[1], B_ci[1]),
hi = c(A_ci[2], B_ci[2]))
ggplot(ab_ci, aes(design, rate, fill = design)) +
geom_col(width = 0.55) +
geom_errorbar(aes(ymin = lo, ymax = hi), width = 0.12, linewidth = 0.8) +
geom_text(aes(label = scales::percent(rate, accuracy = 0.01)),
vjust = -0.7, fontface = "bold") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(values = c("A"="#2E86AB","B"="#F18F01")) +
labs(title = "Conversion with 95% CIs",
x = NULL, y = "Conversion Rate") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
Interpretation
B appears +1.0 pp higher (20% relative lift). CIs show uncertainty;
overlapping CIs do not automatically mean “no difference”—use the formal
two-proportion test.
Context.
While confidence intervals provide a range of plausible values for each
group’s conversion rate, decision-making often requires a direct
statistical test of whether the observed difference is likely to be due
to chance. In an A/B test, this is done through a
two-proportion hypothesis test.
Here, the null hypothesis (H0) states that the conversion rates for Homepage A and Homepage B are equal (pA = pB). The alternative hypothesis (H1) is that the conversion rates are different (pA ≠ pB). This test evaluates whether the 1% absolute lift observed for Design B is statistically significant or simply sampling noise.
In practice, this type of test is widely used in marketing and product optimization to determine if a new design, feature, or campaign truly outperforms the existing one. A small p-value provides evidence to reject the null hypothesis and conclude that the performance difference is real, while a large p-value suggests that the evidence is insufficient to claim improvement.
Hypotheses.
H0: pA = pB vs H1: pA ≠ pB
two_prop <- prop.test(x = c(A_s, B_s), n = c(A_n, B_n), correct = FALSE)
two_prop
##
## 2-sample test for equality of proportions without continuity correction
##
## data: c(A_s, B_s) out of c(A_n, B_n)
## X-squared = 7.703, df = 1, p-value = 0.005513
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.017067685 -0.002932315
## sample estimates:
## prop 1 prop 2
## 0.05 0.06
Interpretation
If the p-value is small and the CI for (pB − pA) excludes 0, we have
evidence B improves conversion. Report both effect size and
significance.
Scenario. Ops monitors turnaround time (TAT, minutes) for a lab’s Day shift and wants a CI for process variability (σ, σ²). While means are useful for summarizing central tendency, in operations management and quality control, the spread of the data (variance and standard deviation) is often just as critical. For example, in a clinical laboratory, even if the average turnaround time (TAT) for tests meets service expectations, high variability could mean that some samples take excessively long, causing SLA violations and workflow bottlenecks.
To assess the reliability of the process, we can construct a confidence interval for the variance (σ²) and standard deviation (σ) using the Chi-square distribution. This allows the ops team to quantify uncertainty around process variability and evaluate whether improvements are needed.
In practice, such variance confidence intervals are used in manufacturing quality control, healthcare operations, and IT system latency monitoring, where reducing variability is as important as improving the mean. A wide upper bound on the CI signals that although the average may look fine, the system occasionally produces outliers—something stakeholders must address to maintain consistency and reliability.
# Simulate a stable Day-shift process
n_day <- 420
day_minutes <- rnorm(n_day, mean = 56, sd = 11)
n <- length(day_minutes)
s2 <- var(day_minutes)
alpha <- 0.05
# Chi-square quantiles (df = n-1)
qL <- qchisq(1 - alpha/2, df = n - 1)
qU <- qchisq(alpha/2, df = n - 1)
ci_var <- c((n - 1)*s2/qL, (n - 1)*s2/qU)
ci_sd <- sqrt(ci_var)
list(variance_CI = ci_var, sd_CI = ci_sd)
## $variance_CI
## [1] 102.0647 133.8538
##
## $sd_CI
## [1] 10.10271 11.56952
Interpretation
With ~95% confidence, the true variance lies in variance_CI
and true SD in sd_CI. Even with a satisfactory mean, a
large upper SD bound implies frequent SLA misses → target variance
reduction.
Extension Questions
- Compare SD CIs for different shifts.
- How sensitive is this interval to non-normality?
Scenario. Compare mean time on Design A vs B (independent users).
set.seed(77)
A_time <- rnorm(200, mean = 5.0, sd = 1.4)
B_time <- rnorm(220, mean = 5.4, sd = 1.6)
t_ind <- t.test(B_time, A_time, var.equal = FALSE) # Welch by default
t_ind
##
## Welch Two Sample t-test
##
## data: B_time and A_time
## t = 4.6961, df = 417.93, p-value = 3.603e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3985485 0.9723917
## sample estimates:
## mean of x mean of y
## 5.582033 4.896563
sum_df <- tibble(
design = c("A","B"),
mean = c(mean(A_time), mean(B_time)),
sd = c(sd(A_time), sd(B_time)),
n = c(length(A_time), length(B_time))
) %>%
mutate(se = sd/sqrt(n),
lo = mean - 1.96*se,
hi = mean + 1.96*se)
ggplot(sum_df, aes(design, mean, fill = design)) +
geom_col(width = 0.55) +
geom_errorbar(aes(ymin = lo, ymax = hi), width = 0.12) +
scale_fill_manual(values = c("A"="#2E86AB","B"="#F18F01")) +
labs(title = "Average Session Time by Design",
x = NULL, y = "Minutes") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
Interpretation
If p-value < 0.05 and CI for (μ_B − μ_A) excludes 0, B likely
increases mean session time. Assess practical as well as statistical
significance.
Scenario. Same users took a tutorial before and after a UI tweak; we compare time to complete.
set.seed(88)
n <- 80
pre <- rnorm(n, mean = 12.0, sd = 3.0)
post <- pre - rnorm(n, mean = 1.2, sd = 1.5) # expected improvement
t_pair <- t.test(post, pre, paired = TRUE) # H0: mean(post - pre) = 0
t_pair
##
## Paired t-test
##
## data: post and pre
## t = -6.2797, df = 79, p-value = 1.715e-08
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -1.3709441 -0.7110319
## sample estimates:
## mean difference
## -1.040988
df <- tibble(id = 1:n, pre = pre, post = post) %>%
pivot_longer(cols = c(pre, post), names_to = "phase", values_to = "time")
ggplot(df, aes(phase, time, group = id)) +
geom_line(alpha = 0.3, color = "grey60") +
stat_summary(fun = mean, geom = "point", size = 3, color = "#2E86AB") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.15) +
labs(title = "Task Time: Pre vs Post (Paired Users)",
x = NULL, y = "Minutes") +
theme_minimal(base_size = 13)
Interpretation
Paired design increases power by removing between-user variability. If
the mean difference CI is below 0, the tweak reduced time.
Scenario. Compare average checkout duration across Desktop, Tablet, Mobile.
set.seed(101)
desk <- rnorm(120, mean = 90, sd = 15)
tab <- rnorm(115, mean = 95, sd = 16)
mob <- rnorm(130, mean = 100, sd = 18)
checkout <- tibble(
secs = c(desk, tab, mob),
group = rep(c("Desktop","Tablet","Mobile"),
times = c(length(desk), length(tab), length(mob)))
)
anova_model <- aov(secs ~ group, data = checkout)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 4448 2224.1 9.213 0.000125 ***
## Residuals 362 87393 241.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(checkout, aes(group, secs, fill = group)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
stat_summary(fun = mean, geom = "point", size = 3, color = "black") +
stat_summary(fun = mean, fun.min = function(z) mean(z) - sd(z)/sqrt(length(z)),
fun.max = function(z) mean(z) + sd(z)/sqrt(length(z)),
geom = "errorbar", width = 0.12, color = "black") +
scale_fill_manual(values = c("#2E86AB","#F18F01","#7CB518")) +
labs(title = "Checkout Duration by Platform", x = NULL, y = "Seconds") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
Interpretation
A significant ANOVA suggests at least one mean differs. Follow with
Tukey HSD or pairwise tests (with multiplicity control).
TukeyHSD(anova_model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = secs ~ group, data = checkout)
##
## $group
## diff lwr upr p adj
## Mobile-Desktop 8.319406 3.6904268 12.948386 0.0000879
## Tablet-Desktop 5.607068 0.8353814 10.378755 0.0164273
## Tablet-Mobile -2.712338 -7.3933525 1.968677 0.3611745
Nonparametric alternative (if normality/homoscedasticity fail):
kruskal.test(secs ~ group, data = checkout)
##
## Kruskal-Wallis rank sum test
##
## data: secs by group
## Kruskal-Wallis chi-squared = 15.951, df = 2, p-value = 0.0003438
Scenario. Device × Purchase association.
# Simulate a contingency table
tab <- matrix(c(120, 30, # Desktop: purchased / not
90, 35, # Tablet
140, 80), # Mobile
nrow = 3, byrow = TRUE)
dimnames(tab) <- list(Device = c("Desktop","Tablet","Mobile"),
Purchase = c("Yes","No"))
tab
## Purchase
## Device Yes No
## Desktop 120 30
## Tablet 90 35
## Mobile 140 80
chisq.test(tab) # Large counts → chi-square OK
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 11.665, df = 2, p-value = 0.00293
If a cell has small counts, use Fisher’s exact (2×2 case) instead:
# Example 2x2
small_tab <- matrix(c(12, 3, 4, 11), nrow = 2, byrow = TRUE)
fisher.test(small_tab)
##
## Fisher's Exact Test for Count Data
##
## data: small_tab
## p-value = 0.009221
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.589058 87.245326
## sample estimates:
## odds ratio
## 9.952827
Interpretation
A small p-value suggests device type and purchase are not
independent (association exists). Inspect standardized
residuals for the pattern.
bartlett.test()
(normal), leveneTest() (robust; needs
car).shapiro.test(residuals(anova_model)) # normality of residuals
bartlett.test(secs ~ group, data = checkout) # homoscedasticity
# car::leveneTest(secs ~ group, data = checkout) # robust alternative
Nonparametric tests (no normality needed): - Two
independent groups: Mann–Whitney U →
wilcox.test(x, y, paired = FALSE)
- Paired groups: Wilcoxon signed-rank →
wilcox.test(x, y, paired = TRUE)
- >2 groups: Kruskal–Wallis →
kruskal.test(y ~ group)
Practice prompts 1. Convert one of the above
scenarios into a power question: What n is
needed to detect a practical effect at 80% power?
2. Re-run the A/B example with sequential
monitoring—how does peeking affect Type I error?
3. For the ANOVA dataset, simulate non-normal errors
and compare ANOVA vs Kruskal–Wallis outcomes.