A digital learning platform claims that the average daily study time of its users is 120 minutes. Based on historical records, the population standard deviation is known to be 15 minutes.
A random sample of 64 users shows an average study time of 116 minutes.
Given:
H₀ (Null Hypothesis): μ = 120 minutes
The true average daily study time is 120 minutes as claimed.
H₁ (Alternative Hypothesis): μ ≠ 120 minutes
The true average daily study time differs from 120 minutes.
Type of Test: Two-tailed test
Significance Level: α = 0.05
Why One-Sample Z-Test?
We use the One-Sample Z-Test for the following reasons:
Decision Rule: When σ is known and n ≥ 30, use Z-test instead of t-test.
# Given data
mu_0 <- 120
sigma <- 15
n <- 64
x_bar <- 116
alpha <- 0.05
# Calculate Standard Error
se <- sigma / sqrt(n)
# Calculate Z-statistic
z_stat <- (x_bar - mu_0) / se
# Calculate p-value (two-tailed)
p_value <- 2 * pnorm(abs(z_stat), lower.tail = FALSE)
# Critical values
z_critical <- qnorm(1 - alpha/2)
# Results
cat("Standard Error (SE):", round(se, 4), "\n")
Standard Error (SE): 1.875
cat("Z-Statistic:", round(z_stat, 4), "\n")
Z-Statistic: -2.1333
cat("P-Value:", round(p_value, 4), "\n")
P-Value: 0.0329
cat("Critical Z-values: ±", round(z_critical, 4), "\n")
Critical Z-values: ± 1.96
Z-Test Formula:
Z = (x̄ - μ₀) / (σ / √n)
Calculation:
SE = σ / √n = 15 / √64 = 15 / 8 =
1.875
Z = (116 - 120) / 1.875
Z = -4 / 1.875
Z =
-2.1333
P-value = 2 × P(Z > |-2.1333|) = 0.0329
results_df <- data.frame(
Statistic = c("Standard Error (SE)", "Z-Statistic", "P-Value",
"Critical Z (±)", "Decision"),
Value = c(round(se, 4), round(z_stat, 4), round(p_value, 4),
round(z_critical, 4),
ifelse(abs(z_stat) > z_critical, "Reject H₀", "Fail to Reject H₀"))
)
kable(results_df, caption = "Test Results Summary", align = 'lr') %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#8B1538") %>%
row_spec(2, bold = TRUE, background = "#FFE4E6") %>%
row_spec(3, bold = TRUE, background = "#E0F2FE")
| Statistic | Value |
|---|---|
| Standard Error (SE) | 1.875 |
| Z-Statistic | -2.1333 |
| P-Value | 0.0329 |
| Critical Z (±) | 1.96 |
| Decision | Reject H₀ |
x <- seq(-4, 4, length.out = 1000)
y <- dnorm(x)
df_plot <- data.frame(x = x, y = y)
ggplot(df_plot, aes(x = x, y = y)) +
geom_line(color = "#0EA5E9", size = 1.5) +
geom_area(data = df_plot %>% filter(x < -z_critical),
aes(x = x, y = y), fill = "#F43F5E", alpha = 0.5) +
geom_area(data = df_plot %>% filter(x > z_critical),
aes(x = x, y = y), fill = "#F43F5E", alpha = 0.5) +
geom_vline(xintercept = z_stat, color = "#8B1538",
linetype = "dashed", size = 1.5) +
geom_vline(xintercept = c(-z_critical, z_critical),
color = "#F97316", linetype = "dotted", size = 1.2) +
annotate("text", x = z_stat, y = max(y) * 0.8,
label = paste0("Z = ", round(z_stat, 3)),
color = "#8B1538", size = 5, fontface = "bold", hjust = 1.2) +
annotate("text", x = -2.5, y = 0.05,
label = "Rejection\nRegion",
color = "#DC2626", size = 4, fontface = "bold") +
annotate("text", x = 2.5, y = 0.05,
label = "Rejection\nRegion",
color = "#DC2626", size = 4, fontface = "bold") +
labs(title = "One-Sample Z-Test: Standard Normal Distribution",
subtitle = paste0("H₀: μ = ", mu_0, " | α = ", alpha, " | p-value = ", round(p_value, 4)),
x = "Z-Score", y = "Density") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", color = "#8B1538", hjust = 0.5),
plot.subtitle = element_text(size = 12, color = "#475569", hjust = 0.5),
panel.background = element_rect(fill = "#F0F9FF", color = NA)
)
Decision Rule: Reject H₀ if |Z| > 1.96 or if p-value < 0.05
Observed:
Conclusion: We REJECT H₀ at α = 0.05 significance level.
There is sufficient statistical evidence to conclude that the true average daily study time differs from the claimed 120 minutes.
Current Situation:
Business Implications:
Practical Significance: While statistical significance is important, consider whether a 4-minute difference has meaningful business impact in your context.
A product analytics team conducts an A/B test to compare the average session duration (minutes) between two versions of a landing page.
Data Collected:
n_A <- 25
mean_A <- 4.8
sd_A <- 1.2
n_B <- 25
mean_B <- 5.4
sd_B <- 1.4
alpha <- 0.05
ab_data <- data.frame(
Version = c("Version A", "Version B"),
`Sample Size (n)` = c(n_A, n_B),
`Mean (minutes)` = c(mean_A, mean_B),
`Standard Deviation` = c(sd_A, sd_B),
check.names = FALSE
)
kable(ab_data, caption = "A/B Test Data Summary", align = 'lrrr') %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#8B1538") %>%
row_spec(1, background = "#DBEAFE") %>%
row_spec(2, background = "#FFE4E6")
| Version | Sample Size (n) | Mean (minutes) | Standard Deviation |
|---|---|---|---|
| Version A | 25 | 4.8 | 1.2 |
| Version B | 25 | 5.4 | 1.4 |
H₀ (Null Hypothesis): μ_A = μ_B
There is no difference in average session duration between Version A and Version B.
H₁ (Alternative Hypothesis): μ_A ≠ μ_B
There is a significant difference in average session duration between the two versions.
Type of Test: Two-sample independent t-test (two-tailed)
Significance Level: α = 0.05
Why Two-Sample Independent T-Test?
We use the Two-Sample Independent T-Test because:
Test Type: Pooled variance t-test (assuming equal population variances)
# Pooled standard deviation
sp <- sqrt(((n_A - 1) * sd_A^2 + (n_B - 1) * sd_B^2) / (n_A + n_B - 2))
# Standard error of difference
se_diff <- sp * sqrt(1/n_A + 1/n_B)
# T-statistic
t_stat <- (mean_B - mean_A) / se_diff
# Degrees of freedom
df <- n_A + n_B - 2
# P-value (two-tailed)
p_value_t <- 2 * pt(abs(t_stat), df, lower.tail = FALSE)
# Critical value
t_critical <- qt(1 - alpha/2, df)
# Effect size (Cohen's d)
cohens_d <- (mean_B - mean_A) / sp
cat("Pooled Standard Deviation (Sp):", round(sp, 4), "\n")
Pooled Standard Deviation (Sp): 1.3038
cat("Standard Error (SE):", round(se_diff, 4), "\n")
Standard Error (SE): 0.3688
cat("T-Statistic:", round(t_stat, 4), "\n")
T-Statistic: 1.627
cat("Degrees of Freedom:", df, "\n")
Degrees of Freedom: 48
cat("P-Value:", round(p_value_t, 4), "\n")
P-Value: 0.1103
cat("Critical T (±):", round(t_critical, 4), "\n")
Critical T (±): 2.0106
cat("Cohen's d:", round(cohens_d, 4), "\n")
Cohen's d: 0.4602
Two-Sample T-Test Formula:
Step 1: Pooled Standard Deviation
Sp =
√[((n_A-1)×s_A² + (n_B-1)×s_B²) / (n_A + n_B - 2)]
Sp = √[((24)×1.44
+ (24)×1.96) / 48]
Sp = 1.3038
Step 2: Standard Error
SE = Sp × √(1/n_A +
1/n_B)
SE = 1.3038 × √(1/25 + 1/25)
SE =
0.3688
Step 3: T-Statistic
T = (x̄_B - x̄_A) / SE
T =
(5.4 - 4.8) / 0.3688
T = 1.627
P-value = 2 × P(t > |1.627|) with df = 48
P-value =
0.1103
t_results <- data.frame(
Statistic = c("Mean Difference (B - A)", "Pooled Std Dev", "Standard Error",
"T-Statistic", "Degrees of Freedom", "P-Value", "Critical T (±)",
"Cohen's d"),
Value = c(round(mean_B - mean_A, 4), round(sp, 4), round(se_diff, 4),
round(t_stat, 4), df, round(p_value_t, 4),
paste("±", round(t_critical, 4)), round(cohens_d, 4))
)
kable(t_results, caption = "T-Test Results Summary", align = 'lr') %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#8B1538") %>%
row_spec(4, bold = TRUE, background = "#FFE4E6") %>%
row_spec(6, bold = TRUE, background = "#E0F2FE")
| Statistic | Value |
|---|---|
| Mean Difference (B - A) | 0.6 |
| Pooled Std Dev | 1.3038 |
| Standard Error | 0.3688 |
| T-Statistic | 1.627 |
| Degrees of Freedom | 48 |
| P-Value | 0.1103 |
| Critical T (±) | ± 2.0106 |
| Cohen’s d | 0.4602 |
# Simulated data for visualization
set.seed(123)
sim_A <- rnorm(n_A, mean_A, sd_A)
sim_B <- rnorm(n_B, mean_B, sd_B)
df_sim <- data.frame(
Version = rep(c("Version A", "Version B"), each = 25),
Duration = c(sim_A, sim_B)
)
# Boxplot
p1 <- ggplot(df_sim, aes(x = Version, y = Duration, fill = Version)) +
geom_boxplot(alpha = 0.7, outlier.shape = 21) +
geom_jitter(width = 0.2, alpha = 0.4, size = 2) +
stat_summary(fun = mean, geom = "point", shape = 23,
size = 4, fill = "red", color = "darkred") +
scale_fill_manual(values = c("#93C5FD", "#FCA5A5")) +
labs(title = "Session Duration Comparison: A/B Test",
subtitle = "Boxplot with individual data points (Red diamond = mean)",
y = "Session Duration (minutes)", x = "") +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", color = "#8B1538"),
legend.position = "none",
panel.background = element_rect(fill = "#F0F9FF", color = NA)
)
# T-distribution
x_t <- seq(-4, 4, length.out = 1000)
y_t <- dt(x_t, df)
df_t <- data.frame(x = x_t, y = y_t)
p2 <- ggplot(df_t, aes(x = x, y = y)) +
geom_line(color = "#0EA5E9", size = 1.5) +
geom_area(data = df_t %>% filter(x < -t_critical),
aes(x = x, y = y), fill = "#F43F5E", alpha = 0.5) +
geom_area(data = df_t %>% filter(x > t_critical),
aes(x = x, y = y), fill = "#F43F5E", alpha = 0.5) +
geom_vline(xintercept = t_stat, color = "#8B1538",
linetype = "dashed", size = 1.5) +
geom_vline(xintercept = c(-t_critical, t_critical),
color = "#F97316", linetype = "dotted", size = 1.2) +
annotate("text", x = t_stat, y = max(y_t) * 0.8,
label = paste0("T = ", round(t_stat, 3)),
color = "#8B1538", size = 5, fontface = "bold", hjust = -0.2) +
labs(title = "T-Distribution with Test Statistic",
subtitle = paste0("df = ", df, " | p-value = ", round(p_value_t, 4)),
x = "T-Score", y = "Density") +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", color = "#8B1538"),
panel.background = element_rect(fill = "#F0F9FF", color = NA)
)
grid.arrange(p1, p2, ncol = 1)
Decision Rule: Reject H₀ if |T| > 2.0106 or if p-value < 0.05
Observed:
Conclusion: We FAIL TO REJECT H₀ at α = 0.05 significance level.
There is insufficient statistical evidence to conclude that there is a significant difference in average session duration between Version A and Version B.
Effect Size: Cohen’s d = 0.4602 (small to medium effect)
Key Findings:
Product Recommendation:
Risk Assessment: With p-value = 0.1103, there is 11.03% probability of observing this difference by chance if there’s truly no difference between versions.
An e-commerce company examines whether device type is associated with payment method preference.
# Contingency table
payment_matrix <- matrix(
c(120, 80, 50,
60, 90, 40),
nrow = 2, byrow = TRUE,
dimnames = list(
Device = c("Mobile", "Desktop"),
Payment = c("E-Wallet", "Credit Card", "Cash on Delivery")
)
)
kable(payment_matrix, caption = "Observed Frequencies: Device Type vs Payment Method") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
add_header_above(c(" " = 1, "Payment Method" = 3)) %>%
row_spec(0, bold = TRUE, color = "white", background = "#8B1538")
| E-Wallet | Credit Card | Cash on Delivery | |
|---|---|---|---|
| Mobile | 120 | 80 | 50 |
| Desktop | 60 | 90 | 40 |
H₀ (Null Hypothesis): Device type and payment method preference are independent
There is no association between device type and payment method choice.
H₁ (Alternative Hypothesis): Device type and payment method preference are dependent
There is an association between device type and payment method choice.
Significance Level: α = 0.05
Why Chi-Square Test of Independence?
We use the Chi-Square (χ²) Test of Independence because:
# Perform Chi-Square test
chi_test <- chisq.test(payment_matrix)
# Extract results
chi_stat <- chi_test$statistic
p_value_chi <- chi_test$p.value
df_chi <- chi_test$parameter
expected_freq <- chi_test$expected
# Critical value
chi_critical <- qchisq(1 - alpha, df_chi)
cat("Chi-Square Statistic (χ²):", round(chi_stat, 4), "\n")
Chi-Square Statistic (χ²): 13.7736
cat("Degrees of Freedom:", df_chi, "\n")
Degrees of Freedom: 2
cat("P-Value:", round(p_value_chi, 4), "\n")
P-Value: 0.001
cat("Critical Chi-Square:", round(chi_critical, 4), "\n")
Critical Chi-Square: 5.9915
Chi-Square Test Formula:
χ² = Σ [(Observed - Expected)² /
Expected]
Degrees of Freedom:
df = (number of rows - 1) ×
(number of columns - 1)
df = (2 - 1) × (3 - 1) =
2
Chi-Square Statistic: χ² =
13.7736
P-Value:
0.001
Critical Value (α = 0.05):
5.9915
kable(round(expected_freq, 2),
caption = "Expected Frequencies (under H₀: Independence)") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
add_header_above(c(" " = 1, "Payment Method" = 3)) %>%
row_spec(0, bold = TRUE, color = "white", background = "#6366F1")
| E-Wallet | Credit Card | Cash on Delivery | |
|---|---|---|---|
| Mobile | 102.27 | 96.59 | 51.14 |
| Desktop | 77.73 | 73.41 | 38.86 |
chi_results <- data.frame(
Statistic = c("Chi-Square (χ²)", "Degrees of Freedom", "P-Value",
"Critical Value", "Decision"),
Value = c(round(chi_stat, 4), df_chi, round(p_value_chi, 4),
round(chi_critical, 4),
ifelse(chi_stat > chi_critical, "Reject H₀", "Fail to Reject H₀"))
)
kable(chi_results, caption = "Chi-Square Test Results", align = 'lr') %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#8B1538") %>%
row_spec(1, bold = TRUE, background = "#FFE4E6") %>%
row_spec(3, bold = TRUE, background = "#E0F2FE")
| Statistic | Value |
|---|---|
| Chi-Square (χ²) | 13.7736 |
| Degrees of Freedom | 2 |
| P-Value | 0.001 |
| Critical Value | 5.9915 |
| Decision | Reject H₀ |
Decision Rule: Reject H₀ if χ² > 5.9915 or if p-value < 0.05
Observed:
Conclusion: We REJECT H₀ at α = 0.05 significance level.
There is sufficient statistical evidence to conclude that device type and payment method preference are dependent (associated).
# Convert to long format
df_payment <- as.data.frame(as.table(payment_matrix))
colnames(df_payment) <- c("Device", "Payment", "Frequency")
# Grouped bar chart
p1 <- ggplot(df_payment, aes(x = Device, y = Frequency, fill = Payment)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
geom_text(aes(label = Frequency), position = position_dodge(0.9),
vjust = -0.5, size = 4, fontface = "bold") +
scale_fill_manual(values = c("#93C5FD", "#FCA5A5", "#FDE68A")) +
labs(title = "Payment Method Distribution by Device Type",
subtitle = "Observed Frequencies",
y = "Frequency", x = "Device Type") +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", color = "#8B1538"),
legend.position = "top",
panel.background = element_rect(fill = "#F0F9FF", color = NA)
)
# Proportional stacked bar
df_prop <- df_payment %>%
group_by(Device) %>%
mutate(Proportion = Frequency / sum(Frequency) * 100)
p2 <- ggplot(df_prop, aes(x = Device, y = Proportion, fill = Payment)) +
geom_bar(stat = "identity", alpha = 0.8) +
geom_text(aes(label = paste0(round(Proportion, 1), "%")),
position = position_stack(vjust = 0.5),
size = 4, fontface = "bold", color = "white") +
scale_fill_manual(values = c("#93C5FD", "#FCA5A5", "#FDE68A")) +
labs(title = "Payment Method Preference by Device (Proportional)",
subtitle = "Percentage Distribution",
y = "Percentage (%)", x = "Device Type") +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", color = "#8B1538"),
legend.position = "top",
panel.background = element_rect(fill = "#F0F9FF", color = NA)
)
grid.arrange(p1, p2, ncol = 1)
Key Findings:
Significant Association Detected! Device type influences payment method choice.
Mobile Users (250 total):
Desktop Users (190 total):
Strategic Recommendations:
Expected Business Impact:
A fintech startup tests whether a new fraud detection algorithm reduces fraudulent transactions.
Hypotheses:
Definition: Type I Error occurs when we reject H₀ when H₀ is actually true (False Positive).
In This Context:
We conclude that the algorithm reduces fraud, but in reality, it does NOT reduce fraud.
Consequences:
Probability: P(Type I Error) = α (typically 0.05 or 5%)
Financial Impact: Moderate - Limited to implementation costs (~$50K-$200K)
Explain Type II Error in This Context
Definition: Type II Error occurs when we fail to reject H₀ when H₁ is actually true (False Negative).
In This Context:
We conclude that the algorithm does NOT reduce fraud, but in reality, it DOES reduce fraud.
Consequences:
Probability: P(Type II Error) = β (depends on sample size and effect size)
Statistical Power: Power = 1 - β (probability of correctly detecting the improvement)
Financial Impact: High - Ongoing fraud losses (potentially $500K-$5M+ annually)
Type II Error is MORE COSTLY in this fraud detection context.
Comparative Analysis:
| Error Type | Immediate Cost | Long-term Cost | Business Risk |
|---|---|---|---|
|
Type I (False Positive) |
$50K-$200K (Implementation) |
Limited (One-time cost) |
Low to Moderate (Reversible) |
|
Type II (False Negative) |
$0 (No action taken) |
$500K-$5M+ (Annual fraud losses) |
High (Ongoing damage) |
Justification:
Relationship: Sample size and Type II Error are inversely related.
As Sample Size Increases:
Why This Happens:
Practical Example:
Business Recommendation: Invest in larger sample sizes when testing critical systems like fraud detection to minimize costly Type II errors.
Explain Relationship Between α, β, and Statistical Power
Key Concepts:
1. Alpha (α) - Significance Level:
2. Beta (β) - Type II Error Probability:
3. Statistical Power (1 - β):
Relationships:
| If You… | Then… | Trade-off |
|---|---|---|
| Decrease α (more strict) | β increases (Power decreases) | Fewer false positives, more false negatives |
| Increase sample size (n) | β decreases (Power increases) | Better at detecting true effects |
| Larger effect size | β decreases (Power increases) | Easier to detect big differences |
Optimal Balance:
Formula: Power = 1 - β
If β = 0.20, then Power = 1 - 0.20 = 0.80 (80%)
A churn prediction model evaluation yields the following results:
P-Value = 0.021
Definition: The p-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.
In This Context:
If the churn prediction model has NO real predictive power (H₀ is true), there is a 2.1% chance of observing a test statistic as extreme as 2.31 or more extreme, purely by random chance.
Interpretation:
Common Misconceptions to Avoid:
Decision Rule: Reject H₀ if p-value < α
Comparison:
Decision: REJECT H₀ at the 5% significance level.
Conclusion: There is statistically significant evidence that the churn prediction model has meaningful predictive power. The observed performance is unlikely to have occurred by chance alone.
Confidence Level: We can be 95% confident in this decision (1 - α = 0.95).
Executive Summary:
“Our new churn prediction model has been rigorously tested and the results show that it works significantly better than random guessing.
What We Found:
What This Means for the Business:
Recommendation: Deploy the model into production with proper monitoring and validation protocols.
Next Steps:
”
Discuss the Risk if the Sample is Not Representative
Critical Assumption: Statistical tests assume the sample is representative of the population.
If Sample is NOT Representative:
1. Selection Bias Risks:
2. Consequences:
3. Impact on Statistical Validity:
4. How to Mitigate:
Business Recommendation:
Before full deployment, conduct a pilot test on a representative sample of the current customer base to validate that the model performs as expected. Monitor key metrics like prediction accuracy, false positive rate, and business KPIs (retention rate, CLV).
Critical Distinction: P-value measures statistical significance, NOT practical significance (effect size).
What P-Value Tells Us:
What P-Value Does NOT Tell Us:
Example Scenarios:
| Scenario | P-Value | Effect Size | Business Impact |
|---|---|---|---|
|
Large Sample Tiny improvement |
0.001 (Very significant) |
0.5% churn reduction |
Low Not worth implementing |
|
Small Sample Large improvement |
0.08 (Not significant) |
15% churn reduction |
High Need larger study |
|
Optimal Meaningful result |
0.021 (Significant) |
8% churn reduction |
High Deploy immediately |
Why This Matters:
Scenario 1: With millions of customers, even a 0.1% difference will show p < 0.001, but the business impact is negligible. Statistically significant ≠ Practically important.
Scenario 2: A small pilot shows 15% improvement but p = 0.08. Don’t dismiss it! The effect is large but sample was too small. Not significant ≠ No effect.
Best Practice: Always report BOTH:
For Our Churn Model:
Recommendation: Supplement statistical significance testing with effect size estimates and ROI calculations for management decisions.
1. Case Study 1 - One-Sample Z-Test:
2. Case Study 3 - Two-Sample T-Test (A/B Testing):
3. Case Study 4 - Chi-Square Test:
4. Case Study 5 - Type I & II Errors:
5. Case Study 6 - P-Value Interpretation: