Statistical Inference

TASKS WEEK 14

Statistical Inference

1 Case Study 1: One-Sample Z-Test

1.1 Problem Statement

Study Context

A digital learning platform claims that the average daily study time of its users is 120 minutes. Based on historical records, the population standard deviation is known to be 15 minutes.

A random sample of 64 users shows an average study time of 116 minutes.

Given:

  • μ₀ = 120 minutes (claimed population mean)
  • σ = 15 minutes (population standard deviation)
  • n = 64 users (sample size)
  • x̄ = 116 minutes (sample mean)
  • α = 0.05 (significance level)

1.2 Hypotheses

Formulate the Null Hypothesis (H₀) and Alternative Hypothesis (H₁)

H₀ (Null Hypothesis): μ = 120 minutes

The true average daily study time is 120 minutes as claimed.

H₁ (Alternative Hypothesis): μ ≠ 120 minutes

The true average daily study time differs from 120 minutes.

Type of Test: Two-tailed test

Significance Level: α = 0.05

1.3 Appropriate Statistical Test

Why One-Sample Z-Test?

We use the One-Sample Z-Test for the following reasons:

  1. Population standard deviation (σ) is KNOWN: σ = 15 minutes (from historical records)
  2. Large sample size: n = 64 ≥ 30, so Central Limit Theorem applies
  3. Testing a single population mean: We’re comparing sample mean to a claimed population value
  4. Quantitative data: Study time is a continuous numerical variable

Decision Rule: When σ is known and n ≥ 30, use Z-test instead of t-test.

1.4 Compute Test Statistic and P-Value

# Given data
mu_0 <- 120
sigma <- 15
n <- 64
x_bar <- 116
alpha <- 0.05

# Calculate Standard Error
se <- sigma / sqrt(n)

# Calculate Z-statistic
z_stat <- (x_bar - mu_0) / se

# Calculate p-value (two-tailed)
p_value <- 2 * pnorm(abs(z_stat), lower.tail = FALSE)

# Critical values
z_critical <- qnorm(1 - alpha/2)

# Results
cat("Standard Error (SE):", round(se, 4), "\n")
Standard Error (SE): 1.875 
cat("Z-Statistic:", round(z_stat, 4), "\n")
Z-Statistic: -2.1333 
cat("P-Value:", round(p_value, 4), "\n")
P-Value: 0.0329 
cat("Critical Z-values: ±", round(z_critical, 4), "\n")
Critical Z-values: ± 1.96 

Z-Test Formula:

Z = (x̄ - μ₀) / (σ / √n)

Calculation:
SE = σ / √n = 15 / √64 = 15 / 8 = 1.875

Z = (116 - 120) / 1.875
Z = -4 / 1.875
Z = -2.1333

P-value = 2 × P(Z > |-2.1333|) = 0.0329

results_df <- data.frame(
  Statistic = c("Standard Error (SE)", "Z-Statistic", "P-Value", 
                "Critical Z (±)", "Decision"),
  Value = c(round(se, 4), round(z_stat, 4), round(p_value, 4),
            round(z_critical, 4), 
            ifelse(abs(z_stat) > z_critical, "Reject H₀", "Fail to Reject H₀"))
)

kable(results_df, caption = "Test Results Summary", align = 'lr') %>%
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#8B1538") %>%
  row_spec(2, bold = TRUE, background = "#FFE4E6") %>%
  row_spec(3, bold = TRUE, background = "#E0F2FE")
Test Results Summary
Statistic Value
Standard Error (SE) 1.875
Z-Statistic -2.1333
P-Value 0.0329
Critical Z (±) 1.96
Decision Reject H₀

1.5 Visualization

x <- seq(-4, 4, length.out = 1000)
y <- dnorm(x)
df_plot <- data.frame(x = x, y = y)

ggplot(df_plot, aes(x = x, y = y)) +
  geom_line(color = "#0EA5E9", size = 1.5) +
  geom_area(data = df_plot %>% filter(x < -z_critical), 
            aes(x = x, y = y), fill = "#F43F5E", alpha = 0.5) +
  geom_area(data = df_plot %>% filter(x > z_critical), 
            aes(x = x, y = y), fill = "#F43F5E", alpha = 0.5) +
  geom_vline(xintercept = z_stat, color = "#8B1538", 
             linetype = "dashed", size = 1.5) +
  geom_vline(xintercept = c(-z_critical, z_critical), 
             color = "#F97316", linetype = "dotted", size = 1.2) +
  annotate("text", x = z_stat, y = max(y) * 0.8, 
           label = paste0("Z = ", round(z_stat, 3)), 
           color = "#8B1538", size = 5, fontface = "bold", hjust = 1.2) +
  annotate("text", x = -2.5, y = 0.05, 
           label = "Rejection\nRegion", 
           color = "#DC2626", size = 4, fontface = "bold") +
  annotate("text", x = 2.5, y = 0.05, 
           label = "Rejection\nRegion", 
           color = "#DC2626", size = 4, fontface = "bold") +
  labs(title = "One-Sample Z-Test: Standard Normal Distribution",
       subtitle = paste0("H₀: μ = ", mu_0, " | α = ", alpha, " | p-value = ", round(p_value, 4)),
       x = "Z-Score", y = "Density") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", color = "#8B1538", hjust = 0.5),
    plot.subtitle = element_text(size = 12, color = "#475569", hjust = 0.5),
    panel.background = element_rect(fill = "#F0F9FF", color = NA)
  )

1.6 State the Conclusion

Statistical Decision

Decision Rule: Reject H₀ if |Z| > 1.96 or if p-value < 0.05

Observed:

  • |Z| = 2.1333 > 1.96
  • p-value = 0.0329 < 0.05

Conclusion: We REJECT H₀ at α = 0.05 significance level.

There is sufficient statistical evidence to conclude that the true average daily study time differs from the claimed 120 minutes.

1.7 Business Analytics Context

Interpretation in Business Context

Current Situation:

  • Platform claims: 120 minutes average study time
  • Observed data: 116 minutes average (4 minutes difference)
  • Statistical significance: YES - The difference is significant

Business Implications:

  1. Platform Claim Challenged: The observed difference is statistically significant and suggests the platform’s claim may need revision.
  2. User Engagement: There may be a decline in user engagement that warrants investigation.
  3. Recommendations: Conduct further analysis to identify causes of reduced study time and implement engagement strategies.

Practical Significance: While statistical significance is important, consider whether a 4-minute difference has meaningful business impact in your context.


2 Case Study 3: Two-Sample T-Test (A/B Testing)

2.1 Problem Statement

A/B Test Scenario

A product analytics team conducts an A/B test to compare the average session duration (minutes) between two versions of a landing page.

Data Collected:

n_A <- 25
mean_A <- 4.8
sd_A <- 1.2

n_B <- 25
mean_B <- 5.4
sd_B <- 1.4

alpha <- 0.05

ab_data <- data.frame(
  Version = c("Version A", "Version B"),
  `Sample Size (n)` = c(n_A, n_B),
  `Mean (minutes)` = c(mean_A, mean_B),
  `Standard Deviation` = c(sd_A, sd_B),
  check.names = FALSE
)

kable(ab_data, caption = "A/B Test Data Summary", align = 'lrrr') %>%
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#8B1538") %>%
  row_spec(1, background = "#DBEAFE") %>%
  row_spec(2, background = "#FFE4E6")
A/B Test Data Summary
Version Sample Size (n) Mean (minutes) Standard Deviation
Version A 25 4.8 1.2
Version B 25 5.4 1.4

2.2 Null and Alternative Hypotheses

Hypotheses Formulation

H₀ (Null Hypothesis): μ_A = μ_B

There is no difference in average session duration between Version A and Version B.

H₁ (Alternative Hypothesis): μ_A ≠ μ_B

There is a significant difference in average session duration between the two versions.

Type of Test: Two-sample independent t-test (two-tailed)

Significance Level: α = 0.05

2.3 Type of T-Test Required

Why Two-Sample Independent T-Test?

We use the Two-Sample Independent T-Test because:

  1. Two separate groups: Version A users vs Version B users
  2. Independent samples: Different users in each group (not paired)
  3. Population standard deviations unknown: Only sample SDs available
  4. Comparing two means: Testing if μ_A ≠ μ_B
  5. Assume equal variances: sd_A = 1.2 and sd_B = 1.4 are reasonably similar

Test Type: Pooled variance t-test (assuming equal population variances)

2.4 Compute Test Statistic and P-Value

# Pooled standard deviation
sp <- sqrt(((n_A - 1) * sd_A^2 + (n_B - 1) * sd_B^2) / (n_A + n_B - 2))

# Standard error of difference
se_diff <- sp * sqrt(1/n_A + 1/n_B)

# T-statistic
t_stat <- (mean_B - mean_A) / se_diff

# Degrees of freedom
df <- n_A + n_B - 2

# P-value (two-tailed)
p_value_t <- 2 * pt(abs(t_stat), df, lower.tail = FALSE)

# Critical value
t_critical <- qt(1 - alpha/2, df)

# Effect size (Cohen's d)
cohens_d <- (mean_B - mean_A) / sp

cat("Pooled Standard Deviation (Sp):", round(sp, 4), "\n")
Pooled Standard Deviation (Sp): 1.3038 
cat("Standard Error (SE):", round(se_diff, 4), "\n")
Standard Error (SE): 0.3688 
cat("T-Statistic:", round(t_stat, 4), "\n")
T-Statistic: 1.627 
cat("Degrees of Freedom:", df, "\n")
Degrees of Freedom: 48 
cat("P-Value:", round(p_value_t, 4), "\n")
P-Value: 0.1103 
cat("Critical T (±):", round(t_critical, 4), "\n")
Critical T (±): 2.0106 
cat("Cohen's d:", round(cohens_d, 4), "\n")
Cohen's d: 0.4602 

Two-Sample T-Test Formula:

Step 1: Pooled Standard Deviation
Sp = √[((n_A-1)×s_A² + (n_B-1)×s_B²) / (n_A + n_B - 2)]
Sp = √[((24)×1.44 + (24)×1.96) / 48]
Sp = 1.3038

Step 2: Standard Error
SE = Sp × √(1/n_A + 1/n_B)
SE = 1.3038 × √(1/25 + 1/25)
SE = 0.3688

Step 3: T-Statistic
T = (x̄_B - x̄_A) / SE
T = (5.4 - 4.8) / 0.3688
T = 1.627

P-value = 2 × P(t > |1.627|) with df = 48
P-value = 0.1103

t_results <- data.frame(
  Statistic = c("Mean Difference (B - A)", "Pooled Std Dev", "Standard Error",
                "T-Statistic", "Degrees of Freedom", "P-Value", "Critical T (±)",
                "Cohen's d"),
  Value = c(round(mean_B - mean_A, 4), round(sp, 4), round(se_diff, 4),
            round(t_stat, 4), df, round(p_value_t, 4), 
            paste("±", round(t_critical, 4)), round(cohens_d, 4))
)

kable(t_results, caption = "T-Test Results Summary", align = 'lr') %>%
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#8B1538") %>%
  row_spec(4, bold = TRUE, background = "#FFE4E6") %>%
  row_spec(6, bold = TRUE, background = "#E0F2FE")
T-Test Results Summary
Statistic Value
Mean Difference (B - A) 0.6
Pooled Std Dev 1.3038
Standard Error 0.3688
T-Statistic 1.627
Degrees of Freedom 48
P-Value 0.1103
Critical T (±) ± 2.0106
Cohen’s d 0.4602

2.5 Visualization

# Simulated data for visualization
set.seed(123)
sim_A <- rnorm(n_A, mean_A, sd_A)
sim_B <- rnorm(n_B, mean_B, sd_B)

df_sim <- data.frame(
  Version = rep(c("Version A", "Version B"), each = 25),
  Duration = c(sim_A, sim_B)
)

# Boxplot
p1 <- ggplot(df_sim, aes(x = Version, y = Duration, fill = Version)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 21) +
  geom_jitter(width = 0.2, alpha = 0.4, size = 2) +
  stat_summary(fun = mean, geom = "point", shape = 23, 
               size = 4, fill = "red", color = "darkred") +
  scale_fill_manual(values = c("#93C5FD", "#FCA5A5")) +
  labs(title = "Session Duration Comparison: A/B Test",
       subtitle = "Boxplot with individual data points (Red diamond = mean)",
       y = "Session Duration (minutes)", x = "") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", color = "#8B1538"),
    legend.position = "none",
    panel.background = element_rect(fill = "#F0F9FF", color = NA)
  )

# T-distribution
x_t <- seq(-4, 4, length.out = 1000)
y_t <- dt(x_t, df)
df_t <- data.frame(x = x_t, y = y_t)

p2 <- ggplot(df_t, aes(x = x, y = y)) +
  geom_line(color = "#0EA5E9", size = 1.5) +
  geom_area(data = df_t %>% filter(x < -t_critical),
            aes(x = x, y = y), fill = "#F43F5E", alpha = 0.5) +
  geom_area(data = df_t %>% filter(x > t_critical),
            aes(x = x, y = y), fill = "#F43F5E", alpha = 0.5) +
  geom_vline(xintercept = t_stat, color = "#8B1538",
             linetype = "dashed", size = 1.5) +
  geom_vline(xintercept = c(-t_critical, t_critical),
             color = "#F97316", linetype = "dotted", size = 1.2) +
  annotate("text", x = t_stat, y = max(y_t) * 0.8,
           label = paste0("T = ", round(t_stat, 3)),
           color = "#8B1538", size = 5, fontface = "bold", hjust = -0.2) +
  labs(title = "T-Distribution with Test Statistic",
       subtitle = paste0("df = ", df, " | p-value = ", round(p_value_t, 4)),
       x = "T-Score", y = "Density") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", color = "#8B1538"),
    panel.background = element_rect(fill = "#F0F9FF", color = NA)
  )

grid.arrange(p1, p2, ncol = 1)

2.6 Statistical Conclusion

Draw a Statistical Conclusion at α = 0.05

Decision Rule: Reject H₀ if |T| > 2.0106 or if p-value < 0.05

Observed:

  • |T| = 1.627 < 2.0106
  • p-value = 0.1103 > 0.05

Conclusion: We FAIL TO REJECT H₀ at α = 0.05 significance level.

There is insufficient statistical evidence to conclude that there is a significant difference in average session duration between Version A and Version B.

Effect Size: Cohen’s d = 0.4602 (small to medium effect)

2.7 Product Decision-Making

Interpret the Result for Product Decision-Making

Key Findings:

  • Version B: 5.4 minutes average session duration
  • Version A: 4.8 minutes average session duration
  • Difference: 0.6 minutes (12.5% increase)
  • Statistical significance: NO

Product Recommendation:

  1. No Clear Winner: The difference between versions is not statistically significant.
  2. Recommendation: Either version can be used, or continue testing with larger sample size.
  3. Consider: Test duration, seasonal effects, or segment-specific analysis.

Risk Assessment: With p-value = 0.1103, there is 11.03% probability of observing this difference by chance if there’s truly no difference between versions.


3 Case Study 4: Chi-Square Test of Independence

3.1 Problem Statement

E-Commerce Analysis

An e-commerce company examines whether device type is associated with payment method preference.

# Contingency table
payment_matrix <- matrix(
  c(120, 80, 50,
    60, 90, 40),
  nrow = 2, byrow = TRUE,
  dimnames = list(
    Device = c("Mobile", "Desktop"),
    Payment = c("E-Wallet", "Credit Card", "Cash on Delivery")
  )
)

kable(payment_matrix, caption = "Observed Frequencies: Device Type vs Payment Method") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Payment Method" = 3)) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#8B1538")
Observed Frequencies: Device Type vs Payment Method
Payment Method
E-Wallet Credit Card Cash on Delivery
Mobile 120 80 50
Desktop 60 90 40

3.2 Null Hypothesis and Alternative Hypothesis

Hypotheses Formulation

H₀ (Null Hypothesis): Device type and payment method preference are independent

There is no association between device type and payment method choice.

H₁ (Alternative Hypothesis): Device type and payment method preference are dependent

There is an association between device type and payment method choice.

Significance Level: α = 0.05

3.3 Appropriate Statistical Test

Why Chi-Square Test of Independence?

We use the Chi-Square (χ²) Test of Independence because:

  1. Two categorical variables: Device type (Mobile/Desktop) and Payment method (3 categories)
  2. Testing association: We want to know if the variables are related
  3. Contingency table format: Data presented as frequency counts
  4. Independent observations: Each transaction is independent
  5. Expected frequencies check: All expected frequencies should be > 5

3.4 Compute Chi-Square Statistic

# Perform Chi-Square test
chi_test <- chisq.test(payment_matrix)

# Extract results
chi_stat <- chi_test$statistic
p_value_chi <- chi_test$p.value
df_chi <- chi_test$parameter
expected_freq <- chi_test$expected

# Critical value
chi_critical <- qchisq(1 - alpha, df_chi)

cat("Chi-Square Statistic (χ²):", round(chi_stat, 4), "\n")
Chi-Square Statistic (χ²): 13.7736 
cat("Degrees of Freedom:", df_chi, "\n")
Degrees of Freedom: 2 
cat("P-Value:", round(p_value_chi, 4), "\n")
P-Value: 0.001 
cat("Critical Chi-Square:", round(chi_critical, 4), "\n")
Critical Chi-Square: 5.9915 

Chi-Square Test Formula:

χ² = Σ [(Observed - Expected)² / Expected]

Degrees of Freedom:
df = (number of rows - 1) × (number of columns - 1)
df = (2 - 1) × (3 - 1) = 2

Chi-Square Statistic: χ² = 13.7736
P-Value: 0.001
Critical Value (α = 0.05): 5.9915

kable(round(expected_freq, 2), 
      caption = "Expected Frequencies (under H₀: Independence)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Payment Method" = 3)) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#6366F1")
Expected Frequencies (under H₀: Independence)
Payment Method
E-Wallet Credit Card Cash on Delivery
Mobile 102.27 96.59 51.14
Desktop 77.73 73.41 38.86
chi_results <- data.frame(
  Statistic = c("Chi-Square (χ²)", "Degrees of Freedom", "P-Value", 
                "Critical Value", "Decision"),
  Value = c(round(chi_stat, 4), df_chi, round(p_value_chi, 4),
            round(chi_critical, 4),
            ifelse(chi_stat > chi_critical, "Reject H₀", "Fail to Reject H₀"))
)

kable(chi_results, caption = "Chi-Square Test Results", align = 'lr') %>%
  kable_styling(bootstrap_options = c("striped", "hover"), 
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#8B1538") %>%
  row_spec(1, bold = TRUE, background = "#FFE4E6") %>%
  row_spec(3, bold = TRUE, background = "#E0F2FE")
Chi-Square Test Results
Statistic Value
Chi-Square (χ²) 13.7736
Degrees of Freedom 2
P-Value 0.001
Critical Value 5.9915
Decision Reject H₀

3.5 Determine P-Value at α = 0.05

Statistical Decision

Decision Rule: Reject H₀ if χ² > 5.9915 or if p-value < 0.05

Observed:

  • χ² = 13.7736 > 5.9915
  • p-value = 0.001 < 0.05

Conclusion: We REJECT H₀ at α = 0.05 significance level.

There is sufficient statistical evidence to conclude that device type and payment method preference are dependent (associated).

3.6 Visualization

# Convert to long format
df_payment <- as.data.frame(as.table(payment_matrix))
colnames(df_payment) <- c("Device", "Payment", "Frequency")

# Grouped bar chart
p1 <- ggplot(df_payment, aes(x = Device, y = Frequency, fill = Payment)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
  geom_text(aes(label = Frequency), position = position_dodge(0.9),
            vjust = -0.5, size = 4, fontface = "bold") +
  scale_fill_manual(values = c("#93C5FD", "#FCA5A5", "#FDE68A")) +
  labs(title = "Payment Method Distribution by Device Type",
       subtitle = "Observed Frequencies",
       y = "Frequency", x = "Device Type") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", color = "#8B1538"),
    legend.position = "top",
    panel.background = element_rect(fill = "#F0F9FF", color = NA)
  )

# Proportional stacked bar
df_prop <- df_payment %>%
  group_by(Device) %>%
  mutate(Proportion = Frequency / sum(Frequency) * 100)

p2 <- ggplot(df_prop, aes(x = Device, y = Proportion, fill = Payment)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  geom_text(aes(label = paste0(round(Proportion, 1), "%")),
            position = position_stack(vjust = 0.5),
            size = 4, fontface = "bold", color = "white") +
  scale_fill_manual(values = c("#93C5FD", "#FCA5A5", "#FDE68A")) +
  labs(title = "Payment Method Preference by Device (Proportional)",
       subtitle = "Percentage Distribution",
       y = "Percentage (%)", x = "Device Type") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", color = "#8B1538"),
    legend.position = "top",
    panel.background = element_rect(fill = "#F0F9FF", color = NA)
  )

grid.arrange(p1, p2, ncol = 1)

3.7 Digital Payment Strategy

Interpret Results for Digital Payment Strategy

Key Findings:

Significant Association Detected! Device type influences payment method choice.

Mobile Users (250 total):

  • E-Wallet: 120 (48%) - Highest preference
  • Credit Card: 80 (32%)
  • Cash on Delivery: 50 (20%)

Desktop Users (190 total):

  • E-Wallet: 60 (31.6%)
  • Credit Card: 90 (47.4%) - Highest preference
  • Cash on Delivery: 40 (21.1%)

Strategic Recommendations:

  1. Mobile-First E-Wallet Optimization:
    • Prioritize e-wallet integration on mobile app
    • Implement one-click payments and biometric authentication
    • Partner with popular mobile payment providers (GoPay, OVO, Dana)
    • Offer mobile-exclusive e-wallet cashback deals
  2. Desktop Credit Card Experience:
    • Streamline credit card checkout flow on desktop
    • Display security badges prominently
    • Offer installment options for high-value purchases
    • Enable card saving for returning customers
  3. Device-Specific Marketing:
    • Promote e-wallet deals in mobile push notifications
    • Highlight credit card benefits in desktop banners
    • A/B test payment option ordering by device
  4. Conversion Rate Optimization:
    • Default to preferred payment method based on device
    • Reduce friction by minimizing form fields
    • Test express checkout options per device

Expected Business Impact:

  • Potential 5-15% increase in conversion rates by optimizing payment flow per device
  • Reduced cart abandonment through device-appropriate payment options
  • Improved customer satisfaction and repeat purchase rates

4 Case Study 5: Type I and Type II Errors (Conceptual)

4.1 Problem Statement

Fraud Detection Algorithm Scenario

A fintech startup tests whether a new fraud detection algorithm reduces fraudulent transactions.

Hypotheses:

  • H₀: The new algorithm does NOT reduce fraud
  • H₁: The new algorithm REDUCES fraud

4.2 Type I Error (α)

Explain Type I Error in This Context

Definition: Type I Error occurs when we reject H₀ when H₀ is actually true (False Positive).

In This Context:

We conclude that the algorithm reduces fraud, but in reality, it does NOT reduce fraud.

Consequences:

  1. Wasted Investment: Company deploys an ineffective algorithm, wasting development and implementation costs
  2. False Confidence: Security team operates with false sense of security
  3. Opportunity Cost: Resources diverted from developing truly effective solutions
  4. Reputation Risk: When discovered, damages credibility of data science team
  5. Implementation Costs: Training, deployment, and maintenance of ineffective system

Probability: P(Type I Error) = α (typically 0.05 or 5%)

Financial Impact: Moderate - Limited to implementation costs (~$50K-$200K)

4.3 Type II Error (β)

Explain Type II Error in This Context

Definition: Type II Error occurs when we fail to reject H₀ when H₁ is actually true (False Negative).

In This Context:

We conclude that the algorithm does NOT reduce fraud, but in reality, it DOES reduce fraud.

Consequences:

  1. Missed Opportunity: Company doesn’t deploy an effective fraud prevention tool
  2. Continued Fraud Losses: Business continues suffering from preventable fraud (potentially millions in losses)
  3. Competitive Disadvantage: Competitors may deploy better fraud detection first
  4. Customer Trust Erosion: Continued fraud affects customer satisfaction and retention
  5. Revenue Impact: Lost sales, increased chargebacks, and customer attrition
  6. Regulatory Risk: Failure to implement adequate fraud controls

Probability: P(Type II Error) = β (depends on sample size and effect size)

Statistical Power: Power = 1 - β (probability of correctly detecting the improvement)

Financial Impact: High - Ongoing fraud losses (potentially $500K-$5M+ annually)

4.4 Which Error is More Costly?

More Costly Error from Business Perspective

Type II Error is MORE COSTLY in this fraud detection context.

Comparative Analysis:

Error Type Immediate Cost Long-term Cost Business Risk
Type I
(False Positive)
$50K-$200K
(Implementation)
Limited
(One-time cost)
Low to Moderate
(Reversible)
Type II
(False Negative)
$0
(No action taken)
$500K-$5M+
(Annual fraud losses)
High
(Ongoing damage)

Justification:

  • Magnitude: Ongoing fraud losses far exceed one-time implementation costs
  • Duration: Type II error leads to continuous losses; Type I is correctable
  • Customer Impact: Type II directly harms customers through fraud
  • Competitive Risk: Falling behind in fraud prevention is strategically dangerous

4.5 Sample Size and Type II Error

How Sample Size Affects Type II Error

Relationship: Sample size and Type II Error are inversely related.

As Sample Size Increases:

  • Type II Error probability (β) DECREASES
  • Statistical Power (1 - β) INCREASES
  • Ability to detect true effects IMPROVES
  • Confidence in test results STRENGTHENS

Why This Happens:

  1. Reduced Standard Error: SE = σ/√n decreases as n increases
  2. Narrower Confidence Intervals: More precise estimates of true effect
  3. Better Signal Detection: Easier to distinguish true effect from noise

Practical Example:

  • Small sample (n=50): Power = 60%, β = 40% (miss real fraud reduction 40% of the time)
  • Large sample (n=500): Power = 95%, β = 5% (miss real fraud reduction only 5% of the time)

Business Recommendation: Invest in larger sample sizes when testing critical systems like fraud detection to minimize costly Type II errors.

4.6 Relationship: α, β, and Statistical Power

Explain Relationship Between α, β, and Statistical Power

Key Concepts:

1. Alpha (α) - Significance Level:

  • Probability of Type I Error (False Positive)
  • Typically set at 0.05 (5%)
  • Controls how strict we are about avoiding false positives

2. Beta (β) - Type II Error Probability:

  • Probability of Type II Error (False Negative)
  • Typically 0.20 (20%) or lower
  • Depends on sample size, effect size, and α

3. Statistical Power (1 - β):

  • Probability of correctly rejecting H₀ when H₁ is true
  • Typically aim for 0.80 (80%) or higher
  • Power = 1 - β

Relationships:

If You… Then… Trade-off
Decrease α (more strict) β increases (Power decreases) Fewer false positives, more false negatives
Increase sample size (n) β decreases (Power increases) Better at detecting true effects
Larger effect size β decreases (Power increases) Easier to detect big differences

Optimal Balance:

  • α = 0.05 (5% Type I Error)
  • Power = 0.80 (80% chance to detect true effect)
  • β = 0.20 (20% Type II Error)

Formula: Power = 1 - β

If β = 0.20, then Power = 1 - 0.20 = 0.80 (80%)


5 Case Study 6: P-Value and Statistical Decision Making

5.1 Problem Statement

Churn Prediction Model Evaluation

A churn prediction model evaluation yields the following results:

  • Test statistic = 2.31
  • P-value = 0.021
  • Significance level: α = 0.05

5.2 Meaning of P-Value

Explain the Meaning of the P-Value

P-Value = 0.021

Definition: The p-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.

In This Context:

If the churn prediction model has NO real predictive power (H₀ is true), there is a 2.1% chance of observing a test statistic as extreme as 2.31 or more extreme, purely by random chance.

Interpretation:

  • Low p-value (0.021 < 0.05): The observed result is unlikely under H₀
  • Evidence against H₀: The data provides strong evidence that the model has predictive power
  • Not a probability of H₀: p-value is NOT the probability that H₀ is true

Common Misconceptions to Avoid:

  • “There’s a 2.1% chance H₀ is true” → WRONG
  • “There’s a 97.9% chance H₁ is true” → WRONG
  • “If H₀ were true, we’d see results this extreme only 2.1% of the time” → CORRECT

5.3 Make a Statistical Decision

Statistical Decision

Decision Rule: Reject H₀ if p-value < α

Comparison:

  • P-value = 0.021
  • α = 0.05
  • 0.021 < 0.05 ✓

Decision: REJECT H₀ at the 5% significance level.

Conclusion: There is statistically significant evidence that the churn prediction model has meaningful predictive power. The observed performance is unlikely to have occurred by chance alone.

Confidence Level: We can be 95% confident in this decision (1 - α = 0.95).

5.4 Translate to Non-Technical Language

Translate the Decision into Non-Technical Language for Management

Executive Summary:

“Our new churn prediction model has been rigorously tested and the results show that it works significantly better than random guessing.

What We Found:

  • The model’s performance is statistically significant (p-value = 0.021)
  • There’s only a 2% chance these results happened by luck
  • We can be 95% confident the model has real predictive capability

What This Means for the Business:

  1. Actionable Insights: The model can reliably identify customers at risk of churning
  2. Proactive Intervention: We can target at-risk customers with retention campaigns before they leave
  3. Resource Optimization: Focus retention efforts on customers most likely to churn
  4. ROI Potential: Expected to reduce churn rate and increase customer lifetime value

Recommendation: Deploy the model into production with proper monitoring and validation protocols.

Next Steps:

  • Integrate model into CRM system
  • Develop automated retention workflows
  • Monitor model performance in production
  • A/B test retention strategies on high-risk segments

5.5 Risk if Sample is Not Representative

Discuss the Risk if the Sample is Not Representative

Critical Assumption: Statistical tests assume the sample is representative of the population.

If Sample is NOT Representative:

1. Selection Bias Risks:

  • Wrong Customer Segments: Model trained on high-value customers won’t work for budget segments
  • Seasonal Bias: Sample from holiday season may not apply to rest of year
  • Geographic Bias: Model trained on urban customers fails in rural markets
  • Temporal Bias: Historical data may not reflect current market conditions

2. Consequences:

  • Model Fails in Production: Predictions don’t work on real customer base
  • Wasted Resources: Investment in ineffective retention campaigns
  • Customer Alienation: Inappropriate messaging to wrong customer segments
  • False Confidence: P-value looks good, but results don’t generalize
  • Business Decisions Based on Flawed Data: Strategy built on unreliable insights

3. Impact on Statistical Validity:

  • External Validity Compromised: Results don’t apply beyond the biased sample
  • Overfitting Risk: Model learns sample-specific patterns, not general trends
  • Confidence Intervals Misleading: Uncertainty estimates are incorrect
  • P-value Unreliable: Statistical significance doesn’t translate to real-world effectiveness

4. How to Mitigate:

  1. Random Sampling: Ensure truly random selection from entire customer base
  2. Stratified Sampling: Include proportional representation from all segments
  3. Sample Size Calculation: Ensure adequate representation of subgroups
  4. Validation Datasets: Test model on separate, representative holdout set
  5. Cross-Validation: Validate across different time periods and segments
  6. A/B Testing: Pilot deployment to verify real-world performance

Business Recommendation:

Before full deployment, conduct a pilot test on a representative sample of the current customer base to validate that the model performs as expected. Monitor key metrics like prediction accuracy, false positive rate, and business KPIs (retention rate, CLV).

5.6 P-Value Does Not Measure Effect Size

Explain Why P-Value Does Not Measure Effect Size

Critical Distinction: P-value measures statistical significance, NOT practical significance (effect size).

What P-Value Tells Us:

  • Whether the effect is likely to be real (not due to chance)
  • Strength of evidence against H₀
  • Reliability of the finding

What P-Value Does NOT Tell Us:

  • How big the effect is
  • Whether the effect matters practically
  • Business impact or importance

Example Scenarios:

Scenario P-Value Effect Size Business Impact
Large Sample
Tiny improvement
0.001
(Very significant)
0.5% churn reduction Low
Not worth implementing
Small Sample
Large improvement
0.08
(Not significant)
15% churn reduction High
Need larger study
Optimal
Meaningful result
0.021
(Significant)
8% churn reduction High
Deploy immediately

Why This Matters:

Scenario 1: With millions of customers, even a 0.1% difference will show p < 0.001, but the business impact is negligible. Statistically significant ≠ Practically important.

Scenario 2: A small pilot shows 15% improvement but p = 0.08. Don’t dismiss it! The effect is large but sample was too small. Not significant ≠ No effect.

Best Practice: Always report BOTH:

  1. P-value: Is the effect real? (Statistical significance)
  2. Effect Size: How big is it? (Practical significance)
    • Cohen’s d for t-tests
    • Odds ratio or risk ratio for categorical outcomes
    • R² for regression models
    • Percentage change in key metrics

For Our Churn Model:

  • P-value = 0.021 → Effect is likely real ✓
  • But we need to know: What’s the actual churn reduction? 2%? 10%? 20%?
  • Business decision depends on: Cost of implementation vs. value of churn reduction

Recommendation: Supplement statistical significance testing with effect size estimates and ROI calculations for management decisions.


6 References and Additional Resources

References

  1. Montgomery, D. C., & Runger, G. C. (2018). Applied Statistics and Probability for Engineers (7th ed.). John Wiley & Sons.
  2. Agresti, A., & Finlay, B. (2018). Statistical Methods for the Social Sciences (5th ed.). Pearson.
  3. Devore, J. L. (2015). Probability and Statistics for Engineering and the Sciences (9th ed.). Cengage Learning.
  4. Walpole, R. E., Myers, R. H., Myers, S. L., & Ye, K. (2016). Probability & Statistics for Engineers & Scientists (9th ed.). Pearson.
  5. Field, A. (2017). Discovering Statistics Using IBM SPSS Statistics (5th ed.). SAGE Publications.
  6. Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.
  7. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
  8. Wasserstein, R. L., & Lazar, N. A. (2016). “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician, 70(2), 129-133.

🔗 Online Resources

Statistical Software & Packages Used

  • R Version: R version 4.5.2 (2025-10-31)
  • ggplot2: Elegant graphics for data visualization
  • dplyr: Data manipulation and transformation
  • knitr & kableExtra: Dynamic report generation and table formatting
  • gridExtra: Multiple plot arrangements

7 Summary and Key Takeaways

Key Learning Points

1. Case Study 1 - One-Sample Z-Test:

  • Use Z-test when population SD is known and sample size is large
  • Two-tailed tests detect differences in either direction
  • Statistical significance doesn’t always equal practical importance

2. Case Study 3 - Two-Sample T-Test (A/B Testing):

  • Independent t-tests compare means between two separate groups
  • Effect size (Cohen’s d) measures practical significance
  • Business decisions should consider both statistical and practical significance

3. Case Study 4 - Chi-Square Test:

  • Chi-square tests association between categorical variables
  • Contingency tables reveal patterns in cross-tabulated data
  • Device-specific optimization can significantly improve conversion rates

4. Case Study 5 - Type I & II Errors:

  • Type I Error: False positive (reject true H₀)
  • Type II Error: False negative (fail to reject false H₀)
  • Sample size directly affects statistical power
  • Business context determines which error is more costly

5. Case Study 6 - P-Value Interpretation:

  • P-value measures evidence against H₀, not probability of H₀
  • Low p-value indicates statistical significance
  • P-value ≠ effect size or practical importance
  • Representative sampling is critical for valid inference

Best Practices for Business Analytics

  1. Context Matters: Always interpret statistical results within business context
  2. Multiple Metrics: Report p-values, effect sizes, confidence intervals, and business KPIs
  3. Sample Quality: Ensure representative sampling before generalizing results
  4. Power Analysis: Calculate required sample sizes before conducting studies
  5. Practical Significance: Consider ROI and implementation costs, not just statistical significance
  6. Transparent Communication: Translate statistical findings into actionable business insights
  7. Validation: Always validate findings with holdout data or A/B tests
  8. Continuous Monitoring: Track performance metrics post-implementation