Study Cases

Statistical Inferences~ Week 14

Carol Dupino Pereira

NIM : 52250051

Student Major in Data Science at
Institut Teknologi Sains Bandung

R Programming
Data Science
Statistics
Dosen Pembimbing
Bakti Siregar, M.Sc., CDS.

1 Case Study 1

1.1 One-Sample Z-Test (Statistical Hypotheses)

A digital learning platform claims that the average daily study time of its users is 120 minutes. Based on historical records, the population standard deviation is known to be 15 minutes.

A random sample of 64 users shows an average study time of 116 minutes.

\[ \begin{eqnarray*} \mu_0 &=& 120 \\ \sigma &=& 15 \\ n &=& 64 \\ \bar{x} &=& 116 \end{eqnarray*} \]

1.2 Tasks

  1. Formulate the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).
  2. Identify the appropriate statistical test and justify your choice.
  3. Compute the test statistic and p-value using \(\alpha = 0.05\).
  4. State the statistical decision.
  5. Interpret the result in a business analytics context.

1.3 Answer Case Study 1

  1. Formulate the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).

In hypothesis testing, we compare a “status quo” claim against a new observation to see if there is a statistically significant difference.

  • The Null Hypothesis (\(H_0\))

The null hypothesis represents the claim that there is no change or no difference. In this case, it assumes the platform’s claim is correct.

  • Statement: The average daily study time of users is equal to 120 minutes.Mathematical

  • Form: \[H_0: \mu = 120\]

  • The Alternative Hypothesis (\(H_1\))

The alternative hypothesis is what we suspect might actually be true (that the claim is incorrect). Since the problem does not specify if we are looking for “less than” or “greater than,” we test for any significant difference.

  • Statement: The average daily study time of users is not equal to 120 minutes.
  • Mathematical Form: \[H_1: \mu \neq 120\]

To move forward with the calculation (the Z-test), you have the following known values:

Statistical Parameters
Parameter Symbol Value
Population Mean μ0 120 minutes
Population Std. Deviation σ 15 minutes
Sample Size n 64
Sample Mean 116 minutes
  1. Identify the appropriate statistical test and justify your choice.

The choice of a Z-test over a T-test is based on two primary criteria found in the data:

  • Justification 1: Known Population Standard Deviation

The most critical factor in choosing a Z-test is that the population standard deviation (\(\sigma\)) is known. In this case, the platform provides a historical record showing \(\sigma = 15\).

  • If \(\sigma\) were unknown and we had to rely on the sample standard deviation (\(s\)), a T-test would typically be required to account for the additional uncertainty.

  • Justification 2: Large Sample Size (\(n \ge 30\))The sample size in this study is 64, which is significantly larger than the common threshold of 30.

  • According to the Central Limit Theorem, when the sample size is large (\(n \ge 30\)), the sampling distribution of the mean becomes approximately normal, regardless of the shape of the underlying population distribution. This reinforces the validity of using the standard normal distribution (Z-distribution).

Requirements Check for Statistical Test
Requirement Case Study Value Condition Met?
Population Std. Dev. (σ) Known (15) Yes
Sample Size (n) 64 Yes (since n≥30)
Data Type Continuous (minutes) Yes
Independence Random sample Yes

Conclusion: Because the population standard deviation is known and the sample size is sufficiently large, the One-Sample Z-Test is the most precise and robust method for this analysis.

  1. Compute the test statistic and p-value using \(\alpha = 0.05\).

Compute the Test Statistic

The Z-score measures how many standard errors the sample mean (\(\bar{x}\)) is from the population mean (\(\mu_0\)).

  • Step 1: Calculate the Standard Error (\(SE\))The standard error represents the standard deviation of the sampling distribution.

\[SE = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{64}} = \frac{15}{8} = 1.875\]

  • Step 2: Calculate the Z-score (\(z\))\[z = \frac{\bar{x} - \mu_0}{SE} = \frac{116 - 120}{1.875} = \frac{-4}{1.875} \approx -2.1333\] The test statistic is \(z = -2.13\).

Compute the P-Value

Since our alternative hypothesis is two-tailed (\(H_1: \mu \neq 120\)), we look for the probability of observing a Z-score more extreme than \(\pm 2.13\) in both tails of the standard normal distribution.

  • P-value calculation: \(P(|Z| > 2.13) = 2 \times P(Z < -2.13)\)

  • Using the standard normal table (or software):\(P(Z < -2.13) \approx 0.0164\)

  • Two-tailed p-value \(= 2 \times 0.0164 = 0.0328\)

The p-value is \(0.0329\) (rounded).

Z-Test Results Summary
Statistical Measure Value
Z-Statistic -2.13
P-value 0.0329
Threshold (α) 0.05
Final Decision Reject H₀

Comparison: Since the p-value (\(0.0329\)) < \(\alpha\) (\(0.05\)), the result is statistically significant.

How to Read the Visualization:

  • Black Line: Standard normal distribution curve.

  • Red Area: \(H_0\) rejection region. The boundary is \(\pm 1.96\). If the Z-count falls into this area, we reject the platform’s claim.

  • Blue Dashed Line: The position of our sample data at -2.13.

  • Conclusion: Since the blue line is within the Red Area, we can immediately visually see that the results of this study are significant (Reject \(H_0\)).

  1. State the statistical decision

To reach a final decision, we evaluate whether the evidence is strong enough to reject the platform’s initial claim.

The Decision Rule

  • P-value (\(0.0329\)) < Alpha (\(0.05\)): The result is statistically significant.
  • Z-statistic (\(-2.13\)): This falls outside the “Fail to Reject” range (which is between \(-1.96\) and \(+1.96\) for a 95% confidence level).

Final Decision

Reject the Null Hypothesis (\(H_0\))

Formal Interpretation

There is sufficient evidence at the \(\alpha = 0.05\) significance level to conclude that the average daily study time of the digital learning platform’s users is significantly different from 120 minutes.

The data suggests that the true population mean is likely lower than what the platform claims, given that our sample mean of 116 minutes resulted in a Z-score that reached the “rejection region.”

Hypothesis Test Decision Criteria
Observed
Decision Rule
Conclusion
Metric Value Threshold/Condition Decision
Z-Score −2.13 |z| > 1.96 Reject H₀
P-value 0.0329 p < 0.05 Statistically Significant
Note:
Note: The absolute Z-score (|−2.13| = 2.13) exceeds the critical value of 1.96, and the p-value is less than 0.05, providing sufficient evidence to reject the null hypothesis at the 5% significance level.

Conclusion: The platform’s claim is not supported by the sample data.

  1. Interpret the result in a business analytics context.

Discrepancy in Performance Claims

The analysis reveals that the platform’s claim of 120 minutes of study time is statistically overestimated. The observed average of 116 minutes is not just a random fluctuation; it represents a significant downward departure from the historical benchmark.

  • Business Impact: If this 120-minute figure is used in marketing materials or investor pitches, it may be considered misleading. The data suggests the “real” user engagement is lower than advertised.

User Engagement Insights

The 4-minute gap (\(120 - 116\)) might seem small, but on a scale of thousands of users, it indicates a shift in engagement patterns.

  • Potential Causes: This decline could be due to increased competition, a change in the difficulty of new content, or perhaps a segment of the user base is experiencing “learning fatigue.”

Operational Recommendations

Based on this statistical decision, a business analyst would recommend the following actions:

  • Audit Marketing Materials: Update engagement statistics to reflect current data (e.g., “Users average nearly 2 hours daily”) to maintain transparency.

  • Investigate Retention: Since study time is often a leading indicator of subscription renewal, the platform should investigate if this decrease in time correlates with a higher churn rate.

  • A/B Testing: Implement new features (like gamification or reminders) to see if the average study time can be pushed back toward the 120-minute goal.

Business Intelligence: Study Time Analysis
Analytics Component Business Finding Priority
Statistical Result The 120-minute claim is invalid at 95% confidence level (p = 0.0329 < 0.05) High
Magnitude of Error Average deficit of 4 minutes (~3.3% below target) across user base Medium
Actionable Insight Review user experience to identify engagement barriers High
Business Implication Potential reputational risk if inaccurate claims continue Critical
Recommendation Conduct A/B testing on platform features to improve engagement High
Note:
Based on sample of 64 users with mean study time of 116 minutes (σ = 15). Analysis conducted at α = 0.05.

2 Case Study 2

2.1 One-Sample T-Test (σ Unknown, Small Sample)

A UX Research Team investigates whether the average task completion time of a new application differs from 10 minutes.

The following data are collected from 10 users:

\[ 9.2,\; 10.5,\; 9.8,\; 10.1,\; 9.6,\; 10.3,\; 9.9,\; 9.7,\; 10.0,\; 9.5 \]

2.2 Tasks

  1. Define H₀ and H₁ (two-tailed).
  2. Determine the appropriate hypothesis test.
  3. Calculate the t-statistic and p-value at \(\alpha = 0.05\).
  4. Make a statistical decision.
  5. Explain how sample size affects inferential reliability.

2.3 Answer Case Study 2

  1. Define H₀ and H₁ (two-tailed).

To investigate whether the average task completion time differs from the 10-minute benchmark, we define our statistical hypotheses based on the population mean (\(\mu\)).In a two-tailed test, we are looking for any significant difference—whether the users are faster or slower than the target—rather than a specific direction.

The Null Hypothesis (\(H_0\))

The null hypothesis represents the “status quo” or the assumption that there is no effect or change.1 In this case, it states that the true average task completion time is exactly 10 minutes.\[H_0: \mu = 10\]

The Alternative Hypothesis (\(H_1\))

The alternative hypothesis represents the research claim we are testing for.2 Since this is a two-tailed test, it states that the true average task completion time is not equal to 10 minutes.\[H_1: \mu \neq 10\]

Contextual Summary

  • \(H_0\): The new application’s average task completion time is 10 minutes.

  • \(H_1\): The new application’s average task completion time is significantly different from 10 minutes.

For a one-sample t-test like this, the next step is usually to calculate the Sample Mean (\(\bar{x}\)) and Sample Standard Deviation (\(s\)) to find the test statistic.

  1. Determine the appropriate hypothesis test.

Selected Test: One-Sample t-Test

For this scenario, the One-Sample t-Test is the appropriate choice. This test is used to determine whether a sample mean significantly differs from a hypothesized population mean when the population standard deviation (\(\sigma\)) is unknown.

The selection is based on four specific criteria present in your case study:

  • Unknown Population Standard Deviation (\(\sigma\)): The problem does not provide the standard deviation for all possible users; we only have the raw data from a small group.

  • Small Sample Size (\(n < 30\)) : You are only testing 10 users. When the sample size is small and \(\sigma\) is unknown, the t-distribution is more accurate than the Z-distribution (Normal distribution).

  • Continuous Data: Task completion time is a ratio-level measurement (time), which is required for a t-test.

  • Comparison to a Fixed Standard: You are comparing the sample mean (\(\bar{x}\)) against a specific benchmark value (\(\mu_0 = 10\) minutes).

Statistical Test Parameters: One-Sample t-Test Setup
Category Parameter Symbol Value Status
Hypothesis Hypothesized Mean (μ₀) μ₀ 10 minutes Given
Sample Sample Size (n) n 10 Collected
Statistics Degrees of Freedom (df) df 9 Calculated
Data Data Type Continuous Valid
Assumptions Population SD (σ) σ Unknown Unknown → Use t-test
Note:
Note: Since population standard deviation (σ) is unknown, a t-test (not z-test) is appropriate for hypothesis testing.

Statistical Calculation Results

Based on the 10 user data you provided:

  • Sample Mean (\(\bar{x}\)): \(9.86\) minutes

  • Sample Standard Deviation (\(s\)): \(0.386\)

  • t-statistic (\(t_{obs}\)): \(-1.146\)

  • Degrees of Freedom (\(df\)): \(9\)

  • Critical Value (\(t_{crit}\)): \(\pm 2.262\) (for \(\alpha = 0.05\), two-tailed)

Conclusion: Since the calculated \(t\) value (\(-1.146\)) is between the critical values ​​(\(-2.262\) and \(+2.262\)), we fail to reject \(H_0\). There is no significant difference between the average task completion time and the target of 10 minutes.

  1. Calculate the t-statistic and p-value at \(\alpha = 0.05\).

Descriptive Statistics

First, we calculate the mean (\(\bar{x}\)) and the sample standard deviation (\(s\)):

  • Sample Mean (\(\bar{x}\)):

\[\bar{x} = \frac{\sum x}{n} = \frac{9.2 + 10.5 + 9.8 + 10.1 + 9.6 + 10.3 + 9.9 + 9.7 + 10.0 + 9.5}{10} = \mathbf{9.86}\]

  • Sample Standard Deviation (\(s\)):

\[s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}} \approx \mathbf{0.386}\]

Calculation of t-statistic

The formula for \(t\)-statistic is:\[t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\]

Where:

  • \(\bar{x} = 9.86\)

  • \(\mu_0 = 10\) (Hypothesis value)

  • \(s = 0.386\)\(n = 10\)

Calculation steps:

  • Calculate Standard Error (\(SE\)): \(SE = 0.386 / \sqrt{10} = 0.386 / 3.162 = \mathbf{0.122}\)

  • Calculate t-score: \(t = (9.86 - 10) / 0.122 = -0.14 / 0.122 = \mathbf{-1.147}\)

Determining the p-value

To determine the \(p\)-value, we need to look at the \(t\)-distribution table or use the statistical function with the parameters:

  • t-statistic: \(-1.147\)

  • Degrees of Freedom (1\(df\)): 2\(n - 1 = \mathbf{9}\)

  • Test Type: Two-tailed (two-way)

Based on the calculation:p-value \(\approx \mathbf{0.281}\)
One-Sample t-Test: Results Summary
Statistical Measure Value Interpretation
Test Statistic (t) -1.147 Standardized difference from μ₀
Degrees of Freedom (df) 9 n - 1 = 10 - 1
P-value (two-tailed) 0.281 Probability of observing such extreme results if H₀ is true
Significance Level (α) 0.05 Maximum acceptable Type I error rate
Critical Value (±t*) ±2.262 Boundary of rejection region
95% Confidence Interval [-1.35, 3.35] minutes* Plausible range for true population mean
Statistical Decision Fail to reject H₀ p-value > α → Retain null hypothesis
Practical Conclusion Insufficient evidence against claim Data consistent with 10-minute average
Note:
*Example confidence interval; actual depends on sample mean and SD
1 H₀: μ = 10 minutes
2 Sample size: n = 10
3 Test: Two-tailed t-test

Visual Analysis of the Plot

  • The Curve:Represents the range of possible outcomes if the null hypothesis (\(H_0\)) were true.

  • The Red Areas: These are the “Danger Zones” (Rejection Regions). If your blue line falls here, the result is significant.

  • The Blue Dashed Line: This represents your actual research result (\(t = -1.147\)).The Verdict: Because the Blue Line is in the white area (the “Fail to Reject” zone) and not in the red areas, we conclude that the task completion time is not significantly different from 10 minutes.

  1. Make a statistical decision.

Decision Criteria

There are two ways to state the decision, both leading to the same result:

P-value Approach: If \(p\text{-value} \leq \alpha\), reject \(H_0\).

  • Our \(p\text{-value} = 0.281\)

  • \(0.281 > 0.05\)

Critical Value Approach: If \(|t_{obs}| > t_{crit}\), reject

  • \(H_0\).\(|{-1.147}| = 1.147\)

  • \(t_{crit}\) (for \(df=9, \alpha=0.05\)) \(= 2.262\)

  • \(1.147 < 2.262\)

The Statistical Decision

Based on the evidence:

Fail to Reject the Null Hypothesis (\(H_0\))

Conclusion in Context

Since we failed to reject \(H_0\), the UX Research Team concludes that there is no statistically significant evidence to suggest that the average task completion time differs from 10 minutes.

The observed mean of \(9.86\) minutes is numerically lower than the target, but the difference is small enough that it could simply be due to random sampling variation rather than a true change in the application’s performance.

  1. Explain how sample size affects inferential reliability.

In inferential statistics, specifically within the context of a One-Sample t-Test, the sample size (\(n\)) is a critical lever that determines the reliability and sensitivity of your findings.2Here is how the sample size of 10 users affects the reliability of your UX research:

Influence on the Standard Error

The reliability of an inference depends on the Standard Error (\(SE\)), which measures how much the sample mean is expected to vary from the true population mean.

\[SE = \frac{s}{\sqrt{n}}\]

  • Small Sample (\(n=10\)): Since you are dividing by a smaller number (\(\sqrt{10}\)), the \(SE\) remains relatively large. This creates “noise,” making it harder to distinguish between a real improvement in the app and random chance.

  • Large Sample: As 5\(n\) increases, the \(SE\) decreases, meaning your sample mean becomes a much more precise estimate of the true population behavior.

The Shape of the t-Distribution

Because your sample size is small (\(n < 30\)), we must use the t-distribution rather than the Z-distribution (normal distribution).

  • Degrees of Freedom (\(df\)): With only 9 degrees of freedom (8\(n - 1\)), the t-distribution has “heavier tails”

  • Reliability Impact: Heavier tails mean that extreme values are more likely to occur by chance.10 To compensate for this uncertainty, the Critical Value becomes higher (in your case, \(\pm 2.262\)). This makes it harder to “reject the null hypothesis” unless the effect is very large.

Statistical Power and Type II Errors

The most significant impact on reliability here is Statistical Power.

  • The Risk: With only 10 users, the test has low power.11 This means there is a high risk of a Type II Error (failing to detect a real difference when one actually exists).

  • UX Context: If your new app actually reduced completion time to 9.5 minutes, a sample of 10 might not be enough to prove it statistically. You might wrongly conclude the app is “no different” from the old version simply because the sample was too small to provide “beyond a reasonable doubt” evidence.

Statistical Implications of Sample Size
Small Sample Characteristics
Large Sample Advantages
Key Consideration
Aspect Small Sample (n=10) Large Sample (n>30) Implication
Estimation Precision Low - High uncertainty High - Low uncertainty Interpret with caution
Sampling Distribution t-distribution (df=9) Approaches normal (z) Different critical values
Effect Detection Only large effects detectable Small effects detectable Practical vs statistical significance
Outlier Impact Significant distortion possible Minimal impact Check robustness
Statistical Test t-test mandatory z-test often acceptable Test selection matters
Confidence Intervals Wide (e.g., ±3.5 units) Narrow (e.g., ±1.0 units) Precision of estimates
Type II Error Risk High (Poor power) Low (Good power) False negative risk
Cost & Feasibility Low cost, quick High cost, time-consuming Study design trade-off
Note:
Note: The ‘n > 30’ rule is a guideline; in practice, sample size requirements depend on effect size, variability, and desired power.

3 Case Study 3

3.1 Two-Sample T-Test (A/B Testing)

A product analytics team conducts an A/B test to compare the average session duration (minutes) between two versions of a landing page.

Version Sample Size (n) Mean Standard Deviation
A 25 4.8 1.2
B 25 5.4 1.4

3.2 Tasks

  1. Formulate the null and alternative hypotheses.
  2. Identify the type of t-test required.
  3. Compute the test statistic and p-value.
  4. Draw a statistical conclusion at \(\alpha = 0.05\).
  5. Interpret the result for product decision-making.

3.3 Answer Case Study 3

  1. Formulate the null and alternative hypotheses.

Formulating the Hypotheses

  • To perform a two-sample t-test, we first define what we are testing for. We use \(\mu_A\) to represent the true population mean of Version A and \(\mu_B\) for Version B.

  • Null Hypothesis (\(H_0\))The null hypothesis assumes there is no significant difference between the two landing page versions. Any observed difference in the sample means is due to random chance.

  • \(H_0: \mu_A = \mu_B\) (or \(\mu_A - \mu_B = 0\))

  • Alternative Hypothesis (\(H_a\))

The alternative hypothesis suggests that there is a significant difference between the two versions. Since the task doesn’t specify a direction (e.g., “B is better than A”), we typically use a two-tailed test.

  • \(H_a: \mu_A \neq \mu_B\) (or \(\mu_A - \mu_B \neq 0\))
A/B Testing: Version Comparison Statistics
Control Group
Test Group
Absolute Δ
Relative Δ
Parameter Version A Version B Difference % Change
Sample Mean (x̄) 4.8 5.4 +0.6 +12.5%
Standard Deviation (s) 1.2 1.4 +0.2 +16.7%
Sample Size (n) 25 25 0 0%
Standard Error 0.240 0.280 +0.040 +16.7%
Variance (s²) 1.44 1.96 +0.52 +36.1%
Minimum ~2.4 ~2.6 +0.2 +8.3%
Maximum ~7.2 ~8.2 +1.0 +13.9%
Range ~4.8 ~5.6 +0.8 +16.7%
Note:
*Minimum, Maximum, and Range are estimates based on normal distribution approximation (±2SD from mean)
1 n = 25 for both groups
2 Data suggests Version B has higher mean but also higher variability

Next Steps in the Analysis

To complete this A/B test, the typical next steps would involve:

  • Calculating the Pooled Standard Deviation (since sample sizes are equal).

  • Computing the T-statistic using the formula:\[t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

  • Determining the p-value based on degrees of freedom (\(df = n_1 + n_2 - 2\)).

  1. Identify the type of t-test required

Type of T-Test: Independent Two-Sample T-Test

For this A/B test, you should use an Independent Two-Sample T-Test (also known as an Unpaired T-Test).

This is the correct choice because:

  • Independent Groups: The participants in Version A are different individuals from those in Version B. The session duration of one user does not influence or relate to the session duration of another.

  • Comparison of Means: The goal is to compare the average (mean) of a continuous variable (session duration) between two distinct categories (Version A vs. Version B).

  • Sample Size: With 4\(n = 25\) for each group, the sample size is relatively small (typically 5\(n < 30\) is the threshold where T-tests are prioritized over Z-tests), making the T-distribution appropriate.

Sub-type Selection: Student’s vs. Welch’s

Within the category of independent T-tests, there are two common variations based on the variance (Standard Deviation squared) of the samples:

Statistical Test Selection: A/B Testing Options
Test Type Key Assumption Variance Check Our Case Assessment Recommendation
Pooled (Student’s) t-test Equal population variances (homoscedasticity) Required: F-test or Levene’s test for equal variances SDs: 1.2 vs 1.4 (ratio = 1.17 < 2) - Likely acceptable Acceptable for preliminary analysis
Welch’s (Unequal Variance) t-test No equal variance assumption (heteroscedasticity allowed) Not required - automatically adjusts for unequal variances Default recommendation - More conservative approach referred for final results
Mann-Whitney U test No normality assumption (non-parametric) Not applicable - works on ranks, not raw values Fallback if normality violated (n=25 may need checking) Consider if data non-normal
Note:
Rule of thumb: If larger SD / smaller SD < 2, pooled t-test is often acceptable. In our case: 1.4/1.2 = 1.17

Note: Because the sample sizes are exactly equal (7\(n=25\) for both), the standard Student’s T-test is highly robust even if the variances differ slightly.

  1. Compute the test statistic and p-value.

Calculate the Standard Error (\(SE\))

Since the sample sizes are equal (\(n_A = n_B = 25\)), the standard error for the difference between means is calculated as: \[SE = \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}\]\[SE = \sqrt{\frac{1.2^2}{25} + \frac{1.4^2}{25}}\]\[SE = \sqrt{\frac{1.44}{25} + \frac{1.96}{25}} = \sqrt{0.0576 + 0.0784}\]\[SE = \sqrt{0.136} \approx 0.3688\]

Compute the Test Statistic (\(t\))

The \(t\)-statistic measures how many standard errors the difference in sample means is from zero:\[t = \frac{\bar{x}_A - \bar{x}_B}{SE}\]\[t = \frac{4.8 - 5.4}{0.3688}\]\[t = \frac{-0.6}{0.3688} \approx -1.627\]

Determine Degrees of Freedom (\(df\))

For a pooled two-sample t-test:\[df = n_A + n_B - 2 = 25 + 25 - 2 = 48\]

Calculate the P-value

Using a \(t\)-distribution table or statistical software for a two-tailed test with \(t = -1.627\) and \(df = 48\):\(p\text{-value} \approx 0.1103\)

Two-Sample Welch’s t-Test: Results Summary
Statistical Measure Value Interpretation
Test Statistic (t-value) -1.627 Standardized difference between groups
Degrees of Freedom (df) 48 Welch-Satterthwaite approximation
P-value (two-tailed) 0.1103 Probability of observing such extreme results if H₀ is true
Significance Level (α) 0.05 Maximum acceptable Type I error rate
Critical Value (t*) ±2.011 Boundary for rejection region (two-tailed)
Mean Difference Estimate 0.600 Sample mean difference (B - A)
Standard Error 0.369 Standard error of the difference
Effect Size (Cohen’s d) 0.461 Medium effect size (but not statistically significant)
Statistical Decision Fail to reject H₀ Insufficient evidence against null hypothesis
Practical Significance No significant difference Difference may not be practically important
Note:
H₀: μ_A = μ_B vs H₁: μ_A ≠ μ_B | Test: Welch’s two-sample t-test (does not assume equal variances)

Interpretation: At a standard significance level of \(\alpha = 0.05\), since the \(p\)-value (\(0.1103\)) is greater than \(0.05\), we fail to reject the null hypothesis. There is not enough statistical evidence to conclude that there is a significant difference in average session duration between Version A and Version B.

Visualization Explanation:

  • Blue Curve: Shows a T-probability distribution with 48 degrees of freedom.

  • Red Area: This is the Rejection Region. If the black line falls within this area, the difference is statistically significant.

  • Black Dashed Line: This is the T-statistic (-1.627) we calculated.

Visual Conclusion: Since the black line falls outside the red area (in the white center), we can visually see that the difference in session duration between Versions A and B is not strong enough to be considered significant at the 5% level.

  1. Draw a statistical conclusion at \(\alpha = 0.05\).

Comparison of Values

Based on the calculation in the previous step:

  • P-value: \(0.1103\)

  • Significance Level (\(\alpha\)): \(0.05\)

  • T-statistic: \(-1.627\)

  • T-critical (\(df=48, \alpha=0.05\)): \(\pm 2.011\)

Hypothesis Test

We use the standard decision rule:

  • P-value Criteria: If \(P \leq \alpha\), Reject \(H_0\). If \(P > \alpha\), Fail to Reject \(H_0\).\(0.1103 > 0.05 \rightarrow\) Fail to Reject \(H_0\)

  • T-statistic Criteria: If \(|t| > t_{crit}\), Reject \(H_0\).\(|-1.627| < 2.011 \rightarrow\) Fail to Reject \(H_0\)

Conclusion: At the \(5%\) significance level, we do not have sufficient statistical evidence to reject the null hypothesis (\(H_0\)).

Meaning: The difference in mean session duration between Version A (4.8 minutes) and Version B (5.4 minutes) is not statistically significant. The observed difference is most likely due to random variation in the sample, not to a significant difference in landing page performance.

Business Recommendation

Although Version B has a nominally higher mean, the product team should:

  • Increase the sample size (\(n\)): With \(n=25\), statistical power may be too low to detect small differences.

  • Run the test longer: To see if this trend becomes significant as the data grows.

  • Not immediately change the design: Based on this data alone, there is no guarantee that Version B will perform better in the general population.

  1. Interpret the result for product decision-making.

Don’t Get Caught Up in the “Lust of Data”

Nominally, Version B appears to be 12.5% ​​better (a difference of 0.6 minutes). However, because the test results are not significant, we cannot be sure that this increase will persist if the feature is rolled out to all users. There is a high risk that this difference is just “noise” or just a coincidence.

Evaluate the Effect of Sample Size

The sample size of \(n = 25\) per group is very small for an A/B test on a digital platform.

  • Problem: This statistical test may have Low Power, meaning the test is not sensitive enough to detect a true difference.

  • Action: If the operational costs of Version B are not significantly higher than Version A, do not dismiss Version B out of hand. Continue testing with a larger sample size (e.g., \(n = 100\) or more) to get more conclusive results.

Consider Switching Costs

In product management, every change has a cost (developer time, bug risk, user confusion).

  • Decision: If Version B requires significant effort to implement permanently, then stick with Version A (status quo). There is no compelling data-driven reason to take on the technical risk if its performance has not been proven to be significantly superior.

Analysis of Variance

The standard deviation of Version B (1.4) is higher than that of Version A (1.2). This suggests that user behavior in Version B is more inconsistent. It is possible that some users really like it (very long duration), while others are confused (very short duration).
Strategic Recommendations: A/B Testing Follow-up Actions
Strategic Direction
Immediate Action
Evidence-Based Rationale
Measurable Outcomes
Responsible
Recommendation Key Action Rationale Success Metrics Owner
Status Quo (Hold) Maintain current deployment Statistical evidence insufficient (p=0.1103, CI includes 0) No negative impact on key metrics Product Manager
Test Iteration Run powered follow-up study Current study underpowered (35%) - high Type II error risk Achieve 80% statistical power Data Scientist
Qualitative Analysis Conduct user experience research Higher variance in Version B (SD=1.4) suggests inconsistent user experience Identify root cause of variance increase UX Researcher
Cost-Benefit Analysis Calculate ROI of potential change Need business justification for 0.6-minute improvement Clear ROI calculation Business Analyst
Implementation Roadmap Develop phased rollout plan Prepare for potential future deployment based on new evidence Documented deployment plan Project Manager
Note:
Based on A/B test results: t(48) = -1.627, p = 0.1103, mean difference = 0.6 minutes

4 Case Study 4

4.1 Chi-Square Test of Independence

An e-commerce company examines whether device type is associated with payment method preference.

Device / Payment E-Wallet Credit Card Cash on Delivery
Mobile 120 80 50
Desktop 60 90 40

4.2 Tasks

  1. State the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).
  2. Identify the appropriate statistical test.
  3. Compute the Chi-Square statistic (χ²).
  4. Determine the p-value at \(\alpha = 0.05\).
  5. Interpret the results in terms of digital payment strategy.

4.3 Answer Case Study 4

  1. State the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).

Null Hypothesis (\(H_0\))

The null hypothesis assumes that there is no relationship between the variables.2 In other words, the variables are independent.

  • \(H_0\): Device type and payment method preference are independent.

  • (Put simply: The type of device a customer uses does not affect which payment method they choose.)

Alternative Hypothesis (\(H_1\))

The alternative hypothesis assumes that there is a relationship between the variables.3 This suggests that the distribution of one variable depends on the other.

  • \(H_1\): Device type and payment method preference are dependent (associated).

  • (Put simply: The choice of payment method depends on whether the customer is using a mobile device or a desktop.)

Understanding the Test

To validate these hypotheses, you would typically calculate the Expected Frequencies for each cell in your table using the formula:

\[E = \frac{(\text{Row Total} \times \text{Column Total})}{\text{Grand Total}}\]

Then, you would use the Chi-Square formula to find the test statistic (\(\chi^2\)):

\[\chi^2 = \sum \frac{(O - E)^2}{E}\]

Where \(O\) is the Observed frequency (the numbers provided in your table) and \(E\) is the Expected frequency.

  1. Identify the appropriate statistical test. The Chi-Square Test of Independence

Categorical Variables: Both of your variables are categorical (nominal).

  • Variable 1 (Device Type): Mobile or Desktop.

  • Variable 2 (Payment Method): E-Wallet, Credit Card, or Cash on Delivery.

Relationship Testing: The goal is to determine if there is a significant association or dependency between these two variables, rather than comparing means (which would require a t-test or ANOVA).

Contingency Table Format: Your data is organized in a \(2 \times 3\) contingency table (2 rows for devices and 3 columns for payment methods), which is the standard input for this test.

Key Assumptions for this Test

To ensure the results of this test are valid, the following conditions should be met:

  • Independence: The observations are independent of each other (one customer’s choice doesn’t affect another’s).

  • Sample Size: The expected frequency in each cell should generally be 5 or greater.

  • Random Sampling: The data should be collected from a random sample of the population.

  1. Compute the Chi-Square statistic (χ²).

To compute the Chi-Square statistic (\(\chi^2\)), we follow these steps:

Calculate Row and Column Totals

First, we find the sums for each row and column to determine the grand total.

Contingency Table: Device Type × Payment Method (N=440)
Payment Method
E-Wallet Credit Card COD Row Total
Mobile 120 80 50 250
️Desktop 60 90 40 190
Column Total 180 170 90 440
Note:
N = 440 total transactions

Calculate Expected Frequencies (\(E\))

The expected frequency for each cell is calculated using the formula:

\[E = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total (N)}}\]

  • Mobile / E-Wallet: \(\frac{250 \times 180}{440} \approx 102.27\)

  • Mobile / Credit Card: \(\frac{250 \times 170}{440} \approx 96.59\)

  • Mobile / COD: \(\frac{250 \times 90}{440} \approx 51.14\)

  • Desktop / E-Wallet: \(\frac{190 \times 180}{440} \approx 77.73\)

  • Desktop / Credit Card: \(\frac{190 \times 170}{440} \approx 73.41\)

  • Desktop / COD: \(\frac{190 \times 90}{440} \approx 38.86\)

Expected Frequencies (Under Independence Assumption)
Payment Method
E-Wallet Credit Card COD Row Total
Mobile 102.27 96.59 51.14 250
Desktop 77.73 73.41 38.86 190
Column Total 180.00 170.00 90.00 440
Note:
Formula: Expected = (Row Total × Column Total) / Grand Total

Compute the Chi-Square Statistic (\(\chi^2\))

Using the formula \(\chi^2 = \sum \frac{(O - E)^2}{E}\), where \(O\) is the observed frequency:

Step 1: Calculate Each PartWe will calculate each fraction one by one:

Part 1: \[\frac{(120 - 102.27)^2}{102.27} = \frac{(17.73)^2}{102.27} = \frac{314.35}{102.27} \approx \mathbf{3.074}\]

Part 2: \[\frac{(80 - 96.59)^2}{96.59} = \frac{(-16.59)^2}{96.59} = \frac{275.23}{96.59} \approx \mathbf{2.849}\]

Part 3: \[\frac{(50 - 51.14)^2}{51.14} = \frac{(-1.14)^2}{51.14} = \frac{1.30}{51.14} \approx \mathbf{0.025}\] Part 4: \[\frac{(60 - 77.73)^2}{77.73} = \frac{(-17.73)^2}{77.73} = \frac{314.35}{77.73} \approx \mathbf{4.044}\]

Part 5: \[\frac{(90 - 73.41)^2}{73.41} = \frac{(16.59)^2}{73.41} = \frac{275.23}{73.41} \approx \mathbf{3.749}\]

Part 6: \[\frac{(40 - 38.86)^2}{38.86} = \frac{(1.14)^2}{38.86} = \frac{1.30}{38.86} \approx \mathbf{0.033}\] Step 2: Add Up All the Results

Now add up all the calculated values: \[\chi^2 = 3.074 + 2.849 + 0.025 + 4.044 + 3.749 + 0.033\]

\[\chi^2 \approx 13.774\]

Chi-Square Statistic Calculation
Calculation Steps
Cell Observed (O) Expected (E) O - E (O - E)² (O - E)² / E
Mobile / E-Wallet 120 102.27 17.73 314.35 3.073
Mobile / Credit Card 80 96.59 -16.59 275.23 2.85
Mobile / COD 50 51.14 -1.14 1.3 0.025
Desktop / E-Wallet 60 77.73 -17.73 314.35 4.044
Desktop / Credit Card 90 73.41 16.59 275.23 3.749
Desktop / COD 40 38.86 1.14 1.3 0.033
TOTAL (χ²) 13.774
Note:
Formula: χ² = Σ [(Oᵢ - Eᵢ)² / Eᵢ] where O = Observed, E = Expected
  1. Determine the p-value at \(\alpha = 0.05\).

To determine the p-value for this Chi-Square Test of Independence, we use the previously calculated Chi-Square statistic (1\(\chi^2 \approx 13.774\)) and the degrees of freedom.

Degrees of Freedom (\(df\))

The degrees of freedom for a contingency table is calculated as:\[df = (\text{Rows} - 1) \times (\text{Columns} - 1)\]\[df = (2 - 1) \times (3 - 1) = 1 \times 2 = 2\]

P-Value Calculation

Using the Chi-Square distribution with \(2\) degrees of freedom and a test statistic of \(13.774\):

  • Chi-Square Statistic (\(\chi^2\)): \(13.7736\)

  • P-value: \(\approx 0.0010\)

Conclusion at $= 0.05

  • $P-value (\(0.0010\)) < \(\alpha\) (\(0.05\))

Since the p-value is significantly less than the significance level of \(0.05\), we reject the null hypothesis (\(H_0\)).

Result:The p-value is \(0.0010\). This indicates that there is a statistically significant association between the device type used and the customer’s payment method preference.

Visualization Explanation:

  • X-axis (Horizontal): Displays payment method categories (E-Wallet, Credit Card, COD).

  • Y-axis (Vertical): Shows the number of transactions or observation frequency.

  • Bar Color: Distinguishes between mobile and desktop users.

  • Interactivity: Because Plotly is used, you can hover over each bar in RStudio Viewer to see the exact numbers in detail.

  1. Interpret the results in terms of digital payment strategy.

Based on the previous Chi-Square test results (\(\chi^2 = 13.77\), \(p < 0.05\)), we have statistically proven that there is a significant relationship between device type and payment method. Here is the interpretation of these results to formulate a digital payment strategy for e-commerce companies:

“Mobile-First” Optimization for E-Wallet

Data shows the overwhelming dominance of Mobile users on E-Wallet methods (120 vs 60).

  • Strategy: Companies should ensure the e-wallet checkout process in mobile apps is very smooth (such as one-click payment features or biometric integration). Since mobile users tend to prefer speed, e-wallet cashback promotional campaigns should be targeted specifically at mobile app users.

Desktop as a High-Value Transaction Hub (Credit Card)

Desktop users are more likely to use Credit Cards (90) than E-Wallets (60), even surpassing mobile credit card users (80).

  • Strategy: Desktop users may feel safer making large transactions or entering card details on a larger screen. The right strategy is to highlight security features (such as the Verified by Visa or Mastercard ID Check logo) and offer a 0% installment program that is more visible on the desktop web display.

Cash on Delivery (COD)

Evaluation COD has a relatively low number on both devices, but still contributes.

  • Strategy: Because COD is operationally riskier (high returns), companies can slowly shift COD users on mobile to e-wallets by providing “Free Shipping” incentives only for digital payments.

UI/UX Personalization Based on Device

The results of this independence test show that consumer behavior differs depending on their “entry point”.

Strategy:

  • On Mobile: Display E-Wallet as the main payment option (topmost).

  • On Desktop: Give display priority to Credit Card or Bank Transfer options that require more detailed verification.

Device-Based Marketing & UX Strategy
Strategic Focus
Value Proposition
Implementation Tactics
Business Impact
Device Segment Primary Focus Core Value Proposition Key Tactics Expected Impact
📱 Mobile Speed & Convenience Quick, frictionless transactions for on-the-go users • E-Wallet integration • Biometric authentication • One-click checkout • Push notifications ↑ Mobile conversion by 15-20% ↓ Cart abandonment by 25% ↑ App engagement by 30%
🖥️ Desktop Security & Large Transactions Secure, informed purchasing for high-value transactions • Credit card promotions • Installment plans • Detailed product info • Security badges ↑ Average order value by 25-30% ↑ Credit card usage by 40% ↑ Customer trust scores
📊 Overall Strategy Personalized User Experience Seamless cross-device journey with tailored offerings • Unified user profile • Cross-device sync • Personalized recommendations • Loyalty integration ↑ Customer lifetime value by 35% ↑ Retention rate by 20% ↑ Cross-sell success by 25%
Note:
Based on chi-square analysis showing significant device × payment method association (χ²=13.77, p<0.01)

5 Case Study 5

5.1 Type I and Type II Errors (Conceptual)

A fintech startup tests whether a new fraud detection algorithm reduces fraudulent transactions.

  • H₀: The new algorithm does not reduce fraud.
  • H₁: The new algorithm reduces fraud.

5.2 Tasks

  1. Explain a Type I Error (α) in this context.
  2. Explain a Type II Error (β) in this context.
  3. Identify which error is more costly from a business perspective.
  4. Discuss how sample size affects Type II Error.
  5. Explain the relationship between α, β, and statistical power.

5.3 Answer Case Study 5

  1. Explain a Type I Error (α) in this context.

Definition of Type I Error (\(\alpha\))A

Type I Error occurs when the null hypothesis (\(H_0\)) is true, but we incorrectly reject it.4 It is essentially a “false positive” at the strategic level.

In This Context:

In your fraud detection case study, a Type I Error would look like this:

  • The Reality: The new algorithm is not effective. It does not actually reduce the rate of fraudulent transactions compared to the current system. (\(H_0\) is true).

  • The Decision: Based on the test data, the startup concludes that the algorithm is effective and reduces fraud. (Reject \(H_0\)).

  • The Error: The startup believes they have found a “winner” and implements the new algorithm, when in fact, the results they saw were likely due to random chance or a biased sample.

Business Consequences of a Type I Error

For a fintech startup, making a Type I Error can be quite costly because it leads to unjustified action. The consequences include:

  • Wasted Resources: The company spends time, engineering effort, and money deploying and maintaining an algorithm that provides no real benefit.

  • False Sense of Security: The startup may stop looking for other fraud solutions, believing the problem is solved, while fraudulent transactions continue at the same rate.

  • Operational Risk: If the new algorithm is “stricter” but not “smarter,” it might increase False Positives at the transaction level (blocking legitimate customers), leading to customer frustration and lost revenue, without actually catching more fraud.

  • Key Takeaway: If the startup sets a significance level of \(\alpha = 0.05\), they are accepting a 5% risk of committing this error—essentially saying they are okay with a 5% chance of adopting a “dud” algorithm.

  1. Explain a Type II Error (β) in this context.

Definition of Type II Error (\(\beta\))A

Type II Error occurs when the null hypothesis (4\(H_0\)) is false, but we fail to reject it.5 In simpler terms, the new algorithm actually works, but your test results are not strong enough to prove it, so you conclude it’s a failure.

In This Context:

In your fraud detection case study, a Type II Error would look like this:

  • The Reality: The new algorithm is highly effective. It truly reduces fraud and would save the company significant money. (\(H_0\) is false; \(H_1\) is true).

_ The Decision: Based on the test data (perhaps due to a small sample size or a low significance level), the startup concludes that the algorithm does not show a significant improvement. (Fail to reject \(H_0\)).

  • The Error: The startup shelves a superior piece of technology and continues using an inferior or outdated fraud detection system.

Business Consequences of a Type II Error

While a Type I error leads to “wasted effort,” a Type II error leads to “missed opportunities.” For a fintech company, this is often the more dangerous error:

  • Financial Loss: The company continues to lose money to fraudulent transactions that the new algorithm could have caught.

  • Competitive Disadvantage: If a competitor develops and successfully implements a similar algorithm, they will have lower fraud costs and can offer better rates to customers, potentially driving the startup out of the market.

  • Stagnation: The startup remains stuck with “legacy” rules that are easily bypassed by sophisticated modern fraud tactics, failing to evolve with the threat landscape.

The Relationship to Statistical Power

The probability of not making a Type II error is called Statistical Power (8\(1 - \beta\)).

  • If your test has low power, you have a high risk of 10\(\beta\).

  • This usually happens if the sample size is too small or if the effect (the reduction in fraud) is subtle and hard to detect against the “noise” of daily transactions.

  • Key Takeaway: A Type II error is like a fire alarm that fails to go off when there is actually a fire.13 The “fire” (fraud) continues to burn your capital because you thought everything was fine.

  1. Identify which error is more costly from a business perspective.

Here is a breakdown of why the “False Negative” (Type II) typically carries a higher price tag than the “False Positive” (Type I).

Statistical Error Types: Business Implications & Mitigation
Statistical
Business Context
Test Outcome
Consequences
Risk Management
Error Type Statistical Concept Business Scenario Hypothesis Test Outcome Business Impact Recommended Mitigation
Type I (Alpha) Error False Positive Adopting ineffective fraud detection Reject H₀ when H₀ is true Sunk Costs:
• Development resources wasted
• Maintenance costs incurred
• Customer trust eroded
• Conservative α levels (0.01)
• Extensive testing
• Phased rollout
Type II (Beta) Error False Negative Rejecting effective fraud detection Fail to reject H₀ when H₀ is false Opportunity Costs:
• Ongoing fraud losses
• Regulatory penalties
• Competitive disadvantage
• Adequate statistical power
• Continuous monitoring
• Sensitivity checks
Type I vs Type II Trade-off α vs β Balance Choosing risk tolerance level Setting significance level (α) Strategic Decision:
• Risk appetite determination
• Resource allocation
• Competitive positioning
• Cost-benefit analysis
• Stakeholder alignment
• Iterative refinement
Note:
Type I Error (α): False alarm | Type II Error (β): Missed detection | Power (1-β): Correct detection

Type II is usually “More Costly”

In fintech, fraud is often an existential threat. Here are three reasons why missing out on a working algorithm (Type II) is worse:

  • Direct Capital Loss: Fraud represents a direct hit to the bottom line. If the new algorithm had the potential to reduce fraud by 20%, every day the startup commits a Type II error, it is effectively “burning” money that it didn’t have to.

  • Scalability Risk: Startups need to scale fast. If you fail to implement a working fraud reduction tool, your losses will grow exponentially as your user base grows, potentially leading to bankruptcy or the loss of banking licenses.

  • The “Safety Net” of Type I: If you commit a Type I error and deploy a useless algorithm, you usually find out eventually through performance monitoring. You can then revert to the old system. The cost is limited to the development time. However, if you commit a Type II error, you never know what you missed, and the fraud continues unabated.

The One Exception

A Type I Error could be more costly only if the new algorithm is so aggressive that it triggers a massive wave of False Positives at the transaction level (blocking legitimate customers). This would lead to “churn” (customers leaving the platform), which can be harder to recover from than the fraud itself.

Final Verdict

For most fintech startups, Type II is the greater risk. The goal of the startup should be to maximize Statistical Power, ensuring that if a solution to their fraud problem exists, they are capable of detecting it and putting it into production.

  1. Discuss how sample size affects Type II Error

The relationship is inverse: as the sample size increases, the probability of a Type II Error decreases.

The Statistical Relationship

The probability of a Type II Error is tied directly to Statistical Power (\(1 - \beta\)).

  1. Small Sample Size: Leads to “noisy” data with a wide standard error. It is difficult to tell if a slight drop in fraud is a real result of the algorithm or just a random fluke in that day’s transactions. This results in high \(\beta\) (high risk of a False Negative).

  2. Large Sample Size: Reduces the standard error, making the test more “sensitive.” It narrows the distribution of the test results, allowing the startup to detect even small improvements in fraud reduction.8 This results in low \(\beta\) (low risk of a False Negative).

Sample Size Matters for Fraud Detection

Fraudulent transactions are often “rare events” (e.g., only 0.1% of all transactions).

  • The Problem: If the startup only tests the algorithm on 1,000 transactions, they might only see 1 fraud case by chance. Even if the algorithm is perfect, 1 case isn’t enough data to prove it works.

  • The Solution: By increasing the sample size to 1,000,000 transactions, the startup might see 1,000 fraud cases. With this much data, they can statistically prove whether the algorithm reduced that number to 800 (a 20% reduction) with high confidence.

The Trade-offs of Increasing Sample Size

While a larger sample size reduces the risk of missing a great algorithm (Type II Error), it comes with practical business costs:

Sample Size Effects: Statistical vs Practical Considerations
Large Sample (n > 10,000)
Small Sample (n < 1,000)
Best Practice
Consideration Large Sample Impact Small Sample Impact Recommendation
Testing Timeline Significant Delay:
• Weeks to months for data collection
• Slower iteration cycles
• Delayed business decisions
Rapid Testing:
• Days to weeks for results
• Fast iteration possible
• Quick business decisions
Use sequential testing
Computational Cost High Expense:
• Cloud computing costs scale linearly
• Storage and processing costs increase
• Infrastructure investment needed
Low Cost:
• Minimal cloud expenses
• Standard infrastructure sufficient
• Low operational overhead
Conduct cost-benefit analysis
Statistical Power Very High Power (≈0.95+):
• Low Type II error risk
• Can detect very small effects
• High confidence in negative results
Moderate Power (≈0.30-0.70):
• Higher Type II error risk
• Only detects substantial effects
• Uncertainty in negative results
Target power = 0.80-0.90
Type I Error Rate No Impact:
• α is researcher-set threshold
• Independent of sample size
• Typically fixed at 0.05
No Impact:
• α remains fixed by design
• Same Type I error control
• Consistent decision threshold
Fix α = 0.05 (standard)
Business Velocity Slower Time-to-Market:
• Delayed feature releases
• Slower learning cycles
• Missed competitive opportunities
Faster Innovation:
• Quick feature iterations
• Rapid experimentation
• Competitive responsiveness
Balance speed with confidence
Detection Sensitivity Extreme Sensitivity:
• Can detect trivial effects
• May find ‘significant’ but unimportant effects
• Risk of overfitting
Practical Sensitivity:
• Detects meaningful effects only
• Focuses on important differences
• Reduces overfitting risk
Consider practical significance
Note:
Key Insight: Type I error rate (α) is independent of sample size - it’s a design choice, not a data property

Visualization Explanation:

  1. Blue Curve (\(H_0\)): Represents the condition where the new algorithm is actually the same as the old one (no reduction in fraud).

  2. Red Curve (\(H_1\)): Represents the condition where the new algorithm is completely effective in reducing fraud.

  3. Type I Error (Red Area on the Right): This is the risk that we reject \(H_0\) when it is actually true. We think the algorithm is successful, when it is just a coincidence (False Positive).

  4. Type II Error (Blue Area on the Left): This is the risk that we fail to reject \(H_0\) when it is actually false. The algorithm is actually good, but our data is not strong enough to prove it (False Negative).

  5. Explain the relationship between α, β, and statistical power.

In statistics, \(\alpha\) (Alpha), \(\beta\) (Beta), and Statistical Power are three components that are interconnected in a decision-making ecosystem. Understanding the relationship between the three is crucial for fintech startups to balance operational risks and innovation opportunities.

The following is an explanation of the interrelationship between the three:

Mathematical Definitions and Relationships

These three concepts work on two different reality scenarios (when \(H_0\) is true vs when \(H_1\) is true):

Statistical Hypothesis Testing Parameters
Parameter Other_Name Simple_Definition Relationship
α (Alpha) Type I Error Risk of ‘Wrongful Accusation’: Assuming an algorithm is effective when it is not. Set by the researcher (e.g., 0.05).
β (Beta) Type II Error Risk of ‘Missing’: Assuming an algorithm is failing when it is actually effective. Inversely proportional to Power.
Power (1−β) Statistical Power The ability to detect a truly present effect. The ideal target is usually ≥0.80.

Inverse Relationship (Trade-off) between \(\alpha\) and \(\beta\)

In a fixed sample, \(\alpha\) and \(\beta\) have an inverse relationship. If you try to lower the Type I risk (\(\alpha\)), the Type II risk (\(\beta\)) will automatically increase.

  • If \(\alpha\) is lowered (e.g., from 0.05 to 0.01): You become very strict and do not want to make mistakes in adopting new algorithms. The consequence: The standard of proof becomes too high, so you are more likely to miss algorithms that are actually good (\(\beta\) increases, Power decreases).

  • If \(\alpha\) is raised (e.g., from 0.05 to 0.10): You are more “lax” and willing to take the risk of wrong adoption. The consequence: You are more likely to detect effective algorithms (\(\beta\) decreases, Power increases).

The Role of Statistical Power (\(1 - \beta\))

Power is the probability that the test will reject 6\(H_0\) when 7\(H_1\) is indeed true.8 In the fintech context:

  • High Power (0.80 - 0.95): Startups have keen “eyes.” If the new algorithm really reduces fraud, the test will almost certainly find it.

  • Low Power (< 0.50): It’s like flipping a coin. Even if your algorithm is great, the test will most likely conclude “no significant results.”

Sample Size Become a “Savior”

The only way to simultaneously decrease \(\alpha\) and \(\beta\) (or increase Power without increasing \(\alpha\)) is to increase the Sample Size (\(n\)). As \(n\) increases, the data distribution becomes narrower (precision increases). This reduces the overlap area between the \(H_0\) and \(H_1\) curves, so the risk of misrepresentation (\(\alpha\)) and the risk of omission (\(\beta\)) are both minimized.

Summary for Business Decisions Fintech startups should choose:

  • Want to be extremely safe from wasted implementation costs? Use low \(\alpha\) (risk: low Power).

  • Want to be sure not to miss opportunities to catch fraud? Go for high Power (risk: need a large sample size or a looser \(\alpha\)).

Read This Visualization:

  • Dashed Line (Decision Line): This is the boundary you defined. If the test results are to the right of this line, you adopt the new algorithm.

  • RED Area (\(\alpha\) - Type I Error): You are on the left curve (H0), but the test results cross the boundary line. You think you won, but it was just a fluke.

  • BLUE Area (\(\beta\) - Type II Error): You are on the right curve (H1), meaning the algorithm is actually good. But because the test results are to the left of the boundary line, you think the algorithm failed. You missed a golden opportunity.

  • GREEN Area (Power - \(1-\beta\)): This is the ideal situation. The algorithm is good, and your tests prove it. The wider the green area, the more sophisticated your startup’s tests are.

6 Case Study 6

6.1 P-Value and Statistical Decision Making

A churn prediction model evaluation yields the following results:

  • Test statistic = 2.31
  • p-value = 0.021
  • Significance level: \(\alpha = 0.05\)

6.2 Tasks

  1. Explain the meaning of the p-value.
  2. Make a statistical decision.
  3. Translate the decision into non-technical language for management.
  4. Discuss the risk if the sample is not representative.
  5. Explain why the p-value does not measure effect size.

6.3 Answer Case Study 6

  1. Explain the meaning of the p-value.

The Core Meaning of the P-Value

The p-value of 0.021 represents the probability of obtaining a test statistic as extreme as (or more extreme than) 2.31, assuming that the null hypothesis (\(H_0\)) is true.

In churn prediction, the null hypothesis usually states that there is no effect—for example, that a specific feature does not influence churn, or that the model’s performance is no better than a baseline (random guessing).

Therefore, a p-value of 0.021 means:

  • There is only a 2.1% chance that you would see these results if the model actually had no predictive power.

  • Because 2.1% is a very low probability, it suggests that the “no effect” assumption is likely incorrect.

Statistical Decision: Comparing P-Value and Alpha (\(\alpha\))

To make a formal decision, we compare the p-value to your pre-defined significance level (\(\alpha\)):

Hypothesis Test Results
Metric Value Comparison/Result
P-Value 0.021 p < α
Significance Level (α) 0.05 (0.021 < 0.05)
Conclusion Statistically Significant

The Decision: Since the p-value (5\(0.021\)) is less than the significance level (6\(0.05\)), you reject the null hypothesis.

Business Interpretation:

In practical terms for your churn model, this result indicates that the findings are statistically significant. You can conclude with 95% confidence that the model has identified a real pattern in customer behavior rather than just “noise” in the data. You now have sufficient evidence to move forward with the model’s insights for retention strategies.

  1. Make a statistical decision.

To make a statistical decision for your churn prediction model, we compare the calculated p-value against the pre-determined significance level (\(\alpha\)).

The Statistical Decision

Based on the results provided, the decision is to Reject the Null Hypothesis (\(H_0\)).

Justification

The decision-making rule in hypothesis testing is straightforward:

  • If \(p \leq \alpha\): Reject \(H_0\) (The result is statistically significant).

  • If \(p > \alpha\): Fail to Reject \(H_0\) (The result is not statistically significant).

In this case study:

  • P-value = \(0.021\)

  • Significance Level (\(\alpha\)) = \(0.05\)

Since \(0.021 < 0.05\), the p-value falls within the “rejection region.” This means the probability of observing a test statistic of \(2.31\) by pure chance is lower than the \(5\%\) threshold you set.

Conclusion and Business Impact

By rejecting the null hypothesis, you are concluding that:

  • Statistical Significance: The findings of the churn prediction model are statistically significant at the \(95\%\) confidence level.

  • Model Validity: There is strong evidence to suggest that the model’s predictors are actually related to customer churn, rather than the results being a product of random sampling error.

  • Actionable Insight: The business can proceed with using this model to inform retention strategies, as the relationship it has identified is unlikely to be a fluke.

Decision via P-Value Comparison

The standard rule for any hypothesis test is: Reject the null hypothesis if \(p \leq \alpha\).

  • P-Value: 0.021

  • Significance Level (\(\alpha\)): 0.05

  • Comparison: \(0.021 < 0.05\)

Statistical Decision: Reject the Null Hypothesis (\(H_0\)).

Decision via Test Statistic Comparison

We can also visualize this by comparing your test statistic to the critical value. For a standard normal distribution (Z-test) at \(\alpha = 0.05\):

  • Two-Tailed Critical Value: \(\pm 1.96\)

  • One-Tailed (Right) Critical Value: \(+1.645\)

  • Your Test Statistic: 2.31

Since your test statistic of 2.31 is greater than the critical value (e.g., \(1.96\) or \(1.645\)), it falls deep into the rejection region (the “tail” of the curve).

Visual Explanation:

  • Blue Curve: Shows the standard normal distribution (assuming \(H_0\) is true).

  • Red Area: Rejection region (\(\alpha = 0.05\)). If the green line falls here, we reject \(H_0\).

  • Green Dashed Line: Shows the position of your model (\(2.31\)). Since this line is well within the red area, it visually proves that your model’s results are statistically significant.

  1. Translate the decision into non-technical language for management.

The Results Are No Coincidence

The analysis shows that the patterns identified by our model are very robust and real. Simply put, the probability that these patterns occurred due to “luck” or pure chance is extremely small—only about 2%.

Decision

The Model is Ready to Use Because our confidence level exceeds industry standards (which typically set a margin of error of 5%), we officially accept the model’s results. We have sufficient evidence to state that the factors identified by the model truly influence customers’ decisions to churn.

Business Impact

Prediction Accuracy: We can be more confident in allocating marketing budgets to retain customers identified by the model as “high-risk.”

Next Steps: We can immediately launch retention campaigns (such as special promotions or discounts) to these targeted customers because the supporting data has been scientifically proven.

Conclusion for Management “The data shows with 97.9% certainty that this model is working correctly. We’re not guessing; the churn patterns we’re seeing are real and actionable to prevent lost revenue.”

  1. Discuss the risk if the sample is not representative.

If the data used to calculate that p-value (\(0.021\)) doesn’t reflect the diversity of your real customer base, you face the following risks:

False Sense of Certainty (The “Paper Tiger” Model)

A p-value of \(0.021\) suggests your results are “statistically significant.” However, significance only means the pattern is strong within the data provided.

  • The Risk: If your sample only includes long-term premium users but the model is applied to new “freemium” users, the model may predict churn perfectly for the sample but fail miserably in the real world. You are essentially making a highly confident decision based on the wrong information.

Misallocation of Retention Budget

Management uses churn models to decide where to spend money (e.g., discounts, outreach).

  • The Risk: If the non-representative sample over-represents a specific demographic (e.g., users from a certain region), the model will identify “churn signals” unique to that group. When you scale this, you might send expensive retention offers to people who weren’t going to leave, while ignoring the actual “at-risk” customers the sample missed.

Masking Hidden Variables (Omitted Variable Bias)

A non-representative sample often correlates with missing data.

  • The Risk: If you only sampled customers who contacted support, your \(2.31\) test statistic might actually be measuring “customer frustration” rather than “churn intent.” You might conclude that a specific app feature is causing churn, when in reality, it’s the lack of support response time that the sample failed to capture globally.

Poor Generalization (Low External Validity)

In statistics, we distinguish between internal and external validity.

  • Internal Validity: Did the experiment work for this specific group? (Yes, \(p < 0.05\)).

  • External Validity: Does this apply to the whole company? (No, because the sample is biased).

6.4 Summary of Risks:

RISK TYPE BUSINESS IMPACT
GENERALIZATION ERROR The model works in the “lab” environment but fails when deployed to the actual production population.
STRATEGIC MISALIGNMENT Marketing treats the wrong symptoms based on biased data, leading to ineffective campaigns and lost revenue.
MODEL DRIFT As the actual customer population naturally evolves, a biased model decays much faster and provides increasingly inaccurate predictions.

  1. Explain why the p-value does not measure effect size.

In statistical decision-making, it is a common misconception that a “small p-value” implies a “large effect.” However, these two metrics answer fundamentally different questions. In your churn prediction case study, the \(p = 0.021\) tells you that the result is statistically significant at \(\alpha = 0.05\), but it says nothing about how much the model actually improves retention or reduces churn.

The p-value does not measure effect size for the following primary reasons:

Confounding with Sample Size

The most critical reason is that p-values are highly dependent on sample size (\(n\)).

  • Large Samples: In a large dataset (e.g., 100,000 customers), even a tiny, trivial difference in churn rates can produce a very small p-value.

  • Small Samples: Conversely, a massive, business-changing effect might produce a high (non-significant) p-value simply because the sample size was too small to provide certainty.

Difference in Objectives

The two metrics represent different dimensions of your data:

  • P-Value (Statistical Significance): Measures the reliability of the result. It answers: “How likely is it that the difference we see is due to random noise?”

  • Effect Size (Practical Significance): Measures the magnitude of the result. It answers: “How large is the difference, and does it matter for the business?”

Example: In your churn model, a p-value of 0.021 suggests the model’s predictors are likely “real” and not just luck. However, the Effect Size (e.g., Cohen’s \(d\) or an Odds Ratio) would tell you if the model reduces churn by 0.1% (trivial) or 10% (substantial).

The P-Value Scale is Not Linear

The p-value is a probability, not a physical measurement.8 A p-value of 0.01 is not “twice as much effect” as a p-value of 0.02. Because the p-value is a function of both the effect magnitude and the precision (standard error) of the estimate, you cannot decouple the “size” of the effect from the “certainty” of the estimate using only the p-value.

Summary Comparison:

FEATURE P-VALUE EFFECT SIZE
Question Is there an effect? How big is the effect?
Sample Size Highly sensitive to n. Independent of n.
Result Type Probability (0 to 1). Magnitude (e.g., mean difference).
Business Use Decision to trust the model. Decision on ROI and implementation.
