Study Cases
Tugas Week 14 ~ Statistical Inferences
Paskalis Farelnata Zamasi
NIM : 52250043
Student Major in Data Science at
Institut Teknologi Sains Bandung
Case Study 1
One-Sample Z-Test (Statistical Hypotheses)
A digital learning platform claims that the average daily study time of its users is 120 minutes. Based on historical records, the population standard deviation is known to be 15 minutes.
A random sample of 64 users shows an average study time of 116 minutes.
\[ \begin{eqnarray*} \mu_0 &=& 120 \\ \sigma &=& 15 \\ n &=& 64 \\ \bar{x} &=& 116 \end{eqnarray*} \]
Tasks
- Formulate the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).
- Identify the appropriate statistical test and justify your choice.
- Compute the test statistic and p-value using \(\alpha = 0.05\).
- State the statistical decision.
- Interpret the result in a business analytics context.
Answer
- Formulate the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).
The null hypothesis (H₀) states that there is no significant difference between the claim platform and the sample data, so the average daily learning time of the user population is equal to 120 minutes. Mathematically, this is formulated as:
\[ H_0: \mu = 120 \]
The alternative hypothesis (H₁) states that there is a significant difference, such that the average daily study time of the user population is not equal to 120 minutes. This is a two-tailed test because the question does not specify the direction of the difference (it simply claims “is 120 minutes” and the sample shows a lower value, but we don’t assume one-tailedness without explicit indication). Mathematically:
\[ H_1: \mu \neq 120 \]
This formulation is based on the principle of statistical inference where H₀ always includes equality to test specific claims, while H₁ includes inequality to detect any deviations. In this context, we have no reason to use a one-sided test (e.g., simply checking whether it is lower) because the problem focuses on verifying a general claim without a specific direction. If one-sidedness were assumed (e.g., H₁: μ < 120), it would change the p-value, but based on the lower sample data, a two-sided test is more conservative and in accordance with standard practice for claims “average is X” without additional context.
- Identify the appropriate statistical test and justify your choice.
The appropriate statistical test is the One-Sample Z-Test.
Justification:
This is a one-sample test because we are comparing the mean of a single sample (x̄ = 116) to the claimed population value (μ₀ = 120).
The population standard deviation (σ = 15) is known from historical records, so we use the Z-test rather than the T-test (which is used when σ is unknown and replaced with s from the sample).
The sample size (n = 64) is greater than 30, satisfying the Central Limit Theorem (CLT) assumption for a normal distribution, although the Z-test does not strictly require this if σ is known.
The samples are random and independent (stated in the problem), and the study time data are assumed to be normally distributed or large enough for the CLT.
Alternatives such as the T-test are inappropriate because σ is known; chi-square or other tests are irrelevant because they are not tests of variance or proportion.
This choice ensures accuracy because the Z-test is specifically designed for this scenario, reducing the risk of type I/II errors compared to less specific tests.
- Compute the test statistic and p-value using \(\alpha = 0.05\)
The formula for the Z test statistic in a one-sample test is:
\[ Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} \]
- Identify the input values of the problem:
- μ₀ = 120 (claimed population mean)
- σ = 15 (population standard deviation)
- n = 64 (sample size)
- x̄ = 116 (sample mean)
- Calculate the standard error (SE) of the sample mean:
\[ SE = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{64}} = \frac{15}{8} = 1.875 \]
- Calculate the deviation between the sample and population means:
\[ \bar{x} - \mu_0 = 116 - 120 = -4 \]
- Calculate the Z value:
\[ Z = \frac{-4}{1.875} = -2.1333 \]
- Determine the p-value for a two-tailed test. Since the Z distribution is symmetric, the p-value = 2 × P(Z < -|Z|), where P is calculated from the standard normal distribution table or the cumulative function.
Use the cumulative normal function: P(Z ≤ -2.1333) ≈ 0.0164
Then the p-value = 2 × 0.0164 = 0.0328.
| Z | Density | Cumulative_P | Two_Tail_P |
|---|---|---|---|
| -4.0 | 0.0001 | 0.0000 | 0.0001 |
| -3.9 | 0.0002 | 0.0000 | 0.0001 |
| -3.8 | 0.0003 | 0.0001 | 0.0001 |
| -3.7 | 0.0004 | 0.0001 | 0.0002 |
| -3.6 | 0.0006 | 0.0002 | 0.0003 |
| -3.5 | 0.0009 | 0.0002 | 0.0005 |
| -3.4 | 0.0012 | 0.0003 | 0.0007 |
| -3.3 | 0.0017 | 0.0005 | 0.0010 |
| -3.2 | 0.0024 | 0.0007 | 0.0014 |
| -3.1 | 0.0033 | 0.0010 | 0.0019 |
| -3.0 | 0.0044 | 0.0013 | 0.0027 |
| -2.9 | 0.0060 | 0.0019 | 0.0037 |
| -2.8 | 0.0079 | 0.0026 | 0.0051 |
| -2.7 | 0.0104 | 0.0035 | 0.0069 |
| -2.6 | 0.0136 | 0.0047 | 0.0093 |
| -2.5 | 0.0175 | 0.0062 | 0.0124 |
| -2.4 | 0.0224 | 0.0082 | 0.0164 |
| -2.3 | 0.0283 | 0.0107 | 0.0214 |
| -2.2 | 0.0355 | 0.0139 | 0.0278 |
| -2.1 | 0.0440 | 0.0179 | 0.0357 |
| -2.0 | 0.0540 | 0.0228 | 0.0455 |
| -1.9 | 0.0656 | 0.0287 | 0.0574 |
| -1.8 | 0.0790 | 0.0359 | 0.0719 |
| -1.7 | 0.0940 | 0.0446 | 0.0891 |
| -1.6 | 0.1109 | 0.0548 | 0.1096 |
| -1.5 | 0.1295 | 0.0668 | 0.1336 |
| -1.4 | 0.1497 | 0.0808 | 0.1615 |
| -1.3 | 0.1714 | 0.0968 | 0.1936 |
| -1.2 | 0.1942 | 0.1151 | 0.2301 |
| -1.1 | 0.2179 | 0.1357 | 0.2713 |
| -1.0 | 0.2420 | 0.1587 | 0.3173 |
| -0.9 | 0.2661 | 0.1841 | 0.3681 |
| -0.8 | 0.2897 | 0.2119 | 0.4237 |
| -0.7 | 0.3123 | 0.2420 | 0.4839 |
| -0.6 | 0.3332 | 0.2743 | 0.5485 |
| -0.5 | 0.3521 | 0.3085 | 0.6171 |
| -0.4 | 0.3683 | 0.3446 | 0.6892 |
| -0.3 | 0.3814 | 0.3821 | 0.7642 |
| -0.2 | 0.3910 | 0.4207 | 0.8415 |
| -0.1 | 0.3970 | 0.4602 | 0.9203 |
| 0.0 | 0.3989 | 0.5000 | 1.0000 |
| 0.1 | 0.3970 | 0.5398 | 0.9203 |
| 0.2 | 0.3910 | 0.5793 | 0.8415 |
| 0.3 | 0.3814 | 0.6179 | 0.7642 |
| 0.4 | 0.3683 | 0.6554 | 0.6892 |
| 0.5 | 0.3521 | 0.6915 | 0.6171 |
| 0.6 | 0.3332 | 0.7257 | 0.5485 |
| 0.7 | 0.3123 | 0.7580 | 0.4839 |
| 0.8 | 0.2897 | 0.7881 | 0.4237 |
| 0.9 | 0.2661 | 0.8159 | 0.3681 |
| 1.0 | 0.2420 | 0.8413 | 0.3173 |
| 1.1 | 0.2179 | 0.8643 | 0.2713 |
| 1.2 | 0.1942 | 0.8849 | 0.2301 |
| 1.3 | 0.1714 | 0.9032 | 0.1936 |
| 1.4 | 0.1497 | 0.9192 | 0.1615 |
| 1.5 | 0.1295 | 0.9332 | 0.1336 |
| 1.6 | 0.1109 | 0.9452 | 0.1096 |
| 1.7 | 0.0940 | 0.9554 | 0.0891 |
| 1.8 | 0.0790 | 0.9641 | 0.0719 |
| 1.9 | 0.0656 | 0.9713 | 0.0574 |
| 2.0 | 0.0540 | 0.9772 | 0.0455 |
| 2.1 | 0.0440 | 0.9821 | 0.0357 |
| 2.2 | 0.0355 | 0.9861 | 0.0278 |
| 2.3 | 0.0283 | 0.9893 | 0.0214 |
| 2.4 | 0.0224 | 0.9918 | 0.0164 |
| 2.5 | 0.0175 | 0.9938 | 0.0124 |
| 2.6 | 0.0136 | 0.9953 | 0.0093 |
| 2.7 | 0.0104 | 0.9965 | 0.0069 |
| 2.8 | 0.0079 | 0.9974 | 0.0051 |
| 2.9 | 0.0060 | 0.9981 | 0.0037 |
| 3.0 | 0.0044 | 0.9987 | 0.0027 |
| 3.1 | 0.0033 | 0.9990 | 0.0019 |
| 3.2 | 0.0024 | 0.9993 | 0.0014 |
| 3.3 | 0.0017 | 0.9995 | 0.0010 |
| 3.4 | 0.0012 | 0.9997 | 0.0007 |
| 3.5 | 0.0009 | 0.9998 | 0.0005 |
| 3.6 | 0.0006 | 0.9998 | 0.0003 |
| 3.7 | 0.0004 | 0.9999 | 0.0002 |
| 3.8 | 0.0003 | 0.9999 | 0.0001 |
| 3.9 | 0.0002 | 1.0000 | 0.0001 |
| 4.0 | 0.0001 | 1.0000 | 0.0001 |
- State the statistical decision.
Reject H₀ because p-value (0.0328) < α (0.05)
The decision rule for a two-sided test at a significance level of α = 0.05 is: if p-value < α, reject H₀ and accept H₁; otherwise, fail to reject H₀. Here, 0.0328 < 0.05, so there is sufficient statistical evidence to state that the population mean is not equal to 120 minutes. The critical Z value for two sides at α = 0.05 is ±1.96; since |Z| = 2.1333 > 1.96, this is consistent with rejecting H₀. This decision is objectively based on the data, not subjective assumptions.
- Interpret the result in a business analytics context.
In the context of business analytics for digital learning platforms, the rejection of H₀ indicates that the claim of an average daily learning time of 120 minutes is not supported by the sample data, with significant evidence (p-value = 0.0328) that the actual average is lower (based on x̄ = 116). This implies potential issues such as decreased user engagement, which could be caused by factors such as less engaging content, external distractions, or changes in user behavior post-pandemic.
Consequences:
Operational risk: If this claim is used for marketing or investor pitches, this data represents an overstatement, potentially damaging credibility and triggering lawsuits for misleading claims.
Financial impact: Lower learning time means weaker user retention, reducing metrics like lifetime value (LTV) or advertising/subscription revenue. For example, if each additional minute generates incremental revenue, a deviation of 4 minutes per day could accumulate to millions of dollars in annual losses for a large user base.
Strategic consideration: You may be underestimating the complexity if you focus solely on this test; historical data with σ = 15 indicates high variability, so the recommendation is to repeat the test with a larger sample or use Bayesian inference to continuously update beliefs. Overall, these results encourage a shift from static claims to dynamic monitoring to support business growth.
Visualization
Case Study 2
One-Sample T-Test (σ Unknown, Small Sample)
A UX Research Team investigates whether the average task completion time of a new application differs from 10 minutes.
The following data are collected from 10 users:
\[ 9.2,\; 10.5,\; 9.8,\; 10.1,\; 9.6,\; 10.3,\; 9.9,\; 9.7,\; 10.0,\; 9.5 \]
Tasks
- Define H₀ and H₁ (two-tailed).
- Determine the appropriate hypothesis test.
- Calculate the t-statistic and p-value at \(\alpha = 0.05\).
- Make a statistical decision.
- Explain how sample size affects inferential reliability.
Answer
- Define H₀ and H₁ (two-tailed).
The null hypothesis (H₀) states that the population’s average task completion time is 10 minutes, so there is no significant difference between the tested values. Mathematically:
\[ H_0: \mu = 10 \]
The alternative hypothesis (H₁) states that the population’s mean task completion time differs from 10 minutes (two-tailed because the question states “differs from” without a specific direction). Mathematically:
\[ H_1: \mu \neq 10 \]
This formulation follows the basic principles of hypothesis testing, where H₀ always includes equality to test a specific value (here, 10 minutes as the UX benchmark), while H₁ includes a two-sided inequality to detect any deviations. This is appropriate in the context of UX investigations where the goal is to verify whether a new app meets a time target, without assuming direction (e.g., faster or slower). A one-sided assumption will be incorrect if the data shows unexpected deviations; a two-sided approach is more conservative to avoid bias.
- Determine the appropriate hypothesis test.
The appropriate hypothesis test is the One-Sample t-Test.
Justification:
This is a one-sample test because we are comparing the sample mean (of 10 users) to a hypothetical population value (μ₀ = 10) without any other population data.
The population standard deviation (σ) is unknown, so a t-test is used instead of a Z-test (which requires a known σ).
The sample size is small (n=10 < 30), so the t-distribution is more appropriate than the Z-distribution because it accounts for uncertainty in the estimation of sample variance, avoiding the strict assumption of normality via the Central Limit Theorem.
Alternatives such as the Z-test are inappropriate due to the unknown σ; ANOVA. This option maximizes inference accuracy for small samples, reducing the risk of type I/II errors compared to less appropriate tests.
- Calculate the t-statistic and p-value at \(\alpha = 0.05.\)
The formula for the t-test statistic in a one-sample t-test is:
\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]
where:
- \(\bar{x}\) is the sample mean
- \(s\) is the sample standard deviation
- degrees of freedom (df) = n-1
Identify the input data for the problem:
- Data: 9.2, 10.5, 9.8, 10.1, 9.6, 10.3, 9.9, 9.7, 10.0, 9.5
- μ₀ = 10
- n = 10
Calculate the sample mean (\(\bar{x}\)):
- Add the data: 9.2 + 10.5 + 9.8 + 10.1 + 9.6 + 10.3 + 9.9 + 9.7 + 10.0 + 9.5 = 98.6
- \(\bar{x} = 98.6 / 10 = 9.86\)
Calculate the sample standard deviation (s) using the formula:
\[ s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}} \]
Calculate the squared deviation of each:
- (9.2 - 9.86)^2 = (-0.66)^2 = 0.4356
- (10.5 - 9.86)^2 = (0.64)^2 = 0.4096
- (9.8 - 9.86)^2 = (-0.06)^2 = 0.0036
- (10.1 - 9.86)^2 = (0.24)^2 = 0.0576
- (9.6 - 9.86)^2 = (-0.26)^2 = 0.0676
- (10.3 - 9.86)^2 = (0.44)^2 = 0.1936
- (9.9 - 9.86)^2 = (0.04)^2 = 0.0016
- (9.7 - 9.86)^2 = (-0.16)^2 = 0.0256
- (10.0 - 9.86)^2 = (0.14)^2 = 0.0196
- (9.5 - 9.86)^2 = (-0.36)^2 = 0.1296
| Observasi | Deviasi | Kuadrat_Deviasi |
|---|---|---|
| 9.2 | -0.66 | 0.4356 |
| 10.5 | 0.64 | 0.4096 |
| 9.8 | -0.06 | 0.0036 |
| 10.1 | 0.24 | 0.0576 |
| 9.6 | -0.26 | 0.0676 |
| 10.3 | 0.44 | 0.1936 |
| 9.9 | 0.04 | 0.0016 |
| 9.7 | -0.16 | 0.0256 |
| 10.0 | 0.14 | 0.0196 |
| 9.5 | -0.36 | 0.1296 |
- Sum of squared deviations: 0.4356 + 0.4096 + 0.0036 + 0.0576 + 0.0676 + 0.1936 + 0.0016 + 0.0256 + 0.0196 + 0.1296 = 1.344
\[s = \sqrt{\frac{1.344}{9}} = \sqrt0.149333 ≈ 0.3864\]
Calculate the standard error (SE):
\[ SE = \frac{s}{\sqrt{n}} = \frac{0.3864}{\sqrt{10}} = \frac{0.3864}{3.1623} ≈ 0.1222 \]
Calculate the deviation between the sample and population means:
\[ \bar{x} - \mu_0 = 9.86 - 10 = -0.14 \]
Calculate the t value:
\[ t = \frac{-0.14}{0.1222} ≈ -1.1456 \]
- df = 10 - 1 = 9
Calculate the p-value for a two-sided test using the t-distribution (from a t-table or cumulative function; precision of the calculation: P(t ≤ -|t|) × 2 ≈ 0.2815)
| t | Density | Cumulative_P | Two_Tail_P |
|---|---|---|---|
| -4.0 | 0.0023 | 0.0016 | 0.0031 |
| -3.9 | 0.0028 | 0.0018 | 0.0036 |
| -3.8 | 0.0032 | 0.0021 | 0.0042 |
| -3.7 | 0.0038 | 0.0025 | 0.0049 |
| -3.6 | 0.0045 | 0.0029 | 0.0057 |
| -3.5 | 0.0053 | 0.0034 | 0.0067 |
| -3.4 | 0.0062 | 0.0039 | 0.0079 |
| -3.3 | 0.0074 | 0.0046 | 0.0092 |
| -3.2 | 0.0087 | 0.0054 | 0.0108 |
| -3.1 | 0.0103 | 0.0064 | 0.0127 |
| -3.0 | 0.0121 | 0.0075 | 0.0150 |
| -2.9 | 0.0143 | 0.0088 | 0.0176 |
| -2.8 | 0.0169 | 0.0104 | 0.0207 |
| -2.7 | 0.0200 | 0.0122 | 0.0244 |
| -2.6 | 0.0236 | 0.0144 | 0.0287 |
| -2.5 | 0.0278 | 0.0169 | 0.0339 |
| -2.4 | 0.0327 | 0.0199 | 0.0399 |
| -2.3 | 0.0385 | 0.0235 | 0.0470 |
| -2.2 | 0.0451 | 0.0277 | 0.0553 |
| -2.1 | 0.0528 | 0.0326 | 0.0651 |
| -2.0 | 0.0617 | 0.0383 | 0.0766 |
| -1.9 | 0.0719 | 0.0449 | 0.0899 |
| -1.8 | 0.0834 | 0.0527 | 0.1054 |
| -1.7 | 0.0964 | 0.0617 | 0.1233 |
| -1.6 | 0.1110 | 0.0720 | 0.1441 |
| -1.5 | 0.1272 | 0.0839 | 0.1679 |
| -1.4 | 0.1449 | 0.0975 | 0.1950 |
| -1.3 | 0.1641 | 0.1130 | 0.2259 |
| -1.2 | 0.1847 | 0.1304 | 0.2608 |
| -1.1 | 0.2065 | 0.1499 | 0.2999 |
| -1.0 | 0.2291 | 0.1717 | 0.3434 |
| -0.9 | 0.2522 | 0.1958 | 0.3916 |
| -0.8 | 0.2752 | 0.2222 | 0.4443 |
| -0.7 | 0.2977 | 0.2508 | 0.5016 |
| -0.6 | 0.3189 | 0.2817 | 0.5633 |
| -0.5 | 0.3384 | 0.3145 | 0.6291 |
| -0.4 | 0.3553 | 0.3492 | 0.6985 |
| -0.3 | 0.3692 | 0.3855 | 0.7710 |
| -0.2 | 0.3795 | 0.4230 | 0.8459 |
| -0.1 | 0.3859 | 0.4613 | 0.9225 |
| 0.0 | 0.3880 | 0.5000 | 1.0000 |
| 0.1 | 0.3859 | 0.5387 | 0.9225 |
| 0.2 | 0.3795 | 0.5770 | 0.8459 |
| 0.3 | 0.3692 | 0.6145 | 0.7710 |
| 0.4 | 0.3553 | 0.6508 | 0.6985 |
| 0.5 | 0.3384 | 0.6855 | 0.6291 |
| 0.6 | 0.3189 | 0.7183 | 0.5633 |
| 0.7 | 0.2977 | 0.7492 | 0.5016 |
| 0.8 | 0.2752 | 0.7778 | 0.4443 |
| 0.9 | 0.2522 | 0.8042 | 0.3916 |
| 1.0 | 0.2291 | 0.8283 | 0.3434 |
| 1.1 | 0.2065 | 0.8501 | 0.2999 |
| 1.2 | 0.1847 | 0.8696 | 0.2608 |
| 1.3 | 0.1641 | 0.8870 | 0.2259 |
| 1.4 | 0.1449 | 0.9025 | 0.1950 |
| 1.5 | 0.1272 | 0.9161 | 0.1679 |
| 1.6 | 0.1110 | 0.9280 | 0.1441 |
| 1.7 | 0.0964 | 0.9383 | 0.1233 |
| 1.8 | 0.0834 | 0.9473 | 0.1054 |
| 1.9 | 0.0719 | 0.9551 | 0.0899 |
| 2.0 | 0.0617 | 0.9617 | 0.0766 |
| 2.1 | 0.0528 | 0.9674 | 0.0651 |
| 2.2 | 0.0451 | 0.9723 | 0.0553 |
| 2.3 | 0.0385 | 0.9765 | 0.0470 |
| 2.4 | 0.0327 | 0.9801 | 0.0399 |
| 2.5 | 0.0278 | 0.9831 | 0.0339 |
| 2.6 | 0.0236 | 0.9856 | 0.0287 |
| 2.7 | 0.0200 | 0.9878 | 0.0244 |
| 2.8 | 0.0169 | 0.9896 | 0.0207 |
| 2.9 | 0.0143 | 0.9912 | 0.0176 |
| 3.0 | 0.0121 | 0.9925 | 0.0150 |
| 3.1 | 0.0103 | 0.9936 | 0.0127 |
| 3.2 | 0.0087 | 0.9946 | 0.0108 |
| 3.3 | 0.0074 | 0.9954 | 0.0092 |
| 3.4 | 0.0062 | 0.9961 | 0.0079 |
| 3.5 | 0.0053 | 0.9966 | 0.0067 |
| 3.6 | 0.0045 | 0.9971 | 0.0057 |
| 3.7 | 0.0038 | 0.9975 | 0.0049 |
| 3.8 | 0.0032 | 0.9979 | 0.0042 |
| 3.9 | 0.0028 | 0.9982 | 0.0036 |
| 4.0 | 0.0023 | 0.9984 | 0.0031 |
- Make a statistical decision.
Fail to reject H₀ because p-value (0.2815) > α (0.05).
The decision rule for a two-tailed test is: if p-value < α, reject H₀; otherwise, fail to reject H₀. Here, there is insufficient evidence to declare a significant difference of 10 minutes. This means the sample data are consistent with the hypothesis that the population mean is 10, even though the sample mean is lower—high variability and a small n make the inference weak.
- Explain how sample size affects inferential reliability.
Sample size (n) directly affects the reliability of statistical inference through several key mechanisms.
Effect on Standard Error (SE): SE = s / √n; a small n (such as 10 here) results in a larger SE (0.1222), which widens the confidence interval (CI) and reduces the precision of the μ estimate. For example, the 95% CI here is ≈ 9.86 ± 2.262 × 0.1222 ≈ (9.58, 10.14), wide enough to encompass 10. With n=100, the SE would be ≈0.0386, a narrower CI (e.g., 9.86 ± 1.96 × 0.0386 ≈ (9.78, 9.94)), increasing reliability by reducing uncertainty.
Effect on Test Power: Power (1 - β) is the probability of detecting a true effect. A small n decreases power because the t-distribution is “fatter” (thicker tails at low df), increasing the risk of a type II error (failing to reject H₀ when it is false). Here, with a small effect (Cohen’s d ≈ -0.14 / 0.3864 ≈ -0.36), power is low (<0.5); a larger n increases power to >0.8, making the test more sensitive to outliers.
Practical Implications: In a UX context, n=10 may be sufficient for a pilot study but is unreliable for business decisions—there is a risk of misallocating resources (e.g., redesigning an app based on noise). Inferential reliability increases quadratically with √n, so double n to smooth the SE by half.
Overall, such a small n underestimates the complexity of user variability, potentially leading to unstable conclusions; prioritize larger n for more powerful and actionable inference.
Visualization
Case Study 3
Two-Sample T-Test (A/B Testing)
A product analytics team conducts an A/B test to compare the average session duration (minutes) between two versions of a landing page.
| Version | Sample Size (n) | Mean | Standard Deviation |
|---|---|---|---|
| A | 25 | 4.8 | 1.2 |
| B | 25 | 5.4 | 1.4 |
Tasks
- Formulate the null and alternative hypotheses.
- Identify the type of t-test required.
- Compute the test statistic and p-value.
- Draw a statistical conclusion at \(\alpha = 0.05\).
- Interpret the result for product decision-making.
Answer
- Formulate the null and alternative hypotheses.
The null hypothesis (H₀) states that there is no significant difference between the average session duration for versions A and B of the landing page, so μ_A = μ_B. Mathematically:
\[ H_0: \mu_A = \mu_B \]
The alternative hypothesis (H₁) states that there is a significant difference between the average session durations (two-tailed because the question states “compare” without a specific direction such as “higher” or “lower”). Mathematically:
\[ H_1: \mu_A \neq \mu_B \]
This formulation is based on the goal of A/B testing to detect differences in metrics (session duration) between variants, where H₀ assumes the status quo (no effect) to be tested against evidence of deviation. A two-sided approach is appropriate because there is no a priori hypothesis about direction (e.g., B might be more engaging or distracting); a one-sided assumption would be biased and increase the risk of a Type I error if the deviation is unexpected. This is standard practice in product analytics to avoid confirmation bias in rollout decisions.
- Identify the type of t-test required.
The required test is a Two-Sample Independent t-Test assuming unequal variances (Welch’s t-Test).
Justification:
This is a two-sample test because it compares two independent groups (versions A and B, assuming separate random users without pairing).
Independent because the samples are from different populations (A vs. B), not repeated measures.
Variances are unequal because SD_A=1.2 ≠ SD_B=1.4, and n=25 per group is not large enough to ignore the difference; Welch’s adjusts df for robustness, avoiding bias from pooled variance if equal variances are incorrectly assumed.
Alternatives such as the paired t-test are inappropriate because they are not matched pairs; the Z-test is not because σ is unknown; Levene’s test can be a prerequisite for confirming unequal variances, but Welch’s is the safe default for this case. This choice maximizes accuracy and reduces the risk of inference errors compared to the equal variance t-test, which can inflate Type I errors if variances are different.
- Compute the test statistic and p-value.
The formula for the t-statistic in Welch’s t-test is:
\[ t = \frac{\bar{x}_A - \bar{x}_B}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}} \]
The degrees of freedom (df) are approximated by the Welch-Satterthwaite equation:
\[ df = \frac{\left( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \right)^2}{\frac{s_A^4}{n_A^2 (n_A - 1)} + \frac{s_B^4}{n_B^2 (n_B - 1)}} \]
p-value of the t-distribution with the df (two-tailed).
Identify the input values:
Version A: \(n_A = 25, \bar{x}_A = 4.8, s_A = 1.2\)
Version B: \(n_B = 25, \bar{x}_B = 5.4, s_B = 1.4\)
Calculate the numerator (mean difference):
\[ \bar{x}_A - \bar{x}_B = 4.8 - 5.4 = -0.6 \]
Calculate the combined standard error (SE):
\[ \frac{s_A^2}{n_A} = \frac{1.2^2}{25} = \frac{1.44}{25} = 0.0576 \]
\[\frac{s_B^2}{n_B} = \frac{1.4^2}{25} = \frac{1.96}{25} = 0.0784\]
\[SE = \sqrt{0.0576 + 0.0784} = \sqrt{0.136} ≈ 0.3688\]
Calculate the t-statistic:
\[ t = \frac{-0.6}{0.3688} ≈ -1.627 \]
Calculate df:
\[ \left( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \right)^2 = (0.0576 + 0.0784)^2 = 0.136^2 ≈ 0.018496 \] \[\frac{s_A^4}{n_A^2 (n_A - 1)} = \frac{1.44^2}{25^2 \times 24} = \frac{2.0736}{625 \times 24} = \frac{2.0736}{15000} ≈ 0.0001382\]
\[\frac{s_B^4}{n_B^2 (n_B - 1)} = \frac{1.96^2}{25^2 \times 24} = \frac{3.8416}{15000} ≈ 0.0002561\]
\[df = \frac{0.018496}{0.0001382 + 0.0002561} ≈ \frac{0.018496}{0.0003943} ≈ 46.90\]
Calculate the p-value (two-tailed): Use the cumulative t-distribution; P(t ≤ -|t|) × 2 ≈ 0.1104
A negative t-value indicates A is lower than B, but a two-tailed test focuses on magnitude. The df is rounded to 47 for the t-table, with the critical value ≈ ±2.012 (α=0.05). Since |t| = 1.627 < 2.012, the p-value > 0.05.
| t | Density | Cumulative_P | Two_Tail_P |
|---|---|---|---|
| -4.0 | 0.0004 | 0.0001 | 0.0002 |
| -3.9 | 0.0005 | 0.0002 | 0.0003 |
| -3.8 | 0.0006 | 0.0002 | 0.0004 |
| -3.7 | 0.0009 | 0.0003 | 0.0006 |
| -3.6 | 0.0011 | 0.0004 | 0.0008 |
| -3.5 | 0.0015 | 0.0005 | 0.0010 |
| -3.4 | 0.0020 | 0.0007 | 0.0014 |
| -3.3 | 0.0027 | 0.0009 | 0.0018 |
| -3.2 | 0.0035 | 0.0012 | 0.0025 |
| -3.1 | 0.0046 | 0.0016 | 0.0033 |
| -3.0 | 0.0059 | 0.0022 | 0.0043 |
| -2.9 | 0.0076 | 0.0028 | 0.0057 |
| -2.8 | 0.0098 | 0.0037 | 0.0074 |
| -2.7 | 0.0125 | 0.0048 | 0.0096 |
| -2.6 | 0.0158 | 0.0062 | 0.0124 |
| -2.5 | 0.0198 | 0.0080 | 0.0160 |
| -2.4 | 0.0248 | 0.0102 | 0.0204 |
| -2.3 | 0.0307 | 0.0130 | 0.0259 |
| -2.2 | 0.0378 | 0.0164 | 0.0328 |
| -2.1 | 0.0461 | 0.0206 | 0.0411 |
| -2.0 | 0.0559 | 0.0256 | 0.0513 |
| -1.9 | 0.0672 | 0.0318 | 0.0636 |
| -1.8 | 0.0801 | 0.0391 | 0.0783 |
| -1.7 | 0.0948 | 0.0479 | 0.0957 |
| -1.6 | 0.1111 | 0.0581 | 0.1163 |
| -1.5 | 0.1292 | 0.0702 | 0.1403 |
| -1.4 | 0.1489 | 0.0840 | 0.1681 |
| -1.3 | 0.1700 | 0.1000 | 0.1999 |
| -1.2 | 0.1923 | 0.1181 | 0.2362 |
| -1.1 | 0.2156 | 0.1385 | 0.2769 |
| -1.0 | 0.2394 | 0.1612 | 0.3224 |
| -0.9 | 0.2633 | 0.1864 | 0.3727 |
| -0.8 | 0.2868 | 0.2139 | 0.4277 |
| -0.7 | 0.3094 | 0.2437 | 0.4874 |
| -0.6 | 0.3304 | 0.2757 | 0.5514 |
| -0.5 | 0.3494 | 0.3097 | 0.6194 |
| -0.4 | 0.3657 | 0.3455 | 0.6910 |
| -0.3 | 0.3790 | 0.3828 | 0.7655 |
| -0.2 | 0.3888 | 0.4212 | 0.8423 |
| -0.1 | 0.3948 | 0.4604 | 0.9208 |
| 0.0 | 0.3968 | 0.5000 | 1.0000 |
| 0.1 | 0.3948 | 0.5396 | 0.9208 |
| 0.2 | 0.3888 | 0.5788 | 0.8423 |
| 0.3 | 0.3790 | 0.6172 | 0.7655 |
| 0.4 | 0.3657 | 0.6545 | 0.6910 |
| 0.5 | 0.3494 | 0.6903 | 0.6194 |
| 0.6 | 0.3304 | 0.7243 | 0.5514 |
| 0.7 | 0.3094 | 0.7563 | 0.4874 |
| 0.8 | 0.2868 | 0.7861 | 0.4277 |
| 0.9 | 0.2633 | 0.8136 | 0.3727 |
| 1.0 | 0.2394 | 0.8388 | 0.3224 |
| 1.1 | 0.2156 | 0.8615 | 0.2769 |
| 1.2 | 0.1923 | 0.8819 | 0.2362 |
| 1.3 | 0.1700 | 0.9000 | 0.1999 |
| 1.4 | 0.1489 | 0.9160 | 0.1681 |
| 1.5 | 0.1292 | 0.9298 | 0.1403 |
| 1.6 | 0.1111 | 0.9419 | 0.1163 |
| 1.7 | 0.0948 | 0.9521 | 0.0957 |
| 1.8 | 0.0801 | 0.9609 | 0.0783 |
| 1.9 | 0.0672 | 0.9682 | 0.0636 |
| 2.0 | 0.0559 | 0.9744 | 0.0513 |
| 2.1 | 0.0461 | 0.9794 | 0.0411 |
| 2.2 | 0.0378 | 0.9836 | 0.0328 |
| 2.3 | 0.0307 | 0.9870 | 0.0259 |
| 2.4 | 0.0248 | 0.9898 | 0.0204 |
| 2.5 | 0.0198 | 0.9920 | 0.0160 |
| 2.6 | 0.0158 | 0.9938 | 0.0124 |
| 2.7 | 0.0125 | 0.9952 | 0.0096 |
| 2.8 | 0.0098 | 0.9963 | 0.0074 |
| 2.9 | 0.0076 | 0.9972 | 0.0057 |
| 3.0 | 0.0059 | 0.9978 | 0.0043 |
| 3.1 | 0.0046 | 0.9984 | 0.0033 |
| 3.2 | 0.0035 | 0.9988 | 0.0025 |
| 3.3 | 0.0027 | 0.9991 | 0.0018 |
| 3.4 | 0.0020 | 0.9993 | 0.0014 |
| 3.5 | 0.0015 | 0.9995 | 0.0010 |
| 3.6 | 0.0011 | 0.9996 | 0.0008 |
| 3.7 | 0.0009 | 0.9997 | 0.0006 |
| 3.8 | 0.0006 | 0.9998 | 0.0004 |
| 3.9 | 0.0005 | 0.9998 | 0.0003 |
| 4.0 | 0.0004 | 0.9999 | 0.0002 |
- Draw a statistical conclusion at \(\alpha = 0.05\).
Failed to reject H₀ because p-value (0.110) > α (0.05).
From the calculation (t ≈ -1.627, df ≈ 46.9, p-value ≈ 0.1104), the conclusion is that we fail to reject H₀. This is because p-value (0.1104) > α (0.05), indicating insufficient evidence to support H₁ (a significant difference between μ_A and μ_B).
In the Neyman-Pearson framework, α represents the maximum risk of a Type I error (rejecting H₀ when it is true). The p-value is the probability that the data are extreme or more below H₀; if > α, the data are consistent with H₀, so maintain the status quo. Here, p=0.1104 means there is an 11.04% chance that the 0.6 minute (or more) deviation is just random noise if μ_A = μ_B is true—high enough not to reject.
The critical t-value for df=47, α=0.05, two-sided, is ±2.012 (from qt(0.975, 47) in R). Since |t|=1.627 < 2.012, the confirmation fails to reject. The 95% confidence interval for the difference in means: (-0.6) ± 2.012 × 0.3688 ≈ (-1.34, 0.14), includes 0, supporting H₀.
This conclusion is objective but weak due to unequal variances (SD_A=1.2 vs SD_B=1.4); assuming equal variances (pooled t-test), t ≈ -1.627 (equal), df=48, p≈0.110—similar results, but Welch’s is more robust. If normality is violated (e.g., skewed data), use the non-parametric Mann-Whitney—but assume normality.
- Interpret the result for product decision-making.
The result (failed to reject H₀, p=0.110) means there is no strong evidence of a difference in session duration between A (4.8 minutes) and B (5.4 minutes), so don’t assume B is superior without further testing. However, the potential uplift of 0.6 minutes (12.5%) could be business-significant if scaled (e.g., for retention/ad revenue), but is unreliable here due to low power.
Interpretation:
Session duration correlates with engagement; on landing pages, uplift means users are more engaged, potentially increasing conversion rates by 5-10% (based on industry benchmarks like Google Analytics data). However, p>0.05 implies a deviation that could be sampling error—rollout risk B: implementation costs (dev time, testing) without proven ROI; ignore risk B: missed opportunity if the effect is real but undetected (high β).
Cohen’s d=0.46 indicates a moderate effect (practically meaningful, >0.2 threshold small), but a power of 0.42 indicates an underpowered test—you underestimate this if you draw final conclusions without calculating the minimum n (e.g., for a power of 0.8, n≈68 per group via pwr.t2n.test(d=0.46, sig.level=0.05, power=0.8)). Inconsistency: n=25 is too small to detect a moderate effect, risking a wasteful experiment.
Business Implications:
Financial: Assuming revenue per session minute is ≈$0.01 (from ads), an uplift of 0.6 minutes × 1 million sessions/month = an additional $6,000—but unreliable, so calculate the expected value: (probability of real effect × gain) - cost. With p=0.11, the probability of H₁ ≈0.11 (mistakenly use p as posterior; use Bayesian if accurate).
The interpretation must be actionable: This result is not a “no go,” but rather an “insufficient evidence—replicate or pivot.”
Visualization
Case Study 4
Chi-Square Test of Independence
An e-commerce company examines whether device type is associated with payment method preference.
| Device / Payment | E-Wallet | Credit Card | Cash on Delivery |
|---|---|---|---|
| Mobile | 120 | 80 | 50 |
| Desktop | 60 | 90 | 40 |
Tasks
- State the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).
- Identify the appropriate statistical test.
- Compute the Chi-Square statistic (χ²).
- Determine the p-value at \(\alpha = 0.05\).
- Interpret the results in terms of digital payment strategy.
Answer
- State the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).
The null hypothesis (H₀) states that there is no association between device type (Mobile vs Desktop) and payment method preference (E-Wallet, Credit Card, Cash on Delivery), so both variables are independent. Mathematically:
\[ H_0: \text{Device type and payment method are independent.} \]
The alternative hypothesis (H₁) states that there is an association between device type and payment method preference, so both are dependent variables. Mathematically:
\[ H_1: \text{Device type and payment method are not independent (associated).} \]
This formulation follows the principle of categorical dependence tests, where H₀ assumes the observed frequency distribution matches the expected based on marginal totals (no interaction), while H₁ detects deviations that indicate patterns (e.g., mobile users prefer E-Wallet due to app-based convenience). This is appropriate for 2x3 contingency data, without specific directionality because the question focuses on “associated” rather than a single direction such as “mobile users prefer E-Wallet”.
- Identify the appropriate statistical test.
The appropriate statistical test is the Chi-Square Test of Independence.
Justification:
This tests for the dependence of two categorical variables (device: 2 levels; payment: 3 levels), comparing observed and expected frequencies under the assumption of independence.
The data are in the form of counts/frequencies in a contingency table, not a single mean or proportion, so the Chi-Square is more appropriate than the t-test (for means) or Z-test (for proportions).
Random/independent samples (assumed from e-commerce data), expected frequencies >5 in each cell (to be verified in the count), and a total of n=440, large enough to approximate the Chi-Square distribution.
Alternatives like Fisher’s Exact Test are only available if the expected value is <5 (not here); logistic regression is for prediction, but the focus is on simple association tests. This option maximizes simplicity and accuracy for business inference, reducing the risk of misinterpretation compared to parametric tests with stricter assumptions.
- Compute the Chi-Square statistic (χ²).
The main formula for the Chi-Square statistic is:
\[ \chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]
where \(O_{ij}\) is the observed frequency in row \(i\), column \(j\).
and \(E_{ij}\) is the expected frequency; \(r=2\) (device levels), \(c=3\) (payment methods).
Identify observed frequencies (\(O_{ij}\)) and totals:
Observed table:
\[ \begin{array}{c|ccc|c} \text{Device} & \text{E-Wallet} & \text{Credit Card} & \text{Cash on Delivery} & \text{Row Total} \\ \hline \text{Mobile} & 120 & 80 & 50 & 250 \\ \text{Desktop} & 60 & 90 & 40 & 190 \\ \hline \text{Column Total} & 180 & 170 & 90 & N=440 \\ \end{array} \]
Calculate the expected frequencies (\(E_{ij}\)) using the formula:
\[ E_{ij} = \frac{(\text{row total}_i \times \text{column total}_j)}{N} \]
- Mobile, E-Wallet:
\[E_{11} = \frac{250 \times 180}{440} = \frac{45000}{440} \approx 102.2727\]
- Mobile, Credit Card:
\[E_{12} = \frac{250 \times 170}{440} = \frac{42500}{440} \approx 96.5909\]
- Mobile, Cash:
\[E_{13} = \frac{250 \times 90}{440} = \frac{22500}{440} \approx 51.1364\]
- Desktop, E-Wallet:
\[E_{21} = \frac{190 \times 180}{440} = \frac{34200}{440} \approx 77.7273\]
- Desktop, Credit Card:
\[E_{22} = \frac{190 \times 170}{440} = \frac{32300}{440} \approx 73.4091\]
- Desktop, Cash:
\[E_{23} = \frac{190 \times 90}{440} = \frac{17100}{440} \approx 38.8636\]
- All \(E_{ij} > 5\), assumption valid; expected number per row/column match totals.
Calculate the deviation (\(O_{ij} - E_{ij}\)) for each cell:
\[O_{11} - E_{11} = 120 - 102.2727 \approx 17.7273\] \[O_{12} - E_{12} = 80 - 96.5909 \approx -16.5909\] \[O_{13} - E_{13} = 50 - 51.1364 \approx -1.1364\] \[O_{21} - E_{21} = 60 - 77.7273 \approx -17.7273\] \[O_{22} - E_{22} = 90 - 73.4091 \approx 16.5909\] \[O_{23} - E_{23} = 40 - 38.8636 \approx 1.1364\]
Calculate the squared deviation divided by the expected (\(\frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)):
- Sel 11:
\[\frac{(17.7273)^2}{102.2727} = \frac{314.257}{102.2727} \approx 3.0718\]
- Sel 12:
\[\frac{(-16.5909)^2}{96.5909} = \frac{275.258}{96.5909} \approx 2.8494\]
- Sel 13:
\[\frac{(-1.1364)^2}{51.1364} = \frac{1.2914}{51.1364} \approx 0.0253\]
- Sel 21:
\[\frac{(-17.7273)^2}{77.7273} = \frac{314.257}{77.7273} \approx 4.0424\]
- Sel 22:
\[\frac{(16.5909)^2}{73.4091} = \frac{275.258}{73.4091} \approx 3.7493\]
- Sel 23:
\[\frac{(1.1364)^2}{38.8636} = \frac{1.2914}{38.8636} \approx 0.0332\]
Sum all contributions to χ²:
\[ \chi^2 \approx 3.0718 + 2.8494 + 0.0253 + 4.0424 + 3.7493 + 0.0332 = 13.7714 \]
Higher precision: 13.7736; cells with high contributions such as E-Wallet and Credit Card encourage rejection of independence.
- Determine the p-value at \(\alpha = 0.05\).
The formula for the p-value is the cumulative probability of the Chi-Square distribution:
\[ p = 1 - F_{\chi^2}(\chi^2 \mid df) \]
where \(F\) is the CDF of Chi-Square and df = (r-1)(c-1)=2
Calculate the degrees of freedom (df):
\[ df = (2-1)(3-1) = 1 \times 2 = 2 \]
Identify the critical value for α=0.05:
- From the Chi-Square table or formula:
\[\chi^2_{\text{crit}} = q_{\chi^2}(1 - 0.05 \mid df=2) \approx 5.9915\]
- Compare: If \(\chi^2 > 5.9915\), reject H₀ at α=0.05.
Calculate the p-value using the CDF:
Use the integral or approximation: \(p = \int_{\chi^2}^{\infty} f(x \mid df=2) \, dx\), where \(f\) is the Chi-Square PDF.
With \(\chi^2 \approx 13.7736\), p ≈ 0.00103 (from precision calculations, e.g., 1 - pchisq(13.7736, 2) in R).
Manual approximation steps (if no tool is available): Use the Chi-Square table—for df=2, χ²=13.77 between 9.21 (p=0.01) and 13.82 (p=0.001), interpolation ≈0.001.
Compare the p-value with α and make a decision:
p ≈ 0.001 < 0.05, jadi tolak H₀.
Verify assumptions: All E>5, no zero cells; if violated, adjust with Fisher’s.
- Interpret the results in terms of digital payment strategy.
Rejection of H₀ indicates a significant association between device and payment method (p=0.001 <0.05), with the following pattern: mobile users prefer E-Wallet (120 vs expected 102), less prefer Credit Card (80 vs 97); desktop users, on the other hand (Credit Card 90 vs 73, E-Wallet 60 vs 78). In the context of e-commerce digital payment strategy, this implies device-based UX optimization to increase conversion—for example, prioritize E-Wallet integration in the mobile app (e.g., seamless QR scan), while desktop users focus on Credit Card (e.g., saved card auto-fill). A small-medium effect size means the association is real but not dominant, potentially influenced by confounding factors such as demographics (younger mobile users prefer digital wallets).
Business Consequences:
Operational Risk: Ignore associations with churn risk—e.g., mobile users are frustrated if Cash on Delivery is the default, reducing impulse buys; desktop users prefer the security of Credit, so add fraud protection.
Financial Impact: Assume a 5% conversion uplift from optimized payments (based on benchmarks like Shopify data), with a representative sample of 440 transactions, potentially millions of dollars in annual additional revenue. However, a small effect means a moderate ROI—calculate with further A/B testing.
Improvement Strategy: Segment campaigns (push e-wallet promos via mobile push notifications); integrate this data into ML recommendations (e.g., suggest payments based on device). Overall, the results encourage a shift to mobile-first digital wallets for emerging market expansion, but validate with larger data sets for robustness.
Visualization
Case Study 5
Type I and Type II Errors (Conceptual)
A fintech startup tests whether a new fraud detection algorithm reduces fraudulent transactions.
- H₀: The new algorithm does not reduce fraud.
- H₁: The new algorithm reduces fraud.
Tasks
- Explain a Type I Error (α) in this context.
- Explain a Type II Error (β) in this context.
- Identify which error is more costly from a business perspective.
- Discuss how sample size affects Type II Error.
- Explain the relationship between α, β, and statistical power.
Answer
- Explain a Type I Error (α) in this context.
A Type I error (α) occurs when we reject H₀ when H₀ is actually true—in this context, concluding that a new algorithm reduces fraud when it actually does not. Mathematically, α is the probability of rejecting H₀ | H₀ is true:
\[\alpha = P(\text{Tolak } H_0 \mid H_0 \text{ benar})\]
In fintech startups, this means rolling out new algorithms based on misleading test data (e.g., random fluctuations appearing as fraud reduction), resulting in wasted implementation costs (e.g., system updates, staff training) without any real benefit. This risk is controlled by the significance level (usually 0.05), but ignoring it can be amplified in multiple testing—e.g., if repeated tests are performed without adjustments like Bonferroni, the effective α increases, increasing false positives. This isn’t just a statistical error, but a business error: over-optimistic algorithms can damage reputations if fraud remains high post-rollout, eroding investor/user trust.
- Explain a Type II Error (β) in this context.
A Type II error (β) occurs when we fail to reject H₀ when H₁ is actually true—in this context, concluding that a new algorithm does not reduce fraud when it actually does. Mathematically, β is the probability of failing to reject H₀ | H₁ is true:
\[ \beta = P(\text{Gagal tolak } H_0 \mid H_1 \text{ benar}) \]
In fintech startups, this means ignoring effective algorithms, missing out on opportunities to reduce fraud losses (e.g., millions of dollars lost annually from fraudulent transactions). β is influenced by factors such as a small effect size (minimal fraud reduction), high data variability (noise from transaction variations), or a small sample size (low detection power). Business consequences: competitors adopt similar technologies more quickly, eroding market share; fintech regulations (e.g., PCI DSS) can prosecute even minimal fraud, so a high β carries the risk of fines/legal action. This is a “missed opportunity” error, often undervalued compared to α because it is not immediately visible, but cumulatively undermines long-term growth.
- Identify which error is more costly from a business perspective.
From the perspective of a fintech startup testing a fraud detection algorithm, a Type II error (β) is more costly than a Type I error (α). Type II error means failing to detect a true fraud reduction (failing to reject H₀ when H₁ is true), resulting in the startup continuing to experience high fraud losses without adopting effective solutions. These losses are immediate financial, such as lost assets from fraudulent transactions, manual investigation costs, and regulatory fines from agencies like the Financial Services Authority (OJK) or the SEC. Furthermore, it erodes reputation in the long term—users lose trust, churn rates increase, and investors withdraw funds due to perceived weak risk management.
Conversely, a Type I error (rejecting H₀ when true) means the algorithm rollout is ineffective, resulting in one-time costs such as system development or training, but can be quickly reversed without permanent losses—for example, switching back to the old algorithm within weeks. Startup business priorities focus on scalability and existential risk mitigation; β is more damaging due to the cumulative opportunity cost (missing efficiencies that could save the company from bankruptcy), while α is only a temporary setback. This evaluation is objective based on cost asymmetry: fraud is permanently lost, while incorrect implementation is reversible.
- Discuss how sample size affects Type II Error.
Sample size (n) affects Type II error (β) inversely proportionally—the larger the n, the smaller the β, as it increases the test’s ability to detect a real effect. β depends on the overlap between the sampling distributions under H₀ and H₁; larger n reduces the standard error (SE = σ / √n), making the distributions sharper, reducing the overlap, making it easier to reject H₀ when H₁ is true.
In the fintech context, if an algorithm is tested with a small n (e.g., 50 transactions), β is high because the variability of the fraud data (fluctuating rate) dominates the signal—a real fraud reduction can appear as random noise, resulting in a failure to reject H₀. Increasing n to 500, the SE decreases, power increases, and β decreases—the test is more sensitive to detecting small changes such as a 5% fraud reduction. Mathematically, for the Z-test (approximation of the fraud rate):
\[ \beta \approx \Phi \left( z_{\alpha} - \frac{\delta}{\sigma / \sqrt{n}} \right) \]
where δ effect size, z_α critical value.
- Explain the relationship between α, β, and statistical power.
Statistical power is defined as 1 - β, the probability of rejecting H₀ when H₁ is true, and its relationship with α and β is an inherent trade-off: increasing power (reducing β) often increases α if other factors (n, effect size) are fixed. α (risk of false positives) determines the critical region (e.g., z > 1.96 for a two-sided α=0.05); β depends on the area under H₁ that falls in the non-critical region. Lowering α (making the critical region tighter, e.g., z > 2.58 for α=0.01) enlarges the non-critical region, increasing β (reducing power)—conservative but risking missed detection.
In fintech, prioritize high power (at least 0.8) to minimize β, because the cost of ignoring an effective algorithm (missed fraud) > the cost of a false alarm (wasted implementation). Mathematically, for the Z test:
\[ \text{Power} = 1 - \beta = 1 - \Phi \left( z_{\alpha} - \frac{\delta \sqrt{n}}{\sigma} \right) \]
Increasing α (decreasing z_α) increases power (reducing β); but this trade-off is avoided by increasing n or the effect size. Ignoring this relationship risks an imbalanced test design—startups set α low to be safe, but β high and consequently ignore opportunities, losing competitive advantage. Use pre-test power analysis to balance, ensuring the test is actionable without error bias.
Visualization
Case Study 6
P-Value and Statistical Decision Making
A churn prediction model evaluation yields the following results:
- Test statistic = 2.31
- p-value = 0.021
- Significance level: \(\alpha = 0.05\)
Tasks
- Explain the meaning of the p-value.
- Make a statistical decision.
- Translate the decision into non-technical language for management.
- Discuss the risk if the sample is not representative.
- Explain why the p-value does not measure effect size.
Answer
- Explain the meaning of the p-value.
A p-value of 0.021 means the probability of obtaining a test statistic of at least 2.31 (or more extreme) if the null hypothesis is true, assuming the data are from a null distribution. Mathematically, the p-value is defined as:
\[ p = P(T \geq t \mid H_0 \text{ benar}) \]
where \(T\) is the distribution of the test statistic under H₀, and \(t = 2.31\) is the observed value. In the context of churn prediction models, this measures how inconsistent the data are with the assumption that the model provides no significant improvement (H₀: no effect on churn prediction). A low value, such as 0.021, indicates that the data rarely occur under H₀, implying evidence against H₀, but not direct evidence for H₁ or a measure of the strength of the effect.
The p-value is not a measure of the probability that H₀ is true (a common Bayesian error), but rather conditional on H₀; it is sensitive to sample size—a large n can produce a low p even for trivial effects, while a small n can overlook important effects. In the churn model, p=0.021 < α=0.05, thus supporting the claim that the model is effective, but interpretation should be cautious as it may be influenced by outliers or model assumptions (normality, independence).
- Make a statistical decision.
Reject H₀ because p-value = 0.021 < α = 0.05.
The decision rule in hypothesis testing: if p < α, reject H₀ and support H₁; otherwise, fail to reject H₀. Here, the evidence is sufficient to suggest that the model reduces churn at the 5% significance level. The test statistic of 2.31 (possibly a t-statistic or Z, depending on the distribution) exceeds the critical value (for a high t-df of ≈1.96, two-sided), consistent with rejection. This decision is objective but depends on the test assumptions (e.g., normality of residuals in the churn model); if violated, the decision is invalid. In practice, consider multiple comparisons if the test is repeated, adjusting α via Bonferroni to avoid Type I error inflation.
- Translate the decision into non-technical language for management.
Statistical judgment indicates that the new churn prediction model shows significant improvement, with only a 2.1% chance that this result would have occurred by chance if the model were ineffective. We reject the assumption that the model is useless and support that it reduces the risk of customer churn.
This translation avoids jargon like p-value or statistics, focusing on the business implications: the model is worth implementing because evidence supports its effectiveness at a 95% confidence level. For management, emphasize actionable insights—the model can reduce churn rates, increase retention, and revenue without confusing mathematical details. If p > α, translate: “The data is insufficient to prove the model is effective, so stick with the old approach to avoid wasted risk.” This approach ensures clear communication and avoids overpromising—management understands that decisions are not guarantees of success, but rather informational foundations for resource allocation.
- Discuss the risk if the sample is not representative.
If the sample is not representative of the population (sampling bias), the main risk is erroneous conclusions that are invalid generalizations, leading to poor business decisions. Mathematically, this distorts parameter estimates (e.g., the mean churn rate) because the sample does not mirror the population distribution:
\[ \hat{\theta} \neq \theta \]
where \(\hat{\theta}\) is the sample estimate and \(\theta\) is the population parameter.
In churn models, biased samples (e.g., oversampling loyal customers, undersampling high-risk customers) can produce a falsely low p-value, incorrectly reject H₀ (inflated Type I error), or vice versa (Type II error).
Business Risk: Implementing the model on a broad population fails to reduce actual churn, resulting in millions in lost revenue from lost customers, or wasted costs on ineffective strategies. Furthermore, poor generalization erodes the credibility of the data team—management loses trust in analytics.
Mitigation: Use stratified sampling to ensure representation of subgroups (age, location); check for bias via diagnostics (comparing sample vs. population demographics). Ignoring this risk means the model overfits the sample, underperforming production, and stagnant startup growth.
- Explain why the p-value does not measure effect size.
The p-value measures the strength of evidence against H₀, not the effect size, because it depends on sample size and variability, not the magnitude of the difference. Mathematically, the p-value of the distribution of the test statistic (e.g., t = (mean difference)/SE), where SE = σ/√n—a large n can produce a low p even for small effects:
\[ p \propto \frac{1}{\sqrt{n}} \]
However, the effect size (e.g., Cohen’s d = mean difference / σ) is independent of n.
In a churn model, p=0.021 indicates significant evidence, but doesn’t tell you how large the churn reduction is (e.g., a 1% or 20% drop rate)—a small effect is significant if n is large, but impractical for business (implementation costs > benefits). Conversely, a large effect can be insignificant if n is small (high p due to low power). Decisions based on p alone can prioritize trivial findings, ignoring impactful but underpowered findings. Use separate effect sizes (d, r²) to measure practicality; in fintech, focus on d>0.5 for meaningful effects, not just p<0.05. This avoids p-hacking (manipulation for low p without real effect) and ensures sustainable data-driven decisions.
Visualization
References
[1] Testing Statistical Hypotheses,Erich L. Lehmann & Joseph P. Romano,2005 (edisi ketiga).
[2] Mathematical Statistics with Applications,“Dennis Wackerly, William Mendenhall, Richard L. Scheaffer”,2008 (edisi ketujuh).
[3] Probability and Statistical Inference,Nitis Mukhopadhyay,2000.
[4] Hypothesis Testing: A Visual Introduction to Statistical Significance,Michael F. Gorman,2017.
[5] Statistics for the Social Sciences,R. Mark Sirkin,2005.
[6] Introduction to Statistics for the Social Sciences,(Open textbook),Terbaru.
[7] Experimentation in Software Engineering,Wohlin et al.,2012.
[8] Trustworthy Online Controlled Experiments,Kohavi et al.,2020.
[9] Statistics for Experimenters,Box et al.,2005.
[10] Statistics for Business and Economics,Anderson et al.,2020.
[11] Applied Statistics for Engineers and Scientists,Devore et al.,2014.
[12] Marketing Analytics,Winston,2014,Aplikasi Chi-Square di e-commerce segmentation.
[13] Hypothesis Testing: A Visual Introduction,Gorman,2017.
[14] Statistics for Business,Anderson et al.,2020.
[15] Practical Statistics for Data Scientists,Bruce et al.,2017.
[16] Statistics for Management and Economics,Keller,2017.
[17] Practical Statistics for Data Scientists,Bruce et al.,2017.
[18] Introduction to Statistical Learning,James et al.,2021.