Study Cases
Statistical Inferences~ Week 14
Cahaya Medina Semidang
NIM: 52250053
Data Science Undergraduate at ITSB
1 Case Study 1
One-Sample Z-Test (Statistical Hypotheses)
A digital learning platform claims that the average daily study time of its users is 120 minutes. Based on historical records, the population standard deviation is known to be 15 minutes.
A random sample of 64 users shows an average study time of 116 minutes.
\[ \begin{eqnarray*} \mu_0 &=& 120 \\ \sigma &=& 15 \\ n &=& 64 \\ \bar{x} &=& 116 \end{eqnarray*} \] Tasks
- Formulate the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).
- Identify the appropriate statistical test and justify your choice.
- Compute the test statistic and p-value using \(\alpha = 0.05\).
- State the statistical decision.
- Interpret the result in a business analytics context.
Answer
1.1 Hypothesis
A digital learning platform claims that the average daily study time of its users is 120 minutes. From historical records, the population standard deviation is known to be 15 minutes. A random sample of 64 users reports: Sample mean (\(\bar{x}\)) is \(116\) and Significance level (\(\alpha\)) is 0.05. Because the claim is about equality and we want to check whether the true mean differs from 120 minutes, we use a two-tailed hypothesis test.
Null Hypothesis \((H₀)\)
\[H_0:μ=120\] The average daily study time is exactly 120 minutes.
Alternative Hypothesis \((H₁)\)
\[H_1:μ\neq120\] The average daily study time is different from 120 minutes.
1.2 Statistical Test
The statistical test used in this problem is a One-Sample Z-Test. We use the One-Sample z-test, because:
- Population Standard Deviation (\(\alpha\)) Is Known (\(\alpha = 15\)),
- Large Sample Size (\(n = 64 > 30\))
- We are testing one population mean against a known value
1.3 The Test Statistic and P-Value
To commpute the test statistic, we use the \(z\) test. Here is the formula:
\[z\ =\ \frac{\bar{x}\ -\mu_0}{\sigma\ / \sqrt{n}}\]
Where
- \(\bar{x}\) : mean = \(116\)
- \(\mu_0\) : Null Hypothesis = \(120\)
- \(\alpha\) : Standard Deviation = \(15\)
- \(n\) : sample size = \(64\)
- compute the test Statistic:
\[\begin{array}{rl} z\ & =\ \frac{\bar{x}\ -\mu_0}{\sigma\ / \sqrt{n}} =\ \frac{116\ -120}{15\ / \sqrt{64}} \\ & =\ \frac{-4}{15\ / 8} =\ \frac{-4}{1.875} \\ & = -2.13 \end{array}\]
- compute the P-value:
Because,
\[z= -2.13\] We look up the probability:
\[P(Z≤−2.13)\]
From the standard normal table:
\[P(Z≤−2.13)≈0.0166\]
Since this is a two-tailed test, we must consider both tails of the distribution.
\[p-value=2×0.0166=0.033\]
Note: we multiply by 2 Because deviations in either direction (much smaller or much larger than 120) are evidence against \(H_0\).
1.4 Statistical Decision
- Step 1: Level of Significance
Before making a decision we must first determine the rules, namely level of significance (\(\alpha\)). In the question, it is known that:
meaning of \(\alpha = 0.05\) we are wiling to accept a \(5\%\) risk of rejecting \(H_0\) when \(H_0\) is actually true (Type 1 Error)
- Step 2: Determine Decision Rule
Because using a two-tailed test and one-sample Z-test, then the p-value based decision rule is:
\[p-value \le \alpha → \text reject \ H_0\]
\[p-value > \alpha → \text fail \ to \ reject \ H_0\]
- Step 3: Comparing p-value with \(\alpha\)
From the previous calculation, we obtain:
- p-value = \(0.033\)
- \(\alpha = 0.05\)
comparison
\[0.033 < 0.05\]The meaning of this comparison The probability pf getting a sample result this extreme or more, if \(H_0\) is true, only \(3,3\%\), which is less than \(5\%\) error tolerance limit.
Since the p-value s smaller than \(\alpha\), the statistical decision is:
\[\text Reject \ H_0\]
At the \(5\%\) significance level there is sufficents statistical evidence to reject the Hypothesis Null.
1.5 Interpretation
The analysis provides statistically significant evidence that the average daily study time of users is not equal to 120 minutes. In fact, the sample data suggest that users spend less time studying on average than the platform claims.
From a business analytics perspective, this result indicates that the company’s reported engagement metric may be overestimated. This has practical implications for performance reporting, user engagement strategy, and investor or stakeholder communication. The platform should consider investigating potential causes such as content quality, user motivation, or competing distractions—and take data driven actions to improve actual study time.
2 Case Study 2
One-Sample T-Test (σ Unknown, Small Sample)
A UX Research Team investigates whether the average task completion time of a new application differs from 10 minutes.
The following data are collected from 10 users:
\[ 9.2,\; 10.5,\; 9.8,\; 10.1,\; 9.6,\; 10.3,\; 9.9,\; 9.7,\; 10.0,\; 9.5 \]
Tasks
- Define H₀ and H₁ (two-tailed).
- Determine the appropriate hypothesis test.
- Calculate the t-statistic and p-value at \(\alpha = 0.05\).
- Make a statistical decision.
- Explain how sample size affects inferential reliability.
Answer
2.1 Hypothesis
A UX Research Team investigates whether the average task completion time of a new application differs from 10 minutes.The analysis is conducted using a small sample of users, and the population standard deviation is unknown. Because the research question asks whether the mean differs from 10 minutes, a two-tailed test is used.
Null Hypotesis
\[H_0: \mu = 10\]
Alternative Hypotesis
\[H_1: \mu \neq 10\]
2.2 Statistical Test
The appropriate test is a One-Sample t-test. We use the One-Sample t-test because:
- The population standard deviation is unkwon
- the sample size is small \((n = 10 < 30)\)
- we are testing one population mean
- The data are continuous and reasonably symmetric
Under this condition, the t-disribution is required instead of the normal Z distribution
2.3 The Test Statistic and P-Value
To compute the statistic test, we use the t-statistic, here is the formula:
\[t\ =\ \frac{\bar{x}\ -\mu_0}{s\ / \sqrt{n}}\] Where
- \(\bar{x}\): Sample mean = \(9.86\)
- \(\mu_0\): Null Hypothesis = \(10\)
- \(s\): Sample standard deviation = \(0.387\)
- \(n\): Sample size = \(10\)
Compute the Test Statistic \[\begin{array}{rl} t\ & =\ \frac{\bar{x}\ -\mu_0}{s\ / \sqrt{n}} & =\ \frac{9.86\ -10} {0.387 / \sqrt{10}} \\ & =\ \frac{-0.14}{0.122} \\ & \approx -1.14 \end{array}\]
Compute P-Value
To find the p-value, we need to use the t-Distribution table with n-1 degrees of freedom. In question, our sample size is \(n = 10\), so \[df = n-1 = 9\]
In the t-Distribution table below, we need to look at the row that corresponds to “9” on the left-hand side and attempt to look for the absolute value of our test statistic 1.14. Notice that 1.14 does not show up in the table, but it does fall between the two values 1.100 and 1.383.
This means that the p-value for a one-sided test is between 0.15 and 0.10. Let’s call it 0.14. Since our t-test is two-sided, we need to multiply this value by 2. So, our estimated
\[p-value = 0.14 \times 2 = 0.28\].We can plug our test statistic t and our degrees of freedom into an online p-value calculator by Zach Bobbitt to see how close our estimated p-value was to the true p-value
2.4 Statistical Decision
- Step 1: Level of Significance
Before making a decision we must first determine the rules, namely level of significance (\(\alpha\)). In the question, it is known that:
meaning of \(\alpha = 0.05\) we are wiling to accept a \(5\%\) risk of rejecting \(H_0\) when \(H_0\) is actually true (Type 1 Error)
- Step 2: Determine Decision Rule
Because using a two-tailed test and one-sample t-test, then the p-value based decision rule is:
\[p-value \le \alpha → \text reject \ H_0\]
\[p-value > \alpha → \text fail \ to \ reject \ H_0\]
- Step 3: Comparing p-value with \(\alpha\)
From the previous calculation, we obtain:
- p-value = \(0.28\)
- \(\alpha = 0.05\)
comparison
\[0.28 > 0.05\]
Since the p-value s larger than \(\alpha\), the statistical decision is:
\[\text Fail \ to \ reject \ H_0\]At the \(5\%\) significance level, there is insufficient statistical evidence to conclude that the average task completion time differs from 10 minutes.
2.5 Effect of Sample Size on Inferential Reliability
Larger Samples Increase Precision
When sample size increases:
- The estimate of a population parameter (example: mean, proportion) becomes more precise
- The sampling variability (random fluctuation from one sample to another) decreases.
This leads to narrower confidence intervals and more reliable estimates—the sample statistic is more likely to be close to the true population value.
Estimating an average from 100 observations will generally yield a more precise estimate than from only 10 observations.
Greater Statistical Power
Statistical power is the probability that a hypothesis test will detect a true effect when it actually exists.
- With a larger sample size, the test has greater ability to detect real differences
- Smaller effects become statistically detectable
- The chance of a Type II error (failing to detect a real effect) decreases
Thus, a larger sample size makes the inferential conclusion more trustworthy in identifying true patterns.
Better Generalizability
A larger sample is more likely to be representative of the population, meaning:
- The sample captures more diversity in the population
- Findings can be generalized more confidently to the entire population of interest
This improves inferential reliability because the results are less dependent on idiosyncrasies of a small subset.
Reduced Impact of Outliers
If there are unusual or extreme values (outliers) in the data:
- In a small sample, one or two outliers can heavily skew the mean or test results
- In a large sample, their influence is diluted by the many other observations
This makes the estimates more stable and less sensitive to a few extreme values.
3 Case Study 3
Two-Sample T-Test (A/B Testing)
A product analytics team conducts an A/B test to compare the average session duration (minutes) between two versions of a landing page.
| Version | Sample Size (n) | Mean | Standard Deviation |
|---|---|---|---|
| A | 25 | 4.8 | 1.2 |
| B | 25 | 5.4 | 1.4 |
Tasks
- Formulate the null and alternative hypotheses.
- Identify the type of t-test required.
- Compute the test statistic and p-value.
- Draw a statistical conclusion at \(\alpha = 0.05\).
- Interpret the result for product decision-making.
Answer
3.1 Hypothesis
A product analytics team conducts an A/B test to compare the average session duration (in minutes) between two versions of a landing page:
- Version A: existing design
- Version B: new design
The objective is to determine whether there is a statistically significant difference in mean session duration between the two versions. Because the question asks whether the averages are different, we use a two-tailed test.
Null Hypothesis
\[H_0: \mu_A = \mu_B\]There is no difference in average session duration between Version A and Version B.
Alternative Hypothesis
\[H_1: \mu_A \neq \mu_B\]There is a difference in average session duration between Version A and Version B.
3.2 The Type of T-Test
When we want to compare the means of two independent groups, we can choose between using two different tests:
This test assumes that both groups of data are sampled from populations that follow a normal distribution and that both populations have the same variance.
This test assumes that both groups of data are sampled from populations that follow a normal distribution, but it does not assume that those two populations have the same variance.
In practice, when you are comparing the means of two groups it’s unlikely that the standard deviations for each group will be identical. This makes it a good idea to just always use Welch’s t-test, so that you don’t have to make any assumptions about equal variances. So the type of t-test we use is Welch’s t-test
3.3 The Test Statistic and P-Value
Since we use the Welch’s t-test, here is the formula
Difference in sample means \[\bar{x}_A - \bar{x}_B\]
Standard Error \[SE = \sqrt{{\frac{{s_A}^2}{n_A}}+ \frac{{s_B}^2}{n_B}}\]
t-statistic \[ t = \frac{\bar{x}_A - \bar{x}_B}{SE} \]
Degrees of Freedom (Welch) \[ df = \frac{\left( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \right)^2}{\frac{\left( \frac{s_A^2}{n_A} \right)^2}{n_A - 1} + \frac{\left( \frac{s_B^2}{n_B} \right)^2}{n_B - 1}} \]
Where
- \(\bar{x}_A\): sample mean varians A = \(4.8\)
- \(\bar{x}_B\): sample mean varians B = \(5.4\)
- \(s_A\): sample standard deviation varians A = \(1.2\)
- \(s_B\): sample standard deviation varians B = \(1.4\)
- \(n_A\): sample size varians A = \(25\)
- \(n_B\): sample size varians B = \(25\)
Summary Data:
Standard Error \[\begin{array}{rl} SE & = \sqrt{{\frac{{s_A}^2}{n_A}}+ \frac{{s_B}^2}{n_B}} \\ & = \sqrt{{\frac{{1.2}^2}{25}}+ \frac{{1.4}^2}{25}} \\ & = \sqrt{{0.0576}+{0.0784}} \\ & = \sqrt{0.136} \\ & = 0.3687 \approx 0.369 \end{array}\]
t-statistic \[\begin{array}{rl} t & = \frac{\bar{x}_A - \bar{x}_B}{SE} \\ & = \frac{{4.8}-{5.4}}{0.369} \\ & = 1.626 \approx 1.63 \end{array}\]
Degrees of Freedom
Step 1: Compute variance terms
\[ \frac{s_A^2}{n_A} = \frac{1.44}{25} = 0.0576 \]
\[\frac{s_B^2}{n_B} = \frac{1.96}{25} = 0.0784 \]
Step 2: Compute numerator \[\begin{array}{rl} {\left( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \right)^2} & =(0.0576+0.0784)^2 \\ & =(0.136)^2 \\ & =0.018496 \end{array}\]
Step 3: Compute denominator \[\begin{array}{rl} & = {\frac{\left( \frac{s_A^2}{n_A} \right)^2}{n_A - 1} + \frac{\left( \frac{s_B^2}{n_B} \right)^2}{n_B - 1}} \\ & = \frac{{0.0576}^2}{24}+\frac{{0.0784}^2}{24} \\ & = \frac{0.00331775}{24}+\frac{0.00614656}{24} \\ & = {0.00013824}+{0.00025611} \\ & = 0.00039435 \end{array}\]
Step 4: Compute df \[\begin{array}{rl} df & = \frac{\left( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \right)^2}{\frac{\left( \frac{s_A^2}{n_A} \right)^2}{n_A - 1} + \frac{\left( \frac{s_B^2}{n_B} \right)^2}{n_B - 1}} \\ & = \frac{0.018496}{0.00039435} \\ & \approx 46.9 \\ & \approx 47 \end{array}\]
- t-statistic
\[ \begin{array}{rl} t & = \frac{\bar{x}_A - \bar{x}_B}{SE} \\ & = \frac{{4.8}-{5.4}}{0.369} \\ & = 1.626 \\ & \approx 1.63 \end{array}\]
- Degrees of Freedom \[df \approx 47\]
- p-value \[p-value = 0.10979 \approx 0.11\]
We can plug our test statistic t and our degrees of freedom into an online p-value calculator by Zach Bobbitt to see how close our estimated p-value was to the true p-value
3.4 Statistical Conclusion
From the previous calculation, we obtain:
- \(p-value = 0.11\)
- \(\alpha = 0.05\)
So,
\[0.11 > 0.05\] Because of that, the statistical decision is
\[\text Fail \ to \ reject \ H_0\] Conclusion
At the 5% significance level, there is insufficient statistical evidence to conclude that the average session duration differs between Version A and Version B.The observed difference of 0.6 minutes could reasonably be explained by random sampling variability.
3.5 Interpreting for Product Decision-Making
Business Interpretation
- Although Version B shows a higher average session duration, the difference is not statistically significant
- There is no strong evidence that Version B truly improves user engagement
- Rolling out Version B based solely on this result would be premature
Recommended Actions
- Collect larger samples to increase statistical power
- Analyze additional KPIs (conversion rate, bounce rate)
- Consider running the experiment for a longer duration
Key Insight A numerically better result does not automatically imply a real product improvement.
4 Case Study 4
Chi-Square Test of Independence
An e-commerce company examines whether device type is associated with payment method preference.
| Device / Payment | E-Wallet | Credit Card | Cash on Delivery |
|---|---|---|---|
| Mobile | 120 | 80 | 50 |
| Desktop | 60 | 90 | 40 |
Tasks
- State the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).
- Identify the appropriate statistical test.
- Compute the Chi-Square statistic (χ²).
- Determine the p-value at \(\alpha = 0.05\).
- Interpret the results in terms of digital payment strategy.
Answer
4.1 Hyppothesis
In a Chi-Square Test of Independence, we test whether two categorical variables are statistically associated.
Null Hypothesis (\(H_0\)):
Device type and payment method preference are independent. In other words, the choice of payment method does not depend on whether users are on mobile or desktop.
Alternative Hypothesis (\(H_1\)):
Device type and payment method preference are associated. This means payment preferences differ by device type.
4.2 Appropriate Statistical Test
The correct test is the Chi-Square Test of Independence. We use this because:
- Both variables are categorical.
- We are comparing frequencies (counts), not means.
- The goal is to test association, not causation.
4.3 Chi-Square Statistic
Before compute the Chi-squre, we need to compute Expected Frequncies
Expected Frequencies
Expected frequencies assume \(H_0\) is true (no association).
| Device / Payment | E-Wallet | Credit Card | Cash on Delivery | Row Total |
|---|---|---|---|---|
| Mobile | 120 | 80 | 50 | 250 |
| Desktop | 60 | 90 | 40 | 190 |
| Column Total | 180 | 170 | 90 | 440 |
For each cell in the contingency table:
\[E=\frac {{(Row \;Total)}{(Column \;Total)}}{Grand \; Total}\]
Example Compute
- Mobile - E-Wallet
\[E = \frac {{250} \times {180}}{440} = 102.27\]
- Dekstop - E-Wallet
\[E = \frac {{190} \times {180}}{440} = 77.73\] compute for all each cell in the contingency table, the summary:
Table Summary Expected Frequencies
| Device / Payment | E-Wallet | Credit Card | Cash on Delivery |
|---|---|---|---|
| Mobile | 102.27 | 96.59 | 51.14 |
| Desktop | 77.73 | 73.41 | 38.86 |
Chi-Square
Concept
The Chi-Square statistic measures how far the observed data deviate from what we would expect under the null hypothesis.
- If Observed \(\approx\) Expected, then \(χ²\) is small \(→\) no association
- If Observed is far from Expected, then \(χ²\) is large \(→\) evidence of association
Formula
\[χ^2=∑\frac{(O−E)^2}{E}\]
O = Observed Frequency
E = Expected Frequency
Observed Frequency
| Device / Payment | E-Wallet | Credit Card | Cash on Delivery |
|---|---|---|---|
| Mobile | 120 | 80 | 50 |
| Desktop | 60 | 90 | 40 |
Compute Chi-square
- Mobile-E-wallet
\[\frac {(120−102.27)^2}{102.27}= \frac {(17.73)^2}{102.27}=3.07\]
- Dekstop-E-Wallet
\[\frac {(60−77.73)^2}{77.73}= \frac {(-17.73)^2}{77.73}=4.04\]
Compute cell by cell
Table Chi-Square Contribution per Cell
| Device / Payment | E-Wallet | Credit Card | Cash on Delivery |
|---|---|---|---|
| Mobile | 3.07 | 2.85 | 0.03 |
| Desktop | 4.04 | 3.75 | 0.03 |
4.4 Statistical Decision
Total Chi-Square Value
\[χ^2=3.07+2.85+0.03+4.05+3.75+0.03 \\ 𝜒^2≈13.77\]
Degrees of Freedom
\[\begin{array}{rl} df & =(jumlah \; baris −1)(jumlah \; kolom −1)\\ &=(2−1)(3−1) \\ & =2 \end{array}\]
for \(df = 2\), the result is:
\[p-value \approx 0.001\]
We can plug our test statistic t and our degrees of freedom into an online p-value calculator by Dotmatics to see how close our estimated p-value was to the true p-value
Statistical Decision
Because:
\[p-value < α=0.05\]
so,
\[Reject \; H_0\]
4.5 Interpretation
The Chi-Square test results indicate a statistically significant association between device type and payment method preference, meaning that customers’ payment choices differ systematically between mobile and desktop users rather than being randomly distributed. This finding has direct implications for the company’s digital payment strategy.
Specifically, mobile users use e-wallets more frequently than expected, suggesting that mobile shoppers are more comfortable with fast, app-based, and frictionless payment options. In contrast, desktop users rely more heavily on credit cards than expected, which may reflect different usage contexts, such as higher trust in card entry on larger screens or different purchasing habits during desktop sessions.
From a strategic perspective, the company should optimize payment options by device type. For mobile platforms, prioritizing e-wallet integrations (such as faster checkout flows, default e-wallet selections, or exclusive mobile wallet promotions) could reduce friction and increase conversion rates. For desktop users, ensuring a smooth and secure credit card checkout experience remains critical.
Overall, the results suggest that a one-size-fits-all payment interface may not be optimal. Instead, aligning payment method prominence with device-specific user preferences can improve checkout efficiency, enhance user experience, and ultimately drive higher transaction completion rates.
5 Case Study 5
Type I and Type II Errors (Conceptual)
A fintech startup tests whether a new fraud detection algorithm reduces fraudulent transactions.
H₀: The new algorithm does not reduce fraud.
H₁: The new algorithm reduces fraud.
Tasks
- Explain a Type I Error (α) in this context.
- Explain a Type II Error (β) in this context.
- Identify which error is more costly from a business perspective.
- Discuss how sample size affects Type II Error.
- Explain the relationship between α, β, and statistical power.
Answer
5.1 Type I Error
A Type I error means rejecting the null hypothesis when it’s actually true. It means concluding that results are statistically significant when, in reality, they came about purely by chance or because of unrelated factors. Mathematically:
\[\text Type \ I \ Error = P(Reject \ H_0 \ | \ H_0 \ is \ true) = \alpha\]
Interpretation in This Context
A Type I Error happens when; The company concludes that the new fraud detection algorithm reduces fraud, but in reality, it does not reduce fraud at all.
Practical Consequence
The company deploys an ineffective algorithm
False confidence in fraud prevention
Possible increase in undetected fraudulent transactions
Financial losses and reputational risk Thus, \(α\) represents the risk of making a false positive decision.
5.2 Type II Error
A type II error, also known as an error of the second kind or a beta error, confirms that a null hypothesis should have been rejected (because two variables that were claimed to be unrelated actually were related). A researcher makes a type II error in this instance by not rejecting the null hypothesis—that is, by not rejecting the idea that two variables are unrelated after the research is completed and it’s proven false. Mathematically:
\[\text Type \ II \ Error = P(Fail \ to \ reject \ H_0 \ | \ H_0 \ is \ false) = β\]
Interpretation in This Context
The company concludes that the new algorithm does not reduce fraud, even though it actually does reduce fraud.
Practical Consequence
A genuinely effective algorithm is not adopted
Missed opportunity to reduce fraud
Continued financial losses due to fraud
Competitive disadvantage Thus, \(β\) represents the risk of a false negative decision.
5.3 More Costly Error From Business Perspective
From a fintech business perspective,
Type I Error: Approving a useless algorithm
Type II Error: Rejecting a useful algorithm
Cost Comparison
Type I Error
Deploys ineffective security
High financial and reputational damage
Regulatory and compliance risk
Type II Error
Missed efficiency gains
Continued fraud losses
Opportunity cost
Conclusion
Type I Error is typically more costly in fraud detection. Because, Allowing fraud to persist due to false confidence can cause immediate and severe damage.
5.4 Effect of Sample Size on Type II Error
Sample size has a strong and direct effect on the probability of making a Type II error \((β)\), which occurs when a statistical test fails to detect a real effect that actually exists. When the sample size is small, estimates of population parameters tend to be unstable and highly influenced by random variation. This high variability makes it difficult for the test statistic to move far enough away from the null hypothesis, even when a true effect is present. As a result, studies with small samples are more likely to conclude that there is “no effect,” thereby committing a Type II error.
As the sample size increases, the standard error decreases, meaning that estimates become more precise and less noisy. This increased precision allows the sampling distribution under the alternative hypothesis to separate more clearly from the distribution under the null hypothesis. Consequently, real effects are easier to detect, the probability of failing to reject a false null hypothesis decreases, and Type II error is reduced. In statistical terms, increasing the sample size raises statistical power, which is defined as \(1−β\). Therefore, larger samples directly improve a study’s ability to identify meaningful effects and avoid costly false negatives, a concern emphasized in business and experimentation contexts such as product testing and A/B experiments.
Key Takeaways
Type II Error \((β)\) occurs when a real effect exists but the test fails to detect it.
Small sample sizes increase \(β\) because high variability masks true effects.
Larger sample sizes reduce standard error, leading to more precise estimates.
As sample size increases, statistical power \((1 − β)\) increases, making it easier to detect real effects.
Underpowered studies with small samples risk false conclusions of “no effect”, even when meaningful differences exist.
In business and experimentation settings, insufficient sample size can lead to missed opportunities, such as failing to adopt genuinely effective products or strategies.
5.5 Relationship Between \(\alpha\), \(\beta\), and Statistical Power
In the context of hypothesis testing, \(α\) (alpha) represents the significance level—the probability of committing a Type I error, which means incorrectly rejecting a true null hypothesis. Researchers commonly set α at 0.05, which reflects a willingness to accept a 5% chance of a false positive result. In contrast, \(β\) (beta) represents the probability of a Type II error, which occurs when a true effect exists but the test fails to detect it. Statistical power is defined as the probability of correctly rejecting the null hypothesis when it is false, and is mathematically described as:
\[Power = 1 - \beta\] Thus, power and β are directly linked: as β decreases, power increases, meaning the test is more likely to detect a real effect if one exists.
There is an inherent trade-off between \(α\) and \(β\). Lowering α (making the criterion for significance more stringent) reduces the risk of a false positive, but this tightening of the rejection threshold also makes it harder to detect true effects, which tends to increase \(β\) and reduce power if other factors (like sample size and effect size) are held constant. Conversely, increasing α makes it easier to reject the null hypothesis and thus can increase power, but at the expense of a higher Type I error rate.
Because of this balance, researchers often conduct power analysis before collecting data, specifying desired values for α and power (often 0.80 or higher) and using estimates of effect size to determine the required sample size. Power analysis ensures that a study is designed with a high probability of detecting meaningful effects while controlling both Type I and Type II error rates, reflecting the interconnected nature of \(α\), \(β\), and statistical power.
Key relationship summarized
\(α\) controls the risk of false positives (Type I error).
\(β\) controls the risk of false negatives (Type II error).
Statistical power \((1 − β)\) is the probability of detecting a real effect.
Lowering \(α\) generally increases \(β\) (reduces power) unless sample size or effect size compensates.
Designing studies with adequate power involves balancing \(α\), \(β\), effect size, and sample size.
6 Case Study 6
P-Value and Statistical Decision Making
A churn prediction model evaluation yields the following results:
- Test statistic = 2.31
- p-value = 0.021
- Significance level: \(\alpha = 0.05\)
Tasks
- Explain the meaning of the p-value.
- Make a statistical decision.
- Translate the decision into non-technical language for management.
- Discuss the risk if the sample is not representative.
- Explain why the p-value does not measure effect size.
Answer
6.1 The Meaning of P-Value
The p value is a number, calculated from a statistical test, that describes how likely you are to have found a particular set of observations if the null hypothesis were true. P values are used in hypothesis testing to help decide whether to reject the null hypothesis. The smaller the p value, the more likely you are to reject the null hypothesis.
| P-value | Evidence against \(H_0\) | Action |
|---|---|---|
| ≤ 0.01 | Very strong | Reject \(H_0\) |
| ≤ 0.05 | Strong | Reject \(H_0\) |
| > 0.05 | Weak or none | Fail to reject \(H_0\) |
In this case:
\[p-value = 0.021\] This means: If the churn prediction model actually has no real effect, there is only a 2.1% chance that we would observe a result this extreme due to random sampling variability alone.
Key Takeaways
- Definition: A p-value measures how likely your observed results (or more extreme ones) would be if the null hypothesis were true. It is a tool for assessing evidence against the null.
- Significance: Statistical significance is determined by comparing the p-value to a chosen cutoff (often 0.05). A smaller p-value suggests stronger evidence against the null hypothesis.
- Misinterpretation: A p-value does not tell you the probability the null hypothesis is true or that your results happened by chance. It only reflects how your data aligns with the null model.
- Limitations: A statistically significant result may have little practical importance, and large samples can produce small p-values even for trivial effects. Always consider effect size and context.
6.2 Statistical Decision
Based on the given significance level \(α=0.05\), the appropriate statistical decision is to reject the null hypothesis, because the p-value (0.021) is smaller than \(α\). This decision means that the evidence from the data is strong enough to conclude that the churn prediction model’s performance is statistically different from what would be expected if there were no real effect. Importantly, rejecting the null hypothesis does not imply certainty, but rather that the observed evidence crosses the predefined threshold for statistical significance.
6.3 Translation For Management (non-technical language)
The analysis indicates that the model’s performance improvement is very unlikely to be due to random chance, suggesting that it is capturing real patterns related to customer churn rather than random noise in the data. From a practical perspective, this makes the model promising and worth considering for business use; however, this conclusion should be understood as statistical confidence rather than a guarantee of success and should be complemented with further business validation, operational testing, and ongoing monitoring before full deployment.
6.4 Risk If the Sample Is Not Representative
A significant risk arises if the sample used to evaluate the model is not representative of the overall customer population. In such cases, the statistical conclusion may accurately describe the sampled data but fail when the model is deployed in real-world conditions. For example, if the sample overrepresents highly active users or a specific demographic group, the model may perform poorly on other segments. This risk highlights that statistical significance alone does not ensure generalizability, and careful attention must be paid to how the data were collected.
6.5 Reason the p-value Doesn’t Measure Effect Size
The p-value does not measure effect size, meaning it does not indicate how large or practically important the model’s improvement is. A small p-value only tells us that the observed effect is unlikely to be due to chance; it does not tell us whether the improvement is large enough to matter for business outcomes. In large samples, even very small and practically irrelevant improvements can produce statistically significant p-values, while in small samples, meaningful improvements may fail to reach significance. Therefore, p-values should always be interpreted alongside effect size measures and business impact metrics to support sound decision-making.
7 References
[1] Bobbitt, Zach. (2023, June 25). How to Calculate a P-Value from a T-Test By Hand.
[2] Bobbitt, Zach. (2025, April 13). T Score to P Value Calculator.
[3] Frost, Jim. Sample Size Essentials: The Foundation of Reliable Statistics.
[4] The Statsig Team. (2025, March 22). Recognizing type 2 error and its cost in hypothesis testing.
[5] DiFrancesco, Vivienne. (2021, April 10). Understanding Alpha, Beta, and Statistical Power.
[7] Statistic Solution. The Building Blocks of Power Analysis: Demystifying Key Statistical Terms.
[8] Bobbitt, Zach. (2020, December 20). Welch’s t-test: When to Use it + Examples.
[9] McLeod, Saul PhD. (2025, August 11). Understanding P-Values and Statistical Significance.
[10] Hayes, Adam. (2025, July 26). Type II Error: Definition, Example, vs. Type I Error.