Data Science Major
Statistical inference is the process of using sample data to make generalizations or inferences about a population. While descriptive statistics summarize the characteristics of a sample, inferential statistics allow us to test hypotheses and estimate population parameters under uncertainty. (Casella and Berger 2024)
Hypothesis testing starts with two opposing statements:
Null Hypothesis(\(H_0\)): Assumes no effect or no difference (\(\mu = \mu_0\)).
Alternative Hypothesis(\(H_1\)): Assumes an effect or difference exists (\(\mu \neq \mu_0\), \(\mu > \mu_0\), or \(\mu < \mu_0\)).
The Z-test applies when the population standard deviation (\(\sigma\)) is known and sample size is large (\(n \geq 30\)), invoking the Central Limit Theorem (Casella and Berger 2024).
The Z-test is used when:
The population standard deviation (\(\sigma\)) is known.
The sample size is large (\(n \geq 30\)), allowing the Central Limit Theorem to assume a normal distribution.
This formula is used to compare a sample mean (\(\bar{x}\)) to a known population mean (\(\mu_0\)).
\[Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}\]Where:
\(Z\): The calculated Z-test statistic.
\(\bar{x}\): Sample mean.
\(\mu_0\): Population mean (under the null hypothesis).
\(\sigma\): Population standard deviation.
\(n\): Sample size.
\(\sigma / \sqrt{n}\): Standard Error (SE) of the mean.
Used to compare the means of two independent groups when both population variances are known.
\[Z = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\]
While not a “test,” it is a core part of inference to estimate where the true population mean lies.
\[CI = \bar{x} \pm Z_{\alpha/2} \left( \frac{\sigma}{\sqrt{n}} \right)\]Where:
If \(P \leq \alpha\): Reject \(H_0\) (Result is statistically significant).
If \(P > \alpha\): Fail to reject \(H_0\) (Insufficient evidence).
If \(|Z_{calc}| > |Z_{crit}|\): Reject \(H_0\).
Common \(Z_{crit}\) for \(\alpha = 0.05\) (Two-tailed): \(\pm 1.96\).
A digital learning platform claims that the average daily study time of its users is 120 minutes. Based on historical records, the population standard deviation is known to be 15 minutes.
A random sample of 64 users shows an average study time of 116 minutes. μ₀ = 120 σ = 15 n = 64 x¯ = 116
Given Parameters:
Population Mean (\(\mu_0\)): 120
Population Standard Deviation (\(\sigma\)): 15
Sample Size (\(n\)): 64
Sample Mean (\(\bar{x}\)): 116
Significance Level (\(\alpha\)): 0.05
To determine if the actual study time differs from the claim, we define:
Null Hypothesis (\(H_0\)): \(\mu = 120\) (The true average daily study time is equal to 120 minutes.)
Alternative Hypothesis (\(H_1\)): \(\mu \neq 120\) (The true average daily study time is significantly different from 120 minutes.)
The appropriate test is the One-Sample Z-Test.
Justification:
Known Population Standard Deviation: The value of \(\sigma\) (15 minutes) is provided based on historical records.
Sample Size: The sample size (\(n = 64\)) is greater than 30, which satisfies the Central Limit Theorem, ensuring the sampling distribution of the mean is approximately normal.
Independent Observations: The users are assumed to be randomly and independently sampled.
\[Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}\]
P-Value: Since this is a two-tailed test, \(P\text{-value} = 2 \times P(Z < -|Z_{calc}|)\).
| Parameter | Value |
|---|---|
| Null Hypothesis (μ₀) | 120 |
| Sample Mean (x̄) | 116 |
| Standard Deviation (σ) | 15 |
| Sample Size (n) | 64 |
| Standard Error | 1.875 |
| Z-score | -2.1333 |
| P-value | 0.0329 |
Calculated \(P\text{-value}\): 0.0328
Significance Level (\(\alpha\)): 0.05
Decision Rule: If \(P\text{-value} < \alpha\), then Reject \(H_0\).
Result: Since 0.0328 < 0.05, we Reject the Null Hypothesis (\(H_0\)).
Business Implications:
Performance Gap: The sample mean (116 minutes) suggests that users are spending significantly less time on the platform than the marketing or operational claim suggests.
Strategic Adjustment: Management should investigate why engagement is lower than historical benchmarks. This could indicate a need for better content, improved user interface, or more effective notification triggers.
Marketing Accuracy: To maintain transparency and brand trust, the platform may need to adjust its external claims or focus on re-engagement campaigns to bring the average back up to the 120-minute target.
A UX Research Team investigates whether the average task completion time of a new application differs from 10 minutes.
The following data are collected from 10 users:
| Participant | Completion.Time..minutes. |
|---|---|
| 1 | 9.2 |
| 2 | 10.5 |
| 3 | 9.8 |
| 4 | 10.1 |
| 5 | 9.6 |
| 6 | 10.3 |
| 7 | 9.9 |
| 8 | 9.7 |
| 9 | 10.0 |
| 10 | 9.5 |
Data Points (minutes): 9.2, 10.5, 9.8, 10.1, 9.6, 10.3, 9.9, 9.7, 10.0, 9.5
Given Parameters:Hypothesized Mean (\(\mu_0\)): 10.0
Significance Level (\(\alpha\)): 0.05
Sample Size (\(n\)): 10
To test if the completion time significantly deviates from the 10-minute benchmark:
Null Hypothesis (\(H_0\)): \(\mu = 10\) (The average task completion time is equal to 10 minutes.)
Alternative Hypothesis (\(H_1\)): \(\mu \neq 10\) (The average task completion time is not equal to 10 minutes.)
The appropriate test is the One-Sample T-Test.
Justification:Unknown Population Standard Deviation (\(\sigma\)): We do not have the historical \(\sigma\) for the population; we must estimate it using the sample standard deviation (\(s\)).
Small Sample Size: The sample size is small (\(n = 10 < 30\)), necessitating the use of the T-distribution rather than the Z-distribution.
Normality Assumption: We assume the underlying population of task completion times follows a normal distribution.
| Parameter | Value |
|---|---|
| Sample Mean (x̄) | 9.86 |
| Sample Std Dev (s) | 0.3864 |
| T-Statistic | -1.1456 |
| Degrees of Freedom | 9 |
| P-value | 0.2815 |
We compare the calculated \(P\text{-value}\) to the significance level \(\alpha\):
Calculated \(P\text{-value}\): 0.174
Significance Level (\(\alpha\)): 0.05
Result: Since TRUE, we Fail to Reject the Null Hypothesis (\(H_0\)).
There is no statistically significant evidence to suggest that the average task completion time differs from 10 minutes.
Standard Error: As \(n\) increases, the standard error (\(s / \sqrt{n}\)) decreases. A smaller standard error means the sample mean is a more precise estimate of the population mean.
Statistical Power: Small samples (like \(n=10\)) have lower statistical power, meaning they are less likely to detect a true difference if one exists (Type II Error). This often leads to a “Fail to Reject” decision even if there is a slight practical difference.
T-Distribution Shape: With a small sample size, the T-distribution has “fatter tails” than the normal distribution to account for the uncertainty in estimating \(\sigma\). As \(n\) grows, the T-distribution approaches the Z-distribution (Normal).
Margin of Error: Larger samples yield narrower confidence intervals, providing more “reliability” and certainty when generalizing findings from the sample to the broader user base.
A product analytics team conducts an A/B test to compare the average session duration (minutes) between two versions of a landing page.
| Version | Sample Size (n) | Mean (x̄) | Standard Deviation (s) |
|---|---|---|---|
| A | 25 | 4.8 | 1.2 |
| B | 25 | 5.4 | 1.4 |
Significance Level (\(\alpha\)): 0.05
Assumption: Independent samples with unequal variances (Welch’s T-test approach).
Null Hypothesis (\(H_0\)): \(\mu_A = \mu_B\) (There is no difference in the average session duration between Version A and Version B.)
Alternative Hypothesis (\(H_1\)): \(\mu_A \neq \mu_B\) (There is a significant difference in the average session duration between Version A and Version B.)
The appropriate test is the Independent Two-Sample T-Test.
Justification:Two Independent Groups: The users seeing Version A are different from the users seeing Version B.
Continuous Data: Session duration is a ratio-scale continuous variable.
Unknown Population Variance: We are using sample standard deviations (\(s_A\) and \(s_B\)) to estimate the population parameters.
Welch’s Adjustment: Given that sample sizes are small and variances may differ, Welch’s T-test is the more robust choice compared to the Pooled-variance T-test.
| Parameter | Value |
|---|---|
| Version A Mean (x̄₁) | 4.8 |
| Version B Mean (x̄₂) | 5.4 |
| Sample Size (n) | 25 |
| Standard Deviation (s) | 1.2 (A), 1.4 (B) |
| Standard Error | 0.3688 |
| T-Statistic | -1.627 |
| Degrees of Freedom | 46.9 |
| P-value | 0.1104 |
We compare the calculated \(P\text{-value}\) to our significance threshold:
Calculated \(P\text{-value}\): 0.108
Significance Level (\(\alpha\)): 0.05
Decision: Since r0.108 > 0.05, we
Fail to Reject the Null Hypothesis (\(H_0\)).
The observed difference in means (4.8 vs 5.4) is not statistically significant at the 5% level.
1.Insufficient Evidence for Change: Although Version B showed a higher mean session duration (5.4 minutes) than Version A (4.8 minutes), we cannot confidently say this difference isn’t due to random chance.
2.Risk of Implementation: Implementing Version B based solely on this data carries a risk of “Type I Error” (thinking there’s an improvement when there isn’t).
3.Recommendation: - Increase Sample Size: The current sample (\(n=50\) total) might be too small to detect a meaningful effect (low power). It is recommended to run the A/B test longer to collect more data.
Review Practical Significance: A 0.6-minute difference might be business-critical. If so, a larger study is justified to verify if this trend holds.
Analyze Sub-segments: Check if Version B performed better for specific user segments (e.g., mobile vs. desktop) before discarding the design.
An e-commerce company examines whether device type is associated with payment method preference.
| Device / Payment | E-Wallet | Credit Card | Cash on Delivery |
|---|---|---|---|
| Mobile | 120 | 80 | 50 |
| Desktop | 60 | 90 | 40 |
To test the relationship between these two categorical variables:
Null Hypothesis (\(H_0\)): Device type and payment method are independent. (There is no association between the device used and the preferred payment method.)
Alternative Hypothesis (\(H_1\)): Device type and payment method are dependent. (There is a significant association between the device used and the preferred payment method.)
The appropriate test is the Chi-Square Test of Independence.
Justification:Categorical Variables: Both “Device Type” (Mobile, Desktop) and “Payment Method” (E-Wallet, Credit Card, COD) are nominal categorical variables.
Contingency Table: The data is structured in a frequency table (cross-tabulation).
Independence of Observations: Each customer response represents an independent event.
Expected Frequencies: The sample size is large enough such that the expected frequency in each cell is \(\geq 5\).
| Statistic | Value |
|---|---|
| Chi-Square Statistic (X²) | 13.7736 |
| Degrees of Freedom | 2 |
| P-value | 0.001021 |
| Significance Level (α) | 0.05 |
Based on the calculation above:
| Test Component | Value |
|---|---|
| Chi-Square Statistic (X²) | 13.7736 |
| Degrees of Freedom | 2 |
| P-value | 0.001021 |
| Significance Level (α) | 0.05 |
Calculated \(P\text{-value}\): 0.0016
Significance Level (\(\alpha\)): 0.05
Decision: Since r0.0016 < 0.05, we
Reject the Null Hypothesis (\(H_0\)).
The results indicate a statistically significant association between the device used and the payment method preferred.
Strategic Insights:Mobile-Wallet Synergy: Mobile users show a much higher preference for E-Wallets (120 vs 60). This suggests that “One-tap” or integrated mobile payments are highly effective for the mobile customer segment.
Desktop-Credit Card Preference: Desktop users are more likely to use Credit Cards. This may be due to the ease of typing long card numbers on a physical keyboard or a perception of higher security for large transactions on a computer.
Strategy Recommendations:
Mobile UX: Ensure the E-Wallet checkout flow is the primary option (default) on the mobile app/site to reduce friction.
Desktop Marketing: Offer credit card-linked promotions (e.g., cashback or points) specifically to users browsing on desktop.
COD Stability: Cash on Delivery remains relatively consistent proportionally, suggesting it serves a specific trust-based segment across both platforms.
A fintech startup is evaluating a new fraud detection algorithm. The effectiveness of the algorithm is tested using the following hypotheses:
\(H_0\): The new algorithm does not reduce fraud (No effect).
\(H_1\): The new algorithm reduces fraud (Effective).
A Type I Error occurs when we reject a true null hypothesis (a “false positive” from the researcher’s perspective).
In this context: The startup concludes that the new algorithm is effective at reducing fraud when, in reality, it is not.
Consequence: The company may spend significant resources (money, time, and engineering effort) to deploy an algorithm that provides no actual improvement in security.
A Type II Error occurs when we fail to reject a false null hypothesis (a “false negative”).
In this context: The startup concludes that the new algorithm does not reduce fraud, when in reality, it actually is effective.
Consequence: The company misses an opportunity to improve its security posture and continues to lose money to fraudulent transactions that could have been prevented.
Rationale: While a Type I Error leads to wasted implementation costs, a Type II Error leads to unmitigated financial loss from fraud. Fraud can result in direct monetary theft, legal liabilities, and a massive loss of customer trust.
Risk Appetite: Most fintech companies would prefer to accidentally test a “dud” algorithm (Type I) than to dismiss an algorithm that could have stopped a million-dollar security breach (Type II).
Increased Precision: As the sample size increases, the standard error of the estimate decreases, making the test more sensitive to small differences.
Lowering \(\beta\): A larger sample size provides more evidence. If the algorithm truly works, a larger sample makes it much more likely that the statistical test will detect that effect, thereby reducing the probability of a Type II Error.
Reliability: Small samples often lack the “resolution” to see the impact of the algorithm, leading to a high \(\beta\) (failing to see the truth).
Inverse Relationship (\(\alpha\) vs \(\beta\)): There is usually a trade-off. If you make it harder to commit a Type I error (by lowering \(\alpha\) from 0.05 to 0.01), you automatically make it easier to commit a Type II error (\(\beta\) increases), because you are requiring “stronger” evidence to reject \(H_0\).
Statistical Power (\(1 - \beta\)): Power is the probability of correctly rejecting a false null hypothesis (detecting an effect that actually exists).
The Triad:
To increase Power, you can either increase the Sample Size or accept a higher \(\alpha\) (risk of false positive).
In business analytics, we aim for high Power (usually 0.80 or higher) to ensure that if our product improvements (like the fraud algorithm) work, our data will actually show it.
A churn prediction model evaluation yields the following results:
Results:
Test Statistic: 2.31
P-value: 0.021
Significance Level (\(\alpha\)): 0.05
In this context, it means there is only a 2.1% chance that we would see these results if the churn model or intervention actually had no effect at all.
A low p-value suggests that the observed data is inconsistent with the null hypothesis.
To make a decision, we compare the p-value to the pre-determined significance level (\(\alpha\)):
Comparison: \(P\text{-value} (0.021) < \alpha (0.05)\).
Decision: Reject the Null Hypothesis (\(H_0\)).
The result is considered statistically significant, meaning the evidence is strong enough to conclude that the observed effect is likely not due to random chance.
“Our analysis shows that the new churn prediction strategy is working. There is a very low probability (about 2%) that the improvements we are seeing are just a fluke. Based on these results, we can confidently say that the model is effectively identifying or reducing customer churn, and we should proceed with the implementation.”
Sampling Bias: The statistical significance might be ‘true’ for that specific group but fails when applied to the whole company.
Generalization Failure: Management might invest in a global rollout based on these results, only to find the model performs poorly on other customer segments, leading to wasted resources.
Misleading Certainty: We might have high statistical confidence (low p-value) in a result that is fundamentally biased, essentially giving us a “precisely wrong” answer.
P-value measures ‘Certainty’: It tells us how sure we are that an effect exists, not how large that effect is.
Sample Size Sensitivity: Even a tiny, practically useless difference (e.g., reducing churn by 0.0001%) can yield a very small p-value if the sample size is large enough.
Effect Size: To understand the business impact, we must look at metrics like the magnitude of the churn reduction (e.g., a 5% drop in churn) o r Cohen’s d. A p-value of 0.021 tells us the result is real, but it doesn’t tell us if it will save the company $1,000 or $1,000,000.
Casella, George, and Roger L. Berger. 2024. Statistical Inference. 2nd ed. New York: Routledge. https://doi.org/10.1201/9781032724546.DScience Labs. 2025. “Introduction to Statistics: Chapter 9 - Statistical Inference.” 2025. https://bookdown.org/dsciencelabs/intro_statistics/09-Statistical_Inference.html.Ismay, Chester, and Albert Y. Kim. 2025. Statistical Inference via Data Science: A ModernDive into r and the Tidyverse. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC. https://doi.org/10.1201/9781032724546.