A digital learning platform claims that the average daily study time of its users is 120 minutes. Based on historical records, the population standard deviation is known to be 15 minutes.
A random sample of 64 users shows an average study time of 116 minutes.
\[ \begin{eqnarray*} \mu_0 &=& 120 \\ \sigma &=& 15 \\ n &=& 64 \\ \bar{x} &=& 116 \end{eqnarray*} \]
In inference, we start by defining what we are testing against.
Null Hypothesis (\(H_0\)): \(\mu = 120\)
Alternative Hypothesis (\(H_1\)): \(\mu \neq 120\)
We need to see how many “standard errors” our sample mean (\(\bar{x} = 116\)) sits away from the claimed mean (\(\mu = 120\)).
Find the P-Value Using a standard normal distribution table for \(Z = -2.13\):
We compare our p-value to our significance level (\(\alpha = 0.05\)):
From a business perspective, the platform’s claim that users study for 120 minutes is statistically unsupported by this data.
The sample mean of 116 minutes is low enough that it is unlikely to have happened by random chance if the true average were 120. As a data analyst, you would advise the marketing or product team that their “120-minute” claim is likely an overestimation and should be revised to reflect actual user behavior more accurately.
A UX Research Team investigates whether the average task completion time of a new application differs from 10 minutes.
The following data are collected from 10 users:
\[ 9.2,\; 10.5,\; 9.8,\; 10.1,\; 9.6,\; 10.3,\; 9.9,\; 9.7,\; 10.0,\; 9.5 \]
Null Hypothesis (H₀):
H₀: μ = 10 minutes
The average task completion time for the new application is 10 minutes
No difference from the target benchmark
Alternative Hypothesis (H₁):
H₁: μ ≠ 10 minutes
The average task completion time differs from 10 minutes
This is a two-tailed test because we’re checking for any difference (faster or slower)
Justification:
Parameter of interest: Population mean (μ)
Population standard deviation: Unknown (we only have sample data)
Sample size: Small (n = 10 < 30)
Conditions check:
Why t-test instead of z-test?
\[\begin{aligned} \bar{x} &= \frac{\sum\_{i=1}^{n} x\_i}{n} \\ &= \frac{98.6}{10} \\ &= 9.86 \text{ minutes} \end{aligned}\]
\[t = \frac{\bar{x} - \mu}{s / \sqrt{n}} = \frac{9.86 - 10}{0.386 / \sqrt{10}} = \frac{-0.14}{0.122} \approx \mathbf{-1.15}\]
Sample size (\(n\)) plays a critical role in hypothesis testing and the reliability of our inferences:
In this case: The sample mean (9.86) is slightly below 10, but with only n=10 and low variability, the difference is not statistically significant. A larger sample (e.g., n=50) showing the same mean difference would likely yield a much smaller p-value and rejection of H₀.
A product analytics team conducts an A/B test to compare the average session duration (minutes) between two versions of a landing page.
| Version | Sample Size (n) | Mean | Standard Deviation |
|---|---|---|---|
| A | 25 | 4.8 | 1.2 |
| B | 25 | 5.4 | 1.4 |
Null Hypothesis (H₀):
\[\begin{aligned} H_0 : \mu_A = \mu_B \end{aligned}\]Alternative Hypothesis (H₁):
\[\begin{aligned} H_1 : \mu_A \neq \mu_B \end{aligned}\]Test Selection: Two-Sample Independent T-Test (Welch’s t-test)
Justification:
Comparing two independent groups (Version A vs Version B)
Population standard deviations unknown (only sample SDs provided: 1.2 and 1.4)
Sample sizes are equal (both n = 25), but not necessarily large enough for z-test
Equal variances assumption check needed:
To compare the two groups, we calculate how much the difference in means (\(5.4 - 4.8 = 0.6\)) stands out against the combined “noise” (standard error) of both groups.
Calculate the Standard Error (\(SE\))
Since the sample sizes are equal (\(n=25\)), we use the formula:
\[SE = \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}\] \[SE = \sqrt{\frac{1.2^2}{25} + \frac{1.4^2}{25}} = \sqrt{\frac{1.44}{25} + \frac{1.96}{25}} \\ = \sqrt{0.0576 + 0.0784} \approx \mathbf{0.369}\]
Calculate the T-Statistic
\[t = \frac{\bar{x}_B - \bar{x}_A}{SE}\\ = \frac{5.4 - 4.8}{0.369} = \frac{0.6}{0.369} \approx \mathbf{1.626}\]
Determine the P-Value
Using the degrees of freedom (\(df \approx 48\) using a simplified \(n_A + n_B - 2\)):
Comparison: \(p \text{-value} (0.110) > \alpha (0.05)\).
Decision: Fail to Reject the Null Hypothesis (\(H_0\)).
Reasoning: Although Version B had a higher average (5.4 minutes) than Version A (4.8 minutes), the p-value tells us there is an 11% chance this difference happened just by random luck. Since 11% is higher than our 5% threshold, we cannot claim the result is “statistically significant.”
Version B shows a higher sample mean session duration (5.4 vs. 4.8 minutes, a +12.5% increase), but this difference is not statistically significant at \(\alpha = 0.05\).
Product Implications:
Key Lesson in A/B Testing: Statistical significance is essential before declaring a “winner.” Many promising variants fail to reach significance due to insufficient sample size or small effect sizes.
An e-commerce company examines whether device type is associated with payment method preference.
| Device / Payment | E-Wallet | Credit Card | Cash on Delivery |
|---|---|---|---|
| Mobile | 120 | 80 | 50 |
| Desktop | 60 | 90 | 40 |
Null Hypothesis (H₀):
\[H_0 : \text{Device type and payment} \\ \text{method are independent}\]
Alternative Hypothesis (H₁):
\[H_1 : \text{Device type and payment}\\ \text{method are dependent}\]
Test Selection: Pearson’s Chi-Square Test of Independence
Justification:
Two categorical variables:
Device type (Mobile, Desktop) - 2 categories
Payment method (E-Wallet, Credit Card, Cash on Delivery) - 3 categories
Independent observations: Each user contributes to only one cell
Expected frequencies ≥ 5: We’ll verify this during calculation
Goal: Test association/independence between two categorical variables
To find the test statistic, we compare the Observed values (the data we have) to the Expected values (what the data would look like if there were no relationship).
library(knitr)
data_tabel <- data.frame(
Device = c("E-Wallet", "Credit Card", "Cash on Delivery", "Row Total"),
Mobile = c(120, 80, 50, 250),
Desktop = c(60, 90, 40, 190),
Column_Total = c(180, 170, 90, 440)
)
colnames(data_tabel) <- c("Device", "Mobile", "Desktop", "Column Total")
kable(data_tabel,
caption = "Contingency Table: Device Type vs Payment Method",
align = "lccc") # l=left, c=center
| Device | Mobile | Desktop | Column Total |
|---|---|---|---|
| E-Wallet | 120 | 60 | 180 |
| Credit Card | 80 | 90 | 170 |
| Cash on Delivery | 50 | 40 | 90 |
| Row Total | 250 | 190 | 440 |
Using the formula
\[E = \frac{(\text{Row Total} \times \text{Column Total})}{\text{Grand Total}}\]
Mobile + E-Wallet: \((250 \times 180) / 440 = \mathbf{102.27}\)
Mobile + Credit Card: \((250 \times 170) / 440 = \mathbf{96.59}\)
Mobile + Cash (COD): \((250 \times 90) / 440 = \mathbf{51.14}\)
Desktop + E-Wallet: \((190 \times 180) / 440 = \mathbf{77.73}\)
Desktop + Credit Card: \((190 \times 170) / 440 = \mathbf{73.41}\)
Desktop + Cash (COD): \((190 \times 90) / 440 = \mathbf{38.86}\)
Formula:
\[\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]
Calculation :
\[\begin{aligned} \chi^2 &= \frac{(120-102.27)^2}{102.27} + \frac{(80-96.59)^2}{96.59} + \frac{(50-51.14)^2}{51.14} \\ &\quad + \frac{(60-77.73)^2}{77.73} + \frac{(90-73.41)^2}{73.41} + \frac{(40-38.86)^2}{38.86} \\[10pt] &= \frac{17.73^2}{102.27} + \frac{(-16.59)^2}{96.59} + \frac{(-1.14)^2}{51.14} \\ &\quad + \frac{(-17.73)^2}{77.73} + \frac{16.59^2}{73.41} + \frac{1.14^2}{38.86} \\[10pt] &= \frac{314.35}{102.27} + \frac{275.23}{96.59} + \frac{1.30}{51.14} \\ &\quad + \frac{314.35}{77.73} + \frac{275.23}{73.41} + \frac{1.30}{38.86} \\[10pt] &= 3.074 + 2.849 + 0.025 + 4.044 + 3.749 + 0.033 \\ &= 13.774 \end{aligned}\]\[\chi^2 = 13.774\]
Degrees of freedom:
\[df = (2 - 1) \times (3 - 1) = 1 \times 2 = 2\]
Compare to \(\alpha = 0.05\):
Decision: Reject the null hypothesis (H₀). There is strong evidence at the 5% significance level of an association between device type and payment method preference. (Alternatively, using critical value: For \(\alpha = 0.05\), df=2, critical \(\chi^2 = 5.991\). Observed 13.77 > 5.991 → reject H₀.)
Strategy:
A fintech startup tests whether a new fraud detection algorithm reduces fraudulent transactions.
A Type I error occurs when we reject the null hypothesis when it is actually true. This is a false positive or false alarm.
In the Fraud Detection Context:
Practical Example:
The startup conducts a test, analyzes the data, and finds a statistically significant reduction in fraudulent transactions (p < 0.05). They implement the new algorithm company-wide, but in reality, the algorithm is no better than the old one. They’ve wasted resources implementing an ineffective solution.
Mathematical Representation:
\[\alpha = P(\text{Reject } H_0 \mid H_0 \text{ is true})\]
Common α level: 0.05 (5% risk of false positive)
A Type II error occurs when we fail to reject the null hypothesis when it is actually false. This is a false negative or missed detection.
In the Fraud Detection Context:
Practical Example: The startup tests the algorithm, finds no statistically significant improvement (p > 0.05), and decides to abandon it. However, the algorithm actually works and could have significantly reduced fraud losses. They’ve missed an opportunity to improve their system.
Mathematical Representation:
\[\beta = P(\text{Fail to reject } H_0 \mid H_1 \text{ is true})\]
Power: 1 - \(\beta\) (probability of correctly detecting a real effect)
| Error Type | Financial Cost | Reputational Cost | Operational Cost |
|---|---|---|---|
| Type I | - Development & implementation costs - Maintenance for unnecessary system - Opportunity cost |
- Loss of trust in results - Credibility damage |
- Wasted engineering resources - System complexity |
| Type II | - Continued fraud losses - Lost revenue from fraud - Regulatory fines |
- Perception of being behind competitors - Customer dissatisfaction |
- Missed efficiency gains - Use of suboptimal system |
Business Perspective Assessment: For a Fintech Startup:
Conclusion: In this specific context, Type II error is likely more costly because:
However, this depends on:
General Rule: In safety-critical or high-risk domains (medicine, fraud detection, security), Type II errors are often more dangerous.
# R code to demonstrate sample size effect on power
library(pwr)
# Fixed parameters: effect size d = 0.5, alpha = 0.05
effect_size <- 0.5
alpha <- 0.05
# Calculate power for different sample sizes
sample_sizes <- c(10, 20, 50, 100, 200)
powers <- sapply(sample_sizes, function(n) {
pwr.t.test(n = n, d = effect_size, sig.level = alpha,
type = "one.sample")$power
})
data.frame(Sample_Size = sample_sizes,
Power = round(powers, 3),
Type_II_Error = round(1 - powers, 3))
Practical Implications for the Startup:
Recommendation:
These three concepts are a “balancing act”:
A churn prediction model evaluation yields the following results:
Most students memorize that a p-value is a probability, but few understand what it actually measures: the compatibility of the data with the Null Hypothesis.
Imagine the “Null World”—a world where our churn model is a total fraud and has zero predictive power. In that world, any success we see is just “the luck of the draw.”
The p-value of 0.021 tells us that if we lived in that Null World, the chance of seeing a result as strong as ours is only 2.1%. Because that probability is so low, we conclude that we likely do not live in the Null World. The model is likely doing something “real.”
In statistics, we don’t say “The model works.” We say “The evidence is strong enough to reject the idea that it doesn’t work.” This is why we use the Significance Level (\(\alpha\)).
\(\alpha\) is our “tolerance for being wrong.” By setting \(\alpha = 0.05\), we are saying: “I am willing to accept a 5% risk of being a False Positive (accusing the Null of being false when it’s actually true).”
Decision Logic: Since \(0.021 < 0.05\), the “weight” of our evidence is heavier than our “risk threshold.” Official Decision: REJECT \(H_0\). We have moved from the land of “maybe” into the land of “statistically significant.”
When you talk to a CEO, they don’t want to hear about Z-scores. They want to hear about Reliability.
The Refined Sentence: “We have validated the Churn Prediction Model against a 95% confidence standard. The analysis shows a high degree of mathematical certainty (97.9%) that the model identifies patterns beyond mere coincidence. From a strategic standpoint, this model is now ‘Production-Ready’ for our retention campaigns.”
A p-value is only as good as the data it was built on. This is the Representative Sampling Principle.
If our training data only included “Premium Users,” but we are applying the model to “Free Tier Users,” our p-value is a lie. This is known as Selection Bias.
The Danger: If the sample isn’t a “mini-me” of the entire population, your inferential results will not generalize. You might have a “significant” result that only works for a tiny, specific group of people, leading to a massive failure when the model is launched globally.
This is the most common pitfall for junior analysts.
Always pair your p-value with an Effect Size metric. A result can be statistically significant (real) but practically insignificant (too small to care about).
Use the code below to turn the raw test statistic into a final decision. This automates the “Z-table” look-up process.
# Input Variables
z_score <- 2.31
alpha_threshold <- 0.05
# 1. Calculate P-value (Two-Tailed)
# pnorm(z) finds the area to the left. 1-pnorm(z) finds the area to the right.
# We multiply by 2 because we are testing for a difference in 'either' direction.
calculated_p <- 2 * (1 - pnorm(abs(z_score)))
# 2. Display the Results with Professional Formatting
cat("--- CHURN MODEL EVALUATION REPORT ---\n")
## --- CHURN MODEL EVALUATION REPORT ---
cat("Test Statistic (Z): ", z_score, "\n")
## Test Statistic (Z): 2.31
cat("P-Value calculated: ", round(calculated_p, 4), "\n")
## P-Value calculated: 0.0209
cat("Alpha Threshold: ", alpha_threshold, "\n")
## Alpha Threshold: 0.05
cat("-------------------------------------\n")
## -------------------------------------
# 3. The 'Decision Engine'
if (calculated_p <= alpha_threshold) {
print("CONCLUSION: STATISTICALLY SIGNIFICANT. REJECT NULL.")
print("STRATEGY: IMPLEMENT MODEL.")
} else {
print("CONCLUSION: NOT SIGNIFICANT. FAIL TO REJECT NULL.")
print("STRATEGY: RE-EVALUATE DATA SOURCE.")
}
## [1] "CONCLUSION: STATISTICALLY SIGNIFICANT. REJECT NULL."
## [1] "STRATEGY: IMPLEMENT MODEL."