A digital learning platform claims that the average daily study time of its users is 120 minutes. Based on historical records, the population standard deviation is known to be 15 minutes.
A random sample of 64 users shows an average study time of 116 minutes.
\[ \begin{eqnarray*} \mu_0 &=& 120 \\ \sigma &=& 15 \\ n &=& 64 \\ \bar{x} &=& 116 \end{eqnarray*} \]
In inference, we start by defining what we are testing against.
Null Hypothesis (\(H_0\)): \(\mu = 120\)
Alternative Hypothesis (\(H_1\)): \(\mu \neq 120\)
We need to see how many “standard errors” our sample mean (\(\bar{x} = 116\)) sits away from the claimed mean (\(\mu = 120\)).
Find the P-Value Using a standard normal distribution table for \(Z = -2.13\):
We compare our p-value to our significance level (\(\alpha = 0.05\)):
From a business perspective, the platform’s claim that users study for 120 minutes is statistically unsupported by this data.
The sample mean of 116 minutes is low enough that it is unlikely to have happened by random chance if the true average were 120. As a data analyst, you would advise the marketing or product team that their “120-minute” claim is likely an overestimation and should be revised to reflect actual user behavior more accurately.
A UX Research Team investigates whether the average task completion time of a new application differs from 10 minutes.
The following data are collected from 10 users:
\[ 9.2,\; 10.5,\; 9.8,\; 10.1,\; 9.6,\; 10.3,\; 9.9,\; 9.7,\; 10.0,\; 9.5 \]
Null Hypothesis (H₀):
H₀: μ = 10 minutes
The average task completion time for the new application is 10 minutes
No difference from the target benchmark
Alternative Hypothesis (H₁):
H₁: μ ≠ 10 minutes
The average task completion time differs from 10 minutes
This is a two-tailed test because we’re checking for any difference (faster or slower)
Justification:
Parameter of interest: Population mean (μ)
Population standard deviation: Unknown (we only have sample data)
Sample size: Small (n = 10 < 30)
Conditions check:
Why t-test instead of z-test?
\[\begin{aligned} \bar{x} &= \frac{\sum\_{i=1}^{n} x\_i}{n} \\ &= \frac{98.6}{10} \\ &= 9.86 \text{ minutes} \end{aligned}\]
\[t = \frac{\bar{x} - \mu}{s / \sqrt{n}} = \frac{9.86 - 10}{0.386 / \sqrt{10}} = \frac{-0.14}{0.122} \approx \mathbf{-1.15}\]
Sample size (\(n\)) plays a critical role in hypothesis testing and the reliability of our inferences:
In this case: The sample mean (9.86) is slightly below 10, but with only n=10 and low variability, the difference is not statistically significant. A larger sample (e.g., n=50) showing the same mean difference would likely yield a much smaller p-value and rejection of H₀.
A product analytics team conducts an A/B test to compare the average session duration (minutes) between two versions of a landing page.
| Version | Sample Size (n) | Mean | Standard Deviation |
|---|---|---|---|
| A | 25 | 4.8 | 1.2 |
| B | 25 | 5.4 | 1.4 |
Null Hypothesis (H₀):
\[\begin{aligned} H_0 : \mu_A = \mu_B \end{aligned}\]Alternative Hypothesis (H₁):
\[\begin{aligned} H_1 : \mu_A \neq \mu_B \end{aligned}\]Test Selection: Two-Sample Independent T-Test (Welch’s t-test)
Justification:
Comparing two independent groups (Version A vs Version B)
Population standard deviations unknown (only sample SDs provided: 1.2 and 1.4)
Sample sizes are equal (both n = 25), but not necessarily large enough for z-test
Equal variances assumption check needed:
To compare the two groups, we calculate how much the difference in means (\(5.4 - 4.8 = 0.6\)) stands out against the combined “noise” (standard error) of both groups.
Calculate the Standard Error (\(SE\))
Since the sample sizes are equal (\(n=25\)), we use the formula:
\[SE = \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}\] \[SE = \sqrt{\frac{1.2^2}{25} + \frac{1.4^2}{25}} = \sqrt{\frac{1.44}{25} + \frac{1.96}{25}} \\ = \sqrt{0.0576 + 0.0784} \approx \mathbf{0.369}\]
Calculate the T-Statistic
\[t = \frac{\bar{x}_B - \bar{x}_A}{SE}\\ = \frac{5.4 - 4.8}{0.369} = \frac{0.6}{0.369} \approx \mathbf{1.626}\]
Determine the P-Value
Using the degrees of freedom (\(df \approx 48\) using a simplified \(n_A + n_B - 2\)):
Comparison: \(p \text{-value} (0.110) > \alpha (0.05)\).
Decision: Fail to Reject the Null Hypothesis (\(H_0\)).
Reasoning: Although Version B had a higher average (5.4 minutes) than Version A (4.8 minutes), the p-value tells us there is an 11% chance this difference happened just by random luck. Since 11% is higher than our 5% threshold, we cannot claim the result is “statistically significant.”
Version B shows a higher sample mean session duration (5.4 vs. 4.8 minutes, a +12.5% increase), but this difference is not statistically significant at \(\alpha = 0.05\).
Product Implications:
Key Lesson in A/B Testing: Statistical significance is essential before declaring a “winner.” Many promising variants fail to reach significance due to insufficient sample size or small effect sizes.
An e-commerce company examines whether device type is associated with payment method preference.
| Device / Payment | E-Wallet | Credit Card | Cash on Delivery |
|---|---|---|---|
| Mobile | 120 | 80 | 50 |
| Desktop | 60 | 90 | 40 |
Null Hypothesis (H₀):
\[H_0 : \text{Device type and payment} \\ \text{method are independent}\]
Alternative Hypothesis (H₁):
\[H_1 : \text{Device type and payment}\\ \text{method are dependent}\]
Test Selection: Pearson’s Chi-Square Test of Independence
Justification:
Two categorical variables:
Device type (Mobile, Desktop) - 2 categories
Payment method (E-Wallet, Credit Card, Cash on Delivery) - 3 categories
Independent observations: Each user contributes to only one cell
Expected frequencies ≥ 5: We’ll verify this during calculation
Goal: Test association/independence between two categorical variables
To find the test statistic, we compare the Observed values (the data we have) to the Expected values (what the data would look like if there were no relationship).
library(knitr)
data_tabel <- data.frame(
Device = c("E-Wallet", "Credit Card", "Cash on Delivery", "Row Total"),
Mobile = c(120, 80, 50, 250),
Desktop = c(60, 90, 40, 190),
Column_Total = c(180, 170, 90, 440)
)
colnames(data_tabel) <- c("Device", "Mobile", "Desktop", "Column Total")
kable(data_tabel,
caption = "Contingency Table: Device Type vs Payment Method",
align = "lccc") # l=left, c=center
| Device | Mobile | Desktop | Column Total |
|---|---|---|---|
| E-Wallet | 120 | 60 | 180 |
| Credit Card | 80 | 90 | 170 |
| Cash on Delivery | 50 | 40 | 90 |
| Row Total | 250 | 190 | 440 |
Using the formula
\(E = \frac{(\text{Row Total} \times \text{Column Total})}{\text{Grand Total}}\)
Mobile + E-Wallet: \((250 \times 180) / 440 = \mathbf{102.27}\)
Mobile + Credit Card: \((250 \times 170) / 440 = \mathbf{96.59}\)
Mobile + Cash (COD): \((250 \times 90) / 440 = \mathbf{51.14}\)
Desktop + E-Wallet: \((190 \times 180) / 440 = \mathbf{77.73}\)
Desktop + Credit Card: \((190 \times 170) / 440 = \mathbf{73.41}\)
Desktop + Cash (COD): \((190 \times 90) / 440 = \mathbf{38.86}\)
Formula:
\[\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]
Calculation :
\[\begin{aligned} \chi^2 &= \frac{(120-102.27)^2}{102.27} + \frac{(80-96.59)^2}{96.59} + \frac{(50-51.14)^2}{51.14} \\ &\quad + \frac{(60-77.73)^2}{77.73} + \frac{(90-73.41)^2}{73.41} + \frac{(40-38.86)^2}{38.86} \\[10pt] &= \frac{17.73^2}{102.27} + \frac{(-16.59)^2}{96.59} + \frac{(-1.14)^2}{51.14} \\ &\quad + \frac{(-17.73)^2}{77.73} + \frac{16.59^2}{73.41} + \frac{1.14^2}{38.86} \\[10pt] &= \frac{314.35}{102.27} + \frac{275.23}{96.59} + \frac{1.30}{51.14} \\ &\quad + \frac{314.35}{77.73} + \frac{275.23}{73.41} + \frac{1.30}{38.86} \\[10pt] &= 3.074 + 2.849 + 0.025 + 4.044 + 3.749 + 0.033 \\ &= 13.774 \end{aligned}\]\[\chi^2 = 13.774\]
Degrees of freedom:
\[df = (2 - 1) \times (3 - 1) = 1 \times 2 = 2\]
A fintech startup tests whether a new fraud detection algorithm reduces fraudulent transactions.
A churn prediction model evaluation yields the following results: