Welcome! In our last session, we learned to cast a “net” around a population parameter with confidence intervals. Today, we take a more decisive step. We’re moving from estimating to deciding. This is the powerful world of Hypothesis Testing.
Think of hypothesis testing as a courtroom trial.
The crucial question is: “If the null hypothesis were true, how surprising is my evidence?” If your sample result is extremely unlikely to occur just by random chance under the null hypothesis, you have enough proof to reject the null hypothesis and declare that your alternative theory is more plausible.
In this judicial process, we can make two kinds of mistakes:
| Truth: \(H_0\) is True (Innocent) | Truth: \(H_0\) is False (Guilty) | |
|---|---|---|
| Decision: Don’t Reject \(H_0\) | Correct Decision (Acquit innocent) | Type II Error (Acquit guilty) |
| Probability = \(1-\alpha\) | Probability = \(\beta\) | |
| Decision: Reject \(H_0\) | Type I Error (Convict innocent) | Correct Decision (Convict guilty) |
| Probability = \(\alpha\) | Probability = \(1-\beta\) (Power) |
Let’s start with the simplest case: we have a claim about a single population, and we’ve collected one sample to test it.
This is our foundational case. It’s rare in the real world to know the true population variance, but it’s the perfect place to build our intuition. Here, we test a claim about the mean \(\mu\) of a normally distributed population where we know \(\sigma^2\).
Since we know the population’s true standard deviation \(\sigma\), our evidence is measured with a Z-statistic. This tells us how many standard errors away our sample mean \(\bar{x}\) is from the hypothesized mean \(\mu_0\). Under the null hypothesis, this statistic follows a standard normal distribution: \[Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1) \quad \text{}\]
A car manufacturer, after modifying an engine, suspects that CO2 emissions have increased from the previous average of 130 g/km.
First, we state our hypotheses. The suspicion is about an increase, so this is a classic upper-tail test:
Next, we calculate our test statistic, which is the Z-score for our sample mean: \[z_{obs} = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{135 - 130}{10 / \sqrt{12}} = \frac{5}{2.887} \approx 1.732 \quad \text{}\] Our sample mean is 1.732 standard errors above the old mean. Is this surprising enough?
The Critical Value Method: We draw a “line in the sand”. For an upper-tail test with \(\alpha = 0.05\), this line is at the Z-value that cuts off the top 5% of the normal distribution. This critical value is \(z_{0.05} = 1.645\). Our decision rule is: if \(z_{obs}\) crosses this line, we reject \(H_0\). Since 1.732 > 1.645, our result is in the rejection region. We reject the null hypothesis.
The p-value Method: This is more informative. We ask, “What’s the probability of getting a result at least as extreme as ours, if \(H_0\) were true?” This probability is the p-value. Here, \(p-value = P(Z \ge 1.732) \approx 0.0418\). This means there’s only a 4.18% chance of seeing a sample mean this high if the true mean were still 130. Since this probability (0.0418) is smaller than our significance level (\(\alpha = 0.05\)), our result is “statistically significant.” We reject the null hypothesis.
Let’s verify this with R.
# We can simulate a sample with the given properties to run the function.
set.seed(12)
co2_sample <- rnorm(n = 12, mean = 135, sd = 10)
# Use the TEST.mean function with the known sigma.
TEST.mean(x = co2_sample, mu0 = 130, sigma = 10, alternative = "greater", digits = 4)## n xbar sigma_X SE stat p-value
## 12 129.3802 10 2.8868 -0.2147 0.585
The R output confirms our manual calculations perfectly. The evidence suggests the manufacturer’s suspicion is correct.
This is the more common and realistic scenario where we don’t know the true population variance. We must estimate it using our sample’s standard deviation, \(s\). This extra uncertainty means we can’t use the Z-distribution anymore. Instead, we use the Student’s t-distribution, which is like a Z-distribution but with slightly “fatter” tails to account for our uncertainty about the true standard deviation. The test statistic is: \[t = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} \sim t_{n-1} \quad \text{}\] The shape of the t-distribution depends on the degrees of freedom (df), which here is \(n-1\).
A restaurant chain tests a new ordering process to see if it reduces the average waiting time, which was previously 7 minutes.
The hypotheses for this lower-tail test are:
The observed t-statistic is: \[t_{obs} = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{6.2 - 7}{1.5 / \sqrt{15}} = -2.066 \quad \text{}\] Our sample result is 2.066 sample standard errors below the old average. The critical value for a LTT with \(df = 15-1=14\) at \(\alpha=0.10\) is \(-t_{14, 0.10} = -1.345\). Since -2.066 is less than -1.345, our result is in the rejection region. We reject \(H_0\). The p-value, \(P(t_{14} \le -2.066)\), is approximately 0.029. Since this is less than our \(\alpha\) of 0.10, we again confirm the decision. There is significant evidence that the new process improved waiting times.
# We can simulate a sample with the given properties to run the function.
set.seed(15)
wait_time_sample <- rnorm(n = 15, mean = 6.2, sd = 1.5)
# Run the test (unknown variance is the default for TEST.mean)
TEST.mean(x = wait_time_sample, mu0 = 7, alternative = "less", digits = 4)## n xbar s_X se stat p-value
## Normal.Approx 15 6.4659 1.4237 0.3676 -1.453 0.0731
## Student-t 15 6.4659 1.4237 0.3676 -1.453 0.0841
This test is for categorical data (Yes/No, Success/Failure). We want to test a claim about the proportion of “successes” in a population. We use a Z-test, provided the sample is large enough for the normal approximation to be valid (the rule of thumb is \(n \cdot p_0 \cdot (1-p_0) > 5\)). The test statistic is: \[Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \approx \mathcal{N}(0, 1) \quad \text{}\] Note that the standard error in the denominator uses the hypothesized proportion \(p_0\), because we build our test assuming the null hypothesis is true.
In the past, 70% of students rated an internship program positively. After some changes, a new sample of \(n=100\) students is taken, and 80 of them rate it positively (\(\hat{p} = 0.80\)). Has the satisfaction rate changed significantly?
The Z-statistic is: \[Z_{obs} = \frac{0.80 - 0.70}{\sqrt{\frac{0.70(1-0.70)}{100}}} = \frac{0.10}{0.0458} \approx 2.18 \quad \text{}\] For a two-tailed test, the p-value is the probability of being at least this far from the mean in either direction: \[p-value = P(|Z| \ge 2.18) = 2 \times P(Z \ge 2.18) \approx 2 \times 0.0146 = 0.0292 \quad \text{}\] If we were testing at \(\alpha=0.05\), since \(0.0292 < 0.05\), we would reject \(H_0\) and conclude that the satisfaction rate has significantly changed.
# Test if the proportion of clients dissatisfied with fees is less than 25%
# Using the Bank dataset for an R example
TEST.prop(x = Bank$FeesOK, success = "No", p0 = 0.25, alternative = "less")## n phat s_X se stat p-value
## 1620 0.23 0.43 0.01 -2.3 0.01
This is where hypothesis testing gets really interesting. We are often more interested in comparing two groups than in evaluating a single one.
This powerful design is used for “before-and-after” measurements on the same subject, or for “matched-pair” studies. The trick is to simplify the problem: we calculate the difference for each pair, \(d_i = x_i - y_i\), and then perform a simple one-sample t-test on these differences. The test statistic is: \[t = \frac{\bar{d} - d_0}{s_d / \sqrt{n}} \sim t_{n-1} \quad \text{}\] Here, \(d_0\) is the hypothesized difference (usually 0).
A company implements a new software and wants to test if it
increases employee efficiency by more than 6.5
points on average. They measure efficiency before
(Pre) and after (Post) for a sample of
employees.
Transition dataset to
perform the test.# Test if the mean difference (Post - Pre) is greater than 6.5
TEST.diffmean(x = Transition$Post, y = Transition$Pre, mdiff0 = 6.5,
type = "paired", alternative = "greater")## n xbar ybar dbar=xbar-ybar s_D se stat p-value
## Normal.Approx 130 57.39 47.95 9.44 12.69 1.11 2.64 0.004
## Student-t 130 57.39 47.95 9.44 12.69 1.11 2.64 0.005
The p-value is very small, leading us to reject \(H_0\). The new software didn’t just help; the evidence strongly suggests it increased efficiency by more than 6.5 points on average.
This is the classic test for comparing two separate, unrelated groups. We’ll assume the population variances are unknown but equal, which allows us to “pool” the variance information from both samples to get a better estimate. The test statistic is: \[t = \frac{(\bar{x} - \bar{y}) - d_0}{\sqrt{\frac{s_p^2}{n_x} + \frac{s_p^2}{n_y}}} \sim t_{n_x+n_y-2} \quad \text{where} \quad s_p^2 = \frac{(n_x-1)s_x^2 + (n_y-1)s_y^2}{n_x+n_y-2}\]
A supermarket wants to know if the average spending of customers arriving by car is different from those using public transport.
First, state the hypotheses: \(H_0: \mu_x - \mu_y = 0\) vs. \(H_1: \mu_x - \mu_y \neq 0\). Next, calculate the pooled variance: \[s_p^2 = \frac{(134-1) \cdot 157 + (109-1) \cdot 199.9}{134+109-2} \approx 176.24 \quad \text{}\] Then, calculate the t-statistic: \[t_{obs} = \frac{(83.38 - 81.47) - 0}{\sqrt{\frac{176.24}{134} + \frac{176.24}{109}}} = \frac{1.91}{1.713} \approx 1.11 \quad \text{}\] The critical value for a TTT at \(\alpha=0.01\) with \(df=241\) is approximately \(\pm t_{241, 0.005} \approx \pm 2.58\). Since \(|1.11| < 2.58\), we fail to reject \(H_0\). There is not enough evidence to conclude that the average spending between the two groups is different.
# R Example: Test for a difference in mean time spent between Area A and Area B
TEST.diffmean(x = Time, by = Area, mdiff0 = 0, alternative = "two.sided", data = Time_Social)## n_x n_y xbar ybar xbar-ybar s_X s_Y se stat p-value
## Normal.Approx 3053 2923 35.39 35.83 -0.43 7.66 7.97 0.2 -2.13 0.03
## Student-t 3053 2923 35.39 35.83 -0.43 7.66 7.97 0.2 -2.13 0.03
## n_x n_y xbar ybar xbar-ybar s_X s_Y se stat p-value
## Normal.Approx 3053 2923 35.39 35.83 -0.43 7.66 7.97 0.2 -2.13 0.03
## Student-t 3053 2923 35.39 35.83 -0.43 7.66 7.97 0.2 -2.13 0.03
This is the workhorse of A/B testing. We compare the proportions of success in two independent groups. The test statistic is a Z-score, where we use a pooled proportion \(\hat{p}_0\) to calculate the standard error under the null hypothesis that the proportions are equal. \[Z = \frac{(\hat{p}_x - \hat{p}_y) - d_0}{\sqrt{\frac{\hat{p}_0(1-\hat{p}_0)}{n_x} + \frac{\hat{p}_0(1-\hat{p}_0)}{n_y}}} \approx \mathcal{N}(0, 1) \quad \text{where} \quad \hat{p}_0 = \frac{n_x\hat{p}_x + n_y\hat{p}_y}{n_x+n_y}\]
What if our data isn’t normal? What if we’re working with purely categorical data? Non-parametric tests come to the rescue. They don’t make strong assumptions about the underlying population distribution.
This test is a “reality check” for a single categorical variable. It checks if the observed frequencies in different categories fit a specific, claimed distribution. We compare our Observed frequencies (\(O_i\)) to the Expected frequencies (\(E_i\)) we’d anticipate if the null hypothesis were true. The test statistic measures the overall discrepancy: \[\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \sim \chi^2_{k-1} \quad \text{}\] This statistic follows a Chi-Squared distribution, which is always right-skewed. A large \(\chi^2\) value means a poor fit, leading us to reject \(H_0\).
A mall manager believes its 4 entrances are equally used. After a change in local traffic flow, he suspects this is no longer true.
The hypotheses are:
If \(H_0\) is true, we expect \(E_i = 120 \cdot 0.25 = 30\) people for each entrance. The test statistic is: \[\chi^2_{obs} = \frac{(21-30)^2}{30} + \frac{(40-30)^2}{30} + \frac{(35-30)^2}{30} + \frac{(24-30)^2}{30} = 8.067 \quad \text{}\] The critical value for a \(\chi^2\) test with \(df=3\) at \(\alpha=0.05\) is \(7.81\). Since 8.067 > 7.81, our result is too strange to be chance. We reject \(H_0\).
# We can use the base R chisq.test function directly.
observed_counts <- c(21, 40, 35, 24)
expected_probs <- c(0.25, 0.25, 0.25, 0.25)
chisq.test(x = observed_counts, p = expected_probs)##
## Chi-squared test for given probabilities
##
## data: observed_counts
## X-squared = 8.0667, df = 3, p-value = 0.04465
This is one of the most useful tests in statistics. It tells us if there is a significant association between two categorical variables. Are voting preference and age group related? Does a customer’s satisfaction level depend on their geographic region?
We want to test if a client’s Class (e.g., Gold, Silver,
Bronze) is independent of the number of marketing campaigns
(Num) they were exposed to.
Class_Campaign
dataset.# The chisq.test function can take two vectors to create a contingency table
# and perform the test of independence.
chisq.test(x = Class_Campaign$Num, y = Class_Campaign$Class)##
## Pearson's Chi-squared test
##
## data: Class_Campaign$Num and Class_Campaign$Class
## X-squared = 27.921, df = 6, p-value = 9.725e-05
The p-value is extremely small, so we reject the null hypothesis of independence. We conclude there is a strong statistical association between the client’s class and the number of campaigns they received. This is a valuable insight for the marketing team!
Remember our courtroom analogy. Power is the ability to convict a
guilty person. In statistics, it’s the probability that our test will
correctly reject the null hypothesis when it is, in fact, false. Let’s
calculate it for a new scenario using our Bank dataset.
Suppose the bank believes the proportion of clients
dissatisfied with fees (FeesOK == "No") is
at most 20%. They want to test if it has increased.
Bank dataset, there are
400 clients. Let’s find how many were dissatisfied.n_bank <- nrow(Bank)
dissatisfied_count <- sum(Bank$FeesOK == "No")
p_hat_bank <- dissatisfied_count / n_bank
cat("Sample size n =", n_bank, "\n")## Sample size n = 1620
## Observed proportion p_hat = 0.2253
The observed proportion is 0.2253. Let’s run the test.
## n phat s_X se stat p-value
## 1620 0.23 0.4 0.01 2.55 0.01
The p-value is 0.005, which is less than 0.05, so we reject \(H_0\). Now, let’s calculate the power of this test to detect a true dissatisfaction rate of \(p_{true} = 0.25\).
Find the Rejection Rule: We reject \(H_0\) if our observed \(\hat{p}\) is greater than a critical value, \(\hat{p}_{crit}\). \[\hat{p}_{crit} = p_0 + z_{\alpha} \sqrt{\frac{p_0(1-p_0)}{n}} = 0.20 + 1.645 \sqrt{\frac{0.20(0.80)}{400}} = 0.20 + 1.645 \cdot 0.02 \approx 0.2329\] So, we reject \(H_0\) if our sample proportion is greater than 23.29%.
Calculate Power: Power is the probability of rejecting \(H_0\) (i.e., finding \(\hat{p} > 0.2329\)) given that the true proportion is actually 0.25. \[Power = P(\hat{p} > 0.2329 \mid p_{true} = 0.25)\] We standardize this using the true proportion \(p_{true}=0.25\): \[Z = \frac{0.2329 - 0.25}{\sqrt{\frac{0.25(1-0.25)}{400}}} \approx \frac{-0.0171}{0.02165} \approx -0.79\] \[Power = P(Z > -0.79) = 1 - P(Z \le -0.79) \approx 1 - 0.2148 = 0.7852\] The power of our test to detect a true dissatisfaction rate of 25% is about 78.5%. This is a reasonably powerful test.
There is a beautiful, direct link between a two-tailed hypothesis test and a confidence interval.
The Rule: A two-tailed test for \(H_0: \theta = \theta_0\) at a significance level \(\alpha\) will be rejected if and only if the \(100(1-\alpha)\%\) confidence interval for \(\theta\) does not contain the value \(\theta_0\).
Think of the confidence interval as the range of “plausible” values for the parameter. If the hypothesized value \(\theta_0\) falls outside this range of plausible values, we reject the idea that \(\theta_0\) could be the true value.
Example: In our supermarket spending example, we failed to reject \(H_0: \mu_x - \mu_y = 0\) at \(\alpha=0.01\). This rule tells us that the 99% confidence interval for the difference in means, \(\mu_x - \mu_y\), must contain 0. If it didn’t contain 0, we would have rejected \(H_0\).