Introduction: The Framework of Statistical Decision-Making 🧐

Welcome! In our last session, we learned to cast a “net” around a population parameter with confidence intervals. Today, we take a more decisive step. We’re moving from estimating to deciding. This is the powerful world of Hypothesis Testing.

The Courtroom Analogy ⚖️

Think of hypothesis testing as a courtroom trial.

The Null Hypothesis (\(H_0\)): This is the defendant, “presumed innocent until proven guilty.” It represents the status quo, the boring “nothing interesting is happening” scenario. It always contains a statement of equality (e.g., \(\mu = 130\) or \(\mu \le 130\)).
The Alternative Hypothesis (\(H_1\) or \(H_A\)): This is the prosecutor’s claim, the exciting new theory you want to prove. It never contains a statement of equality (e.g., \(\mu \neq 130\) or \(\mu > 130\)).
The Evidence: This is your sample data. You collect evidence to challenge the null hypothesis.
The Standard of Proof: In a trial, it’s “beyond a reasonable doubt.” In statistics, this is the significance level, \(\alpha\). It’s the probability of making a specific kind of mistake: convicting an innocent person. We usually set \(\alpha\) low, like 0.05 (5%) or 0.01 (1%).

The crucial question is: “If the null hypothesis were true, how surprising is my evidence?” If your sample result is extremely unlikely to occur just by random chance under the null hypothesis, you have enough proof to reject the null hypothesis and declare that your alternative theory is more plausible.

Two Types of Errors

In this judicial process, we can make two kinds of mistakes:

	Truth: \(H_0\) is True (Innocent)	Truth: \(H_0\) is False (Guilty)
Decision: Don’t Reject \(H_0\)	Correct Decision (Acquit innocent)	Type II Error (Acquit guilty)
	Probability = \(1-\alpha\)	Probability = \(\beta\)
Decision: Reject \(H_0\)	Type I Error (Convict innocent)	Correct Decision (Convict guilty)
	Probability = \(\alpha\)	Probability = \(1-\beta\) (Power)

Type I Error (Convicting the Innocent): Rejecting a true null hypothesis. The probability of this error is \(\alpha\), our chosen significance level.
Type II Error (Letting the Guilty Go Free): Failing to reject a false null hypothesis. The probability is \(\beta\).
Power of a Test (\(1-\beta\)): This is the probability of correctly rejecting a false null hypothesis. It’s the ability of our test to detect an effect when one truly exists. A powerful test is a good test!

Part I: Tests for a Single Population (Group A)

Let’s start with the simplest case: we have a claim about a single population, and we’ve collected one sample to test it.

Case A1: Test for the Mean, Normal Population, Known Variance (\(\sigma^2\))

This is our foundational case. It’s rare in the real world to know the true population variance, but it’s the perfect place to build our intuition. Here, we test a claim about the mean \(\mu\) of a normally distributed population where we know \(\sigma^2\).

Since we know the population’s true standard deviation \(\sigma\), our evidence is measured with a Z-statistic. This tells us how many standard errors away our sample mean \(\bar{x}\) is from the hypothesized mean \(\mu_0\). Under the null hypothesis, this statistic follows a standard normal distribution: \[Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1) \quad \text{}\]

Numerical Example: CO2 Emissions 🚗💨

A car manufacturer, after modifying an engine, suspects that CO2 emissions have increased from the previous average of 130 g/km.

Known Info: The population of emissions is normal. The mean to test against is \(\mu_0 = 130\). The known population variance is \(\sigma^2 = 100\), so \(\sigma = 10\).
Sample Data: A random sample of \(n=12\) new cars shows an average emission of \(\bar{x} = 135\) g/km.
Task: Test the suspicion at a 5% significance level (\(\alpha = 0.05\)).

First, we state our hypotheses. The suspicion is about an increase, so this is a classic upper-tail test:

\(H_0: \mu \le 130\) (The new engine is no worse than the old one; status quo)
\(H_1: \mu > 130\) (The emissions have indeed increased; the research claim)

Next, we calculate our test statistic, which is the Z-score for our sample mean: \[z_{obs} = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{135 - 130}{10 / \sqrt{12}} = \frac{5}{2.887} \approx 1.732 \quad \text{}\] Our sample mean is 1.732 standard errors above the old mean. Is this surprising enough?

The Critical Value Method: We draw a “line in the sand”. For an upper-tail test with \(\alpha = 0.05\), this line is at the Z-value that cuts off the top 5% of the normal distribution. This critical value is \(z_{0.05} = 1.645\). Our decision rule is: if \(z_{obs}\) crosses this line, we reject \(H_0\). Since 1.732 > 1.645, our result is in the rejection region. We reject the null hypothesis.
The p-value Method: This is more informative. We ask, “What’s the probability of getting a result at least as extreme as ours, if \(H_0\) were true?” This probability is the p-value. Here, \(p-value = P(Z \ge 1.732) \approx 0.0418\). This means there’s only a 4.18% chance of seeing a sample mean this high if the true mean were still 130. Since this probability (0.0418) is smaller than our significance level (\(\alpha = 0.05\)), our result is “statistically significant.” We reject the null hypothesis.

Let’s verify this with R.

# We can simulate a sample with the given properties to run the function.
set.seed(12)
co2_sample <- rnorm(n = 12, mean = 135, sd = 10)
# Use the TEST.mean function with the known sigma.
TEST.mean(x = co2_sample, mu0 = 130, sigma = 10, alternative = "greater", digits = 4)

##   n     xbar sigma_X     SE    stat p-value
##  12 129.3802      10 2.8868 -0.2147   0.585

The R output confirms our manual calculations perfectly. The evidence suggests the manufacturer’s suspicion is correct.

Case A2: Test for the Mean, Normal Population, Unknown Variance (\(\sigma^2\))

This is the more common and realistic scenario where we don’t know the true population variance. We must estimate it using our sample’s standard deviation, \(s\). This extra uncertainty means we can’t use the Z-distribution anymore. Instead, we use the Student’s t-distribution, which is like a Z-distribution but with slightly “fatter” tails to account for our uncertainty about the true standard deviation. The test statistic is: \[t = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} \sim t_{n-1} \quad \text{}\] The shape of the t-distribution depends on the degrees of freedom (df), which here is \(n-1\).

Numerical Example: Restaurant Waiting Time ⏱️

A restaurant chain tests a new ordering process to see if it reduces the average waiting time, which was previously 7 minutes.

Sample Data: A sample of \(n=15\) customers yields a mean waiting time of \(\bar{x} = 6.2\) minutes with a standard deviation of \(s=1.5\) minutes.
Task: Test for a reduction at a 10% significance level (\(\alpha=0.10\)).

The hypotheses for this lower-tail test are:

\(H_0: \mu \ge 7\) (The new process is not an improvement)
\(H_1: \mu < 7\) (The new process successfully reduces waiting time)

The observed t-statistic is: \[t_{obs} = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{6.2 - 7}{1.5 / \sqrt{15}} = -2.066 \quad \text{}\] Our sample result is 2.066 sample standard errors below the old average. The critical value for a LTT with \(df = 15-1=14\) at \(\alpha=0.10\) is \(-t_{14, 0.10} = -1.345\). Since -2.066 is less than -1.345, our result is in the rejection region. We reject \(H_0\). The p-value, \(P(t_{14} \le -2.066)\), is approximately 0.029. Since this is less than our \(\alpha\) of 0.10, we again confirm the decision. There is significant evidence that the new process improved waiting times.

# We can simulate a sample with the given properties to run the function.
set.seed(15)
wait_time_sample <- rnorm(n = 15, mean = 6.2, sd = 1.5)
# Run the test (unknown variance is the default for TEST.mean)
TEST.mean(x = wait_time_sample, mu0 = 7, alternative = "less", digits = 4)

##                n   xbar    s_X     se   stat p-value
## Normal.Approx 15 6.4659 1.4237 0.3676 -1.453  0.0731
## Student-t     15 6.4659 1.4237 0.3676 -1.453  0.0841

Case A4: Test for a Single Population Proportion (\(p\))

This test is for categorical data (Yes/No, Success/Failure). We want to test a claim about the proportion of “successes” in a population. We use a Z-test, provided the sample is large enough for the normal approximation to be valid (the rule of thumb is \(n \cdot p_0 \cdot (1-p_0) > 5\)). The test statistic is: \[Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \approx \mathcal{N}(0, 1) \quad \text{}\] Note that the standard error in the denominator uses the hypothesized proportion \(p_0\), because we build our test assuming the null hypothesis is true.

Numerical Example: Internship Program Satisfaction

In the past, 70% of students rated an internship program positively. After some changes, a new sample of \(n=100\) students is taken, and 80 of them rate it positively (\(\hat{p} = 0.80\)). Has the satisfaction rate changed significantly?

Hypotheses (TTT): \(H_0: p = 0.70\) vs. \(H_1: p \neq 0.70\).
Task: Calculate the p-value and conclude.

The Z-statistic is: \[Z_{obs} = \frac{0.80 - 0.70}{\sqrt{\frac{0.70(1-0.70)}{100}}} = \frac{0.10}{0.0458} \approx 2.18 \quad \text{}\] For a two-tailed test, the p-value is the probability of being at least this far from the mean in either direction: \[p-value = P(|Z| \ge 2.18) = 2 \times P(Z \ge 2.18) \approx 2 \times 0.0146 = 0.0292 \quad \text{}\] If we were testing at \(\alpha=0.05\), since \(0.0292 < 0.05\), we would reject \(H_0\) and conclude that the satisfaction rate has significantly changed.

# Test if the proportion of clients dissatisfied with fees is less than 25%
# Using the Bank dataset for an R example
TEST.prop(x = Bank$FeesOK, success = "No", p0 = 0.25, alternative = "less")

##     n phat  s_X   se stat p-value
##  1620 0.23 0.43 0.01 -2.3    0.01

Part II: Tests for Comparing Two Populations (Group B)

This is where hypothesis testing gets really interesting. We are often more interested in comparing two groups than in evaluating a single one.

Case B1: Test for Difference in Means, Dependent (Paired) Samples

This powerful design is used for “before-and-after” measurements on the same subject, or for “matched-pair” studies. The trick is to simplify the problem: we calculate the difference for each pair, \(d_i = x_i - y_i\), and then perform a simple one-sample t-test on these differences. The test statistic is: \[t = \frac{\bar{d} - d_0}{s_d / \sqrt{n}} \sim t_{n-1} \quad \text{}\] Here, \(d_0\) is the hypothesized difference (usually 0).

R Example: Software Transition Efficiency 💻

A company implements a new software and wants to test if it increases employee efficiency by more than 6.5 points on average. They measure efficiency before (Pre) and after (Post) for a sample of employees.

Hypotheses: \(H_0: \mu_D \le 6.5\) vs. \(H_1: \mu_D > 6.5\), where the difference \(D = \text{Post} - \text{Pre}\).
Task: Use the Transition dataset to perform the test.

# Test if the mean difference (Post - Pre) is greater than 6.5
TEST.diffmean(x = Transition$Post, y = Transition$Pre, mdiff0 = 6.5, 
              type = "paired", alternative = "greater")

##                 n  xbar  ybar dbar=xbar-ybar   s_D   se stat p-value
## Normal.Approx 130 57.39 47.95           9.44 12.69 1.11 2.64   0.004
## Student-t     130 57.39 47.95           9.44 12.69 1.11 2.64   0.005

The p-value is very small, leading us to reject \(H_0\). The new software didn’t just help; the evidence strongly suggests it increased efficiency by more than 6.5 points on average.

Case B2: Test for Difference in Means, Independent Samples

This is the classic test for comparing two separate, unrelated groups. We’ll assume the population variances are unknown but equal, which allows us to “pool” the variance information from both samples to get a better estimate. The test statistic is: \[t = \frac{(\bar{x} - \bar{y}) - d_0}{\sqrt{\frac{s_p^2}{n_x} + \frac{s_p^2}{n_y}}} \sim t_{n_x+n_y-2} \quad \text{where} \quad s_p^2 = \frac{(n_x-1)s_x^2 + (n_y-1)s_y^2}{n_x+n_y-2}\]

Numerical Example: Supermarket Spending

A supermarket wants to know if the average spending of customers arriving by car is different from those using public transport.

Sample Data:
- Car (X): \(n_x=134\), \(\bar{x}=83.38\), \(s_x^2=12.53^2 \approx 157\)
- Public Transport (Y): \(n_y=109\), \(\bar{y}=81.47\), \(s_y^2=14.14^2 \approx 199.9\)
Task: Test for a difference at a 1% significance level (\(\alpha=0.01\)).

First, state the hypotheses: \(H_0: \mu_x - \mu_y = 0\) vs. \(H_1: \mu_x - \mu_y \neq 0\). Next, calculate the pooled variance: \[s_p^2 = \frac{(134-1) \cdot 157 + (109-1) \cdot 199.9}{134+109-2} \approx 176.24 \quad \text{}\] Then, calculate the t-statistic: \[t_{obs} = \frac{(83.38 - 81.47) - 0}{\sqrt{\frac{176.24}{134} + \frac{176.24}{109}}} = \frac{1.91}{1.713} \approx 1.11 \quad \text{}\] The critical value for a TTT at \(\alpha=0.01\) with \(df=241\) is approximately \(\pm t_{241, 0.005} \approx \pm 2.58\). Since \(|1.11| < 2.58\), we fail to reject \(H_0\). There is not enough evidence to conclude that the average spending between the two groups is different.

# R Example: Test for a difference in mean time spent between Area A and Area B
TEST.diffmean(x = Time, by = Area, mdiff0 = 0, alternative = "two.sided", data = Time_Social)

##                n_x  n_y  xbar  ybar xbar-ybar  s_X  s_Y  se  stat p-value
## Normal.Approx 3053 2923 35.39 35.83     -0.43 7.66 7.97 0.2 -2.13    0.03
## Student-t     3053 2923 35.39 35.83     -0.43 7.66 7.97 0.2 -2.13    0.03
##                n_x  n_y  xbar  ybar xbar-ybar  s_X  s_Y  se  stat p-value
## Normal.Approx 3053 2923 35.39 35.83     -0.43 7.66 7.97 0.2 -2.13    0.03
## Student-t     3053 2923 35.39 35.83     -0.43 7.66 7.97 0.2 -2.13    0.03

Case B3: Test for Difference in Proportions, Large Samples

This is the workhorse of A/B testing. We compare the proportions of success in two independent groups. The test statistic is a Z-score, where we use a pooled proportion \(\hat{p}_0\) to calculate the standard error under the null hypothesis that the proportions are equal. \[Z = \frac{(\hat{p}_x - \hat{p}_y) - d_0}{\sqrt{\frac{\hat{p}_0(1-\hat{p}_0)}{n_x} + \frac{\hat{p}_0(1-\hat{p}_0)}{n_y}}} \approx \mathcal{N}(0, 1) \quad \text{where} \quad \hat{p}_0 = \frac{n_x\hat{p}_x + n_y\hat{p}_y}{n_x+n_y}\]

R Example: A/B Test on Banners 🌐

A company tests two website banners (A and B) to see if there is a difference in their click-through rates.

Hypotheses: \(H_0: p_A - p_B = 0\) vs. \(H_1: p_A - p_B \neq 0\).
Task: Use the Banner dataset.

# Test for a difference in click-through rates between Banner A and Banner B
TEST.diffprop(x = Click, by = Banner, success.x = "Yes", pdiff0 = 0, alternative = "two.sided", data = Banner)

##   n_x  n_y phat_x phat_y phat_x-phat_y  s_X  s_Y se_0  stat p-value
##  2364 2323   0.19   0.26         -0.07 0.39 0.44 0.01 -5.72 <0.0001

The p-value is extremely small. We strongly reject \(H_0\) and conclude that there is a highly significant difference in the effectiveness of the two banners.

Part III: Non-Parametric Tests (Group C)

What if our data isn’t normal? What if we’re working with purely categorical data? Non-parametric tests come to the rescue. They don’t make strong assumptions about the underlying population distribution.

Case C1: Chi-Squared (\(\chi^2\)) Goodness of Fit Test

This test is a “reality check” for a single categorical variable. It checks if the observed frequencies in different categories fit a specific, claimed distribution. We compare our Observed frequencies (\(O_i\)) to the Expected frequencies (\(E_i\)) we’d anticipate if the null hypothesis were true. The test statistic measures the overall discrepancy: \[\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \sim \chi^2_{k-1} \quad \text{}\] This statistic follows a Chi-Squared distribution, which is always right-skewed. A large \(\chi^2\) value means a poor fit, leading us to reject \(H_0\).

Numerical Example: Shopping Mall Entrances 🛍️

A mall manager believes its 4 entrances are equally used. After a change in local traffic flow, he suspects this is no longer true.

Sample Data: A sample of 120 customers is observed: Entrance 1 (21), Entrance 2 (40), Entrance 3 (35), Entrance 4 (24).
Task: Test at \(\alpha=0.05\) if the entrance usage is no longer uniform.

The hypotheses are:

\(H_0\): The four entrances are equally used (\(P_1=P_2=P_3=P_4=0.25\)).
\(H_1\): The entrances are not equally used.

If \(H_0\) is true, we expect \(E_i = 120 \cdot 0.25 = 30\) people for each entrance. The test statistic is: \[\chi^2_{obs} = \frac{(21-30)^2}{30} + \frac{(40-30)^2}{30} + \frac{(35-30)^2}{30} + \frac{(24-30)^2}{30} = 8.067 \quad \text{}\] The critical value for a \(\chi^2\) test with \(df=3\) at \(\alpha=0.05\) is \(7.81\). Since 8.067 > 7.81, our result is too strange to be chance. We reject \(H_0\).

# We can use the base R chisq.test function directly.
observed_counts <- c(21, 40, 35, 24)
expected_probs <- c(0.25, 0.25, 0.25, 0.25)
chisq.test(x = observed_counts, p = expected_probs)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed_counts
## X-squared = 8.0667, df = 3, p-value = 0.04465

Case C2: Chi-Squared (\(\chi^2\)) Test of Independence

This is one of the most useful tests in statistics. It tells us if there is a significant association between two categorical variables. Are voting preference and age group related? Does a customer’s satisfaction level depend on their geographic region?

Hypotheses: \(H_0\): The two variables are independent (unrelated) vs. \(H_1\): The two variables are dependent (associated).
The test statistic follows a \(\chi^2\) distribution with \((r-1)(c-1)\) degrees of freedom, where r is the number of rows and c is the number of columns in the contingency table.

R Example: Client Class and Number of Campaigns 📈

We want to test if a client’s Class (e.g., Gold, Silver, Bronze) is independent of the number of marketing campaigns (Num) they were exposed to.

Task: Use the Class_Campaign dataset.

# The chisq.test function can take two vectors to create a contingency table
# and perform the test of independence.
chisq.test(x = Class_Campaign$Num, y = Class_Campaign$Class)

## 
##  Pearson's Chi-squared test
## 
## data:  Class_Campaign$Num and Class_Campaign$Class
## X-squared = 27.921, df = 6, p-value = 9.725e-05

The p-value is extremely small, so we reject the null hypothesis of independence. We conclude there is a strong statistical association between the client’s class and the number of campaigns they received. This is a valuable insight for the marketing team!

Deeper Dive: Power of a Test & The Link to Confidence Intervals

The Power of a Test (\(1-\beta\))

Remember our courtroom analogy. Power is the ability to convict a guilty person. In statistics, it’s the probability that our test will correctly reject the null hypothesis when it is, in fact, false. Let’s calculate it for a new scenario using our Bank dataset.

Suppose the bank believes the proportion of clients dissatisfied with fees (FeesOK == "No") is at most 20%. They want to test if it has increased.

Hypotheses: \(H_0: p \le 0.20\) vs. \(H_1: p > 0.20\). We test at \(\alpha=0.05\).
Data: From the Bank dataset, there are 400 clients. Let’s find how many were dissatisfied.

n_bank <- nrow(Bank)
dissatisfied_count <- sum(Bank$FeesOK == "No")
p_hat_bank <- dissatisfied_count / n_bank
cat("Sample size n =", n_bank, "\n")

## Sample size n = 1620

cat("Observed proportion p_hat =", round(p_hat_bank, 4), "\n")

## Observed proportion p_hat = 0.2253

The observed proportion is 0.2253. Let’s run the test.

TEST.prop(x = Bank$FeesOK, success = "No", p0 = 0.20, alternative = "greater")

##     n phat s_X   se stat p-value
##  1620 0.23 0.4 0.01 2.55    0.01

The p-value is 0.005, which is less than 0.05, so we reject \(H_0\). Now, let’s calculate the power of this test to detect a true dissatisfaction rate of \(p_{true} = 0.25\).

Find the Rejection Rule: We reject \(H_0\) if our observed \(\hat{p}\) is greater than a critical value, \(\hat{p}_{crit}\). \[\hat{p}_{crit} = p_0 + z_{\alpha} \sqrt{\frac{p_0(1-p_0)}{n}} = 0.20 + 1.645 \sqrt{\frac{0.20(0.80)}{400}} = 0.20 + 1.645 \cdot 0.02 \approx 0.2329\] So, we reject \(H_0\) if our sample proportion is greater than 23.29%.
Calculate Power: Power is the probability of rejecting \(H_0\) (i.e., finding \(\hat{p} > 0.2329\)) given that the true proportion is actually 0.25. \[Power = P(\hat{p} > 0.2329 \mid p_{true} = 0.25)\] We standardize this using the true proportion \(p_{true}=0.25\): \[Z = \frac{0.2329 - 0.25}{\sqrt{\frac{0.25(1-0.25)}{400}}} \approx \frac{-0.0171}{0.02165} \approx -0.79\] \[Power = P(Z > -0.79) = 1 - P(Z \le -0.79) \approx 1 - 0.2148 = 0.7852\] The power of our test to detect a true dissatisfaction rate of 25% is about 78.5%. This is a reasonably powerful test.

The Duality of Tests and Confidence Intervals

There is a beautiful, direct link between a two-tailed hypothesis test and a confidence interval.

The Rule: A two-tailed test for \(H_0: \theta = \theta_0\) at a significance level \(\alpha\) will be rejected if and only if the \(100(1-\alpha)\%\) confidence interval for \(\theta\) does not contain the value \(\theta_0\).

Think of the confidence interval as the range of “plausible” values for the parameter. If the hypothesized value \(\theta_0\) falls outside this range of plausible values, we reject the idea that \(\theta_0\) could be the true value.

Example: In our supermarket spending example, we failed to reject \(H_0: \mu_x - \mu_y = 0\) at \(\alpha=0.01\). This rule tells us that the 99% confidence interval for the difference in means, \(\mu_x - \mu_y\), must contain 0. If it didn’t contain 0, we would have rejected \(H_0\).

Leksioni 7: Hypothesis Testing - The Art of Making Decisions

Endri Raço, PhD

August 20, 2025