Lecture 10: Hypothesis Testing — Null, Alternative, P-Values

2026-03-29

Agenda and Announcements

  • Today

      - **Topic 9: Hypothesis Testing — Null, Alternative, P-Values**
      - t-tests and Chi-Squared Tests: Simple Examples
      - Top Hat Quiz
  • Next Class

      - Topic 12: Experimental Design Internal Validity & Controls.
      - We will do two articles next week and I will announce them Wednesday

The Story of the t-Test

William Sealy Gossett

Watch First (~5 min)

Before we dive into the mechanics, let’s meet the statistician who invented the t-test.

Key Takeaways from Gossett

  • Worked at Guinness Brewery — needed to test small samples of barley
  • Published under the pseudonym “Student” (company secrecy rules)
  • The test is formally called Student’s t-test
  • Problem he solved: small samples don’t follow the normal distribution exactly
  • The t-distribution is wider/flatter than the normal — it accounts for extra uncertainty with small n
  • As sample size grows, the t-distribution approaches the normal distribution

Review: Connecting the Dots

Where We’ve Been

  • Lecture 5: Descriptive statistics — mean, variance, standard deviation
  • Lecture 8: Sampling distributions, Central Limit Theorem, Law of Large Numbers
  • Lecture 9: Confidence intervals — the range of plausible values for a parameter

The Bridge to Hypothesis Testing

  • A confidence interval tells us: “The true parameter is probably in this range”
  • A hypothesis test answers: “Is this relationship real, or just random noise?”
  • Both rely on the same machinery: - Sample statistics as estimators of population parameters - Probability distributions to quantify uncertainty - The 68-95-99.7 rule and critical values

Null and Alternative Hypotheses

What Is a Hypothesis?

  • A falsifiable statement about what we believe based on our theory
  • In statistics, we always test two competing claims simultaneously

The Two Hypotheses

Hypothesis Symbol Meaning
Null hypothesis H0 No relationship; any pattern is random chance
Alternative hypothesis H1 or Ha A real relationship exists
  • We start from the assumption that H0 is true
  • Our goal: find evidence strong enough to reject H0

Examples in Political Science

  • H0: Democracy has no effect on economic growth
  • H1: Democracies have higher economic growth than autocracies
  • H0: Gender and party identification are independent (unrelated)
  • H1: Gender and party identification are not independent

Important Reminder from Last Lecture

If we reject H0, does that mean H1 is definitely true?

NO!

  • Rejecting H0 means: there is strong evidence that H1 is approximately true
  • We are working within a probability framework — we are never 100% certain
  • We may be wrong — that’s what Type I and Type II errors are about

The P-Value and Alpha Level

What Is a P-Value?

  • The p-value is the probability of observing a result at least this extreme if H0 were true
  • A small p-value means: “It would be very unlikely to see this pattern by random chance alone”

Common Misconception

The p-value is NOT the probability that H0 is true.
It is the probability of the data given that H0 is true.

The Alpha Level (α)

  • α (alpha) is our pre-chosen threshold — the maximum p-value at which we reject H0
  • In social sciences: α = .05 is the standard
  • This means we are willing to be wrong 1 time in 20 (5%)
  • Sometimes researchers use α = .01 (1%) or α = .10 (10%)

The Decision Rule

\[\text{If } p < \alpha \text{, reject } H_0\]

\[\text{If } p \geq \alpha \text{, fail to reject } H_0\]

Why “Fail to Reject” and not “Accept”?

We never prove H0 is true. We simply lack sufficient evidence to reject it.
Absence of evidence is not evidence of absence.

Why Does p < .05 Make Sense?

  • If H0 is true and we ran this study 100 times, we’d expect results this extreme 5 times or fewer just by chance
  • When p < .05, the result would happen fewer than 1 in 20 times by random chance
  • That’s unlikely enough that we conclude: “This is probably a real pattern, not noise”

The Critical Value Connection

  • Recall from Lecture 8: Z ≥ 1.96 corresponds to p ≤ .05 (two-tailed)
  • For a t-test: the critical t-value depends on degrees of freedom (sample size)
  • Larger test statistic = smaller p-value = stronger evidence against H0
  • When the test statistic exceeds the critical value, the p-value is below α

The t-Test

What Is a t-Test?

  • Tests whether a mean (or difference in means) is statistically significant
  • Uses the t-distribution (Gossett’s contribution!) which accounts for small sample uncertainty
  • When n is large (≥ 30), the t-distribution ≈ normal distribution

When Do You Use a t-Test?

Situation Test Type
One group vs. a known value One-sample t-test
Two independent groups Independent samples t-test
Same group measured twice Paired samples t-test

Tip

Key requirement: The dependent variable (DV) must be continuous (interval or ratio level)

t-Test Formula

The t-statistic for comparing two independent groups:

\[t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}\]

  • \(\bar{x}_1, \bar{x}_2\) = sample means of the two groups
  • \(s_p\) = pooled standard deviation
  • \(n_1, n_2\) = sample sizes

As consumers: You will read the t-statistic and p-value from software output — you will not calculate this by hand in most research contexts.

Political Science Example: t-Test

Research Question: Do democracies have higher GDP per capita than autocracies?

  • H0: Mean GDP per capita is the same in democracies and autocracies
  • H1: Mean GDP per capita is different (or higher) in democracies

Note

This is a real question in comparative politics — the relationship between regime type and economic development has been studied extensively.

t-Test in R

# Simulated data: GDP per capita (thousands USD) by regime type
set.seed(42)
democracies <- c(28, 35, 41, 22, 55, 47, 38, 30, 52, 44, 29, 61, 37, 49,
    33, 58, 26, 43, 51, 36)
autocracies <- c(12, 18, 9, 25, 15, 7, 21, 11, 19, 14, 8, 23, 16, 10, 27,
    13, 20, 6, 17, 22)

# Run independent samples t-test
t.test(democracies, autocracies, alternative = "two.sided", var.equal = FALSE)

    Welch Two Sample t-test

data:  democracies and autocracies
t = 8.7705, df = 29.559, p-value = 1.006e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 19.25163 30.94837
sample estimates:
mean of x mean of y 
    40.75     15.65 

Reading the t-Test Output

When you see t-test output, look for:

  • t = the test statistic (how many standard errors the difference is from zero)
  • df = degrees of freedom (related to sample size)
  • p-value = probability of this difference by chance if H0 is true
  • 95% confidence interval = plausible range for the true difference in means

Decision

If p < .05: Reject H0 → Evidence supports a real difference between groups
If p ≥ .05: Fail to reject H0 → Insufficient evidence of a difference

The Chi-Squared Test

What Is a Chi-Squared (χ²) Test?

  • Tests whether two categorical variables are independent of each other
  • Uses the χ² distribution (we saw this in Lecture 9 for variance CIs)
  • Compares observed frequencies to expected frequencies (what we’d expect if the variables were unrelated)

When Do You Use χ²?

  • Both variables are categorical (nominal or ordinal)
  • You have a contingency table (cross-tabulation)
  • Common in survey-based political science research

Tip

Examples: Party ID × Gender, Vote choice × Education level, Region × Policy opinion

The χ² Formula

\[\chi^2 = \sum \frac{(O - E)^2}{E}\]

  • \(O\) = observed frequency in each cell
  • \(E\) = expected frequency if the variables were independent
  • Larger χ² = greater deviation from independence = stronger evidence against H0

As consumers: Again, software calculates this. Your job is to interpret the test statistic and p-value.

Political Science Example: χ² Test

Research Question: Is party identification independent of gender?

  • H0: Party ID and gender are independent (no relationship)
  • H1: Party ID and gender are not independent (a relationship exists)

Note

The “gender gap” in political party affiliation is one of the most studied phenomena in American political behavior.

χ² Test in R

# Contingency table: Party ID by Gender (simulated survey data, n =
# 300)
party_gender <- matrix(c(65, 55, 40, 60, 30, 50), nrow = 3, byrow = TRUE,
    dimnames = list(Party = c("Democrat", "Republican", "Independent"),
        Gender = c("Women", "Men")))

# Display the contingency table
print(party_gender)
             Gender
Party         Women Men
  Democrat       65  55
  Republican     40  60
  Independent    30  50
# Run chi-squared test
chisq.test(party_gender)

    Pearson's Chi-squared test

data:  party_gender
X-squared = 6.9024, df = 2, p-value = 0.03171

Reading the χ² Output

When you see chi-squared output, look for:

  • χ² (X-squared) = the test statistic (sum of squared deviations from expected)
  • df = degrees of freedom = (rows - 1) × (columns - 1)
  • p-value = probability of this pattern if the variables were truly independent

Decision

If p < .05: Reject H0 → Evidence of a relationship between the two categorical variables
If p ≥ .05: Fail to reject H0 → Insufficient evidence of a relationship

Visualizing the χ² Test Result

Comparing the Two Tests

t-Test vs. Chi-Squared: At a Glance

Feature t-Test χ² Test
What it tests Difference in means Independence of categorical variables
Variable type (DV) Continuous (interval/ratio) Categorical (nominal/ordinal)
Variable type (IV) Categorical (2 groups) Categorical
Distribution used t-distribution χ² distribution
Key statistic t χ²
Decision rule p < α → reject H0 p < α → reject H0

Choosing the Right Test

Quick Test Selection Guide
Research Question DV Type IV Type Test
Is avg. GDP different between regime types? Continuous Categorical (2 groups) t-Test
Is vote choice related to education level? Categorical Categorical Chi-Squared
Does support for policy differ by region? Continuous Categorical t-Test / ANOVA
Is income related to media consumption? Continuous Categorical t-Test / Regression

Putting It All Together

The Full Hypothesis Testing Workflow

  1. State H0 and H1
  2. Choose α (usually .05 in social science)
  3. Select the appropriate test (t-test, χ², etc.)
  4. Calculate the test statistic (software does this)
  5. Find the p-value
  6. Decision: Is p < α? - Yes → Reject H0 — evidence supports the alternative - No → Fail to reject H0 — insufficient evidence

What p < .05 Really Means

The Intuition

If p = .03, it means: “If there were truly no relationship (H0 true), we would observe a result this extreme only 3 times out of 100 by random chance.”
That’s unlikely enough that we conclude the relationship is probably real.

Common Mistakes to Avoid

  • ❌ “p = .03 means there is a 97% chance H1 is true” — Wrong
  • ❌ “p = .06 means there is no relationship” — Wrong (just insufficient evidence)
  • ❌ “Statistical significance = practical importance” — Not necessarily!
  • ✅ p < .05 means: the result is unlikely due to random chance alone
  • ✅ Always report effect size alongside p-values in real research

Alpha = .05: A Social Science Convention

  • The .05 threshold is a convention, not a law of nature
  • Ronald Fisher originally proposed it as a rough guideline
  • Social sciences typically use α = .05
  • Some fields use stricter thresholds (physics: α = 5 × 10⁻⁷ for particle discovery)
  • The key: choose α before you collect data

Summary

Key Concepts from Today

  • H0 (null): No relationship; pattern is due to random chance
  • H1 (alternative): A real relationship exists
  • p-value: Probability of the data if H0 is true
  • α = .05: Our threshold — willing to be wrong 1 in 20 times
  • p < α → Reject H0 — the relationship is unlikely due to chance
  • t-test: Tests differences in means (continuous DV)
  • χ² test: Tests independence of categorical variables

As Consumers of Statistics

When you read political science research, you will see:

  • “t(38) = 3.24, p = .003” → Reject H0, statistically significant
  • “χ²(2) = 9.47, p = .009” → Reject H0, variables are not independent
  • “p = .42” → Fail to reject H0, no significant relationship found

Your job: Understand what these mean, evaluate whether the test was appropriate, and assess whether the conclusions follow from the results.

Authorship, License, Credits

Creative Commons License