Statistics is, at its most useful, a translator. It takes the noise of the world — patient blood draws, late-arriving flights, fertilizer trials, baseball box scores — and converts it into a yes-or-no answer about whether what we are observing is real or merely a coincidence. This notebook walks through nine such decisions, each told as a story.
Every chart below is interactive. Hover, zoom, click. The code is hidden by default — toggle it with the Code buttons on the right of each block.
Chi-Square asks: “Do these counts look
right?” · ANOVA asks: “Are these averages
really different?”
Both are formal versions of the question
every analyst eventually asks: “Is this signal worth caring about,
or am I fooling myself?”
If you flip a coin 10 times and get 7 heads, is the coin rigged? Probably not — random luck explains it. If you flip it 10,000 times and get 7,000 heads, the coin is definitely rigged. Hypothesis testing is the math that draws the line between those two cases.
What goodness-of-fit answers: “I have one set of counts and an expectation of what the counts should look like. Do they match?”
The setup. A medical researcher samples 50 hospital patients. He wants to know whether their blood-type breakdown looks like the U.S. population — or whether something is unusual about who walks into this hospital.
H₀ (null): Hospital matches the general
population (20% A · 28% B · 36% O · 16% AB).
H₁
(claim): Hospital differs from the general population.
observed_blood <- c(A = 12, B = 8, O = 24, AB = 6)
expected_p_blood <- c(0.20, 0.28, 0.36, 0.16)
blood_test <- chisq.test(observed_blood, p = expected_p_blood)
blood_test##
## Chi-squared test for given probabilities
##
## data: observed_blood
## X-squared = 5.4714, df = 3, p-value = 0.1404
## [1] 6.251389
Imagine you ordered a bag of 50 mixed candies and got more reds than expected. Is the candy company short-changing you on greens, or did you just get a lucky bag? With only 50 candies, it’s almost certainly a lucky bag.
The setup. A major airline’s 200-flight sample is compared against the federal benchmark (BTS): 70.8% on-time, 8.2% NAS delay, 9.0% arriving late, 12.0% other (weather). Of the 200 flights: 125 on-time, 10 NAS, 40 weather/other, 25 late (the leftover).
H₀ (null): The carrier matches BTS
percentages.
H₁ (claim): The carrier
differs from BTS percentages.
observed_air <- c(OnTime = 125, NAS = 10, Late = 25, Other = 40)
expected_p_air <- c(0.708, 0.082, 0.090, 0.120)
air_test <- chisq.test(observed_air, p = expected_p_air)
air_test##
## Chi-squared test for given probabilities
##
## data: observed_air
## X-squared = 17.832, df = 3, p-value = 0.0004763
## [1] 7.814728
This airline is being battered by environmental delays, not by air-traffic-system coordination. The fix isn’t more dispatchers — it’s smarter weather routing.
What independence answers: “I have two characteristics measured on the same people. Are they related to each other, or does one tell me nothing about the other?”
The setup. Did the demographic mix of moviegoers shift between 2013 and 2014?
H₀ (null): Year and ethnicity are
independent — the mix is stable.
H₁
(claim): Movie attendance depends on year — the mix
shifted.
movies <- matrix(c(724, 335, 174, 107,
370, 292, 152, 140), nrow = 2, byrow = TRUE,
dimnames = list(Year = c("2013","2014"),
Ethnicity = c("Caucasian","Hispanic",
"AfricanAm","Other")))
movies## Ethnicity
## Year Caucasian Hispanic AfricanAm Other
## 2013 724 335 174 107
## 2014 370 292 152 140
##
## Pearson's Chi-squared test
##
## data: movies
## X-squared = 60.144, df = 3, p-value = 0.0000000000005478
## [1] 7.814728
If a restaurant’s customer mix changes from year to year — fewer regulars, more new walk-ins — the menu and marketing have to change too. Same idea here, just for movie tickets.
The setup. Across the Army, Navy, Marines, and Air Force, is the officer-to-enlisted ratio for women the same — or do branches differ?
H₀ (null): Rank distribution is
independent of branch.
H₁ (claim): Rank
composition depends on branch.
military <- matrix(c(10791, 62491, 7816, 42750,
932, 9525, 11819, 54344), nrow = 4, byrow = TRUE,
dimnames = list(Branch = c("Army","Navy","Marines","AirForce"),
Rank = c("Officers","Enlisted")))
military## Rank
## Branch Officers Enlisted
## Army 10791 62491
## Navy 7816 42750
## Marines 932 9525
## AirForce 11819 54344
##
## Pearson's Chi-squared test
##
## data: military
## X-squared = 654.27, df = 3, p-value < 0.00000000000000022
## [1] 7.814728
What one-way ANOVA answers: “I have three or more groups. Are their averages really different, or could random sampling explain the gaps?”
The setup. Do condiments, cereals, and desserts contain different amounts of sodium per serving?
H₀ (null): All three population means are
equal.
H₁ (claim): At least one
category differs.
sodium <- data.frame(
amount = c(270,130,230,180,80,70,200, 260,220,290,290,200,320,140,
100,180,250,250,300,360,300,160),
food = factor(c(rep("Condiments",7), rep("Cereals",7), rep("Desserts",8)))
)
summary(aov(amount ~ food, data = sodium))## Df Sum Sq Mean Sq F value Pr(>F)
## food 2 27544 13772 2.399 0.118
## Residuals 19 109093 5742
## [1] 3.521893
If three classrooms have averages of 70%, 75%, and 73%, you might think there’s a difference. But if scores in each class range from 40% to 100%, the averages are wobbling around so much you can’t really tell them apart.
What post-hoc tests do: ANOVA tells you whether the groups differ. Tukey or Scheffé tells you which pairs differ. (Only run them if ANOVA rejects H₀.)
The setup. Cereal, chocolate, and coffee company sales (in millions). Are the industry averages really different?
H₀ (null): Mean sales are equal across
the three industries.
H₁ (claim): At
least one industry mean differs.
sales <- data.frame(
revenue = c(578,320,264,249,237, 311,106,109,125,173, 261,185,302,689),
industry = factor(c(rep("Cereal",5), rep("Chocolate",5), rep("Coffee",4)))
)
summary(aov(revenue ~ industry, data = sales))## Df Sum Sq Mean Sq F value Pr(>F)
## industry 2 103770 51885 2.172 0.16
## Residuals 11 262795 23890
## [1] 7.205713
The setup. Eastern, Middle, and Western U.S. states. Is there a real difference in per-pupil spending across regions?
H₀ (null): Mean spending is equal across
the three regions.
H₁ (claim): At least
one regional mean differs.
pupil <- data.frame(
spend = c(4946,5953,6202,7243,6113, 6149,7451,6000,6479, 5282,8605,6528,6911),
region = factor(c(rep("Eastern",5), rep("Middle",4), rep("Western",4)))
)
summary(aov(spend ~ region, data = pupil))## Df Sum Sq Mean Sq F value Pr(>F)
## region 2 1244588 622294 0.649 0.543
## Residuals 10 9591145 959114
## [1] 4.102821
What two-way ANOVA adds: It tests two factors simultaneously and whether they interact (i.e., does the effect of one depend on the level of the other?).
The setup. A 2×2 factorial experiment: two light strengths, two plant foods, three plants in each combination. Twelve plants total.
H₀ (food): Plant food has no effect
H₁: It does.
H₀
(light): Light strength has no effect
H₁: It does.
H₀
(interaction): Effects are additive
H₁: They interact.
plants <- data.frame(
growth = c(9.2,9.4,8.9, 8.5,9.2,8.9, 7.1,7.2,8.5, 5.5,5.8,7.6),
food = factor(rep(c("A","A","B","B"), each = 3)),
light = factor(rep(c("L1","L2","L1","L2"), each = 3))
)
summary(aov(growth ~ food * light, data = plants))## Df Sum Sq Mean Sq F value Pr(>F)
## food 1 12.813 12.813 24.562 0.00111 **
## light 1 1.920 1.920 3.681 0.09133 .
## food:light 1 0.750 0.750 1.438 0.26482
## Residuals 8 4.173 0.522
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 5.317655
If two lines on a chart are parallel, the gap between them stays the same — meaning the two things you’re measuring add up independently. If the lines cross, you have an interaction: the effect of one depends on the other.
This is real Major League Baseball data covering 1962 through 2012 — the kind of dataset that powered the Moneyball revolution and changed how every front office in pro sports thinks about evidence.
## Rows: 1,232
## Columns: 15
## $ Team <chr> "ARI", "ATL", "BAL", "BOS", "CHC", "CHW", "CIN", "CLE", "…
## $ League <chr> "NL", "NL", "AL", "AL", "NL", "AL", "NL", "AL", "NL", "AL…
## $ Year <int> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 201…
## $ RS <int> 734, 700, 712, 734, 613, 748, 669, 667, 758, 726, 583, 67…
## $ RA <int> 688, 600, 705, 806, 759, 676, 588, 845, 890, 670, 794, 74…
## $ W <int> 81, 94, 93, 69, 61, 85, 97, 68, 64, 88, 55, 72, 89, 86, 6…
## $ OBP <dbl> 0.328, 0.320, 0.311, 0.315, 0.302, 0.318, 0.315, 0.324, 0…
## $ SLG <dbl> 0.418, 0.389, 0.417, 0.415, 0.378, 0.422, 0.411, 0.381, 0…
## $ BA <dbl> 0.259, 0.247, 0.247, 0.260, 0.240, 0.255, 0.251, 0.251, 0…
## $ Playoffs <int> 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ RankSeason <int> NA, 4, 5, NA, NA, NA, 2, NA, NA, 6, NA, NA, NA, NA, NA, N…
## $ RankPlayoffs <int> NA, 5, 4, NA, NA, NA, 4, NA, NA, 2, NA, NA, NA, NA, NA, N…
## $ G <int> 162, 162, 162, 162, 162, 162, 162, 162, 162, 162, 162, 16…
## $ OOBP <dbl> 0.317, 0.306, 0.315, 0.331, 0.335, 0.319, 0.305, 0.336, 0…
## $ OSLG <dbl> 0.415, 0.378, 0.403, 0.428, 0.424, 0.405, 0.390, 0.430, 0…
The win distribution is roughly bell-shaped and centered near 80.9 — exactly half of a 162-game season. That’s the natural gravity of a parity league: every win for one team is a loss for another.
Playoff teams (red) sit consistently above the diagonal — they outscore their opponents over the season. This is the empirical foundation behind Bill James’s Pythagorean Expectation, the formula that powers most modern analytics-driven roster construction.
Mean wins hold steady around 81 — by mathematical necessity. League-wide totals balance out every year.
H₀ (null): Total wins are equally
distributed across decades.
H₁ (claim):
Total wins are not equally distributed across decades.
bb$Decade <- bb$Year - (bb$Year %% 10)
wins <- bb %>% group_by(Decade) %>% summarise(wins = sum(W)) %>% as_tibble()
wins##
## Chi-squared test for given probabilities
##
## data: wins$wins
## X-squared = 9989.5, df = 5, p-value < 0.00000000000000022
## [1] 11.0705
Critical-value vs p-value — same answer. The test statistic exceeds the critical value, and p is far below α. The two procedures are different framings of the same arithmetic; whenever one rejects, the other will too.
A statistical test can be computationally correct yet substantively misleading if its assumptions don’t match how the data were generated. Equal expected frequencies across decades implicitly assumes a stable league and full coverage — neither of which holds here. The right next step would be to weight expected frequencies by team-seasons per decade.
96 plots. 3 fertilizer types. 2 planting densities. 4 randomization blocks. This is what statistics looks like when an experimenter actually controlled the inputs — and the analysis can give a clean recommendation.
crop <- read.csv("crop_data.csv")
crop$density <- factor(crop$density)
crop$fertilizer <- factor(crop$fertilizer)
crop$block <- factor(crop$block)
str(crop)## 'data.frame': 96 obs. of 4 variables:
## $ density : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2 ...
## $ block : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
## $ fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ yield : num 177 178 176 178 177 ...
H₀ (fertilizer): Mean yield equal across
fertilizers · H₁: They differ.
H₀ (density): Mean yield equal across
densities · H₁: They differ.
H₀ (interaction): No interaction ·
H₁: Interaction present.
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.068 3.034 9.001 0.000273 ***
## density 1 5.122 5.122 15.195 0.000186 ***
## fertilizer:density 2 0.428 0.214 0.635 0.532500
## Residuals 90 30.337 0.337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = yield ~ fertilizer * density, data = crop)
##
## $fertilizer
## diff lwr upr p adj
## 2-1 0.1761687 -0.16972711 0.5220646 0.4482026
## 3-1 0.5991256 0.25322974 0.9450214 0.0002393
## 3-2 0.4229569 0.07706102 0.7688527 0.0123951
| Problem | Test | Statistic | Critical | p-value | Decision |
|---|---|---|---|---|---|
| 11-1 #6 Blood Types | GoF χ² | 5.47 | 6.25 | 0.140 | Fail to Reject |
| 11-1 #8 Airlines | GoF χ² | 17.83 | 7.81 | <0.001 | Reject |
| 11-2 #8 Movies | Indep. χ² | 60.14 | 7.81 | <0.001 | Reject |
| 11-2 #10 Military | Indep. χ² | 654.27 | 7.81 | <0.001 | Reject |
| 12-1 #8 Sodium | 1-Way ANOVA | 2.40 | 3.52 | 0.118 | Fail to Reject |
| 12-2 #10 Sales | 1-Way ANOVA | 2.17 | 7.21 | 0.160 | Fail to Reject |
| 12-2 #12 Per-Pupil | 1-Way ANOVA | 0.65 | 4.10 | 0.543 | Fail to Reject |
| 12-3 #10 Plants (Food) | 2-Way ANOVA | 24.56 | 5.32 | 0.001 | Reject |
| Baseball Decade GoF | GoF χ² | 9989.54 | 11.07 | <0.001 | Reject |
| Crop Fertilizer × Density | 2-Way ANOVA | 9.00 / 15.19 | — | <0.001 | Reject |
Across nine tests, four returned clear signals and five returned an honest verdict of insufficient evidence. Both outcomes are deliverables.
A ‘fail to reject’ is not a failure of analysis — it is information about the design. The temptation in modern analytics is to surface a ‘significant’ result and amplify it; the more disciplined practice is to acknowledge that small samples and wide variance leave conclusions ambiguous, and that pretending otherwise is exactly how analytics teams lose credibility with the businesses they support.