📊 Reading the Numbers Without Lying With Them

Statistics is, at its most useful, a translator. It takes the noise of the world — patient blood draws, late-arriving flights, fertilizer trials, baseball box scores — and converts it into a yes-or-no answer about whether what we are observing is real or merely a coincidence. This notebook walks through nine such decisions, each told as a story.

Every chart below is interactive. Hover, zoom, click. The code is hidden by default — toggle it with the Code buttons on the right of each block.

What This Notebook Covers

9
Hypothesis Tests
5
Rejected H₀
4
Failed to Reject
3
Real Datasets
The Two Tools, In One Sentence Each

Chi-Square asks: “Do these counts look right?”  ·  ANOVA asks: “Are these averages really different?”
Both are formal versions of the question every analyst eventually asks: “Is this signal worth caring about, or am I fooling myself?”

If you flip a coin 10 times and get 7 heads, is the coin rigged? Probably not — random luck explains it. If you flip it 10,000 times and get 7,000 heads, the coin is definitely rigged. Hypothesis testing is the math that draws the line between those two cases.


1 Section 11-1 — Goodness-of-Fit Tests

What goodness-of-fit answers: “I have one set of counts and an expectation of what the counts should look like. Do they match?”

1.1 🩸 Problem 6 — Blood Types in a Large Hospital

The setup. A medical researcher samples 50 hospital patients. He wants to know whether their blood-type breakdown looks like the U.S. population — or whether something is unusual about who walks into this hospital.

H₀ (null): Hospital matches the general population (20% A · 28% B · 36% O · 16% AB).
H₁ (claim): Hospital differs from the general population.

observed_blood   <- c(A = 12, B = 8, O = 24, AB = 6)
expected_p_blood <- c(0.20, 0.28, 0.36, 0.16)
blood_test <- chisq.test(observed_blood, p = expected_p_blood)
blood_test
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_blood
## X-squared = 5.4714, df = 3, p-value = 0.1404
qchisq(0.90, df = 3)   # critical value at alpha = 0.10
## [1] 6.251389
Decision: Fail to reject H₀. χ² = 5.47 is below the critical value of 6.25 (p = 0.140). The hospital’s patient pool is statistically consistent with the general population, even though Type O looks visibly over-represented. With only 50 patients, the eye is easily fooled — the difference could just be sampling noise.

Imagine you ordered a bag of 50 mixed candies and got more reds than expected. Is the candy company short-changing you on greens, or did you just get a lucky bag? With only 50 candies, it’s almost certainly a lucky bag.


1.2 ✈️ Problem 8 — Airline On-Time Performance

The setup. A major airline’s 200-flight sample is compared against the federal benchmark (BTS): 70.8% on-time, 8.2% NAS delay, 9.0% arriving late, 12.0% other (weather). Of the 200 flights: 125 on-time, 10 NAS, 40 weather/other, 25 late (the leftover).

H₀ (null): The carrier matches BTS percentages.
H₁ (claim): The carrier differs from BTS percentages.

observed_air   <- c(OnTime = 125, NAS = 10, Late = 25, Other = 40)
expected_p_air <- c(0.708, 0.082, 0.090, 0.120)
air_test <- chisq.test(observed_air, p = expected_p_air)
air_test
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_air
## X-squared = 17.832, df = 3, p-value = 0.0004763
qchisq(0.95, df = 3)
## [1] 7.814728
Decision: Reject H₀. χ² = 17.83 is far above the critical value of 7.81 (p < 0.001). The carrier’s delay mix is genuinely different from the federal benchmark — and the chart above shows why: nearly two-thirds of the test value comes from weather/other delays, which were dramatically over-represented.
Operational read-out

This airline is being battered by environmental delays, not by air-traffic-system coordination. The fix isn’t more dispatchers — it’s smarter weather routing.


2 Section 11-2 — Tests of Independence

What independence answers: “I have two characteristics measured on the same people. Are they related to each other, or does one tell me nothing about the other?”

2.1 🎬 Problem 8 — Movie Admissions and Ethnicity

The setup. Did the demographic mix of moviegoers shift between 2013 and 2014?

H₀ (null): Year and ethnicity are independent — the mix is stable.
H₁ (claim): Movie attendance depends on year — the mix shifted.

movies <- matrix(c(724, 335, 174, 107,
                   370, 292, 152, 140), nrow = 2, byrow = TRUE,
                 dimnames = list(Year = c("2013","2014"),
                                 Ethnicity = c("Caucasian","Hispanic",
                                               "AfricanAm","Other")))
movies
##       Ethnicity
## Year   Caucasian Hispanic AfricanAm Other
##   2013       724      335       174   107
##   2014       370      292       152   140
movies_test <- chisq.test(movies); movies_test
## 
##  Pearson's Chi-squared test
## 
## data:  movies
## X-squared = 60.144, df = 3, p-value = 0.0000000000005478
qchisq(0.95, df = 3)
## [1] 7.814728
Decision: Reject H₀. χ² = 60.14 dwarfs the critical value of 7.81 (p < 0.001). Caucasian admissions fell sharply while ‘Other’ rose. For studio marketing teams, audience composition is a moving target — last year’s plan is already stale.

If a restaurant’s customer mix changes from year to year — fewer regulars, more new walk-ins — the menu and marketing have to change too. Same idea here, just for movie tickets.


2.2 🪖 Problem 10 — Women in the Military

The setup. Across the Army, Navy, Marines, and Air Force, is the officer-to-enlisted ratio for women the same — or do branches differ?

H₀ (null): Rank distribution is independent of branch.
H₁ (claim): Rank composition depends on branch.

military <- matrix(c(10791, 62491,  7816, 42750,
                      932,  9525, 11819, 54344), nrow = 4, byrow = TRUE,
                   dimnames = list(Branch = c("Army","Navy","Marines","AirForce"),
                                   Rank   = c("Officers","Enlisted")))
military
##           Rank
## Branch     Officers Enlisted
##   Army        10791    62491
##   Navy         7816    42750
##   Marines       932     9525
##   AirForce    11819    54344
mil_test <- chisq.test(military); mil_test
## 
##  Pearson's Chi-squared test
## 
## data:  military
## X-squared = 654.27, df = 3, p-value < 0.00000000000000022
qchisq(0.95, df = 3)
## [1] 7.814728
Decision: Reject H₀ (decisively). χ² = 654.27 — an extraordinary statistic, with p essentially zero. Air Force runs near 18% officers; Marines near 9%. The branches are structurally different, and any ‘women in the military’ analysis that aggregates them obscures the real picture.

3 Section 12-1 — One-Way ANOVA

What one-way ANOVA answers: “I have three or more groups. Are their averages really different, or could random sampling explain the gaps?”

3.1 🧂 Problem 8 — Sodium Across Food Categories

The setup. Do condiments, cereals, and desserts contain different amounts of sodium per serving?

H₀ (null): All three population means are equal.
H₁ (claim): At least one category differs.

sodium <- data.frame(
  amount = c(270,130,230,180,80,70,200, 260,220,290,290,200,320,140,
             100,180,250,250,300,360,300,160),
  food = factor(c(rep("Condiments",7), rep("Cereals",7), rep("Desserts",8)))
)
summary(aov(amount ~ food, data = sodium))
##             Df Sum Sq Mean Sq F value Pr(>F)
## food         2  27544   13772   2.399  0.118
## Residuals   19 109093    5742
qf(0.95, 2, 19)
## [1] 3.521893
Decision: Fail to reject H₀. F = 2.40, below the critical value of 3.52 (p = 0.118). The means look different (165.7 · 245.7 · 237.5 mg) but the spread within each category — especially condiments and desserts — is wide enough that the gaps could easily be sampling noise. This is an honest ‘we don’t know yet,’ not a clean answer.

If three classrooms have averages of 70%, 75%, and 73%, you might think there’s a difference. But if scores in each class range from 40% to 100%, the averages are wobbling around so much you can’t really tell them apart.


4 Section 12-2 — One-Way ANOVA with Post-Hoc

What post-hoc tests do: ANOVA tells you whether the groups differ. Tukey or Scheffé tells you which pairs differ. (Only run them if ANOVA rejects H₀.)

4.1 💰 Problem 10 — Sales for Leading Companies

The setup. Cereal, chocolate, and coffee company sales (in millions). Are the industry averages really different?

H₀ (null): Mean sales are equal across the three industries.
H₁ (claim): At least one industry mean differs.

sales <- data.frame(
  revenue = c(578,320,264,249,237, 311,106,109,125,173, 261,185,302,689),
  industry = factor(c(rep("Cereal",5), rep("Chocolate",5), rep("Coffee",4)))
)
summary(aov(revenue ~ industry, data = sales))
##             Df Sum Sq Mean Sq F value Pr(>F)
## industry     2 103770   51885   2.172   0.16
## Residuals   11 262795   23890
qf(0.99, 2, 11)
## [1] 7.205713
Decision: Fail to reject H₀. F = 2.17 vs critical 7.21 at α = 0.01 (p = 0.160). The coffee category contains a $689M outlier that inflates its variance enormously — once spread is accounted for, the industry gaps lose statistical force. Because we cannot reject H₀, no Tukey or Scheffé follow-up is needed.

4.2 🏫 Problem 12 — Per-Pupil Expenditures by Region

The setup. Eastern, Middle, and Western U.S. states. Is there a real difference in per-pupil spending across regions?

H₀ (null): Mean spending is equal across the three regions.
H₁ (claim): At least one regional mean differs.

pupil <- data.frame(
  spend = c(4946,5953,6202,7243,6113, 6149,7451,6000,6479, 5282,8605,6528,6911),
  region = factor(c(rep("Eastern",5), rep("Middle",4), rep("Western",4)))
)
summary(aov(spend ~ region, data = pupil))
##             Df  Sum Sq Mean Sq F value Pr(>F)
## region       2 1244588  622294   0.649  0.543
## Residuals   10 9591145  959114
qf(0.95, 2, 10)
## [1] 4.102821
Decision: Fail to reject H₀. F = 0.65, far below critical 4.10 (p = 0.543). Regional means trend upward (East $6,091 → Middle $6,520 → West $6,832) but state-level variation inside each region overwhelms the between-region signal. Headline regional comparisons can mislead — the real story is at the state level.

5 Section 12-3 — Two-Way ANOVA

What two-way ANOVA adds: It tests two factors simultaneously and whether they interact (i.e., does the effect of one depend on the level of the other?).

5.1 🌱 Problem 10 — Plant Growth Under Light × Food

The setup. A 2×2 factorial experiment: two light strengths, two plant foods, three plants in each combination. Twelve plants total.

H₀ (food): Plant food has no effect    H₁: It does.
H₀ (light): Light strength has no effect    H₁: It does.
H₀ (interaction): Effects are additive    H₁: They interact.

plants <- data.frame(
  growth = c(9.2,9.4,8.9, 8.5,9.2,8.9, 7.1,7.2,8.5, 5.5,5.8,7.6),
  food   = factor(rep(c("A","A","B","B"), each = 3)),
  light  = factor(rep(c("L1","L2","L1","L2"), each = 3))
)
summary(aov(growth ~ food * light, data = plants))
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## food         1 12.813  12.813  24.562 0.00111 **
## light        1  1.920   1.920   3.681 0.09133 . 
## food:light   1  0.750   0.750   1.438 0.26482   
## Residuals    8  4.173   0.522                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qf(0.95, 1, 8)
## [1] 5.317655
Mixed result. Food: F = 24.56 — Reject (p = 0.001). Light: F = 3.68 (p = 0.091) — fail to reject. Interaction: F = 1.44 (p = 0.265) — fail to reject. The roughly parallel lines above are the visual signature of additive effects: food matters strongly, light barely matters, and the two don’t amplify each other.

If two lines on a chart are parallel, the gap between them stays the same — meaning the two things you’re measuring add up independently. If the lines cross, you have an interaction: the effect of one depends on the other.


6 Part II — The Baseball Story

⚾ 50 Seasons. 39 Franchises. 1,232 Team-Years.

This is real Major League Baseball data covering 1962 through 2012 — the kind of dataset that powered the Moneyball revolution and changed how every front office in pro sports thinks about evidence.

6.1 📥 Importing the Data

bb <- read.csv("baseball.csv", stringsAsFactors = FALSE)
glimpse(bb)
## Rows: 1,232
## Columns: 15
## $ Team         <chr> "ARI", "ATL", "BAL", "BOS", "CHC", "CHW", "CIN", "CLE", "…
## $ League       <chr> "NL", "NL", "AL", "AL", "NL", "AL", "NL", "AL", "NL", "AL…
## $ Year         <int> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 201…
## $ RS           <int> 734, 700, 712, 734, 613, 748, 669, 667, 758, 726, 583, 67…
## $ RA           <int> 688, 600, 705, 806, 759, 676, 588, 845, 890, 670, 794, 74…
## $ W            <int> 81, 94, 93, 69, 61, 85, 97, 68, 64, 88, 55, 72, 89, 86, 6…
## $ OBP          <dbl> 0.328, 0.320, 0.311, 0.315, 0.302, 0.318, 0.315, 0.324, 0…
## $ SLG          <dbl> 0.418, 0.389, 0.417, 0.415, 0.378, 0.422, 0.411, 0.381, 0…
## $ BA           <dbl> 0.259, 0.247, 0.247, 0.260, 0.240, 0.255, 0.251, 0.251, 0…
## $ Playoffs     <int> 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ RankSeason   <int> NA, 4, 5, NA, NA, NA, 2, NA, NA, 6, NA, NA, NA, NA, NA, N…
## $ RankPlayoffs <int> NA, 5, 4, NA, NA, NA, 4, NA, NA, 2, NA, NA, NA, NA, NA, N…
## $ G            <int> 162, 162, 162, 162, 162, 162, 162, 162, 162, 162, 162, 16…
## $ OOBP         <dbl> 0.317, 0.306, 0.315, 0.331, 0.335, 0.319, 0.305, 0.336, 0…
## $ OSLG         <dbl> 0.415, 0.378, 0.403, 0.428, 0.424, 0.405, 0.390, 0.430, 0…
1,232
Team-Seasons
39
Franchises
1962–2012
Year Range
244
Playoff Berths

6.2 📊 Exploratory Analysis

6.2.1 How are wins distributed?

The win distribution is roughly bell-shaped and centered near 80.9 — exactly half of a 162-game season. That’s the natural gravity of a parity league: every win for one team is a loss for another.

6.2.2 What separates playoff teams from the rest?

Playoff teams (red) sit consistently above the diagonal — they outscore their opponents over the season. This is the empirical foundation behind Bill James’s Pythagorean Expectation, the formula that powers most modern analytics-driven roster construction.

6.2.3 Has the league changed over time?

Mean wins hold steady around 81 — by mathematical necessity. League-wide totals balance out every year.

6.3 🎯 Chi-Square: Are Wins Distributed Equally Across Decades?

H₀ (null): Total wins are equally distributed across decades.
H₁ (claim): Total wins are not equally distributed across decades.

bb$Decade <- bb$Year - (bb$Year %% 10)
wins <- bb %>% group_by(Decade) %>% summarise(wins = sum(W)) %>% as_tibble()
wins
dec_test <- chisq.test(wins$wins); dec_test
## 
##  Chi-squared test for given probabilities
## 
## data:  wins$wins
## X-squared = 9989.5, df = 5, p-value < 0.00000000000000022
qchisq(0.95, df = 5)
## [1] 11.0705
Decision: Reject H₀. χ² = 9,989.54 utterly dwarfs the critical value of 11.07 (p ≈ 0). But — the imbalance reflects the dataset’s structure, not the sport: the 1960s start in 1962, the 2010s end in 2012, and league expansion added franchises across the period. The test is mechanically correct but the equal-frequency assumption doesn’t match how the data were generated.

Critical-value vs p-value — same answer. The test statistic exceeds the critical value, and p is far below α. The two procedures are different framings of the same arithmetic; whenever one rejects, the other will too.

The deeper lesson

A statistical test can be computationally correct yet substantively misleading if its assumptions don’t match how the data were generated. Equal expected frequencies across decades implicitly assumes a stable league and full coverage — neither of which holds here. The right next step would be to weight expected frequencies by team-seasons per decade.


7 Part III — Crop Yield Two-Way ANOVA

🌾 A Designed Agricultural Experiment

96 plots. 3 fertilizer types. 2 planting densities. 4 randomization blocks. This is what statistics looks like when an experimenter actually controlled the inputs — and the analysis can give a clean recommendation.

7.1 Setup

crop <- read.csv("crop_data.csv")
crop$density    <- factor(crop$density)
crop$fertilizer <- factor(crop$fertilizer)
crop$block      <- factor(crop$block)
str(crop)
## 'data.frame':    96 obs. of  4 variables:
##  $ density   : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2 ...
##  $ block     : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
##  $ fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ yield     : num  177 178 176 178 177 ...

H₀ (fertilizer): Mean yield equal across fertilizers  ·  H₁: They differ.
H₀ (density): Mean yield equal across densities  ·  H₁: They differ.
H₀ (interaction): No interaction  ·  H₁: Interaction present.

7.2 The ANOVA Table

crop_aov <- aov(yield ~ fertilizer * density, data = crop)
summary(crop_aov)
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## fertilizer          2  6.068   3.034   9.001 0.000273 ***
## density             1  5.122   5.122  15.195 0.000186 ***
## fertilizer:density  2  0.428   0.214   0.635 0.532500    
## Residuals          90 30.337   0.337                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(crop_aov, "fertilizer")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = yield ~ fertilizer * density, data = crop)
## 
## $fertilizer
##          diff         lwr       upr     p adj
## 2-1 0.1761687 -0.16972711 0.5220646 0.4482026
## 3-1 0.5991256  0.25322974 0.9450214 0.0002393
## 3-2 0.4229569  0.07706102 0.7688527 0.0123951

7.3 Interaction Plot — Are the Lines Parallel?

7.4 Distribution by Fertilizer

Decision: Reject H₀ for both main effects. Fertilizer F = 9.00 (p < 0.001) and Density F = 15.19 (p < 0.001) are highly significant. The interaction (F = 0.63, p = 0.533) is not. Tukey’s HSD shows fertilizer 3 outperforms 1 and 2; types 1 and 2 don’t differ from each other. Recommendation: switch to fertilizer 3 at the higher planting density.

8 Putting It All Together

8.1 📊 Results Summary

Problem Test Statistic Critical p-value Decision
11-1 #6 Blood Types GoF χ² 5.47 6.25 0.140 Fail to Reject
11-1 #8 Airlines GoF χ² 17.83 7.81 <0.001 Reject
11-2 #8 Movies Indep. χ² 60.14 7.81 <0.001 Reject
11-2 #10 Military Indep. χ² 654.27 7.81 <0.001 Reject
12-1 #8 Sodium 1-Way ANOVA 2.40 3.52 0.118 Fail to Reject
12-2 #10 Sales 1-Way ANOVA 2.17 7.21 0.160 Fail to Reject
12-2 #12 Per-Pupil 1-Way ANOVA 0.65 4.10 0.543 Fail to Reject
12-3 #10 Plants (Food) 2-Way ANOVA 24.56 5.32 0.001 Reject
Baseball Decade GoF GoF χ² 9989.54 11.07 <0.001 Reject
Crop Fertilizer × Density 2-Way ANOVA 9.00 / 15.19 <0.001 Reject

8.2 🎯 Stakeholder Recommendations

🏥 Hospital
Blood types didn’t differ significantly (χ² = 5.47, p = 0.14), but n = 50 is too small to be conclusive.
Continue collection until n ≥ 200 before publishing claims. Stock based on national averages in the meantime.
✈️ Airline Operations
Delay mix differs from BTS (χ² = 17.83, p < 0.001). ‘Other’ (weather) drives 60%+ of the test value.
Reallocate budget from NAS coordination to weather-routing investment. Target: bring ‘Other’ delays within 25% of BTS benchmark in 12 months.
🎬 Studio Marketing
Audience composition shifted significantly between 2013 and 2014 (χ² = 60.14, p < 0.001).
Refresh demographic models annually. Pilot two campaigns aimed at the growing ‘Other’ segment.
🪖 Defense HR Analytics
Officer-to-enlisted ratio is highly branch-dependent (χ² = 654.27). Air Force ≈18% officers; Marines ≈9%.
Stop reporting aggregated cross-branch female participation. Publish branch-specific dashboards instead.
🧂 Nutrition Labeling
Sodium didn’t differ across categories (F = 2.40, p = 0.118), but sample sizes were small.
Don’t yet publish category-level claims. Expand sample to ≥30 items per category, or stratify within categories.
💰 Industry Benchmarking
Cereal/chocolate/coffee sales didn’t differ (F = 2.17, p = 0.160), driven by one $689M coffee outlier.
Re-run after winsorizing or using medians, and segment coffee (specialty vs. mass-market).
🏫 Education Policy
Per-pupil spending didn’t differ across U.S. regions (F = 0.65, p = 0.543); state variation overwhelms region.
Shift the policy conversation from regional to state-level. Identify high/low spenders within each region.
🌱 Gardening Company
Plant food strongly affects growth (F = 24.56, p = 0.001); light strength and interaction don’t.
Lead product positioning with food formulation. Bundle Food A as the headline SKU; de-emphasize light strength.
⚾ MLB Baseball Operations
Wins differ by decade statistically — but the imbalance is a coverage artefact, not era difference.
Re-run with expected frequencies weighted by team-seasons per decade, or restrict to the four fully-covered decades.
🌾 Agriculture / Crop Producer
Both fertilizer (F = 9.00) and density (F = 15.19) are highly significant; no interaction. Tukey: type 3 wins.
Convert all fields to fertilizer 3 at higher density next cycle. Run a one-season pilot at scale before locking in supply contracts.

Closing Reflection

The Discipline Behind the Decision

Across nine tests, four returned clear signals and five returned an honest verdict of insufficient evidence. Both outcomes are deliverables.

A ‘fail to reject’ is not a failure of analysis — it is information about the design. The temptation in modern analytics is to surface a ‘significant’ result and amplify it; the more disciplined practice is to acknowledge that small samples and wide variance leave conclusions ambiguous, and that pretending otherwise is exactly how analytics teams lose credibility with the businesses they support.