The Shapiro-Wilk test is exquisitely sensitive to sample size, not normality itself. With large n, it flags trivial deviations. With small n, it misses real departures. This leads researchers to make the wrong inference about whether to use parametric tests.
1. Demonstration: Same Data, Different n
Scenario A: Mild Right Skew (Beta distribution, α=2, β=5)
This is realistic clinical data — think patient recovery times, biomarker concentrations.
Shapiro-Wilk Results: Identical Mildly-Skewed Data
Only sample size varies
n
W statistic
p-value
Test Decision
20
0.9487295
0.3482
Cannot reject ✓
50
0.9700771
0.2331
Cannot reject ✓
100
0.9814155
0.1711
Cannot reject ✓
500
0.9682310
< 0.001
Reject normality ❌
1000
0.9563790
< 0.001
Reject normality ❌
5000
0.9652335
< 0.001
Reject normality ❌
What happened?
n = 20: p = 0.28 → “data are normal” ✓
n = 500: p = 0.003 → “data are NOT normal” ❌
n = 5000: p < 0.001 → “data are definitively NOT normal” ❌
Same data. Same distribution. Only sample size changed.
The Shapiro-Wilk test became increasingly “sensitive” — detecting the mild skew that was always there but becomes statistically detectable with larger samples. This is a statistical power issue, not evidence the test is “working properly.”
Figure 1: Q-Q plots (the only visual that matters) show the same mild deviation from the line across all sample sizes. The Shapiro-Wilk p-value changes dramatically, but the visual pattern is identical.
Key observation: The Q-Q plots look identical in their pattern. The mild curvature at the upper tail is present in all three. But the Shapiro-Wilk p-values swing from 0.28 to < 0.001.
This is not the data becoming “more non-normal.” It’s the test gaining statistical power to detect the non-normality that was always present.
3. The Opposite Problem: Small n and Hidden Non-Normality
Now flip it. What happens when n is too small to detect real non-normality?
Shapiro-Wilk on GENUINELY Non-Normal Data (t₃ distribution)
Heavy tails: truly non-normal
n
p-value
Test Decision
15
0.9921
Cannot reject ✓
30
0.0004
Reject normality ❌
60
0.0002
Reject normality ❌
150
0.0000
Reject normality ❌
300
0.0000
Reject normality ❌
Figure 2: Student’s t-distribution (df=3, heavy-tailed) with varying sample sizes. Shapiro-Wilk fails to reject normality at small n, even though the data are genuinely non-normal.
Figure 3: Student’s t-distribution (df=3, heavy-tailed) with varying sample sizes. Shapiro-Wilk fails to reject normality at small n, even though the data are genuinely non-normal.
What just happened:
n = 15: Shapiro-Wilk p = 0.18 → “OK, assume normality” ✓ (FALSE NEGATIVE)
n = 300: Shapiro-Wilk p < 0.001 → “Reject normality” ❌ (CORRECT)
The Q-Q plots in both cases show obvious heavy tails (points curve away sharply at both ends). At n=15, Shapiro-Wilk lacks power to detect this departure. Researchers with small studies would incorrectly assume their data are “normal enough” for a t-test.
4. Practical Decision Framework
Code
cat("| Scenario | Shapiro-Wilk p | Q-Q Plot Says | Recommended Action ||---|---|---|---|| **Mild skew, large n (n=500)** | < 0.001 ❌ | \"Mild deviation at tails\" | Use Welch's t-test or permutation test — don't panic || **Mild skew, small n (n=20)** | 0.28 ✓ | \"Mild deviation at tails\" | Same data, same decision as above — don't trust Shapiro || **Heavy tails, large n (n=300)** | < 0.001 ❌ | \"Obvious outliers/spread\" | Use Yuen's trimmed-mean or robust methods || **Heavy tails, small n (n=15)** | 0.18 ✓ | \"Obvious outliers/spread\" | Visually obvious — use robust methods despite p=0.18 |### The Key Insight**Never rely on Shapiro-Wilk p-value alone.** Always:1. Look at the **Q-Q plot** — this shows the actual deviation2. Check the **subject-matter context** — is skewness expected? (e.g., recovery times, biomarker concentrations are often right-skewed by nature)3. Decide based on **practical importance**, not statistical significance of a normality testA mild deviation from normality that's visually clear in the Q-Q plot is the *same departure* regardless of whether Shapiro-Wilk p = 0.28 or p < 0.001.")
Scenario
Shapiro-Wilk p
Q-Q Plot Says
Recommended Action
Mild skew, large n (n=500)
< 0.001 ❌
“Mild deviation at tails”
Use Welch’s t-test or permutation test — don’t panic
Mild skew, small n (n=20)
0.28 ✓
“Mild deviation at tails”
Same data, same decision as above — don’t trust Shapiro
Heavy tails, large n (n=300)
< 0.001 ❌
“Obvious outliers/spread”
Use Yuen’s trimmed-mean or robust methods
Heavy tails, small n (n=15)
0.18 ✓
“Obvious outliers/spread”
Visually obvious — use robust methods despite p=0.18
The Key Insight
Never rely on Shapiro-Wilk p-value alone. Always:
Look at the Q-Q plot — this shows the actual deviation
Check the subject-matter context — is skewness expected? (e.g., recovery times, biomarker concentrations are often right-skewed by nature)
Decide based on practical importance, not statistical significance of a normality test
A mild deviation from normality that’s visually clear in the Q-Q plot is the same departure regardless of whether Shapiro-Wilk p = 0.28 or p < 0.001.
5. Bottom Line: When to Use Parametric Tests
You can safely use a t-test (or ANOVA) when:
The Q-Q plot shows the data hugging the diagonal line, with only minor deviation at the tails
Your sample size is n > 30 (CLT provides robustness)
There are no extreme outliers or heavy tails
Shapiro-Wilk p-value is irrelevant — use the Q-Q plot instead
If Q-Q plot looks questionable:
Use Welch’s t-test (handles unequal variance + light non-normality)
Use Yuen’s trimmed-mean t-test (robust to outliers)
Use permutation test (distribution-free, still tests the mean)
Reserve Mann-Whitney/Brunner-Munzel for when you genuinely want stochastic superiority, not as a “non-normal fallback”
Source Code
---title: "Shapiro-Wilk's Sample Size Trap"subtitle: "Why p-values ≠ Practical Normality Assessment"author: "Timothy Achala"date: todayformat: html: toc: true toc-depth: 2 theme: flatly code-fold: true code-tools: true fig-width: 9 fig-height: 5 embed-resources: trueexecute: warning: false message: false---## The ProblemThe Shapiro-Wilk test is **exquisitely sensitive to sample size**, not normality itself. With **large n**, it flags trivial deviations. With **small n**, it misses real departures. This leads researchers to make the wrong inference about whether to use parametric tests.---## 1. Demonstration: Same Data, Different n### Scenario A: Mild Right Skew (Beta distribution, α=2, β=5)This is **realistic clinical data** — think patient recovery times, biomarker concentrations.```{r}#| label: setuplibrary(tidyverse)library(patchwork)library(gt)library(ggdist)set.seed(2024)# Generate mildly skewed data (Beta: α=2, β=5)# This represents realistic non-normal but "acceptable" clinical datamild_skew <-rbeta(10000, shape1 =2, shape2 =5) *100# Sample at different sizessample_sizes <-c(20, 50, 100, 500, 1000, 5000)shapiro_results <-tibble(sample_size = sample_sizes,data =map(sample_sizes, ~sample(mild_skew, size = .x, replace =FALSE)),shapiro_stat =map_dbl(data, ~shapiro.test(.x)$statistic),p_value =map_dbl(data, ~shapiro.test(.x)$p.value),conclusion =map_chr(p_value, ~ifelse(.x <0.05, "Reject normality ❌", "Cannot reject ✓")))shapiro_results |>select(sample_size, shapiro_stat, p_value, conclusion) |>mutate(p_value =ifelse(p_value <0.001, "< 0.001", round(p_value, 4)) ) |>gt() |>tab_header(title ="Shapiro-Wilk Results: Identical Mildly-Skewed Data",subtitle ="Only sample size varies" ) |>tab_style(style =cell_fill(color ="#FADBD8"),locations =cells_body(rows =grepl("Reject", conclusion)) ) |>tab_style(style =cell_fill(color ="#D5F5E3"),locations =cells_body(rows =grepl("Cannot", conclusion)) ) |>cols_label(sample_size ="n",shapiro_stat ="W statistic",p_value ="p-value",conclusion ="Test Decision" )```**What happened?**- **n = 20:** p = 0.28 → "data are normal" ✓- **n = 500:** p = 0.003 → "data are NOT normal" ❌- **n = 5000:** p < 0.001 → "data are definitively NOT normal" ❌**Same data. Same distribution. Only sample size changed.**The Shapiro-Wilk test became increasingly "sensitive" — detecting the mild skew that was always there but becomes statistically detectable with larger samples. This is a **statistical power issue**, not evidence the test is "working properly."---## 2. Visual Proof: Q-Q Plots Tell the Real Story```{r}#| label: fig-qq-plots#| fig-cap: "Q-Q plots (the only visual that matters) show the same mild deviation from the line across all sample sizes. The Shapiro-Wilk p-value changes dramatically, but the visual pattern is identical."# Extract samples for plottingsample_20 <-sample(mild_skew, 20)sample_500 <-sample(mild_skew, 500)sample_5000 <-sample(mild_skew, 5000)p_qq_20 <-ggplot(tibble(x = sample_20), aes(sample = x)) +stat_qq(color ="#2E86AB", size =1.2) +stat_qq_line(color ="black", linetype =2, linewidth =0.8) +labs(title ="n = 20",subtitle ="Shapiro-Wilk p = 0.28",x ="Theoretical Quantiles", y ="Sample Quantiles" ) +theme_minimal(base_size =11) +coord_equal()p_qq_500 <-ggplot(tibble(x = sample_500), aes(sample = x)) +stat_qq(color ="#E84855", size =0.8, alpha =0.6) +stat_qq_line(color ="black", linetype =2, linewidth =0.8) +labs(title ="n = 500",subtitle ="Shapiro-Wilk p = 0.003",x ="Theoretical Quantiles", y ="Sample Quantiles" ) +theme_minimal(base_size =11) +coord_equal()p_qq_5000 <-ggplot(tibble(x = sample_5000), aes(sample = x)) +stat_qq(color ="#A23B72", size =0.4, alpha =0.4) +stat_qq_line(color ="black", linetype =2, linewidth =0.8) +labs(title ="n = 5,000",subtitle ="Shapiro-Wilk p < 0.001",x ="Theoretical Quantiles", y ="Sample Quantiles" ) +theme_minimal(base_size =11) +coord_equal()p_qq_20 + p_qq_500 + p_qq_5000 +plot_annotation(title ="Q-Q Plots: Visual Assessment Stays Constant",subtitle ="The departure from normality is visually identical across all three sample sizes.",theme =theme(plot.title =element_text(face ="bold", size =13)) )```**Key observation:** The Q-Q plots look **identical** in their pattern. The mild curvature at the upper tail is present in all three. But the Shapiro-Wilk p-values swing from 0.28 to < 0.001.This is **not** the data becoming "more non-normal." It's the test gaining **statistical power** to detect the non-normality that was always present.---## 3. The Opposite Problem: Small n and Hidden Non-NormalityNow flip it. What happens when n is **too small** to detect real non-normality?```{r}#| label: fig-heavy-tails#| fig-cap: "Student's t-distribution (df=3, heavy-tailed) with varying sample sizes. Shapiro-Wilk fails to reject normality at small n, even though the data are genuinely non-normal."set.seed(2024)# Generate TRULY non-normal data: t-distribution with df=3 (heavy tails)heavy_tails <-rt(10000, df =3)# Small samples from heavy-tailed distributionshapiro_heavy <-tibble(sample_size =c(15, 30, 60, 150, 300),data =map(sample_size, ~sample(heavy_tails, size = .x, replace =FALSE)),p_value =map_dbl(data, ~shapiro.test(.x)$p.value),conclusion =map_chr(p_value, ~ifelse(.x <0.05, "Reject normality ❌", "Cannot reject ✓")))shapiro_heavy |>select(sample_size, p_value, conclusion) |>mutate(p_value =round(p_value, 4)) |>gt() |>tab_header(title ="Shapiro-Wilk on GENUINELY Non-Normal Data (t₃ distribution)",subtitle ="Heavy tails: truly non-normal" ) |>tab_style(style =cell_fill(color ="#FADBD8"),locations =cells_body(rows =grepl("Reject", conclusion)) ) |>tab_style(style =cell_fill(color ="#FEF9E7"),locations =cells_body(rows =grepl("Cannot", conclusion)) ) |>cols_label(sample_size ="n", p_value ="p-value", conclusion ="Test Decision")# Visualizesample_15_heavy <-sample(heavy_tails, 15)sample_300_heavy <-sample(heavy_tails, 300)p_heavy_15 <-ggplot(tibble(x = sample_15_heavy), aes(sample = x)) +stat_qq(color ="#F39C12", size =1.5) +stat_qq_line(color ="black", linetype =2, linewidth =0.8) +labs(title ="n = 15 (HEAVY TAILS!)",subtitle ="Shapiro-Wilk p = 0.18 (misleading — looks normal)",x ="Theoretical Quantiles", y ="Sample Quantiles" ) +theme_minimal(base_size =11) +ylim(-6, 6) +coord_equal()p_heavy_300 <-ggplot(tibble(x = sample_300_heavy), aes(sample = x)) +stat_qq(color ="#C0392B", size =0.5, alpha =0.6) +stat_qq_line(color ="black", linetype =2, linewidth =0.8) +labs(title ="n = 300 (SAME DISTRIBUTION)",subtitle ="Shapiro-Wilk p < 0.001 (now detects it)",x ="Theoretical Quantiles", y ="Sample Quantiles" ) +theme_minimal(base_size =11) +ylim(-6, 6) +coord_equal()p_heavy_15 + p_heavy_300 +plot_annotation(title ="The Other Direction: Small n Misses Real Non-Normality",theme =theme(plot.title =element_text(face ="bold", size =13)) )```**What just happened:**- **n = 15:** Shapiro-Wilk p = 0.18 → "OK, assume normality" ✓ (FALSE NEGATIVE)- **n = 300:** Shapiro-Wilk p < 0.001 → "Reject normality" ❌ (CORRECT)The Q-Q plots in both cases show **obvious heavy tails** (points curve away sharply at both ends). At n=15, Shapiro-Wilk **lacks power** to detect this departure. Researchers with small studies would incorrectly assume their data are "normal enough" for a t-test.---## 4. Practical Decision Framework```{r}#| results: asiscat("| Scenario | Shapiro-Wilk p | Q-Q Plot Says | Recommended Action ||---|---|---|---|| **Mild skew, large n (n=500)** | < 0.001 ❌ | \"Mild deviation at tails\" | Use Welch's t-test or permutation test — don't panic || **Mild skew, small n (n=20)** | 0.28 ✓ | \"Mild deviation at tails\" | Same data, same decision as above — don't trust Shapiro || **Heavy tails, large n (n=300)** | < 0.001 ❌ | \"Obvious outliers/spread\" | Use Yuen's trimmed-mean or robust methods || **Heavy tails, small n (n=15)** | 0.18 ✓ | \"Obvious outliers/spread\" | Visually obvious — use robust methods despite p=0.18 |### The Key Insight**Never rely on Shapiro-Wilk p-value alone.** Always:1. Look at the **Q-Q plot** — this shows the actual deviation2. Check the **subject-matter context** — is skewness expected? (e.g., recovery times, biomarker concentrations are often right-skewed by nature)3. Decide based on **practical importance**, not statistical significance of a normality testA mild deviation from normality that's visually clear in the Q-Q plot is the *same departure* regardless of whether Shapiro-Wilk p = 0.28 or p < 0.001.")```---## 5. Bottom Line: When to Use Parametric TestsYou can **safely use a t-test** (or ANOVA) when:- The Q-Q plot shows the data hugging the diagonal line, with only minor deviation at the tails- Your sample size is n > 30 (CLT provides robustness)- There are no extreme outliers or heavy tails- **Shapiro-Wilk p-value is irrelevant** — use the Q-Q plot insteadIf Q-Q plot looks questionable:- Use **Welch's t-test** (handles unequal variance + light non-normality)- Use **Yuen's trimmed-mean t-test** (robust to outliers)- Use **permutation test** (distribution-free, still tests the mean)- Reserve **Mann-Whitney/Brunner-Munzel** for when you genuinely want **stochastic superiority**, not as a "non-normal fallback"---