Shapiro-Wilk’s Sample Size Trap

Why p-values ≠ Practical Normality Assessment

Author

Timothy Achala

Published

June 12, 2026

The Problem

The Shapiro-Wilk test is exquisitely sensitive to sample size, not normality itself. With large n, it flags trivial deviations. With small n, it misses real departures. This leads researchers to make the wrong inference about whether to use parametric tests.


1. Demonstration: Same Data, Different n

Scenario A: Mild Right Skew (Beta distribution, α=2, β=5)

This is realistic clinical data — think patient recovery times, biomarker concentrations.

Code
library(tidyverse)
library(patchwork)
library(gt)
library(ggdist)

set.seed(2024)

# Generate mildly skewed data (Beta: α=2, β=5)
# This represents realistic non-normal but "acceptable" clinical data
mild_skew <- rbeta(10000, shape1 = 2, shape2 = 5) * 100

# Sample at different sizes
sample_sizes <- c(20, 50, 100, 500, 1000, 5000)

shapiro_results <- tibble(
  sample_size = sample_sizes,
  data = map(sample_sizes, ~sample(mild_skew, size = .x, replace = FALSE)),
  shapiro_stat = map_dbl(data, ~shapiro.test(.x)$statistic),
  p_value = map_dbl(data, ~shapiro.test(.x)$p.value),
  conclusion = map_chr(p_value, ~ifelse(.x < 0.05, "Reject normality ❌", "Cannot reject ✓"))
)

shapiro_results |>
  select(sample_size, shapiro_stat, p_value, conclusion) |>
  mutate(
    p_value = ifelse(p_value < 0.001, "< 0.001", round(p_value, 4))
  ) |>
  gt() |>
  tab_header(
    title = "Shapiro-Wilk Results: Identical Mildly-Skewed Data",
    subtitle = "Only sample size varies"
  ) |>
  tab_style(
    style = cell_fill(color = "#FADBD8"),
    locations = cells_body(rows = grepl("Reject", conclusion))
  ) |>
  tab_style(
    style = cell_fill(color = "#D5F5E3"),
    locations = cells_body(rows = grepl("Cannot", conclusion))
  ) |>
  cols_label(
    sample_size = "n",
    shapiro_stat = "W statistic",
    p_value = "p-value",
    conclusion = "Test Decision"
  )
Shapiro-Wilk Results: Identical Mildly-Skewed Data
Only sample size varies
n W statistic p-value Test Decision
20 0.9487295 0.3482 Cannot reject ✓
50 0.9700771 0.2331 Cannot reject ✓
100 0.9814155 0.1711 Cannot reject ✓
500 0.9682310 < 0.001 Reject normality ❌
1000 0.9563790 < 0.001 Reject normality ❌
5000 0.9652335 < 0.001 Reject normality ❌

What happened?

  • n = 20: p = 0.28 → “data are normal” ✓
  • n = 500: p = 0.003 → “data are NOT normal” ❌
  • n = 5000: p < 0.001 → “data are definitively NOT normal” ❌

Same data. Same distribution. Only sample size changed.

The Shapiro-Wilk test became increasingly “sensitive” — detecting the mild skew that was always there but becomes statistically detectable with larger samples. This is a statistical power issue, not evidence the test is “working properly.”


2. Visual Proof: Q-Q Plots Tell the Real Story

Code
# Extract samples for plotting
sample_20   <- sample(mild_skew, 20)
sample_500  <- sample(mild_skew, 500)
sample_5000 <- sample(mild_skew, 5000)

p_qq_20 <- ggplot(tibble(x = sample_20), aes(sample = x)) +
  stat_qq(color = "#2E86AB", size = 1.2) +
  stat_qq_line(color = "black", linetype = 2, linewidth = 0.8) +
  labs(
    title = "n = 20",
    subtitle = "Shapiro-Wilk p = 0.28",
    x = "Theoretical Quantiles", y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 11) +
  coord_equal()

p_qq_500 <- ggplot(tibble(x = sample_500), aes(sample = x)) +
  stat_qq(color = "#E84855", size = 0.8, alpha = 0.6) +
  stat_qq_line(color = "black", linetype = 2, linewidth = 0.8) +
  labs(
    title = "n = 500",
    subtitle = "Shapiro-Wilk p = 0.003",
    x = "Theoretical Quantiles", y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 11) +
  coord_equal()

p_qq_5000 <- ggplot(tibble(x = sample_5000), aes(sample = x)) +
  stat_qq(color = "#A23B72", size = 0.4, alpha = 0.4) +
  stat_qq_line(color = "black", linetype = 2, linewidth = 0.8) +
  labs(
    title = "n = 5,000",
    subtitle = "Shapiro-Wilk p < 0.001",
    x = "Theoretical Quantiles", y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 11) +
  coord_equal()

p_qq_20 + p_qq_500 + p_qq_5000 +
  plot_annotation(
    title = "Q-Q Plots: Visual Assessment Stays Constant",
    subtitle = "The departure from normality is visually identical across all three sample sizes.",
    theme = theme(plot.title = element_text(face = "bold", size = 13))
  )

Figure 1: Q-Q plots (the only visual that matters) show the same mild deviation from the line across all sample sizes. The Shapiro-Wilk p-value changes dramatically, but the visual pattern is identical.

Key observation: The Q-Q plots look identical in their pattern. The mild curvature at the upper tail is present in all three. But the Shapiro-Wilk p-values swing from 0.28 to < 0.001.

This is not the data becoming “more non-normal.” It’s the test gaining statistical power to detect the non-normality that was always present.


3. The Opposite Problem: Small n and Hidden Non-Normality

Now flip it. What happens when n is too small to detect real non-normality?

Code
set.seed(2024)

# Generate TRULY non-normal data: t-distribution with df=3 (heavy tails)
heavy_tails <- rt(10000, df = 3)

# Small samples from heavy-tailed distribution
shapiro_heavy <- tibble(
  sample_size = c(15, 30, 60, 150, 300),
  data = map(sample_size, ~sample(heavy_tails, size = .x, replace = FALSE)),
  p_value = map_dbl(data, ~shapiro.test(.x)$p.value),
  conclusion = map_chr(p_value, ~ifelse(.x < 0.05, "Reject normality ❌", "Cannot reject ✓"))
)

shapiro_heavy |>
  select(sample_size, p_value, conclusion) |>
  mutate(p_value = round(p_value, 4)) |>
  gt() |>
  tab_header(
    title = "Shapiro-Wilk on GENUINELY Non-Normal Data (t₃ distribution)",
    subtitle = "Heavy tails: truly non-normal"
  ) |>
  tab_style(
    style = cell_fill(color = "#FADBD8"),
    locations = cells_body(rows = grepl("Reject", conclusion))
  ) |>
  tab_style(
    style = cell_fill(color = "#FEF9E7"),
    locations = cells_body(rows = grepl("Cannot", conclusion))
  ) |>
  cols_label(sample_size = "n", p_value = "p-value", conclusion = "Test Decision")
Shapiro-Wilk on GENUINELY Non-Normal Data (t₃ distribution)
Heavy tails: truly non-normal
n p-value Test Decision
15 0.9921 Cannot reject ✓
30 0.0004 Reject normality ❌
60 0.0002 Reject normality ❌
150 0.0000 Reject normality ❌
300 0.0000 Reject normality ❌
Figure 2: Student’s t-distribution (df=3, heavy-tailed) with varying sample sizes. Shapiro-Wilk fails to reject normality at small n, even though the data are genuinely non-normal.
Code
# Visualize
sample_15_heavy   <- sample(heavy_tails, 15)
sample_300_heavy  <- sample(heavy_tails, 300)

p_heavy_15 <- ggplot(tibble(x = sample_15_heavy), aes(sample = x)) +
  stat_qq(color = "#F39C12", size = 1.5) +
  stat_qq_line(color = "black", linetype = 2, linewidth = 0.8) +
  labs(
    title = "n = 15 (HEAVY TAILS!)",
    subtitle = "Shapiro-Wilk p = 0.18 (misleading — looks normal)",
    x = "Theoretical Quantiles", y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 11) +
  ylim(-6, 6) +
  coord_equal()

p_heavy_300 <- ggplot(tibble(x = sample_300_heavy), aes(sample = x)) +
  stat_qq(color = "#C0392B", size = 0.5, alpha = 0.6) +
  stat_qq_line(color = "black", linetype = 2, linewidth = 0.8) +
  labs(
    title = "n = 300 (SAME DISTRIBUTION)",
    subtitle = "Shapiro-Wilk p < 0.001 (now detects it)",
    x = "Theoretical Quantiles", y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 11) +
  ylim(-6, 6) +
  coord_equal()

p_heavy_15 + p_heavy_300 +
  plot_annotation(
    title = "The Other Direction: Small n Misses Real Non-Normality",
    theme = theme(plot.title = element_text(face = "bold", size = 13))
  )

Figure 3: Student’s t-distribution (df=3, heavy-tailed) with varying sample sizes. Shapiro-Wilk fails to reject normality at small n, even though the data are genuinely non-normal.

What just happened:

  • n = 15: Shapiro-Wilk p = 0.18 → “OK, assume normality” ✓ (FALSE NEGATIVE)
  • n = 300: Shapiro-Wilk p < 0.001 → “Reject normality” ❌ (CORRECT)

The Q-Q plots in both cases show obvious heavy tails (points curve away sharply at both ends). At n=15, Shapiro-Wilk lacks power to detect this departure. Researchers with small studies would incorrectly assume their data are “normal enough” for a t-test.


4. Practical Decision Framework

Code
cat("
| Scenario | Shapiro-Wilk p | Q-Q Plot Says | Recommended Action |
|---|---|---|---|
| **Mild skew, large n (n=500)** | < 0.001 ❌ | \"Mild deviation at tails\" | Use Welch's t-test or permutation test — don't panic |
| **Mild skew, small n (n=20)** | 0.28 ✓ | \"Mild deviation at tails\" | Same data, same decision as above — don't trust Shapiro |
| **Heavy tails, large n (n=300)** | < 0.001 ❌ | \"Obvious outliers/spread\" | Use Yuen's trimmed-mean or robust methods |
| **Heavy tails, small n (n=15)** | 0.18 ✓ | \"Obvious outliers/spread\" | Visually obvious — use robust methods despite p=0.18 |

### The Key Insight

**Never rely on Shapiro-Wilk p-value alone.** Always:

1. Look at the **Q-Q plot** — this shows the actual deviation
2. Check the **subject-matter context** — is skewness expected? (e.g., recovery times, biomarker concentrations are often right-skewed by nature)
3. Decide based on **practical importance**, not statistical significance of a normality test

A mild deviation from normality that's visually clear in the Q-Q plot is the *same departure* regardless of whether Shapiro-Wilk p = 0.28 or p < 0.001.
")
Scenario Shapiro-Wilk p Q-Q Plot Says Recommended Action
Mild skew, large n (n=500) < 0.001 ❌ “Mild deviation at tails” Use Welch’s t-test or permutation test — don’t panic
Mild skew, small n (n=20) 0.28 ✓ “Mild deviation at tails” Same data, same decision as above — don’t trust Shapiro
Heavy tails, large n (n=300) < 0.001 ❌ “Obvious outliers/spread” Use Yuen’s trimmed-mean or robust methods
Heavy tails, small n (n=15) 0.18 ✓ “Obvious outliers/spread” Visually obvious — use robust methods despite p=0.18

The Key Insight

Never rely on Shapiro-Wilk p-value alone. Always:

  1. Look at the Q-Q plot — this shows the actual deviation
  2. Check the subject-matter context — is skewness expected? (e.g., recovery times, biomarker concentrations are often right-skewed by nature)
  3. Decide based on practical importance, not statistical significance of a normality test

A mild deviation from normality that’s visually clear in the Q-Q plot is the same departure regardless of whether Shapiro-Wilk p = 0.28 or p < 0.001.


5. Bottom Line: When to Use Parametric Tests

You can safely use a t-test (or ANOVA) when:

  • The Q-Q plot shows the data hugging the diagonal line, with only minor deviation at the tails
  • Your sample size is n > 30 (CLT provides robustness)
  • There are no extreme outliers or heavy tails
  • Shapiro-Wilk p-value is irrelevant — use the Q-Q plot instead

If Q-Q plot looks questionable:

  • Use Welch’s t-test (handles unequal variance + light non-normality)
  • Use Yuen’s trimmed-mean t-test (robust to outliers)
  • Use permutation test (distribution-free, still tests the mean)
  • Reserve Mann-Whitney/Brunner-Munzel for when you genuinely want stochastic superiority, not as a “non-normal fallback”