Sample Size, Effect Size, and Magnitude of Effect

Alex Crisp

Learning goals

  1. Explain the logic behind null hypothesis significance testing (NHST) and the meaning of the p-value
  2. Distinguish between Type I and Type II errors and understand their relationship with statistical power
  3. Describe how sample size affects the ability to detect effects and the precision of estimates
  4. Differentiate between effect size and magnitude of effect
  5. Recognize the key components needed for a priori sample size calculation

Initial Disclaimer


Statistics can be divided into two main branches:

  • Frequentist Statistics: based on frequencies of events in repeated experiments
  • Bayesian Statistics: incorporates prior knowledge and updates with new data

The focus of this lecture will be the frequentist approach, which is by far the most commonly used in health sciences and sport sciences.

A Common Feature…


If you’ve ever read a scientific paper, you’ve probably encountered:
- The famous p-value
- The expression “statistically significant”
But what do these really mean?


To understand sample size and effect size, we first need to understand the logic behind the p-value.

Brief Historical Context


Ronald Fisher (1890-1962)

  • Proposed the p-value as a measure of evidence against the null hypothesis
  • More flexible and exploratory approach


Jerzy Neyman & Egon Pearson (1930s)

  • Formalized the hypothesis testing framework
  • Introduced concepts of Type I (α) and Type II (β) errors
  • Focus on decision-making


The most widely used model today, NHST (Null Hypothesis Significance Testing), is a fusion of these two approaches.
Fisher saw p as continuous evidence;
Neyman-Pearson saw it as a binary decision trigger

Read Me Later (open access)

The Logic Behind Statistical Tests


Example: Compare handgrip strength between:

  • Physically active group
  • Physically inactive group


The central question is: Is the observed difference between groups real or just random variation?

The t-test: A Simple Logic


The independent t-test formula:

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{SE(\bar{x}_1 - \bar{x}_2)} \]


Where:

  • \(\bar{x}_1 - \bar{x}_2\) = difference between group means
  • \(SE(\bar{x}_1 - \bar{x}_2)\) = standard error of the mean difference

In essence: signal divided by noise

Where does sample size enter?


The standard error depends on variability and sample size.
For the independent t-test (equal-variance version):

\[ SE(\bar{x}_1 - \bar{x}_2)= s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \]


  • Larger \(n_1, n_2\) \(\rightarrow\) smaller \(SE\)
  • Smaller \(SE\) \(\rightarrow\) larger \(|t|\) (if the same mean difference is observed)

The t-test: Interpretation


  • When the difference is large relative to the error → high t-value
  • The t-value, along with degrees of freedom, determines the p-value

t-Distribution: Signal vs. Noise

Effect of Degrees of Freedom on the t-Distribution

The NHST Framework


NHST has a simple and objective framework:

  1. Formulate null (H₀) and alternative (H₁) hypotheses
  2. Define significance level (α, usually 0.05)
  3. Collect data and calculate test statistic
  4. Compare p-value with α
  5. Make decision: reject or fail to reject H₀

This simplicity explains its popularity. But there’s a catch…

What is the p-value?


Formal definition:

The p-value is the probability of observing a result as extreme or more extreme than the one observed, assuming the null hypothesis is true.


Even experienced researchers misinterpret this definition.

Let’s see what the p-value is NOT.

Common p-value Misinterpretations


❌ “The p-value is the probability that the null hypothesis is true”


❌ “1 − p is the probability that the alternative hypothesis is true”


❌ “A p < 0.05 means the result is clinically important”


❌ “A p = 0.049 is very different from p = 0.051”


✅ “The p-value tells us how surprising the data are, assuming H₀ is true — nothing more.”


The p-value depends on sample size and variability. It is not a direct measure of effect size!

Popper, falsification, and the null hypothesis


Core idea (Popper): A scientific claim is stronger when it can be falsified by data.
Science progresses by attempting to refute claims, not by “proving” them true.


Bridge to NHST:
In hypothesis testing, we start with a default model:

  • Null hypothesis (H0): “no effect” or “no difference” (e.g., Active = Inactive)
  • We ask: Are the data unusually incompatible with H0?
  • If the evidence is strong, we reject H0 (a controlled attempt at falsification)


NHST does not test whether the null hypothesis is true.
It evaluates how compatible the data are with a model that assumes H0.


Situation 1: Rejection


Hypothesis: “This bag contains only ping-pong balls”
I open the bag and observe in the first layer:

🥚 An egg!

Conclusion: I can reject the hypothesis immediately.

Situation 2: Failure to Reject


I open the bag and observe in the first layer:

🏓🏓🏓 Only ping-pong balls

But… what about the bottom of the bag?

Conclusion: I cannot be certain without inspecting all the balls. I can only “fail to reject” the hypothesis within the available level of evidence.

Rejecting H₀ does not mean H₁ is proven —
only that the data are incompatible with H₀.

Analogy: The Trial


To understand the logic of rejection in NHST, imagine a trial:

Null hypothesis (H₀): My friend is innocent (did not steal my dessert from the fridge)


Alternative hypothesis (H₁): My friend is guilty

Scenario 1: Weak Evidence


A witness reports seeing someone taking the dessert, but:

  • The suspect had their back turned
  • The location was dark
  • The witness is not certain


Conclusion: The evidence does not reach the level needed to reject innocence.


“Failing to reject H₀” ≠ “H₀ is true”

Scenario 2: Strong Evidence


A camera inside the fridge captured:

  • The exact moment the suspect takes the dessert
  • The face clearly identifiable
  • Forensics confirm: authentic video, no manipulation


Conclusion: The level of evidence is very high.

Type I and Type II Errors


The fundamental point of NHST is error control:

H₀ is true H₀ is false
Reject H₀ Type I Error (α) ✓ Correct
Fail to reject H₀ ✓ Correct Type II Error (β)

Examples




Type I Error (False Positive): - It is not raining — but you bring an umbrella anyway ☂️

Type II Error (False Negative): - It is raining — but you leave your umbrella at home 🌧️

Multiple Comparisons and Type I Error

Every time you run a statistical test at α = 0.05, you accept a 5% chance of a false positive — even when H₀ is true.


What happens when you run multiple tests?


Tests (k) Probability of ≥ 1 false positive
1 5.0%
3 14.3%
5 22.6%
10 40.1%
20 64.2%



With 20 tests at α = 0.05, you have a 64% chance of at least one false positive — even if all null hypotheses are true.

Statistical Power


Power = 1 − β


- Low power = high risk of “failing to detect” real effects

Power depends on: sample size, effect size, and variability

The Role of Sample Size


One of the most important ways a researcher can control the balance between errors is to plan sample size a priori.

The calculation considers:

  • Type I error (α)
  • Desired power (1 − β)
  • Expected effect size
  • Variability

Effect Size: Cohen’s d


An alternative (and requirement of many journals) is to report effect size.

Cohen’s d:

\[ d = \frac{\bar{x}_1 - \bar{x}_2}{SD_{pooled}} \]


Conventional interpretation (heuristic):

  • d ≈ 0.2 → Small
  • d ≈ 0.5 → Medium
  • d ≈ 0.8 → Large


⚠️ These conventions are context-dependent!

What Does Cohen’s d Look Like?


Standardized vs. Unstandardized Effects


Standardized effect (Cohen’s d):

  • Useful for comparing across studies
  • Independent of measurement unit


Unstandardized effect (original units):

  • E.g., “5 kg difference in handgrip strength”
  • More clinically interpretable


Both are important — use what makes most sense for your question!

  • Unstandardized: Active group had 8 kg higher handgrip strength

  • Standardized: d = 0.82 (large effect)

Effect Size vs. Magnitude of Effect


It is important to distinguish two concepts:

Effect Size:

  • Quantitative estimate of the effect
  • E.g., mean difference, ratio, correlation, Cohen’s d


Magnitude of Effect:

  • Substantive interpretation — is the effect practically relevant?
  • Considers: context, costs, risks, benefits, applicability

Statistical significance ≠ Clinical relevance!



Effect Size Measures: Same Logic, Different Scales

All effect size measures share the same fundamental logic:

\[\text{Effect Size} = \frac{\text{signal}}{\text{noise}}\]


Based on means and standard deviationscontinuous outcomes, group comparisons

Measure Formula Used with
Cohen’s d \(\frac{\bar{x}_1 - \bar{x}_2}{s_p}\) t-test (two groups)
Hedges’ g \(d \times \text{correction}\) t-test (small samples)
Cohen’s f \(\frac{\sigma_m}{\sigma}\) ANOVA (3+ groups)
η² (eta-squared) \(SS_{effect} / SS_{total}\) ANOVA
ω² (omega-squared) Less biased version of η² ANOVA (preferred)


Based on correlation coefficientsstrength of association, variance explained

Measure Formula Used with
r (Pearson) \(\frac{\sum(x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^2\sum(y-\bar{y})^2}}\) Correlation between two continuous variables
\(r \times r\) Bivariate correlation (1 predictor)
\(1 - SS_{residual}/SS_{total}\) Multiple regression (1+ predictors)
φ (phi) \(\sqrt{\chi^2 / n}\) Chi-square (2×2 tables)
Cramér’s V \(\sqrt{\chi^2 / (n \cdot \min(r-1, c-1))}\) Chi-square (larger tables)


Based on events and non-events in two groupsbinary outcomes, risk

Measure Formula Reference Used with
RR (Risk Ratio) \(p_1 / p_2\) RR = 1 Cohort studies, RCTs
OR (Odds Ratio) \(\frac{p_1/(1-p_1)}{p_2/(1-p_2)}\) OR = 1 Case-control, logistic regression


Read Me Later (open access)
Read Me Later

Criticisms of NHST


P-hacking: - Manipulating analyses until p < 0.05
- “Torturing the data until it confesses”

Arbitrary dichotomization: - p = 0.049 → “significant” ✓
- p = 0.051 → “not significant” ✗

Excessive focus on p-value: - At the expense of effect magnitude and precision

HARKing (Hypothesizing After Results are Known) - Presenting post-hoc hypotheses as if they were a priori


Cherry Picking: - Reporting only outcomes or subgroups that reached p < 0.05


File Drawer Problem: - Null results are rarely published
- The published literature systematically overestimates effect sizes

Beyond p-values

Major journals (Nature, JAMA) increasingly require reporting of effect sizes and confidence intervals — not just p-values.

Why?

  • p-value tells you if there is an effect
  • Effect size tells you how big it is
  • Confidence interval tells you how precise the estimate is

Estimation Statistics

🔗 dabestr package

🔗 Estimation Statistics

Sample Size Calculation


To estimate sample size a priori, we generally need:

  1. Effect size of interest (ideally, the smallest clinically relevant effect)
  2. Expected variability (standard deviation)
  3. Significance level (α, usually 0.05)
  4. Desired power (1 − β, usually 0.80 or 0.90)
  5. One-tailed or two-tailed test
  6. Study design and planned model

Simulation:


🔗 Open the Power Simulator


Try this:

  1. Set d = 0.2 and α = 0.05 — how large does n need to be to reach 80% power?
  2. Keep n = 20 and increase d from 0.2 to 0.8 — what happens to power?
  3. Change α from 0.05 to 0.01 — what is the cost in terms of β?

Sample Size Calculation: Practice

🔗 Open pwrss calculator


For each scenario: identify the inputs, calculate the required n, and interpret what happens when you change α or power.


Case 1 — Handgrip Strength (t-test)

A researcher wants to compare handgrip strength between physically active and inactive adults.
Based on the literature:

  • Mean Group 1 (active): 42 kg | Mean Group 2 (inactive): 37 kg
  • SD both groups: 8 kg
  • α = 0.05 | Power = 80% | Two-tailed

How many participants per group are needed?


Case 2 — Exercise Intervention (t-test, higher power)

Same study as Case 1, but the research team wants to publish in a high-impact journal requiring 90% power.

How does the required n change?
What is the cost of increasing power from 80% to 90%?


Case 3 — Small Effect, High Precision (t-test)

A clinical trial evaluates a new intervention expected to produce a small effect (d = 0.2) on resting blood pressure.

  • α = 0.05 | Power = 80% | Two-tailed

How many participants are needed?
What happens if you change α to 0.01?


Case 4 — Rare Adverse Event (strict α)

A study investigates a rare adverse event where a false positive would have serious consequences (e.g., withdrawing a safe treatment).

  • d = 0.5 | Power = 90% | α = 0.01

Compare the n required here vs. Case 1.
Why is α more conservative in this scenario?

Sample Size Calculation: One-Way ANOVA

🔗 Open pwrss calculator


Select: One-way ANOVA (F Test)
Benchmarks used: Cohen (1992) — Small: 0.02–0.13 | Medium: 0.13–0.26 | Large: ≥ 0.26


Case 5 — Physical Activity Groups (small effect)

A researcher compares VO₂max across three groups:

  • Sedentary | Moderately active | Highly active
  • Expected η² = 0.06 (small effect)
  • α = 0.05 | Power = 80% | 3 groups

How many participants in total are needed?
How many per group (assuming equal groups)?


Case 6 — Same study, medium effect

Same design as Case 5, but the researcher finds stronger evidence in the literature supporting a larger difference between groups: η² = 0.14 (medium effect).

  • α = 0.05 | Power = 80% | 3 groups

Compare the total N with Case 5.
How does doubling the expected effect size affect the required N?


Case 7 — Four intervention groups, large effect

A clinical trial compares four exercise protocols on resting blood pressure — a well-established intervention with a strong expected response:

  • Control | Low intensity | Moderate intensity | High intensity
  • Expected η² = 0.26 (large effect)
  • α = 0.05 | Power = 80% | 4 groups

Compare with Cases 5 and 6.
How does a large effect size change the required N?
What happens when you increase power to 90%?

Sample Size Calculation: RM-ANOVA (Time × Group)

🔗 Open pwrss calculator


Select: Repeated Measures ANOVA: Within × Between Interaction
Key parameters:
- n.levels = number of groups (between factor)
- n.rm = number of time points (within factor)
- corr.rm = correlation between repeated measures (default: 0.5)
- epsilon = non-sphericity correction (default: 1)


Case 8 — Exercise intervention: 2 groups × 3 time points

A researcher evaluates the effect of an aerobic exercise program on body fat percentage, comparing intervention vs. control group measured at three time points (baseline, 3 months, 6 months):

  • η² = 0.06 (small effect — interaction)
  • n.levels = 2 | n.rm = 3
  • corr.rm = 0.5 | α = 0.05 | Power = 80%

How many participants in total are needed?
How many per group?


Case 9 — Same study, medium effect

Same design as Case 8, but the researcher expects a stronger group × time interaction based on previous literature:

  • η² = 0.14 (medium effect)
  • n.levels = 2 | n.rm = 3
  • corr.rm = 0.5 | α = 0.05 | Power = 80%

Compare with Case 8.
How does the effect size affect the required N?


Case 10 — Three groups × 4 time points

A clinical trial compares three rehabilitation protocols (control, low dose, high dose) on muscle strength, measured at 4 time points (baseline, 4, 8, 12 weeks):

  • η² = 0.06 (small effect)
  • n.levels = 3 | n.rm = 4
  • corr.rm = 0.5 | α = 0.05 | Power = 80%

Compare with Case 8 — same η², but more groups and time points.
What is the impact of adding groups and measurements on total N?

Sample Size Calculation: Correlation

🔗 Open pwrss calculator


Select: A Correlation against a Constant (z Test)
Key parameters:
- r = expected correlation
- r0 = constant to test against (default: 0 — testing if r ≠ 0)
- alternative = Not equal (two-tailed)


Case 11 — Physical activity and quality of life (medium correlation)

A researcher investigates the association between weekly steps and quality of life score in older adults:

  • Expected r = 0.30 (medium effect)
  • r0 = 0 | α = 0.05 | Power = 80% | Two-tailed

How many participants are needed?


Case 12 — Same study, small correlation

Same study as Case 11, but a more conservative estimate based on a broader population:

  • Expected r = 0.10 (small effect)
  • r0 = 0 | α = 0.05 | Power = 80% | Two-tailed

Compare with Case 11.
What is the cost of planning with a small expected correlation?


Case 13 — Strong association, high power

A study examines the association between VO₂max and performance time in recreational runners, where a strong correlation is expected based on the literature:

  • Expected r = 0.50 (large effect)
  • r0 = 0 | α = 0.05 | Power = 90% | Two-tailed

Compare with Cases 11 and 12.
What is the cost of increasing power from 80% to 90%?


Case 14 — Testing against a known baseline correlation

A researcher wants to test whether the correlation between sedentary time and cardiometabolic risk in pregnant women (r = 0.40) is significantly different from a previously published value in the general population (r₀ = 0.20):

  • r = 0.40 | r0 = 0.20
  • α = 0.05 | Power = 80% | Two-tailed

Compare with Case 11 (same r = 0.40 vs. r₀ = 0).
Why does testing against r₀ ≠ 0 require a larger sample?

The take-home message


A statistically significant result with a trivial effect size tells you very little.

A non-significant result from an underpowered study tells you even less.

Before collecting a single data point, ask yourself:

  • What is the smallest effect that would matter in my context?
  • What power do I need to detect it reliably?
  • How many participants does that require?

Planning sample size is not a bureaucratic step —
it is the first act of a rigorous scientist.


Further Reading


Books:

  • Lakens, D. Improving Your Statistical Inferences. [Read online]

  • Caldwell, A. R., Lakens, D., Parlett-Pelleriti, C. M., Prochilo, G., & Aust, F. (2022). Power Analysis with Superpower. [Read online]


Articles:

  • Greenland, S., et al. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. [DOI]

  • Ho, J., et al. (2019). Moving beyond P values: data analysis with estimation graphics. Nature Methods, 16(7), 565–566. [DOI]

  • Caldwell, A. R., et al. (2020). Moving sport and exercise science forward: a call for the adoption of more transparent research practices. Sports Medicine, 50(3), 449–459. [DOI]

  • Lakens, D. (2022). Sample size justification. Collabra: Psychology, 8(1), 33267. [DOI]


Blog posts:

  • Lakens, D. (2014, December 19). Observed power, and what to do if your editor asks for post-hoc power analyses. The 20% Statistician. [Read online]

Tools:

Thank You!


Questions?