Sample Size, Effect Size, and Magnitude of Effect

Alex Crisp

Learning goals

Explain the logic behind null hypothesis significance testing (NHST) and the meaning of the p-value
Distinguish between Type I and Type II errors and understand their relationship with statistical power
Describe how sample size affects the ability to detect effects and the precision of estimates
Differentiate between effect size and magnitude of effect
Recognize the key components needed for a priori sample size calculation

Initial Disclaimer

Statistics can be divided into two main branches:

Frequentist Statistics: based on frequencies of events in repeated experiments
Bayesian Statistics: incorporates prior knowledge and updates with new data

The focus of this lecture will be the frequentist approach, which is by far the most commonly used in health sciences and sport sciences.

A Common Feature…

If you’ve ever read a scientific paper, you’ve probably encountered:
- The famous p-value
- The expression “statistically significant”
But what do these really mean?

To understand sample size and effect size, we first need to understand the logic behind the p-value.

Brief Historical Context

Ronald Fisher (1890-1962)

Proposed the p-value as a measure of evidence against the null hypothesis
More flexible and exploratory approach

Jerzy Neyman & Egon Pearson (1930s)

Formalized the hypothesis testing framework
Introduced concepts of Type I (α) and Type II (β) errors
Focus on decision-making

The most widely used model today, NHST (Null Hypothesis Significance Testing), is a fusion of these two approaches.
Fisher saw p as continuous evidence;
Neyman-Pearson saw it as a binary decision trigger

Read Me Later (open access)

The Logic Behind Statistical Tests

Example: Compare handgrip strength between:

Physically active group
Physically inactive group

The central question is: Is the observed difference between groups real or just random variation?

The t-test: A Simple Logic

The independent t-test formula:

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{SE(\bar{x}_1 - \bar{x}_2)} \]

Where:

\(\bar{x}_1 - \bar{x}_2\) = difference between group means
\(SE(\bar{x}_1 - \bar{x}_2)\) = standard error of the mean difference

In essence: signal divided by noise

Where does sample size enter?

The standard error depends on variability and sample size.
For the independent t-test (equal-variance version):

\[ SE(\bar{x}_1 - \bar{x}_2)= s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \]

Larger \(n_1, n_2\) \(\rightarrow\) smaller \(SE\)
Smaller \(SE\) \(\rightarrow\) larger \(|t|\) (if the same mean difference is observed)

The t-test: Interpretation

When the difference is large relative to the error → high t-value
The t-value, along with degrees of freedom, determines the p-value

t-Distribution: Signal vs. Noise

Effect of Degrees of Freedom on the t-Distribution

The NHST Framework

NHST has a simple and objective framework:

Formulate null (H₀) and alternative (H₁) hypotheses
Define significance level (α, usually 0.05)
Collect data and calculate test statistic
Compare p-value with α
Make decision: reject or fail to reject H₀

This simplicity explains its popularity. But there’s a catch…

What is the p-value?

Formal definition:

The p-value is the probability of observing a result as extreme or more extreme than the one observed, assuming the null hypothesis is true.

Even experienced researchers misinterpret this definition.

Let’s see what the p-value is NOT.

Common p-value Misinterpretations

❌ “The p-value is the probability that the null hypothesis is true”

❌ “1 − p is the probability that the alternative hypothesis is true”

❌ “A p < 0.05 means the result is clinically important”

❌ “A p = 0.049 is very different from p = 0.051”

✅ “The p-value tells us how surprising the data are, assuming H₀ is true — nothing more.”

The p-value depends on sample size and variability. It is not a direct measure of effect size!

Popper, falsification, and the null hypothesis

Core idea (Popper): A scientific claim is stronger when it can be falsified by data.
Science progresses by attempting to refute claims, not by “proving” them true.

Bridge to NHST:
In hypothesis testing, we start with a default model:

Null hypothesis (H0): “no effect” or “no difference” (e.g., Active = Inactive)
We ask: Are the data unusually incompatible with H0?
If the evidence is strong, we reject H0 (a controlled attempt at falsification)

NHST does not test whether the null hypothesis is true.
It evaluates how compatible the data are with a model that assumes H0.

Situation 1: Rejection

Hypothesis: “This bag contains only ping-pong balls”
I open the bag and observe in the first layer:

🥚 An egg!

Conclusion: I can reject the hypothesis immediately.

Situation 2: Failure to Reject

I open the bag and observe in the first layer:

🏓🏓🏓 Only ping-pong balls

But… what about the bottom of the bag?

Conclusion: I cannot be certain without inspecting all the balls. I can only “fail to reject” the hypothesis within the available level of evidence.

Rejecting H₀ does not mean H₁ is proven —
only that the data are incompatible with H₀.

Analogy: The Trial

To understand the logic of rejection in NHST, imagine a trial:

Null hypothesis (H₀): My friend is innocent (did not steal my dessert from the fridge)

Alternative hypothesis (H₁): My friend is guilty

Scenario 1: Weak Evidence

A witness reports seeing someone taking the dessert, but:

The suspect had their back turned
The location was dark
The witness is not certain

Conclusion: The evidence does not reach the level needed to reject innocence.

“Failing to reject H₀” ≠ “H₀ is true”

Scenario 2: Strong Evidence

A camera inside the fridge captured:

The exact moment the suspect takes the dessert
The face clearly identifiable
Forensics confirm: authentic video, no manipulation

Conclusion: The level of evidence is very high.

Type I and Type II Errors

The fundamental point of NHST is error control:

	H₀ is true	H₀ is false
Reject H₀	Type I Error (α)	✓ Correct
Fail to reject H₀	✓ Correct	Type II Error (β)

Examples

Type I Error (False Positive): - It is not raining — but you bring an umbrella anyway ☂️

Type II Error (False Negative): - It is raining — but you leave your umbrella at home 🌧️

Multiple Comparisons and Type I Error

Every time you run a statistical test at α = 0.05, you accept a 5% chance of a false positive — even when H₀ is true.

What happens when you run multiple tests?

Tests (k)	Probability of ≥ 1 false positive
1	5.0%
3	14.3%
5	22.6%
10	40.1%
20	64.2%

With 20 tests at α = 0.05, you have a 64% chance of at least one false positive — even if all null hypotheses are true.

Statistical Power

Power = 1 − β

- Low power = high risk of “failing to detect” real effects

Power depends on: sample size, effect size, and variability

The Role of Sample Size

One of the most important ways a researcher can control the balance between errors is to plan sample size a priori.

The calculation considers:

Type I error (α)
Desired power (1 − β)
Expected effect size
Variability

Effect Size: Cohen’s d

An alternative (and requirement of many journals) is to report effect size.

Cohen’s d:

\[ d = \frac{\bar{x}_1 - \bar{x}_2}{SD_{pooled}} \]

Conventional interpretation (heuristic):

d ≈ 0.2 → Small
d ≈ 0.5 → Medium
d ≈ 0.8 → Large

⚠️ These conventions are context-dependent!

What Does Cohen’s d Look Like?

Standardized vs. Unstandardized Effects

Standardized effect (Cohen’s d):

Useful for comparing across studies
Independent of measurement unit

Unstandardized effect (original units):

E.g., “5 kg difference in handgrip strength”
More clinically interpretable

Both are important — use what makes most sense for your question!

Unstandardized: Active group had 8 kg higher handgrip strength
Standardized: d = 0.82 (large effect)

Effect Size vs. Magnitude of Effect

It is important to distinguish two concepts:

Effect Size:

Quantitative estimate of the effect
E.g., mean difference, ratio, correlation, Cohen’s d

Magnitude of Effect:

Substantive interpretation — is the effect practically relevant?
Considers: context, costs, risks, benefits, applicability

Statistical significance ≠ Clinical relevance!

Effect Size Measures: Same Logic, Different Scales

All effect size measures share the same fundamental logic:

\[\text{Effect Size} = \frac{\text{signal}}{\text{noise}}\]

Based on means and standard deviations — continuous outcomes, group comparisons

Measure	Formula	Used with
Cohen’s d	\(\frac{\bar{x}_1 - \bar{x}_2}{s_p}\)	t-test (two groups)
Hedges’ g	\(d \times \text{correction}\)	t-test (small samples)
Cohen’s f	\(\frac{\sigma_m}{\sigma}\)	ANOVA (3+ groups)
η² (eta-squared)	\(SS_{effect} / SS_{total}\)	ANOVA
ω² (omega-squared)	Less biased version of η²	ANOVA (preferred)

Based on correlation coefficients — strength of association, variance explained

Measure	Formula	Used with
r (Pearson)	\(\frac{\sum(x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^2\sum(y-\bar{y})^2}}\)	Correlation between two continuous variables
r²	\(r \times r\)	Bivariate correlation (1 predictor)
R²	\(1 - SS_{residual}/SS_{total}\)	Multiple regression (1+ predictors)
φ (phi)	\(\sqrt{\chi^2 / n}\)	Chi-square (2×2 tables)
Cramér’s V	\(\sqrt{\chi^2 / (n \cdot \min(r-1, c-1))}\)	Chi-square (larger tables)

Based on events and non-events in two groups — binary outcomes, risk

Measure	Formula	Reference	Used with
RR (Risk Ratio)	\(p_1 / p_2\)	RR = 1	Cohort studies, RCTs
OR (Odds Ratio)	\(\frac{p_1/(1-p_1)}{p_2/(1-p_2)}\)	OR = 1	Case-control, logistic regression

Read Me Later (open access)
Read Me Later

Criticisms of NHST

P-hacking: - Manipulating analyses until p < 0.05
- “Torturing the data until it confesses”

Arbitrary dichotomization: - p = 0.049 → “significant” ✓
- p = 0.051 → “not significant” ✗

Excessive focus on p-value: - At the expense of effect magnitude and precision

HARKing (Hypothesizing After Results are Known) - Presenting post-hoc hypotheses as if they were a priori

Cherry Picking: - Reporting only outcomes or subgroups that reached p < 0.05

File Drawer Problem: - Null results are rarely published
- The published literature systematically overestimates effect sizes

Beyond p-values

Major journals (Nature, JAMA) increasingly require reporting of effect sizes and confidence intervals — not just p-values.

Why?

p-value tells you if there is an effect
Effect size tells you how big it is
Confidence interval tells you how precise the estimate is

Estimation Statistics

🔗 dabestr package

🔗 Estimation Statistics

Sample Size Calculation

To estimate sample size a priori, we generally need:

Effect size of interest (ideally, the smallest clinically relevant effect)
Expected variability (standard deviation)
Significance level (α, usually 0.05)
Desired power (1 − β, usually 0.80 or 0.90)
One-tailed or two-tailed test
Study design and planned model

Simulation:

🔗 Open the Power Simulator

Try this:

Set d = 0.2 and α = 0.05 — how large does n need to be to reach 80% power?
Keep n = 20 and increase d from 0.2 to 0.8 — what happens to power?
Change α from 0.05 to 0.01 — what is the cost in terms of β?

Sample Size Calculation: Practice

🔗 Open pwrss calculator

For each scenario: identify the inputs, calculate the required n, and interpret what happens when you change α or power.

Case 1 — Handgrip Strength (t-test)

A researcher wants to compare handgrip strength between physically active and inactive adults.
Based on the literature:

Mean Group 1 (active): 42 kg | Mean Group 2 (inactive): 37 kg
SD both groups: 8 kg
α = 0.05 | Power = 80% | Two-tailed

How many participants per group are needed?

Case 2 — Exercise Intervention (t-test, higher power)

Same study as Case 1, but the research team wants to publish in a high-impact journal requiring 90% power.

How does the required n change?
What is the cost of increasing power from 80% to 90%?

Case 3 — Small Effect, High Precision (t-test)

A clinical trial evaluates a new intervention expected to produce a small effect (d = 0.2) on resting blood pressure.

α = 0.05 | Power = 80% | Two-tailed

How many participants are needed?
What happens if you change α to 0.01?

Case 4 — Rare Adverse Event (strict α)

A study investigates a rare adverse event where a false positive would have serious consequences (e.g., withdrawing a safe treatment).

d = 0.5 | Power = 90% | α = 0.01

Compare the n required here vs. Case 1.
Why is α more conservative in this scenario?

Sample Size Calculation: One-Way ANOVA

🔗 Open pwrss calculator

Select: One-way ANOVA (F Test)
Benchmarks used: Cohen (1992) — Small: 0.02–0.13 | Medium: 0.13–0.26 | Large: ≥ 0.26

Case 5 — Physical Activity Groups (small effect)

A researcher compares VO₂max across three groups:

Sedentary | Moderately active | Highly active
Expected η² = 0.06 (small effect)
α = 0.05 | Power = 80% | 3 groups

How many participants in total are needed?
How many per group (assuming equal groups)?

Case 6 — Same study, medium effect

Same design as Case 5, but the researcher finds stronger evidence in the literature supporting a larger difference between groups: η² = 0.14 (medium effect).

α = 0.05 | Power = 80% | 3 groups

Compare the total N with Case 5.
How does doubling the expected effect size affect the required N?

Case 7 — Four intervention groups, large effect

A clinical trial compares four exercise protocols on resting blood pressure — a well-established intervention with a strong expected response:

Control | Low intensity | Moderate intensity | High intensity
Expected η² = 0.26 (large effect)
α = 0.05 | Power = 80% | 4 groups

Compare with Cases 5 and 6.
How does a large effect size change the required N?
What happens when you increase power to 90%?

Sample Size Calculation: RM-ANOVA (Time × Group)

🔗 Open pwrss calculator

Select: Repeated Measures ANOVA: Within × Between Interaction
Key parameters:
- n.levels = number of groups (between factor)
- n.rm = number of time points (within factor)
- corr.rm = correlation between repeated measures (default: 0.5)
- epsilon = non-sphericity correction (default: 1)

Case 8 — Exercise intervention: 2 groups × 3 time points

A researcher evaluates the effect of an aerobic exercise program on body fat percentage, comparing intervention vs. control group measured at three time points (baseline, 3 months, 6 months):

η² = 0.06 (small effect — interaction)
n.levels = 2 | n.rm = 3
corr.rm = 0.5 | α = 0.05 | Power = 80%

How many participants in total are needed?
How many per group?

Case 9 — Same study, medium effect

Same design as Case 8, but the researcher expects a stronger group × time interaction based on previous literature:

η² = 0.14 (medium effect)
n.levels = 2 | n.rm = 3
corr.rm = 0.5 | α = 0.05 | Power = 80%

Compare with Case 8.
How does the effect size affect the required N?

Case 10 — Three groups × 4 time points

A clinical trial compares three rehabilitation protocols (control, low dose, high dose) on muscle strength, measured at 4 time points (baseline, 4, 8, 12 weeks):

η² = 0.06 (small effect)
n.levels = 3 | n.rm = 4
corr.rm = 0.5 | α = 0.05 | Power = 80%

Compare with Case 8 — same η², but more groups and time points.
What is the impact of adding groups and measurements on total N?

Sample Size Calculation: Correlation

🔗 Open pwrss calculator

Select: A Correlation against a Constant (z Test)
Key parameters:
- r = expected correlation
- r0 = constant to test against (default: 0 — testing if r ≠ 0)
- alternative = Not equal (two-tailed)

Case 11 — Physical activity and quality of life (medium correlation)

A researcher investigates the association between weekly steps and quality of life score in older adults:

Expected r = 0.30 (medium effect)
r0 = 0 | α = 0.05 | Power = 80% | Two-tailed

How many participants are needed?

Case 12 — Same study, small correlation

Same study as Case 11, but a more conservative estimate based on a broader population:

Expected r = 0.10 (small effect)
r0 = 0 | α = 0.05 | Power = 80% | Two-tailed

Compare with Case 11.
What is the cost of planning with a small expected correlation?

Case 13 — Strong association, high power

A study examines the association between VO₂max and performance time in recreational runners, where a strong correlation is expected based on the literature:

Expected r = 0.50 (large effect)
r0 = 0 | α = 0.05 | Power = 90% | Two-tailed

Compare with Cases 11 and 12.
What is the cost of increasing power from 80% to 90%?

Case 14 — Testing against a known baseline correlation

A researcher wants to test whether the correlation between sedentary time and cardiometabolic risk in pregnant women (r = 0.40) is significantly different from a previously published value in the general population (r₀ = 0.20):

r = 0.40 | r0 = 0.20
α = 0.05 | Power = 80% | Two-tailed

Compare with Case 11 (same r = 0.40 vs. r₀ = 0).
Why does testing against r₀ ≠ 0 require a larger sample?

The take-home message

A statistically significant result with a trivial effect size tells you very little.

A non-significant result from an underpowered study tells you even less.

Before collecting a single data point, ask yourself:

What is the smallest effect that would matter in my context?
What power do I need to detect it reliably?
How many participants does that require?

Planning sample size is not a bureaucratic step —
it is the first act of a rigorous scientist.

Thank You!

Questions?

Sample Size, Effect Size, and Magnitude of Effect

Learning goals

Initial Disclaimer

A Common Feature…

Brief Historical Context

The Logic Behind Statistical Tests

The t-test: A Simple Logic

Where does sample size enter?

The t-test: Interpretation

t-Distribution: Signal vs. Noise

Effect of Degrees of Freedom on the t-Distribution

The NHST Framework

What is the p-value?

Common p-value Misinterpretations

Popper, falsification, and the null hypothesis

Situation 1: Rejection

Situation 2: Failure to Reject

Analogy: The Trial

Scenario 1: Weak Evidence

Scenario 2: Strong Evidence

Type I and Type II Errors

Examples

Multiple Comparisons and Type I Error

Statistical Power

The Role of Sample Size

Effect Size: Cohen’s d

What Does Cohen’s d Look Like?

Standardized vs. Unstandardized Effects

Effect Size vs. Magnitude of Effect

Effect Size Measures: Same Logic, Different Scales

Criticisms of NHST

Beyond p-values

Estimation Statistics

Sample Size Calculation

Simulation:

Sample Size Calculation: Practice

Case 1 — Handgrip Strength (t-test)

Case 2 — Exercise Intervention (t-test, higher power)

Case 3 — Small Effect, High Precision (t-test)

Case 4 — Rare Adverse Event (strict α)

Sample Size Calculation: One-Way ANOVA

Case 5 — Physical Activity Groups (small effect)

Case 6 — Same study, medium effect

Case 7 — Four intervention groups, large effect

Sample Size Calculation: RM-ANOVA (Time × Group)

Case 8 — Exercise intervention: 2 groups × 3 time points

Case 9 — Same study, medium effect

Case 10 — Three groups × 4 time points

Sample Size Calculation: Correlation

Case 11 — Physical activity and quality of life (medium correlation)

Case 12 — Same study, small correlation

Case 13 — Strong association, high power

Case 14 — Testing against a known baseline correlation

The take-home message

Further Reading

Thank You!