Statistics can be divided into two main branches:
The focus of this lecture will be the frequentist approach, which is by far the most commonly used in health sciences and sport sciences.
If you’ve ever read a scientific paper, you’ve probably encountered:
- The famous p-value
- The expression “statistically significant”
But what do these really mean?
To understand sample size and effect size, we first need to understand the logic behind the p-value.
Ronald Fisher (1890-1962)
Jerzy Neyman & Egon Pearson (1930s)
The most widely used model today, NHST (Null Hypothesis Significance Testing), is a fusion of these two approaches.
Fisher saw p as continuous evidence;
Neyman-Pearson saw it as a binary decision trigger
Example: Compare handgrip strength between:
The central question is: Is the observed difference between groups real or just random variation?
The independent t-test formula:
\[ t = \frac{\bar{x}_1 - \bar{x}_2}{SE(\bar{x}_1 - \bar{x}_2)} \]
Where:
In essence: signal divided by noise
The standard error depends on variability and sample size.
For the independent t-test (equal-variance version):
\[ SE(\bar{x}_1 - \bar{x}_2)= s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \]
NHST has a simple and objective framework:
This simplicity explains its popularity. But there’s a catch…
Formal definition:
The p-value is the probability of observing a result as extreme or more extreme than the one observed, assuming the null hypothesis is true.
Even experienced researchers misinterpret this definition.
Let’s see what the p-value is NOT.
❌ “The p-value is the probability that the null hypothesis is true”
❌ “1 − p is the probability that the alternative hypothesis is true”
❌ “A p < 0.05 means the result is clinically important”
❌ “A p = 0.049 is very different from p = 0.051”
✅ “The p-value tells us how surprising the data are, assuming H₀ is true — nothing more.”
The p-value depends on sample size and variability. It is not a direct measure of effect size!
Core idea (Popper): A scientific claim is stronger when it can be falsified by data.
Science progresses by attempting to refute claims, not by “proving” them true.
Bridge to NHST:
In hypothesis testing, we start with a default model:
NHST does not test whether the null hypothesis is true.
It evaluates how compatible the data are with a model that assumes H0.
Hypothesis: “This bag contains only ping-pong balls”
I open the bag and observe in the first layer:
🥚 An egg!
Conclusion: I can reject the hypothesis immediately.
I open the bag and observe in the first layer:
🏓🏓🏓 Only ping-pong balls
But… what about the bottom of the bag?
Conclusion: I cannot be certain without inspecting all the balls. I can only “fail to reject” the hypothesis within the available level of evidence.
Rejecting H₀ does not mean H₁ is proven —
only that the data are incompatible with H₀.
To understand the logic of rejection in NHST, imagine a trial:
Null hypothesis (H₀): My friend is innocent (did not steal my dessert from the fridge)
Alternative hypothesis (H₁): My friend is guilty
A witness reports seeing someone taking the dessert, but:
Conclusion: The evidence does not reach the level needed to reject innocence.
“Failing to reject H₀” ≠ “H₀ is true”
A camera inside the fridge captured:
Conclusion: The level of evidence is very high.
The fundamental point of NHST is error control:
| H₀ is true | H₀ is false | |
|---|---|---|
| Reject H₀ | Type I Error (α) | ✓ Correct |
| Fail to reject H₀ | ✓ Correct | Type II Error (β) |
Type I Error (False Positive): - It is not raining — but you bring an umbrella anyway ☂️
Type II Error (False Negative): - It is raining — but you leave your umbrella at home 🌧️
Every time you run a statistical test at α = 0.05, you accept a 5% chance of a false positive — even when H₀ is true.
What happens when you run multiple tests?
| Tests (k) | Probability of ≥ 1 false positive |
|---|---|
| 1 | 5.0% |
| 3 | 14.3% |
| 5 | 22.6% |
| 10 | 40.1% |
| 20 | 64.2% |
With 20 tests at α = 0.05, you have a 64% chance of at least one false positive — even if all null hypotheses are true.
Power = 1 − β
- Low power = high risk of “failing to detect” real effects
Power depends on: sample size, effect size, and variability
One of the most important ways a researcher can control the balance between errors is to plan sample size a priori.
The calculation considers:
An alternative (and requirement of many journals) is to report effect size.
Cohen’s d:
\[ d = \frac{\bar{x}_1 - \bar{x}_2}{SD_{pooled}} \]
Conventional interpretation (heuristic):
⚠️ These conventions are context-dependent!
Standardized effect (Cohen’s d):
Unstandardized effect (original units):
Both are important — use what makes most sense for your question!
Unstandardized: Active group had 8 kg higher handgrip strength
Standardized: d = 0.82 (large effect)
It is important to distinguish two concepts:
Effect Size:
Magnitude of Effect:
Statistical significance ≠ Clinical relevance!
All effect size measures share the same fundamental logic:
\[\text{Effect Size} = \frac{\text{signal}}{\text{noise}}\]
Based on means and standard deviations — continuous outcomes, group comparisons
| Measure | Formula | Used with |
|---|---|---|
| Cohen’s d | \(\frac{\bar{x}_1 - \bar{x}_2}{s_p}\) | t-test (two groups) |
| Hedges’ g | \(d \times \text{correction}\) | t-test (small samples) |
| Cohen’s f | \(\frac{\sigma_m}{\sigma}\) | ANOVA (3+ groups) |
| η² (eta-squared) | \(SS_{effect} / SS_{total}\) | ANOVA |
| ω² (omega-squared) | Less biased version of η² | ANOVA (preferred) |
Based on correlation coefficients — strength of association, variance explained
| Measure | Formula | Used with |
|---|---|---|
| r (Pearson) | \(\frac{\sum(x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^2\sum(y-\bar{y})^2}}\) | Correlation between two continuous variables |
| r² | \(r \times r\) | Bivariate correlation (1 predictor) |
| R² | \(1 - SS_{residual}/SS_{total}\) | Multiple regression (1+ predictors) |
| φ (phi) | \(\sqrt{\chi^2 / n}\) | Chi-square (2×2 tables) |
| Cramér’s V | \(\sqrt{\chi^2 / (n \cdot \min(r-1, c-1))}\) | Chi-square (larger tables) |
Based on events and non-events in two groups — binary outcomes, risk
| Measure | Formula | Reference | Used with |
|---|---|---|---|
| RR (Risk Ratio) | \(p_1 / p_2\) | RR = 1 | Cohort studies, RCTs |
| OR (Odds Ratio) | \(\frac{p_1/(1-p_1)}{p_2/(1-p_2)}\) | OR = 1 | Case-control, logistic regression |
P-hacking: - Manipulating analyses until p < 0.05
- “Torturing the data until it confesses”
Arbitrary dichotomization: - p = 0.049 → “significant” ✓
- p = 0.051 → “not significant” ✗
Excessive focus on p-value: - At the expense of effect magnitude and precision
HARKing (Hypothesizing After Results are Known) - Presenting post-hoc hypotheses as if they were a priori
Cherry Picking: - Reporting only outcomes or subgroups that reached p < 0.05
File Drawer Problem: - Null results are rarely published
- The published literature systematically overestimates effect sizes
Major journals (Nature, JAMA) increasingly require reporting of effect sizes and confidence intervals — not just p-values.
Why?
To estimate sample size a priori, we generally need:
Try this:
For each scenario: identify the inputs, calculate the required n, and interpret what happens when you change α or power.
A researcher wants to compare handgrip strength between physically active and inactive adults.
Based on the literature:
How many participants per group are needed?
Same study as Case 1, but the research team wants to publish in a high-impact journal requiring 90% power.
How does the required n change?
What is the cost of increasing power from 80% to 90%?
A clinical trial evaluates a new intervention expected to produce a small effect (d = 0.2) on resting blood pressure.
How many participants are needed?
What happens if you change α to 0.01?
A study investigates a rare adverse event where a false positive would have serious consequences (e.g., withdrawing a safe treatment).
Compare the n required here vs. Case 1.
Why is α more conservative in this scenario?
Select: One-way ANOVA (F Test)
Benchmarks used: Cohen (1992) — Small: 0.02–0.13 | Medium: 0.13–0.26 | Large: ≥ 0.26
A researcher compares VO₂max across three groups:
How many participants in total are needed?
How many per group (assuming equal groups)?
Same design as Case 5, but the researcher finds stronger evidence in the literature supporting a larger difference between groups: η² = 0.14 (medium effect).
Compare the total N with Case 5.
How does doubling the expected effect size affect the required N?
A clinical trial compares four exercise protocols on resting blood pressure — a well-established intervention with a strong expected response:
Compare with Cases 5 and 6.
How does a large effect size change the required N?
What happens when you increase power to 90%?
Select: Repeated Measures ANOVA: Within × Between Interaction
Key parameters:
- n.levels = number of groups (between factor)
- n.rm = number of time points (within factor)
- corr.rm = correlation between repeated measures (default: 0.5)
- epsilon = non-sphericity correction (default: 1)
A researcher evaluates the effect of an aerobic exercise program on body fat percentage, comparing intervention vs. control group measured at three time points (baseline, 3 months, 6 months):
How many participants in total are needed?
How many per group?
Same design as Case 8, but the researcher expects a stronger group × time interaction based on previous literature:
Compare with Case 8.
How does the effect size affect the required N?
A clinical trial compares three rehabilitation protocols (control, low dose, high dose) on muscle strength, measured at 4 time points (baseline, 4, 8, 12 weeks):
Compare with Case 8 — same η², but more groups and time points.
What is the impact of adding groups and measurements on total N?
Select: A Correlation against a Constant (z Test)
Key parameters:
- r = expected correlation
- r0 = constant to test against (default: 0 — testing if r ≠ 0)
- alternative = Not equal (two-tailed)
A researcher investigates the association between weekly steps and quality of life score in older adults:
How many participants are needed?
Same study as Case 11, but a more conservative estimate based on a broader population:
Compare with Case 11.
What is the cost of planning with a small expected correlation?
A study examines the association between VO₂max and performance time in recreational runners, where a strong correlation is expected based on the literature:
Compare with Cases 11 and 12.
What is the cost of increasing power from 80% to 90%?
A researcher wants to test whether the correlation between sedentary time and cardiometabolic risk in pregnant women (r = 0.40) is significantly different from a previously published value in the general population (r₀ = 0.20):
Compare with Case 11 (same r = 0.40 vs. r₀ = 0).
Why does testing against r₀ ≠ 0 require a larger sample?
A statistically significant result with a trivial effect size tells you very little.
A non-significant result from an underpowered study tells you even less.
Before collecting a single data point, ask yourself:
Planning sample size is not a bureaucratic step —
it is the first act of a rigorous scientist.
Books:
Lakens, D. Improving Your Statistical Inferences. [Read online]
Caldwell, A. R., Lakens, D., Parlett-Pelleriti, C. M., Prochilo, G., & Aust, F. (2022). Power Analysis with Superpower. [Read online]
Articles:
Greenland, S., et al. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. [DOI]
Ho, J., et al. (2019). Moving beyond P values: data analysis with estimation graphics. Nature Methods, 16(7), 565–566. [DOI]
Caldwell, A. R., et al. (2020). Moving sport and exercise science forward: a call for the adoption of more transparent research practices. Sports Medicine, 50(3), 449–459. [DOI]
Lakens, D. (2022). Sample size justification. Collabra: Psychology, 8(1), 33267. [DOI]
Blog posts:
Tools:
G*Power (free software for sample size calculation) [Download]
pwr package in R [CRAN] | [Vignette]
dabestr package in R (estimation plots) [GitHub]
Questions?