The perils of power (analysis)

Phillip M. Alday
22 November 2017 (Nijmegen)

Stats is hard

And if it's not hard, then it's probably not worth a whole presentation. So I probably will say something infelicitous at some point.

If you catch it (either now or later), do let me know.

I'm the one-eyed guy with bad cataracts in the land of the blind. I trip a little bit less often, but I still trip.

Power: old and new

Traditional error analysis framework

Type-I error

(alpha-Fehler in der dt. Tradition)

Incorrect rejection of the null hypothesis
False positive
Related to specifity
spurious results
Type-I error rate often listed as \( \alpha \)
(this is where the 'significance level' comes from, but the nominal rate is often quite different from actual rate due to differences between reality and models)

Type-II error

(beta-Fehler in der dt. Tradition)

Failure to reject the null hypothesis
False negative
Related to sensitivity
null results
Type-II error rate often listed as \( \beta \)
reduce the odds of this happening by increasing power \( 1 - \beta \)
(don't confuse this \( \beta \) with the coefficient \( \beta \)s)

A more modern perspective

Type-M error

In the 'softer' sciences, there are rarely 'false' positives, because everything has some effect if we can measure it well enough.
Instead, there are false magnitude estimates.
Indeed, in a low power situation, we are more likely to overestimate an effect size, because a single extreme value has more influence.

Type-S error

Similarly, it is possible to have an effect estimate that goes in the wrong direction
even when the confidence/credible/uncertainty interval doesn't cross zero!
Again, in a low power situation, we are more likely to overestimate an effect size, because a single extreme value has more influence.

The "What does not kill my statistical significance makes it stronger" Fallacy

Surprising results in “despite low power” are neither surprising nor trustworthy.
Low power implies a higher chance of observing bigger and/or reversed effects.

What does low power even mean?

Your signal is much weaker than your noise.
Power is in some-sense your signal-to-noise ratio, albeit at the abstract level of effects instead of raw measurements.
Kangaroo bouncing on a scale

Null effects are the least of your worries

Type S

plot of chunk unnamed-chunk-2

Type M

plot of chunk unnamed-chunk-3

Literature

There's a ton!

Geoff Cummings' The New Statistics
Gelman and Carlin 2014 on Type-S and Type-M errors
Oldies but goodies from Jacob Cohen (both the book and his commentaries from the early 90s)
Button's various publications on low-power in neuroscience
and many more!

And of course see Andrew Gelman's description of our increasing awareness of the role of power in the replication crisis (search for 'timeline')

Simulation

General Linear Model

\[ Y = \beta_{0} + \beta_{1}X_1 + \beta_{2}X_2 \ldots + \beta_{p}X_p + \varepsilon \] \[ \varepsilon \sim N(0,\sigma) \]

All of these terms are additive – they simply add together.
They are often called main effects, although there is some subtelty in your choice of coding scheme if you have categorical variables (see this worked example for a discussion of coding schemes and links to main vs simple effects)

General Linear Model

\[ Y = \beta_{0} + \beta_{1}X_1 + \beta_{2}X_2 \ldots + \beta_{p}X_p + \varepsilon \] \[ \varepsilon \sim N(0,\sigma) \]

Categorical predictors with \( n \) levels are broken up into \( n-1 \) model terms (variables) in the usual coding schemes. (NB: This is deeply related to the 'types of sums of squares' in the traditional ANOVA procedure.)
Thus, our categorical effects are specific contrasts:
- we can specify other contrasts (equivalent to choosing a different coding scheme)
- we can compute additional contrasts offline (via e.g. lsmeans, but there are potential multiple-comparisons issues)
- we test that linear combinations of certain variables are (not) equal to a certain value
  - the special case of 'all variables belonging to a given predictor are simultaneously zero'.
  - note that combining these estimates in this way also combines their error estimates, so the power for the 'whole' categorical predictor may differ from the power for the individual model terms

Interactions: what are they?

Interactions occur when the strength of one effect is modulated by another term and thus we have three components:
- modulation effect size (new model coefficient)
- value of modulating predictor
- value of 'modulated' predictor
(We can define higher level interactions recursively/inductively from this definition.)
Interactions are multiplicative effects and are symmetric at the level of model terms.
'Resolving' an interaction only makes sense in an ANOVA context where the linear hypothesis test combines multiple model terms into one predictor.

Additive effect: \[ \beta_{1}X_1 + \beta_{2}X_2 \] Letting the coefficient of \( X_1 \) be modulated by \( X_2 \) \[ (\beta_{1} + \beta_{int}X_2)X_1 + \beta_{2}X_2 \] Distributive Property \[ \beta_{1}X_1 + \beta_{int}X_{2}X_1 + \beta_{2}X_2 \] Commutativity of (element-wise) multiplication \[ \beta_{1}X_1 + \beta_{int}X_{1}X_2 + \beta_{2}X_2 \] Distributive Property + Commutativity of Addition \[ \beta_{1}X_1 + (\beta_{2} + \beta_{int}X_{1})X_2 \]

Note that this gives a nice interpretation of quadratic and other polynomial effects – a model term is interacting with itself, thus providing a bit of a feedback cycle and changing its own slope.

mtcars interaction example

'all' terms

summary(lm(mpg ~ disp * cyl),data=mtcars)


Call:
lm(formula = mpg ~ disp * cyl)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0809 -1.6054 -0.2948  1.0546  5.7981 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.037212   5.004636   9.798 1.51e-10 ***
disp        -0.145526   0.040002  -3.638 0.001099 ** 
cyl         -3.405244   0.840189  -4.053 0.000365 ***
disp:cyl     0.015854   0.004948   3.204 0.003369 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.66 on 28 degrees of freedom
Multiple R-squared:  0.8241,    Adjusted R-squared:  0.8052 
F-statistic: 43.72 on 3 and 28 DF,  p-value: 1.078e-10

enumerate additive and multiplicative terms

summary(lm(mpg ~ disp + cyl + disp:cyl),data=mtcars)


Call:
lm(formula = mpg ~ disp + cyl + disp:cyl)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0809 -1.6054 -0.2948  1.0546  5.7981 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.037212   5.004636   9.798 1.51e-10 ***
disp        -0.145526   0.040002  -3.638 0.001099 ** 
cyl         -3.405244   0.840189  -4.053 0.000365 ***
disp:cyl     0.015854   0.004948   3.204 0.003369 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.66 on 28 degrees of freedom
Multiple R-squared:  0.8241,    Adjusted R-squared:  0.8052 
F-statistic: 43.72 on 3 and 28 DF,  p-value: 1.078e-10

explicitly calculate interacion

summary(lm(mpg ~ disp + cyl + I(disp*cyl)),data=mtcars)


Call:
lm(formula = mpg ~ disp + cyl + I(disp * cyl))

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0809 -1.6054 -0.2948  1.0546  5.7981 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   49.037212   5.004636   9.798 1.51e-10 ***
disp          -0.145526   0.040002  -3.638 0.001099 ** 
cyl           -3.405244   0.840189  -4.053 0.000365 ***
I(disp * cyl)  0.015854   0.004948   3.204 0.003369 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.66 on 28 degrees of freedom
Multiple R-squared:  0.8241,    Adjusted R-squared:  0.8052 
F-statistic: 43.72 on 3 and 28 DF,  p-value: 1.078e-10

see R do it for you

head(model.matrix(~ disp * cyl,data=mtcars))

                  (Intercept) disp cyl disp:cyl
Mazda RX4                   1  160   6      960
Mazda RX4 Wag               1  160   6      960
Datsun 710                  1  108   4      432
Hornet 4 Drive              1  258   6     1548
Hornet Sportabout           1  360   8     2880
Valiant                     1  225   6     1350

Note the format

predictors (including the intercept) are columns
observations are rows.

So we when write the GLM as a matrix operation, we write \[ Y = X \beta + \varepsilon \]

where \( X \) is the model matrix as above, \( \beta \) is a column vector of coefficients/effects (matrices are multipled row by column) and the result is a column vector of predictions for each observation. The error \( \varepsilon \) a column vector whose entires are sampled from \( N(0,\sigma) \), where \( \sigma \) is the residual variance.

This is equivalent to the column-vector representation we used earlier and it's easier to manipulate for computations, but harder to discuss individual terms.

Challenge question

What effect does log-transforming variables have on the interpretation of additive and multiplicative effects?

\[ \log(ab) = \log(a) + \log(b) \]

So additive effects on the log scale are multiplicative effects on the original scale!

Hierarchical models

Extending \[ Y = \beta_{0} + \beta_{1}X_1 + \beta_{2}X_2 \ldots + \beta_{p}X_p + \varepsilon \] \[ \varepsilon \sim N(0,\sigma) \]

per-subject correction for intercepts: \[ Y = \beta_{1}X_1 + \beta_{0} + S_0 + \varepsilon \]
per-subject correction for slopes: \[ Y = \left(\beta_{1} + S_1\right)X_1 + \beta_{0} + \varepsilon \]
per-subject correction for both: \[ Y = \left(\beta_{1} + S_1\right)X_1 + \beta_{0} + S_0 + \varepsilon \]
per-subject and per-item correction for both: \[ Y = \left(\beta_{1} + S_1 +I_1 \right)X_1 + \beta_{0} + I_0 + S_0 + \varepsilon \]

All such that, \[ S_i \stackrel{iid}{\sim} N(0,\sigma_{S_i}) \] \[ I_i \stackrel{iid}{\sim} N(0,\sigma_{I_i}) \]

The grouping variable behind the vertical bar is used so that each member of that group gets an offset attached to each of the fixed effects included before the vertical bar. The exact offsets for the observed groups are predicted (BLUPs/conditional modes), but overall the model only estimates the variance (or equivalently, the standard deviation) of the grouping variable for each effect, under the assumption that the offsets are drawn from a normal distribution with that variance and mean of 0 (remember, since these are offsets from the population-level fixed effect, this is equiv. to the group has the fixef est. as its mean.)