MT5762 Lecture 12

C. Donovan

Power

Power is intimately related to type 2 errors (false negatives)

It is our ability to detect differences when they actually exist

Overview

We look to address:

What is power? (and why do we care)
Intuitively how would you calculate power?
How do you calculate power in practice?
How can you improve power?

Overview

We've done much testing now.

Type 1 error is the probability of incorrectly rejecting \( H_0 \) i.e. a false positive
Type 2 error is the probability of incorrectly failing to reject \( H_0 \) i.e. a false negative

We control Type 1 in our tests by setting our threshold \( p \)-value (type I error = \( \alpha \)).

There is nothing explicitly controlling type II error (\( \beta \)) in our tests.

Overview

Perhaps obviously type II is the harder one.
Calculating it requires we know (or speculate) about what \( H_A \) is - something we're usually very vague about.

Power is \( 1-\beta \) - the probability we correctly reject \( H_0 \)

To calcuate it, we have to specify an explicit \( H_A \).
Similar to our other tests, our probability calculations are conditional on some hypothesised state being true.

Motivating example - EIA

This cool thing was a prototype current turbine

A current turbine

Motivating example - EIA

People were concerned about animals being injured by it (it does look like a blender - mainly nonsense though)
Extensive Environmental Impact Assessments were required

A current turbine

EIA

A current turbine

Monitoring various animal populations by visual survey

Interested in determining if there is an impact that can be linked to the turbine.
The models of abundance include various environmental conditions (tide-state, time of day, season etc) and are quite complex

EIA

A current turbine

No impacts were immediately detectible.

Raises various questions:

Is this because there is no impact?
Is this because the power of our monitoring scheme and analysis have low power?

EIA

A current turbine

What size impacts are we likely to be able to see under the status quo?
Are more samples or a longer observation period likely to change this? (i.e. cost-effectiveness)

EIA

A current turbine

This area has a conservation status that means we must concern ourselves about “significant effects”.

Suppose it is speculated that reductions in the local animal population size of 10%-20% would be concerning and might jeapordise the turbine licence
More money could be spent on collecting data.
An investigation of power is required.

EIA

Various effect sizes were considered, along with two sampling periods i.e. amount of data
Results are the %-age chance of detecting the effect with that data

Power analysis results

EIA

Power analysis results

These are effectively the effect-size \( D \) and sample size \( n \) that we considered in our simple power analyses.

The power under various effect sizes and sampling regimes are given as %-ages
For example, expect almost 0.898 prob of detecting a 20% reduction if a further 6-months data are collected.

Some key elements of a power analysis

Power calculations are often done to determine how much data to collect.

How big an effect do you need/hope to detect?
How much variability is in the system?
What level of power is needed?

Armed with these, we can advise on \( n \) to meet specifications

Some problems

Power calculations are often done to determine how much data to collect.

The variability of the system is needed:
- Perhaps this is unknown, so have to guess
- If we estimate from some small study, precision on the variance is very low
The recommended \( n \) is often large, so desired power and effect size might need revising

Another example

(Example modified from Larsen & Marx, 2001)

There is a new fuel additive that is expected to increase the fuel efficiency for vehicles. The underlying fuel efficiency with the standard fuel is assumed to be 25 mpg.
It is thought an improvement of at least 3% in fuel efficiency would be substantial enough for the additive to be taken to market.
Hence we are particularly interested in establishing a change from 25 to 25.75 mpg.

Another example

It is proposed that a sample size of 30 would be used to establish this improvement.
The standard deviation of efficiency in cars is claimed to be 2.4 (\( \sigma \)).
If the improvement was in fact 25.75 mpg, what is the power and Type II associated with this (one-sided) test scenario?

NB we've been given \( \sigma \) as known. If this were estimated, we'd have to use \( t \)-distributions.

Another example

Under \( H_0 \), we'd set the decision boundary at:

  qnorm(0.95)

[1] 1.644854

\[ 25 + 1.644854 \times \frac{2.4}{\sqrt{30}} \]

25+1.644854*2.4/sqrt(30)

[1] 25.72074

Another example

So a sample mean above 27.72 is required for a positive result

[I'll indulge in some laptop sketching again]

The decision boundary

Another example

If \( H_A \) were true with its 3% greater mean, and had the same spread, we'd produce false negatives (type 2) errors by falling left of this decision boundary

The decision boundary

Another example

The decision boundary

The decision boundary for \( \alpha = 0.05 \) (red)

The false negative area if \( \mu = 25.75 \) (blue)

Another example

The decision boundary

The decision boundary for \( \alpha = 0.05 \) (red)

The false negative area if \( \mu = 25.75 \) (blue)

Another example

Calculate the blue false negative area in R:

pnorm(27.72, mean = 27.75, sd = 2.4/sqrt(30), lower.tail = T)

[1] 0.4727076

So this occurs about 47.2% of the time
Power is \( 1-\beta \) = 0.528 i.e. 52.8%

Another example

So we'd fail to find evidence of an improvement, if it were only 3%, almost half the time.
These are not great odds! The obvious place to make an improvements to this poor power are:
- Increasing precision through larger samples.
- Increasing precision through controlling variability (e.g. testing on a standardised track or the like).
Accept a higher Type I error.
Have a fuel additive that is expected to have improvements much greater than 3%.

Another example

What happens if \( n \) is doubled?

The decision boundary is now at:

decisionBound <- 25+1.644854*2.4/sqrt(60)

decisionBound

[1] 25.50964

Another example

What happens if \( n \) is doubled?

Meaning if the reality is actually \( /mu = 25.75 \) (the 3% increase):

pnorm(decisionBound, mean = 25.75, sd = 2.4/sqrt(60))

[1] 0.2189452

Another example

This means our Type 2 error is about 22% - a power of 78% (80 is often the goal in planning)
Bigger differences would give higher power too - 95% power for a 4% efficiency improvement:

pnorm(decisionBound, mean = 25*1.04, sd = 2.4/sqrt(60))

[1] 0.05675267

Basically increasing the signal-to-noise ratio improves power

Another example

Accepting a higher type 1 error similarly improves power - it lowers the decision boundary in this problem

  qnorm(0.9)

[1] 1.281552

decisionBound <- 25+qnorm(0.9)*2.4/sqrt(30)

decisionBound

[1] 25.56155

Another example

pnorm(decisionBound, mean = 25*1.03, sd = 2.4/sqrt(30))

[1] 0.3335682

Given a type 2 error of 33% vs 47% before (power moves from 53% to 67%)
We're trading the probability of one error for another

Multiple comparisons

With this in mind, consider multiple comparison adjustments

We make our type 1 error boundaries more stringent (broader) i.e. lower threshold \( p \)-values
We're trading errors at some level - type 2 errors increase, power goes down

Multiple comparisons

Implications for cynical exploitations:

If you don't want to find significant results, you want low power
Take small samples, define big changes as important, do adjust for multiple comparisons, sloppy measurements to give large variances etc

Some matrix algebra!

We need this a lot:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e} \]

leading to things like this

\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_{1,1} & x_{2,1}\\ 1 & x_{1,2} & x_{2,2}\\ \vdots & \vdots & \vdots\\ 1 & x_{1,n} & x_{2,n}\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \end{bmatrix} \]

Wrap up

We've looked at

the complement to type 1 error: type 2 error (false positives and false negatives respectively)
Power is our ability to detect differences (a type of signal) when they exist
We can see how to improve power by examining its calculation

“Assessment”

Group project (40% - has individual components)
An online quiz to do at your leisure to practice for the test
Class test, late semester, date TBD (30%)