MT5762 Lecture 12

C. Donovan

Power

Power is intimately related to type 2 errors (false negatives)

It is our ability to detect differences when they actually exist

Overview

We look to address:

  • What is power? (and why do we care)
  • Intuitively how would you calculate power?
  • How do you calculate power in practice?
  • How can you improve power?

Overview

We've done much testing now.

  • Type 1 error is the probability of incorrectly rejecting \( H_0 \) i.e. a false positive
  • Type 2 error is the probability of incorrectly failing to reject \( H_0 \) i.e. a false negative

We control Type 1 in our tests by setting our threshold \( p \)-value (type I error = \( \alpha \)).

There is nothing explicitly controlling type II error (\( \beta \)) in our tests.

Overview

  • Perhaps obviously type II is the harder one.
  • Calculating it requires we know (or speculate) about what \( H_A \) is - something we're usually very vague about.

Power is \( 1-\beta \) - the probability we correctly reject \( H_0 \)

  • To calcuate it, we have to specify an explicit \( H_A \).
  • Similar to our other tests, our probability calculations are conditional on some hypothesised state being true.

Motivating example - EIA

  • This cool thing was a prototype current turbine

A current turbine

Motivating example - EIA

  • People were concerned about animals being injured by it (it does look like a blender - mainly nonsense though)
  • Extensive Environmental Impact Assessments were required

A current turbine

EIA

A current turbine

  • Monitoring various animal populations by visual survey
  • Interested in determining if there is an impact that can be linked to the turbine.
  • The models of abundance include various environmental conditions (tide-state, time of day, season etc) and are quite complex

EIA

A current turbine

  • No impacts were immediately detectible.

Raises various questions:

  • Is this because there is no impact?
  • Is this because the power of our monitoring scheme and analysis have low power?

EIA

A current turbine

  • What size impacts are we likely to be able to see under the status quo?
  • Are more samples or a longer observation period likely to change this? (i.e. cost-effectiveness)

EIA

A current turbine

  • This area has a conservation status that means we must concern ourselves about “significant effects”.
  • Suppose it is speculated that reductions in the local animal population size of 10%-20% would be concerning and might jeapordise the turbine licence
  • More money could be spent on collecting data.
  • An investigation of power is required.

EIA

  • Various effect sizes were considered, along with two sampling periods i.e. amount of data
  • Results are the %-age chance of detecting the effect with that data

Power analysis results

EIA

Power analysis results

  • These are effectively the effect-size \( D \) and sample size \( n \) that we considered in our simple power analyses.
  • The power under various effect sizes and sampling regimes are given as %-ages
  • For example, expect almost 0.898 prob of detecting a 20% reduction if a further 6-months data are collected.

Some key elements of a power analysis

Power calculations are often done to determine how much data to collect.

  • How big an effect do you need/hope to detect?
  • How much variability is in the system?
  • What level of power is needed?

Armed with these, we can advise on \( n \) to meet specifications

Some problems

Power calculations are often done to determine how much data to collect.

  • The variability of the system is needed:
    • Perhaps this is unknown, so have to guess
    • If we estimate from some small study, precision on the variance is very low
  • The recommended \( n \) is often large, so desired power and effect size might need revising

Another example

(Example modified from Larsen & Marx, 2001)

  • There is a new fuel additive that is expected to increase the fuel efficiency for vehicles. The underlying fuel efficiency with the standard fuel is assumed to be 25 mpg.
  • It is thought an improvement of at least 3% in fuel efficiency would be substantial enough for the additive to be taken to market.
  • Hence we are particularly interested in establishing a change from 25 to 25.75 mpg.

Another example

  • It is proposed that a sample size of 30 would be used to establish this improvement.
  • The standard deviation of efficiency in cars is claimed to be 2.4 (\( \sigma \)).
  • If the improvement was in fact 25.75 mpg, what is the power and Type II associated with this (one-sided) test scenario?

NB we've been given \( \sigma \) as known. If this were estimated, we'd have to use \( t \)-distributions.

Another example

  • Under \( H_0 \), we'd set the decision boundary at:
  qnorm(0.95)
[1] 1.644854

\[ 25 + 1.644854 \times \frac{2.4}{\sqrt{30}} \]

25+1.644854*2.4/sqrt(30)
[1] 25.72074

Another example

  • So a sample mean above 27.72 is required for a positive result

[I'll indulge in some laptop sketching again]

The decision boundary

Another example

  • If \( H_A \) were true with its 3% greater mean, and had the same spread, we'd produce false negatives (type 2) errors by falling left of this decision boundary

The decision boundary

Another example

The decision boundary

The decision boundary for \( \alpha = 0.05 \) (red)

The decision boundary

The false negative area if \( \mu = 25.75 \) (blue)

Another example

The decision boundary

The decision boundary for \( \alpha = 0.05 \) (red)

The decision boundary

The false negative area if \( \mu = 25.75 \) (blue)

Another example

Calculate the blue false negative area in R:

pnorm(27.72, mean = 27.75, sd = 2.4/sqrt(30), lower.tail = T)
[1] 0.4727076
  • So this occurs about 47.2% of the time
  • Power is \( 1-\beta \) = 0.528 i.e. 52.8%

Another example

  • So we'd fail to find evidence of an improvement, if it were only 3%, almost half the time.
  • These are not great odds! The obvious place to make an improvements to this poor power are:
    • Increasing precision through larger samples.
    • Increasing precision through controlling variability (e.g. testing on a standardised track or the like).
  • Accept a higher Type I error.
  • Have a fuel additive that is expected to have improvements much greater than 3%.

Another example

What happens if \( n \) is doubled?

  • The decision boundary is now at:
decisionBound <- 25+1.644854*2.4/sqrt(60)

decisionBound
[1] 25.50964

Another example

What happens if \( n \) is doubled?

  • Meaning if the reality is actually \( /mu = 25.75 \) (the 3% increase):
pnorm(decisionBound, mean = 25.75, sd = 2.4/sqrt(60))
[1] 0.2189452

Another example

  • This means our Type 2 error is about 22% - a power of 78% (80 is often the goal in planning)
  • Bigger differences would give higher power too - 95% power for a 4% efficiency improvement:
pnorm(decisionBound, mean = 25*1.04, sd = 2.4/sqrt(60))
[1] 0.05675267
  • Basically increasing the signal-to-noise ratio improves power

Another example

  • Accepting a higher type 1 error similarly improves power - it lowers the decision boundary in this problem
  qnorm(0.9)
[1] 1.281552
decisionBound <- 25+qnorm(0.9)*2.4/sqrt(30)

decisionBound
[1] 25.56155

Another example

pnorm(decisionBound, mean = 25*1.03, sd = 2.4/sqrt(30))
[1] 0.3335682
  • Given a type 2 error of 33% vs 47% before (power moves from 53% to 67%)
  • We're trading the probability of one error for another

Multiple comparisons

With this in mind, consider multiple comparison adjustments

  • We make our type 1 error boundaries more stringent (broader) i.e. lower threshold \( p \)-values
  • We're trading errors at some level - type 2 errors increase, power goes down

Multiple comparisons

Implications for cynical exploitations:

  • If you don't want to find significant results, you want low power
  • Take small samples, define big changes as important, do adjust for multiple comparisons, sloppy measurements to give large variances etc

Some matrix algebra!

We need this a lot:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e} \]

leading to things like this

\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_{1,1} & x_{2,1}\\ 1 & x_{1,2} & x_{2,2}\\ \vdots & \vdots & \vdots\\ 1 & x_{1,n} & x_{2,n}\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \end{bmatrix} \]

Wrap up

We've looked at

  • the complement to type 1 error: type 2 error (false positives and false negatives respectively)
  • Power is our ability to detect differences (a type of signal) when they exist
  • We can see how to improve power by examining its calculation

“Assessment”

  • Group project (40% - has individual components)
  • An online quiz to do at your leisure to practice for the test
  • Class test, late semester, date TBD (30%)