Motivation

This document provides a framework for performing power analysis with any given test of statistical significance.

When making a statistical test, if grouped statistics of interest are similar detecting differences between groups may be difficult. Determining an appropriate sample size to detect a significant difference (if it occurs) is important.

This allows us to avoid making Type II errors where we fail to reject a null hypothesis that should be rejected (is actually false).

For example, if I design a lobster trap sampling survey between two sites and expect differing catch rates. I want to be confident that with the number of samples I take, I will be able to detect a difference if it exist.

Overall, a power analysis helps to determine the confidence level a researcher can have in results from a given test.

Mean Lobster Catch Example

An example power analysis is provided using aggregated catch data from the UMaine Commercial Trpping Survey. This analysis determines the power in detecting mean catch rate differences between a test and control site.

The actual analysis determines the percent of tests that significant results are detected for varied sample sizes and effect sizes by iteratively performing a test on sampled data.

Sampled data may be from previous observation or simulated based on prior knowledge, personal comments, expected values from literature and more.

rm(list = ls()) # clean env

# attach necessary packagaes
library(doParallel)
library(wesanderson) 
library(tidyverse)

# import data
catch.data <- tibble(read.csv('~/Documents/R Projects/SMS 598-0004/CTS_gam_data.csv'))

# check out import
str(catch.data); head(catch.data)

The above catch data is from the UMaine Commercial Trapping survey Fall 2021. This survey is a before-after control-impact study designed to determine whether or not measurable impacts to legal lobster catch are observed at the site of a floating offshore wind turbine post-deployment. This is done by conducting pre-deployment surveys at a test and control site to establish a relationship between catch rates at the sites and then repeating the surveys post-dpeloyment to determine if the established relationship changes.

In this survey its important detect differences between the test and control site now to know how / if they differ in legal catch rate. To effectively and economically plan a second survey we will want to know how many samples to collect.

Likewise, if there is a change in catch rate in the future, we should know the necessary sample sizes to observe different size effects.

Determine appropriate simulated data

To begin the power analysis it is important to have some expectation of what data may look like. Therefore, pre-deploment landing rates and standard deviation are calculated.

# mean seasonal site landing rates
catch.data %>% 
  group_by(array, season) %>%
  summarize(mean = mean(land, na.rm = TRUE),
            sd = sd(land, na.rm = TRUE),
            n = n())

## # A tibble: 4 × 5
## # Groups:   array [2]
##   array   season  mean    sd     n
##   <chr>   <chr>  <dbl> <dbl> <int>
## 1 Control Fall    2.62  2.04   324
## 2 Control Spring  1.11  1.17   324
## 3 Test    Fall    3.33  2.39   324
## 4 Test    Spring  1.30  1.36   252

In this example I will be testing whether site mean catch rates differ. To do so I would either use a t-test (normally dist. data) or kruskal-wallis test (not normally distributed data). To determine normality the distribution is plotted.

# plot distribution of landings 
catch.data %>%
  filter(season == 'Fall' & trap.type == 'Vented') %>%
  ggplot(aes(x = land, fill = array)) +
  facet_grid(cols = vars(array)) +
  geom_histogram(bins = 12) +
  geom_hline(yintercept = 0) +
  labs(title = 'Distribution of legal catch per trap',
       subtitle = 'Fall 2021',
       x = 'Legal lobsters / trap',
       y = 'Frequency',
       fill = 'Site') +
  scale_fill_manual(values = wes_palette("GrandBudapest1", n = 2)) + 
  theme_linedraw(base_size = 16) + 
  theme(strip.background = element_rect(fill = 'gray85'),
        strip.text = element_text(color = 'black'))

# distribution is not normal test difference in means with kruskal
kruskal.test(land ~ array, catch.data, subset = season == 'Fall')

## 
##  Kruskal-Wallis rank sum test
## 
## data:  land by array
## Kruskal-Wallis chi-squared = 15.455, df = 1, p-value = 8.449e-05

The distribution of mean legal catch was not normal so a KruskalWallis test is used and observes significant site differences within the actual dataset.

Simulated data example

Using the calculated mean catch rate and standard deviation for each site during the Fall survey we can simulate random distributions of data to represent our data.

# simulate catch data
set.seed(245)
test <- tweedie::rtweedie(n = 300, mu = 3.3, phi = 2.3, power = 1.99)
control <- tweedie::rtweedie(n = 300, mu = 2.6, phi = 2.2, power = 1.99)

tibble(land = c(test, control),
       array = rep(c('Test', 'Control'), each = 300)) -> sim.data1

  ggplot(sim.data1, aes(x = land, fill = array)) +
  facet_grid(cols = vars(array)) +
  geom_histogram(bins = 12) +
  geom_hline(yintercept = 0) +
  labs(title = 'Simulated distribution of legal catch per trap',
       subtitle = 'Fall 2021',
       x = 'Legal lobsters / trap',
       y = 'Frequency',
       fill = 'Site') +
  scale_fill_manual(values = wes_palette("GrandBudapest1", n = 2)) + 
  theme_linedraw(base_size = 16) + 
  theme(strip.background = element_rect(fill = 'gray85'),
        strip.text = element_text(color = 'black'))

kruskal.test(land ~ array, sim.data1) # no significant difference between site means

## 
##  Kruskal-Wallis rank sum test
## 
## data:  land by array
## Kruskal-Wallis chi-squared = 1.5032, df = 1, p-value = 0.2202

In the first simulation the Kruskall-Wallis test determines insignificant differences between means.

# simulate catch data again 
set.seed(2044)
test <- tweedie::rtweedie(n = 300, mu = 3.3, phi = 2.3, power = 1.99)
control <- tweedie::rtweedie(n = 300, mu = 2.6, phi = 2.2, power = 1.99)

tibble(land = c(test, control),
       array = rep(c('Test', 'Control'), each = 300)) -> sim.data2

ggplot(sim.data1, aes(x = land, fill = array)) +
  facet_grid(cols = vars(array)) +
  geom_histogram(bins = 12) +
  geom_hline(yintercept = 0) +
  labs(title = 'Simulated distribution of legal catch per trap',
       subtitle = 'Fall 2021',
       x = 'Legal lobsters / trap',
       y = 'Frequency',
       fill = 'Site') +
  scale_fill_manual(values = wes_palette("GrandBudapest1", n = 2)) + 
  theme_linedraw(base_size = 16) + 
  theme(strip.background = element_rect(fill = 'gray85'),
        strip.text = element_text(color = 'black'))

kruskal.test(land ~ array, sim.data2) # this time the difference in means is signficant

## 
##  Kruskal-Wallis rank sum test
## 
## data:  land by array
## Kruskal-Wallis chi-squared = 7.3095, df = 1, p-value = 0.006859

However, in the second simulation means are significantly different.

The distributions are generated using the same parameters but use different seed numbers to begin the random number generator.

This presents a problem. We want to be able to detect differences when they exist, even if they are small, so how can we be confident in the results we are reading if statistically similar distributions can yield different results?

Power analysis example

By simulating a series of effect sizes on our observed data and sampling with a series of varied sample sizes we can test iteratively to determine the confidence we have in results for each (effect, sample) pair.

Below I provide a function that will be used to create samples from simulated data, test for a significant difference in mean catch between samples, and store significance as a binary value.

After completing the specified number of iterations the mean of the binary values is taken which provides a percent of the time that a result is significant.

This percent value is the power for the given (effect, sample) pair and power >= 0.8 is considered acceptable.

# write function to use in power analysis
kruskal.power <- function(n, alpha = 0.05, iter = 10){
  stat <- numeric(iter) 
  for(i in 1:iter){
    print(i) 
    
    # create sample from simulated data
    test.sample <- slice_sample(sim.data[sim.data$season == 'Fall' &
                                        sim.data$array == 'Test' ,], n = n/2, replace = TRUE)
    
    control.sample <- slice_sample(sim.data[sim.data$season == 'Fall' &
                                           sim.data$array == 'Control' ,], n = n/2, replace = TRUE)
    # then concatenate samples and run stat tests
    sample <- bind_rows(test.sample, control.sample)
    
    # test difference in site means
    k.tst <- kruskal.test(land ~ array, sample) # this step may be any statistical test of significance you may use
    # for example, this could be a gam where we record the significance value of a specific covariate of interest
    
    # record binary significance values 
    if (k.tst$p.value <= alpha) {
      stat[i] <- 1
    } else {
      stat[i] <- 0
    }
  }
  my.power <- mean(stat) # calculate power
  stat.sd <- sd(stat) # other stats may be added as well
 # se <- stat.sd/n
 # hi <- my.power + qt(1 - (0.05 / 2), n - 1) * se
 # lo <- my.power - qt(1 - (0.05 / 2), n - 1) * se
  tibble(my.power, 
         stat.sd, 
         n 
        # se, 
         #hi, 
         #lo
        ) -> res
  return(
    res
  )
}

For this analysis we’re interested in detecting significant differences at various effect sizes. Therefore, effect sizes must be simulated in the data.

To do this we duplicate the catch dataset so there are an equal amount of replicates and effect sizes. For each replicate we modify catch values at one site by the a simulated effect size, between 0.5 to 1.5, representing 50 % to 150 % changes in in catch.

# create vector of effect sizes by which to modify test site catch
(effect_sizes <- seq(0.5, 1.5, by = 0.1))

# create vector of n values to test
(n_vals <- seq(100, 1000, by = 50))

# create tibble to modify with effect sizes 
(catch.data %>%
    slice(rep(1:n(), times = length(effect_sizes))) %>% # modify length of tibble
    mutate(land = as.double(land),                      # necessary for last mutate step
           effect_size = rep(effect_sizes, each = nrow(catch.data))) %>% # add effect size column 
    mutate(catch.pwr = if_else(array == 'Test', land*effect_size, land)) -> catch.data.pwr) # modify catch values at test site by simulated effect size

##  [1] 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
##  [1]  100  150  200  250  300  350  400  450  500  550  600  650  700  750  800
## [16]  850  900  950 1000
## # A tibble: 13,464 × 22
##        X date        haul array  trap.id trap.…¹ effort  land depth slope aspect
##    <int> <chr>      <int> <chr>    <int> <chr>    <int> <dbl> <dbl> <dbl>  <dbl>
##  1     1 2021-10-15     1 Contr…       1 Ventle…      4     0 -77.6 15.1   325. 
##  2     2 2021-10-19     2 Contr…       1 Ventle…      4     0 -89.8 20.3    79.3
##  3     3 2021-10-23     3 Contr…       1 Ventle…      4     4 -86.6 15.1   343. 
##  4     4 2021-10-29     4 Contr…       1 Ventle…      6     2 -93.1  6.90  110. 
##  5     5 2021-11-02     5 Contr…       1 Ventle…      4     2 -86.3 29.6    81.7
##  6     6 2021-11-08     6 Contr…       1 Ventle…      6     1 -89.8 20.3    79.3
##  7     7 2021-11-11     7 Contr…       1 Ventle…      3     2 -86.3 29.6    81.7
##  8     8 2021-11-16     8 Contr…       1 Ventle…      5     0 -89.8 20.3    79.3
##  9     9 2021-11-20     9 Contr…       1 Ventle…      4     1 -89.8 20.3    79.3
## 10    10 2021-10-15     1 Contr…       2 Vented       4     2 -77.6 15.1   325. 
## # … with 13,454 more rows, 11 more variables: roughness <dbl>, day.yr <int>,
## #   trap.temps <dbl>, site.temps <dbl>, temp.anom <dbl>, season <chr>,
## #   molt.ratio <dbl>, sex.ratio <dbl>, presence <int>, effect_size <dbl>,
## #   catch.pwr <dbl>, and abbreviated variable name ¹trap.type

To run the power analysis it may be ideal to use parallel computing. Your personal computer is generally only using 1 core to make computations in R, unless you’ve instructed it to do otherwise. Your personal computer likely has 8 cores.

By using multiple cores at once to make many calculations (or exceptionally large computations) we can greatly speed up computation time. In this example analysis the power function only collects significance statistics from 10 iterations of sample generation and KruskallWallis testing. As we scale up this type of analysis it may be ideal to complete 1000 or 10000 iterations which can require more computation time making parallel computing necessary.

There are many ways to approach a parallel computing environment in R.

For example, I am using a foreach loop with a nested purrr::map function to speed the calculation as I have found this is faster than nesting for-loops and I prefer the map output.

However, all approaches begin with setting up the backend of computing, or specifying how you want cores to work on tasks. To do so we:

Specify the amount of cores we want to use, and assign them to a cluster. I have also specified that I want a ‘FORK’ type cluster. This type of cluster copies the entire environment and all variables contained to each core at the beginning of a computtation. Doing so can greatly speed up computations as there is less communication between cores in the cluster and the original environment once the job has begun. The limitation of a FORK cluster is that it does not work on Windows.
Register your cluster. This allows parallel computing to begin. Depending on the approach to parallel computing in R the registration function may vary slightly. For foreach we will use %dopar% and registerDoParallel, however, if we changed the code below and did not include a seed number we would use %doRNG% and registerDoRNG to prevent poor RNG results that may occur in parallel.
Complete the analysis by applying a series of samples sizes to the power function. This can be extended as in the provided example to apply a series of sample sizes to the power function with a series of effect sizes.

n.cores <- parallel::detectCores() - 1 # leave one free for other operations

# parallel computing setup, create the cluster
my.cluster <- parallel::makeCluster(
  n.cores, 
  type = "FORK"
)

# register cluster
doParallel::registerDoParallel(cl = my.cluster)
foreach::getDoParRegistered() # optional check registerDoParallel worked
foreach::getDoParWorkers() # optional check number of cores in cluster

set.seed(144) # reproducible results
pwr.res <- foreach(i = 1:length(effect_sizes), .combine = cbind) %dopar% {# foreach loop for outside loop, dopar operator to allow parallel computing
  catch.data.pwr %>%
    filter(effect_size == effect_sizes[i]) -> sim.data # generate effect size 
  map_dfc(n_vals, kruskal.power) # apply map into dataframe
} 

parallel::stopCluster(cl = my.cluster) # deconstruct cluster

A list is provided with 1 obs. per variable where each variable is power function output. The results provided are difficult to work with so we rearrange them to a more workable tibble.

The length of the list is dependent on the number of (effect, sample) combinations tested, and the number of stats returned by the power function for each.

pwr.res %>% # this is a list with power function output for each iteration of the foreach, map loop. 
  pivot_longer(cols = 1:57, 
               names_to = 'var', 
               values_to = 'values') %>%
  mutate(nval = rep(n_vals, times = length(effect_sizes), each = 3),  # 3 stats per function call
         effect = rep(effect_sizes, each = length(n_vals)*3), 
         var = rep(c('my.power', # these are the 3 stats returned by the function
                     'stat.sd',
                     'n'), times = 209)) %>%
  pivot_wider(id_cols = c('nval','effect'), names_from = var, values_from = values) -> pwr.tbl # the last step tidies the data by putting (effect, sample) combination stats into one row per test rather than having them spread over three rows

(pwr.tbl %>%
    ggplot(aes(n, my.power, group = effect)) +
    facet_wrap(facets = vars(effect)) +
    geom_line(aes(color = my.power), size = 1.25) +
    #geom_ribbon(aes(ymin = lo, ymax = hi), alpha = 0.3) +
    geom_smooth(color = 'limegreen', alpha = 0.3, method = 'gam') + 
    geom_hline(yintercept = 0.8, lty = 2) +
    scale_color_distiller(palette = 'Greens',
                          direction = 1, limits = c(0,1)) +
    coord_cartesian(ylim = c(0,1)) +
    labs(title = 'Kruskal Wallis Power Analysis',
         subtitle = '10 iterations / test',
         x = 'Sample size (n)',
         y = 'Power',
         color = 'Power') +
    theme_classic() -> pwr.plot)

pwr.tbl %>%
  group_by(effect) %>%
  summarize(min.sample = min(n_vals[my.power >= 0.80])) %>%
  ggplot(aes(effect, min.sample)) +
  geom_line(size = 1.25) +
  labs(title = 'Acceptable sample sizes',
         subtitle = 'Minimum sample size with power >= 0.8',
         x = 'Effect size',
         y = 'Sample size (n)') +
  coord_cartesian(ylim = c(100, 1000)) + 
  theme_classic()

From this plot it is apparent 450 samples is the minimum we could deploy with confidence in results if significant differences between the sites were detected. 250 samples would be acceptable for certain effect sizes however it would not detect effects in the lower half of the effect size range 0.5 - 1.1.

Power analysis tutorial

Everett Rzeszowski

November 17, 2022

Motivation

Mean Lobster Catch Example

Determine appropriate simulated data

Simulated data example

Power analysis example