To obtain a sensible estimate of the sample size required to achieve a given power, we will need to

Also, we may want to consider missing data expectations - magnitude and mechanism.

Surf Therapy study

The research questions concern

will they benefit from the intervention?
can we assess which baseline covariates help predicting the best intervention path?

Previous year intake for the two sites available for sampling was 300 (East) and 700 (West), this should constitute a reasonable ceiling for the maximum possible sample size.

Study design

Several continuous outcomes, possibly to analyse in a multivariate fashion (if feasible at all). Longitudinal study over three years, 12 time points in two patterns from 0 to 24 months for each cohort:

baseline + every three months
baseline + every month for a year, then every 3 until completion.

Two groups, experimental and control, and a post-intervention one year follow-up.

Predictors

We have several predictors:

gender: male/female
setting: rural/urban-rural/urban
race: black/mixed/white
site: eastern/western
age: continuous

also, a background variable consisting of 5 constructs exist (maybe too much as a predictor, maybe too important? Think about this).

Power analysis

No closed form solution to the problem, need to resort to simulations to construct power curves. Computationally intensive and requires a lot of prior knowledge/educated guesses on the distributions of both outcomes and predictors.

There are three possibilities to set up the simulation study (ordered from the most to the least desirable situation):

We have a dataset of a pilot study with the same outcomes and predictors. We have explicit hypothesis on the expected change or effect size at any given time point. If the only available information concerns a baseline and the end of the study, then some scenarios regarding the patterns of variability and correlations could be devised and investigated, to produce plausible power curves.
We have precise information regarding the distribution of the outcomes (skeweness, kurtosis, modes, quantiles, standard deviations, means, …) and their correlations, if they are not to be analysed separately. Also, we have precise information regarding how the outcomes behave within levels of the predictors. This way we can inform a model and generate samples from it as if we were at point 1. We would still need explicit hypothesis on the treatment effect.
We have only accurate univariate information regarding the outcomes of interest and their relationships with the predictors, hence we need to set up several power analyses separately, one for each dependent variable. Following a conservative approach, we set the desired sample size equal to the largest one, for a given level of power: we will be overpowered for some of the outcomes, but at least be likely able to detect differences for all of them. Again, an idea of the magnitude of an effect size of practical significance is fundamental.
We have only partial univariate information regarding some of the outcomes and their relationships with some of the predictors. In this situation, we need to simplify the model to account only for those predictors we have reliable knowledge of (with respect to the outcomes). For example, if we know that CRP levels tends to be higher in males than females, or tends to be different within racial groups, we could set up a simplified model that accounts for this, and only this. Of course, we would then need to re-update our sample size determination at baseline, when the study begins. This might lead to needing more or less units, to a non-anticipable extent.

Currently, we are sort of in point 4. I have found literature on CRP, tumor necrosis factor alpha, interleukin-1b and -6, and obtained a wealth of numbers I could use, but: most of the levels refer to populations that are very different from the target one - older individuals, usually in their 50s and with some kind of medical condition that warranted the analyses in the first place, plus they are asian, european, hell even aboriginal.. but no black no mixed. Moreover, I found some literature on the fact that CRP production may be stimulated by the other three factors, leading to a somewhat more complex structure than anticipated.

As things are, and given that I don’t think we will be able to run a pilot before budgeting, I see three ways out of this impasse (again, in order of preference):

more time, and someone with specific medical/biochemical expertise on this to to review the literature and provide sensible estimates of the involved quantities for the target population, together with a reasonable correlation structure and expectations for the evolution of the levels thereof. I would work with this person to focus on the strictly necessary questions
we focus on one outcome only - say CRP, or whatever else you think is most important - possibly the one which we expect to be most variable throughout the study, and base the power calculation on that alone, using a lot of educated guessing (again, something I can’t do by myself, but maybe we wouldn’t need a Nobel laureate to back me up)
we go full conservative, and budget for the maximum possible sample size (300+700 if I recall correctly), without any power calculation (which can’t be done anyway without a reasonable model), and resolve to run a small pilot to measure the involved quantity on the target population OR to do the power analysis with baseline data when the study begins.

Tentative power calculation

The third option, ‘go full conservative’ seems to be the only viable. However, given budget constraints, it is desirable to find some indication that the expected maximum possible sample size (1000) could be reduced without losing too much power.

According to the study design, at the beginning of each of three years (2019, 2020, 2021), a sample is to be collected from the intake of the W4C centers (T, treatment), together with sample from the population (C, control) that matches T on known covariates such as age, gender, race, setting (rural/urban), area (western/eastern) and background of adversity. The experimental surf therapy (ST) lasts one year, after which the T groups from 2019 and 2020 will be followed-up for an additional year (red lines).

The case of support document highlights the lack of literature on the outcomes of interest for the population at hand (young, black/mixed/white south african with an history of adversity). Moreover, I could find no explicit mention to how the scientific questions would translate in terms of hypothesis on said outcomes. This makes a proper and fully reliable power calculation impossible to carry out at the time.

Given the structure of the design, however, it is reasonable to assume that once the 2019 T and C groups have been sampled, the baseline information will allow to provide a more accurate estimate of the sample size required to achieve a prescribed power in testing the existence of a treatment effect. Moreover, the accuracy of this estimate may be greatly improven once the observations for at least a few time points have been collected, which would help refining our understanding of the relationships between the involved quantities. In practice, this translates to ‘we will be likely able to reduce the sample size needed to achieve the desired power with the second and third group, therefore it may be reasonable to deflate an initial, necessarily rough estimate, of the sample size’. The magnitude to which this may be reasonable, however, is entirely up to conjecture.

Until that point, the best I can do with in the little time available is focus on a single outcome and use all the available information to obtain a sample size. I’ve decided to pick C-reactive protein (CRP), mostly because I’ve managed to find some literature describing its distribution in populations similar to the one under study and because it seemed reasonable to assume that surf therapy, being a combinatin of cognitive behavioural therapy and phisical activity, could contribute to reduce the leves of CRP over time, which provides me with a working hypothesis to test.

Power calculation using CRP as a single outcome

The available literature suggests [12881452, 26033244, 15205215] that the distribution of concentration of CRP in saliva in the general population:

is typically markedly right-skewed
is slightly variable across ethnicities/races in terms of mean levels (after adjusting for methabolic factors), albeit stable in terms of variability
tends to have higher mean levels in females, as opposed to females
tends to increase with age.

Values of CRP concentration (in mg/L) higher than 3 are typically considered high, and can indicate existence of systemic inflammation. I decided to describe the distribution of CRP concentration using a Log-Normal distribution, and made the following assumptions:

the baseline mean level of CRP in T and C should be the same, I set it to \(\mu=4\), following the literature indicating childhood adversity as being likely to increase CRP levels (this assumption is, however, not crucial for the model I use later)
the baseline variance of CRP in T and C should be the same, I set it to \(\sigma^2=0.13\). I have found no convincing literature on how and if said variability should change over time and following physical activity/cognitive behavioural therapy. For this reason, I assume homogeneity of variance also throughout the whole study (this is a strong assumption, but I know too little of the biology of these processes to venture guess a more flexible structure)
the repeated measurements are correlated within the same individuals, and the correlation decays in time following a Toeplitz structure; I assume that the correlation between two measurement on the same individual \(k\) months apart is \(\rho=0.5^k\). I assume that this correlation structure holds for both T and C
no direct effect of time exists on CRP levels, given the short period of time we consider; the literature indicates that CRP concentration tends to increase with age, but only found practically significant differnces between age groups 10-15 and 80+
no direct effect of treatment exists, the only way treatment affects the outcome is through interaction with time (no differences in the intercepts between T and C)
T and C have the same sample size.

I use a linear mixed regression approach to model the natural logarithm of CRP concentration. The model includes an intercept, an individual-specific intercept (random effect), and an interaction term of time (in months) and a treatment variable (1 if in T, 0 otherwise).

The hypotesis to be tested is that there exist and interaction between time and treatment, specifically we postulate that the differences between T and C can be described by the effect size of the treatment effect per time unit on the log-scale (i.e., the mean on the log-scale in the C group is constant, whereas it decreases linearly with time in the T group). For interpretability, the considered effect sizes are eventually presented in terms of percentage changes in the (geometric) mean of the original CRP concentration, per month, for those in the treatment group.

The plot below presents the power curves, varying effect sizes, as a function of sample size in each group (meaning that the total sample size will be twice that one), the significance level being \(\alpha=0.01\), at each point \(99\%\) asymptotic confidence intervals are drawn around the Monte Carlo estimate (1000 runs). Interpretation: for a %Effect size of, e.g., \(-0.5\%\), the curve describes the power to detect a maximum difference of \(12*(-0.5)\% = -6\%\) at the end of the 12 months. The red lines indicate the usual thresholds of \(80\%\) and \(90\%\) power, for better readability.

The use of the simulation results presented in the plot obviously needs to consider the effect size we desire to detect, whose magnitude is currently unknown to me for CRP.

Final remarks

Very briefly:

reliable power calculations need to be based either on a model fitted on available data or on simulations produced from a theoretical model; we currently have neither
the hypothesis to be tested need to be explicitely stated, otherwise no power computation is possible at all
we have a chance at assessing more accurately the required sample size if we decide to postopone the power calculation after obtaining baseline information on the first wave (ha!)
if anyone had an idea of what a reasonable effect size would be of practical significance to detect, we may use the power calculation based on my simplistic model to decide an initial sample size and base our budget on it
I coul not consider multivariate models linking the many (not even two) outcomes at this stage; this may, however, be feasible at baseline/after a few time points.

Conservative solution: based on these results, we would need approximately 250 individuals per group (total 500) in the first wave to obtain a reasonable power (around \(80%\)) in detecting approximately \(5%\) decrease in CRP g-mean levels at the end of the 12 months (this magnitude is absolutely arbitrary).If we decide to revise our power calculation at baseline, however, this could be a conservative approach, so that we could decide to sample a bit less of those 500 (again, arbitrary effect size). How many less needs to be discussed based on the desired effect size.

Power calculation for multivariate mixed models - Surf Therapy study

Federico Andreis

22 April 2018