The examples and formulas extracted from Sullivan, L. Power and Sample Size Determination. Boston University School of Public Health. https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_power/bs704_power_print.html
In studies where the plan is to estimate the mean of a continuous outcome variable in a single population, the formula for determining sample size is given below:
\[ sample\:size = (\frac{z * standard\:deviation}{margin\:of\:error}) ^ 2 \]
The R code for the function is given below:
compute_sample_size_for_one_continuous_outcome <- function(confidence_level, standard_deviation, margin_of_err) {
z_score <- qnorm((1 - confidence_level) / 2, lower.tail = FALSE)
sample_size = ceiling(((z_score * standard_deviation) / margin_of_err) ^ 2)
return(sample_size)
}
The R code for calling the function is given below. See “Example of use” for example.
compute_sample_size_for_one_continuous_outcome(confidence_level = 0.95, standard_deviation = 20, margin_of_err = 5)
## [1] 62
In order to ensure that the 95% confidence interval (confidence level=0.95) estimate of the mean systolic blood pressure in children between the ages of 3 and 5 with congenital heart disease is within 5 units of the true mean (margin of error=5), a sample of size 62 is needed. The standard deviation of systolic blood pressure is unknown, but the investigators conduct a literature search and find that the standard deviation of systolic blood pressures in children with other cardiac defects is between 15 and 20 (standard deviation(std)=20). To estimate the sample size,we consider the larger standard deviation in order to obtain the most conservative (largest) sample size.
In studies where the plan is to estimate the proportion of successes in a dichotomous outcome variable (yes/no) in a single population, the formula for determining sample size is:
\[ sample\:size = p * (1-p) * (\frac{z}{margin\:of\:error}) ^ 2 \]
If the sample size is too large and not feasible. The formula below computes how precisely can we estimate the prevalence of a feasible sample of size:
\[ margin\:of\:error = z * \sqrt{\frac{\hat{p} * (1 - \hat{p})}{desired\:sample\:size}} \]
The R code for the sample size function is given below:
compute_sample_size_for_one_dichotomous_outcome <- function(confidence_level, p, margin_of_err) {
z_score <- qnorm((1 - confidence_level) / 2, lower.tail = FALSE)
sample_size <- ceiling(p * (1 - p) * ((z_score / margin_of_err) ^ 2))
return(sample_size)
}
The R code for the margin of error function is given below:
compute_margin_of_err_for_one_dichotomous_outcome <- function(confidence_level, p, n) {
z_score <- qnorm((1 - confidence_level) / 2, lower.tail = FALSE)
margin_of_err <- round(z_score * sqrt(p * (1 - p) / n), 4)
return(margin_of_err)
}
In order to ensure that a 95% confidence interval(confidence level=0.95) estimate of the prevalence of breast cancer is within 0.001 (or to within 10 women per 10,000) of its true value(margin of error=0.1), a sample size of 16,448 is needed. National data suggest that 1 in 235 women are diagnosed with breast cancer by age 40. This translates to a proportion of 0.0043 or a prevalence of 43 per 10,000 women(p=0.43%).
compute_sample_size_for_one_dichotomous_outcome(confidence_level = 0.95, p = 0.0043, margin_of_err = 0.001)
## [1] 16448
Continue with the previous example. If this is a situation where investigators might decide that a sample of this size 16,448 is not feasible, investigators might want to know how precisely can we estimate the prevalence with a sample of size n=5,000. Assuming that the prevalence of breast cancer in the sample will be close to that based on national data 0.0043 (0.43%). Thus, with n=5,000 women, a 95% confidence interval would be expected to have a margin of error of 0.0018 (or 18 per 10,000).
compute_margin_of_err_for_one_dichotomous_outcome(confidence_level = 0.95, p = 0.0043, n = 5000)
## [1] 0.0018
a href=“#top”> Back to top
In studies where the plan is to estimate the difference in means between two independent populations, the formula for determining the sample sizes required in each comparison group is given below:
\[ Sample\:size\:for\:each\:group = 2 * (\frac{z * common\:standard\:deviation\:for\:both\:group}{margin\:of\:error})^2 \]
\[ common\:standard\:deviation\:for\:both\:group = \sqrt{\frac{(n_1-1)*std_1^2 + (n_2 -1)*std_2^2}{n_1+n_2-2}} \]
\(n_1\) represents the sample size for group 1, and \(n_2\) represents the sample size for group 2. \(std_1\) represents the standard deviation for group 1, and \(std_2\) represents the standard deviation for group 2.
The R code for the function is given below:
compute_sample_size_for_two_independent_continuous_outcomes <- function(confidence_level,
standard_deviation,
standard_deviation_1=NULL,
standard_deviation_2=NULL,
sample_size_1=NULL,
sample_size_2=NULL,
margin_of_err) {
z_score <- qnorm((1 - confidence_level) / 2, lower.tail = FALSE)
if (is.null(standard_deviation)) {
# Error checking
if (is.null(standard_deviation_1) || is.null(standard_deviation_2) ||
is.null(sample_size_1) || is.null(sample_size_2)) {
print("Please provide standard deviations and sample sizes for both group if the
common standard deviation is unknown.")
return(NULL)
}
sqrt_numerator <- ((sample_size_1 - 1) * (standard_deviation_1 ^ 2)) +
((sample_size_2 - 1) * (standard_deviation_2 ^ 2))
sqrt_denominator <- sample_size_1 + sample_size_2 - 2
standard_deviation <- sqrt(sqrt_numerator / sqrt_denominator)
}
sample_size <- ceiling(2 * ((z_score * standard_deviation / margin_of_err) ^ 2))
return(sample_size) # The sample size required in each group
}
An investigator wants to plan a clinical trial to evaluate the efficacy of a new drug designed to increase HDL cholesterol (the “good” cholesterol). The plan is to enroll participants and to randomly assign them to receive either the new drug or a placebo.In order to ensure that the 95% confidence interval(confidence level=0.95) estimate of the difference in mean HDL levels between patients taking the new drug as compared to placebo is within 3 units(margin of error=3), a sample of size 250 for each group is needed. According to the Framingham Heart Study, the standard deviation of HDL cholesterol is 17.1 (standard deviation=17.1). The total sample size would be 500 (\(n_1 + n_2 = 500\)).
compute_sample_size_for_two_independent_continuous_outcomes(confidence_level = 0.95,standard_deviation = 17.1,margin_of_err = 3)
## [1] 250
An investigator wants to compare two diet programs (low fat diet and low carbohydrate diet) in children who are obese.According to the data from a previous published study, the study reported a standard deviation in weight lost over 8 weeks on a low fat diet of 8.4 pounds and a standard deviation in weight lost over 8 weeks on a low carbohydrate diet of 7.7 pounds (\(std_1\) = 8.4, \(std_2\) = 7.7). The study involved 100 participants in each diet group (\(n_1\) = 100, \(n_2\) = 100). In order to ensure that the 95% confidence interval(confidence level=0.95) estimate of the difference in weight lost between diets is within a margin of error of no more than 3 pounds(margin of error=3), a sample size of 56 is needed for each group. A total sample size needed for both groups is 56+56=112.
compute_sample_size_for_two_independent_continuous_outcomes(confidence_level = 0.95, standard_deviation = NULL, standard_deviation_1 = 8.4, standard_deviation_2 = 7.7, sample_size_1 = 100, sample_size_2 = 100, margin_of_err = 3)
## [1] 56
In studies where the plan is to estimate the difference in proportions between two independent populations (i.e., to estimate the risk difference), the formula for determining the sample sizes required in each comparison group is:
\[ sample\:size = [p_1(1-p_1) + p_2(1-p_2)]*(\frac{z}{margin\:of\:error})^2 \]
\(p_1\) and \(p_2\) are the proportions of successes in each comparison group. If there is no information available to approximate \(p_1\) and \(p_2\), then \(p_1=p_2=0.5\) can be used to generate the most conservative, or largest, sample sizes.
The R code for the function is given below:
compute_sample_size_for_two_independent_dichotomous_outcomes <- function(confidence_level,
p1,
p2,
margin_of_err) {
z_score <- qnorm((1 - confidence_level) / 2, lower.tail = FALSE)
sample_size <- ceiling(((p1 * (1 - p1)) + (p2 * (1 - p2))) * ((z_score / margin_of_err) ^ 2))
return(sample_size)
}
An investigator wants to estimate the impact of smoking during pregnancy on premature delivery. The 2005 National Vital Statistics report indicates that approximately 12% of infants are born prematurely in the United States. In order to ensure that the 95% confidence interval estimate of the difference in proportions who deliver prematurely is within a margin of error of 4%, a sample size of 508 is needed for each group. The total sample needed is 508+508=1016.
compute_sample_size_for_two_independent_dichotomous_outcomes(
confidence_level = 0.95,
p1 = 0.12,
p2 = 0.12,
margin_of_err = 0.04)
## [1] 508
In studies where the plan is to perform a test of hypothesis comparing the mean of a continuous outcome variable in a single population to a known mean, the hypotheses of interest are: \(H_0: \mu = \mu_0\) and \(H_1: \mu \neq \mu_0\) where \(\mu_0\) is the known mean. The formula for determining sample size to ensure that the test has a specified power is given below:
\[ sample\:size = (\frac{z_{1-\alpha/2}+z_{1-\beta}}{effect\:size}) ^ 2 \]
\[ effect\:size = \frac{|\mu_1-\mu_0|}{std} \]
where \(\alpha\) is the selected level of significance, \(1-\beta\) is the selected power, and \(std\) is the standard deviation of the outcome of interest.
In studies where the plan is to perform a test of hypothesis comparing the proportion of successes in a dichotomous outcome variable in a single population to a known proportion, the hypotheses of interest are: \(H_0: p = p_0\) and \(H_1: p \neq p_0\) where \(p_0\) is the known proportion.The formula for determining the sample size to ensure that the test has a specified power is given below:
\[ sample\:size = (\frac{z_{1-\alpha/2}+z_{1-\beta}}{effect\:size}) ^ 2 \]
\[ effect\:size = \frac{p_1-p_0}{\sqrt{p_1(1-p_1)}} \]
where \(\alpha\) is the selected level of significance, \(1-\beta\) is the selected power, and \(p_0\) is the proportion under \(H_0\), and \(p_1\) is the proportion under \(H_1\).
In studies where the plan is to perform a test of hypothesis comparing the means of a continuous outcome variable in two independent populations, the hypotheses of interest are \(H_0: \mu_1 = \mu_2\) and \(H_1: \mu_1 \neq \mu_2\) where \(\mu_1\) and \(miu_2\) are the means in the two comparison populations. The formula for determining sample size to ensure that the test has a specified power is given below:
\[ sample\:size\:for\:each\:group = 2* (\frac{z_{1-\alpha/2}+z_{1-\beta}}{effect\:size}) ^ 2 \]
where \(\alpha\) is the selected level of significance, and \(1-\beta\) is the selected power.
\[ effect\:size = \frac{|\mu_1-\mu_2|}{estimate\:of\:common\:stand\:deviation} \]
where \(|\mu_1-\mu_2|\) is the absolute value of the difference in means between the two groups expected under the alternative hypothesis (\(H_1\)).
\[ estimate\:of\:common\:standard\:deviation=\sqrt{\frac{(n_1-1)std_1^2+(n_2-1)std_2^2}{n_1+n_2-2}} \]
In studies where the plan is to perform a test of hypothesis on the mean difference in a continuous outcome variable based on matched data, the hypotheses of interest are \(H_0=\mu_d\) and \(H_1=\mu_d \neq \mu_0\), where \(\mu_d\) is the mean difference in the population. The formula for determining the sample size to ensure that the test has a specified power is given below:
\[ sample\:size = (\frac{z_{1-\alpha/2}+z_{1-\beta}}{effect\:size}) ^ 2 \]
where \(\alpha\) is the selected level of significance, and \(1-\beta\) is the selected power.
\[ effect\:size = \frac{\mu_d}{\sigma_d} \]
where \(\mu_d\) is the mean difference expected under the alternative hypothesis, and \(\sigma_d\) is the standard deviation of the difference in the outcome (e.g., the difference based on measurements over time or the difference between matched pairs).
In studies where the plan is to perform a test of hypothesis comparing the proportions of successes in two independent populations, the hypotheses of interest are \(H_0: p_1 = p_2\) and \(H_1: p_1 \neq p_2\) where \(p_1\) and \(p_2\) are are the proportions in the two comparison populations. The formula for determining the sample size to ensure that the test has a specified power is given below:
\[ sample\:size\:for\:each\:group = 2* (\frac{z_{1-\alpha/2}+z_{1-\beta}}{effect\:size}) ^ 2 \]
where \(\alpha\) is the selected level of significance, and \(1-\beta\) is the selected power.
\[ effect\:size = \frac{|p_1-p_2|}{\sqrt{p(1-p)}} \]
where \(|p_1-p_2|\) is the absolute value of the difference in proportions between the two groups expected under the alternative hypothesis (\(H_1\)). \(p\) is the overall proportion, based on pooling the data from the two comparison groups (p can be computed by taking the mean of the proportions in the two comparison groups, assuming that the groups will be of approximately equal size).