Topic 4: Power Calculations and Revision


These are the solutions for DA Computer Lab 4.

Please note that this lab for DA Topic 4 content was provided for you to complete at your own pace off-campus. Completing this lab prior to starting on DA Topic 5 is recommended.


1 G*Power Sample Size Calculation - One Sample t-test 🌱

No answer required.

In this question we considered the following scenario:

Let \(\mu_{height}\) denote the true population mean height of BIO2POS students.

Let

\[H_0: \mu_{height} = 165 \text{ versus } H_1: \mu_{height} \neq 165\]

1.1

The sample size required here is \(n=15\).

1.2

The sample size required here is \(n=8\).

1.3

By increasing the difference considered meaningful, the sample size required to achieve the desired power level has decreased from 15 to 8. The actual power of the test has also increased very slightly, from approximately 0.8213 to approximately 0.8279.

1.4

By reducing the standard deviation value, the sample size required to achieve the desired power level has decreased from 15 to 7. The actual power of the test has also increased from approximately 0.8213 to approximately 0.8344.

1.5

By increasing the \(\alpha\) value, the sample size required to achieve the desired power level has decreased from 15 to 12. The actual power of the test has also increased slightly from approximately 0.8213 to approximately 0.8290.


2 G*Power Power Calculation - One Sample t-test 🌱

No answer required.

2.1

For the stated conditions, the power value for your one sample \(t\)-test is approximately 0.6162.

2.2

By increasing the \(\alpha\) value to 0.1, the power value for the one sample \(t\)-test has increased from approximately 0.6162 to approximately 0.7545, which is a large change.

2.3

By increasing the sample size to \(n=20\), the power level has increased dramatically from approximately 0.6162 to approximately 0.9239.

2.4

An example response is provided below.

Since the power level is already well above the 0.8 threshold for a sample size of 20, doubling the sample size is not really necessary. If it is easy to collect more data, then increasing the sample size will be beneficial, but if collecting more data were time consuming, or costly, or both, then it would be very reasonable to keep the sample size at 20. It may even be worthwhile considering a slightly smaller sample size, which still satisfies the 0.8 power threshold., if data collection was not easy.


3 jpower Power Calculations 🌱

3.1 Q1.1

The effect size calculated was 0.8. For an \(\alpha\) value of \(0.05\), and a minimum power value of \(0.8\), the sample size required is 15.

3.2 Q1.2

The effect size calculated was 1.2. For an \(\alpha\) value of \(0.05\), and a minimum power value of \(0.8\), the sample size required is 8.

3.3 Q1.3

See 1.3 response.

3.4 Q1.4

By reducing the standard deviation value, the effect size has changed from 0.8 to 1.333. This has led the required sample size to be reduced from 15 to 7.

3.5 Q1.5

The effect size calculated was 0.8. For an \(\alpha\) value of \(0.1\), and a minimum power value of \(0.8\), the sample size required is 12 (a reduction from 15 with \(\alpha = 0.05\)).

3.6 Q2.1

For \(n=10\) students, \(\alpha = 0.05\), and a calculated effect size of 0.8, the associated power is 0.616.

3.7 Q2.2

For \(n=10\) students, \(\alpha = 0.1\), and a calculated effect size of 0.8, the associated power has increased to 0.754. We can observe how the power will change by looking at the power curve.

3.8 Q2.3

For \(n=20\) students, \(\alpha = 0.05\), and a calculated effect size of 0.8, the associated power has increased to 0.924

3.9 Q2.4

See 2.4 response. If we check the power contour plot in jamovi, we can see that increasing the sample size further will not have a huge impact on the effect size, nor the power.


4 G*Power Power Adjustments 🌱

No answer required.

In this question we considered the following scenario:

Suppose that you are planning the design of a study to determine if there is a difference in the mean external parasite loads of Budgerigar nestlings in nest boxes compared to those in natural cavities. You believe that budgerigar nestlings in nest boxes are likely to have a lower mean external parasite load, compared to those in natural cavities.

  • Let \(\mu_{nest}\) denote the true population mean external parasite load of Budgerigar nestlings in nest boxes.
  • Let \(\mu_{cave}\) denote the true population mean external parasite load of Budgerigar nestlings in natural cavities

You decide to conduct a two sample \(t\)-test, with the hypotheses:

\[H_0: \mu_{nest} = \mu_{cave} \text{ versus } H_1: \mu_{nest} \neq \mu_{cave}\] Suppose you choose to use the initial following settings and assumptions, based on a pilot study:

  • \(\alpha = 0.05\)
  • Power \((1 - \beta) = 0.8\)
  • Allocation Ratio = 1
  • Mean for nest box group = 0.2
  • Mean for natural cavities group = 0.15
  • SD for nest box group = 0.1
  • SD for natural cavities group = 0.11
  • Effect size \(d \approx 0.4757\)

To achieve the desired power, your initial G*Power analysis suggests you need a total sample size of 142, with 71 birds sampled from each group (screenshot provided below for reference)

4.1

Example options are provided below, but are not exhaustive.

  1. Given that it is stated in the question that you believe that budgerigar nestlings in nest boxes are likely to have a lower mean external parasite load, compared to those in natural cavities, you could change the test from being a non-directional (two-tailed) test to being a directional (one-tailed) test.

If we do this, and keep all the other parameter specifications the same, the total sample size required reduces from 142 to 112, while the power is not meaningfully impacted, and remains above the 0.8 target threshold.

If we conducted a second pilot study, we may be able to obtain potentially more accurate values for the group means and standard deviations. However, at this stage, the other main feasible option would be to consider increasing the level of significance, if this were acceptable. For example, if we were comfortable with a higher accepted rate of Type I error, we could set \(\alpha = 0.1\), which, everything else remaining fixed, would result in the required sample size being 112 (the same result as for changing the test direction - have a think about why this would be the case).

4.2

If the sample size required to achieve the power level of at least 80 was \(n=142\), given the other settings, then one way to increase the power level would simply be to increase our sample size.

For example, if we increase the sample size to 200 overall, then the actual power will increase to approximately 0.9173.

As previously, we could also consider changing the test to be directional. Keeping everything else fixed, this would increase the power to approximately 0.8801. Hopefully you can see the benefit here of correctly specifying the null and alternate hypotheses, for a given context.


5 ANOVA Analysis Interpretation Check 🌱

In this question we critiqued the following analysis and conclusions from an imaginary student:

Conclusion: The ANOVA was statistically significant as the \(p\)-value was small. Therefore we reject the null hypothesis and conclude that HCB levels increase as the Depth level increases. This result is supported by the medium effect size of 0.183, which suggests the findings are clinically significant. The Levene’s test and Shapiro-Wilk test results clearly show that there are no issues with the test assumptions.

jamovi output:

Correct Aspects

  • The student correctly notes that the results of the Levene’s Test \((p = 0.153)\) and the Shapiro-Wilk Normality Test \((p = 0.476)\) suggest there are no violations of the test assumptions.

Errors:

  • While the \(p\)-value of 0.065 is small, it is greater than the specified level of significance \((\alpha = 0.05)\).
  • We should not reject \(H_0\) since \(p > 0.05\)
  • The \(\eta^2\) effect size of 0.183 is considered a large effect size, rather than a medium effect size
  • The conclusion that HCB levels increase as the Depth level increases is wrong.

Omissions:

  • An introduction for context would be helpful
  • It should be noted that the ANOVA was a one-way ANOVA, for accuracy
  • The test statistic and degrees of freedom are not mentioned
  • No mention of clinical significance is made
  • A Normal Q-Q plot of the residuals could have been included for additional evidence
  • If the student believed the results were statistically significant, they should have then conducted the appropriate post-hoc tests

An example corrected version of the summary is provided below:

A one-way ANOVA was conducted to determine if mean HCB levels differed across different specified Depth levels in Wolf River. Results were not statistically significant at the 5% level of significance \((F(2, 27) = 3.032, p = 0.065 > 0.05)\). Therefore, despite the small \(p\)-value, we cannot reject our null hypothesis that the population mean HCB levels are equal across the different Depth levels considered. However, the large \(\eta^2\) effect size of 0.183 suggested that the result was clinically significant, so further analysis of the data may be warranted.


6 AI jamovi Output Interpretation 🌳

In this question we considered the following AI scenario:

The following prompt and image were provided to the Microsoft AI tool Copilot:

Prompt: Summarise the results of the statistical test output shown in this image and comment on the statistical significance of the results at the 5% level of significance please

Image:

The AI output is included in the chunk below (minimised by default). Click on the symbol to expand the section.

Please note that each time you submit queries to an AI tool, the output you receive will be different. The output shown here was one of several versions obtained - some were more accurate, while some were less accurate.

AI answer

The statistical test output in the image pertains to a One-Way ANOVA analysis. Here are the key takeaways:

One-Way ANOVA Results: The p-value for HCB is 0.044, indicating evidence to reject the null hypothesis at a 5% significance level. The p-value for Fisher’s test is 0.065, suggesting no significant difference at the same level.

Assumption Checks: The Homogeneity of Variances Test (Levene’s) has a p-value of 0.153, implying no significant violation of homogeneity of variances.

Post Hoc Tests: Mean differences and p-values comparing Surface, Middldepth, and Bottom levels are not statistically significant at the 5% level.

In summary, there are complex relationships within the data, with significant differences in means observed in some tests but not others. The results highlight the need for further exploration and context-specific interpretation.


The AI tool has in general correctly identified the tests conducted and the statistical results (\(p\)-values, significant/not significant). However, there are some issues with the AI output, ranging from small things, like using ‘Middldepth’ instead of ‘Middepth’, to more serious oversights, namely:

  • If the Levene’s test \(p = 0.153\), the equal variances assumption holds, so we could use the Fisher’s \(F\) rather than the Welch’s \(F\) result - the AI makes no mention of which version to choose
  • If, based on the results of the one-way ANOVA, we decided to conduct post-hoc tests, then it would be better to use the Tukey HSD post-hoc tests, given the results of the Levene’s test.
  • The AI has missed that, if we were to use the Games-Howell post-hoc test results, the \(p\)-value for the pairwise comparison between Surface and Bottom is statistically significant \((p = 0.038 < 0.05)\).

Hopefully you can see here that while the AI tool has been useful, we have to be very careful about blindly trusting the results provided. We need to exercise critical thinking when assessing any AI output, and it is important that we understand the subject matter ourselves.


7 G*Power Calculation - Pea Plant Data 🌳

Results from the \(t\)-tests conducted in Computer Lab 2 are shown below for reference, along with the corresponding G*Power calculations.

For reference, a descriptives table with the sample sizes for each group is also included below:

1. The one sample \(t\)-test conducted in part 5.2

For an effect size of \(d = 0.157\), with \(n = df + 1 = 85\), and \(\alpha = 0.05\), the power of the one-sample \(t\)-test is only approximately 0.2987. This suggests the test results are not powerful enough for us to be confident in our conclusions.


2. The two sample \(t\)-test conducted in part 5.3

For an effect size of \(d= 2.731\) in magnitude, and a total sample size of \(n = df + 2 = 60\), with \(n_1 = 28\) and \(n_2 = 32\), the power of the two-sample \(t\)-test comparing the C and TA groups mean seedling height is extremely high, at approximately 1.

3. The two sample \(t\)-test conducted in part 5.4

For an effect size of \(d= 0.119\) in magnitude, and a total sample size of \(n = df + 2 = 57\), with \(n_1 = 32\) and \(n_2 = 25\), the power of the two-sample \(t\)-test comparing the TA and TB groups mean seedling height is extremely low, at approximately 0.0723.


References

Faul, F., Erdfelder, E., Buchner, A., and Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149-1160.

Faul, F., Erdfelder, E., Lang, A.-G., and Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175-191.


These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.