Drummond and Vowler address a few considerations regarding ANOVA tests:
There are a few assumptions that ANOVA makes: that the residuals are normally distrubuted, and that the variation between grooups is relatively similar. In addition, the power of ANOVA is significantly limited by outliers, so excluding these may provide a more accurate test.
ANOVA tests differences in measures of central tendence between 2 or more groups. This means it assesses the differences between groups, and determines if this is due to error by comparing it with the differences within the groups.
One-Way ANOVA is reserved for one independent variable, whereas Two-Way ANOVA is used when there are two or more independent variables. Two-Way is especially when useful when trying to consider the interactions two independent variables might have, especially in the context of experimental design.
Using repeated measures increases the chances of false positives, especially when the subject is measured more than once. In cases where there are multiple measurements, then ANOVA might not be the best test for that dataset.
There are two important points in particular:
ANOVAs measure the variation between two or more groups. Two-Way ANOVAs are supposed to be used when there is more than one independent variable for every measured value. This is calculated in the form of the F-statistic, which the the ratio of the sum of squares between groups to the sum of squares within groups. The higher the F-statistic, the greater relative variance there is between two groups, and therefore the dependent variable is more likely to be affected by the independent variables. An F-statistic of 1 means that the variance between as the variance within the groups, which is indicative of the samples coming from the same populations. The aforementioned assumptions must be considered, but a large sample size is the most integral to generating these values.
There is a possibility that two independent variables can interact, which would affect the ANOVA. As such, a Two-Way ANOVA will determine the F-statistic of the interactions, and if the additive model is sufficient to determine the variance. One-Way ANOVA is used when assessing only one variable, and can only be used on a dataset with more than one independent variable if it is determined that the interactions are significant. Distinguishing between the two types of ANOVA is essential due to the different information given by each test. One-Way ANOVA is limited to one factor, but a Two-Way can perform on two or more factors (with more than two factors, this must be a repeated measure).
Figure 1 represents the four subgroups of quoll aversion data using box plots. J-U indicates the Juvenile-Untrained group, J-T represents the Juvenile-Trained group, A-U stands for Adult-Untrained, and A-T shows the measurements for the Adult-Trained group. The box indicates the 25th and 75th percentiles, with the black line indicating the median of the data. The dashed line plungers indicate the most extreme data point that is 1.5 times the interquartile range from the quartiles, and the blank dots indicate the most extreme data that doesnt lie in that range.
Figure 2 represents the four subgroups of quoll aversion data using box plots. J-U indicates the Juvenile-Untrained group, J-T represents the Juvenile-Trained group, A-U stands for Adult-Untrained, and A-T shows the measurements for the Adult-Trained group. The box indicates the 25th and 75th percentiles, with the black line indicating the median of the data. The dashed line plungers indicate the most extreme data point that is 1.5 times the interquartile range from the quartiles, and the blank dots indicate the most extreme data that doesnt lie in that range. Figures 2A and 2C show all four subgroups, colored according to age and training, respectively. Figures 2B and 2D show the data grouped into the Juvenile-Adult and Untrained-Trained groups, respectively.
Figure 3 illustrates the interaction plots of both Age and Training with Time. Figure 3A represents the interaction Training has with Age, and Figure 3B represents the interaction Age has with Training.
Based on the boxplots in Figures 1 and 2, distribution of the subgroups (Juvenile vs. Adult, Untrained vs. Trained) are different, indicating that these two factors might have an effect. The interaction plots in Figure 3 look very similar to each other in that they indicate a large interaction between the factors, and that there are effects due to both factors. It seems that both age and training may cause more aversion to the toxic cane toad. In both graphs, there was a higher mean inspection time in one group (juvenile and untrained, respectively). More tests are needed to quantitatively determine these effects.
results.2way = anova2way.boot(quolldat$Time, quolldat$Age, quolldat$Training, center = mean, agg = ssq)
results.round <- round(x = results.2way$p, digits = 4)
theory = lm(Time ~ Age*Training, data=quolldat)
summary(theory)
##
## Call:
## lm(formula = Time ~ Age * Training, data = quolldat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8300 -4.5800 0.1225 4.9200 10.1850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.680 1.501 28.427 < 2e-16 ***
## AgeJuvenile 9.945 2.123 4.684 1.21e-05 ***
## TrainingUntrained 8.835 2.123 4.161 8.26e-05 ***
## AgeJuvenile:TrainingUntrained -7.130 3.003 -2.374 0.0201 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.714 on 76 degrees of freedom
## Multiple R-squared: 0.3215, Adjusted R-squared: 0.2947
## F-statistic: 12.01 on 3 and 76 DF, p-value: 1.62e-06
Figure 4 is the histogram of the bootstrapped 2-way F-statistic. The calculated F-statistics are shown by the purple, green, and pink lines for age, training, and interactions, respectively. The p-values for each statistic are shown at the top of the line.
Test Variable | F-statistic | p-value |
---|---|---|
Age | 0.2376027 | 0 |
Training | 0.162118 | 310^{-4} |
Interactions | 0.0741872 | 0.0192 |
These values were not expected, which indicated a potential issue with the code (see Formulae Appendix). Seeing this, I chose to look at the built-in ANOVA function in R. While flawed, it could potentially give F-statistics that were more in line with what was expected.
model <- lm(Time ~ Training * Age, data=quolldat)
result <- round(anova(model),4)
result$Fval <- result$`F value`
result$Pval <- result$`Pr(>F)`
Test Variable | F-statistic | p-value |
---|---|---|
Age | 12.321 | 810^{-4} |
Training | 18.0578 | 10^{-4} |
Interactions | 5.6382 | 0.0201 |
These values were more aligned with the expected variance; even so, I chose to assess the p-values of the handmade function, as their relative comparisons seemed to show similar results, and the handmade function corrects for the discrepancies and assumptions found within the built-in function.
One must first consider the F statistic of the interactions, to see if they have an effect, before performing any other ANOVA tests. The \(\alpha\) was 0.0741872, which is greater than the p-value of 0.0192. Therefore, one may conclude that the additive model can be rejected; that means that both Age and Training have an effect on outcome. Therefore, the two-way F-statistics generated from these analyses will not be used, as they could provide results that are inaccurate. Therefore, we must perform one-way ANOVA tests on both age and training separately.
#see formula appendix
age <- fstat.all(quolldat$Time, quolldat$Age)
training <- fstat.all(quolldat$Time, quolldat$Training)
Figure 5 illustrates the histograms of the bootstrapped one-way F-statistics. The calculated F-statistics are shown by the purple and green lines for age and training, respectively. The p-values for each statistic are shown at the top of the line.
Test Variable | F-statistic | p-value |
---|---|---|
Age | 0.1921878 | 610^{-4} |
Training | 0.1235853 | 0.0032 |
As shown in Table 2, the F-statistics are less than one in both tests, but the p-values are significant (610^{-4} and 0.0032, for age and training respectively). Because the F-statistics are less than one, that indicates that there is a greater amount of variance within the groups than between the groups. Because of this, it seems that there might not be a large effect of these two (as opposed to an F-statisic that is greater than one).
After further inspection, it seems that went awry with the code (see Formulae appendix). As such, I chose to verify the information with the built in ANOVA modeling. My results were as follows:
resultT <- anova(lm(Time ~ Training, data=quolldat))
TF <- round(resultT$`F value`[1], 4)
TP <- round(resultT$`Pr(>F)`[1], 4)
resultA <- anova(lm(Time ~ Age, data=quolldat))
AF <- round(resultA$`F value`[1], 4)
AP <- round(resultA$`Pr(>F)`[1], 4)
Test Variable | F-statistic | p-value |
---|---|---|
Age | 14.9906 | 210^{-4} |
Training | 9.6397 | 0.0027 |
The p-values from these tests are also significant, but the primary difference is that the F-statistics are actually greater than one, which is much closer to the expected results. Based on visual inspection of the data, it seemed likely that age had a greater effect on time spent inspecting cane toads than training did, and these F-statistics do point to that information; the F-statistic is higher for the comparison of the age samples than it is for the training sample groups. However, as both of the values seem large, one can conclude that these both have an effect on the time spent inspecting the toxic toads.
Figure 6 compares age subpopulations and training subpopulations in a violin plot. The white dot indicates the median, with the black box indicating the values of the upper and lower quantiles. Figure 6A compares the age subgroups, while Figure 6B compares those of the training characteristics.
Post-Hoc testing can be performed on the two types of variables (age, training). Inspecting the violin plots, which were used in order to see the distributions of the data, led to the conclusion that the subgroups did not come from the same underlying populations, and as such, two-box comparisons were used for both of the post-hoc comparisons. By inspecting the distribution of the data, it seemed that the best measure of central tendency to use was the median, as there was a fairly large amount of skew in all four subgroups. After performing these tests, the p-values must be adjusted to account for the chance of false positivies when performing multiple evaluations.
JA <- twobox.test(quoll.j$Time, quoll.a$Time)
UT <- twobox.test(quoll.u$Time, quoll.t$Time)
#pvalues <- c(JA$p, AJ$p, UT$p, TU$p)
pvalues <- c(JA[4], UT[4])
qvalues <- p.adjust(pvalues, method = p.adjust.methods, n= length(pvalues))
The two-group comparisons yielded the following results:
Comparison of Juvenile vs. Adult:
Comparison of Untrained vs. Trained:
Age and training both have a statistically significant negative effect on the median time inspecting a cane toad, according to the two-box tests. The difference in medians for the age subgroups had a positive effect, while the confidence intervals for the median differences for the training subgroups contained zero effect, meaning that there was a possibility that there was no effect. The training seems to have a minimal effect on making the quolls averse to the toxic toads, but it seems that age also has an effect. This highlights the importance of learning when it comes to survival behaviors. Overall, in order to conserve this species, it would seem that training would help. However, given the amount of effort required train the quolls, it would be incredibly costly to try and train the quoll populations for an effect that can also be seen just by comparing juvenile to adult quolls.
ssq <- function(x) sum(x^2)
sab <- function(x) sum(abs(x))
#inner function
Fstat.2way <- function(X, f1, f2, center = mean, agg = sab, rank = F){
if (rank == F) {Y = X} else {Y = rank(-X)}
#take the grand mean
mu = center(Y)
#generate f1, f2, and interaction groupmeans
groupmeans.f1 = ave(Y, f1, FUN = center)
groupmeans.f2 = ave(Y, f2, FUN = center)
groupmeans.int = ave(Y, f1, f2, FUN = center)
#calculate the aggregate functions
SSB.f1 <- agg(groupmeans.f1 - mu)
SSB.f2 <- agg(groupmeans.f2 - mu)
SSB.int <- agg(groupmeans.int - (groupmeans.f1 + groupmeans.f2 - mu))
SSW <- agg(Y - groupmeans.int)
return( c(f1 = SSB.f1, f2 = SSB.f2, int = SSB.int) / SSW)
}
#outer function
anova2way.boot <- function(X, f1, f2, deals = 10000, center= mean, agg = sab, rank = FALSE){
dataF <- Fstat.2way(X, f1, f2, center = center, agg = agg, rank = rank)
#create the vector
bootF = matrix(NA, nrow = deals, ncol = 3)
#create the bootstrap within this function
for(i in 1:deals) {
bootX = sample (X, replace = T)
bootF[i,] = Fstat.2way(bootX, f1, f2, center = center, agg = agg, rank = rank)
}
pval = c( p1 = sum(bootF[,1] > dataF[1])/deals,
p2 = sum(bootF[,2] > dataF[2])/deals,
pint = sum(bootF[,3] > dataF[3])/deals)
return(list(Fstat = dataF, bootF = bootF, p = pval))
}
#1 way fstat function
Fstat.1way <- function(Y, group, center = mean, agg = sumsquares) {
GroupMeans <- ave(Y, group, FUN = center)
SSB <- agg(GroupMeans - center(Y))
SSW <- agg(Y - GroupMeans)
FF <- SSB/SSW
return(FF=FF)
}
#bootstrap for p-value of fstat
Fstat.1way.p = function(Y, group, center = mean, agg = sumsquares){
deals = 10000
bootF <- rep(NA, deals)
for (i in 1:deals){
quolldat$boottime <- sample(Y, replace = TRUE)
bootF[i] = Fstat.1way(quolldat$boottime, group)
}
return(bootF)
}
twobox.test <- function(datasetA, datasetB){
#Effect size
deals = 10000
difference = rep(NA,deals)
for (i in 1:deals) {
bootdataA = sample(datasetA,length(datasetA),replace=TRUE)
bootdataB = sample(datasetB,length(datasetB),replace=TRUE)
difference[i] = median(bootdataA) - median(bootdataB)
}
est.median = median(difference)
#calculate confidence intervals
CI = sort(difference)[c(0.025*deals,0.975*deals)]
#NHST
boxA = datasetA - median(datasetA)
boxB = datasetB - median(datasetB)
deals = 10000
difference = rep(NA,deals)
for (i in 1:deals) {
bootdataA = sample(boxA,length(datasetA),replace=TRUE)
bootdataB = sample(boxB,length(datasetB),replace=TRUE)
bootmedianA = median(bootdataA)
bootmedianB = median(bootdataB)
difference[i] = bootmedianA-bootmedianB
}
diff <- median(datasetA)-median(datasetB)
#generate the p-value thresholds
HP <- sum(difference > abs(diff))
LP <- sum(difference < -abs(diff))
#calculate the pvalue
pval = (HP+LP)/deals
return(c(diff = diff, CI=CI, p=pval, est= est.median))
}
onebox.test <- function(datasetA, datasetB){
#Effect size
deals = 10000
difference = rep(NA,deals)
for (i in 1:deals) {
bootdataA = sample(datasetA,length(datasetA),replace=TRUE)
bootdataB = sample(datasetB,length(datasetB),replace=TRUE)
difference[i] = median(bootdataA) - median(bootdataB)
}
est.median = median(difference)
#calculate confidence intervals
CI = sort(difference)[c(0.025*deals,0.975*deals)]
#NHST
deals = 10000
datajoined = c(datasetA, datasetB)
difference = rep(NA, deals)
for (i in 1 : deals) {
bootdataA = sample(datajoined, length(datasetA), replace = TRUE)
bootdataB = sample(datajoined, length(datasetB), replace = TRUE)
bootmedianA = median (bootdataA)
bootmedianB = median (bootdataB)
difference[i] = bootmedianA - bootmedianB
}
diff <- median(datasetA)-median(datasetB)
#generate the p-value thresholds
HP <- sum(difference > abs(diff))
LP <- sum(difference < -abs(diff))
#calculate the pvalue
pval = (HP+LP)/deals
return(c(diff = diff, CI=CI, p=pval, est= est.median))
}