An analysis of variance (ANOVA) is appropriate when we seek to compare the means of three or more groups.
ANOVA: Compares the means of three or more independent groups used with a continuous outcome and categorical factor of interest that distinguishes the independent groups from each other.
It is an extension of the two-sample t-test when there are more than two groups ( \(k > 2\) ).
ANOVA is widely used in various fields, including social sciences, biology, medicine, engineering, and business. Some common applications of ANOVA include:
ANOVA is dependent on estimates of spread or dispersion. In other words, the procedure analyzes the variances of the data. There are two sources of variation in the data: Within-group and between-group.
If the variation between groups is significantly larger than the variation within groups, it suggests that the means of the groups are different.
The formulas to apply:
Let’s understand the formulas…
Within-Group Variation \((s_w^2)\): The variation of individual values around their group mean.
Between-Group Variation \((s_B^2)\) : The variation of the group means around the grand mean; an estimate of the common variance \(\sigma^2\).
An alternative way to calculate them is:
\(SS_T = SS_B + SS_W = \sum_{i=1}^{k} \sum_{j=1}^{n_i}(x_{ij} - \bar{x})^2\)
\(SS_B = SS_M = \sum_{i=1}^{k} n_i (\bar{x_i} - \bar{x})^2\)
\(SS_W = SS_E = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(x_{ij} - \bar{x_i})^2\)
The Analysis of Variance (ANOVA) is a powerful statistical technique with several advantages that make it a valuable tool for data analysis in various fields. Here are some advantages of using ANOVA:
Overall, ANOVA is a versatile and robust tool that helps uncover significant differences among groups and provides a structured approach for exploring relationships in data.
# Load medical data
data <- read.csv( paste(directory,"medical_data.csv",sep = "") )
# Display the first few rows of the data
head(data)
## PatientID Group RecoveryTime
## 1 1 A 10.2
## 2 2 A 9.8
## 3 3 A 11.5
## 4 4 A 10.7
## 5 5 A 10.0
## 6 6 B 15.3
Treatment Examples:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(ggpubr)
H0: \(\mu_1 = \mu_2= ... = \mu_k\)
Ha: \(\mu_i \not= \mu_j\)
Our null hypothesis states that there are no significant differences in recovery times among the treatment groups.
The alternative hypothesis , on the other hand, suggests that at least one group has a different mean recovery time.
# Create a boxplot
ggplot(data, aes(x = Group, y = RecoveryTime)) +
geom_boxplot() +
labs(x = "Treatment Group", y = "Recovery Time") +
ggtitle("Comparison of Recovery Times across Treatment Groups")
# Mean Comparison between groups
data %>%
group_by(Group) %>%
summarize( mean = mean(RecoveryTime),
sd = sd(RecoveryTime))
## # A tibble: 3 × 3
## Group mean sd
## <chr> <dbl> <dbl>
## 1 A 10.4 0.680
## 2 B 16.0 0.950
## 3 C 9 0.524
# Test for normality
shapiro.test(data$RecoveryTime)
##
## Shapiro-Wilk normality test
##
## data: data$RecoveryTime
## W = 0.83418, p-value = 0.01045
From the output above, we can see that the p-value is > 0.05, which is not significant. This means that, there is not significant, therefore the data follows a normal distribution.
# Test for homogeneity of variances
bartlett.test(RecoveryTime ~ Group, data = data)
##
## Bartlett test of homogeneity of variances
##
## data: RecoveryTime by Group
## Bartlett's K-squared = 1.2713, df = 2, p-value = 0.5296
From the output above, we can see that the p-value is > 0.05, which is not significant. This means that, there is not significant difference between variances across groups.
Therefore, we can assume the homogeneity of variances in the different treatment groups.
# Run one-way ANOVA
model <- aov(RecoveryTime ~ Group, data = data)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 134.98 67.49 123.4 9.95e-09 ***
## Residuals 12 6.56 0.55
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above ANOVA table, it can be seen that there are significant differences between groups (p <0.01),value which are highlighted with “*“, F(2, 12) = 123.4, p< 0.05.
## Mean Value per group compared with reference
model$coefficients
## (Intercept) GroupB GroupC
## 10.44 5.52 -1.44
As It was calculated before, the mean of Recovery Time is different per Treatment (Factor).
## Mean
print("Average Recovery Time Per group:")
## [1] "Average Recovery Time Per group:"
with(data, tapply(RecoveryTime ,Group ,mean ) )
## A B C
## 10.44 15.96 9.00
Therefore, it is not surprise the conclusion with the pvalue lead us to conclude that on average the difference between treatments in recovery time is significant.
## extract pvalue
s<- unlist( summary(model) )
s<- s[9]
if(s<0.05){
print("Reject Null hypothesis H0 --> Significant statistical evidence found")
}else{
print("No evidence to reject H0 --> No significant statistical difference found")
}
## [1] "Reject Null hypothesis H0 --> Significant statistical evidence found"
A significant one-way ANOVA is generally followed up by Tukey post-hoc tests to perform multiple pairwise comparisons between groups.
Using the function tukey_hsd() in the
rstatix package:
# Pairwise comparisons
pwc <- data %>% tukey_hsd(RecoveryTime ~ Group)
pwc
## # A tibble: 3 × 9
## term group1 group2 null.value estimate conf.low conf.high p.adj p.adj…¹
## * <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Group A B 0 5.52 4.27 6.77 1.63e-7 ****
## 2 Group A C 0 -1.44 -2.69 -0.192 2.41e-2 *
## 3 Group B C 0 -6.96 -8.21 -5.71 1.18e-8 ****
## # … with abbreviated variable name ¹p.adj.signif
Based on our analysis, because the p-value is below a significance level (commonly 0.05), we can conclude that at least one group has a significantly different mean recovery time.
Moreover, it can be seen from the output in multiple comparison, that the differences between all groups are significant (adjusted p-value < 0.01).
Treatment C provides on average the least Recovery Time.
In the context of medical applications, this could lead to further investigations or changes in treatment approaches.
res.aov <- data %>% anova_test(RecoveryTime ~ Group) ## from rstatix
res.aov
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Group 2 12 123.378 9.95e-09 * 0.954
# Visualization: box plots with p-values
pwc <- pwc %>% add_xy_position(x = "Group")
ggboxplot(data, x = "Group", y = "RecoveryTime") +
stat_pvalue_manual(pwc, hide.ns = TRUE) +
labs(
subtitle = get_test_label(res.aov, detailed = TRUE),
caption = get_pwc_label(pwc)
)