Analysis of Variance One Way

An analysis of variance (ANOVA) is appropriate when we seek to compare the means of three or more groups.

ANOVA: Compares the means of three or more independent groups used with a continuous outcome and categorical factor of interest that distinguishes the independent groups from each other.

It is an extension of the two-sample t-test when there are more than two groups ( \(k > 2\) ).

MEDICAL APPLICATIONS.

ANOVA is widely used in various fields, including social sciences, biology, medicine, engineering, and business. Some common applications of ANOVA include:

Comparing means of different treatment groups in a scientific experiment.
Assessing the impact of various factors on product performance in manufacturing.
Analyzing the differences in test scores among students in different schools.
Determining if there are significant variations in customer satisfaction ratings across different service providers.

Explanation of ANOVA

ANOVA is dependent on estimates of spread or dispersion. In other words, the procedure analyzes the variances of the data. There are two sources of variation in the data: Within-group and between-group.

If the variation between groups is significantly larger than the variation within groups, it suggests that the means of the groups are different.

ANOVA COMPARISON.

The formulas to apply:

ANOVA TABLE.

Let’s understand the formulas…

\(k\) is the number of factors in variable Group
\(n\) is the number of observations
\(s_i\) is the variance of continuous variable for the group i
\(n_i\) is the number of observations for group i

Within-Group Variation \((s_w^2)\): The variation of individual values around their group mean.

Between-Group Variation \((s_B^2)\) : The variation of the group means around the grand mean; an estimate of the common variance \(\sigma^2\).

An alternative way to calculate them is:

\(SS_T = SS_B + SS_W = \sum_{i=1}^{k} \sum_{j=1}^{n_i}(x_{ij} - \bar{x})^2\)

\(SS_B = SS_M = \sum_{i=1}^{k} n_i (\bar{x_i} - \bar{x})^2\)

\(SS_W = SS_E = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(x_{ij} - \bar{x_i})^2\)

Advantages of ANOVA

The Analysis of Variance (ANOVA) is a powerful statistical technique with several advantages that make it a valuable tool for data analysis in various fields. Here are some advantages of using ANOVA:

Comparison of Multiple Groups:
Efficient Use of Resources:
Control of Experiment-Wide Error Rate:
Insights into Group Relationships:
Detection of Complex Patterns:
Less Chance of Simpson’s Paradox:
Useful for Experimental Designs:
Flexible and Adaptable:
Inferential Insights:
Supports Scientific Hypotheses:

Overall, ANOVA is a versatile and robust tool that helps uncover significant differences among groups and provides a structured approach for exploring relationships in data.

Loading and Exploring Data

# Load medical data
data <- read.csv( paste(directory,"medical_data.csv",sep = "") )

# Display the first few rows of the data
head(data)

##   PatientID Group RecoveryTime
## 1         1     A         10.2
## 2         2     A          9.8
## 3         3     A         11.5
## 4         4     A         10.7
## 5         5     A         10.0
## 6         6     B         15.3

Treatment Examples:

Surgery Techniques: Laparoscopic, Open, Robotic
Medication Types: Antibiotics, Painkillers, Anti-inflammatory
Therapy Approaches: Physical Therapy, Occupational Therapy, Cognitive Behavioral Therapy

Load necessary libraries

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(ggpubr)

Setting Hypothesis

H0: \(\mu_1 = \mu_2= ... = \mu_k\)

Ha: \(\mu_i \not= \mu_j\)

Our null hypothesis states that there are no significant differences in recovery times among the treatment groups.

The alternative hypothesis , on the other hand, suggests that at least one group has a different mean recovery time.

Visualization for Insight

# Create a boxplot
ggplot(data, aes(x = Group, y = RecoveryTime)) +
  geom_boxplot() +
  labs(x = "Treatment Group", y = "Recovery Time") +
  ggtitle("Comparison of Recovery Times across Treatment Groups")

# Mean Comparison between groups
data %>%
  group_by(Group) %>%
  summarize( mean = mean(RecoveryTime),
             sd = sd(RecoveryTime))

## # A tibble: 3 × 3
##   Group  mean    sd
##   <chr> <dbl> <dbl>
## 1 A      10.4 0.680
## 2 B      16.0 0.950
## 3 C       9   0.524

Assumptions of ANOVA

Samples from the k populations are independent.
Samples from the k populations are normally distributed.
Variances in the k populations are equal \((i.e., \sigma_1 = \sigma_2 = … \sigma _k)\).

# Test for normality
shapiro.test(data$RecoveryTime)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$RecoveryTime
## W = 0.83418, p-value = 0.01045

From the output above, we can see that the p-value is > 0.05, which is not significant. This means that, there is not significant, therefore the data follows a normal distribution.

# Test for homogeneity of variances
bartlett.test(RecoveryTime ~ Group, data = data)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  RecoveryTime by Group
## Bartlett's K-squared = 1.2713, df = 2, p-value = 0.5296

From the output above, we can see that the p-value is > 0.05, which is not significant. This means that, there is not significant difference between variances across groups.

Therefore, we can assume the homogeneity of variances in the different treatment groups.

Running the One-Way ANOVA

# Run one-way ANOVA
model <- aov(RecoveryTime ~ Group, data = data)
summary(model)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 134.98   67.49   123.4 9.95e-09 ***
## Residuals   12   6.56    0.55                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the above ANOVA table, it can be seen that there are significant differences between groups (p <0.01),value which are highlighted with “*“, F(2, 12) = 123.4, p< 0.05.

Interpreting Results

## Mean Value per group compared with reference
model$coefficients

## (Intercept)      GroupB      GroupC 
##       10.44        5.52       -1.44

As It was calculated before, the mean of Recovery Time is different per Treatment (Factor).

## Mean

print("Average Recovery Time Per group:")

## [1] "Average Recovery Time Per group:"

with(data, tapply(RecoveryTime ,Group ,mean ) )

##     A     B     C 
## 10.44 15.96  9.00

Therefore, it is not surprise the conclusion with the pvalue lead us to conclude that on average the difference between treatments in recovery time is significant.

## extract pvalue

s<- unlist( summary(model) )
s<- s[9]  

if(s<0.05){
  print("Reject Null hypothesis H0 --> Significant statistical evidence found")
}else{
  print("No evidence to reject H0 --> No significant statistical difference found")
}

## [1] "Reject Null hypothesis H0 --> Significant statistical evidence found"

Post-hoc tests

A significant one-way ANOVA is generally followed up by Tukey post-hoc tests to perform multiple pairwise comparisons between groups.

Using the function tukey_hsd() in the rstatix package:

# Pairwise comparisons
pwc <- data %>% tukey_hsd(RecoveryTime ~ Group)
pwc

## # A tibble: 3 × 9
##   term  group1 group2 null.value estimate conf.low conf.high       p.adj p.adj…¹
## * <chr> <chr>  <chr>       <dbl>    <dbl>    <dbl>     <dbl>       <dbl> <chr>  
## 1 Group A      B               0     5.52     4.27     6.77      1.63e-7 ****   
## 2 Group A      C               0    -1.44    -2.69    -0.192     2.41e-2 *      
## 3 Group B      C               0    -6.96    -8.21    -5.71      1.18e-8 ****   
## # … with abbreviated variable name ¹p.adj.signif

Conclusions

Based on our analysis, because the p-value is below a significance level (commonly 0.05), we can conclude that at least one group has a significantly different mean recovery time.

Moreover, it can be seen from the output in multiple comparison, that the differences between all groups are significant (adjusted p-value < 0.01).

Treatment C provides on average the least Recovery Time.

In the context of medical applications, this could lead to further investigations or changes in treatment approaches.

res.aov <- data %>% anova_test(RecoveryTime ~ Group)  ## from rstatix
res.aov

## ANOVA Table (type II tests)
## 
##   Effect DFn DFd       F        p p<.05   ges
## 1  Group   2  12 123.378 9.95e-09     * 0.954

# Visualization: box plots with p-values
pwc <- pwc %>% add_xy_position(x = "Group")
ggboxplot(data, x = "Group", y = "RecoveryTime") +
  stat_pvalue_manual(pwc, hide.ns = TRUE) +
  labs(
    subtitle = get_test_label(res.aov, detailed = TRUE),
    caption = get_pwc_label(pwc)
    )

Unveiling the Power of One-Way ANOVA: Medical Insights with R

Raul V - NutrInsight.ch

2023-08-10