In this blog post, we will explore the analysis of variance (ANOVA) in the context of a synthetic dataset. ANOVA helps us assess whether the means of two or more groups are statistically different from each other.
Let’s generate a synthetic dataset for illustration purposes.
set.seed(123) # for reproducibility
num_groups <- 3
group_sizes <- c(30, 25, 35)
# Creating a synthetic categorical variable (e.g., treatment groups)
groups <- rep(1:num_groups, each = group_sizes)
## Warning in rep(1:num_groups, each = group_sizes): first element used of 'each'
## argument
# Generating synthetic values for a quantitative variable (e.g., dependent variable)
values <- rnorm(sum(group_sizes), mean = c(10, 15, 20), sd = 3)
# Creating a data frame
mydata <- data.frame(Group = factor(groups), DependentVar = values)
Now, let’s run ANOVA and interpret the results.
# Load required libraries
library(car)
## Loading required package: carData
# Run ANOVA
model <- lm(DependentVar ~ Group, data = mydata)
anova_result <- Anova(model, type = "II")
# Interpret ANOVA results
summary(anova_result)
## Sum Sq Df F value Pr(>F)
## Min. : 7.167 Min. : 2.00 Min. :0.1555 Min. :0.8562
## 1st Qu.: 506.497 1st Qu.:23.25 1st Qu.:0.1555 1st Qu.:0.8562
## Median :1005.826 Median :44.50 Median :0.1555 Median :0.8562
## Mean :1005.826 Mean :44.50 Mean :0.1555 Mean :0.8562
## 3rd Qu.:1505.156 3rd Qu.:65.75 3rd Qu.:0.1555 3rd Qu.:0.8562
## Max. :2004.486 Max. :87.00 Max. :0.1555 Max. :0.8562
## NA's :1 NA's :1
If the ANOVA results are significant, we may need to run post hoc tests for pairwise comparisons.
# Load required libraries
library(agricolae)
# Run post hoc tests
posthoc_result <- LSD.test(model, "Group", console = TRUE)
##
## Study: model ~ "Group"
##
## LSD t Test for DependentVar
##
## Mean Square Error: 23.04007
##
## Group, means and individual ( 95 %) CI
##
## DependentVar std r se LCL UCL Min Max
## 1 14.85869 4.986279 30 0.8763574 13.11683 16.60054 8.124882 25.14519
## 2 15.53502 5.048204 30 0.8763574 13.79316 17.27687 6.203811 24.10581
## 3 15.07326 4.332767 30 0.8763574 13.33141 16.81512 6.944274 23.44642
## Q25 Q50 Q75
## 1 11.20294 14.20481 17.90791
## 2 11.90035 14.84283 19.50612
## 3 12.05657 13.95511 17.53644
##
## Alpha: 0.05 ; DF Error: 87
## Critical Value of t: 1.987608
##
## least Significant Difference: 2.463355
##
## Treatments with the same letter are not significantly different.
##
## DependentVar groups
## 2 15.53502 a
## 3 15.07326 a
## 1 14.85869 a
# Interpret post hoc test results
print(posthoc_result)
## $statistics
## MSerror Df Mean CV t.value LSD
## 23.04007 87 15.15565 31.67139 1.987608 2.463355
##
## $parameters
## test p.ajusted name.t ntr alpha
## Fisher-LSD none Group 3 0.05
##
## $means
## DependentVar std r se LCL UCL Min Max
## 1 14.85869 4.986279 30 0.8763574 13.11683 16.60054 8.124882 25.14519
## 2 15.53502 5.048204 30 0.8763574 13.79316 17.27687 6.203811 24.10581
## 3 15.07326 4.332767 30 0.8763574 13.33141 16.81512 6.944274 23.44642
## Q25 Q50 Q75
## 1 11.20294 14.20481 17.90791
## 2 11.90035 14.84283 19.50612
## 3 12.05657 13.95511 17.53644
##
## $comparison
## NULL
##
## $groups
## DependentVar groups
## 2 15.53502 a
## 3 15.07326 a
## 1 14.85869 a
##
## attr(,"class")
## [1] "group"
Let’s create visualizations to better understand our data.
# Load required library
library(ggplot2)
# Bar chart
ggplot(mydata, aes(x = Group, y = DependentVar)) +
geom_bar(stat = "summary", fun = "mean", fill = "skyblue") +
labs(title = "Mean Values Across Groups", x = "Group", y = "Mean Dependent Variable")
# Box plot
ggplot(mydata, aes(x = Group, y = DependentVar, fill = Group)) +
geom_boxplot() +
labs(title = "Box Plot of Dependent Variable Across Groups", x = "Group", y = "Dependent Variable")
# Scatter plot with trend line
ggplot(mydata, aes(x = as.numeric(Group), y = DependentVar)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Scatter Plot with Trend Line", x = "Group", y = "Dependent Variable")
## `geom_smooth()` using formula = 'y ~ x'