Time: ~30 minutes
Goal: Practice one-way ANOVA analysis from start to finish using real public health data
Learning Objectives:
Your Task: Complete the same 9-step analysis workflow you just practiced, but now on a different outcome and predictor.
# Load necessary libraries
library(tidyverse) # For data manipulation and visualization
library(knitr) # For nice tables
library(car) # For Levene's test
library(NHANES) # NHANES dataset
# Load the NHANES data
data(NHANES)Create analysis dataset:
# Prepare the dataset
set.seed(553)
mental_health_data <- NHANES %>%
filter(Age >= 18) %>%
filter(!is.na(DaysMentHlthBad) & !is.na(PhysActive)) %>%
mutate(
activity_level = case_when(
PhysActive == "No" ~ "None",
PhysActive == "Yes" & !is.na(PhysActiveDays) & PhysActiveDays < 3 ~ "Moderate",
PhysActive == "Yes" & !is.na(PhysActiveDays) & PhysActiveDays >= 3 ~ "Vigorous",
TRUE ~ NA_character_
),
activity_level = factor(activity_level,
levels = c("None", "Moderate", "Vigorous"))
) %>%
filter(!is.na(activity_level)) %>%
select(ID, Age, Gender, DaysMentHlthBad, PhysActive, activity_level)
# YOUR TURN: Display the first 6 rows and check sample sizes
#Display the first 6 rows
head(mental_health_data) %>%
kable(caption = "Mental Health and Physical Activity Dataset (first 6 rows)")| ID | Age | Gender | DaysMentHlthBad | PhysActive | activity_level |
|---|---|---|---|---|---|
| 51624 | 34 | male | 15 | No | None |
| 51624 | 34 | male | 15 | No | None |
| 51624 | 34 | male | 15 | No | None |
| 51630 | 49 | female | 10 | No | None |
| 51647 | 45 | female | 3 | Yes | Vigorous |
| 51647 | 45 | female | 3 | Yes | Vigorous |
##
## None Moderate Vigorous
## 3139 768 1850
YOUR TURN - Answer these questions:
# YOUR TURN: Calculate summary statistics by activity level
# Hint: Follow the same structure as the guided example
# Variables to summarize: n, Mean, SD, Median, Min, Max
summary_statistics <- mental_health_data %>%
group_by(activity_level) %>%
summarise(
n = n(),
Mean = mean(DaysMentHlthBad),
SD = sd(DaysMentHlthBad),
Median = median(DaysMentHlthBad),
Min = min(DaysMentHlthBad),
Max = max(DaysMentHlthBad)
)
summary_statistics %>%
kable(digits = 2,
caption = "Descriptive Statistics: Days Mental Health Bad by Activity level")| activity_level | n | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|
| None | 3139 | 5.08 | 9.01 | 0 | 0 | 30 |
| Moderate | 768 | 3.81 | 6.87 | 0 | 0 | 30 |
| Vigorous | 1850 | 3.54 | 7.17 | 0 | 0 | 30 |
YOUR TURN - Interpret:
# YOUR TURN: Create boxplots comparing DaysMentHlthBad across activity levels
# Hint: Use the same ggplot code structure as the example
# Change variable names and labels appropriately
ggplot(mental_health_data,
aes(x = activity_level, y = DaysMentHlthBad, fill = activity_level)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Days Mental Health Bad by Activity Level",
subtitle = "NHANES Data, Adults aged 18-65",
x = "Activity Level",
y = "Days Mental Health Bad",
fill = "Activity Level"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")YOUR TURN - Describe what you see:
YOUR TURN - Write the hypotheses:
Null Hypothesis (H₀): μ_None = μ_Moderate = μ_Vigorous
Alternative Hypothesis (H₁): At least one population mean differs from the others
Significance level: α = 0.05
# YOUR TURN: Fit the ANOVA model
# Outcome: DaysMentHlthBad
# Predictor: activity_level
anova_model <- aov(DaysMentHlthBad ~ activity_level, data = mental_health_data)
summary(anova_model)## Df Sum Sq Mean Sq F value Pr(>F)
## activity_level 2 3109 1555 23.2 9.5e-11 ***
## Residuals 5754 386089 67
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
YOUR TURN - Extract and interpret the results:
# YOUR TURN: Conduct Tukey HSD test
# Only if your ANOVA p-value < 0.05
tukey_results <- TukeyHSD(anova_model)
print(tukey_results)## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DaysMentHlthBad ~ activity_level, data = mental_health_data)
##
## $activity_level
## diff lwr upr p adj
## Moderate-None -1.2726 -2.046 -0.4995 0.0003
## Vigorous-None -1.5465 -2.109 -0.9836 0.0000
## Vigorous-Moderate -0.2739 -1.098 0.5504 0.7160
YOUR TURN - Complete the table:
| Comparison | Mean Difference | 95% CI Lower | 95% CI Upper | p-value | Significant? |
|---|---|---|---|---|---|
| Moderate - None | -1.2726 | -2.046 | -0.4995 | 0.0003 | significant |
| Vigorous - None | -1.5465 | -2.109 | -0.9836 | 0.0000 | significant |
| Vigorous - Moderate | -0.2739 | -1.098 | 0.5504 | 0.7160 | not significant |
Interpretation:
Which specific groups differ significantly? The “Moderate - None” and the “Vigorous - None” groups differ significantly. Individuals with vigorous activity levels and individuals with moderate activity levels have lower amounts of mental health bad days than individuals with no activity level.
# YOUR TURN: Calculate eta-squared
# Hint: Extract Sum Sq from the ANOVA summary
anova_summary <- summary(anova_model)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")## Eta-squared (η²): 0.008
#Percentage of variance explained
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")## Percentage of variance explained: 0.8 %
YOUR TURN - Interpret:
YOUR TURN - Evaluate each plot:
# YOUR TURN: Conduct Levene's test
levene_test <- leveneTest(DaysMentHlthBad ~ activity_level, data = mental_health_data)
print(levene_test)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 23.2 9.5e-11 ***
## 5754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
YOUR TURN - Overall assessment:
YOUR TURN - Write a complete 2-3 paragraph results section:
Include: 1. Sample description and descriptive statistics 2. F-test results 3. Post-hoc comparisons (if applicable) 4. Effect size interpretation 5. Public health significance
Your Results Section:
We conducted a one-way ANOVA to examine whether mean days mental health bad differs across activity level groups (None, Moderate, Vigorous) among 5757 adults aged 18-65 from NHANES. Descriptive statistics showed mean days mental health bad of 5.08 days (SD = 9) for no activity level, 3.81 days (SD = 6.9) for moderate activity level, and 3.54 days (SD = 7.2) for vigorous activity level.
The ANOVA revealed a statistically significant difference in mean SBP across BMI categories, F(2, 5757) = 23.2, p < 0.001. Tukey’s HSD post-hoc tests indicated that two pairwise comparisons, “Moderate - None” and “Vigorous - None”, were significant (p < 0.05): adults with moderate activity level had on average 1.3 fewer bad mental health days than adults with no activity level, adults with vigorous activity level had on average 1.5 fewer bad mental health days than adults with no activity level.
The effect size (η² = 0.008) indicates that the activity level category explains 0.8 % of the variance in days mental health bad, representing a small practical effect. These findings support the well-established relationship between higher activity levels and less days mental health bad, though other factors account for most of the variation in days mental health bad.
1. How does the effect size help you understand the practical vs. statistical significance?
Statistically significant doesn’t always mean practically meaningful. Effect size matters, p < 0.05 doesn’t mean practically important; you should always report η² (eta-squared) because it shows how much variance a factor explains. While something may be statistically significant, the practical effect may differ because one variable alone doesn’t explain most of the variation within another variable.
2. Why is it important to check ANOVA assumptions? What might happen if they’re violated?
It important to check ANOVA assumptions because meeting these assumptions ensures the validity of your p-values and inference. If they’re violated there may be inconsistencies when making observations for independence, normality, and homogeneity of variance.
3. In public health practice, when might you choose to use ANOVA?
In public health practice, one might choose to use ANOVA to compare 3 or more group means, when there is one categorical predictor, when they have a continuous outcome, or when there are independent observations within each group. Furthermore, you may choose to use ANOVA when you want to confidently compare multiple treatments, exposure levels, or populations without inflating your chance of false discoveries.
4. What was the most challenging part of this lab activity?
The most challenging part of this lab activity was probably the part when I had to explain if assumptions were met, and if any violations threatened my conclusions.
Before submitting, verify you have:
To submit: Upload both your .Rmd file and the HTML output to Brightspace.
Lab completed on: February 05, 2026