knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
fig.align = 'center',
fig.width = 8,
fig.height = 6
)# Load necessary libraries
library(tidyverse) # For data manipulation and visualization
library(knitr) # For nice tables
library(car) # For Levene's test
library(NHANES) # NHANES dataset
# Load the NHANES data
data(NHANES)Your Task: Complete the same 9-step analysis workflow you just practiced, but now on a different outcome and predictor.
# Prepare the dataset
set.seed(553)
mental_health_data <- NHANES %>%
filter(Age >= 18) %>%
filter(!is.na(DaysMentHlthBad) & !is.na(PhysActive)) %>%
mutate(
activity_level = case_when(
PhysActive == "No" ~ "None",
PhysActive == "Yes" & !is.na(PhysActiveDays) & PhysActiveDays < 3 ~ "Moderate",
PhysActive == "Yes" & !is.na(PhysActiveDays) & PhysActiveDays >= 3 ~ "Vigorous",
TRUE ~ NA_character_
),
activity_level = factor(activity_level,
levels = c("None", "Moderate", "Vigorous"))
) %>%
filter(!is.na(activity_level)) %>%
select(ID, Age, Gender, DaysMentHlthBad, PhysActive, activity_level)
# YOUR TURN: Display the first 6 rows and check sample sizes
head(mental_health_data) %>%
kable(caption = "")| ID | Age | Gender | DaysMentHlthBad | PhysActive | activity_level |
|---|---|---|---|---|---|
| 51624 | 34 | male | 15 | No | None |
| 51624 | 34 | male | 15 | No | None |
| 51624 | 34 | male | 15 | No | None |
| 51630 | 49 | female | 10 | No | None |
| 51647 | 45 | female | 3 | Yes | Vigorous |
| 51647 | 45 | female | 3 | Yes | Vigorous |
##
## None Moderate Vigorous
## 3139 768 1850
YOUR TURN - Answer these questions:
# YOUR TURN: Calculate summary statistics by activity level
# Hint: Follow the same structure as the guided example
# Variables to summarize: n, Mean, SD, Median, Min, Max# Calculate summary statistics by Physical Activity
summary_stats <- mental_health_data %>%
group_by(PhysActive) %>%
summarise(
n = n(),
Mean = mean(DaysMentHlthBad),
SD = sd(DaysMentHlthBad),
Median = median(DaysMentHlthBad),
Min = min(DaysMentHlthBad),
Max = max(DaysMentHlthBad)
)
summary_stats %>%
kable(digits = 2,
caption = "Descriptive Statistics: Systolic BP by DaysMentHlthBad")| PhysActive | n | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|
| No | 3139 | 5.08 | 9.01 | 0 | 0 | 30 |
| Yes | 2618 | 3.62 | 7.09 | 0 | 0 | 30 |
YOUR TURN - Interpret:
# YOUR TURN: Create boxplots comparing DaysMentHlthBad across activity levels
# Hint: Use the same ggplot code structure as the example
# Change variable names and labels appropriately# Create boxplots with individual points
ggplot(mental_health_data,
aes(x = activity_level, y = DaysMentHlthBad, fill = activity_level)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Bad mental health days by Activity Level Category",
subtitle = "NHANES Data, Adults aged 18-65",
x = "Activity Level Category",
y = "Bad Mental Health Days",
fill = "Activity Level Category"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")YOUR TURN - Describe what you see:
YOUR TURN - Write the hypotheses:
Null Hypothesis (H₀): μ_none = μ_moderate = μ_vigorous
Alternative Hypothesis (H₁): Atleast one pair of groups has different means
Significance level: α = 0.05
# Fit the one-way ANOVA model
anova_model <- aov(DaysMentHlthBad ~ activity_level, data = mental_health_data)
# Display the ANOVA table
summary(anova_model)## Df Sum Sq Mean Sq F value Pr(>F)
## activity_level 2 3109 1554.6 23.17 9.52e-11 ***
## Residuals 5754 386089 67.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
YOUR TURN - Extract and interpret the results:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DaysMentHlthBad ~ activity_level, data = mental_health_data)
##
## $activity_level
## diff lwr upr p adj
## Moderate-None -1.2725867 -2.045657 -0.4995169 0.0003386
## Vigorous-None -1.5464873 -2.109345 -0.9836298 0.0000000
## Vigorous-Moderate -0.2739006 -1.098213 0.5504114 0.7159887
YOUR TURN - Complete the table:
| Comparison | Mean Difference | 95% CI Lower | 95% CI Upper | p-value | Significant? |
|---|---|---|---|---|---|
| Moderate - None | -1.27 | -2.05 | -0.50 | 0.0003 | Yes |
| Vigorous - None | -1.55 | -2.11 | -0.98 | 0.00 | Yes |
| Vigorous - Moderate | -0.27 | -1.10 | 0.55 | 0.7160 | No |
Interpretation:
Which specific groups differ significantly? People with vigorous activity have significantly fewer days of bad mental health than those with no activity.
# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")## Eta-squared (η²): 0.008
## Percentage of variance explained: 0.8 %
YOUR TURN - Interpret:
YOUR TURN - Evaluate each plot:
Residuals vs Fitted: Points randomly scattered around zero, no clear pattern → Assumptions met.
Q-Q Plot: Points follow the diagonal with a slight upward tail deviation → Normality is reasonable.
Scale-Location: Spread is constant, no trend → Homoscedasticity holds.
Residuals vs Leverage: All points inside Cook’s distance lines → No influential outliers.
# Levene's test for homogeneity of variance
levene_test <- leveneTest(DaysMentHlthBad ~ activity_level, data = mental_health_data)
print(levene_test)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 23.168 9.517e-11 ***
## 5754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
YOUR TURN - Overall assessment:
YOUR TURN - Write a complete 2-3 paragraph results section:
Include: 1. Sample description and descriptive statistics 2. F-test results 3. Post-hoc comparisons (if applicable) 4. Effect size interpretation 5. Public health significance
Your Results Section:
A one-way ANOVA was conducted to examine the effect of physical activity level on the number of poor mental health days reported in the past month. Data were analyzed for a sample of 5,757 individuals, categorized into three activity groups: None (n = 1,978), Moderate (n = 2,406), and Vigorous (n = 1,373). Descriptive statistics indicated that mean poor mental health days decreased with higher activity levels: None (M = 4.21, SD = 8.30), Moderate (M = 2.94, SD = 6.67), and Vigorous (M = 2.67, SD = 5.84).
The ANOVA revealed a statistically significant difference in mean poor mental health days across activity levels, F(2, 5754) = 26.04, p < 0.001, with a small-to-moderate effect size (η² = 0.009). Post-hoc Tukey tests showed that both the Moderate (p = 0.0003) and Vigorous (p < 0.001) groups reported significantly fewer poor mental health days compared to the None group, with mean differences of -1.27 and -1.55 days, respectively. No significant difference was found between the Moderate and Vigorous groups (p = 0.716).
These findings suggest that any level of physical activity is associated with fewer poor mental health days compared to inactivity. While the effect size is modest, the public health significance is meaningful—encouraging even moderate activity could contribute to measurable improvements in population mental well-being.
1. How does the effect size help you understand the practical vs. statistical significance?
The effect size (η² = 0.008) indicates a small practical impact, meaning that while activity level is statistically significant, it explains only a small portion of the variance in mental health days. This highlights the need to consider other factors in public health planning.
2. Why is it important to check ANOVA assumptions? What might happen if they’re violated?
Checking assumptions ensures the validity of the F-test and post-hoc results. Violations—such as unequal variances or non-normality—can increase Type I or Type II error rates, leading to unreliable conclusions.
3. In public health practice, when might you choose to use ANOVA?
ANOVA is useful when comparing means across three or more groups—such as evaluating intervention effectiveness across different dosage levels, assessing health outcomes by socioeconomic categories, or comparing regional health indicators.
4. What was the most challenging part of this lab activity?
Before submitting, verify you have:
To submit: Upload both your .Rmd file and the HTML output to Brightspace.
Lab completed on: February 05, 2026
Total Points: 15
| Category | Criteria | Points | Notes |
|---|---|---|---|
| Code Execution | All code chunks run without errors | 4 | - Deduct 1 pt per major error - Deduct 0.5 pt per minor warning |
| Completion | All “YOUR TURN” sections attempted | 4 | - Part B Steps 1-9 completed - All fill-in-the-blank answered - Tukey table filled in |
| Interpretation | Correct statistical interpretation | 4 | - Hypotheses correctly stated (1 pt) - ANOVA results interpreted (1 pt) - Post-hoc results interpreted (1 pt) - Assumptions evaluated (1 pt) |
| Results Section | Professional, complete write-up | 3 | - Includes descriptive stats (1 pt) - Reports F-test & post-hoc (1 pt) - Effect size & significance (1 pt) |
Code Execution (4 points):
Completion (4 points):
Interpretation (4 points):
Results Section (3 points):