# Prepare the dataset
set.seed(553)
mental_health_data <- NHANES %>%
filter(Age >= 18) %>%
filter(!is.na(DaysMentHlthBad) & !is.na(PhysActive)) %>%
mutate(
activity_level = case_when(
PhysActive == "No" ~ "None",
PhysActive == "Yes" & !is.na(PhysActiveDays) & PhysActiveDays < 3 ~ "Moderate",
PhysActive == "Yes" & !is.na(PhysActiveDays) & PhysActiveDays >= 3 ~ "Vigorous",
TRUE ~ NA_character_
),
activity_level = factor(activity_level,
levels = c("None", "Moderate", "Vigorous"))
) %>%
filter(!is.na(activity_level)) %>%
select(ID, Age, Gender, DaysMentHlthBad, PhysActive, activity_level)
# YOUR TURN: Display the first 6 rows and check sample sizes
head(mental_health_data, n=6)## # A tibble: 6 × 6
## ID Age Gender DaysMentHlthBad PhysActive activity_level
## <int> <int> <fct> <int> <fct> <fct>
## 1 51624 34 male 15 No None
## 2 51624 34 male 15 No None
## 3 51624 34 male 15 No None
## 4 51630 49 female 10 No None
## 5 51647 45 female 3 Yes Vigorous
## 6 51647 45 female 3 Yes Vigorous
##
## None Moderate Vigorous
## 3139 768 1850
YOUR TURN - Answer these questions:
# YOUR TURN: Calculate summary statistics by activity level
# Hint: Follow the same structure as the guided example
# Variables to summarize: n, Mean, SD, Median, Min, Max
summary_stats <- mental_health_data %>%
group_by(activity_level) %>%
summarise(
n = n(),
Mean = mean(DaysMentHlthBad),
SD = sd(DaysMentHlthBad),
Median = median(DaysMentHlthBad),
Min = min(DaysMentHlthBad),
Max = max(DaysMentHlthBad)
)
summary_stats %>%
kable(digits = 3,
caption = "Descriptive statistics for Mental Health by physical activity level")| activity_level | n | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|
| None | 3139 | 5.084 | 9.010 | 0 | 0 | 30 |
| Moderate | 768 | 3.811 | 6.873 | 0 | 0 | 30 |
| Vigorous | 1850 | 3.537 | 7.171 | 0 | 0 | 30 |
YOUR TURN - Interpret:
# YOUR TURN: Create boxplots comparing DaysMentHlthBad across activity levels
# Hint: Use the same ggplot code structure as the example
# Change variable names and labels appropriately
ggplot(mental_health_data, aes(x = activity_level, y = DaysMentHlthBad, fill = activity_level)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.3, size = 0.5) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Days of Bad Mental Health compared to Physical Activity Levels",
subtitle = "NHANES 2017-2018, Adults aged 18+",
x = "Physical Activity Level",
y = "Bad Mental Health Days (g/cm²)",
fill = "Activity Level"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")YOUR TURN - Describe what you see:
YOUR TURN - Write the hypotheses:
Null Hypothesis (H₀):
There is no difference in the number of bad mental health days across physical activity level (none, moderate, vigorous).
Alternative Hypothesis (H₁):
There is a difference in the number of bad mental health days across physical activity levels (none, moderate, vigorous).
Significance level: α = 0.05
# YOUR TURN: Fit the ANOVA model
# Outcome: DaysMentHlthBad
# Predictor: activity_level
# Fit one-way ANOVA
anova_model <- aov(DaysMentHlthBad ~ activity_level, data = mental_health_data)
# Display ANOVA table
anova_table <- summary(anova_model)
print(anova_table)## Df Sum Sq Mean Sq F value Pr(>F)
## activity_level 2 3109 1554.6 23.17 9.52e-11 ***
## Residuals 5754 386089 67.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
YOUR TURN - Extract and interpret the results:
# YOUR TURN: Conduct Tukey HSD test
# Only if your ANOVA p-value < 0.05
# Conduct Tukey HSD test
tukey_results <- TukeyHSD(anova_model)
print(tukey_results)## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DaysMentHlthBad ~ activity_level, data = mental_health_data)
##
## $activity_level
## diff lwr upr p adj
## Moderate-None -1.2725867 -2.045657 -0.4995169 0.0003386
## Vigorous-None -1.5464873 -2.109345 -0.9836298 0.0000000
## Vigorous-Moderate -0.2739006 -1.098213 0.5504114 0.7159887
# Extract and format results
tukey_summary <- as.data.frame(tukey_results$activity_level)
tukey_summary$Comparison <- rownames(tukey_summary)
tukey_summary <- tukey_summary[, c("Comparison", "diff", "lwr", "upr", "p adj")]
# Add interpretation columns
tukey_summary$Significant <- ifelse(tukey_summary$`p adj` < 0.05, "Yes", "No")
tukey_summary$Direction <- ifelse(
tukey_summary$Significant == "Yes",
ifelse(tukey_summary$diff > 0, "First group higher", "Second group higher"),
"No difference"
)
kable(tukey_summary, digits = 4,
caption = "Tukey HSD post-hoc comparisons with 95% confidence intervals")| Comparison | diff | lwr | upr | p adj | Significant | Direction | |
|---|---|---|---|---|---|---|---|
| Moderate-None | Moderate-None | -1.2726 | -2.0457 | -0.4995 | 0.0003 | Yes | Second group higher |
| Vigorous-None | Vigorous-None | -1.5465 | -2.1093 | -0.9836 | 0.0000 | Yes | Second group higher |
| Vigorous-Moderate | Vigorous-Moderate | -0.2739 | -1.0982 | 0.5504 | 0.7160 | No | No difference |
YOUR TURN - Complete the table:
| Comparison | Mean Difference | 95% CI Lower | 95% CI Upper | p-value | Significant? |
|---|---|---|---|---|---|
| Moderate - None | |||||
| Vigorous - None | |||||
| Vigorous - Moderate |
Interpretation:
Which specific groups differ significantly?
Moderate vs. None: The none group has significantly more bad mental health days then the moderate group.
Vigorous vs. None: The none group has significantly more bad mental health days then the vigorous group.
Vigorous vs. Moderate: There is no significant difference.
# YOUR TURN: Calculate eta-squared
# Hint: Extract Sum Sq from the ANOVA summary
anova_summary <- summary(anova_model)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")## Eta-squared (η²): 0.008
## Percentage of variance explained: 0.8 %
YOUR TURN - Interpret:
An η² = 0.008 value indicates that physical activity level is only 0.8% of the variation of bad mental health days. This indicates that while physical activity is statistically significantly associated with mental health days, it only accounts for a small portion of mental health days.
YOUR TURN - Evaluate each plot:
The residuals are roughly centered around zero with no visible pattern. This suggests that independence assumption holds and the linearity assumption for this model is reasonably met.
The residuals deviate from the diagonal line, especially in the upper tail. This indicates that normality is violated.
The residuals are not constant across the fitted values and might indicate some non-constant variance but may not have a severe affect.
# YOUR TURN: Conduct Levene's test
library (car)
levene_test <- leveneTest(DaysMentHlthBad ~ activity_level, data = mental_health_data)
print(levene_test)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 23.168 9.517e-11 ***
## 5754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
YOUR TURN - Overall assessment:
The assumptions are reasonably met.
The assumptions show mild violations but are not severe enough to change our overall conclusion.
YOUR TURN - Write a complete 2-3 paragraph results section:
Include: 1. Sample description and descriptive statistics 2. F-test results 3. Post-hoc comparisons (if applicable) 4. Effect size interpretation 5. Public health significance
Your Results Section:
The study included 5757 participants that were divided up into 3 groups based on physical activity level (none-3139, moderate-768, vigorous-1850). A one way ANOVA was conducted to examine the effect of physical activity on bad mental health days. The results indicated a statistically significant difference between the groups with an F-statistic of 23.17 and a p-value of 9.52e-11. The moderate vs. None group showed that the none group has significantly more bad mental health days then the moderate group. The vigorous vs. None group showed that the none group has significantly more bad mental health days then the vigorous group. The vigorous vs. Moderate group showed that there was no significant difference. The calculated eta-squared pf 0.008 shows that only 0.8% of bad mental health days is explained by physical activity. This represents a medium effect, suggesting that physical activity has a moderate impact on bad mental health days beyond the statistics. The results suggest that physical activity levels can have an influence on bad mental health days. From a public health perspective, creating interventions that show a positive correlation between physical activity and bad mental health days could help improve the overall populations mental health. By creating more interventions, we can help improve long term health outcomes and help reduce mental health disease at a individual level and a community based level.
1. How does the effect size help you understand the practical vs. statistical significance?
Effect size allows us to see the real world terms and not just if they are statistically significant or not. In this example, η² = 0.008 value shows a small portion of mental health days is affected by physical activity levels. This shows the practical relvance.
2. Why is it important to check ANOVA assumptions? What might happen if they’re violated?
ANOVA assumes normality and if the assumptions are violated it can inflate Type 1 and Type 2 error rates which affects your conclusion.
3. In public health practice, when might you choose to use ANOVA?
ANOVA is used when you are comparing 3 or more groups.
4. What was the most challenging part of this lab activity?
The most challenging part of this lab was interpreting the diagnostic plots and making sure that I was reading them correclty.