# Load necessary libraries
library(tidyverse) # For data manipulation and visualization
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr) # For nice tables
## Warning: package 'knitr' was built under R version 4.4.3
library(car) # For Levene's test
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(NHANES) # NHANES dataset
## Warning: package 'NHANES' was built under R version 4.4.3
# Load the NHANES data
data(NHANES)
Your Task: Complete the same 9-step analysis workflow you just practiced, but now on a different outcome and predictor.
# Prepare the dataset
set.seed(553)
mental_health_data <- NHANES %>%
filter(Age >= 18) %>%
filter(!is.na(DaysMentHlthBad) & !is.na(PhysActive)) %>%
mutate(
activity_level = case_when(
PhysActive == "No" ~ "None",
PhysActive == "Yes" & !is.na(PhysActiveDays) & PhysActiveDays < 3 ~ "Moderate",
PhysActive == "Yes" & !is.na(PhysActiveDays) & PhysActiveDays >= 3 ~ "Vigorous",
TRUE ~ NA_character_
),
activity_level = factor(activity_level,
levels = c("None", "Moderate", "Vigorous"))
) %>%
filter(!is.na(activity_level)) %>%
select(ID, Age, Gender, DaysMentHlthBad, PhysActive, activity_level)
# YOUR TURN: Display the first 6 rows and check sample sizes
head(mental_health_data)%>%
kable(caption="Physical Activity Level (first 6 rows)")
| ID | Age | Gender | DaysMentHlthBad | PhysActive | activity_level |
|---|---|---|---|---|---|
| 51624 | 34 | male | 15 | No | None |
| 51624 | 34 | male | 15 | No | None |
| 51624 | 34 | male | 15 | No | None |
| 51630 | 49 | female | 10 | No | None |
| 51647 | 45 | female | 3 | Yes | Vigorous |
| 51647 | 45 | female | 3 | Yes | Vigorous |
#
table(mental_health_data$activity_level)
##
## None Moderate Vigorous
## 3139 768 1850
YOUR TURN - Answer these questions:
# YOUR TURN: Calculate summary statistics by activity level
summary_stats_activity <- mental_health_data %>%
group_by(activity_level) %>%
summarise (
n=n(),
Mean = mean(DaysMentHlthBad),
SD = sd(DaysMentHlthBad),
median = median(DaysMentHlthBad),
Min = min(DaysMentHlthBad),
Max = max(DaysMentHlthBad)
)
summary_stats_activity%>%
kable(digits = 2,
caption= "Descriptive Statistics: Number of Bad Mental Health Days by Physical Activity" )
| activity_level | n | Mean | SD | median | Min | Max |
|---|---|---|---|---|---|---|
| None | 3139 | 5.08 | 9.01 | 0 | 0 | 30 |
| Moderate | 768 | 3.81 | 6.87 | 0 | 0 | 30 |
| Vigorous | 1850 | 3.54 | 7.17 | 0 | 0 | 30 |
# Hint: Follow the same structure as the guided example
# Variables to summarize: n, Mean, SD, Median, Min, Max
YOUR TURN - Interpret:
# YOUR TURN: Create boxplots comparing DaysMentHlthBad across activity levels
ggplot(mental_health_data,
aes(x = activity_level, y = DaysMentHlthBad, fill = activity_level)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Number of Bad Mental Health Days by Activity Level",
x = "Activity Level",
y = "Bad Mental Health Days",
fill = "Activity Level"
)
# Hint: Use the same ggplot code structure as the example
# Change variable names and labels appropriately
YOUR TURN - Describe what you see:
YOUR TURN - Write the hypotheses:
Null Hypothesis (H₀): μ_None = μ_Moderate = μ_Vigorous
Alternative Hypothesis (H₁): At least one population mean differs from the others
Significance level: α = 0.05
# YOUR TURN: Fit the ANOVA model
anova_model <- aov(DaysMentHlthBad ~ activity_level, data=mental_health_data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## activity_level 2 3109 1554.6 23.17 9.52e-11 ***
## Residuals 5754 386089 67.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Outcome: DaysMentHlthBad
# Predictor: activity_level
YOUR TURN - Extract and interpret the results:
# YOUR TURN: Conduct Tukey HSD test
tukey_results <- TukeyHSD(anova_model)
print(tukey_results)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DaysMentHlthBad ~ activity_level, data = mental_health_data)
##
## $activity_level
## diff lwr upr p adj
## Moderate-None -1.2725867 -2.045657 -0.4995169 0.0003386
## Vigorous-None -1.5464873 -2.109345 -0.9836298 0.0000000
## Vigorous-Moderate -0.2739006 -1.098213 0.5504114 0.7159887
# Only if your ANOVA p-value < 0.05
plot(tukey_results, las = 1)
YOUR TURN - Complete the table:
| Comparison | Mean Difference | 95% CI Lower | 95% CI Upper | p-value | Significant? |
|---|---|---|---|---|---|
| Moderate - None | -1.2725867 | -2.045657 | -0.4995169 | 0.0003386 | Yes |
| Vigorous - None | -1.5464873 | -2.109345 | -0.9836298 | 0.0000000 | Yes |
| Vigorous - Moderate | -0.2739006 | -1.098213 | 0.5504114 | 0.7159887 | No |
Interpretation:
Which specific groups differ significantly? Both the Moderate-None and Virgorous-None are statistically significant. —
# YOUR TURN: Calculate eta-squared
anova_summary <- summary(anova_model)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.008
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")
## Percentage of variance explained: 0.8 %
# Hint: Extract Sum Sq from the ANOVA summary
YOUR TURN - Interpret:
# YOUR TURN: Create diagnostic plots
par(mfrow = c(2, 2))
plot(anova_model)
par(mfrow = c(1, 1))
YOUR TURN - Evaluate each plot:
Residuals vs Fitted: The relationship between activity level and mental health is not seen clearly.
Q-Q Plot: There are strong deviations from the diagonal line.
Scale-Location: The red line is not stable, showing non-constant variance.
Residuals vs Leverage: There are influential points beyond Cook’s distance line.
# YOUR TURN: Conduct Levene's test
levene_test <- leveneTest(DaysMentHlthBad ~ activity_level, data = mental_health_data)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 23.168 9.517e-11 ***
## 5754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
YOUR TURN - Overall assessment:
YOUR TURN - Write a complete 2-3 paragraph results section:
Include: 1. Sample description and descriptive statistics 2. F-test results 3. Post-hoc comparisons (if applicable) 4. Effect size interpretation 5. Public health significance
Your Results Section:
A one-way ANOVA was conducted to examine whether the mean number of bad mental health days differed across physical activity levels (None, Moderate, and Vigorous) from the NHANES dataset. Descriptive statistics indicated a clear pattern of individuals reporting no physical activity had the highest mean number of bad mental health days, followed by moderate activity, and individuals engaging in vigorous physical activity levels had the lowest number of bad mental health days.
ANOVA revealed a statistically significant difference in mean bad mental health days across activity levels, with the F-test result being 23.17. The post-hoc test showed both moderate vs none and vigorous vs none comparisons were statistically significant. This indicated that individuals with any level of physical activity reported fewer bad mental health days than individual with no activity. The difference between vigorous and moderate was not statistically significant.
The effect size (η² = 0.008) indicated that physical activity level explains 0.8% of the variance in bad mental health days, representing a small practical effect. Although the statistical association is clear, physical activity only accounts for a small portion of individual difference in mental health.
1. How does the effect size help you understand the practical vs. statistical significance? The effect size allows your to understand the meaning of the difference. A small p value will tell you groups differ and are statistically significant. The effect size will indicate whether the difference is important, or how big of an effect the difference has.
2. Why is it important to check ANOVA assumptions? What might happen if they’re violated?
Checking the assumptions will ensure the F-test and p-value are trusted. ANOVA can give misleading results when normality and homogeneity of variance are violated. Violations weaken the confidence in the conclusion.
3. In public health practice, when might you choose to use ANOVA?
In public health practice, ANOVA can be used to compare the continuous outcome of three or more groups. You can evaluate the dietary intake across income groups, or blood pressure across BMI categories as shown in the example.
4. What was the most challenging part of this lab activity? The most challenging part of the lab was to full understanding what the data showed. Understanding what the different tests show and how they effect the final results is important and will take practice.