Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on
Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA
subsample)
What to submit (2 items):
.Rmd fileAI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.
Create a course folder on your computer like:
EPI553/
├── Assignments/
│ └── HW01/
│ ├── bmd.csv
│ └── EPI553_HW01_LastName_FirstName.Rmd
Required filename for your submission:
EPI553_HW01_LastName_FirstName.Rmd
Required RPubs title (use this exact pattern):
epi553_hw01_lastname_firstname
This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname
This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.
Key variables:
| Variable | Description | Type |
|---|---|---|
DXXOFBMD |
Total femur bone mineral density | Continuous (g/cm²) |
RIDRETH1 |
Race/ethnicity | Categorical (1–5) |
RIAGENDR |
Sex | Categorical (1=Male, 2=Female) |
RIDAGEYR |
Age in years | Continuous |
BMXBMI |
Body mass index | Continuous (kg/m²) |
smoker |
Smoking status | Categorical (1=Current, 2=Past, 3=Never) |
calcium |
Dietary calcium intake | Continuous (mg/day) |
DSQTCALC |
Supplemental calcium | Continuous (mg/day) |
vitd |
Dietary vitamin D intake | Continuous (mcg/day) |
DSQTVD |
Supplemental vitamin D | Continuous (mcg/day) |
totmet |
Total physical activity (MET-min/week) | Continuous |
Load required packages and import the dataset.
# Load required packages
library(tidyverse)
library(patchwork)
library(car)
library(kableExtra)
library(broom)# Import data (adjust path as needed)
bmd <- readr::read_csv("bmd.csv", show_col_types = FALSE)
# Inspect the data
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
Create readable labels for the main categorical variables:
bmd <- bmd %>%
mutate(
sex = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
ethnicity = factor(RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other")),
smoker_f = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)Report the total sample size and create the analytic dataset for ANOVA by removing missing values:
# Total observations
n_total <- nrow(bmd)
# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(ethnicity))
# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)
# Display sample sizes
n_total## [1] 2898
## [1] 2286
Write 2–4 sentences summarizing your analytic sample here.
[YOUR TEXT HERE: Describe the total sample, how many observations were removed due to missing data, and what the final analytic sample size is for the ANOVA analysis.]
The initial dataset included 2,898 participants. After excluding observations with missing values for bone mineral density or ethnicity, 612 observations were removed. The final analytic sample for the ANOVA analysis consisted of 2,286 participants with complete data.
Research question: Do mean BMD values differ across ethnicity groups?
State your null and alternative hypotheses in both statistical notation and plain language.
Statistical notation:
Plain language:
Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:
anova_df %>%
group_by(ethnicity) %>%
summarise(
n = n(),
mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
) %>%
arrange(desc(n)) %>%
kable(digits = 3)| ethnicity | n | mean_bmd | sd_bmd |
|---|---|---|---|
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other | 409 | 0.897 | 0.148 |
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
Create a visualization comparing BMD distributions across ethnicity groups:
ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.15, alpha = 0.25) +
labs(
x = "Ethnicity group",
y = "Bone mineral density (g/cm²)"
) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))Interpret the plot: What patterns do you observe in BMD across ethnicity groups?
Mean bone mineral density (BMD) appears to differ across ethnicity groups, with Non-Hispanic Black participants generally showing the highest average BMD, followed by Non-Hispanic White, Mexican American and Other Hispanic groups, and the lowest values often observed among individuals in the Other group. Overall, the pattern suggests noticeable variation in BMD across ethnic groups rather than similar averages across all categories.
Fit a one-way ANOVA with:
DXXOFBMD)# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)
# Display ANOVA table
summary(fit_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Report the F-statistic, degrees of freedom, and p-value in a formatted table:
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| ethnicity | 4 | 2.9482 | 0.7371 | 30.185 | 0 |
| Residuals | 2281 | 55.6973 | 0.0244 | NA | NA |
Write your interpretation here:
We reject H₀. The ANOVA test shows a statistically significant difference in mean BMD across ethnicity groups, F = 30.18, p < 0.001, indicating that at least one ethnic group has a different mean BMD compared to the others.
ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.
If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.
Create diagnostic plots:
par(mfrow = c(1, 2))
plot(fit_aov, which = 1) # Residuals vs fitted
plot(fit_aov, which = 2) # QQ plot## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Summarize what you see in 2–4 sentences:
The Q-Q plot shows that the data points mostly follow a straight line, which means the residuals look close to normally distributed, with only small differences at the extremes. Levene’s test was not significant (p = 0.1788), suggesting that the variation in BMD is similar across the groups. These results indicate that the assumptions required for ANOVA are reasonably met, so the analysis results can be considered reliable.
Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.
Use Tukey HSD and report:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Create and present a readable table (you can tidy the Tukey output):
library(dplyr)
library(knitr)
library(kableExtra)
# Run Tukey test
tuk <- TukeyHSD(fit_aov)
# Convert to data frame
tuk_tbl <- as.data.frame(tuk$ethnicity)
tuk_tbl$Comparison <- rownames(tuk_tbl)
# Reorder + format numbers
tuk_tbl <- tuk_tbl %>%
select(Comparison, diff, lwr, upr, `p adj`) %>%
mutate(
diff = round(diff, 3),
lwr = round(lwr, 3),
upr = round(upr, 3),
`p adj` = signif(`p adj`, 3)
)
# Print table
kable(
tuk_tbl,
col.names = c("Comparison", "diff", "lwr", "upr", "p adj")
) %>%
kable_styling(full_width = FALSE) %>%
row_spec(0, bold = TRUE, hline_after = TRUE)| Comparison | diff | lwr | upr | p adj | |
|---|---|---|---|---|---|
| Other Hispanic-Mexican American | Other Hispanic-Mexican American | -0.004 | -0.043 | 0.034 | 9.98e-01 |
| Non-Hispanic White-Mexican American | Non-Hispanic White-Mexican American | -0.062 | -0.093 | -0.032 | 3.00e-07 |
| Non-Hispanic Black-Mexican American | Non-Hispanic Black-Mexican American | 0.022 | -0.010 | 0.055 | 3.28e-01 |
| Other-Mexican American | Other-Mexican American | -0.053 | -0.087 | -0.019 | 1.92e-04 |
| Non-Hispanic White-Other Hispanic | Non-Hispanic White-Other Hispanic | -0.058 | -0.089 | -0.027 | 3.60e-06 |
| Non-Hispanic Black-Other Hispanic | Non-Hispanic Black-Other Hispanic | 0.027 | -0.006 | 0.060 | 1.72e-01 |
| Other-Other Hispanic | Other-Other Hispanic | -0.049 | -0.083 | -0.014 | 1.07e-03 |
| Non-Hispanic Black-Non-Hispanic White | Non-Hispanic Black-Non-Hispanic White | 0.085 | 0.061 | 0.108 | 0.00e+00 |
| Other-Non-Hispanic White | Other-Non-Hispanic White | 0.009 | -0.017 | 0.035 | 8.72e-01 |
| Other-Non-Hispanic Black | Other-Non-Hispanic Black | -0.076 | -0.104 | -0.048 | 0.00e+00 |
Write a short paragraph: “The groups that differed were …”
The groups that differed in mean BMD were Non-Hispanic White and Mexican American, Other and Mexican American, Non-Hispanic White and Other Hispanic, Other and Other Hispanic, Non-Hispanic Black and Non-Hispanic White, and Other and Non-Hispanic Black. The groups that did not differ significantly in mean BMD were Mexican American vs Other Hispanic, Mexican American vs Non-Hispanic Black, and Other Hispanic vs Non-Hispanic Black, indicating that these groups had statistically similar mean BMD values after adjustment for multiple comparisons.
Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).
Formula: η² = SS_between / SS_total
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ethnicity 4 2.95 0.737 30.2 1.65e-24
## 2 Residuals 2281 55.7 0.0244 NA NA
# Extract sums of squares
ss_between <- anova_tbl$sumsq[anova_tbl$term == "ethnicity"]
ss_total <- sum(anova_tbl$sumsq)
# Compute eta-squared
eta_sq <- ss_between / ss_total
eta_sq## [1] 0.05027196
Interpret in 1–2 sentences:
Ethnicity explained approximately 5% of the variance in BMD (η² ≈ 0.05). According to Cohen’s guidelines, this represents a small-to-moderate effect size, indicating that ethnicity accounts for a modest proportion of variability in bone mineral density.
In this section you will:
Calculate total nutrient intake by adding dietary and supplemental sources:
Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.
Create histograms to assess distributions:
p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "BMI Distribution")
p2 <- ggplot(bmd2, aes(x = calcium_total)) +
geom_histogram(bins = 30, fill = "lightgreen", color = "black") +
labs(title = "Total Calcium Intake")
p3 <- ggplot(bmd2, aes(x = vitd_total)) +
geom_histogram(bins = 30, fill = "plum", color = "black") +
labs(title = "Total Vitamin D Intake")
p4 <- ggplot(bmd2, aes(x = totmet)) +
geom_histogram(bins = 30, fill = "orange", color = "black") +
labs(title = "Total MET")
(p1 | p2) / (p3 | p4)Based on your assessment, apply transformations as needed:
Document your reasoning: “I applied/did not apply transformations to [variables] because…”
I applied a square-root transformation to totmet because its histogram showed a right-skewed distribution with some high values. The square-root transformation reduces skewness and minimizes the influence of extreme values, making the variable more suitable for correlation analysis. I did not apply transformations to BMXBMI, calcium_total, or vitd_total because their distributions appeared reasonably symmetric and interpretable on their original scales.
For each correlation, report:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}Examine correlations between potential predictors and BMD:
Use transformed versions where appropriate.
tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "BMXBMI", "DXXOFBMD"),
corr_pair(bmd2, "calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "vitd_total", "DXXOFBMD")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
knitr::kable(
tableA,
col.names = c("Variable", "Outcome", "N", "Correlation", "p-value")
) %>%
kableExtra::kable_styling(full_width = FALSE) %>%
kableExtra::row_spec(0, bold = TRUE, hline_after = TRUE)| Variable | Outcome | N | Correlation | p-value |
|---|---|---|---|---|
| RIDAGEYR | DXXOFBMD | 2286 | -0.232 | 0.000 |
| BMXBMI | DXXOFBMD | 2275 | 0.410 | 0.000 |
| calcium_total | DXXOFBMD | 2129 | -0.009 | 0.663 |
| vitd_total | DXXOFBMD | 2129 | -0.027 | 0.216 |
Interpret the results:
Age and BMI showed statistically significant correlations with BMD (p < 0.001). Age was negatively correlated with BMD (r = −0.232), indicating that bone mineral density decreases as age increases; this represents a weak negative relationship. BMI was positively correlated with BMD (r = 0.410), indicating that individuals with higher BMI tend to have higher BMD; this represents a moderate positive relationship. Total calcium intake (r = −0.009, p = 0.663) and total vitamin D intake (r = −0.027, p = 0.216) were not significantly correlated with BMD, indicating no statistically significant linear association.
Examine correlations between potential confounders and physical activity:
tableB <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "BMXBMI", "sqrt_totmet"),
corr_pair(bmd2, "calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "vitd_total", "sqrt_totmet")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
knitr::kable(
tableB,
col.names = c("Variable", "Outcome", "N", "Correlation", "p-value")
) %>%
kableExtra::kable_styling(full_width = FALSE) %>%
kableExtra::row_spec(0, bold = TRUE, hline_after = TRUE)| Variable | Outcome | N | Correlation | p-value |
|---|---|---|---|---|
| RIDAGEYR | sqrt_totmet | 1934 | -0.097 | 1.96e-05 |
| BMXBMI | sqrt_totmet | 1912 | -0.007 | 7.71e-01 |
| calcium_total | sqrt_totmet | 1777 | 0.073 | 2.18e-03 |
| vitd_total | sqrt_totmet | 1777 | 0.030 | 2.03e-01 |
Interpret the results:
Age and calcium intake were related to physical activity, but the relationships were very small. Age had a weak negative relationship with activity (r = −0.097, p < 0.001), meaning older people tended to be slightly less active. Calcium intake had a weak positive relationship with activity (r = 0.073, p = 0.002), meaning people who were more active tended to consume a little more calcium. BMI (r = −0.007, p = 0.771) and vitamin D intake (r = 0.030, p = 0.203) were not related to activity levels, since their p-values show no statistical evidence of a relationship. Overall, all relationships were very weak (r < 0.10), suggesting they are unlikely to be important in real-world terms.
Examine correlations among all predictor variables (all pairwise combinations):
preds <- c("RIDAGEYR", "BMXBMI", "calcium_total", "vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- purrr::map_dfr(pairs, ~ corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
knitr::kable(
tableC,
col.names = c("Variable 1", "Variable 2", "N", "Correlation", "p-value")
) %>%
kableExtra::kable_styling(full_width = FALSE) %>%
kableExtra::row_spec(0, bold = TRUE, hline_after = TRUE)| Variable 1 | Variable 2 | N | Correlation | p-value |
|---|---|---|---|---|
| RIDAGEYR | BMXBMI | 2834 | -0.104 | 0.000000 |
| RIDAGEYR | calcium_total | 2605 | 0.051 | 0.008690 |
| RIDAGEYR | vitd_total | 2605 | 0.064 | 0.000998 |
| BMXBMI | calcium_total | 2569 | -0.019 | 0.332000 |
| BMXBMI | vitd_total | 2569 | 0.012 | 0.550000 |
| calcium_total | vitd_total | 2605 | 0.117 | 0.000000 |
Interpret the results:
The predictors were not strongly related to each other. The relationships between age, BMI, calcium intake, and vitamin D intake were all very small, meaning the variables mostly vary independently. Because of this, there are no multicollinearity concerns, and these variables should be safe to include together in a regression model.
Create scatterplots to visualize key relationships:
# Example: BMD vs BMI
ggplot(bmd2, aes(x = BMXBMI, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "BMI",
y = "BMD (g/cm²)",
title = "BMD vs BMI"
)# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = totmet, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "Total MET minutes",
y = "BMD (g/cm²)",
title = "BMD vs Physical Activity"
)Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
B. Assumptions and Limitations
C. R Programming Challenge
YOUR REFLECTION HERE (200–400 words)
A. Comparing Methods (ANOVA vs Correlation)
Correlation is used when examining the strength and direction of a linear relationship between two continuous variables, whereas ANOVA is used to compare mean differences in a continuous outcome across categories of a categorical exposure. Correlation tells us whether variables move together and how strongly, but it does not compare groups or test differences in means. In contrast, ANOVA evaluates whether observed group differences are statistically significant but does not measure linear association. For example, a question such as “Is physical activity associated with bone mineral density?” is best addressed using correlation. A question like “Does mean bone mineral density differ by physical activity category (low, moderate, high)?” would be better suited to ANOVA.
B. Assumptions and Limitations
The most challenging assumptions to meet in this assignment were normality and linearity, particularly for variables like total MET minutes and nutrient intake, which were right-skewed. Violations of these assumptions can lead to biased, inaccurate correlation estimates or inflated significance levels. Additionally, cross-sectional correlation analyses cannot establish causality or temporal relationships. For example, even if higher calcium intake is correlated with higher activity levels, we cannot determine whether diet influences activity, activity influences diet, or both are affected by another factor such as socioeconomic status. Residual confounding is also a limitation because important variables may not have been measured or included.
C. R Programming Challenge
The most difficult part of the coding process was troubleshooting errors related to variable creation and chunk execution order when knitting the R Markdown document. Specifically, I encountered errors when a transformed variable did not exist at the time it was referenced in later analyses. I resolved this by carefully reviewing chunk order, ensuring transformations were created before use, and rerunning all chunks sequentially. This process strengthened my understanding of reproducible workflows and improved my ability to debug tidyverse pipelines and R Markdown execution logic.
Before submitting, verify:
Good luck! Remember to start early and use office hours if you need help.