Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on
Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA
subsample)
What to submit (2 items):
.Rmd fileAI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.
Required filename for your submission:
EPI553_HW01_LastName_FirstName.Rmd
Required RPubs title (use this exact pattern):
epi553_hw01_lastname_firstname
This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname
This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.
Key variables:
| Variable | Description | Type |
|---|---|---|
DXXOFBMD |
Total femur bone mineral density | Continuous (g/cm²) |
RIDRETH1 |
Race/ethnicity | Categorical (1–5) |
RIAGENDR |
Sex | Categorical (1=Male, 2=Female) |
RIDAGEYR |
Age in years | Continuous |
BMXBMI |
Body mass index | Continuous (kg/m²) |
smoker |
Smoking status | Categorical (1=Current, 2=Past, 3=Never) |
calcium |
Dietary calcium intake | Continuous (mg/day) |
DSQTCALC |
Supplemental calcium | Continuous (mg/day) |
vitd |
Dietary vitamin D intake | Continuous (mcg/day) |
DSQTVD |
Supplemental vitamin D | Continuous (mcg/day) |
totmet |
Total physical activity (MET-min/week) | Continuous |
Load required packages and import the dataset.
# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)
library(dplyr)
library(patchwork)# Import data (adjust path as needed)
bmd <- readr::read_csv("/Users/jingjunyang/Desktop/EPI553 Project/bmd.csv", show_col_types = FALSE)
# Inspect the data
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
Create readable labels for the main categorical variables:
bmd <- bmd %>%
mutate(
sex = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
ethnicity = factor(RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other")),
smoker_f = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)Report the total sample size and create the analytic dataset for ANOVA by removing missing values:
# Total observations
n_total <- nrow(bmd)
# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(ethnicity))
# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)
# Display sample sizes
n_total## [1] 2898
## [1] 2286
Write 2–4 sentences summarizing your analytic sample here.
[Total sample size: 2898; 612 observations were removed due to missing data; the final analytic sample size is 2286 for the ANOVA analysis.]
Research question: Do mean BMD values differ across ethnicity groups?
State your null and alternative hypotheses in both statistical notation and plain language.
Statistical notation:
Plain language:
Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:
anova_df %>%
group_by(ethnicity) %>%
summarise(
n = n(),
mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
) %>%
arrange(desc(n)) %>%
kable(digits = 3)| ethnicity | n | mean_bmd | sd_bmd |
|---|---|---|---|
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other | 409 | 0.897 | 0.148 |
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
Create a visualization comparing BMD distributions across ethnicity groups:
ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.15, alpha = 0.25) +
labs(
x = "Ethnicity group",
y = "Bone mineral density (g/cm²)"
) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))Interpret the plot: What patterns do you observe in BMD across ethnicity groups?
[Different ethnicity groups have different mean BMD; The plot shows that Non-Hispanic Black have the highest median BMD compare to any other groups, while non-hispanic white participants have the lowest. Variability (box heights) looks similar across groups except for Non-Hispanic White]
Fit a one-way ANOVA with:
DXXOFBMD)# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)
# Display ANOVA table
summary(fit_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Report the F-statistic, degrees of freedom, and p-value in a formatted table:
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| ethnicity | 4 | 2.9482 | 0.7371 | 30.185 | 0 |
| Residuals | 2281 | 55.6973 | 0.0244 | NA | NA |
Write your interpretation here:
[Since p<0.05, we reject Ho; there is statistically significant evidence that mean BMD differs across at least two ethnicity groups.]
ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.
If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.
Create diagnostic plots:
par(mfrow = c(1, 2))
plot(fit_aov, which = 1) # Residuals vs fitted
plot(fit_aov, which = 2) # QQ plot## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Summarize what you see in 2–4 sentences:
[The Q-Q plot of residuals shows that the points fall along the diagonal reference line, which suggests that the residuals are approximately normally distributed. Levene’s test for homogeneity of variaty was not statistically significant. Since p value 0.1788 >0.05, we fail to reject the null hypothesis. F (4,2281)=1.57. This indicates that there is no evidence of unequal variances across ethnicity groups. Therefore, the assumption of homogeneity of variance is satisfied. ]
Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.
Use Tukey HSD and report:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Create and present a readable table (you can tidy the Tukey output):
# YOUR CODE HERE to create a tidy table from Tukey results
# Hint: Convert to data frame, add rownames as column, filter for significance
tukey_df <- as.data.frame(tukey_results$ethnicity)
tukey_table <- tukey_df %>%
rownames_to_column(var = "Comparison") %>%
rename(
'Difference in Means' =diff,
'Lower CI' = lwr,
'Upper CI' = upr,
'Adjusted P-value' = 'p adj'
)
tukey_table %>% kable()| Comparison | Difference in Means | Lower CI | Upper CI | Adjusted P-value |
|---|---|---|---|---|
| Other Hispanic-Mexican American | -0.0044413 | -0.0426441 | 0.0337616 | 0.9978049 |
| Non-Hispanic White-Mexican American | -0.0623772 | -0.0928526 | -0.0319019 | 0.0000003 |
| Non-Hispanic Black-Mexican American | 0.0223916 | -0.0101001 | 0.0548833 | 0.3276613 |
| Other-Mexican American | -0.0533193 | -0.0873573 | -0.0192814 | 0.0001919 |
| Non-Hispanic White-Other Hispanic | -0.0579360 | -0.0889347 | -0.0269373 | 0.0000036 |
| Non-Hispanic Black-Other Hispanic | 0.0268329 | -0.0061502 | 0.0598159 | 0.1722466 |
| Other-Other Hispanic | -0.0488781 | -0.0833853 | -0.0143708 | 0.0010720 |
| Non-Hispanic Black-Non-Hispanic White | 0.0847689 | 0.0611644 | 0.1083733 | 0.0000000 |
| Other-Non-Hispanic White | 0.0090579 | -0.0166334 | 0.0347492 | 0.8719109 |
| Other-Non-Hispanic Black | -0.0757110 | -0.1037645 | -0.0476574 | 0.0000000 |
Write a short paragraph: “The groups that differed were …”
[ The groups that differed were Non-Hispanic Black and Non-Hispanic White, with Non-Hispanic Black adults showing higher mean BMD. Non-Hispanic White Adults also had lower BMD than Mexican American and other Hispanic adults. Additionally, other adults had lower BMD than Mexican American, Other Hispanic, and Non-Hispanic Black Adults. All other pairwise comparisons did not show statistically significant differences in mean BMD. ]
Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).
Formula: η² = SS_between / SS_total
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ethnicity 4 2.95 0.737 30.2 1.65e-24
## 2 Residuals 2281 55.7 0.0244 NA NA
anova_summary <- summary(fit_aov) [[1]]
ss_between <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
eta_squared<- ss_between / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")## Eta-squared (η²): 0.0503
## Percentage of variance explained: 5.03 %
Interpret in 1–2 sentences:
[The effect size for ethnicity was 5.03%, indicating that approximately 5% of the total variance in BMD is explained by ethnicity. This represents a small-to-moderate effect, suggesting that while differences between groups are statistically significant, other factors contribute substantially to BMD variation]
In this section you will:
Calculate total nutrient intake by adding dietary and supplemental sources:
Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.
Create histograms to assess distributions:
# Recommended: BMXBMI, calcium_total, vitd_total, totmet
# Example structure:
p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
geom_histogram(bins = 30, fill= "steelblue", color = "black") +
labs (title = "BMI Distribution", x="BMI (kg/m^2)", y = "Count")
p2 <- ggplot(bmd2, aes(x = calcium_total)) +
geom_histogram(bins = 30, fill ="salmon", color = "black")+
labs (title = "Total Calcium Intake", x="mg/day", y = "Count")
p3 <- ggplot(bmd2, aes( x=vitd_total)) +
geom_histogram(bins = 30, fill = "green", color = "black") +
labs (title = "Total Vitamin D Intake", x= "mcg/day", y= "Count")
p4 <-ggplot(bmd2, aes(x= totmet)) +
geom_histogram(bins = 30, fill = "purple", color = "black") +
labs (title = "Total Physical Activity", x="MET-min/week", y="Count")
(p1 | p2) / (p3 | p4)Based on your assessment, apply transformations as needed:
bmd2 <- bmd2 %>%
mutate(
log_bmi= log(BMXBMI),
log_calcium_total = log1p(calcium_total),
sqrt_vitd_total = sqrt(vitd_total),
sqrt_totmet = sqrt(totmet),
)Document your reasoning: “I applied a log transformations to BMI and total calcium because their distributions were right-skewed with some extreme values in the histograms, I applied a square root transforamtion to total Vitamin D and Total MET to reduce skewness while handling zeros..”
For each correlation, report:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}Examine correlations between potential predictors and BMD:
Use transformed versions where appropriate.
# YOUR CODE HERE
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()
tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableA %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | DXXOFBMD | 2286 | -0.232 | 0.00000 |
| log_bmi | DXXOFBMD | 2275 | 0.425 | 0.00000 |
| log_calcium_total | DXXOFBMD | 2129 | 0.012 | 0.59200 |
| sqrt_vitd_total | DXXOFBMD | 2129 | -0.062 | 0.00426 |
Interpret the results:
[Age: Negatively correlated with BMD (r = -0.232, p < 0.001). This indicates that as age increases, bone mineral density tends to decrease. The strength of this relationship is small to moderate. BMI (log-transformed): Positively correlated with BMD (r = 0.425, p < 0.001). Individuals with higher BMI tend to have higher BMD. This is a moderate positive association and is the strongest correlation observed in the table. Total calcium intake (log-transformed): Not significantly correlated with BMD (r = 0.012, p = 0.592). This suggests no meaningful relationship between calcium intake and BMD in this dataset. Total vitamin D intake (square-root transformed): Negatively correlated with BMD (r = -0.062, p = 0.004). Although statistically significant, the relationship is very weak and likely not practically meaningful. Summary: Age and BMI are the most relevant predictors of BMD, with BMI showing a moderate positive association. Calcium intake does not show a significant association, and vitamin D intake shows a very weak negative correlation.]
Examine correlations between potential confounders and physical activity:
# YOUR CODE HERE
# Follow the same structure as Table A
# Use sqrt_totmet (or transformed version) as the outcome variable
bmd2 <- bmd2 %>%
mutate(sqrt_totmet = sqrt(totmet))
tableB <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
)%>%
mutate(
estimate= round(estimate, 3),
p_value =signif(p_value, 3)
)
tableB %>% knitr::kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | sqrt_totmet | 1934 | -0.097 | 1.96e-05 |
| log_bmi | sqrt_totmet | 1912 | 0.001 | 9.51e-01 |
| log_calcium_total | sqrt_totmet | 1777 | 0.086 | 2.82e-04 |
| sqrt_vitd_total | sqrt_totmet | 1777 | -0.002 | 9.32e-01 |
Interpret the results:
[Age and total calcium intake are associated with the physical activity levels. Age is negatively correlated with physical activity (r=-0.097, p<0.001), meaning older participants tend to be slightly less active. Total calcium intake is positively correlated with physical activity (r=0.086, p<0.001), meaning that participants with higher calcium intake tend to be slightly more active. BMI and total vitamin intake are not significantly associated with physical activity in this sample]
Examine correlations among all predictor variables (all pairwise combinations):
# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()
preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableC %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | log_bmi | 2834 | -0.098 | 2.00e-07 |
| RIDAGEYR | log_calcium_total | 2605 | 0.048 | 1.35e-02 |
| RIDAGEYR | sqrt_vitd_total | 2605 | 0.153 | 0.00e+00 |
| log_bmi | log_calcium_total | 2569 | 0.000 | 9.83e-01 |
| log_bmi | sqrt_vitd_total | 2569 | 0.007 | 7.31e-01 |
| log_calcium_total | sqrt_vitd_total | 2605 | 0.314 | 0.00e+00 |
Interpret the results: [Most predictor pairs show weak correlations |r|<0.2, suggests they are independent and unlikely cause multicollinearity problems in regression models. The only moderate correlation is between log_calcium_total and log_vita_total r=0.314, which indicates some shared variation but generally would not be a serious concern.]
Create scatterplots to visualize key relationships:
# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "log(BMI)",
y = "BMD (g/cm²)",
title = "BMD vs BMI (transformed)"
)
# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "sqrt(totmet)",
y = "BMD (g/cm²)",
title = "BMD vs MET (transformed)"
)Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
B. Assumptions and Limitations
C. R Programming Challenge
YOUR REFLECTION HERE (200–400 words)
[A. Comparing Methods (ANOVA vs Correlation) ANOVA and correlation serve different purposes in epidemiological research. ANOVA is used to compare the means of a continuous outcome across two or more categorical groups. Correlation assesses the strength and direction of a linear relationship between two continuous variables. For example, a research question like “Does mean bone mineral density differ by ethnicity?” is best addressed with ANOVA. On the other hand, a question such as “Is higher BMI associated with higher bone mineral density?” is better suited for correlation analysis.
B. The most challenging assumptions in this assignment were normality and linearity for the correlation analyses. Many variables were skewed, which required transformations to better meet these assumptions. Violating assumptions, such as non-normal distributions or strong outliers, could bias the correlation coefficients or p-values, leading to misleading conclusions. A major limitation of cross-sectional correlation analysis is that it cannot establish causality. While I can observe associations between nutrition, physical activity, and BMD, I cannot determine whether changes in calcium intake or exercise cause changes in bone health..
C. The hardest part of the R coding was applying variable transformations consistently across tables and plots. I addressed this by checking each variable’s distribution with histograms, choosing the right transformation, and using mutate() systematically. This helped me work better with dplyr, use custom functions like corr_pair(), and create reproducible tables and plots. I also became more confident fixing errors from missing or misnamed columns.]
Before submitting, verify:
Good luck! Remember to start early and use office hours if you need help.