Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on
Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA
subsample)
What to submit (2 items):
.Rmd fileAI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.
Create a course folder on your computer like:
EPI553/
├── Assignments/
│ └── HW01/
│ ├── bmd.csv
│ └── EPI553_HW01_LastName_FirstName.Rmd
Required filename for your submission:
EPI553_HW01_LastName_FirstName.Rmd
Required RPubs title (use this exact pattern):
epi553_hw01_lastname_firstname
This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname
This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.
Key variables:
| Variable | Description | Type |
|---|---|---|
DXXOFBMD |
Total femur bone mineral density | Continuous (g/cm²) |
RIDRETH1 |
Race/ethnicity | Categorical (1–5) |
RIAGENDR |
Sex | Categorical (1=Male, 2=Female) |
RIDAGEYR |
Age in years | Continuous |
BMXBMI |
Body mass index | Continuous (kg/m²) |
smoker |
Smoking status | Categorical (1=Current, 2=Past, 3=Never) |
calcium |
Dietary calcium intake | Continuous (mg/day) |
DSQTCALC |
Supplemental calcium | Continuous (mg/day) |
vitd |
Dietary vitamin D intake | Continuous (mcg/day) |
DSQTVD |
Supplemental vitamin D | Continuous (mcg/day) |
totmet |
Total physical activity (MET-min/week) | Continuous |
Load required packages and import the dataset.
# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Assignment 1/bmd.csv", show_col_types = FALSE)
# Inspect the data
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
Create readable labels for the main categorical variables:
bmd <- bmd %>%
mutate(
sex = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
ethnicity = factor(RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other")),
smoker_f = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)Report the total sample size and create the analytic dataset for ANOVA by removing missing values:
# Total observations
n_total <- nrow(bmd)
# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(ethnicity))
# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)
# Display sample sizes
n_total## [1] 2898
## [1] 2286
Write 2–4 sentences summarizing your analytic sample here.
The total sample size is N=2898. After removing 612 samples due to missing data in the variables DXXOFBMD and ethnicity, the analytic sample size is N=2286.
Research question: Do mean BMD values differ across ethnicity groups?
State your null and alternative hypotheses in both statistical notation and plain language.
Statistical notation:
Plain language:
Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:
anova_df %>%
group_by(ethnicity) %>%
summarise(
n = n(),
mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
) %>%
arrange(desc(n)) %>%
kable(digits = 3)| ethnicity | n | mean_bmd | sd_bmd |
|---|---|---|---|
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other | 409 | 0.897 | 0.148 |
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
Create a visualization comparing BMD distributions across ethnicity groups:
ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.15, alpha = 0.25) +
labs(
x = "Ethnicity group",
y = "Bone mineral density (g/cm²)"
) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))Interpret the plot: What patterns do you observe in BMD across ethnicity groups?
The race/ethnicity groups have similar variances, but Non-Hispanic white has the lowest median BMD while Non-Hispanic black has the highest median BMD.
Fit a one-way ANOVA with:
DXXOFBMD)# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)
# Display ANOVA table
summary(fit_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Report the F-statistic, degrees of freedom, and p-value in a formatted table:
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| ethnicity | 4 | 2.9482 | 0.7371 | 30.185 | 0 |
| Residuals | 2281 | 55.6973 | 0.0244 | NA | NA |
Write your interpretation here: Based on the results of the ANOVA, I will reject H0 and conclude that not all group means are equal. Thus, not all race/ethnicity groups have the same mean BMD.
ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.
If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.
Create diagnostic plots:
par(mfrow = c(1, 2))
plot(fit_aov, which = 1) # Residuals vs fitted
plot(fit_aov, which = 2) # QQ plot## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Summarize what you see in 2–4 sentences: The Q-Q residuals plot indicates that the data are normally distributed; points follow the red line with almost no deviation. The results of Levene’s Test were F = 1.5728, p = 0.1788. Thus, I do not reject the null hypothesis and assume homogeneity of variance.
Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.
Use Tukey HSD and report:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Create and present a readable table (you can tidy the Tukey output):
| term | contrast | null.value | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|---|---|
| ethnicity | Other Hispanic-Mexican American | 0 | -0.0044 | -0.0426 | 0.0338 | 0.9978 |
| ethnicity | Non-Hispanic White-Mexican American | 0 | -0.0624 | -0.0929 | -0.0319 | 0.0000 |
| ethnicity | Non-Hispanic Black-Mexican American | 0 | 0.0224 | -0.0101 | 0.0549 | 0.3277 |
| ethnicity | Other-Mexican American | 0 | -0.0533 | -0.0874 | -0.0193 | 0.0002 |
| ethnicity | Non-Hispanic White-Other Hispanic | 0 | -0.0579 | -0.0889 | -0.0269 | 0.0000 |
| ethnicity | Non-Hispanic Black-Other Hispanic | 0 | 0.0268 | -0.0062 | 0.0598 | 0.1722 |
| ethnicity | Other-Other Hispanic | 0 | -0.0489 | -0.0834 | -0.0144 | 0.0011 |
| ethnicity | Non-Hispanic Black-Non-Hispanic White | 0 | 0.0848 | 0.0612 | 0.1084 | 0.0000 |
| ethnicity | Other-Non-Hispanic White | 0 | 0.0091 | -0.0166 | 0.0347 | 0.8719 |
| ethnicity | Other-Non-Hispanic Black | 0 | -0.0757 | -0.1038 | -0.0477 | 0.0000 |
Write a short paragraph: “The groups that differed were …”
The groups that have significantly different mean BMD are:
Non-Hispanic White - Mexican American Other - Mexican American
Non-Hispanic White - Non-Hispanic Black Other - Non-Hispanic Black
Other Hispanic - Non-Hispanic White Other Hispanic - Other
Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).
Formula: η² = SS_between / SS_total
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ethnicity 4 2.95 0.737 30.2 1.65e-24
## 2 Residuals 2281 55.7 0.0244 NA NA
# YOUR CODE HERE to compute eta-squared
anova_summary <- summary(fit_aov)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")## Eta-squared (η²): 0.0503
## Percentage of variance explained: 5.03 %
Interpret in 1–2 sentences:
The effect size is 0.0503, meaning 5.03% of the variance in BMD is explained by race/ethnicity. According to Cohen’s guidelines, this is a small effect size, so race/ethnicity is a not a major predictor of BMD. It is likely there are other variables that contribute more to BMD.
In this section you will:
Calculate total nutrient intake by adding dietary and supplemental sources:
Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.
Create histograms to assess distributions:
# YOUR CODE HERE: Create histograms for key continuous variables
# Recommended: BMXBMI, calcium_total, vitd_total, totmet
# Example structure:
library(patchwork)
ggplot(bmd2, aes(x = BMXBMI))+geom_histogram(bins=30) +
labs(
title = "Histogram of BMI",
subtitle = "NHANES Data",
x = "BMI (kg/m2)",
y = "Frequency",
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")ggplot(bmd2, aes(x = calcium_total))+geom_histogram(bins=30) +
labs(
title = "Histogram of Total Calcium",
subtitle = "NHANES Data",
x = "Total Calcium (mg/day)",
y = "Frequency",
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")ggplot(bmd2, aes(x = vitd_total))+geom_histogram(bins=30) +
labs(
title = "Histogram of Total Vitamin D",
subtitle = "NHANES Data",
x = "Total Vitamin D (mcg/day)",
y = "Frequency",
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")ggplot(bmd2, aes(x = totmet))+geom_histogram(bins=30) +
labs(
title = "Histogram of Total Physical Activity",
subtitle = "NHANES Data",
x = "Total Physical Activity (MET-min/week)",
y = "Frequency",
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")Based on your assessment, apply transformations as needed:
bmd2 <- bmd2 %>%
mutate(
# Add your transformation code here
# Examples (modify as needed based on your assessment):
log_bmi = log(BMXBMI),
log_calcium_total = log(calcium_total),
sqrt_totmet = sqrt(totmet),
sqrt_vitd_total = sqrt(vitd_total)
)Document your reasoning: “I applied/did not apply transformations to [variables] because…”
I applied transformation to all these variables as they all have a substantial positive skew, as well as some outliers to the right of the median.
For each correlation, report:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}Examine correlations between potential predictors and BMD:
Use transformed versions where appropriate.
# YOUR CODE HERE
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()
# Example structure:
tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
#
tableA %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | DXXOFBMD | 2286 | -0.232 | 0.00000 |
| log_bmi | DXXOFBMD | 2275 | 0.425 | 0.00000 |
| log_calcium_total | DXXOFBMD | 2129 | 0.012 | 0.59200 |
| sqrt_vitd_total | DXXOFBMD | 2129 | -0.062 | 0.00426 |
Interpret the results: Age, BMI, and total vitamin D intake showed statistically significant correlations with BMD.
Age and BMD exhibited a weak negative correlation (r=-0.232, p = 0.000, n = 2286) BMI and BMD exhibited a moderate positive correlation (r=0.425, p = 0.000, n = 2275) Vitamin D and BMD exhibited a weak negative correlation (r=-0.062, p = 0.004, n=2129)
Calcium and BMD exhibited no correlation (r=0.012, p = 0.592, n=2129)
Examine correlations between potential confounders and physical activity:
# YOUR CODE HERE
tableB <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 2)
)
#
tableB %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | sqrt_totmet | 1934 | -0.097 | 0.00002 |
| log_bmi | sqrt_totmet | 1912 | 0.001 | 0.95000 |
| log_calcium_total | sqrt_totmet | 1777 | 0.086 | 0.00028 |
| sqrt_vitd_total | sqrt_totmet | 1777 | -0.002 | 0.93000 |
# Follow the same structure as Table A
# Use sqrt_totmet (or transformed version) as the outcome variableInterpret the results:
Age and Total calcium intake are associated with physical activity level.
Age exhibits a weak negative correlation with physical activity level (r=-0.097, p=0.000, n=1934) Total calcium intake exhibits a weak positive correlation with physical activity level (r=0.086, p=0.000, n=1777)
BMI was not associated with physical activity level (r=0.001, p=0.950, n=1912) Vitamin D intake was not associated with physical activity level (r=-0.002, p=0.930, n=1777)
Examine correlations among all predictor variables (all pairwise combinations):
# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()
# Example structure:
preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 2)
)
#
tableC %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | log_bmi | 2834 | -0.098 | 2.0e-07 |
| RIDAGEYR | log_calcium_total | 2605 | 0.048 | 1.4e-02 |
| RIDAGEYR | sqrt_vitd_total | 2605 | 0.153 | 0.0e+00 |
| log_bmi | log_calcium_total | 2569 | 0.000 | 9.8e-01 |
| log_bmi | sqrt_vitd_total | 2569 | 0.007 | 7.3e-01 |
| log_calcium_total | sqrt_vitd_total | 2605 | 0.314 | 0.0e+00 |
Interpret the results:
There are some statistically significant correlations between Age-BMI (n=2834, r=-0.098, p=2.0e-07), Age-Calcium (n=2605, r=0.048, p=1.4e-02), Age-VitaminD (n=2605, r=0.153, p=0.0e+00), and Calcium-VitaminD (n=2605, r=0.314, p=0.0e+00). The correlations between BMI-Calcium (n=2569, r=0.000, p=9.8e-01) and BMI-VitaminD (n=2569, r=0.007, p=7.3e-01) are not statistically significant. However, even among the statistically significant results, there are no strong correlations between these predictors. Only Calcium-VitaminD exhibits a moderate positive correlation (r=0.314)
Create scatterplots to visualize key relationships:
# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "log(BMI)",
y = "BMD (g/cm²)",
title = "BMD vs BMI (transformed)"
)
# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "sqrt(totmet)",
y = "BMD (g/cm²)",
title = "BMD vs MET (transformed)"
)Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
B. Assumptions and Limitations
C. R Programming Challenge
YOUR REFLECTION HERE (200–400 words)
I would use ANOVA when the predictor variable is categorical with more than two levels and the outcome variable is continuous. I would use correlation when both variables are continuous. ANOVA tells us if there are significant differences in the dependent variable between different levels of the independent variable. Correlation tells us if the independent and dependent variable exhibit a linear relationship. An example of a research question best suited for ANOVA is “Is there a significant difference in mean systolic blood pressure between different income categories (low, medium, high)?” A research question best suited for correlation is “Is there an association between white blood cell count and BMI?”
The assumption most challenging to meet was the assumption of normality for the correlation analyses, as the key variables BMI, Total calcium, Total Vitamin D, and Total physical activity exhibited positively-skewed distributions. Violations of assumptions for a statistical test might mean that the test is not giving us an accurate picture of the true relationship in the population, and lead to a Type I or Type II error. Cross sectional correlation analyses are limited because a cross sectional study only provides a snapshot of a moment in time. Therefore, if we find an association between variable A and variable B, we do not know which variable is leading to / causing the other, or if an unknown Variable C is actually responsible for the observed relationship.
The most difficult part of the assignment for me was creating the histograms of BMI, Calcium, Vitamin D, and total physical activity. I solved it by using the ggplot code from a previous assignment and combining it with the provided code, making sure the dataset, variables, and labels were updated to this assignment. In solving this problem, I strengthened my ability to modify previously written code to complete a new task. This is important so as to not “reinvent the wheel” for every new coding task.
Before submitting, verify:
Good luck! Remember to start early and use office hours if you need help.