Due: Friday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on
Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA
subsample)
What to submit (2 items):
.Rmd fileAI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.
Create a course folder on your computer like:
EPI553/
├── Assignments/
│ └── HW01/
│ ├── bmd.csv
│ └── EPI553_HW01_LastName_FirstName.Rmd
Required filename for your submission:
EPI553_HW01_LastName_FirstName.Rmd
Required RPubs title (use this exact pattern):
epi553_hw01_lastname_firstname
This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname
This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.
Key variables:
| Variable | Description | Type |
|---|---|---|
DXXOFBMD |
Total femur bone mineral density | Continuous (g/cm²) |
RIDRETH1 |
Race/ethnicity | Categorical (1–5) |
RIAGENDR |
Sex | Categorical (1=Male, 2=Female) |
RIDAGEYR |
Age in years | Continuous |
BMXBMI |
Body mass index | Continuous (kg/m²) |
smoker |
Smoking status | Categorical (1=Current, 2=Past, 3=Never) |
calcium |
Dietary calcium intake | Continuous (mg/day) |
DSQTCALC |
Supplemental calcium | Continuous (mg/day) |
vitd |
Dietary vitamin D intake | Continuous (mcg/day) |
DSQTVD |
Supplemental vitamin D | Continuous (mcg/day) |
totmet |
Total physical activity (MET-min/week) | Continuous |
Load required packages and import the dataset.
# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/bmd.csv", show_col_types = FALSE)
# Inspect the data
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
Create readable labels for the main categorical variables:
bmd <- bmd %>%
mutate(
sex = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
ethnicity = factor(RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other")),
smoker_f = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)Report the total sample size and create the analytic dataset for ANOVA by removing missing values:
# Total observations
n_total <- nrow(bmd)
# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(ethnicity))
# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)
# Display sample sizes
n_total## [1] 2898
## [1] 2286
Write 2–4 sentences summarizing your analytic sample here.
The total sample size of the data is 2898. After removing 612 observations due to missing data, the final sample size for the ANOVA analysis is 2,286 adults.
Research question: Do mean BMD values differ across ethnicity groups?
State your null and alternative hypotheses in both statistical notation and plain language.
Statistical notation:
Plain language:
Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:
anova_df %>%
group_by(ethnicity) %>%
summarise(
n = n(),
mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
) %>%
arrange(desc(n)) %>%
kable(digits = 3)| ethnicity | n | mean_bmd | sd_bmd |
|---|---|---|---|
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other | 409 | 0.897 | 0.148 |
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
Create a visualization comparing BMD distributions across ethnicity groups:
ggplot(bmd,
aes(x = ethnicity, y = DXXOFBMD, fill = ethnicity)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.25, size = 0.5) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Bone Mineral Density by Ethnicity Category",
subtitle = "NHANES Data, Adults aged 50-80",
x = "Ethnicity Category",
y = "Bone mineral density (g/cm²)",
fill = "Ethnicity Category"
) +
#theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))Interpret the plot: What patterns do you observe in BMD across ethnicity groups?
Fit a one-way ANOVA with:
DXXOFBMD)# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)
# Display ANOVA table
summary(fit_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Report the F-statistic, degrees of freedom, and p-value in a formatted table:
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| ethnicity | 4 | 2.9482 | 0.7371 | 30.185 | 0 |
| Residuals | 2281 | 55.6973 | 0.0244 | NA | NA |
Write your interpretation here:
[YOUR INTERPRETATION: State your decision (reject or fail to reject H₀) and what this means for ethnic differences in BMD.]
ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.
If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.
Create diagnostic plots:
par(mfrow = c(1, 2))
plot(fit_aov, which = 1) # Residuals vs fitted
plot(fit_aov, which = 2) # QQ plot## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Summarize what you see in 2–4 sentences:
[YOUR INTERPRETATION: Comment on whether residuals appear normally distributed based on the QQ plot, and whether variances appear equal based on Levene’s test. State whether assumptions are met or violated.]
Diagnostic Plot Interpretation:
Levene’s Test Interpretation:
Overall Assessment: With n > 2000, ANOVA is robust to minor violations. Our assumptions are reasonably satisfied.
Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.
Use Tukey HSD and report:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Table and Interpretation:
| Comparison | Mean Diff | 95% CI | p-value | Significant? |
|---|---|---|---|---|
| Other Hispanic - Mexican American | 0 | [-0.04, 0.03] | 0.998 | No |
| Non-Hispanic White - Mexican American | -0.06 | [-0.09, -0.03] | < 0.001 | Yes |
| Non-Hispanic Black - Mexican American | 0.02 | [-0.01, 0.05] | 0.328 | No |
| Other - Mexican American | -0.05 | [-0.09, -0.02] | < 0.001 | Yes |
| Non-Hispanic White - Other Hispanic | -0.06 | [-0.09, -0.03] | < 0.001 | Yes |
| Non-Hispanic Black - Other Hispanic | 0.03 | [-0.01, 0.06] | 0.172 | No |
| Other - Other Hispanic | -0.05 | [-0.08, -0.01] | < 0.001 | Yes |
| Non-Hispanic Black - Non-Hispanic White | 0.08 | [0.06, 0.11] | < 0.001 | Yes |
| Other - Non-Hispanic White | 0.01 | [-0.02, 0.03] | 0.872 | No |
| Other - Non-Hispanic Black | -0.08 | [-0.1, -0.05] | < 0.001 | Yes |
Conclusion: Not all 5 pairwise comparisons are statistically significant. The groups that were statistically significant were Non-Hispanic White compared to Mexican American, Other compared to Mexican American, Non-Hispanic White compared to Other Hispanic,Other compared to Other Hispanic, Non-Hispanic Black compared to Non-Hispanic White, and Other compared to Non-Hispanic Black.
Non-Hispanic WHite, Non-Hispanic Black and Other all differed from Hispanic and/or Other Hispanic.
Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).
Formula: η² = SS_between / SS_total
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ethnicity 4 2.95 0.737 30.2 1.65e-24
## 2 Residuals 2281 55.7 0.0244 NA NA
# YOUR CODE HERE to compute eta-squared
# Hint: Extract SS for ethnicity and Residuals, then calculate proportion
# Extract sum of squares from ANOVA table
ss_treatment_hw <- anova_tbl$sumsq [1]
ss_total_hw <- sum(anova_tbl$sumsq)
# Calculate eta-squared
eta_squared_hw <- ss_treatment_hw / ss_total_hw
cat("Eta-squared (η²):", round(eta_squared_hw, 4), "\n")## Eta-squared (η²): 0.0503
## Percentage of variance explained: 5.03 %
## Percentage of variance explained: 5.03 %
Interpret in 1–2 sentences:
[YOUR INTERPRETATION: What proportion of variance in BMD is explained by ethnicity? Use Cohen’s guidelines: small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14]
Interpretation: Ethnicity category explains 5.03% of the variance in Bone Mean Density.
Our calculated effect size is η² = 0.0503. This means Ethnicty categories explains approximately 5.03% of the variance in BMD.
To interpret this effect size, we use conventional guidelines: η² = 0.01 (small), η² = 0.06 (medium), and η² = 0.14 (large). In our case, the effect is small.
While the F-test indicates a statistically significant difference across activity groups, the practical magnitude of this effect is modest. Other factors not included in our model—such as genetics, age, gender, calcium intake, and hormonal status—likely play substantial roles in determining bone mineral density.
In this section you will:
Calculate total nutrient intake by adding dietary and supplemental sources:
Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.
Create histograms to assess distributions:
# YOUR CODE HERE: Create histograms for key continuous variables
# Recommended: BMXBMI, calcium_total, vitd_total, totmet
# Example structure:
library(patchwork)
p1 <- ggplot(bmd2, aes(x = BMXBMI)) + geom_histogram(bins = 30)
p2 <- ggplot(bmd2, aes(x = calcium_total)) + geom_histogram(bins = 30)
p3 <- ggplot(bmd2, aes(x = vitd_total)) + geom_histogram(bins = 30)
p4 <- ggplot(bmd2, aes(x = totmet)) + geom_histogram(bins = 30)
(p1 | p2) / (p3 | p4)Based on your assessment, apply transformations as needed:
bmd2 <- bmd2 %>%
mutate(
# Add your transformation code here
# Examples (modify as needed based on your assessment):
log_bmi = log(BMXBMI),
log_calcium_total = log(calcium_total),
sqrt_totmet = sqrt(totmet),
sqrt_vitd_total = sqrt(vitd_total)
)
p5 <- ggplot(bmd2, aes(x = log_bmi)) + geom_histogram(bins = 30)
p6 <- ggplot(bmd2, aes(x = log_calcium_total)) + geom_histogram(bins = 30)
p7 <- ggplot(bmd2, aes(x = sqrt_vitd_total)) + geom_histogram(bins = 30)
p8 <- ggplot(bmd2, aes(x = sqrt_totmet)) + geom_histogram(bins = 30)
(p5 | p6) / (p7 | p8)Document your reasoning: “I applied/did not apply transformations to [variables] because…”
After reviewing the orginal variables, I decided to transform the BMI and total Calcium variables since they were right skewed. After this transfromation the distribution appears to be more normal. I applied the square root transformation to the vitamin D and Total MET varaibles because it slightly helped with skewness and accounting for some of the 0s.
For each correlation, report:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}Examine correlations between potential predictors and BMD:
Use transformed versions where appropriate.
# YOUR CODE HERE
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()
tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableA %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | DXXOFBMD | 2286 | -0.232 | 0.00000 |
| log_bmi | DXXOFBMD | 2275 | 0.425 | 0.00000 |
| log_calcium_total | DXXOFBMD | 2129 | 0.012 | 0.59200 |
| sqrt_vitd_total | DXXOFBMD | 2129 | -0.062 | 0.00426 |
Interpret the results:
[YOUR INTERPRETATION: Which variables show statistically significant correlations with BMD? What is the direction and strength of these relationships?]
Age and Vitamin D Total show a startically significant negative relationship with bone mineral density, while BMI shows a significant positive correlation with bone mineral density.
Examine correlations between potential confounders and physical activity:
# YOUR CODE HERE
# Follow the same structure as Table A
# Use sqrt_totmet (or transformed version) as the outcome variable
tableB <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableB %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | sqrt_totmet | 1934 | -0.097 | 1.96e-05 |
| log_bmi | sqrt_totmet | 1912 | 0.001 | 9.51e-01 |
| log_calcium_total | sqrt_totmet | 1777 | 0.086 | 2.82e-04 |
| sqrt_vitd_total | sqrt_totmet | 1777 | -0.002 | 9.32e-01 |
Interpret the results:
[YOUR INTERPRETATION: Which variables are associated with physical activity levels?]
All four variables (AGE, BMI, Calcium Total and Vitamn D) are statistically signficant with physical activity levels.
Examine correlations among all predictor variables (all pairwise combinations):
# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()
# Example structure:
preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableC %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | log_bmi | 2834 | -0.098 | 2.00e-07 |
| RIDAGEYR | log_calcium_total | 2605 | 0.048 | 1.36e-02 |
| RIDAGEYR | sqrt_vitd_total | 2605 | 0.153 | 0.00e+00 |
| log_bmi | log_calcium_total | 2569 | 0.000 | 9.81e-01 |
| log_bmi | sqrt_vitd_total | 2569 | 0.007 | 7.31e-01 |
| log_calcium_total | sqrt_vitd_total | 2605 | 0.314 | 0.00e+00 |
Interpret the results:
Are there strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses?]
There are strong correlations between age and BMI, Calcium Total, and Vitamin D. There is no correlation between BMI and Calcium total and Vitamin D total. There is a strong correlation between calcium total and vitamin D total.
Create scatterplots to visualize key relationships:
# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "log(BMI)",
y = "BMD (g/cm²)",
title = "BMD vs BMI (transformed)"
)
# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "sqrt(totmet)",
y = "BMD (g/cm²)",
title = "BMD vs MET (transformed)"
)Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
I would use correlation if I was analyzing two continuous or ordinal variables, interested in knowing the strength and direction of a linear relationship and want to explore the data before doing a regression. . While ANOVA would be used if I had 1 categorical predictor and a continous outcome, 3 or more groups means, and the observations are independent of each other.
Performing an ANOVA statistic would assess if there are difference in the group means, while correlation is used to describe the relastionship but not the cause.
One example of a research question better suited for ANOVA would be: Does BMI differ between across different regions or counties in New York State. One example of when to use correlation would be if I was interested in understanding if there is a relationship of hours of physial activity affect BMI among New York State residents.
B. Assumptions and Limitations
Addrssing Normality was the most challenging to meet in this assignment, we would want are data to be normally distributed and in order to do so we had to transform some of the variables so that they would fit better and be more normally distributed.
Violations of assumptions can skew the results and either make it seem that there is a relationship or not. Outliers can also distort the conclusions as well as not having a large enough sample size. Even though not assumptions were met, due to our large sample size we can make the assumption that the distribution is normal.
C. R Programming Challenge
The most difficult part of the R coding for this assignment was getting some of my ANOVA plots to run because the plots were to large. The document would knit but for some reason my code was not running as it should so made it challenging to intrepret the findings. Another challenge was creating the Tukey’s table, took me some time to really understand the code and understand how to set it up correctly so right values fell with the right variables.
Ended up being some triak and error and playing around with the code to see what the problem was generating from,
An R skill that I strengthen while completing this assignment was defenitely the table codes. I was able to troubleshoot and figure out what the issue was and why it was happening. Moving forward I will continue to work on my final presentation of tables and visuals.
YOUR REFLECTION HERE (200–400 words)
[Write your reflection addressing all three components above]
Before submitting, verify:
Good luck! Remember to start early and use office hours if you need help.