Homework 1 overview

Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

Your .Rmd file
Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.

File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)

# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/abbym/OneDrive/Desktop/STATS553/Assignments/Homework #1/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total

## [1] 2898

n_anova

## [1] 2286

mean(is.na(bmd))*100

## [1] 14.11521

Write 2–4 sentences summarizing your analytic sample here.

The total sample is 2898. 612 observations were removed due to missing data, meaning the final total analytic sample size is 2286. 14.11% of the data is missing before cleaning.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀: μ_Mexican American = μ_Other Hispanic = μ_Non-Hispanic White = μ_Non-Hispanic Black = μ_Other
Hₐ: At least one ethnicity has a different mean BMD than the other ethnicities.

Plain language:

H₀: The mean BMD values are the same across the different ethnicities.
Hₐ: At least one ethnicity has a different mean BMD than the other ethnicities.

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)

ethnicity	n	mean_bmd	sd_bmd
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

The variances of BMD seem about equal across the different ethnicities. The Non-Hispanic White and Other ethnicities seem to have the lowest median BMD and the Non-Hispanic Black ethnicity has the greatest median BMD. Mexican American and Other Hispanic are in the middle with similar median BMDs.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
ethnicity	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here:

Since p < 0.5, we reject H₀, meaning there is statistically significant evidence that mean bone mineral density differs across at least two ethnicity categories.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

Based on the Q-Q plot, the data are normally distributed, as the data points do not have any departures from the line. The test statistic of the Levene’s test was 1.5728. Since p > 0.05 in the Levene’s test of homogeniety of variance, we fail to reject the null hypothesis that there is homogeneity of variance, meaning the variances in mean bone mineral density are approximately equal across groups of ethnicity.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tuk <- TukeyHSD(fit_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# create a tidy table from Tukey results
plot(tuk, las = 1)

# Extract and format results
tukey_summary <- as.data.frame(tuk$ethnicity)
tukey_summary$Comparison <- rownames(tukey_summary)
tukey_summary <- tukey_summary[, c("Comparison", "diff", "lwr", "upr", "p adj")]

# Add interpretation columns
tukey_summary$Significant <- ifelse(tukey_summary$`p adj` < 0.05, "Yes", "No")
tukey_summary$Direction <- ifelse(
  tukey_summary$Significant == "Yes",
  ifelse(tukey_summary$diff > 0, "First group higher", "Second group higher"),
  "No difference"
)

kable(tukey_summary, digits = 4,
      caption = "Tukey HSD post-hoc comparisons with 95% confidence intervals")

Tukey HSD post-hoc comparisons with 95% confidence intervals
	Comparison	diff	lwr	upr	p adj	Significant	Direction
Other Hispanic-Mexican American	Other Hispanic-Mexican American	-0.0044	-0.0426	0.0338	0.9978	No	No difference
Non-Hispanic White-Mexican American	Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000	Yes	Second group higher
Non-Hispanic Black-Mexican American	Non-Hispanic Black-Mexican American	0.0224	-0.0101	0.0549	0.3277	No	No difference
Other-Mexican American	Other-Mexican American	-0.0533	-0.0874	-0.0193	0.0002	Yes	Second group higher
Non-Hispanic White-Other Hispanic	Non-Hispanic White-Other Hispanic	-0.0579	-0.0889	-0.0269	0.0000	Yes	Second group higher
Non-Hispanic Black-Other Hispanic	Non-Hispanic Black-Other Hispanic	0.0268	-0.0062	0.0598	0.1722	No	No difference
Other-Other Hispanic	Other-Other Hispanic	-0.0489	-0.0834	-0.0144	0.0011	Yes	Second group higher
Non-Hispanic Black-Non-Hispanic White	Non-Hispanic Black-Non-Hispanic White	0.0848	0.0612	0.1084	0.0000	Yes	First group higher
Other-Non-Hispanic White	Other-Non-Hispanic White	0.0091	-0.0166	0.0347	0.8719	No	No difference
Other-Non-Hispanic Black	Other-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000	Yes	Second group higher

# Interpret each comparison
cat("POST-HOC TEST INTERPRETATION\n")

## POST-HOC TEST INTERPRETATION

cat("============================\n\n")

## ============================

for (i in 1:nrow(tukey_summary)) {
  row <- tukey_summary[i, ]
  cat("Comparison:", row$Comparison, "\n")
  cat("  Estimated mean difference:", round(row$diff, 4), "g/cm²\n")
  cat("  95% Confidence Interval: [", 
      round(row$lwr, 4), ", ", round(row$upr, 4), "]\n")
  cat("  Adjusted p-value:", format(row$`p adj`, digits = 4), "\n")
  cat("  Statistical Significance:", row$Significant, "\n")
  cat("  Direction:", row$Direction, "\n\n")
}

## Comparison: Other Hispanic-Mexican American 
##   Estimated mean difference: -0.0044 g/cm²
##   95% Confidence Interval: [ -0.0426 ,  0.0338 ]
##   Adjusted p-value: 0.9978 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Non-Hispanic White-Mexican American 
##   Estimated mean difference: -0.0624 g/cm²
##   95% Confidence Interval: [ -0.0929 ,  -0.0319 ]
##   Adjusted p-value: 2.568e-07 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic Black-Mexican American 
##   Estimated mean difference: 0.0224 g/cm²
##   95% Confidence Interval: [ -0.0101 ,  0.0549 ]
##   Adjusted p-value: 0.3277 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Other-Mexican American 
##   Estimated mean difference: -0.0533 g/cm²
##   95% Confidence Interval: [ -0.0874 ,  -0.0193 ]
##   Adjusted p-value: 0.0001919 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic White-Other Hispanic 
##   Estimated mean difference: -0.0579 g/cm²
##   95% Confidence Interval: [ -0.0889 ,  -0.0269 ]
##   Adjusted p-value: 3.607e-06 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic Black-Other Hispanic 
##   Estimated mean difference: 0.0268 g/cm²
##   95% Confidence Interval: [ -0.0062 ,  0.0598 ]
##   Adjusted p-value: 0.1722 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Other-Other Hispanic 
##   Estimated mean difference: -0.0489 g/cm²
##   95% Confidence Interval: [ -0.0834 ,  -0.0144 ]
##   Adjusted p-value: 0.001072 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic Black-Non-Hispanic White 
##   Estimated mean difference: 0.0848 g/cm²
##   95% Confidence Interval: [ 0.0612 ,  0.1084 ]
##   Adjusted p-value: 3.936e-11 
##   Statistical Significance: Yes 
##   Direction: First group higher 
## 
## Comparison: Other-Non-Hispanic White 
##   Estimated mean difference: 0.0091 g/cm²
##   95% Confidence Interval: [ -0.0166 ,  0.0347 ]
##   Adjusted p-value: 0.8719 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Other-Non-Hispanic Black 
##   Estimated mean difference: -0.0757 g/cm²
##   95% Confidence Interval: [ -0.1038 ,  -0.0477 ]
##   Adjusted p-value: 4.176e-11 
##   Statistical Significance: Yes 
##   Direction: Second group higher

Write a short paragraph:

The Non-Hispanic White- Mexican American, Other- Mexican American, Non-Hispanic White-Other Hispanic, Other-Other Hispanic, Non-Hispanic Black-Non-Hispanic White, and Other-Non-Hispanic Black groups differ significantly from one another.

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl

## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA

# Extract sum of squares from ANOVA table
anova_summary <- summary(fit_aov)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")

## Eta-squared (η²): 0.0503

cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")

## Percentage of variance explained: 5.03 %

Interpret in 1–2 sentences:

η² = 0.0503, which means that only 5.03% of the variance in the bone mineral density can be explained by ethnicity. While statistically significant, ethnicity alone doesn’t explain most of the variation in the bone mineral density. 0.0503 means that ethnicity has only a small effect on mean BMD.

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

#Create histograms for key continuous variables

library(patchwork)
p1 <- ggplot(bmd2, aes(x = BMXBMI)) + geom_histogram(bins = 30)
p2 <- ggplot(bmd2, aes(x = calcium_total)) + geom_histogram(bins = 30)
p3 <- ggplot(bmd2, aes(x = vitd_total)) + geom_histogram(bins = 30) 
p4 <- ggplot(bmd2, aes(x = totmet)) + geom_histogram(bins = 30)
(p1 | p2) / (p3 | p4)

Based on your assessment, apply transformations as needed:

# Transform variables
bmd2 <- bmd2 %>%
  mutate(

log_bmi = log(BMXBMI),
log_calcium_total = log(calcium_total),
sqrt_totmet = sqrt(totmet),
sqrt_vitd_total = sqrt(vitd_total)
  )

Document your reasoning:

BMI, total calcium, total vitamin D, and physical activity (MET) are all right-skewed based on the histograms. BMI and total calcium will undergo log transformations, because they are right-skewed and don’t have many 0 values, whereas MET and vitamin D will undergo square root transformations, as they have more 0 values and square root transformations handle 0 values better.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

#create table A

tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
  mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
)

tableA %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	DXXOFBMD	2286	-0.232	0.00000
log_bmi	DXXOFBMD	2275	0.425	0.00000
log_calcium_total	DXXOFBMD	2129	0.012	0.59200
sqrt_vitd_total	DXXOFBMD	2129	-0.062	0.00426

Interpret the results: Age, BMI, and total vitamin D intake all show statistically significant correlations with BMD. Age and vitamin D intake show weak negative correlations, whereas BMI shows a moderate positive correlation. This means that as age and vitamin D increase, BMD tends to decrease, and that as BMI increases, BMD tends to increase.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

#create table B

tableB <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
) %>%
  mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
)

tableB %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	sqrt_totmet	1934	-0.097	1.96e-05
log_bmi	sqrt_totmet	1912	0.001	9.51e-01
log_calcium_total	sqrt_totmet	1777	0.086	2.82e-04
sqrt_vitd_total	sqrt_totmet	1777	-0.002	9.32e-01

Interpret the results: Age and total calcium intake have statistically significant correlations with physical activity level. Age has a weak negative correlation and calcium intake has a weak positive correlation. This means that as age increases, physical activity tends to decrease, and that as calcium intake increases, physical activity tends to increase.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

# create table C

preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
   mutate(
     estimate = round(estimate, 3),
      p_value = signif(p_value, 3)
  )
 tableC %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	log_bmi	2834	-0.098	2.00e-07
RIDAGEYR	log_calcium_total	2605	0.048	1.36e-02
RIDAGEYR	sqrt_vitd_total	2605	0.153	0.00e+00
log_bmi	log_calcium_total	2569	0.000	9.81e-01
log_bmi	sqrt_vitd_total	2569	0.007	7.31e-01
log_calcium_total	sqrt_vitd_total	2605	0.314	0.00e+00

Interpret the results: No, there are not any strong correlations among predictors that may indicate multicollinearity. Age and BMI, age and total calcium intake, age and vitamin D intake, and calcium intake and vitamin D intake are statistically significantly correlated, but only weakly or moderately.

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?
What are the key differences in what these methods tell us?
Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?
How might violations of assumptions affect your conclusions?
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?
How did you solve it? (Describe your problem-solving process)
What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

ANOVA should be used when there is a categorical predictor and a continuous outcome and correlation should be used when both variables are continuous, or at least ordinal. ANOVA tells us whether there are differences in a continuous outcome across a categorical predictor (a difference in means across different categories, for example), whereas correlation tells us the strength and direction of an association between two continuous variables. An example of a research question for ANOVA would be “How does BMI vary by food security levels in New York State?” and an example research question for correlation is “Is there a correlation between weight and cholesterol levels in US adults?”.

The normality and linearity assumptions were challenging, as that’s why the transformations were applied to the variables that were being tested with correlation analysis. Violations can affect conclusions by invalidating results of the tests conducted. Results may show one thing, but if you are assuming the assumptions are met and they actually aren’t, your results could not truly be accurate. Limitations of cross-sectional correlation for causal inference include that cross-sectional studies cannot prove temporality. Because the variables are measured at the same time, you cannot tell which one may have preceded the other, and therefore caused the other.

The most difficult part of coding for me was understanding what transformations should be completed based on the histograms made for each variable. I solved it by reading material on why you would complete transformations and when to know when you would transform to log or square roots. I strengthened my understanding of different ways variables can be presented as well as what transformations do and when to perform them. I also strengthened my ability to find resources to help me understand certain coding and statistical information.

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Abigail Miller

2026-02-20