Homework 1 overview

Due: Friday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

Your .Rmd file
Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.

File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(gridExtra)
library(patchwork)
library(tidyverse)
library(car)
library(kableExtra)
library(broom)

# Import data
bmd <- readr::read_csv("/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/Assignments/HW01/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Display total sample size
data.frame(
  Metric = "Total Sample Size",
  Value = paste(nrow(bmd), "adults")
) %>%
  kable()

Metric	Value
Total Sample Size	2898 adults

# Count missing values in each column
missing_summary <- data.frame(
  Variable = names(bmd), 
  Missing_Count = (colSums(is.na(bmd))),
  Missing_Percent = paste0(round(colSums(is.na(bmd)) / nrow(bmd) * 100, 2), "%")
)

# Display results in a clean table
kable(missing_summary[order(-missing_summary$Missing_Count), ][1:15, -1], digits = 4, align = "r",
      caption = "Missing Count and Percentages for Variables")

Missing Count and Percentages for Variables
	Missing_Count	Missing_Percent
DSQTCALC	1633	56.35%
DSQTVD	1515	52.28%
totmet	964	33.26%
metcat	964	33.26%
DXXOFBMD	612	21.12%
tbmdcat	612	21.12%
calcium	293	10.11%
vitd	293	10.11%
BMXBMI	64	2.21%
smoker	2	0.07%
smoker_f	2	0.07%
SEQN	0	0%
RIAGENDR	0	0%
RIDAGEYR	0	0%
RIDRETH1	0	0%

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display Analytic sample size
data.frame(
  Metric = "Analytic Sample Size",
  Value = paste(nrow(anova_df), "adults")
) %>%
  kable()

Metric	Value
Analytic Sample Size	2286 adults

Total and Analytic Sample Size: The total sample was n = 2898 individuals. 612 observations were removed due to missing data from incomplete cases for BMD and ethnicity. This leaves n = 2286 individuals in the analytic sample size for the ANOVA analysis.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀: μ_MexicanAmerican = μ_OtherHispanic = μ_Non-HispanicWhite = μ_Non-HispanicBlack = μ_Other
Hₐ: μ_i ≠ μ_j

Plain language:

H₀: Mean BMD is equal across all five ethnicity groups
Hₐ: At least one ethnicity’s mean BMD differs from the other ethnicities

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)

ethnicity	n	mean_bmd	sd_bmd
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD, fill = ethnicity)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Bone Mineral Density by Ethnicity",
    subtitle = "NHANES 2017-2018, Adults aged 50+",
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)",
    fill = "Ethnicity"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

What patterns do you observe in BMD across ethnicity groups?:

Initial Observations: The median BMD appears highest in non-Hispanic Blacks and lowest in either non-Hispanic whites or those identifying as other. Mexican American and other Hispanic appear to be sit in the middle of the ethnicity groups for median BMD. Variability looks similar across the five groups of ethnicity with a few potential outliers in each group.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
ethnicity	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here:

We should make the decision to reject H₀ because the p-value < 0.001 and that is below the threshold of α = 0.05. This suggests that the means across ethnicity are not the same and we should consider the evidence for the alternative hypothesis that at least one mean for ethnicity is different from the other means. We also see a very large f-statistic (30.186) meaning we have a large difference in the variance between group means and within group variance adding further evidence to reject the null hypothesis. In addition, we have a very small p-value (p < .001) suggesting that this outcome is unlikely to have been caused by chance.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

The residuals vs. fitted appear to be randomly scattered with no visible pattern and equally spread above and below zero suggesting independence as well as the red line is staying relatively straight around the 0.0 mark. In the Q-Q residuals, the points fall close to the dotted diagonal line with slight deviations at the tail suggesting normality. Levene’s test resulted in a p values of 0.1788 which indicates homogenous variances across groups, ultimately supporting the equal variance assumption of ANOVA. These conclusions lead me to suggest that assumptions of independence, normality, and homogeneity of variance are met.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tuk <- TukeyHSD(fit_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# Extract and Format Results
tukey_summary <- as.data.frame(tuk$ethnicity)
tukey_summary$Comparison <- rownames(tukey_summary)
tukey_summary <- tukey_summary[, c("Comparison", "diff", "lwr", "upr", "p adj")]

# Add interpretation columns
tukey_summary$Significant <- ifelse(tukey_summary$`p adj` < 0.05, "Yes", "No")
tukey_summary$Direction <- ifelse(
  tukey_summary$Significant == "Yes",
  ifelse(tukey_summary$diff > 0, "First group higher", "Second group higher"), "No difference"
)

kable(tukey_summary, digits = 4,
      caption = "Tukey HSD post-hoc comparisons with 95% confidence intervals")

Tukey HSD post-hoc comparisons with 95% confidence intervals
	Comparison	diff	lwr	upr	p adj	Significant	Direction
Other Hispanic-Mexican American	Other Hispanic-Mexican American	-0.0044	-0.0426	0.0338	0.9978	No	No difference
Non-Hispanic White-Mexican American	Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000	Yes	Second group higher
Non-Hispanic Black-Mexican American	Non-Hispanic Black-Mexican American	0.0224	-0.0101	0.0549	0.3277	No	No difference
Other-Mexican American	Other-Mexican American	-0.0533	-0.0874	-0.0193	0.0002	Yes	Second group higher
Non-Hispanic White-Other Hispanic	Non-Hispanic White-Other Hispanic	-0.0579	-0.0889	-0.0269	0.0000	Yes	Second group higher
Non-Hispanic Black-Other Hispanic	Non-Hispanic Black-Other Hispanic	0.0268	-0.0062	0.0598	0.1722	No	No difference
Other-Other Hispanic	Other-Other Hispanic	-0.0489	-0.0834	-0.0144	0.0011	Yes	Second group higher
Non-Hispanic Black-Non-Hispanic White	Non-Hispanic Black-Non-Hispanic White	0.0848	0.0612	0.1084	0.0000	Yes	First group higher
Other-Non-Hispanic White	Other-Non-Hispanic White	0.0091	-0.0166	0.0347	0.8719	No	No difference
Other-Non-Hispanic Black	Other-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000	Yes	Second group higher

Write a short paragraph: “The groups that differed were …”

The groups that demonstrated statistically significant differences in mean BMD were Non-Hispanic White vs. Mexican American (p adj < .001), other vs. Mexican American (p adj = 0.0002), non-Hispanic white vs. other Hispanic (p adj < .001), other vs. other Hispanic (p adj = 0.0011), non-Hispanic Black vs. non-Hispanic White (p adj < .001), and other vs. non-Hispanic Black (p adj < .001). The groups that did not demonstrate statistically significant differences in mean BMD were other Hispanic vs. Mexican American (p adj = 0.9978), non-Hispanic Black vs. Mexican American (p adj = 0.3277), non-Hispanic Black vs. other Hispanic (p adj = 0.1722), and other vs. non_Hispanic White (p adj = 0.8719).

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl

## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA

ss_treatment <- anova_tbl$sumsq[1]
ss_total <- sum(anova_tbl$sumsq)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")

## Eta-squared (η²): 0.0503

cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")

## Percentage of variance explained: 5.03 %

Interpret in 1–2 sentences: Our calculated effect size is η² = 0.0503. This means ethnicity explains approximately 5.03% of the variance in BMD. While the F-test indicates a statistically significant difference across ethnicity, the practical magnitude of this effect is small nearing medium meaning that other factors have larger roles in deciding bone mineral density.

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

# Create histograms for key continuous variables
# Recommended: age, BMXBMI, calcium_total, vitd_total, totmet

# BMI Histogram 
p1 <- ggplot(bmd2, aes(x = BMXBMI)) + geom_histogram(bins = 30, fill = "skyblue") + ggtitle("BMI Histogram")

# Total Calcium Histogram 
p2 <- ggplot(bmd2, aes(x = calcium_total)) + 
geom_histogram(bins = 30, fill = "salmon") + ggtitle("Total Calcium Histogram")

# Total Vitamin D Histogram
p3 <- ggplot(bmd2, aes (x = vitd_total)) + geom_histogram(bins = 30, fill = "seagreen") + ggtitle("Total Vitamin D Histogram")

# Physical Activity Histogram 
p4 <- ggplot(bmd2, aes (x = totmet)) + geom_histogram(bins = 30, fill = "orchid") + ggtitle("Physical Activity Histogram")

# Age Histogram 
p5 <- ggplot(bmd2, aes(x = RIDAGEYR)) + geom_histogram(bins = 30, fill = "khaki") + ggtitle("Age Histogram")

# Arrange plots in a 2 x 3 grid 
grid.arrange(p1, p2, p3, p4, p5,
             layout_matrix = matrix(c(1, 2, 3, 4, 5, NA), nrow = 2, byrow = TRUE))

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
    log_bmi = log(BMXBMI),
    log_calcium_total = log(calcium_total),
    sqrt_totmet = sqrt(totmet),
    sqrt_vitd_total = sqrt(vitd_total)
  )

Document your reasoning:

I applied log transformations for both BMXBMI and calcium_total because they are both relatively right-skewed with possible outliers. I applied square root transformations to totmet and vitd_total because it appears that they are also right skewed but have much more 0 values which square root handles better. I did not transform age to ensure preservation of all data within that variable.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

# Use the corr_pair function to create correlations and use bind_rows() to combine multiple correlation results
tableA <- bind_rows(
   corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),  
   corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
   corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
   corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
  mutate(
    estimate = round(estimate, 3),
     p_value = formatC(p_value, format = "f", digits = 5)
    )

# Create table A 
tableA %>% kable(align = "r", caption = "Predictors vs. Outcome - BMD")

Predictors vs. Outcome - BMD
x	y	n	estimate	p_value
RIDAGEYR	DXXOFBMD	2286	-0.232	0.00000
log_bmi	DXXOFBMD	2275	0.425	0.00000
log_calcium_total	DXXOFBMD	2129	0.012	0.59164
sqrt_vitd_total	DXXOFBMD	2129	-0.062	0.00426

Interpret the results:

[Which variables show statistically significant correlations with BMD? What is the direction and strength of these relationships?]

Age (RIDAGEYR) and BMI (log-transformed) demonstrated statistically significant correlations with BMD (p-value < .001). Total Vitamin D (square-root transformed) also demonstrated a statistically significant correlation with BMD (p-values = 0.00426). The relationship between age and BMD is a weak negative relationship (r = -0.232). The relationship between BMI and BMD is a moderate positive correlation (r = 0.425). The relationship between total vitamin D and BMD is a very weak negative relationship (r = -0.062). Calcium did not demonstrate a correlation with BMD (p-value = 0.59164).

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

# Use the corr_pair function to create correlations and use bind_rows() to combine multiple correlation results
tableB <- bind_rows(
   corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),  
   corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
   corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
   corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
) %>%
  mutate(
    estimate = round(estimate, 3),
     p_value = formatC(p_value, format = "f", digits = 5)
    )

# Create table B
tableB %>% kable(align = "r", caption = "Predictor vs. Exposure - Physical Activity")

Predictor vs. Exposure - Physical Activity
x	y	n	estimate	p_value
RIDAGEYR	sqrt_totmet	1934	-0.097	0.00002
log_bmi	sqrt_totmet	1912	0.001	0.95115
log_calcium_total	sqrt_totmet	1777	0.086	0.00028
sqrt_vitd_total	sqrt_totmet	1777	-0.002	0.93222

Interpret the results:

Age appears to demonstrate a statistically significant correlation with square-root-transformed physical activity (p-value = 0.00002). The relationship is a weak negative correlation (r = -0.097). Total calcium (log-transformed) also appears to demonstrate a statistically significant correlation with square-root-transformed physical activity (p-value = 0.00029). The relationship is a weak positive relationship (r = 0.086). BMI (log-transformed) does not demonstrate a significant correlation (p-value = 0.96115), and neither does total vitamin D (square-root-transformed) with a p-value with 0.93222.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

# Create all pairwise combinations of predictors

# Use combn() to generate pairs, then map_dfr() with corr_pair()
preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
 pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
 estimate = round(estimate, 3),
 p_value = formatC(p_value, format = "f", digits = 5)
 )
 
# Create table C
tableC %>% kable(align = "r", caption = "Predictors vs. Each Other")

Predictors vs. Each Other
x	y	n	estimate	p_value
RIDAGEYR	log_bmi	2834	-0.098	0.00000
RIDAGEYR	log_calcium_total	2605	0.048	0.01360
RIDAGEYR	sqrt_vitd_total	2605	0.153	0.00000
log_bmi	log_calcium_total	2569	0.000	0.98112
log_bmi	sqrt_vitd_total	2569	0.007	0.73115
log_calcium_total	sqrt_vitd_total	2605	0.314	0.00000

Interpret the results: There are not any strong correlations among predictors that may indicate multicollinearity concerns in future regressions analyses. None of the pairwise combinations demonstrate an |r| > 0.7 suggesting strong correlations.

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?
What are the key differences in what these methods tell us?
Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?
How might violations of assumptions affect your conclusions?
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?
How did you solve it? (Describe your problem-solving process)
What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

ANOVA allows us to address a multiple comparison problem that arises when using numerous t-tests to compare more than three groups in epidemiological research. If we were to conduct numerous t-tests, it would inflate the type I error. ANOVA is useful because it is a single omnibus test that controls for error at α = 0.05. ANOVA specifically is used when comparing three or more group means, there is one categorical predictor, a continuous outcome, and independent observations within each group. ANOVA can only explain differences between means, whereas correlation can test for relationships. Correlation allows us to measure the strength and direction of a linear relationship between two continuous variables and should be employed in epidemiological research when exploration of data before regression is wanted, and describing associations, not causation. ANOVA is better suited for a research question such as does blood pressure reduction differ across three different medications. Correlation is better suited for a research question, such as whether there is a correlation between weight and average systolic blood pressure among US adults.

The assumption of independence was met through the random scatter in the residual vs. fitted plot. Equal variances were met through the non-significant result in the Levene’s test. Normality was met when the dots in the Q-Q plot followed the dotted line without too heavy tails; none were particularly hard to meet. If working with a small sample size in ANOVA, assumptions are critically important because with small, unbalanced samples, the violations can impact the F-test validity, as outliers have large effects on it. The limitations of cross-sectional correlation analysis for understanding relationships between nutrition health/activity and bone health are that correlation does not equal causation. Although a correlation may be present between two variables, it does not mean one causes the other because cross-sectional studies can not establish temporal relationships.

The most difficult part of this R coding was formatting each table and graph, such as the histograms, to look as I wanted. I originally referred back to the numerous lectures that we had to take ideas from there. When that code achieved everything it could, I then went to Google and utilized r resources that were available to learn how to edit colors, bins, fills, alignments, and themes to make the data appealing. I strengthened my ability to conduct ANOVA and correlation analyses and present visually attractive data.

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Zoya Hayes

2026-02-20