Homework 1 overview

Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

Your .Rmd file
Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.

File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(patchwork)
library(car)
library(kableExtra)
library(broom)

# Import data (adjust path as needed)
bmd <- readr::read_csv("bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total

## [1] 2898

n_anova

## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

[YOUR TEXT HERE: Describe the total sample, how many observations were removed due to missing data, and what the final analytic sample size is for the ANOVA analysis.]

The initial dataset included 2,898 participants. After excluding observations with missing values for bone mineral density or ethnicity, 612 observations were removed. The final analytic sample for the ANOVA analysis consisted of 2,286 participants with complete data.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀: μ (Non-Hispanic White) = μ (Non-Hispanic Black) = μ (Other) = μ (Mexican American) = μ (Other Hispanic)
Hₐ: At least one group mean differs

Plain language:

H₀: The mean BMD is the same across all ethnicity groups
Hₐ: The mean BMD differs for at least one ethnicity group.

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)

ethnicity	n	mean_bmd	sd_bmd
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

Mean bone mineral density (BMD) appears to differ across ethnicity groups, with Non-Hispanic Black participants generally showing the highest average BMD, followed by Non-Hispanic White, Mexican American and Other Hispanic groups, and the lowest values often observed among individuals in the Other group. Overall, the pattern suggests noticeable variation in BMD across ethnic groups rather than similar averages across all categories.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
ethnicity	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here:

We reject H₀. The ANOVA test shows a statistically significant difference in mean BMD across ethnicity groups, F = 30.18, p < 0.001, indicating that at least one ethnic group has a different mean BMD compared to the others.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

The Q-Q plot shows that the data points mostly follow a straight line, which means the residuals look close to normally distributed, with only small differences at the extremes. Levene’s test was not significant (p = 0.1788), suggesting that the variation in BMD is similar across the groups. These results indicate that the assumptions required for ANOVA are reasonably met, so the analysis results can be considered reliable.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tuk <- TukeyHSD(fit_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

library(dplyr)
library(knitr)
library(kableExtra)

# Run Tukey test
tuk <- TukeyHSD(fit_aov)

# Convert to data frame
tuk_tbl <- as.data.frame(tuk$ethnicity)
tuk_tbl$Comparison <- rownames(tuk_tbl)

# Reorder + format numbers
tuk_tbl <- tuk_tbl %>%
  select(Comparison, diff, lwr, upr, `p adj`) %>%
  mutate(
    diff = round(diff, 3),
    lwr  = round(lwr, 3),
    upr  = round(upr, 3),
    `p adj` = signif(`p adj`, 3)
  )

# Print table
kable(
  tuk_tbl,
  col.names = c("Comparison", "diff", "lwr", "upr", "p adj")
) %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(0, bold = TRUE, hline_after = TRUE)

	Comparison	diff	lwr	upr	p adj
Other Hispanic-Mexican American	Other Hispanic-Mexican American	-0.004	-0.043	0.034	9.98e-01
Non-Hispanic White-Mexican American	Non-Hispanic White-Mexican American	-0.062	-0.093	-0.032	3.00e-07
Non-Hispanic Black-Mexican American	Non-Hispanic Black-Mexican American	0.022	-0.010	0.055	3.28e-01
Other-Mexican American	Other-Mexican American	-0.053	-0.087	-0.019	1.92e-04
Non-Hispanic White-Other Hispanic	Non-Hispanic White-Other Hispanic	-0.058	-0.089	-0.027	3.60e-06
Non-Hispanic Black-Other Hispanic	Non-Hispanic Black-Other Hispanic	0.027	-0.006	0.060	1.72e-01
Other-Other Hispanic	Other-Other Hispanic	-0.049	-0.083	-0.014	1.07e-03
Non-Hispanic Black-Non-Hispanic White	Non-Hispanic Black-Non-Hispanic White	0.085	0.061	0.108	0.00e+00
Other-Non-Hispanic White	Other-Non-Hispanic White	0.009	-0.017	0.035	8.72e-01
Other-Non-Hispanic Black	Other-Non-Hispanic Black	-0.076	-0.104	-0.048	0.00e+00

Write a short paragraph: “The groups that differed were …”

The groups that differed in mean BMD were Non-Hispanic White and Mexican American, Other and Mexican American, Non-Hispanic White and Other Hispanic, Other and Other Hispanic, Non-Hispanic Black and Non-Hispanic White, and Other and Non-Hispanic Black. The groups that did not differ significantly in mean BMD were Mexican American vs Other Hispanic, Mexican American vs Non-Hispanic Black, and Other Hispanic vs Non-Hispanic Black, indicating that these groups had statistically similar mean BMD values after adjustment for multiple comparisons.

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl

## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA

# Extract sums of squares
ss_between <- anova_tbl$sumsq[anova_tbl$term == "ethnicity"]
ss_total <- sum(anova_tbl$sumsq)

# Compute eta-squared
eta_sq <- ss_between / ss_total
eta_sq

## [1] 0.05027196

Interpret in 1–2 sentences:

Ethnicity explained approximately 5% of the variance in BMD (η² ≈ 0.05). According to Cohen’s guidelines, this represents a small-to-moderate effect size, indicating that ethnicity accounts for a modest proportion of variability in bone mineral density.

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "BMI Distribution")

p2 <- ggplot(bmd2, aes(x = calcium_total)) +
  geom_histogram(bins = 30, fill = "lightgreen", color = "black") +
  labs(title = "Total Calcium Intake")

p3 <- ggplot(bmd2, aes(x = vitd_total)) +
  geom_histogram(bins = 30, fill = "plum", color = "black") +
  labs(title = "Total Vitamin D Intake")

p4 <- ggplot(bmd2, aes(x = totmet)) +
  geom_histogram(bins = 30, fill = "orange", color = "black") +
  labs(title = "Total MET")

(p1 | p2) / (p3 | p4)

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
    sqrt_totmet = sqrt(totmet)
  )

Document your reasoning: “I applied/did not apply transformations to [variables] because…”

I applied a square-root transformation to totmet because its histogram showed a right-skewed distribution with some high values. The square-root transformation reduces skewness and minimizes the influence of extreme values, making the variable more suitable for correlation analysis. I did not apply transformations to BMXBMI, calcium_total, or vitd_total because their distributions appeared reasonably symmetric and interpretable on their original scales.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

tableA <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(bmd2, "BMXBMI", "DXXOFBMD"),
  corr_pair(bmd2, "calcium_total", "DXXOFBMD"),
  corr_pair(bmd2, "vitd_total", "DXXOFBMD")
) %>%
  mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
  )

knitr::kable(
  tableA,
  col.names = c("Variable", "Outcome", "N", "Correlation", "p-value")
) %>%
  kableExtra::kable_styling(full_width = FALSE) %>%
  kableExtra::row_spec(0, bold = TRUE, hline_after = TRUE)

Variable	Outcome	N	Correlation	p-value
RIDAGEYR	DXXOFBMD	2286	-0.232	0.000
BMXBMI	DXXOFBMD	2275	0.410	0.000
calcium_total	DXXOFBMD	2129	-0.009	0.663
vitd_total	DXXOFBMD	2129	-0.027	0.216

Interpret the results:

Age and BMI showed statistically significant correlations with BMD (p < 0.001). Age was negatively correlated with BMD (r = −0.232), indicating that bone mineral density decreases as age increases; this represents a weak negative relationship. BMI was positively correlated with BMD (r = 0.410), indicating that individuals with higher BMI tend to have higher BMD; this represents a moderate positive relationship. Total calcium intake (r = −0.009, p = 0.663) and total vitamin D intake (r = −0.027, p = 0.216) were not significantly correlated with BMD, indicating no statistically significant linear association.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

tableB <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
  corr_pair(bmd2, "BMXBMI", "sqrt_totmet"),
  corr_pair(bmd2, "calcium_total", "sqrt_totmet"),
  corr_pair(bmd2, "vitd_total", "sqrt_totmet")
) %>%
  mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
  )

knitr::kable(
  tableB,
  col.names = c("Variable", "Outcome", "N", "Correlation", "p-value")
) %>%
  kableExtra::kable_styling(full_width = FALSE) %>%
  kableExtra::row_spec(0, bold = TRUE, hline_after = TRUE)

Variable	Outcome	N	Correlation	p-value
RIDAGEYR	sqrt_totmet	1934	-0.097	1.96e-05
BMXBMI	sqrt_totmet	1912	-0.007	7.71e-01
calcium_total	sqrt_totmet	1777	0.073	2.18e-03
vitd_total	sqrt_totmet	1777	0.030	2.03e-01

Interpret the results:

Age and calcium intake were related to physical activity, but the relationships were very small. Age had a weak negative relationship with activity (r = −0.097, p < 0.001), meaning older people tended to be slightly less active. Calcium intake had a weak positive relationship with activity (r = 0.073, p = 0.002), meaning people who were more active tended to consume a little more calcium. BMI (r = −0.007, p = 0.771) and vitamin D intake (r = 0.030, p = 0.203) were not related to activity levels, since their p-values show no statistical evidence of a relationship. Overall, all relationships were very weak (r < 0.10), suggesting they are unlikely to be important in real-world terms.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

preds <- c("RIDAGEYR", "BMXBMI", "calcium_total", "vitd_total")

pairs <- combn(preds, 2, simplify = FALSE)

tableC <- purrr::map_dfr(pairs, ~ corr_pair(bmd2, .x[1], .x[2])) %>%
  mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
  )

knitr::kable(
  tableC,
  col.names = c("Variable 1", "Variable 2", "N", "Correlation", "p-value")
) %>%
  kableExtra::kable_styling(full_width = FALSE) %>%
  kableExtra::row_spec(0, bold = TRUE, hline_after = TRUE)

Variable 1	Variable 2	N	Correlation	p-value
RIDAGEYR	BMXBMI	2834	-0.104	0.000000
RIDAGEYR	calcium_total	2605	0.051	0.008690
RIDAGEYR	vitd_total	2605	0.064	0.000998
BMXBMI	calcium_total	2569	-0.019	0.332000
BMXBMI	vitd_total	2569	0.012	0.550000
calcium_total	vitd_total	2605	0.117	0.000000

Interpret the results:

The predictors were not strongly related to each other. The relationships between age, BMI, calcium intake, and vitamin D intake were all very small, meaning the variables mostly vary independently. Because of this, there are no multicollinearity concerns, and these variables should be safe to include together in a regression model.

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = BMXBMI, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "BMI",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Total MET minutes",
    y = "BMD (g/cm²)",
    title = "BMD vs Physical Activity"
  )

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?
What are the key differences in what these methods tell us?
Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?
How might violations of assumptions affect your conclusions?
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?
How did you solve it? (Describe your problem-solving process)
What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

A. Comparing Methods (ANOVA vs Correlation)

Correlation is used when examining the strength and direction of a linear relationship between two continuous variables, whereas ANOVA is used to compare mean differences in a continuous outcome across categories of a categorical exposure. Correlation tells us whether variables move together and how strongly, but it does not compare groups or test differences in means. In contrast, ANOVA evaluates whether observed group differences are statistically significant but does not measure linear association. For example, a question such as “Is physical activity associated with bone mineral density?” is best addressed using correlation. A question like “Does mean bone mineral density differ by physical activity category (low, moderate, high)?” would be better suited to ANOVA.

B. Assumptions and Limitations

The most challenging assumptions to meet in this assignment were normality and linearity, particularly for variables like total MET minutes and nutrient intake, which were right-skewed. Violations of these assumptions can lead to biased, inaccurate correlation estimates or inflated significance levels. Additionally, cross-sectional correlation analyses cannot establish causality or temporal relationships. For example, even if higher calcium intake is correlated with higher activity levels, we cannot determine whether diet influences activity, activity influences diet, or both are affected by another factor such as socioeconomic status. Residual confounding is also a limitation because important variables may not have been measured or included.

C. R Programming Challenge

The most difficult part of the coding process was troubleshooting errors related to variable creation and chunk execution order when knitting the R Markdown document. Specifically, I encountered errors when a transformed variable did not exist at the time it was referenced in later analyses. I resolved this by carefully reviewing chunk order, ensuring transformations were created before use, and rerunning all chunks sequentially. This process strengthened my understanding of reproducible workflows and improved my ability to debug tidyverse pipelines and R Markdown execution logic.

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Jessica Okoroji

2026-02-20

Homework 1 overview

File organization and naming (required)

Data description

Part 0 — Load and prepare the data (10 points)

0.1 Import

0.2 Recode to readable factors

0.3 Missingness + analytic sample

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

1.1 Hypotheses (required)

1.2 Exploratory plot and group summaries

1.3 Fit ANOVA model + report the F-test

1.4 Assumption checks (normality + equal variances)

Normality of residuals

Equal variances (Levene’s test)

1.5 Post-hoc comparisons (pairwise tests)

1.6 Effect size (eta-squared)

Part 2 — Correlation analysis (40 points)

2.1 Create total intake variables

2.2 Decide on transformations (show evidence)

2.3 Create three correlation tables (Pearson)

Helper function for correlation pairs

Table A: Variables associated with the outcome (BMD)

Table B: Variables associated with the exposure (MET)

Table C: Predictors associated with each other

Optional: Visualization examples

Part 3 — Reflection (10 points)

Submission checklist