Homework 1 overview

Due: Friday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

Your .Rmd file
Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.

File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)

# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total

## [1] 2898

n_anova

## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

The total sample size of the data is 2898. After removing 612 observations due to missing data, the final sample size for the ANOVA analysis is 2,286 adults.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀:μ_MexicanAmerican = μ_OtherHispanic = μ_Non-Hispanic White = μ_Non-Hispanic Black = μ_Other (All five population means are equal)
Hₐ: At least one population mean differs from the others

Plain language:

H₀: All the mean population mean is the same.
Hₐ: At least one population mean differs from the others

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)

ethnicity	n	mean_bmd	sd_bmd
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(bmd, 
  aes(x = ethnicity, y = DXXOFBMD, fill = ethnicity)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.25, size = 0.5) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Bone Mineral Density by Ethnicity Category",
    subtitle = "NHANES Data, Adults aged 50-80",
    x = "Ethnicity Category",
    y = "Bone mineral density (g/cm²)",
    fill = "Ethnicity Category"
  ) +
  #theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

Non-Hispanic Black appear to have a higher BMD compared to the other groups with Non-Hispanic White having the lowest BMD across groups.
Mexican American and Other Hispanic have the close means for BMD and appear to be similar.have a higher BMD compared to the other groups with Non-Hispanic White having the lowest BMD across groups.
Variability (box heights) looks similar across groups.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
ethnicity	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here:

[YOUR INTERPRETATION: State your decision (reject or fail to reject H₀) and what this means for ethnic differences in BMD.]

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

levene_test <- car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)
print(levene_test)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

[YOUR INTERPRETATION: Comment on whether residuals appear normally distributed based on the QQ plot, and whether variances appear equal based on Levene’s test. State whether assumptions are met or violated.]

Diagnostic Plot Interpretation:

Residuals vs Fitted: Points show random scatter around zero with no clear pattern → Good!
Q-Q Plot: Points follow the diagonal line reasonably well → Normality assumption is reasonable
Scale-Location: Red line is relatively flat → Equal variance assumption is reasonable
Residuals vs Leverage: No points beyond Cook’s distance lines → No highly influential outliers

Levene’s Test Interpretation:

p-value: 0.179
If p < 0.05, we would fail to reject equal variances
Here: Equal variance assumption is met

Overall Assessment: With n > 2000, ANOVA is robust to minor violations. Our assumptions are reasonably satisfied.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tuk <- TukeyHSD(fit_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Table and Interpretation:

Comparison	Mean Diff	95% CI	p-value	Significant?
Other Hispanic - Mexican American	0	[-0.04, 0.03]	0.998	No
Non-Hispanic White - Mexican American	-0.06	[-0.09, -0.03]	< 0.001	Yes
Non-Hispanic Black - Mexican American	0.02	[-0.01, 0.05]	0.328	No
Other - Mexican American	-0.05	[-0.09, -0.02]	< 0.001	Yes
Non-Hispanic White - Other Hispanic	-0.06	[-0.09, -0.03]	< 0.001	Yes
Non-Hispanic Black - Other Hispanic	0.03	[-0.01, 0.06]	0.172	No
Other - Other Hispanic	-0.05	[-0.08, -0.01]	< 0.001	Yes
Non-Hispanic Black - Non-Hispanic White	0.08	[0.06, 0.11]	< 0.001	Yes
Other - Non-Hispanic White	0.01	[-0.02, 0.03]	0.872	No
Other - Non-Hispanic Black	-0.08	[-0.1, -0.05]	< 0.001	Yes

Conclusion: Not all 5 pairwise comparisons are statistically significant. The groups that were statistically significant were Non-Hispanic White compared to Mexican American, Other compared to Mexican American, Non-Hispanic White compared to Other Hispanic,Other compared to Other Hispanic, Non-Hispanic Black compared to Non-Hispanic White, and Other compared to Non-Hispanic Black.

Non-Hispanic WHite, Non-Hispanic Black and Other all differed from Hispanic and/or Other Hispanic.

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl

## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA

# YOUR CODE HERE to compute eta-squared
# Hint: Extract SS for ethnicity and Residuals, then calculate proportion

# Extract sum of squares from ANOVA table

ss_treatment_hw <- anova_tbl$sumsq [1]
ss_total_hw <- sum(anova_tbl$sumsq)

# Calculate eta-squared
eta_squared_hw <- ss_treatment_hw / ss_total_hw

cat("Eta-squared (η²):", round(eta_squared_hw, 4), "\n")

## Eta-squared (η²): 0.0503

cat("Percentage of variance explained:", round(eta_squared_hw * 100, 2), "%")

## Percentage of variance explained: 5.03 %

cat("Percentage of variance explained:", round(eta_squared_hw * 100, 2), "%")

## Percentage of variance explained: 5.03 %

Interpret in 1–2 sentences:

[YOUR INTERPRETATION: What proportion of variance in BMD is explained by ethnicity? Use Cohen’s guidelines: small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14]

Interpretation: Ethnicity category explains 5.03% of the variance in Bone Mean Density.

Our calculated effect size is η² = 0.0503. This means Ethnicty categories explains approximately 5.03% of the variance in BMD.

To interpret this effect size, we use conventional guidelines: η² = 0.01 (small), η² = 0.06 (medium), and η² = 0.14 (large). In our case, the effect is small.

While the F-test indicates a statistically significant difference across activity groups, the practical magnitude of this effect is modest. Other factors not included in our model—such as genetics, age, gender, calcium intake, and hormonal status—likely play substantial roles in determining bone mineral density.

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

# YOUR CODE HERE: Create histograms for key continuous variables
# Recommended: BMXBMI, calcium_total, vitd_total, totmet
# Example structure:
library(patchwork)
p1 <- ggplot(bmd2, aes(x = BMXBMI)) + geom_histogram(bins = 30)
p2 <- ggplot(bmd2, aes(x = calcium_total)) + geom_histogram(bins = 30)
p3 <- ggplot(bmd2, aes(x = vitd_total)) + geom_histogram(bins = 30)
p4 <- ggplot(bmd2, aes(x = totmet)) + geom_histogram(bins = 30)

(p1 | p2) / (p3 | p4)

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
    # Add your transformation code here
    # Examples (modify as needed based on your assessment):
    log_bmi = log(BMXBMI),
    log_calcium_total = log(calcium_total),
    sqrt_totmet = sqrt(totmet),
    sqrt_vitd_total = sqrt(vitd_total)
  )

p5 <- ggplot(bmd2, aes(x = log_bmi)) + geom_histogram(bins = 30)
p6 <- ggplot(bmd2, aes(x = log_calcium_total)) + geom_histogram(bins = 30)
p7 <- ggplot(bmd2, aes(x = sqrt_vitd_total)) + geom_histogram(bins = 30)
p8 <- ggplot(bmd2, aes(x = sqrt_totmet)) + geom_histogram(bins = 30)

(p5 | p6) / (p7 | p8)

Document your reasoning: “I applied/did not apply transformations to [variables] because…”

After reviewing the orginal variables, I decided to transform the BMI and total Calcium variables since they were right skewed. After this transfromation the distribution appears to be more normal. I applied the square root transformation to the vitamin D and Total MET varaibles because it slightly helped with skewness and accounting for some of the 0s.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

# YOUR CODE HERE
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()

tableA <- bind_rows(
   corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
   corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
   corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
   corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
 ) %>%
   mutate(
     estimate = round(estimate, 3),
     p_value = signif(p_value, 3)
   )
 
 tableA %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	DXXOFBMD	2286	-0.232	0.00000
log_bmi	DXXOFBMD	2275	0.425	0.00000
log_calcium_total	DXXOFBMD	2129	0.012	0.59200
sqrt_vitd_total	DXXOFBMD	2129	-0.062	0.00426

Interpret the results:

[YOUR INTERPRETATION: Which variables show statistically significant correlations with BMD? What is the direction and strength of these relationships?]

Age and Vitamin D Total show a startically significant negative relationship with bone mineral density, while BMI shows a significant positive correlation with bone mineral density.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

# YOUR CODE HERE
# Follow the same structure as Table A
# Use sqrt_totmet (or transformed version) as the outcome variable

tableB <- bind_rows(
   corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
   corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
   corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
   corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
 ) %>%
   mutate(
     estimate = round(estimate, 3),
     p_value = signif(p_value, 3)
   )
 
 tableB %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	sqrt_totmet	1934	-0.097	1.96e-05
log_bmi	sqrt_totmet	1912	0.001	9.51e-01
log_calcium_total	sqrt_totmet	1777	0.086	2.82e-04
sqrt_vitd_total	sqrt_totmet	1777	-0.002	9.32e-01

Interpret the results:

[YOUR INTERPRETATION: Which variables are associated with physical activity levels?]

All four variables (AGE, BMI, Calcium Total and Vitamn D) are statistically signficant with physical activity levels.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()

# Example structure:
 preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
 pairs <- combn(preds, 2, simplify = FALSE)
 tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
   mutate(
     estimate = round(estimate, 3),
     p_value = signif(p_value, 3)
   )
 
 tableC %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	log_bmi	2834	-0.098	2.00e-07
RIDAGEYR	log_calcium_total	2605	0.048	1.36e-02
RIDAGEYR	sqrt_vitd_total	2605	0.153	0.00e+00
log_bmi	log_calcium_total	2569	0.000	9.81e-01
log_bmi	sqrt_vitd_total	2569	0.007	7.31e-01
log_calcium_total	sqrt_vitd_total	2605	0.314	0.00e+00

Interpret the results:

Are there strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses?]

There are strong correlations between age and BMI, Calcium Total, and Vitamin D. There is no correlation between BMI and Calcium total and Vitamin D total. There is a strong correlation between calcium total and vitamin D total.

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?
What are the key differences in what these methods tell us?
Give one example of a research question better suited to ANOVA and one better suited to correlation.

I would use correlation if I was analyzing two continuous or ordinal variables, interested in knowing the strength and direction of a linear relationship and want to explore the data before doing a regression. . While ANOVA would be used if I had 1 categorical predictor and a continous outcome, 3 or more groups means, and the observations are independent of each other.

Performing an ANOVA statistic would assess if there are difference in the group means, while correlation is used to describe the relastionship but not the cause.

One example of a research question better suited for ANOVA would be: Does BMI differ between across different regions or counties in New York State. One example of when to use correlation would be if I was interested in understanding if there is a relationship of hours of physial activity affect BMI among New York State residents.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?
How might violations of assumptions affect your conclusions?
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

Addrssing Normality was the most challenging to meet in this assignment, we would want are data to be normally distributed and in order to do so we had to transform some of the variables so that they would fit better and be more normally distributed.

Violations of assumptions can skew the results and either make it seem that there is a relationship or not. Outliers can also distort the conclusions as well as not having a large enough sample size. Even though not assumptions were met, due to our large sample size we can make the assumption that the distribution is normal.

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?
How did you solve it? (Describe your problem-solving process)
What R skill did you strengthen through completing this assignment?

The most difficult part of the R coding for this assignment was getting some of my ANOVA plots to run because the plots were to large. The document would knit but for some reason my code was not running as it should so made it challenging to intrepret the findings. Another challenge was creating the Tukey’s table, took me some time to really understand the code and understand how to set it up correctly so right values fell with the right variables.

Ended up being some triak and error and playing around with the code to see what the problem was generating from,

An R skill that I strengthen while completing this assignment was defenitely the table codes. I was able to troubleshoot and figure out what the issue was and why it was happening. Moving forward I will continue to work on my final presentation of tables and visuals.

YOUR REFLECTION HERE (200–400 words)

[Write your reflection addressing all three components above]

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Arielle Coq

2026-02-20

Homework 1 overview

File organization and naming (required)

Data description

Part 0 — Load and prepare the data (10 points)

0.1 Import

0.2 Recode to readable factors

0.3 Missingness + analytic sample

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

1.1 Hypotheses (required)

1.2 Exploratory plot and group summaries

1.3 Fit ANOVA model + report the F-test

1.4 Assumption checks (normality + equal variances)

Normality of residuals

Equal variances (Levene’s test)

1.5 Post-hoc comparisons (pairwise tests)

1.6 Effect size (eta-squared)

Part 2 — Correlation analysis (40 points)

2.1 Create total intake variables

2.2 Decide on transformations (show evidence)

2.3 Create three correlation tables (Pearson)

Helper function for correlation pairs

Table A: Variables associated with the outcome (BMD)

Table B: Variables associated with the exposure (MET)

Table C: Predictors associated with each other

Optional: Visualization examples

Part 3 — Reflection (10 points)

Submission checklist