Homework 1 overview

Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

Your .Rmd file
Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.

File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_Yatsiv_Viktoriya.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_Yatsiv_Viktoriya

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)

# Import data (adjust path as needed)
bmd <- readr::read_csv("/Users/vikya/OneDrive - University at Albany - SUNY/Documents/553 Principles of Statistical Inference II/Assignments/Assignment 1 ANOVA Correlation/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total

## [1] 2898

n_anova

## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

The original dataset contained 2,898 observations from the NHANES 2017-2018 DEXA sample. After removing observations with missing values for bone mineral density or ethnicity, the final sample contains 2,286 participants. A total of 612 observations were excluded due to missing data on these key variables.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀: μ₁ = μ₂ = μ₃ = μ₄ = μ₅
Hₐ: At least one μᵢ ≠ μⱼ

Plain language:

H₀: All ethnic groups have equal mean BMD. here is no difference in mean bone mineral density across the five ethnicity groups in the NHANES 2017-2018 DEXA sample.
Hₐ: At least one ethnicity group has a different mean bone mineral density compared to the other groups.

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3, col.names = c("Ethnicity", "N", "Mean BMD (g/cm²)", "SD"),
        caption = "Descriptive Statistics: BMD by Ethnicity")

Descriptive Statistics: BMD by Ethnicity
Ethnicity	N	Mean BMD (g/cm²)	SD
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: The boxplot shows that Non-Hispanic Black group has the highest median BMD (~0.97 g/cm²), while Non-Hispanic White participants have the lowest median BMD (~0.89 g/cm²).All groups show similar variability and some outliers.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
ethnicity	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

The one-way ANOVA shows a statistically significant difference in mean bone mineral density across ethnicity groups (F(4, 2281) = 30.18, p < 0.001). We reject the null hypothesis.This means that at least one ethnicity group has significantly different mean BMD.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

The Q-Q plot shows that residuals mostly follow the diagonal line. The Residuals vs Fitted plot shows random scatter with no particular patterns. Levene’s test yields F = 1.57, p = 0.179 (p > 0.05), therefore we fail to reject the null hypothesis of equal variances.The homogeneity of variance assumption is satisfied. There is a large sample (n=2286) and balanced groups. ANOVA is robust to minor violations.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tuk <- TukeyHSD(fit_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# Create table from Tukey results
tuk_df <- tuk$ethnicity %>%
  as.data.frame() %>%
  rownames_to_column("Comparison") %>%
  arrange(`p adj`) %>%
  mutate(
    Significant = ifelse(`p adj` < 0.05, "Yes", "No")
  )

# Display all comparisons
tuk_df %>%
  kable(digits = 4, caption = "Tukey HSD Results")

Tukey HSD Results
Comparison	diff	lwr	upr	p adj	Significant
Non-Hispanic Black-Non-Hispanic White	0.0848	0.0612	0.1084	0.0000	Yes
Other-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000	Yes
Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000	Yes
Non-Hispanic White-Other Hispanic	-0.0579	-0.0889	-0.0269	0.0000	Yes
Other-Mexican American	-0.0533	-0.0874	-0.0193	0.0002	Yes
Other-Other Hispanic	-0.0489	-0.0834	-0.0144	0.0011	Yes
Non-Hispanic Black-Other Hispanic	0.0268	-0.0062	0.0598	0.1722	No
Non-Hispanic Black-Mexican American	0.0224	-0.0101	0.0549	0.3277	No
Other-Non-Hispanic White	0.0091	-0.0166	0.0347	0.8719	No
Other Hispanic-Mexican American	-0.0044	-0.0426	0.0338	0.9978	No

# Show only significant comparisons
tuk_df %>%
  filter(`p adj` < 0.05) %>%
  kable(digits = 4, caption = "Significant Pairwise Differences")

Significant Pairwise Differences
Comparison	diff	lwr	upr	p adj	Significant
Non-Hispanic Black-Non-Hispanic White	0.0848	0.0612	0.1084	0.0000	Yes
Other-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000	Yes
Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000	Yes
Non-Hispanic White-Other Hispanic	-0.0579	-0.0889	-0.0269	0.0000	Yes
Other-Mexican American	-0.0533	-0.0874	-0.0193	0.0002	Yes
Other-Other Hispanic	-0.0489	-0.0834	-0.0144	0.0011	Yes

Write a short paragraph: “The groups that differed were …”

The groups that differed were Non-Hispanic Black, Non-Hispanic White, Mexican American, Other Hispanic, and Other. Non-Hispanic Black participants had much higher BMD compared to Non-Hispanic White (+0.085 g/cm², p < 0.001) and Other (+0.076 g/cm², p < 0.001). Non-Hispanic White had much lower BMD compared to Mexican American (-0.062 g/cm², p < 0.001) and Other Hispanic (-0.058 g/cm², p < 0.001). The Other group also differed significantly from Mexican American (-0.053 g/cm², p < 0.001) and Other Hispanic (-0.049 g/cm², p = 0.001).

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl

## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA

anova_tbl <- broom::tidy(fit_aov)

ss_between <- anova_tbl$sumsq[1]
ss_total <- sum(anova_tbl$sumsq)
eta_squared <- ss_between / ss_total
data.frame(
  Measure = c("SS Between", "SS Total", "Eta-squared", "% Variance"),
  Value = c(round(ss_between, 3), round(ss_total, 3), 
            round(eta_squared, 4), paste0(round(eta_squared*100, 2), "%"))
) %>%
  kable(caption = "Effect Size")

Effect Size
Measure	Value
SS Between	2.948
SS Total	58.646
Eta-squared	0.0503
% Variance	5.03%

Interpret in 1–2 sentences:

The eta-squared value is 0.050. This indicates that ethnicity explains 5.0% of the variance in bone mineral density.This is a small to medium effect.While statistically significant, ethnicity accounts for a modest portion of BMD variance. Therefore it suggests that there are other factors having an effect.

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

summary(bmd2[, c("calcium_total", "vitd_total")])

##  calcium_total      vitd_total      
##  Min.   :  55.0   Min.   :   0.000  
##  1st Qu.: 596.8   1st Qu.:   2.900  
##  Median : 895.5   Median :   8.533  
##  Mean   : 997.6   Mean   :  28.676  
##  3rd Qu.:1285.0   3rd Qu.:  30.400  
##  Max.   :5399.0   Max.   :2574.650  
##  NA's   :293      NA's   :293

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

p1 <- ggplot(bmd2, aes(x = BMXBMI)) + 
  geom_histogram(bins = 30, fill = "steelblue") +
  labs(title = "BMI", x = "BMI") +
  theme_minimal()

p2 <- ggplot(bmd2, aes(x = calcium_total)) + 
  geom_histogram(bins = 30, fill = "darkgreen") +
  labs(title = "Total Calcium", x = "Calcium (mg/day)") +
  theme_minimal()

p3 <- ggplot(bmd2, aes(x = vitd_total)) + 
  geom_histogram(bins = 30, fill = "red") +
  labs(title = "Total Vitamin D", x = "Vitamin D (mcg/day)") +
  theme_minimal()

p4 <- ggplot(bmd2, aes(x = totmet)) + 
  geom_histogram(bins = 30, fill = "purple") +
  labs(title = "Physical Activity", x = "MET-min/week") +
  theme_minimal()

library(patchwork)
(p1 | p2) / (p3 | p4)

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
  log_bmi = log(BMXBMI),
  log_calcium_total = log(calcium_total + 1),
  log_vitd_total = log(vitd_total + 1),
  sqrt_totmet = sqrt(totmet)
  )
p1t <- ggplot(bmd2, aes(x = log_bmi)) + 
  geom_histogram(bins = 30, fill = "steelblue") +
  labs(title = "log(BMI)") + theme_minimal()

p2t <- ggplot(bmd2, aes(x = log_calcium_total)) + 
  geom_histogram(bins = 30, fill = "darkgreen") +
  labs(title = "log(Calcium+1)") + theme_minimal()

p3t <- ggplot(bmd2, aes(x = log_vitd_total)) + 
  geom_histogram(bins = 30, fill = "red") +
  labs(title = "log(VitD+1)") + theme_minimal()

p4t <- ggplot(bmd2, aes(x = sqrt_totmet)) + 
  geom_histogram(bins = 30, fill = "purple") +
  labs(title = "sqrt(MET)") + theme_minimal()

(p1t | p2t) / (p3t | p4t)

Document your reasoning: “I applied/did not apply transformations to [variables] because…”

I applied transformations to improve normality for correlation analysis. Total calcium and vitamin D showed right skew with outliers.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

tableA <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
  corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
  corr_pair(bmd2, "log_vitd_total", "DXXOFBMD")
) %>%
  mutate(
    estimate = round(estimate, 3),
    p_value = format.pval(p_value, digits = 3)
  )

tableA %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	DXXOFBMD	2286	-0.232	<2e-16
log_bmi	DXXOFBMD	2275	0.425	<2e-16
log_calcium_total	DXXOFBMD	2129	0.012	0.5919
log_vitd_total	DXXOFBMD	2129	-0.070	0.0013

Interpret the results:

Age shows a moderate negative correlation with BMD (r = -0.232, p < 0.001), consistent with age-related bone loss. Log-BMI demonstrates a moderate positive correlation (r = 0.425, p < 0.001). This suggests higher body mass is associated with greater bone density.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

tableB <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
  corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
  corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
  corr_pair(bmd2, "log_vitd_total", "sqrt_totmet")
) %>%
  mutate(
    estimate = round(estimate, 3),
    p_value = format.pval(p_value, digits = 3)
  )
tableB %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	sqrt_totmet	1934	-0.097	1.96e-05
log_bmi	sqrt_totmet	1912	0.001	0.951153
log_calcium_total	sqrt_totmet	1777	0.086	0.000282
log_vitd_total	sqrt_totmet	1777	-0.018	0.437819

Interpret the results:

Age shows a weak negative correlation with physical activity (r = -0.097, p < 0.001), indicating older adults are less active. BMI showed no significant correlation with activity (r = 0.001, p = 0.95). Log-calcium showed a weak positive correlation (r = 0.086, p < 0.001). This suggests that more active individuals may have better overall nutrition. ### Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "log_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)

tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
  mutate(
    estimate = round(estimate, 3),
    p_value = format.pval(p_value, digits = 3)
  )

 tableC %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	log_bmi	2834	-0.098	1.83e-07
RIDAGEYR	log_calcium_total	2605	0.048	0.0135
RIDAGEYR	log_vitd_total	2605	0.190	< 2e-16
log_bmi	log_calcium_total	2569	0.000	0.9831
log_bmi	log_vitd_total	2569	-0.008	0.7005
log_calcium_total	log_vitd_total	2605	0.445	< 2e-16

Interpret the results:

The predictor correlation matrix reveals weak to moderate correlations among variables. The strongest correlation is between calcium and vitamin D (r = 0.445, p < 0.001). All other correlations are very weak (|r| < 0.2). These predictors capture mostly independent aspects of health and demographics

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?
What are the key differences in what these methods tell us?
Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?
How might violations of assumptions affect your conclusions?
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?
How did you solve it? (Describe your problem-solving process)
What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

ANOVA compares means across categorical groups to test whether groups differ, while correlation measures the strength and direction of linear relationships between continuous variables. ANOVA tells us which groups differ through post-hoc tests, while correlation quantifies association strength through r and variance explained through r². For example, “Do mean blood pressure levels differ between normal weight, overweight, and obese individuals?” requires ANOVA because BMI category is categorical. “What is the relationship between daily sodium intake and blood pressure?” requires correlation because both are continuous.

Normality was the most challenging assumption. Nutritional variables showed extreme right skew. This could mean that many people take no supplements while some take very high doses. Physical activity required square root transformation due to many zeros.Assumption violations can reduce statistical power and produce incorrect p-values for correlations. However, the large size of this sample helped with these issues. The biggest limitation is the inability to establish causation. It’s seen through no calcium-BMD correlation and negative vitamin D correlation

Creating Table C with all pairwise correlations was most difficult. First I tried creating manually each correlation. But then I used combn() and map_dfr(). This helped generate combinations of correlations and then mapping the helper function. Transformation and missing data handling were my primary target skills that i solidified during this assignment.

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Viktoriya Yatsiv

2026-02-20

Homework 1 overview

File organization and naming (required)

Data description

Part 0 — Load and prepare the data (10 points)

0.1 Import

0.2 Recode to readable factors

0.3 Missingness + analytic sample

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

1.1 Hypotheses (required)

1.2 Exploratory plot and group summaries

1.3 Fit ANOVA model + report the F-test

1.4 Assumption checks (normality + equal variances)

Normality of residuals

Equal variances (Levene’s test)

1.5 Post-hoc comparisons (pairwise tests)

1.6 Effect size (eta-squared)

Part 2 — Correlation analysis (40 points)

2.1 Create total intake variables

2.2 Decide on transformations (show evidence)

2.3 Create three correlation tables (Pearson)

Helper function for correlation pairs

Table A: Variables associated with the outcome (BMD)

Table B: Variables associated with the exposure (MET)

Optional: Visualization examples

Part 3 — Reflection (10 points)

Submission checklist