Homework 1 overview

Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

Your .Rmd file
Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.

File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)

# Import data 
bmd <- readr::read_csv("/Users/sarah/OneDrive/Documents/EPI 553/data/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(
      RIDRETH1,
    levels= c(1,2,3,4,5),
    labels = c("Mexican American", "Other Hispanic", "Non-Hispanic White", "Non-Hispanic Black", "Other")
  ),
    RIAGENDR = factor(
      RIAGENDR,
      levels= c (1,2),
      labels = c ("Male", "Female")
    ),
  smoker = factor(
    smoker,
    levels = c(1,2,3),
    labels = c("Current", "Past", "Never")
  )
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
total_n <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(RIDRETH1))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
total_n

## [1] 2898

n_anova

## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

The total sample size was 2898. When observations were removed due to missing data, the final analytic sample size is 2286. This new final analytic sample size will be used for the ANOVA analysis.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀: μ_1 = μ_2 = μ_3 = μ_4 = μ_5
Hₐ: At least one population mean differs from the others. Plain language:
H₀: Bone mineral density is the same across all ethnicity groups.
Hₐ: At least one ethnicity group has a different average bone mineral density. This indicates the bone health is not the same across all ethinic groups

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(RIDRETH1) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)

RIDRETH1	n	mean_bmd	sd_bmd
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = RIDRETH1, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

There are clear differences in bone mineral density across ethnicity groups. Non-Hispanic Black adults have the highest BMD, with a noticeably higher median than all other groups. Both Mexican American and Other Hispanic adults show intermediate BMD levels, slightly higher than Non-Hispanic White.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ RIDRETH1, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## RIDRETH1       4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
RIDRETH1	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here: F-statistic = 30.18 p-value = <2e-16 Since the p-value is < .05, we reject the null hypothesis. There is statistically significant evidence that bone mineral density differs across at least one pair of ethnic groups.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  
plot(fit_aov, which = 2)

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ RIDRETH1, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

The Q-Q plot shows that the residuals fall close to the reference line, with only minor deviations at the tails. This indicates that the normality assumption is reasonably satisfied. Levene’s test for homogeneity of variance was not statistically significant (F value= 1.5728,p=0.1788).

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tukey_results <- TukeyHSD(fit_aov)
tukey_results

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ RIDRETH1, data = anova_df)
## 
## $RIDRETH1
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# YOUR CODE HERE to create a tidy table from Tukey results
tukey_summary <- as.data.frame(tukey_results$RIDRETH1)
tukey_summary$Comparison <- rownames(tukey_summary)
tukey_summary <- tukey_summary[, c("Comparison", "diff", "lwr", "upr", "p adj")]
tukey_summary$Significant <- ifelse(tukey_summary$`p adj` < 0.05, "Yes", "No")
tukey_summary$Direction <- ifelse(
  tukey_summary$Significant == "Yes",
  ifelse(tukey_summary$diff > 0, "First group higher", "Second group higher"),
  "No difference"
)

kable(tukey_summary, digits = 4,
      caption = "Tukey HSD post-hoc comparisons with 95% confidence intervals")

Tukey HSD post-hoc comparisons with 95% confidence intervals
	Comparison	diff	lwr	upr	p adj	Significant	Direction
Other Hispanic-Mexican American	Other Hispanic-Mexican American	-0.0044	-0.0426	0.0338	0.9978	No	No difference
Non-Hispanic White-Mexican American	Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000	Yes	Second group higher
Non-Hispanic Black-Mexican American	Non-Hispanic Black-Mexican American	0.0224	-0.0101	0.0549	0.3277	No	No difference
Other-Mexican American	Other-Mexican American	-0.0533	-0.0874	-0.0193	0.0002	Yes	Second group higher
Non-Hispanic White-Other Hispanic	Non-Hispanic White-Other Hispanic	-0.0579	-0.0889	-0.0269	0.0000	Yes	Second group higher
Non-Hispanic Black-Other Hispanic	Non-Hispanic Black-Other Hispanic	0.0268	-0.0062	0.0598	0.1722	No	No difference
Other-Other Hispanic	Other-Other Hispanic	-0.0489	-0.0834	-0.0144	0.0011	Yes	Second group higher
Non-Hispanic Black-Non-Hispanic White	Non-Hispanic Black-Non-Hispanic White	0.0848	0.0612	0.1084	0.0000	Yes	First group higher
Other-Non-Hispanic White	Other-Non-Hispanic White	0.0091	-0.0166	0.0347	0.8719	No	No difference
Other-Non-Hispanic Black	Other-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000	Yes	Second group higher

Interpretation

Non-Hispanic White vs. Mexican American: Mexican American individuals had a higher mean BMD than Non-Hispanic White individuals (p= 2.567750e-07).

Other vs. Mexican American: Mexican American ethnicity had a higher mean BMD than the “Other” ethnic category (p =1.919211e-04).

Non-Hispanic White vs. Other Hispanic: The “Other Hispanic” group had a higher mean BMD than the Non-hispanic white group (p= 3.606571e-06).

Other vs. Other Hispanic: The “Other Hispanic” ethnic group had a higher mean BMD than the “Other” ethnic category (p= 1.071982e-03).

Non-Hispanic Black vs.Non-Hispanic White: Non-Hispanic Black individuals had a hgiher mean BMD than Non-Hispanic WHite individuals (p= 3.935607e-11).

Other vs.Non-Hispanic Black: Non-Hispanic Black individuals had a higher mean BMD level than the “Other” ethnic category (p=4.175826e-11). ## 1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Compute eta-squared
anova_table <- summary(fit_aov)

ss_treatment <- anova_table[[1]]$`Sum Sq`[1]
ss_total <- sum(anova_table[[1]]$`Sum Sq`)
eta_squared <- ss_treatment / ss_total
eta_squared

## [1] 0.05027196

# Hint: Extract SS for ethnicity and Residuals, then calculate proportion

Interpret in 1–2 sentences: The calculated effect size is η²= 0.0502. This indicates that ethnicity explains 5.02% of the variance in bone mineral density (BMD). According to Cohen’s guidelines (small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14), this represents a small effect size, suggesting that while ethnicity contributes to differences in BMD, other factors account for most of the variability.

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

# Create histograms for key continuous variables

ggplot(bmd2, aes(x = BMXBMI)) + geom_histogram(bins = 30) + labs(title = "Histogram of BMI", x = "BMI", y = "Count")

ggplot(bmd2, aes(x = calcium_total)) + geom_histogram(bins = 30) + labs(title = "Histogram of Total Calcium Intake", x= "Calcium Intake (mg/day)", y = "Count")

ggplot(bmd2, aes(x = vitd_total)) + geom_histogram(bins = 30) + labs(title = "Histogram of Total Vitamin D Intake", x = "Vitamin D Intake (mcg/day)", y= "Count")

ggplot(bmd2, aes(x=totmet)) + geom_histogram(bins=30) + labs(title = "Histogram of Physical Activity", x = "Physical Activity (min/wk)", y = "Count")

# (p1 | p2) / (p3 | p4)

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
 
  log_bmi = log(BMXBMI),
  log_calcium_total = log(calcium_total),
  sqrt_totmet = sqrt(totmet),
  sqrt_vitd_total = sqrt(vitd_total)
  )

Document your reasoning: Based on the visual inspection of the histograms, I applied log transformations to BMI and total calcium intake, as they were right-skewed without zero values. I applied square root transformations to total vitamin D intake and physical activity since they were also right-skewed but included zero values. These transformations help to reduce skewness and allows the variables to be more suitable for correlation analysis.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

# Table A
tableA <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(bmd2, "log_bmi","DXXOFBMD" ),
  corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
  corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
)


tableA %>% kable(digits = 4,
        caption = "Table A: Correlations Between Predictors and BMD")

Table A: Correlations Between Predictors and BMD
x	y	n	estimate	p_value
RIDAGEYR	DXXOFBMD	2286	-0.2324	0.0000
log_bmi	DXXOFBMD	2275	0.4247	0.0000
log_calcium_total	DXXOFBMD	2129	0.0116	0.5916
sqrt_vitd_total	DXXOFBMD	2129	-0.0619	0.0043

Interpret the results:

Age (RIDAGEYR), log transformed BMI, and square-root transformed total vitamin D intake all show statistically significant correlations with BMD. Age and vitamin D both have weak negative correlations to BMD. Log transformed BMI shows a moderate positive correlation. Log transformed total calcium intake does not have a statistically significant relationship with BMD.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

tableB <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
  corr_pair(bmd2, "log_bmi","sqrt_totmet" ),
  corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
  corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
)


tableB %>% kable(digits = 4,
        caption = "Table B: Correlations Between Potential Confounders and Physical Activity")

Table B: Correlations Between Potential Confounders and Physical Activity
x	y	n	estimate	p_value
RIDAGEYR	sqrt_totmet	1934	-0.0969	0.0000
log_bmi	sqrt_totmet	1912	0.0014	0.9512
log_calcium_total	sqrt_totmet	1777	0.0860	0.0003
sqrt_vitd_total	sqrt_totmet	1777	-0.0020	0.9322

Interpret the results:

Age (RIDAGEYR) and log transformed calcium levels both show statistically significant correlation to physical activity levels, as there p-values are below 0.05. Age has a weak negative correlation to physical activity, and calcium intake has a weak positive correlation. Log transformed BMI and square root transformed vitamin D do not show significant correlations to physical activity.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()

preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
TableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
     estimate = round(estimate, 3),
     p_value = signif(p_value, 3)
   )
 
 TableC %>% kable(digits= 4, caption = "Table C: Correlation between Predictors")

Table C: Correlation between Predictors
x	y	n	estimate	p_value
RIDAGEYR	log_bmi	2834	-0.098	0.0000
RIDAGEYR	log_calcium_total	2605	0.048	0.0136
RIDAGEYR	sqrt_vitd_total	2605	0.153	0.0000
log_bmi	log_calcium_total	2569	0.000	0.9810
log_bmi	sqrt_vitd_total	2569	0.007	0.7310
log_calcium_total	sqrt_vitd_total	2605	0.314	0.0000

Interpret the results:

According the Table C, there are no strong correlations among the predictors. Age has a weak negative correlation with log_bmi and a weak positive correlation with log_calcium_total and sqrt_vitd_total. The strongest correlation is between log_calcium_total and sqrt_vitd_total where r= 0.314, which is a moderate positve correlation.

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?
What are the key differences in what these methods tell us?
Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?
How might violations of assumptions affect your conclusions?
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?
How did you solve it? (Describe your problem-solving process)
What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

ANOVA and correlation address different types of epidemiologic questions. ANVOA is used when comparing three or more group means, where there are categorical predictors and continuous outcomes. Correlation on the other hand, assesses the strength and direction of linear relationships between two continuous variables. ANOVA answers the question of “Are these groups different?” whereas correlation answers “Do these variables move together?”. A research question for ANOVA would be “Does BMD differ across levels of physical activity?” A question more suited for correlation would be “Is there an association between age and BMD?”

The most challenging assumptions in this assignment were normality and linearity, since several variables were right-skewed and required transformations. Violations of these assumptions would weaken the validity of p-values and affect the estimates, resulting in incorrect conclusions about the association.

The difficult part of the code was troubleshooting the correlation tables and understanding how corr_pair() handled variable names. I had errors within my code with unquoted column names when I created the tables. I wasn’t sure why I continued to get an error message saying a specific variable didn’t exist, even after double checking. To solve this, I put the variable names in quotations to see if the problem would be resolved. I strengthened my ability to interpret correlation tables and assess potential multicollinearity.

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

SARAH ANDRES

2026-02-20

Homework 1 overview

File organization and naming (required)

Data description

Part 0 — Load and prepare the data (10 points)

0.1 Import

0.2 Recode to readable factors

0.3 Missingness + analytic sample

The total sample size was 2898. When observations were removed due to missing data, the final analytic sample size is 2286. This new final analytic sample size will be used for the ANOVA analysis.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

1.1 Hypotheses (required)

1.2 Exploratory plot and group summaries

1.3 Fit ANOVA model + report the F-test

1.4 Assumption checks (normality + equal variances)

Normality of residuals

Equal variances (Levene’s test)

1.5 Post-hoc comparisons (pairwise tests)

Part 2 — Correlation analysis (40 points)

2.1 Create total intake variables

2.2 Decide on transformations (show evidence)

2.3 Create three correlation tables (Pearson)

Helper function for correlation pairs

Table A: Variables associated with the outcome (BMD)

Table B: Variables associated with the exposure (MET)

Table C: Predictors associated with each other

Part 3 — Reflection (10 points)

Submission checklist