Homework 1 overview

Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

Your .Rmd file
Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.

File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)

# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Assignment 1/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total

## [1] 2898

n_anova

## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

The total sample size is N=2898. After removing 612 samples due to missing data in the variables DXXOFBMD and ethnicity, the analytic sample size is N=2286.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀: μ1 = μ2 = μ3 = μ4 = μ5
Hₐ: not all population means are equal

Plain language:

H₀: There is no difference in mean BMD between race/ethnicity groups.
Hₐ: There is a difference in mean BMD between race/ethnicity groups.

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)

ethnicity	n	mean_bmd	sd_bmd
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

The race/ethnicity groups have similar variances, but Non-Hispanic white has the lowest median BMD while Non-Hispanic black has the highest median BMD.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
ethnicity	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here: Based on the results of the ANOVA, I will reject H0 and conclude that not all group means are equal. Thus, not all race/ethnicity groups have the same mean BMD.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences: The Q-Q residuals plot indicates that the data are normally distributed; points follow the red line with almost no deviation. The results of Levene’s Test were F = 1.5728, p = 0.1788. Thus, I do not reject the null hypothesis and assume homogeneity of variance.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tuk <- TukeyHSD(fit_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# YOUR CODE HERE to create a tidy table from Tukey results
broom::tidy(tuk) %>%
  kable(digits = 4)

term	contrast	estimate	conf.low	conf.high	adj.p.value
ethnicity	Other Hispanic-Mexican American	-0.0044	-0.0426	0.0338	0.9978
ethnicity	Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000
ethnicity	Non-Hispanic Black-Mexican American	0.0224	-0.0101	0.0549	0.3277
ethnicity	Other-Mexican American	-0.0533	-0.0874	-0.0193	0.0002
ethnicity	Non-Hispanic White-Other Hispanic	-0.0579	-0.0889	-0.0269	0.0000
ethnicity	Non-Hispanic Black-Other Hispanic	0.0268	-0.0062	0.0598	0.1722
ethnicity	Other-Other Hispanic	-0.0489	-0.0834	-0.0144	0.0011
ethnicity	Non-Hispanic Black-Non-Hispanic White	0.0848	0.0612	0.1084	0.0000
ethnicity	Other-Non-Hispanic White	0.0091	-0.0166	0.0347	0.8719
ethnicity	Other-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000

# Hint: Convert to data frame, add rownames as column, filter for significance

Write a short paragraph: “The groups that differed were …”

The groups that have significantly different mean BMD are: Non-Hispanic White - Mexican American Other - Mexican American Non-Hispanic White - Non-Hispanic Black Other - Non-Hispanic Black
Other Hispanic - Non-Hispanic White Other Hispanic - Other

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl

## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA

# YOUR CODE HERE to compute eta-squared
anova_summary <- summary(fit_aov)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")

## Eta-squared (η²): 0.0503

cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")

## Percentage of variance explained: 5.03 %

# Hint: Extract SS for ethnicity and Residuals, then calculate proportion

Interpret in 1–2 sentences:

The effect size is 0.0503, meaning 5.03% of the variance in BMD is explained by race/ethnicity. According to Cohen’s guidelines, this is a small effect size, so race/ethnicity is a not a major predictor of BMD. It is likely there are other variables that contribute more to BMD.

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

# YOUR CODE HERE: Create histograms for key continuous variables
# Recommended: BMXBMI, calcium_total, vitd_total, totmet
# Example structure:
library(patchwork)
ggplot(bmd2, aes(x = BMXBMI))+geom_histogram(bins=30) +
  labs(
    title = "Histogram of BMI",
    subtitle = "NHANES Data",
    x = "BMI (kg/m2)",
    y = "Frequency",
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

ggplot(bmd2, aes(x = calcium_total))+geom_histogram(bins=30) +
  labs(
    title = "Histogram of Total Calcium",
    subtitle = "NHANES Data",
    x = "Total Calcium (mg/day)",
    y = "Frequency",
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

ggplot(bmd2, aes(x = vitd_total))+geom_histogram(bins=30) +
  labs(
    title = "Histogram of Total Vitamin D",
    subtitle = "NHANES Data",
    x = "Total Vitamin D (mcg/day)",
    y = "Frequency",
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

ggplot(bmd2, aes(x = totmet))+geom_histogram(bins=30) +
  labs(
    title = "Histogram of Total Physical Activity",
    subtitle = "NHANES Data",
    x = "Total Physical Activity (MET-min/week)",
    y = "Frequency",
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

# ...
# (p1 | p2) / (p3 | p4)

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
    # Add your transformation code here
    # Examples (modify as needed based on your assessment):
log_bmi = log(BMXBMI),
log_calcium_total = log(calcium_total),
sqrt_totmet = sqrt(totmet),
sqrt_vitd_total = sqrt(vitd_total)
  )

Document your reasoning: “I applied/did not apply transformations to [variables] because…”

I applied transformation to all these variables as they all have a substantial positive skew, as well as some outliers to the right of the median.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

# YOUR CODE HERE
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()

# Example structure:
tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
# 
tableA %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	DXXOFBMD	2286	-0.232	0.00000
log_bmi	DXXOFBMD	2275	0.425	0.00000
log_calcium_total	DXXOFBMD	2129	0.012	0.59200
sqrt_vitd_total	DXXOFBMD	2129	-0.062	0.00426

Interpret the results: Age, BMI, and total vitamin D intake showed statistically significant correlations with BMD.

Age and BMD exhibited a weak negative correlation (r=-0.232, p = 0.000, n = 2286) BMI and BMD exhibited a moderate positive correlation (r=0.425, p = 0.000, n = 2275) Vitamin D and BMD exhibited a weak negative correlation (r=-0.062, p = 0.004, n=2129)

Calcium and BMD exhibited no correlation (r=0.012, p = 0.592, n=2129)

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

# YOUR CODE HERE

tableB <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 2)
)
# 
tableB %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	sqrt_totmet	1934	-0.097	0.00002
log_bmi	sqrt_totmet	1912	0.001	0.95000
log_calcium_total	sqrt_totmet	1777	0.086	0.00028
sqrt_vitd_total	sqrt_totmet	1777	-0.002	0.93000

# Follow the same structure as Table A
# Use sqrt_totmet (or transformed version) as the outcome variable

Interpret the results:

Age and Total calcium intake are associated with physical activity level.

Age exhibits a weak negative correlation with physical activity level (r=-0.097, p=0.000, n=1934) Total calcium intake exhibits a weak positive correlation with physical activity level (r=0.086, p=0.000, n=1777)

BMI was not associated with physical activity level (r=0.001, p=0.950, n=1912) Vitamin D intake was not associated with physical activity level (r=-0.002, p=0.930, n=1777)

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()

# Example structure:
preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 2)
)
# 
tableC %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	log_bmi	2834	-0.098	2.0e-07
RIDAGEYR	log_calcium_total	2605	0.048	1.4e-02
RIDAGEYR	sqrt_vitd_total	2605	0.153	0.0e+00
log_bmi	log_calcium_total	2569	0.000	9.8e-01
log_bmi	sqrt_vitd_total	2569	0.007	7.3e-01
log_calcium_total	sqrt_vitd_total	2605	0.314	0.0e+00

Interpret the results:

There are some statistically significant correlations between Age-BMI (n=2834, r=-0.098, p=2.0e-07), Age-Calcium (n=2605, r=0.048, p=1.4e-02), Age-VitaminD (n=2605, r=0.153, p=0.0e+00), and Calcium-VitaminD (n=2605, r=0.314, p=0.0e+00). The correlations between BMI-Calcium (n=2569, r=0.000, p=9.8e-01) and BMI-VitaminD (n=2569, r=0.007, p=7.3e-01) are not statistically significant. However, even among the statistically significant results, there are no strong correlations between these predictors. Only Calcium-VitaminD exhibits a moderate positive correlation (r=0.314)

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?
What are the key differences in what these methods tell us?
Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?
How might violations of assumptions affect your conclusions?
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?
How did you solve it? (Describe your problem-solving process)
What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

I would use ANOVA when the predictor variable is categorical with more than two levels and the outcome variable is continuous. I would use correlation when both variables are continuous. ANOVA tells us if there are significant differences in the dependent variable between different levels of the independent variable. Correlation tells us if the independent and dependent variable exhibit a linear relationship. An example of a research question best suited for ANOVA is “Is there a significant difference in mean systolic blood pressure between different income categories (low, medium, high)?” A research question best suited for correlation is “Is there an association between white blood cell count and BMI?”

The assumption most challenging to meet was the assumption of normality for the correlation analyses, as the key variables BMI, Total calcium, Total Vitamin D, and Total physical activity exhibited positively-skewed distributions. Violations of assumptions for a statistical test might mean that the test is not giving us an accurate picture of the true relationship in the population, and lead to a Type I or Type II error. Cross sectional correlation analyses are limited because a cross sectional study only provides a snapshot of a moment in time. Therefore, if we find an association between variable A and variable B, we do not know which variable is leading to / causing the other, or if an unknown Variable C is actually responsible for the observed relationship.

The most difficult part of the assignment for me was creating the histograms of BMI, Calcium, Vitamin D, and total physical activity. I solved it by using the ggplot code from a previous assignment and combining it with the provided code, making sure the dataset, variables, and labels were updated to this assignment. In solving this problem, I strengthened my ability to modify previously written code to complete a new task. This is important so as to not “reinvent the wheel” for every new coding task.

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Josh Macera

2026-02-20