Homework 1 overview

Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

Your .Rmd file
Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.

File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)
library(ggplot2)
library(ggstats)
library(readr)

# Import data (adjust path as needed)
bmd <- readr::read_csv("~/R/site-library/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total

## [1] 2898

n_anova

## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

[YOUR TEXT HERE: Describe the total sample, how many observations were removed due to missing data, and what the final analytic sample size is for the ANOVA analysis.]

The total sample included 2,898 participants from the dataset. For the ANOVA analysis, 612 observations were excluded due to missing data on key variables, including sex, ethnicity, and smoking status. After removing these incomplete cases, the final analytic sample consisted of 2,286 participants with complete data on all variables required for the analysis

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

# Fit the one-way ANOVA model
anova_model <- aov(DXXOFBMD ~ ethnicity, data = bmd)

# Display the ANOVA table
summary(anova_model)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 612 observations deleted due to missingness

Research question: Do mean BMD values differ across ethnicity groups?

Yes the mean BMD values differ significantly across ethnicity groups.

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀: mu_1 = mu_2 = mu_3 = mu_4 = mu_5
Hₐ: At least one mu differs

Plain language:

H₀: The mean bone mineral density (BMD) is the same across all ethnicity groups.
Hₐ: At least one ethnicity group has a different mean BMD compared to the others.

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)

ethnicity	n	mean_bmd	sd_bmd
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

[BMD appears to vary across ethnicity groups, with Non-Hispanic Black participants showing the highest median BMD and Non-Hispanic White participants showing the lowest median BMD. Mexican American, Other Hispanic, and Other groups fall in between these two groups. Although the distributions overlap substantially, there is a shift in central tendency across groups.]

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table: #| Source | df | F-statistic | p-value | | ——— | —- | ———– | ——- | | Ethnicity | 4 | 30.18 | < 0.001 | | Residuals | 2281 | — | — |

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
ethnicity	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here:

[YOUR INTERPRETATION: State your decision (reject or fail to reject H₀) and what this means for ethnic differences in BMD.]

#Interpretation: We reject the null hypothesis (H₀). there is strong statistical evidence that mean bone mineral density (BMD) differs across ethnicity groups,therefore at least one ethnic group has a significantly different average BMD compared to the others.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests. ### Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

[YOUR INTERPRETATION: Comment on whether residuals appear normally distributed based on the QQ plot, and whether variances appear equal based on Levene’s test. State whether assumptions are met or violated.]

#Interpretation: The Q-Q plot shows that the residuals follow the reference line reasonably well, with only minor deviations in the tails, indicating approximate normality. Levene’s test for homogeneity of variance was not statistically significant (F(4, 2281) = 1.57, p = 0.1788), indicating that the variances are equal across ethnicity groups. Therefore, the normality and equal variance assumptions appear to be reasonably met.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tuk <- TukeyHSD(fit_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

tuk <- TukeyHSD(fit_aov)
print(tuk)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

plot(tuk, las = 0)

Write a short paragraph: “The groups that differed were …”

#YOUR INTERPRETATION:

The groups that differed were primarily those involving Non-Hispanic White and Non-Hispanic Black participants. Non-Hispanic Whites had significantly lower mean BMD compared to Mexican Americans, Other Hispanics, and Non-Hispanic Blacks (all adjusted p < 0.001). The “Other” group also had significantly lower BMD compared to Mexican Americans and Non-Hispanic Blacks. Additionally, Non-Hispanic Blacks had significantly higher mean BMD than Non-Hispanic Whites. No statistically significant differences were observed between Mexican Americans and Other Hispanics, Mexican Americans and Non-Hispanic Blacks, Other Hispanics and Non-Hispanic Blacks, or Other and Non-Hispanic Whites.

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl

## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA

# YOUR CODE HERE to compute eta-squared
ss_between <- anova_tbl$sumsq[anova_tbl$term == "ethnicity"]
ss_within  <- anova_tbl$sumsq[anova_tbl$term == "Residuals"]

# Compute total SS
ss_total <- ss_between + ss_within

# Compute eta-squared
eta_sq <- ss_between / ss_total

eta_sq

## [1] 0.05027196

# Hint: Extract SS for ethnicity and Residuals, then calculate proportion

Interpret in 1–2 sentences:

[YOUR INTERPRETATION: What proportion of variance in BMD is explained by ethnicity? Use Cohen’s guidelines: small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14]

#Ethnicity explains approximately 95% of the variance in BMD (η² = 55.697313 / 58.64554 ≈ 0.95), which would be considered an extremely large effect size according to Cohen’s guidelines

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

bmd2$vitd_total <- bmd2$vitd + 
  ifelse(is.na(bmd2$DSQTVD), 0, bmd2$DSQTVD)

# Verify
summary(bmd2[, c("calcium_total", "vitd_total")])

##  calcium_total      vitd_total      
##  Min.   :  55.0   Min.   :   0.000  
##  1st Qu.: 596.8   1st Qu.:   2.900  
##  Median : 895.5   Median :   8.533  
##  Mean   : 997.6   Mean   :  28.675  
##  3rd Qu.:1285.0   3rd Qu.:  30.400  
##  Max.   :5399.0   Max.   :2574.650  
##  NA's   :293      NA's   :293

# Check for any remaining NAs
sum(is.na(bmd2$calcium_total))

## [1] 293

sum(is.na(bmd2$vitd_total))

## [1] 293

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

# YOUR CODE HERE: Create histograms for key continuous variables
# # Create histograms for all continuous variables
p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
  geom_histogram(fill = "purple", color = "black", bins = 30) +
  labs(title = "BMI Distribution", x = "BMI", y = "Frequency") +
  theme_minimal()

p2 <- ggplot(bmd2, aes(x = calcium_total)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Total Calcium Distribution", x = "Calcium (mg)", y = "Frequency") +
  theme_minimal()

p3 <- ggplot(bmd2, aes(x = vitd_total)) +
  geom_histogram(fill = "darkgreen", color = "black", bins = 30) +
  labs(title = "Total Vitamin D Distribution", x = "Vitamin D (mcg)", y = "Frequency") +
  theme_minimal()

p4 <- ggplot(bmd2, aes(x = totmet)) +
  geom_histogram(fill = "orange2", color = "black", bins = 30) +
  labs(title = "Total MET Distribution", x = "MET (min/week)", y = "Frequency") +
  theme_minimal()

# Display all plots together
print(p1)

print(p2)

print(p3)

print(p4)

# Check missing data for each variable
sum(is.na(bmd2$BMXBMI))        # 11 missing

## [1] 64

sum(is.na(bmd2$calcium_total)) # 157 missing

## [1] 293

sum(is.na(bmd2$vitd_total))    # 157 missing

## [1] 293

sum(is.na(bmd2$totmet))        # 698 missing

## [1] 964

# Check complete cases for correlation analysis
complete_cases <- sum(complete.cases(bmd2[, c("BMXBMI", "calcium_total", "vitd_total", "totmet")]))
cat("Complete cases for all variables:", complete_cases)

## Complete cases for all variables: 1765

Interpretation:

The bone mineral density (BMD) values of the five ethnic groups are similar.
Because the boxes are roughly the same height and location, the majority of individuals in each group have BMD values that are comparable (0.9–1.0 g/cm²).
Black non-Hispanic individuals have a somewhat higher BMD; their box is positioned slightly higher than the others.
Mexican Americans’ box sits a little lower than the others, indicating a slightly lower BMD.

*The boxes overlap a lot, indicating that there are not many distinctions between the groups. Numerous dots (data points) indicate that each group has a large number of members, which increases the reliability of our findings. Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
    bmi_log = log(BMXBMI),
    calcium_log = log(calcium_total + 1),
    vitd_sqrt = sqrt(vitd_total),
    met_sqrt = sqrt(totmet)
    # Add your transformation code here
    
    # Examples (modify as needed based on your assessment):
    # log_bmi = log(BMXBMI),
    # log_calcium_total = log(calcium_total),
    # sqrt_totmet = sqrt(totmet),
    # sqrt_vitd_total = sqrt(vitd_total)
  )
# Check complete cases for correlation analysis
complete_cases <- sum(complete.cases(bmd2[, c("BMXBMI", "calcium_total", "vitd_total", "totmet")]))
cat("Complete cases for all variables:", complete_cases)

## Complete cases for all variables: 1765

Document your reasoning: “I applied/did not apply transformations to [variables] because…”

[YOUR JUSTIFICATION HERE]

BMI Distribution (p1): Right-skewed distribution with a long tail extending to higher BMI values Most data clustered around 25-30, but some extend to 50-60

Decision: Apply LOG transformation

Justification: Right-skewed shape needs correction No zeros in BMI data, so log is safe to use For improved correlation analysis, log will compress the right tail and make the distribution more symmetrical.

Total Calcium Distribution (p2): Strong right-skewed with most values clustered at the low end (0-1000 mg) Long right tail extending to 4000+ mg

Decision: Apply LOG transformation

Justification: Strong right-skewed distribution Possible zero values (some individuals might not consume any calcium) Log will increase symmetry by compressing the really high numbers.

Total Vitamin D Distribution (p3): Extremely right-skewed, spike at zero, some values extend to 2000+ mcg

Decision: SQUARE ROOT transformation

Justification: Many zeros present (people without supplements), sqrt handles zeros without adding constants, useful for heavily zero-inflated data

Total MET Distribution (p4): Severely right-skewed, large spike at zero (sedentary people), long tail to 7500+ min/week

Decision: SQUARE ROOT transformation

Justification: Many zeros (sedentary individuals), extremely skewed, sqrt handles zeros naturally, appropriate for count-like physical activity data

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

# Create the correlation helper function

corr_pair <- function(data, var1, var2) {
  # Remove missing values
  complete_data <- data[complete.cases(data[[var1]], data[[var2]]), ]
  
  # Calculate correlation test
  cor_test <- cor.test(complete_data[[var1]], complete_data[[var2]], 
                       method = "pearson")
  
  # Extract results
  result <- data.frame(
    Variable_1 = var1,
    Variable_2 = var2,
    r = round(cor_test$estimate, 3),
    p_value = format.pval(cor_test$p.value, digits = 3),
    n = nrow(complete_data)
  )
  
  return(result)
  
}

Table A: Predictors vs. Outcome (BMD)

#Table A: Predictors vs. Outcome (BMD)
table_a <- rbind(
  corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),      # Age vs BMD
  corr_pair(bmd2, "bmi_log", "DXXOFBMD"),       # BMI vs BMD
  corr_pair(bmd2, "calcium_log", "DXXOFBMD"),   # Calcium vs BMD
  corr_pair(bmd2, "vitd_sqrt", "DXXOFBMD")     # Vitamin D vs BMD
)


table_a <- table_a %>%
  mutate(
    Variable_1 = case_when(
      Variable_1 == "RIDAGEYR" ~ "Age (years)",
      Variable_1 == "bmi_log" ~ "BMI (log-transformed)",
      Variable_1 == "calcium_log" ~ "Total Calcium (log-transformed)",
      Variable_1 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
      TRUE ~ Variable_1
    ),
    Variable_2 = "BMD (g/cm²)"
  )

kable(table_a,
      caption = "Table A: Correlations between Predictors and BMD (Outcome)",
      col.names = c("Predictor", "Outcome", "r", "p-value", "n"),
      align = c('l', 'l', 'r', 'r', 'r'))

Table A: Correlations between Predictors and BMD (Outcome)
	Predictor	Outcome	r	p-value	n
cor	Age (years)	BMD (g/cm²)	-0.232	<2e-16	2286
cor1	BMI (log-transformed)	BMD (g/cm²)	0.425	<2e-16	2275
cor2	Total Calcium (log-transformed)	BMD (g/cm²)	0.012	0.592	2129
cor3	Total Vitamin D (sqrt-transformed)	BMD (g/cm²)	-0.062	0.00426	2129

Interpret the results:

[YOUR INTERPRETATION: Age has a slight negative correlation with BMD, although BMI has a significant (moderately positive) correlation.the cross-sectional research does not reveal any significant positive correlations between BMD with either calcium or vitamin D intake.]

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

# YOUR CODE HERE
# Table B: Predictors vs. Exposure (Physical Activity)
table_b <- rbind(
  corr_pair(bmd2, "RIDAGEYR", "met_sqrt"),  # Age vs MET
  corr_pair(bmd2, "bmi_log", "met_sqrt"),   # BMI vs MET
  corr_pair(bmd2, "calcium_log", "met_sqrt"), # Calcium vs MET
  corr_pair(bmd2, "vitd_sqrt", "met_sqrt")  # Vitamin D vs MET
)

# Add descriptive labels
table_b <- table_b %>%
  mutate(
    Variable_1 = case_when(
      Variable_1 == "RIDAGEYR" ~ "Age (years)",
      Variable_1 == "bmi_log" ~ "BMI (log-transformed)",
      Variable_1 == "calcium_log" ~ "Total Calcium (log-transformed)",
      Variable_1 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
      TRUE ~ Variable_1
    ),
    Variable_2 = "Physical Activity (MET, sqrt-transformed)"
  )

# Display formatted table
kable(table_b,
      digits = 3,
      col.names = c("Predictor", "Exposure", "r", "p-value", "n"),
      caption = "Table B: Correlations between Predictors and Physical Activity (Exposure)",
      align = c('l', 'l', 'r', 'r', 'r'))

Table B: Correlations between Predictors and Physical Activity (Exposure)
	Predictor	Exposure	r	p-value	n
cor	Age (years)	Physical Activity (MET, sqrt-transformed)	-0.097	1.96e-05	1934
cor1	BMI (log-transformed)	Physical Activity (MET, sqrt-transformed)	0.001	0.951	1912
cor2	Total Calcium (log-transformed)	Physical Activity (MET, sqrt-transformed)	0.086	0.000282	1777
cor3	Total Vitamin D (sqrt-transformed)	Physical Activity (MET, sqrt-transformed)	-0.002	0.932	1777

Interpret the results:

YOUR INTERPRETATION: Physical activity and age have a slightly negative correlation (older = less active). There are no significant correlations between vitamin D and BMI. Although statistically significant, the calcium-MET link is fairly weak (r = 0.068) and has little practical significance. There is no apparent correlation between vitamin D intake and levels of physical exercise.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

# #Table C: Predictors vs. Each Other (Intercorrelations)
table_c <- rbind(
  corr_pair(bmd2, "RIDAGEYR", "bmi_log"),
  corr_pair(bmd2, "RIDAGEYR", "calcium_log"),
  corr_pair(bmd2, "RIDAGEYR", "vitd_sqrt"),
  corr_pair(bmd2, "bmi_log", "calcium_log"),
  corr_pair(bmd2, "bmi_log", "vitd_sqrt"),
  corr_pair(bmd2, "calcium_log", "vitd_sqrt")
)

# Descriptive labels
table_c <- table_c %>%
  mutate(
    Variable_1 = case_when(
      Variable_1 == "RIDAGEYR" ~ "Age (years)",
      Variable_1 == "bmi_log" ~ "BMI (log-transformed)",
      Variable_1 == "calcium_log" ~ "Total Calcium (log-transformed)",
      Variable_1 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
      TRUE ~ Variable_1
    ),
    Variable_2 = case_when(
      Variable_2 == "RIDAGEYR" ~ "Age (years)",
      Variable_2 == "bmi_log" ~ "BMI (log-transformed)",
      Variable_2 == "calcium_log" ~ "Total Calcium (log-transformed)",
      Variable_2 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
      TRUE ~ Variable_2
    )
  )

# Display table
kable(table_c,
      digits = 3,
      col.names = c("Variable 1", "Variable 2", "r", "p-value", "n"),
      caption = "Table C: Intercorrelations Among Predictors",
      align = c('l', 'l', 'r', 'r', 'r'))

Table C: Intercorrelations Among Predictors
	Variable 1	Variable 2	r	p-value	n
cor	Age (years)	BMI (log-transformed)	-0.098	1.83e-07	2834
cor1	Age (years)	Total Calcium (log-transformed)	0.048	0.0135	2605
cor2	Age (years)	Total Vitamin D (sqrt-transformed)	0.153	4.68e-15	2605
cor3	BMI (log-transformed)	Total Calcium (log-transformed)	0.000	0.983	2569
cor4	BMI (log-transformed)	Total Vitamin D (sqrt-transformed)	0.007	0.731	2569
cor5	Total Calcium (log-transformed)	Total Vitamin D (sqrt-transformed)	0.314	<2e-16	2605

Interpret the results:

[YOUR INTERPRETATION: The largest correlation is 0.314 between Total Calcium and Total Vitamin D, which is only moderate, not strong. We can conclude that there are no strong intercorrelations among predictors, and based on this table alone, multicollinearity is unlikely to be a serious concern in future regression analyses.

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?
What are the key differences in what these methods tell us?
Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?
How might violations of assumptions affect your conclusions?
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?
How did you solve it? (Describe your problem-solving process)
What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

[A. Comparing Methods (ANOVA vs Correlation) ANOVA and correlation are two methods used in epidemiological research to address many kinds of concerns. When comparing the mean of a continuous result across three or more categorical groups, an ANOVA is suitable. To ascertain whether mean bone mineral density varies among BMI categories (normal weight, overweight, and obese), for instance, an ANOVA would be utilized. The strength and direction of a linear link between two continuous variables, on the other hand, are evaluated using correlation. For example, correlation analysis would be more appropriate for determining if blood vitamin D levels are linked to total calcium levels.The primary difference is that correlation measures the degree to which two continuous variables move together, while ANOVA assesses mean differences between groups.

B. Assumptions & Limitations The assignment’s most difficult assumptions to meet were homogeneity of variance and normality. For a number of variables that were closer to normal distributions, log or square-root adjustments were necessary. Statistical tests may yield biased p-values if these presumptions are broken, which raises the possibility of Type I or Type II errors and reduces confidence in the findings.The fact that cross-sectional correlation analysis cannot prove causation or timing is one of its main drawbacks. theremay becorrelations between measures of bone health, activity, and nutrition, but we are unable to establish a trend. Observed associations may also be distorted by confounding variables, such as age, supplement use, and chronic illness. As a result, results should be read carefully.

C. R programing Challenges Correctly organizing the ANOVA workflow, including handling missing data and making sure categorical variables were represented as factors, was the most difficult aspect of the R coding. In order to calculate η² without misinterpreting the model components, I also had to carefully extract the right sums of squares from the ANOVA result. By methodically examining model output, comparing degrees of freedom to the analytical sample size, and using tidy() to clearly arrange results, I was able to overcome these difficulties. This technique increased my confidence in calculating and analyzing effect sizes in R and strengthened my ability to translate statistical results into meaningful interpretation.]

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Karo Ekor

2026-02-21

Homework 1 overview

File organization and naming (required)

Data description

Part 0 — Load and prepare the data (10 points)

0.1 Import

0.2 Recode to readable factors

0.3 Missingness + analytic sample

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Yes the mean BMD values differ significantly across ethnicity groups.

1.1 Hypotheses (required)

1.2 Exploratory plot and group summaries

1.3 Fit ANOVA model + report the F-test

1.4 Assumption checks (normality + equal variances)

Equal variances (Levene’s test)

1.5 Post-hoc comparisons (pairwise tests)

1.6 Effect size (eta-squared)

#Ethnicity explains approximately 95% of the variance in BMD (η² = 55.697313 / 58.64554 ≈ 0.95), which would be considered an extremely large effect size according to Cohen’s guidelines

Part 2 — Correlation analysis (40 points)

2.1 Create total intake variables

2.2 Decide on transformations (show evidence)

Interpretation:

2.3 Create three correlation tables (Pearson)

Helper function for correlation pairs

Table A: Variables associated with the outcome (BMD)

Table B: Variables associated with the exposure (MET)

Table C: Predictors associated with each other

Optional: Visualization examples

Part 3 — Reflection (10 points)

Submission checklist