Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)

# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total

## [1] 2898

n_anova

## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

The total observations before removing missing data is 2898, after removing the missing data there is as an observation size of 2286.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

Statistical notation: - H₀: (H_0: mu_Mexican American = mu_Other Hispanic = mu_Non-Hispanic White = mu_Non-Hispanic Black = mu_Other) - Hₐ: (H_A: One BMD differs from another) Plain language: - H₀: Mean bone mineral density is the same across all ethnicity groups - Hₐ: Mean bone mineral density differs for at least one or more ethnicity groups, suggesting ethnicity may be associated with differences in bone health.

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)

ethnicity	n	mean_bmd	sd_bmd
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

The box plot shows that Non-Hispanic Black individuals have the highest median BMD overall and a relatively wide spread, indicating higher bone density on average compared with other groups. Mexican American and Other Hispanic groups have similar median BMD’s. Non-Hispanic White individuals show the lowest median BMD among the groups. The Other group also has a lower BMD.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
ethnicity	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here:

F-statistic= 30.185, p-value = 0 The results indicate that there is a statistically significant difference in mean bone mineral density across ethnicity groups. There is at least one ethnicity group that has a different average BMD level compared to the others. We reject the null hypothesis because the p-value is less than 0.05.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

Interpretation: The Q-Q plot assesses normality by comparing quantiles of the residuals to quantiles of a standard normal distribution. Points fall exactly on the diagonal line. The Q-Q plot follows normality. The p-value of Levenes test was 0.1788 which is greater than 0.05, so a non-significant result indicates homogeneous variances across groups, supporting the equal variance assumption underlying ANOVA.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tuk <- TukeyHSD(fit_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# YOUR CODE HERE to create a tidy table from Tukey results
# Hint: Convert to data frame, add rownames as column, filter for significance

# Extract and format results
tukey_summary <- as.data.frame(tuk$ethnicity)
tukey_summary$Comparison <- rownames(tukey_summary)
tukey_summary <- tukey_summary[, c("Comparison", "diff", "lwr", "upr", "p adj")]

# Add interpretation columns
tukey_summary$Significant <- ifelse(tukey_summary$`p adj` < 0.05, "Yes", "No")
tukey_summary$Direction <- ifelse(
  tukey_summary$Significant == "Yes",
  ifelse(tukey_summary$diff > 0, "First group higher", "Second group higher"),
  "No difference"
)

kable(tukey_summary, digits = 4,
      caption = "Tukey HSD post-hoc comparisons with 95% confidence intervals")

Tukey HSD post-hoc comparisons with 95% confidence intervals
	Comparison	diff	lwr	upr	p adj	Significant	Direction
Other Hispanic-Mexican American	Other Hispanic-Mexican American	-0.0044	-0.0426	0.0338	0.9978	No	No difference
Non-Hispanic White-Mexican American	Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000	Yes	Second group higher
Non-Hispanic Black-Mexican American	Non-Hispanic Black-Mexican American	0.0224	-0.0101	0.0549	0.3277	No	No difference
Other-Mexican American	Other-Mexican American	-0.0533	-0.0874	-0.0193	0.0002	Yes	Second group higher
Non-Hispanic White-Other Hispanic	Non-Hispanic White-Other Hispanic	-0.0579	-0.0889	-0.0269	0.0000	Yes	Second group higher
Non-Hispanic Black-Other Hispanic	Non-Hispanic Black-Other Hispanic	0.0268	-0.0062	0.0598	0.1722	No	No difference
Other-Other Hispanic	Other-Other Hispanic	-0.0489	-0.0834	-0.0144	0.0011	Yes	Second group higher
Non-Hispanic Black-Non-Hispanic White	Non-Hispanic Black-Non-Hispanic White	0.0848	0.0612	0.1084	0.0000	Yes	First group higher
Other-Non-Hispanic White	Other-Non-Hispanic White	0.0091	-0.0166	0.0347	0.8719	No	No difference
Other-Non-Hispanic Black	Other-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000	Yes	Second group higher

cat("POST-HOC TEST INTERPRETATION\n")

## POST-HOC TEST INTERPRETATION

for (i in 1:nrow(tukey_summary)) {
  row <- tukey_summary[i, ]
  cat("Comparison:", row$Comparison, "\n")
  cat("  Estimated mean difference:", round(row$diff, 4), "g/cm²\n")
  cat("  95% Confidence Interval: [", 
      round(row$lwr, 4), ", ", round(row$upr, 4), "]\n")
  cat("  Adjusted p-value:", format(row$`p adj`, digits = 4), "\n")
  cat("  Statistical Significance:", row$Significant, "\n")
  cat("  Direction:", row$Direction, "\n\n")
}

## Comparison: Other Hispanic-Mexican American 
##   Estimated mean difference: -0.0044 g/cm²
##   95% Confidence Interval: [ -0.0426 ,  0.0338 ]
##   Adjusted p-value: 0.9978 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Non-Hispanic White-Mexican American 
##   Estimated mean difference: -0.0624 g/cm²
##   95% Confidence Interval: [ -0.0929 ,  -0.0319 ]
##   Adjusted p-value: 2.568e-07 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic Black-Mexican American 
##   Estimated mean difference: 0.0224 g/cm²
##   95% Confidence Interval: [ -0.0101 ,  0.0549 ]
##   Adjusted p-value: 0.3277 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Other-Mexican American 
##   Estimated mean difference: -0.0533 g/cm²
##   95% Confidence Interval: [ -0.0874 ,  -0.0193 ]
##   Adjusted p-value: 0.0001919 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic White-Other Hispanic 
##   Estimated mean difference: -0.0579 g/cm²
##   95% Confidence Interval: [ -0.0889 ,  -0.0269 ]
##   Adjusted p-value: 3.607e-06 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic Black-Other Hispanic 
##   Estimated mean difference: 0.0268 g/cm²
##   95% Confidence Interval: [ -0.0062 ,  0.0598 ]
##   Adjusted p-value: 0.1722 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Other-Other Hispanic 
##   Estimated mean difference: -0.0489 g/cm²
##   95% Confidence Interval: [ -0.0834 ,  -0.0144 ]
##   Adjusted p-value: 0.001072 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic Black-Non-Hispanic White 
##   Estimated mean difference: 0.0848 g/cm²
##   95% Confidence Interval: [ 0.0612 ,  0.1084 ]
##   Adjusted p-value: 3.936e-11 
##   Statistical Significance: Yes 
##   Direction: First group higher 
## 
## Comparison: Other-Non-Hispanic White 
##   Estimated mean difference: 0.0091 g/cm²
##   95% Confidence Interval: [ -0.0166 ,  0.0347 ]
##   Adjusted p-value: 0.8719 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Other-Non-Hispanic Black 
##   Estimated mean difference: -0.0757 g/cm²
##   95% Confidence Interval: [ -0.1038 ,  -0.0477 ]
##   Adjusted p-value: 4.176e-11 
##   Statistical Significance: Yes 
##   Direction: Second group higher

Write a short paragraph: Interpretation: The groups that significantly differ were Non-Hispanic white-Mexican American,Other-Mexican American, Non-Hispanic White-Other Hispanic, Other-Other Hispanic,Non-Hispanic Black-Non-Hispanic White, and Other-Non-Hispanic Black.

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl

## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA

ss_total <- sum(anova_tbl$sumsq)

anova_eta <- anova_tbl %>%
  mutate(eta_sq = sumsq / ss_total)

anova_eta

## # A tibble: 2 × 7
##   term         df sumsq meansq statistic   p.value eta_sq
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>  <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24 0.0503
## 2 Residuals  2281 55.7  0.0244      NA   NA        0.950

Interpret in 1–2 sentences: The calculated effect size is η² = 0.0503. This means ethnicity explains approximately 5.03% of the variance in BMD.To interpret this effect size, we use conventional guidelines: η² = 0.01 (small), η² = 0.06 (medium), and η² = 0.14 (large). In our case, the effect is small.

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

library(ggplot2)
library(patchwork)

p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
  geom_histogram(bins = 30, na.rm = TRUE)

p2 <- ggplot(bmd2, aes(x = calcium_total)) +
  geom_histogram(bins = 30, na.rm = TRUE)

p3 <- ggplot(bmd2, aes(x = vitd_total)) +
  geom_histogram(bins = 30, na.rm = TRUE)

p4 <- ggplot(bmd2, aes(x = totmet)) +
  geom_histogram(bins = 30, na.rm = TRUE)

#Arrange plots
(p1 | p2) / (p3 | p4)

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
    log_bmi = log(BMXBMI),
    log_calcium_total = log(calcium_total),
    sqrt_totmet = sqrt(totmet),
    sqrt_vitd_total = sqrt(vitd_total)
  )

Document your reasoning: I applied transformations to ALL variables because all variables are right skewed. By applying transformations the variables skewness will decrease.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

bmd2 <- bmd2 %>%
  mutate(
    log_BMXBMI = log(BMXBMI),                 
    log_calcium_total = log(calcium_total),     
    log_vitd_total = log(vitd_total + 1)        
  )

tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
 ) %>%
   mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
    )
 
tableA %>% kable(caption = "Table A: Correlations between predictors and BMD")

Table A: Correlations between predictors and BMD
x	y	n	estimate	p_value
RIDAGEYR	DXXOFBMD	2286	-0.232	0.00000
log_bmi	DXXOFBMD	2275	0.425	0.00000
log_calcium_total	DXXOFBMD	2129	0.012	0.59200
sqrt_vitd_total	DXXOFBMD	2129	-0.062	0.00426

Interpret the results:

Interpretation:

Age vs. BMD: The relationship between age and BMD is statistically significant. The p-value was less than 0.05 with a value of 0.0000. Age is negatively correlated with BMD. As people get older, their BMD tends to decrease. The strength is weak with an estimate of -0.232.

BMI vs. BMD: The relationship between BMI and BMD is statistically significant. The p-value was less than 0.05 with a value of 0.0000. BMI is positively correlated with BMD. As BMI increase, BMD increases. The strength is moderate with an estimate of 0.425.

Total calcium vs. BMD: The relationship between total calcium and BMD is not statistically significant. The p-value was greater than 0.05 with a value of 0.59200. There is no meaningful correlation between total calcium and BMD.

Total Vitamin D vs. BMD: The relationship between total vitamin D and BMD is statistically significant. The p-value was less than 0.05 with a value of 0.00426. Vitamin D is negative correlated with BMD. The strength is weak with an estimate of -0.062.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

bmd2 <- bmd2 %>%
  mutate(
    log_BMXBMI = log(BMXBMI),                 
    log_calcium_total = log(calcium_total),     
    log_vitd_total = log(vitd_total + 1)        
  )

tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
 ) %>%
   mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
    )
 
tableA %>% kable(caption = "Table B: Correlations between potential confounders and MET")

Table B: Correlations between potential confounders and MET
x	y	n	estimate	p_value
RIDAGEYR	sqrt_totmet	1934	-0.097	1.96e-05
log_bmi	sqrt_totmet	1912	0.001	9.51e-01
log_calcium_total	sqrt_totmet	1777	0.086	2.82e-04
sqrt_vitd_total	sqrt_totmet	1777	-0.002	9.32e-01

Interpret the results:

Age vs. MET: The relationship between age and MET is statistically significant. The p-value was less than 0.05 with a value of 1.96e-05.

BMI vs. MET: The relationship between total BMI and MET is not statistically significant. The p-value was greater than 0.05 with a value of 0.951.

Total Calcium vs. MET: The relationship between total calcium and MET is statistically significant. The p-value was less than 0.05 with a value of 2.82e-04.

Vitamin D vs. MET: The relationship between total vitamin D and MET is not statistically significant. The p-value was greater than 0.05 with a value of 0.932.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()

# Example structure:
preds <- c("RIDAGEYR", "log_BMXBMI", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
  )
tableC %>% kable(caption = "Table C: Pairwise correlations among predictors")

Table C: Pairwise correlations among predictors
x	y	n	estimate	p_value
RIDAGEYR	log_BMXBMI	2834	-0.098	2.00e-07
RIDAGEYR	log_calcium_total	2605	0.048	1.36e-02
RIDAGEYR	sqrt_vitd_total	2605	0.153	0.00e+00
log_BMXBMI	log_calcium_total	2569	0.000	9.81e-01
log_BMXBMI	sqrt_vitd_total	2569	0.007	7.31e-01
log_calcium_total	sqrt_vitd_total	2605	0.314	0.00e+00

Interpret the results:

There are no strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses. Age vs. BMI, calcium, and vitamin D, all showed weak correlations whether they were positive or negative associations. BMI vs. calcium and vitamin, showed no correlation. Calcium and Vitamin D showed a moderate positive correlation with r= 0.314. As calcium intake increases, vitamin D levels also increase.

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?
What are the key differences in what these methods tell us?
Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?
How might violations of assumptions affect your conclusions?
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?
How did you solve it? (Describe your problem-solving process)
What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

In epidemiological research, you want to use ANOVA when you are comparing 3 or more group means, has one categorical predictor, has continuous outcomes, and independent observations within each group. ANOVA tells you whether there is statistical significance among group means, but it does not measure the magnitude or direction of pairwise relationship. You use correlation when you want to measure the strength or direction of linear relationship, you want to describe associations, you have continuous variables, and you are exploring data before regression. Correlation provides a standardized estimate of linear association between variables but does compare group differences. A research question that is suited for ANOVA is ” Do average BMD levels differ across ethnic groups?” while a research question suited for correlation is ” Is higher BMI associated with higher BMD?”

A major learning outcome was challenging was assessing normality, linearity, and homoscedasticity because the variables ended up being skewed which required transformations to be added. This showed me how important it is to visually inspect the data and understand it before applying the transformations. Additionally, our analyses was cross-sectional, so correlation cannot establish causality. Observed relationships may have been influenced by unmeasured confounding factors. For example, we see the association between higher BMI and Higher BMD, but we cannot conclude BMI causes increase bone density without conducting a longitudinal study or using longitudinal data.

The most challenging part of this assignment was creating and combining multiple histograms. When I first ran the code, i got errors saying that something was wrong with my ggplot code and the syntax (p1|p2)/(p3|p4) was not arranging my plots to sit on top of each other. I solved this problem by using google and searching up how to “arrange plots with histograms” and I found an analytics website that had the steps. https://www.analytics-tuts.com/arrange-multiple-plots-using-patchwork-in-r/ In this link, the author had a library called “patchwork” and a library called “ggplot2”, I ended up adding these because besides these packages, the rest of my code looked similar to what the author had, and it ended up working and allowing my histograms to sit on top and side by side of each other. This first assignment has allowed me to strengthen by skills in ANOVA and correlation. We have done labs previously but this assignment allowed me to apply both concepts in the same assignment. This assignment also helped me with steps that will be important for our final paper. For example, removing missing data, import a csv file, create histograms, and compare statistical results.

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Grace Beal

2026-02-20

Data description

Part 0 — Load and prepare the data (10 points)

0.1 Import

0.2 Recode to readable factors

0.3 Missingness + analytic sample

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

1.1 Hypotheses (required)

1.2 Exploratory plot and group summaries

1.3 Fit ANOVA model + report the F-test

1.4 Assumption checks (normality + equal variances)

Normality of residuals

Equal variances (Levene’s test)

1.5 Post-hoc comparisons (pairwise tests)

1.6 Effect size (eta-squared)

Part 2 — Correlation analysis (40 points)

2.1 Create total intake variables

2.2 Decide on transformations (show evidence)

2.3 Create three correlation tables (Pearson)

Helper function for correlation pairs

Table A: Variables associated with the outcome (BMD)

Table B: Variables associated with the exposure (MET)

Table C: Predictors associated with each other

Optional: Visualization examples

Part 3 — Reflection (10 points)