knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 8,
  fig.height = 5
)

Homework 1 overview

Due: Friday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

  1. Your .Rmd file
  2. Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.


File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname


Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable Description Type
DXXOFBMD Total femur bone mineral density Continuous (g/cm²)
RIDRETH1 Race/ethnicity Categorical (1–5)
RIAGENDR Sex Categorical (1=Male, 2=Female)
RIDAGEYR Age in years Continuous
BMXBMI Body mass index Continuous (kg/m²)
smoker Smoking status Categorical (1=Current, 2=Past, 3=Never)
calcium Dietary calcium intake Continuous (mg/day)
DSQTCALC Supplemental calcium Continuous (mg/day)
vitd Dietary vitamin D intake Continuous (mcg/day)
DSQTVD Supplemental vitamin D Continuous (mcg/day)
totmet Total physical activity (MET-min/week) Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)
# Import data (adjust path as needed)
bmd <- readr::read_csv("/Users/morganwheat/Downloads/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)
## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total
## [1] 2898
n_anova
## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

[The total sample size is n = 2898. After the removal of missing data for BMD and ethnicity, the final analytic sample size that remained for the ANOVA analysis was n = 2286. Overall, 612 observations were removed during this process.]


Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

  • H₀: [mu_Mexican American = mu_Other Hispanic = mu_Non-Hispanic White = mu_Non-Hispanic Black = mu_Other]
  • Hₐ: [mu_i != mu_j]

Plain language:

  • H₀: [All ethnic groups mean BMD are equal to each other]
  • Hₐ: [At least one ethnic groups mean BMD differs from the others. This suggests bone mineral density varies by ethnicity, though other factors may explain these differences]

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)
ethnicity n mean_bmd sd_bmd
Non-Hispanic White 846 0.888 0.160
Non-Hispanic Black 532 0.973 0.161
Other 409 0.897 0.148
Mexican American 255 0.950 0.149
Other Hispanic 244 0.946 0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

[The median BMD appears highest in the non-Hispanic Black group, while the median BMD in the Mexican American and Other-Hispanic groups appear to be roughly equal to each other. The median BMD appears to be roughly equal between the Non-Hispanic White and Other group as well. Variability (IQR width) looks similar across group. The variability is especially similar between the Mexican American, Other Hispanic, and Other groups and in the Non-Hispanic White and Non-Hispanic Black groups. There are a few potential outliers in each group.]

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

  • Outcome: BMD (DXXOFBMD)
  • Predictor: Ethnicity (5 groups)
# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)
term df sumsq meansq statistic p.value
ethnicity 4 2.9482 0.7371 30.185 0
Residuals 2281 55.6973 0.0244 NA NA

Write your interpretation here:

[F-statistic = 30.19, p-value = 0. Conclusion: since p<0.05, we reject the null hypothesis. There is statistically significant evidence that mean BMD differs across at least one pair of ethnic groups.The F-statistic of 30.19 means that the between-group variation is about 30 times larger than the within-group variation of the ethnic groups.The p-value (0) indicates this difference is extremely unlikely to have occurred by chance if all groups truly had the same mean.]

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

par(mfrow = c(1, 1))

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

[Based on the QQ residuals plot which assesses normality by comparing quantiles of the residuals to the quantiles of a standard normal distribution, the points follow the diagonal line well meaning the normality assumption is met. Looking at Levene’s test, which evaluates whether variance is approximately equal across the five ethnicity groups, there is a non-significant result (p = 0.1788), indicating a homogeneous variances across groups, meeting the equal variance assumption.]

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

  • Pairwise comparisons
  • Mean differences
  • 95% confidence intervals
  • Adjusted p-values
tuk <- TukeyHSD(fit_aov)
tuk
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# YOUR CODE HERE to create a tidy table from Tukey results
# Hint: Convert to data frame, add rownames as column, filter for significance
# Extract and format results

# Extract and format results
tukey_summary <- as.data.frame(tuk$ethnicity)
tukey_summary$Comparison <- rownames(tukey_summary)
tukey_summary <- tukey_summary[, c("Comparison", "diff", "lwr", "upr", "p adj")]

# Add interpretation columns
tukey_summary$Significant <- ifelse(tukey_summary$`p adj` < 0.05, "Yes", "No")
tukey_summary$Direction <- ifelse(
  tukey_summary$Significant == "Yes",
  ifelse(tukey_summary$diff > 0, "First group higher", "Second group higher"),
  "No difference"
)

kable(tukey_summary, digits = 4,
      caption = "Tukey HSD post-hoc comparisons with 95% confidence intervals")
Tukey HSD post-hoc comparisons with 95% confidence intervals
Comparison diff lwr upr p adj Significant Direction
Other Hispanic-Mexican American Other Hispanic-Mexican American -0.0044 -0.0426 0.0338 0.9978 No No difference
Non-Hispanic White-Mexican American Non-Hispanic White-Mexican American -0.0624 -0.0929 -0.0319 0.0000 Yes Second group higher
Non-Hispanic Black-Mexican American Non-Hispanic Black-Mexican American 0.0224 -0.0101 0.0549 0.3277 No No difference
Other-Mexican American Other-Mexican American -0.0533 -0.0874 -0.0193 0.0002 Yes Second group higher
Non-Hispanic White-Other Hispanic Non-Hispanic White-Other Hispanic -0.0579 -0.0889 -0.0269 0.0000 Yes Second group higher
Non-Hispanic Black-Other Hispanic Non-Hispanic Black-Other Hispanic 0.0268 -0.0062 0.0598 0.1722 No No difference
Other-Other Hispanic Other-Other Hispanic -0.0489 -0.0834 -0.0144 0.0011 Yes Second group higher
Non-Hispanic Black-Non-Hispanic White Non-Hispanic Black-Non-Hispanic White 0.0848 0.0612 0.1084 0.0000 Yes First group higher
Other-Non-Hispanic White Other-Non-Hispanic White 0.0091 -0.0166 0.0347 0.8719 No No difference
Other-Non-Hispanic Black Other-Non-Hispanic Black -0.0757 -0.1038 -0.0477 0.0000 Yes Second group higher

Write a short paragraph: “The groups that differed were …”

[The groups that differed were the Non-Hispanic White-Mexican American (p adj = 0.0000), Other-Mexican American (p adj = 0.0002), Non-Hispanic White-Other Hispanic (p adj = 0.0000), Other-Other Hispanic (p adj = 0.0011), Non-Hispanic Black-Non-Hispanic White (p adj = 0.0000), and the Other-Non-Hispanic Black (p adj = 0.0000).]

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl
## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA
# YOUR CODE HERE to compute eta-squared
# Hint: Extract SS for ethnicity and Residuals, then calculate proportion
ss_treatment <- anova_tbl$`sumsq`[1]
ss_total <- sum(anova_tbl$`sumsq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.0503
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")
## Percentage of variance explained: 5.03 %

Interpret in 1–2 sentences: [Percentage of variance explained: Ethnicity explains 5.03% of the variance in BMD.This effect size classification is small since it is less than 0.06. The practical effect is small-Ethnicity alone doesn’t explain most of the variation in BMD.]


Part 2 — Correlation analysis (40 points)

In this section you will:

  1. Create total intake variables (dietary + supplemental)
  2. Assess whether transformations are needed
  3. Produce three correlation tables examining relationships between:
    • Table A: Predictors vs. outcome (BMD)
    • Table B: Predictors vs. exposure (physical activity)
    • Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

# YOUR CODE HERE: Create histograms for key continuous variables
# Recommended: BMXBMI, calcium_total, vitd_total, totmet
# Example structure:
# library(patchwork)
# p1 <- ggplot(bmd2, aes(x = BMXBMI)) + geom_histogram(bins = 30)
# p2 <- ggplot(bmd2, aes(x = calcium_total)) + geom_histogram(bins = 30)
# ...
# (p1 | p2) / (p3 | p4)

library(ggplot2)
library(patchwork)

p1 <- ggplot(bmd2, aes(x = BMXBMI)) + geom_histogram(bins = 30)
p2 <- ggplot(bmd2, aes(x = calcium_total)) + geom_histogram(bins = 30)
p3 <- ggplot(bmd2, aes(x = vitd_total)) + geom_histogram(bins = 30)
p4 <- ggplot(bmd2, aes(x = totmet)) + geom_histogram(bins = 30)

(p1 | p2) / (p3 | p4)

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
    log_totmet = sqrt(totmet),
    log_vitd_total = sqrt(vitd_total),
    sqrt_BMXBMI = sqrt(BMXBMI),
    sqrt_calcium_total = sqrt(calcium_total)
  )

Document your reasoning: “I applied/did not apply transformations to [variables] because…”

[I applied a log transformation to the physical activity and Total Vitamin D variables because their histograms show they were heavily skewed to the right. In comparison, I applied a square-root transformation to the BMI and total calcium variables because the histograms show that they are moderately skewed to the right]

2.3 Create three correlation tables (Pearson)

For each correlation, report:

  • Correlation coefficient (r)
  • p-value
  • Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

  • Age vs. BMD
  • BMI vs. BMD
  • Total calcium vs. BMD
  • Total vitamin D vs. BMD

Use transformed versions where appropriate.

 # Pearson correlation Age vs. BMD
cor_age_bmd <- cor.test(bmd2$RIDAGEYR, bmd2$DXXOFBMD)

# Display results
tidy(cor_age_bmd) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Age and BMD"
  )
Pearson Correlation: Age and BMD
r t-statistic p-value 95% CI Lower 95% CI Upper
-0.232 -11.42 0 -0.271 -0.193
 # Pearson correlation BMI vs BMD
cor_bmi_bmd <- cor.test(bmd2$sqrt_BMXBMI, bmd2$DXXOFBMD)

# Display results
tidy(cor_bmi_bmd) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: BMI and BMD"
  )
Pearson Correlation: BMI and BMD
r t-statistic p-value 95% CI Lower 95% CI Upper
0.418 21.955 0 0.384 0.452
 # Pearson correlation Total calcium vs BMD
cor_totalcalc_bmd <- cor.test(bmd2$sqrt_calcium_total, bmd2$DXXOFBMD)

# Display results
tidy(cor_totalcalc_bmd) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Total calcium and BMD"
  )
Pearson Correlation: Total calcium and BMD
r t-statistic p-value 95% CI Lower 95% CI Upper
0.002 0.092 0.927 -0.041 0.044
 # Pearson correlation Total vitamin D vs BMD
cor_totalvitd_bmd <- cor.test(bmd2$log_vitd_total, bmd2$DXXOFBMD)

# Display results
tidy(cor_totalvitd_bmd) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Total vitamin D and BMD"
  )
Pearson Correlation: Total vitamin D and BMD
r t-statistic p-value 95% CI Lower 95% CI Upper
-0.062 -2.862 0.004 -0.104 -0.019

Interpret the results:

[According to the results, the variables Age (p = 0), BMI (p = 0), and Total vitamin D (p = 0.004) show statistically significant correlations with BMD (none of their 95% CI’s contain zero). Age shows a weak negative correlation (r = -0.232) with BMD, BMI shows a moderate positive correlation (r=0.418) with BMD, and Total vitamin D shows a very weak negative correlation (r = -0.062) with BMD.]

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

  • Age vs. MET
  • BMI vs. MET
  • Total calcium vs. MET
  • Total vitamin D vs. MET
# YOUR CODE HERE
# Follow the same structure as Table A
# Use sqrt_totmet (or transformed version) as the outcome variable

 # Pearson correlation Age vs. MET
cor_age_met <- cor.test(bmd2$RIDAGEYR, bmd2$log_totmet)

# Display results
tidy(cor_age_met) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Age and MET"
  )
Pearson Correlation: Age and MET
r t-statistic p-value 95% CI Lower 95% CI Upper
-0.097 -4.28 0 -0.141 -0.053
 # Pearson correlation BMI vs. MET
cor_bmi_met <- cor.test(bmd2$sqrt_BMXBMI, bmd2$log_totmet)

# Display results
tidy(cor_bmi_met) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: BMI and MET"
  )
Pearson Correlation: BMI and MET
r t-statistic p-value 95% CI Lower 95% CI Upper
-0.003 -0.11 0.912 -0.047 0.042
 # Pearson correlation Total calcium vs. MET
cor_totalcalc_met <- cor.test(bmd2$sqrt_calcium_total, bmd2$log_totmet)

# Display results
tidy(cor_totalcalc_met) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Total calcium and MET"
  )
Pearson Correlation: Total calcium and MET
r t-statistic p-value 95% CI Lower 95% CI Upper
0.082 3.475 0.001 0.036 0.128
 # Pearson correlation Total vitamin D vs. MET
cor_totalvitd_met <- cor.test(bmd2$log_vitd_total, bmd2$log_totmet)

# Display results
tidy(cor_totalvitd_met) %>%
  select(estimate, statistic, p.value, conf.low, conf.high) %>%
  kable(
    digits = 3,
    col.names = c("r", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    caption = "Pearson Correlation: Total vitamin D and MET"
  )
Pearson Correlation: Total vitamin D and MET
r t-statistic p-value 95% CI Lower 95% CI Upper
-0.002 -0.085 0.932 -0.049 0.044

Interpret the results:

[According to the results, the variables Age (p = 0) and Total calcium (p = 0.001) show a statistically significant correlation with physical activity (none of their 95% CI’s contain zero). Age shows a weak negative correlation (r = -0.097) with physical activity and Total calcium shows a weak positive correlation (r = 0.082) with physical activity.]

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

  • Age, BMI, Total calcium, Total vitamin D
# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()

# Example structure:
preds <- c("RIDAGEYR", "sqrt_BMXBMI", "sqrt_calcium_total", "log_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
 tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
   mutate(
     estimate = round(estimate, 3),
     p_value = signif(p_value, 3)
   )
 
 tableC %>% kable()
x y n estimate p_value
RIDAGEYR sqrt_BMXBMI 2834 -0.101 1.00e-07
RIDAGEYR sqrt_calcium_total 2605 0.053 6.93e-03
RIDAGEYR log_vitd_total 2605 0.153 0.00e+00
sqrt_BMXBMI sqrt_calcium_total 2569 -0.012 5.53e-01
sqrt_BMXBMI log_vitd_total 2569 0.006 7.68e-01
sqrt_calcium_total log_vitd_total 2605 0.318 0.00e+00

Interpret the results:

[There is no strong correlation among the predictors (Age, BMI, Total calcium, and Total vitamin D) that would indicate serious multicollinearity concerns for future regression analyses. This is because none of the pairwise correlation coefficients exceed 0.7, which is a common threshold for identifying strong linear relationships between predictors. Therefore, multicollinearity does not appear to be a major concern based on these pairwise correlations.]

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = sqrt_BMXBMI, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(BMXBMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = log_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )


Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

  • When would you use ANOVA versus correlation in epidemiological research?
  • What are the key differences in what these methods tell us?
  • Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

  • Which assumptions were most challenging to meet in this assignment?
  • How might violations of assumptions affect your conclusions?
  • What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

  • What was the most difficult part of the R coding for this assignment?
  • How did you solve it? (Describe your problem-solving process)
  • What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

[In epidemiological research, ANOVA is used to compare the means of three or more groups, typically with one categorical predictor and one continuous outcome. In contrast, correlation is used when both variables are continuous, and the goal is to describe their association. A key difference between these methods is the type of question they answer. ANOVA addresses whether group means differ, while correlation assesses whether two variables are related by measuring the strength and direction of their linear relationship. For example, a research question suited for ANOVA might be, “Does mean BMI differ across ethnic groups?” whereas a question suited for correlation would be, “Is age associated with BMI?”

None of the ANOVA assumptions were challenging to meet in this assignment. Since meeting these assumptions ensures the validity of your p-values and inferences, if they were violated, there could have been an inflation of type I errors (false positives). The main limitation of using cross-sectional correlation analysis to understand relationships between nutrition/activity and bone health is that, because it captures only a single point in time, we cannot establish a temporal relationship. In other words, we cannot claim that nutritional status and physical activity levels directly impact bone health.

The most difficult part of the R coding for this assignment was coding for Tables A and B in Part 2. Originally, I was having trouble using the provided Helper function for correlation pairs, as the code wasn’t producing the output I wanted. After trying it a few times, I eventually decided to use the cor.test formatting from our corresponding correlation lab. The biggest R skill I strengthened through completing this assignment was learning how to break down code and recognize patterns rather than relying on the pre-existing code we were given and running it.]


Submission checklist

Before submitting, verify:


Good luck! Remember to start early and use office hours if you need help.