Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable Description Type
DXXOFBMD Total femur bone mineral density Continuous (g/cm²)
RIDRETH1 Race/ethnicity Categorical (1–5)
RIAGENDR Sex Categorical (1=Male, 2=Female)
RIDAGEYR Age in years Continuous
BMXBMI Body mass index Continuous (kg/m²)
smoker Smoking status Categorical (1=Current, 2=Past, 3=Never)
calcium Dietary calcium intake Continuous (mg/day)
DSQTCALC Supplemental calcium Continuous (mg/day)
vitd Dietary vitamin D intake Continuous (mcg/day)
DSQTVD Supplemental vitamin D Continuous (mcg/day)
totmet Total physical activity (MET-min/week) Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)
# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)
## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total
## [1] 2898
n_anova
## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

The total observations before removing missing data is 2898, after removing the missing data there is as an observation size of 2286.


Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

Statistical notation: - H₀: (H_0: mu_Mexican American = mu_Other Hispanic = mu_Non-Hispanic White = mu_Non-Hispanic Black = mu_Other) - Hₐ: (H_A: One BMD differs from another) Plain language: - H₀: Mean bone mineral density is the same across all ethnicity groups - Hₐ: Mean bone mineral density differs for at least one or more ethnicity groups, suggesting ethnicity may be associated with differences in bone health.

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)
ethnicity n mean_bmd sd_bmd
Non-Hispanic White 846 0.888 0.160
Non-Hispanic Black 532 0.973 0.161
Other 409 0.897 0.148
Mexican American 255 0.950 0.149
Other Hispanic 244 0.946 0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

The box plot shows that Non-Hispanic Black individuals have the highest median BMD overall and a relatively wide spread, indicating higher bone density on average compared with other groups. Mexican American and Other Hispanic groups have similar median BMD’s. Non-Hispanic White individuals show the lowest median BMD among the groups. The Other group also has a lower BMD.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

  • Outcome: BMD (DXXOFBMD)
  • Predictor: Ethnicity (5 groups)
# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)
term df sumsq meansq statistic p.value
ethnicity 4 2.9482 0.7371 30.185 0
Residuals 2281 55.6973 0.0244 NA NA

Write your interpretation here:

F-statistic= 30.185, p-value = 0 The results indicate that there is a statistically significant difference in mean bone mineral density across ethnicity groups. There is at least one ethnicity group that has a different average BMD level compared to the others. We reject the null hypothesis because the p-value is less than 0.05.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

Interpretation: The Q-Q plot assesses normality by comparing quantiles of the residuals to quantiles of a standard normal distribution. Points fall exactly on the diagonal line. The Q-Q plot follows normality. The p-value of Levenes test was 0.1788 which is greater than 0.05, so a non-significant result indicates homogeneous variances across groups, supporting the equal variance assumption underlying ANOVA.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

  • Pairwise comparisons
  • Mean differences
  • 95% confidence intervals
  • Adjusted p-values
tuk <- TukeyHSD(fit_aov)
tuk
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# YOUR CODE HERE to create a tidy table from Tukey results
# Hint: Convert to data frame, add rownames as column, filter for significance

# Extract and format results
tukey_summary <- as.data.frame(tuk$ethnicity)
tukey_summary$Comparison <- rownames(tukey_summary)
tukey_summary <- tukey_summary[, c("Comparison", "diff", "lwr", "upr", "p adj")]

# Add interpretation columns
tukey_summary$Significant <- ifelse(tukey_summary$`p adj` < 0.05, "Yes", "No")
tukey_summary$Direction <- ifelse(
  tukey_summary$Significant == "Yes",
  ifelse(tukey_summary$diff > 0, "First group higher", "Second group higher"),
  "No difference"
)

kable(tukey_summary, digits = 4,
      caption = "Tukey HSD post-hoc comparisons with 95% confidence intervals")
Tukey HSD post-hoc comparisons with 95% confidence intervals
Comparison diff lwr upr p adj Significant Direction
Other Hispanic-Mexican American Other Hispanic-Mexican American -0.0044 -0.0426 0.0338 0.9978 No No difference
Non-Hispanic White-Mexican American Non-Hispanic White-Mexican American -0.0624 -0.0929 -0.0319 0.0000 Yes Second group higher
Non-Hispanic Black-Mexican American Non-Hispanic Black-Mexican American 0.0224 -0.0101 0.0549 0.3277 No No difference
Other-Mexican American Other-Mexican American -0.0533 -0.0874 -0.0193 0.0002 Yes Second group higher
Non-Hispanic White-Other Hispanic Non-Hispanic White-Other Hispanic -0.0579 -0.0889 -0.0269 0.0000 Yes Second group higher
Non-Hispanic Black-Other Hispanic Non-Hispanic Black-Other Hispanic 0.0268 -0.0062 0.0598 0.1722 No No difference
Other-Other Hispanic Other-Other Hispanic -0.0489 -0.0834 -0.0144 0.0011 Yes Second group higher
Non-Hispanic Black-Non-Hispanic White Non-Hispanic Black-Non-Hispanic White 0.0848 0.0612 0.1084 0.0000 Yes First group higher
Other-Non-Hispanic White Other-Non-Hispanic White 0.0091 -0.0166 0.0347 0.8719 No No difference
Other-Non-Hispanic Black Other-Non-Hispanic Black -0.0757 -0.1038 -0.0477 0.0000 Yes Second group higher
cat("POST-HOC TEST INTERPRETATION\n")
## POST-HOC TEST INTERPRETATION
for (i in 1:nrow(tukey_summary)) {
  row <- tukey_summary[i, ]
  cat("Comparison:", row$Comparison, "\n")
  cat("  Estimated mean difference:", round(row$diff, 4), "g/cm²\n")
  cat("  95% Confidence Interval: [", 
      round(row$lwr, 4), ", ", round(row$upr, 4), "]\n")
  cat("  Adjusted p-value:", format(row$`p adj`, digits = 4), "\n")
  cat("  Statistical Significance:", row$Significant, "\n")
  cat("  Direction:", row$Direction, "\n\n")
}
## Comparison: Other Hispanic-Mexican American 
##   Estimated mean difference: -0.0044 g/cm²
##   95% Confidence Interval: [ -0.0426 ,  0.0338 ]
##   Adjusted p-value: 0.9978 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Non-Hispanic White-Mexican American 
##   Estimated mean difference: -0.0624 g/cm²
##   95% Confidence Interval: [ -0.0929 ,  -0.0319 ]
##   Adjusted p-value: 2.568e-07 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic Black-Mexican American 
##   Estimated mean difference: 0.0224 g/cm²
##   95% Confidence Interval: [ -0.0101 ,  0.0549 ]
##   Adjusted p-value: 0.3277 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Other-Mexican American 
##   Estimated mean difference: -0.0533 g/cm²
##   95% Confidence Interval: [ -0.0874 ,  -0.0193 ]
##   Adjusted p-value: 0.0001919 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic White-Other Hispanic 
##   Estimated mean difference: -0.0579 g/cm²
##   95% Confidence Interval: [ -0.0889 ,  -0.0269 ]
##   Adjusted p-value: 3.607e-06 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic Black-Other Hispanic 
##   Estimated mean difference: 0.0268 g/cm²
##   95% Confidence Interval: [ -0.0062 ,  0.0598 ]
##   Adjusted p-value: 0.1722 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Other-Other Hispanic 
##   Estimated mean difference: -0.0489 g/cm²
##   95% Confidence Interval: [ -0.0834 ,  -0.0144 ]
##   Adjusted p-value: 0.001072 
##   Statistical Significance: Yes 
##   Direction: Second group higher 
## 
## Comparison: Non-Hispanic Black-Non-Hispanic White 
##   Estimated mean difference: 0.0848 g/cm²
##   95% Confidence Interval: [ 0.0612 ,  0.1084 ]
##   Adjusted p-value: 3.936e-11 
##   Statistical Significance: Yes 
##   Direction: First group higher 
## 
## Comparison: Other-Non-Hispanic White 
##   Estimated mean difference: 0.0091 g/cm²
##   95% Confidence Interval: [ -0.0166 ,  0.0347 ]
##   Adjusted p-value: 0.8719 
##   Statistical Significance: No 
##   Direction: No difference 
## 
## Comparison: Other-Non-Hispanic Black 
##   Estimated mean difference: -0.0757 g/cm²
##   95% Confidence Interval: [ -0.1038 ,  -0.0477 ]
##   Adjusted p-value: 4.176e-11 
##   Statistical Significance: Yes 
##   Direction: Second group higher

Write a short paragraph: Interpretation: The groups that significantly differ were Non-Hispanic white-Mexican American,Other-Mexican American, Non-Hispanic White-Other Hispanic, Other-Other Hispanic,Non-Hispanic Black-Non-Hispanic White, and Other-Non-Hispanic Black.

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl
## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA
ss_total <- sum(anova_tbl$sumsq)

anova_eta <- anova_tbl %>%
  mutate(eta_sq = sumsq / ss_total)

anova_eta
## # A tibble: 2 × 7
##   term         df sumsq meansq statistic   p.value eta_sq
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>  <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24 0.0503
## 2 Residuals  2281 55.7  0.0244      NA   NA        0.950

Interpret in 1–2 sentences: The calculated effect size is η² = 0.0503. This means ethnicity explains approximately 5.03% of the variance in BMD.To interpret this effect size, we use conventional guidelines: η² = 0.01 (small), η² = 0.06 (medium), and η² = 0.14 (large). In our case, the effect is small.


Part 2 — Correlation analysis (40 points)

In this section you will:

  1. Create total intake variables (dietary + supplemental)
  2. Assess whether transformations are needed
  3. Produce three correlation tables examining relationships between:
    • Table A: Predictors vs. outcome (BMD)
    • Table B: Predictors vs. exposure (physical activity)
    • Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

library(ggplot2)
library(patchwork)

p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
  geom_histogram(bins = 30, na.rm = TRUE)

p2 <- ggplot(bmd2, aes(x = calcium_total)) +
  geom_histogram(bins = 30, na.rm = TRUE)

p3 <- ggplot(bmd2, aes(x = vitd_total)) +
  geom_histogram(bins = 30, na.rm = TRUE)

p4 <- ggplot(bmd2, aes(x = totmet)) +
  geom_histogram(bins = 30, na.rm = TRUE)

#Arrange plots
(p1 | p2) / (p3 | p4)

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
    log_bmi = log(BMXBMI),
    log_calcium_total = log(calcium_total),
    sqrt_totmet = sqrt(totmet),
    sqrt_vitd_total = sqrt(vitd_total)
  )

Document your reasoning: I applied transformations to ALL variables because all variables are right skewed. By applying transformations the variables skewness will decrease.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

  • Correlation coefficient (r)
  • p-value
  • Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

  • Age vs. BMD
  • BMI vs. BMD
  • Total calcium vs. BMD
  • Total vitamin D vs. BMD

Use transformed versions where appropriate.

bmd2 <- bmd2 %>%
  mutate(
    log_BMXBMI = log(BMXBMI),                 
    log_calcium_total = log(calcium_total),     
    log_vitd_total = log(vitd_total + 1)        
  )

tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
 ) %>%
   mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
    )
 
tableA %>% kable(caption = "Table A: Correlations between predictors and BMD")
Table A: Correlations between predictors and BMD
x y n estimate p_value
RIDAGEYR DXXOFBMD 2286 -0.232 0.00000
log_bmi DXXOFBMD 2275 0.425 0.00000
log_calcium_total DXXOFBMD 2129 0.012 0.59200
sqrt_vitd_total DXXOFBMD 2129 -0.062 0.00426

Interpret the results:

Interpretation:

Age vs. BMD: The relationship between age and BMD is statistically significant. The p-value was less than 0.05 with a value of 0.0000. Age is negatively correlated with BMD. As people get older, their BMD tends to decrease. The strength is weak with an estimate of -0.232.

BMI vs. BMD: The relationship between BMI and BMD is statistically significant. The p-value was less than 0.05 with a value of 0.0000. BMI is positively correlated with BMD. As BMI increase, BMD increases. The strength is moderate with an estimate of 0.425.

Total calcium vs. BMD: The relationship between total calcium and BMD is not statistically significant. The p-value was greater than 0.05 with a value of 0.59200. There is no meaningful correlation between total calcium and BMD.

Total Vitamin D vs. BMD: The relationship between total vitamin D and BMD is statistically significant. The p-value was less than 0.05 with a value of 0.00426. Vitamin D is negative correlated with BMD. The strength is weak with an estimate of -0.062.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

  • Age vs. MET
  • BMI vs. MET
  • Total calcium vs. MET
  • Total vitamin D vs. MET
bmd2 <- bmd2 %>%
  mutate(
    log_BMXBMI = log(BMXBMI),                 
    log_calcium_total = log(calcium_total),     
    log_vitd_total = log(vitd_total + 1)        
  )

tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
 ) %>%
   mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
    )
 
tableA %>% kable(caption = "Table B: Correlations between potential confounders and MET")
Table B: Correlations between potential confounders and MET
x y n estimate p_value
RIDAGEYR sqrt_totmet 1934 -0.097 1.96e-05
log_bmi sqrt_totmet 1912 0.001 9.51e-01
log_calcium_total sqrt_totmet 1777 0.086 2.82e-04
sqrt_vitd_total sqrt_totmet 1777 -0.002 9.32e-01

Interpret the results:

Age vs. MET: The relationship between age and MET is statistically significant. The p-value was less than 0.05 with a value of 1.96e-05.

BMI vs. MET: The relationship between total BMI and MET is not statistically significant. The p-value was greater than 0.05 with a value of 0.951.

Total Calcium vs. MET: The relationship between total calcium and MET is statistically significant. The p-value was less than 0.05 with a value of 2.82e-04.

Vitamin D vs. MET: The relationship between total vitamin D and MET is not statistically significant. The p-value was greater than 0.05 with a value of 0.932.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

  • Age, BMI, Total calcium, Total vitamin D
# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()

# Example structure:
preds <- c("RIDAGEYR", "log_BMXBMI", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
  )
tableC %>% kable(caption = "Table C: Pairwise correlations among predictors")
Table C: Pairwise correlations among predictors
x y n estimate p_value
RIDAGEYR log_BMXBMI 2834 -0.098 2.00e-07
RIDAGEYR log_calcium_total 2605 0.048 1.36e-02
RIDAGEYR sqrt_vitd_total 2605 0.153 0.00e+00
log_BMXBMI log_calcium_total 2569 0.000 9.81e-01
log_BMXBMI sqrt_vitd_total 2569 0.007 7.31e-01
log_calcium_total sqrt_vitd_total 2605 0.314 0.00e+00

Interpret the results:

There are no strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses. Age vs. BMI, calcium, and vitamin D, all showed weak correlations whether they were positive or negative associations. BMI vs. calcium and vitamin, showed no correlation. Calcium and Vitamin D showed a moderate positive correlation with r= 0.314. As calcium intake increases, vitamin D levels also increase.

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )


Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

  • When would you use ANOVA versus correlation in epidemiological research?
  • What are the key differences in what these methods tell us?
  • Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

  • Which assumptions were most challenging to meet in this assignment?
  • How might violations of assumptions affect your conclusions?
  • What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

  • What was the most difficult part of the R coding for this assignment?
  • How did you solve it? (Describe your problem-solving process)
  • What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

In epidemiological research, you want to use ANOVA when you are comparing 3 or more group means, has one categorical predictor, has continuous outcomes, and independent observations within each group. ANOVA tells you whether there is statistical significance among group means, but it does not measure the magnitude or direction of pairwise relationship. You use correlation when you want to measure the strength or direction of linear relationship, you want to describe associations, you have continuous variables, and you are exploring data before regression. Correlation provides a standardized estimate of linear association between variables but does compare group differences. A research question that is suited for ANOVA is ” Do average BMD levels differ across ethnic groups?” while a research question suited for correlation is ” Is higher BMI associated with higher BMD?”

A major learning outcome was challenging was assessing normality, linearity, and homoscedasticity because the variables ended up being skewed which required transformations to be added. This showed me how important it is to visually inspect the data and understand it before applying the transformations. Additionally, our analyses was cross-sectional, so correlation cannot establish causality. Observed relationships may have been influenced by unmeasured confounding factors. For example, we see the association between higher BMI and Higher BMD, but we cannot conclude BMI causes increase bone density without conducting a longitudinal study or using longitudinal data.

The most challenging part of this assignment was creating and combining multiple histograms. When I first ran the code, i got errors saying that something was wrong with my ggplot code and the syntax (p1|p2)/(p3|p4) was not arranging my plots to sit on top of each other. I solved this problem by using google and searching up how to “arrange plots with histograms” and I found an analytics website that had the steps. https://www.analytics-tuts.com/arrange-multiple-plots-using-patchwork-in-r/ In this link, the author had a library called “patchwork” and a library called “ggplot2”, I ended up adding these because besides these packages, the rest of my code looked similar to what the author had, and it ended up working and allowing my histograms to sit on top and side by side of each other. This first assignment has allowed me to strengthen by skills in ANOVA and correlation. We have done labs previously but this assignment allowed me to apply both concepts in the same assignment. This assignment also helped me with steps that will be important for our final paper. For example, removing missing data, import a csv file, create histograms, and compare statistical results.