Homework 1

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)

# Import data 

bmd <- readr::read_csv("D:/fizza documents/Epi 553/Assignments/HW01/bmd.csv", show_col_types = FALSE)
# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total

## [1] 2898

n_anova

## [1] 2286

cat("Total sample size:", n_total, "\n")

## Total sample size: 2898

cat("ANOVA analytic sample size:", n_anova, "\n")

## ANOVA analytic sample size: 2286

cat("Observations removed:", n_total - n_anova, "\n")

## Observations removed: 612

cat("Percent missing:", round((n_total - n_anova)/n_total * 100, 1), "%")

## Percent missing: 21.1 %

Write 2–4 sentences summarizing your analytic sample here.

The total sample size in the original dataset is 2,898 participants. After removing observations with missing values for either BMD (DXXOFBMD) or ethnicity (RIDRETH1), the analytic sample for the ANOVA analysis is 2,286 participants. This represents a loss of 612 observations (21.1% of the original sample), primarily due to missing DEXA scan data since not all NHANES participants completed the DEXA component.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀: μ₁ = μ₂ = μ₃ = μ₄ = μ₅ (All ethnicity groups have the same mean BMD)
Hₐ: At least one μᵢ ≠ μⱼ (At least two ethnicity groups have different mean BMD)

Plain language:

H₀: There is no difference in mean bone mineral density across the five ethnicity groups.
Hₐ: There is a difference in mean bone mineral density across at least two ethnicity groups.

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3,
        col.names = c("Ethnicity", "N", "Mean BMD (g/cm²)", "SD"),
        caption = "Table 1: BMD Summary Statistics by Ethnicity") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Table 1: BMD Summary Statistics by Ethnicity
Ethnicity	N	Mean BMD (g/cm²)	SD
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD, fill = ethnicity)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.2, size = 0.5) +
  labs(
    title = "Distribution of Bone Mineral Density by Ethnicity",
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)",
    fill = "Ethnicity"
  ) +
   theme_minimal() +
 theme(axis.text.x = element_text(angle = 25, hjust = 1),
        legend.position = "none")

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

The boxplot shows that Non-Hispanic Black individuals appear to have the highest median BMD, followed by Mexican American and Other Hispanic groups. Non-Hispanic White individuals and the “Other” ethnicity category appear to have lower median BMD values. There is considerable variability within each group, as shown by the spread of points and the height of the boxes. Some groups, particularly Non-Hispanic Black and Non-Hispanic White, have more extreme outliers.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4,
        col.names = c("Term", "df", "Sum Sq", "Mean Sq", "F-statistic", "p-value"),
        caption = "Table 2: One-way ANOVA Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Table 2: One-way ANOVA Results
Term	df	Sum Sq	Mean Sq	F-statistic	p-value
ethnicity	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here:

The ANOVA F-test shows a statistically significant result (F = 30.19, df = 4, p < 0.0001). Therefore, we reject the null hypothesis and conclude that there is a significant difference in mean bone mineral density across at least two of the five ethnicity groups. The very small p-value (less than 0.0001) provides strong evidence against the null hypothesis.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

The Residuals vs Fitted plot shows no clear pattern, suggesting homogeneity of variances, which is confirmed by the non-significant Levene’s test (F = 1.57, df = 4, p = 0.1788). The Q-Q plot indicates some deviation from normality in the tails, but ANOVA is robust to this violation given our large sample size (n = 2,286). Overall, the assumptions are reasonably satisfied for this analysis.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tuk <- TukeyHSD(fit_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# Create tidy table from Tukey results
tukey_tidy <- as.data.frame(tuk$ethnicity) %>%
  rownames_to_column("comparison") %>%
  rename(
    diff = `diff`,
    lwr = `lwr`,
    upr = `upr`,
    p_adj = `p adj`
  ) %>%
  mutate(
    significant = ifelse(p_adj < 0.05, "Yes", "No"),
    diff = round(diff, 3),
    lwr = round(lwr, 3),
    upr = round(upr, 3),
    p_adj = signif(p_adj, 3)
  )

tukey_tidy %>%
  kable(caption = "Table 3: Tukey HSD Post-hoc Comparisons") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Table 3: Tukey HSD Post-hoc Comparisons
comparison	diff	lwr	upr	p_adj	significant
Other Hispanic-Mexican American	-0.004	-0.043	0.034	9.98e-01	No
Non-Hispanic White-Mexican American	-0.062	-0.093	-0.032	3.00e-07	Yes
Non-Hispanic Black-Mexican American	0.022	-0.010	0.055	3.28e-01	No
Other-Mexican American	-0.053	-0.087	-0.019	1.92e-04	Yes
Non-Hispanic White-Other Hispanic	-0.058	-0.089	-0.027	3.60e-06	Yes
Non-Hispanic Black-Other Hispanic	0.027	-0.006	0.060	1.72e-01	No
Other-Other Hispanic	-0.049	-0.083	-0.014	1.07e-03	Yes
Non-Hispanic Black-Non-Hispanic White	0.085	0.061	0.108	0.00e+00	Yes
Other-Non-Hispanic White	0.009	-0.017	0.035	8.72e-01	No
Other-Non-Hispanic Black	-0.076	-0.104	-0.048	0.00e+00	Yes

Write a short paragraph: “The groups that differed were …”

The Tukey post-hoc tests revealed that several ethnicity pairs had significantly different mean BMD. Non-Hispanic Black individuals had significantly higher BMD compared to Non-Hispanic White (diff = 0.085, p < 0.001) and Other groups (diff = -0.076, p < 0.001). Non-Hispanic White individuals had significantly lower BMD than Mexican American (diff = -0.062, p < 0.001) and Other Hispanic groups (diff = -0.058, p < 0.001). However, Non-Hispanic Black did not differ significantly from Mexican American or Other Hispanic groups, and Non-Hispanic White did not differ from the Other category.

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl

## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA

# Calculate eta-squared
ss_between <- anova_tbl$sumsq[1]  # Sum of squares for ethnicity
ss_total <- sum(anova_tbl$sumsq)   # Total sum of squares
eta_squared <- ss_between / ss_total

# Display result
cat("Eta-squared (η²) =", round(eta_squared, 4), "\n")

## Eta-squared (η²) = 0.0503

cat("This means that ethnicity explains", round(eta_squared * 100, 1), 
    "% of the variance in BMD.")

## This means that ethnicity explains 5 % of the variance in BMD.

Interpret in 1–2 sentences:

The calculated eta-squared is 0.0503, meaning ethnicity explains 5% of the variance in BMD. This represents a small-to-medium effect size according to Cohen’s guidelines.

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

# Load patchwork for combining plots
library(patchwork)

# Create histograms
p1 <- ggplot(bmd2, aes(x = BMXBMI)) + 
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "BMI Distribution", x = "BMI (kg/m²)", y = "Count") +
  theme_minimal()

p2 <- ggplot(bmd2, aes(x = calcium_total)) + 
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "Total Calcium Intake", x = "Calcium (mg/day)", y = "Count") +
  theme_minimal()

p3 <- ggplot(bmd2, aes(x = vitd_total)) + 
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "Total Vitamin D Intake", x = "Vitamin D (mcg/day)", y = "Count") +
  theme_minimal()

p4 <- ggplot(bmd2, aes(x = totmet)) + 
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "Physical Activity (MET-min/week)", x = "MET-min/week", y = "Count") +
  theme_minimal()

# Arrange plots
(p1 | p2) / (p3 | p4)

bmd2 <- bmd2 %>%
  mutate(
    log_bmi = log(BMXBMI),
    log_calcium_total = log(calcium_total + 1),  # +1 to handle zeros
    sqrt_vitd_total = sqrt(vitd_total),
    sqrt_totmet = sqrt(totmet)
  )

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
    log_bmi = log(BMXBMI),
    log_calcium_total = log(calcium_total + 1),  # +1 to handle zeros
    sqrt_vitd_total = sqrt(vitd_total),
    sqrt_totmet = sqrt(totmet)
  )

Document your reasoning:

I applied transformations to address positive skewness observed in the histograms. BMI shows a roughly normal distribution with slight right skew, so I used a log transformation to improve symmetry. Total calcium intake is highly right-skewed with many values clustered near zero, so I applied log(calcium+1) to handle zeros and reduce skewness. Vitamin D intake is extremely right-skewed with most values concentrated near zero, so I used a square root transformation. Physical activity (MET-min/week) is also highly right-skewed with a long tail, so I applied a square root transformation to better approximate normality for correlation analysis.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

# YOUR CODE HERE
tableA <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
  corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
  corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
  mutate(
    predictor = case_when(
      x == "RIDAGEYR" ~ "Age",
      x == "log_bmi" ~ "BMI (log-transformed)",
      x == "log_calcium_total" ~ "Total Calcium (log-transformed)",
      x == "sqrt_vitd_total" ~ "Total Vitamin D (sqrt-transformed)"
    ),
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3),
    significant = ifelse(p_value < 0.05, "Yes", "No")
  ) %>%
  select(predictor, n, estimate, p_value, significant)

tableA %>%
  kable(caption = "Table A: Correlations with BMD (Outcome)",
        col.names = c("Predictor", "N", "r", "p-value", "p < 0.05?")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Table A: Correlations with BMD (Outcome)
Predictor	N	r	p-value	p < 0.05?
Age	2286	-0.232	0.00000	Yes
BMI (log-transformed)	2275	0.425	0.00000	Yes
Total Calcium (log-transformed)	2129	0.012	0.59200	No
Total Vitamin D (sqrt-transformed)	2129	-0.062	0.00426	Yes

Interpret the results:

Age shows a weak negative correlation with BMD (r = -0.232, p < 0.001), indicating that BMD tends to decrease slightly with age. BMI shows a moderate positive correlation with BMD (r = 0.425, p < 0.001), meaning higher BMI is associated with higher BMD. Total calcium intake shows no significant correlation with BMD (r = 0.012, p = 0.592). Total vitamin D intake shows a very weak negative but statistically significant correlation with BMD (r = -0.062, p = 0.004), though the practical significance is minimal given the small effect size. The sample sizes vary due to missing data in different variables.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

# YOUR CODE HERE
tableB <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
  corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
  corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
  corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
) %>%
  mutate(
    predictor = case_when(
      x == "RIDAGEYR" ~ "Age",
      x == "log_bmi" ~ "BMI (log-transformed)",
      x == "log_calcium_total" ~ "Total Calcium (log-transformed)",
      x == "sqrt_vitd_total" ~ "Total Vitamin D (sqrt-transformed)"
    ),
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3),
    significant = ifelse(p_value < 0.05, "Yes", "No")
  ) %>%
  select(predictor, n, estimate, p_value, significant)

tableB %>%
  kable(caption = "Table B: Correlations with Physical Activity (sqrt-transformed)",
        col.names = c("Predictor", "N", "r", "p-value", "p < 0.05?")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Table B: Correlations with Physical Activity (sqrt-transformed)
Predictor	N	r	p-value	p < 0.05?
Age	1934	-0.097	1.96e-05	Yes
BMI (log-transformed)	1912	0.001	9.51e-01	No
Total Calcium (log-transformed)	1777	0.086	2.82e-04	Yes
Total Vitamin D (sqrt-transformed)	1777	-0.002	9.32e-01	No

Interpret the results:

Age shows a weak negative correlation with physical activity (r = -0.097, p < 0.001), indicating that activity levels decrease slightly with age. BMI shows no correlation with physical activity (r = 0.001, p = 0.951). Total calcium intake has a weak positive correlation with activity (r = 0.086, p < 0.001), suggesting higher calcium intake is associated with slightly higher activity levels. Total vitamin D intake shows no significant correlation with physical activity (r = -0.002, p = 0.932). While some associations are statistically significant due to large sample sizes, the correlations are very weak in magnitude.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

# YOUR CODE HERE
# Create all pairwise combinations of predictors
preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)

tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
  mutate(
    var1 = case_when(
      x == "RIDAGEYR" ~ "Age",
      x == "log_bmi" ~ "BMI (log)",
      x == "log_calcium_total" ~ "Calcium (log)",
      x == "sqrt_vitd_total" ~ "Vitamin D (sqrt)"
    ),
    var2 = case_when(
      y == "RIDAGEYR" ~ "Age",
      y == "log_bmi" ~ "BMI (log)",
      y == "log_calcium_total" ~ "Calcium (log)",
      y == "sqrt_vitd_total" ~ "Vitamin D (sqrt)"
    ),
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3),
    significant = ifelse(p_value < 0.05, "Yes", "No")
  ) %>%
  select(var1, var2, n, estimate, p_value, significant)

tableC %>%
  kable(caption = "Table C: Pairwise Correlations Among Predictors",
        col.names = c("Variable 1", "Variable 2", "N", "r", "p-value", "p < 0.05?")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Table C: Pairwise Correlations Among Predictors
Variable 1	Variable 2	N	r	p-value	p < 0.05?
Age	BMI (log)	2834	-0.098	2.00e-07	Yes
Age	Calcium (log)	2605	0.048	1.35e-02	Yes
Age	Vitamin D (sqrt)	2605	0.153	0.00e+00	Yes
BMI (log)	Calcium (log)	2569	0.000	9.83e-01	No
BMI (log)	Vitamin D (sqrt)	2569	0.007	7.31e-01	No
Calcium (log)	Vitamin D (sqrt)	2605	0.314	0.00e+00	Yes

Interpret the results:

Most correlations among predictors are weak. The strongest correlation is between calcium and vitamin D intake (r = 0.314, p < 0.001), which makes sense as these nutrients often come from similar food sources. Age shows weak correlations with all variables: negatively with BMI (r = -0.098) and positively with calcium (r = 0.048) and vitamin D (r = 0.153). BMI shows essentially no correlation with either nutrient (r = 0.000 and r = 0.007, both non-significant). These weak correlations suggest that multicollinearity is unlikely to be a major concern in future regression analyses using these variables.

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )

Part 3 — Reflection (10 points)

A. Comparing Methods (ANOVA vs Correlation)

ANOVA is used when comparing means across three or more categorical groups, while correlation is used to assess the strength and direction of linear relationships between two continuous variables. The key difference is that ANOVA tells us whether group means differ (categorical predictor → continuous outcome), while correlation tells us about the association between two continuous variables. An example research question for ANOVA would be: “Does mean blood pressure differ across smoking status groups (never, former, current)?” An example for correlation would be: “Is there a linear relationship between physical activity level and bone mineral density?”

B. Assumptions and Limitations

The most challenging assumptions to meet in this assignment were the normality of residuals and homogeneity of variances for ANOVA. Both were violated in our analysis, as shown by the Q-Q plot and Levene’s test. These violations could affect our conclusions by increasing the Type I error rate or reducing statistical power. However, ANOVA is fairly robust to these violations with large sample sizes. The cross-sectional correlation analysis has important limitations for understanding causality between nutrition/activity and bone health. Correlations cannot establish temporal sequence, are susceptible to confounding, and cannot rule out reverse causation (e.g., people with better bone health might be more physically active).

C. R Programming Challenge

The most difficult part of the R coding was creating all three correlation tables, particularly Table C with the pairwise combinations. I initially struggled with generating all possible pairs efficiently. I solved this by studying the combn() function and understanding how to combine it with map_dfr() from the purrr package. I also found the corr_pair() helper function very useful once I understood how it worked. Through this assignment, I strengthened my skills in data transformation, creating reproducible tables with kable(), and writing functions to automate repetitive tasks. I now feel more confident in my ability to conduct ANOVA and correlation analyses independently.

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Fizza Zaheer

02/20/2026

Homework 1

Data description

Part 0 — Load and prepare the data (10 points)

0.1 Import

0.2 Recode to readable factors

0.3 Missingness + analytic sample

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

1.1 Hypotheses (required)

1.2 Exploratory plot and group summaries

1.3 Fit ANOVA model + report the F-test

1.4 Assumption checks (normality + equal variances)

Normality of residuals

Equal variances (Levene’s test)

1.5 Post-hoc comparisons (pairwise tests)

1.6 Effect size (eta-squared)

Part 2 — Correlation analysis (40 points)

2.1 Create total intake variables

2.2 Decide on transformations (show evidence)

2.3 Create three correlation tables (Pearson)

Helper function for correlation pairs

Table A: Variables associated with the outcome (BMD)

Table B: Variables associated with the exposure (MET)

Table C: Predictors associated with each other

Optional: Visualization examples

Part 3 — Reflection (10 points)

Submission checklist