Downloading Packages

# Load necessary libraries
library(tidyverse)
library(kableExtra)     
library(car)        
library(broom) 

Part 0: Data Preparation

bmd <- readr::read_csv("~/Documents/Albany/Courses/Spring 2026/Epi 553/Homework Assignments/HW1/bmd.csv", show_col_types = FALSE)

#creating new labels
bmd_new <- bmd %>% 
  mutate(
    sex = factor (RIAGENDR, levels = c(1,2), labels = c("Male", "Female")), 
         ethnicity = factor (RIDRETH1, levels = c(1,2,3,4,5), 
                             labels = c("Mexican American", "Other Hispanic",
                                        "Non-Hispanic White", "Non-Hispanic Black", "Other")),
                            smokernew = factor (smoker, levels = c(1,2,3), 
                                                labels = c("Current","Past","Never"))
  )

# Display sample
# Display sample size
data.frame(
  Metric = "Sample Size",
  Value = paste(nrow(bmd_new), "bmd_new")
  ) %>%
  kable()
Metric Value
Sample Size 2898 bmd_new
#Removing missing BMD and ethnicity 
bmd_new1 <- bmd_new %>% 
filter(!is.na(DXXOFBMD) & !is.na(ethnicity))

# Display sample
# Display sample size
data.frame(
  Metric = "Sample Size",
  Value = paste(nrow(bmd_new1), "bmd_new1")
  ) %>%
  kable()
Metric Value
Sample Size 2286 bmd_new1

The initial dataset bmd.csv included N = 2898 observations and 14 variables. Due to missing data, 612 observations were removed. Therfore, for the ANOVA analysis, the final analytic sample size is 2286 observations.


Part 1: One-Way ANOVA

Research Question: Do mean BMD values differ across ethnicity groups?

Step 1: State Hypothesis

Null Hypothesis (H₀): μ_Mexican_American = μ_Other_Hispanic = μ_Non-Hispanic_White = μ_Non-Hispanic_Black = μ_Other (All five BMD population means are equal)

Alternative Hypothesis (H₁): At least one population mean differs from the others

Significance level: α = 0.05

#Step 2: Exploratory Analysis
summary_stats <- bmd_new1 %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    Mean = mean(DXXOFBMD),
    SD = sd(DXXOFBMD),
    Median = median(DXXOFBMD),
    Min = min(DXXOFBMD),
    Max = max(DXXOFBMD)
  )

summary_stats %>%
  kable(digits = 2,
        caption = "Decriptive Statistics: BMD by Ethnicity Categories")
Decriptive Statistics: BMD by Ethnicity Categories
ethnicity n Mean SD Median Min Max
Mexican American 255 0.95 0.15 0.96 0.44 1.46
Other Hispanic 244 0.95 0.16 0.94 0.55 1.39
Non-Hispanic White 846 0.89 0.16 0.88 0.37 1.46
Non-Hispanic Black 532 0.97 0.16 0.98 0.44 1.55
Other 409 0.90 0.15 0.89 0.54 1.34
# Create boxplots with individual points
ggplot(bmd_new1, 
  aes(x = ethnicity, y = DXXOFBMD, fill = ethnicity)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "BMD Populations Means Across Ethnicity Groups",
    subtitle = "Bone Mineral Density Data",
    x = "Ethnicity",
    y = "BMD Means",
    fill = "Ethnicity"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

# **Interpret the plot:** What patterns do you observe in BMD across ethnicity groups?

# There are clear differences in BMD values among the ethnic groups. Some groups have  higher average BMD, while others are lower, such as the Mexican American and Non-Hispanic Black vategories having higher BMD, and Non-Hispanic White and Other having lower BMD. 

# Step 3: ANOVA F-Test 
# Fit the one-way ANOVA model

anova_model <- aov(DXXOFBMD ~ethnicity, data = bmd_new1)

# Display ANOVA table
summary(anova_model)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# State Decision and Interpret: The p-value is <2e-16 which is <0.05. Therefore, we are rejecting the null hypothesis and there is a statistical significance between the mean BMD across different ehtnicity groups.  

# Step 4: Assumption Checks
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(anova_model)

par(mfrow = c(1, 1))

# Step 4B: Levene Test 
# Levene's test for homogeneity of variance

levene_test <- leveneTest(DXXOFBMD ~ ethnicity, data = bmd_new1)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281
# Interpretation: Since the p-value is 0.1788 which is greater than 0.05, we do not reject equal variances. 
  • F-statistic: 30.18
  • Degrees of freedom: 4
  • p-value: <2e-16
  • Decision (reject or fail to reject H₀): reject the null (H₀)
  • Statistical conclusion in words: Since the p-value, <2e-16, is less than 0.05, the result is statistically significant and we fail to reject the null hypothesis (H₀). Thus, we will accept the alternative hypothesis (H₁) meaning that at least one population mean differs from the others. At least one BMD population mean is different among the ethnicity categories.

Normality 1. Residuals vs Fitted: Points show random scatter around zero with no clear pattern → Good! 2. Q-Q Plot: Points follow the diagonal line reasonably well. 3. Scale-Location: Red line is relatively flat → Equal variance assumption is reasonable 4. Residuals vs Leverage: No points beyond Cook’s distance lines → No highly influential outliers

# Step 5: Post-Hoc Comparisons
# Only if your ANOVA p-value < 0.05

# Conduct Tukey HSD test
tukey_results <- TukeyHSD(anova_model)
print(tukey_results)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = bmd_new1)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000
# Visualize the confidence intervals
plot(tukey_results, las = 0)

Table:

Comparison Mean Difference 95% CI Lower 95% CI Upper p-value Significant?
Other Hispanic - Mexican American -0.004441273 -0.042644130 0.03376158 0.9978049 No
Non-Hispanic White - Mexican American 0.022391619 -0.010100052 0.05488329 0.0000003 Yes
Non-Hispanic Black - Mexican American 0.022391619 -0.010100052 0.05488329 0.3276613 No
Other - Mexican American -0.053319344 -0.087357253 -0.01928144 0.0001919 0.0001919
Non-Hispanic White - Other Hispanic -0.053319344 -0.088934694 -0.02693726 0.0000036 Yes
Non-Hispanic Black - Other Hispanic 0.026832892 -0.006150151 0.05981593 0.1722466 No
Other - Other Hispanic -0.048878071 -0.083385341 -0.01437080 0.0010720 Yes
Non-Hispanic Black - Non-Hispanic White 0.084768868 0.061164402 0.10837333 0 Yes
Other - Non-Hispanic White 0.009057905 -0.016633367 0.03474918 0.8719109 No
Other - Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741 0 Yes

Interpretation:

Which specific groups differ significantly? Non-Hispanic White - Mexican American, Other - Mexican American, Non-Hispanic White - Other Hispanic, Other - Other Hispanic, Non-Hispanic Black - Non-Hispanic White, Other - Non-Hispanic Black all have a p-value less than 0.05, therefore the are statistically significant.

# Calculate eta-squared
# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.0503
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")
## Percentage of variance explained: 5.03 %
  • η² = 0.0503
  • Percentage of variance explained: Ethnicity explains only 5.03% of the variance of bmd. Other unmeasured factors explain the remaining 94.97%
  • Effect size classification (small/medium/large): small
  • What this means practically: The effect is real but modest in magnitude. Ethnicity explains 5.03% of the variance in bone mineral density (bmd).

Part 2: Correlation Analysis

#Step 1: Create Total Intake Variables

bmd_new2 <- bmd_new1 %>%
  mutate(
    calcium_total = coalesce(calcium, 0) + coalesce(DSQTCALC, 0),
    vitd_total    = coalesce(vitd, 0) + coalesce(DSQTVD, 0)
  )

Step 2: Assess and Apply Transformations

# Visual checks
par(mfrow = c(1, 2))

# Normal data 

# BMI
hist(bmd_new2$BMXBMI, 
     main = "Histogram: BMI", 
     xlab = "BMI",
     col = "lightblue")

qqnorm(bmd_new2$BMXBMI, 
       main = "Q-Q Plot: BMI")
qqline(bmd_new2$BMXBMI, 
       col = "red", 
       lwd = 2)

#Total Calcium
hist(bmd_new2$calcium_total,
     main = "Histogram: Total Calcium",
     xlab = "Calcium Intake",
     col = "lightgreen")

qqnorm(bmd_new2$calcium_total,
       main = "Q-Q Plot: Total Calcium")
qqline(bmd_new2$calcium_total,
       col = "red",
       lwd = 2)

# Total Vitamin D
hist (bmd_new2$vitd_total,
      main = "Histogram: Total Vitamin D",
      xlab = "Vitamin D Intake",
      col = "lightgreen")

qqnorm(bmd_new2$vitd_total,
       main = "Q-Q Plot: Total Vitamin D")
qqline(bmd_new2$vitd_total,
       col = "red",
       lwd = 2)

# MET
hist(bmd_new2$totmet,
     main = "Histogram: MET",
     xlab = "MET",
     col = "lightgray")

qqnorm(bmd_new2$totmet,
       main = "Q-Q Plot: MET")
qqline(bmd_new2$totmet,
       col = "red",
       lwd = 2)

# Apply Transformations

# Log transformation (add +1 to calcium in case of zeros)
bmd_new2$log_BMXBMI <- log(bmd_new2$BMXBMI)
bmd_new2$log_calcium_total <- log(bmd_new2$calcium_total + 1)

# Square root transformation
bmd_new2$sqrt_totmet <- sqrt(bmd_new2$totmet)
bmd_new2$sqrt_vitd_total <- sqrt(bmd_new2$vitd_total)


# Re-check Normality After Transformation

par(mfrow = c(1, 2))

qqnorm(bmd_new2$log_BMXBMI, main = "Q-Q Plot: Log BMI")
qqline(bmd_new2$log_BMXBMI, col = "red", lwd = 2)

qqnorm(bmd_new2$log_calcium_total, main = "Q-Q Plot: Log Calcium")
qqline(bmd_new2$log_calcium_total, col = "red", lwd = 2)

qqnorm(bmd_new2$sqrt_totmet, main = "Q-Q Plot: Sqrt MET")
qqline(bmd_new2$sqrt_totmet, col = "red", lwd = 2)

qqnorm(bmd_new2$sqrt_vitd_total, main = "Q-Q Plot: Sqrt Vitamin D")
qqline(bmd_new2$sqrt_vitd_total, col = "red", lwd = 2)

I applied transformations to the variables bmi, total calcium, total vitamin D, and MET because I wanted to make sure the data was distributed more symmetrically and not right-skewed.


Step 3: Three Correlation Tables

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

  • Age vs. BMD
  • BMI vs. BMD
  • Total calcium vs. BMD
  • Total vitamin D vs. BMD
tableA <- bind_rows(
  corr_pair(bmd_new2, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(bmd_new2, "log_BMXBMI", "DXXOFBMD"),
  corr_pair(bmd_new2, "log_calcium_total", "DXXOFBMD"),
  corr_pair(bmd_new2, "sqrt_vitd_total", "DXXOFBMD")
  ) %>% 
 
   mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
    )

tableA %>% kable()
x y n estimate p_value
RIDAGEYR DXXOFBMD 2286 -0.232 0.0000
log_BMXBMI DXXOFBMD 2275 0.425 0.0000
log_calcium_total DXXOFBMD 2286 0.030 0.1450
sqrt_vitd_total DXXOFBMD 2286 -0.051 0.0151

Interpret the results:

Which variables show statistically significant correlations with BMD? What is the direction and strength of these relationships?

The variables showing statistically significant correlations with BMD are age, BMI, and vitamin D. Age and BMD, as well as vitamin D and BMD, exhibit inverse correlations, indicating that BMD decreases as age or vitamin D levels increase. In contrast, BMI and BMD have a positive correlation, meaning that higher BMI is associated with higher BMD.


Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

  • Age vs. MET
  • BMI vs. MET
  • Total calcium vs. MET
  • Total vitamin D vs. MET
tableB <- bind_rows(
  corr_pair(bmd_new2, "RIDAGEYR", "sqrt_totmet"),
  corr_pair(bmd_new2, "log_BMXBMI", "sqrt_totmet"),
  corr_pair(bmd_new2, "log_calcium_total", "sqrt_totmet"),
  corr_pair(bmd_new2, "sqrt_vitd_total", "sqrt_totmet")
) %>% 
  mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
  )

tableB %>% kable()
x y n estimate p_value
RIDAGEYR sqrt_totmet 1588 -0.105 2.67e-05
log_BMXBMI sqrt_totmet 1582 0.009 7.32e-01
log_calcium_total sqrt_totmet 1588 0.072 4.36e-03
sqrt_vitd_total sqrt_totmet 1588 -0.002 9.34e-01

Interpret the results:

Which variables are associated with physical activity levels?

All of the variables are associated with physical activity levels due to the p-values being less than 0.05.


Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

  • Age, BMI, Total calcium, Total vitamin D
preds <- c("RIDAGEYR", "BMXBMI", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd_new2, .x[1], .x[2])) %>%
 
  mutate(
  estimate = round(estimate, 3),
  p_value = signif(p_value, 3)
  )

  tableC %>% kable()
x y n estimate p_value
RIDAGEYR BMXBMI 2275 -0.098 2.70e-06
RIDAGEYR log_calcium_total 2286 -0.024 2.58e-01
RIDAGEYR sqrt_vitd_total 2286 0.146 0.00e+00
BMXBMI log_calcium_total 2275 0.032 1.23e-01
BMXBMI sqrt_vitd_total 2275 0.027 2.01e-01
log_calcium_total sqrt_vitd_total 2286 0.224 0.00e+00

Interpret the results:

Based on the correlations from table C (looking at the estimates), there is no strong evidence of multicollinearity among these predictors. Therefore, we should be able to include them in a regression model without worrying about inflated standard errors or unstable coefficient estimates due to multicollinearity.


Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd_new2, aes(x = log_BMXBMI, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd_new2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

  • When would you use ANOVA versus correlation in epidemiological research?
  • What are the key differences in what these methods tell us?
  • Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

  • Which assumptions were most challenging to meet in this assignment?
  • How might violations of assumptions affect your conclusions?
  • What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

  • What was the most difficult part of the R coding for this assignment?
  • How did you solve it? (Describe your problem-solving process)
  • What R skill did you strengthen through completing this assignment?

ANOVA and correlation are both useful statistical tools in epidemiological research, but they answer different types of questions. ANOVA is used when comparing the mean outcome across two or more categorical groups. For example, it would be appropriate to use ANOVA to examine whether mean BMI differs across different physical activity categories (low, moderate, high). In contrast, correlation is used to measure the strength and direction of a linear relationship between two continuous variables. Correlation would be more appropriate for examining the relationship between age and blood pressure, since both variables are continuous.

The key difference is that ANOVA tests for differences in group means, while correlation assesses the strength of association between continuous variables. ANOVA helps determine whether groups differ significantly, whereas correlation tells us how strongly two variables move together and whether the relationship is positive or negative.

The most challenging assumptions to meet in this assignment were normality and linearity. Some variables, such as BMI and calcium intake, were skewed and required log transformations. Violations of normality or homoscedasticity could lead to biased estimates, incorrect p-values, and misleading conclusions.

A major limitation of cross-sectional correlation analysis is that it does not establish causation. Because all data are collected at one point point in time, we cannot determine whether higher calcium intake leads to improved bone health or whether individuals with better bone health engage in healthier behaviors. Confounding and reverse causation are also concerns.

The most difficult part of the R coding was managing variable names and ensuring transformations were correctly applied in the correlation tables and plots. Errors such as mismatched column names and pipe syntax issues required careful debugging. I solved these issues by re-checking variable names and testing codes in smaller sections before combining them. Through this assignment, I strengthened my understanding of writing functions, using bind_rows(), and applying transformations properly within tidyverse package.