Missing Summary Before Cleaning
Metric	Value
Total Sample Size (raw)	2898 participants

Final Sample Size After Cleaning
Variable	Missing_n	Missing_percent
BMD	612	21.11801
Ethnicity	0	0.00000

Metric	Value
Final Analytic Sample Size	2898 participants

Part 1

Step 1: State Hypotheses

Null Hypothesis (H₀): μ_MexicanAmerican = μ_OtherHispanic = μ_NonHispanicWhite = μ_NonHispanicBlack = μ_Other

Alternative Hypothesis (H₁): At least one population mean differs from the others

Significance level: α = 0.05

The null hypothesis states that mean total femur bone density is the same across all race/ethnicity groups, while the alternative states that at least one group differs.

anova_model <- aov(DXXOFBMD ~ RIDRETH1, data = bmd_clean)

# Boxplot with individual point BMD by Ethnicity
ggplot(bmd_clean,
       aes(x = factor(RIDRETH1),
           y = DXXOFBMD,
           fill = factor(RIDRETH1))) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Total Femur BMD by Ethnicity",
    subtitle = "Analytic Sample",
    x = "Race/Ethnicity",
    y = "Total Femur BMD (g/cm^2)",
    fill = "Race/Ethnicity"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

library(broom)

anova_model <- aov(DXXOFBMD ~ RIDRETH1, data = bmd_clean)
summary(anova_model)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## RIDRETH1       4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 612 observations deleted due to missingness

anova_table <- tidy(anova_model)

anova_table %>%
  kable(caption = "ANOVA Results: Total Femur BMD by Ethnicity") %>%
  kable_styling(full_width = FALSE)

ANOVA Results: Total Femur BMD by Ethnicity
term	df	sumsq	meansq	statistic	p.value
RIDRETH1	4	2.948226	0.7370565	30.18504	0
Residuals	2281	55.697313	0.0244179	NA	NA

# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")

## Eta-squared (η²): 0.0503

cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")

## Percentage of variance explained: 5.03 %

F statistic = 30.18 df₁ = 4 (ethnicity groups − 1) df₂ = 2281 (residual df) p < 2e-16 ?(very small) Decision: Since p < 0.05, we reject H₀ Conclusion: There is statistically significant evidence that mean total femur bone density differs across race/ethnicity groups.

# Create diagnostic plots
par(mfrow = c(2, 2))
par(mfrow = c(1, 1))
plot(anova_model)

par(mfrow = c(1,2))

qqnorm(residuals(anova_model), main = "Q-Q Plot: Residuals")
qqline(residuals(anova_model), col = "red")

hist(residuals(anova_model), main = "Histogram of Residuals",
     xlab = "Residuals")

# Levene's test
levene_test <- leveneTest(DXXOFBMD ~ RIDRETH1, data = bmd_clean)

print(levene_test)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

# Levene’s test was not significant p = 0.1788, suggesting equal variances assumption is met.

Diagnostic Plot Interpretation:

Residuals vs Fitted: Plot showed random scatter around zero with no clear pattern, indicating no major violations of linearity → Good! Q-Q Plot: Points follow the diagonal line reasonably well, Slight deviation at tails → Normality assumption is reasonable Scale-Location: Flat trend, suggesting consistent variance across fitted values → Equal variance assumption is reasonable Residuals vs Leverage: No points beyond Cook’s distance lines → No highly influential outliers

Levene’s Test Interpretation:

p-value: 0.1788 If p > 0.05, we would not reject equal variances Here: Equal variance assumption is met

Overall Assessment: With n > 2000, ANOVA is robust to minor violations. Our assumptions are reasonably satisfied.

tuk <- TukeyHSD(anova_model)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ RIDRETH1, data = bmd_clean)
## 
## $RIDRETH1
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

library(tidyverse)
library(knitr)
library(kableExtra)

tukey_tbl <- as.data.frame(tuk$RIDRETH1) %>%
  tibble::rownames_to_column("Comparison") %>%
  rename(
    Mean_Difference = diff,
    CI_Lower = lwr,
    CI_Upper = upr,
    Adjusted_p = `p adj`
  ) %>%
  mutate(
    Mean_Difference = round(Mean_Difference, 4),
    CI_Lower = round(CI_Lower, 4),
    CI_Upper = round(CI_Upper, 4),
    Adjusted_p = signif(Adjusted_p, 3),
    Significant = ifelse(Adjusted_p < 0.05, "Yes", "No")
  )

# Tukey table 
tukey_tbl %>%
  kable(
    caption = "Tukey HSD Comparisons: BMD by Ethnicity",
    col.names = c("Comparison", "Mean Diff", "95% CI Lower", "95% CI Upper", "Adj p-value", "Significant?")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Tukey HSD Comparisons: BMD by Ethnicity
Comparison	Mean Diff	95% CI Lower	95% CI Upper	Adj p-value	Significant?
Other Hispanic-Mexican American	-0.0044	-0.0426	0.0338	9.98e-01	No
Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	3.00e-07	Yes
Non-Hispanic Black-Mexican American	0.0224	-0.0101	0.0549	3.28e-01	No
Other-Mexican American	-0.0533	-0.0874	-0.0193	1.92e-04	Yes
Non-Hispanic White-Other Hispanic	-0.0579	-0.0889	-0.0269	3.60e-06	Yes
Non-Hispanic Black-Other Hispanic	0.0268	-0.0062	0.0598	1.72e-01	No
Other-Other Hispanic	-0.0489	-0.0834	-0.0144	1.07e-03	Yes
Non-Hispanic Black-Non-Hispanic White	0.0848	0.0612	0.1084	0.00e+00	Yes
Other-Non-Hispanic White	0.0091	-0.0166	0.0347	8.72e-01	No
Other-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.00e+00	Yes

# Tukey’s test indicated that the mean total femur bone density differed between ethnicity groups, particularly between Non-Hispanic White and multiple other groups, while some group comparisons were not significant.

# Effect Size
# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")

## Eta-squared (η²): 0.0503

cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")

## Percentage of variance explained: 5.03 %

# The effect size (η² = 0.08) indicates a medium effect.

Interpretation: Ethnicity explains 5.03% of the variance in total femur bone mineral density.

Effect size guidelines: Small (0.01), Medium (0.06), Large (0.14) Our effect: Small

Part 2: Correlation

# Packages
library(tidyverse)
library(broom)
library(kableExtra)

# Creating calcium_total/vitd_total
bmd_clean <- bmd_clean %>%
  mutate(
    calcium_total = replace_na(calcium, 0) + replace_na(DSQTCALC, 0),
    vitd_total = replace_na(vitd, 0) + replace_na(DSQTVD, 0)
  )

bmd_clean <- bmd_clean %>%
  mutate(
    log_BMI = log(BMXBMI),
    log_calcium = log(calcium_total + 1),
    sqrt_MET = sqrt(totmet),
    sqrt_vitd = sqrt(vitd_total)
  )
summary(bmd_clean$calcium_total)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   503.6   840.1   911.8  1229.5  5399.0

summary(bmd_clean$vitd_total)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.45    7.85   27.81   29.35 2574.65

# Log transformations was to reduce skewness, since total vitamin D and MET had many zero values and were right skewed.

# Factoring RIDRETH1
bmd_clean <- bmd_clean %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = c(1, 2, 3, 4, 5),
                      labels = c("Mexican American",
                                 "Other Hispanic",
                                 "Non-Hispanic White",
                                 "Non-Hispanic Black",
                                 "Other"))
  )

# Histograms

# BMI
ggplot(bmd_clean, aes(x = BMXBMI)) +
  geom_histogram(bins = 30) +
  labs(title = "Histogram of BMI", x = "BMI", y = "Number") +
  theme_minimal()

# Calcium

ggplot(bmd_clean, aes(x = calcium_total)) +
  geom_histogram(bins = 30) +
  labs(title = "Histogram of Total Calcium Intake",
       x = "Calcium total (mg/day)", y = "Number") +
  theme_minimal()

# Vitamin D

ggplot(bmd_clean, aes(x = vitd_total)) +
  geom_histogram(bins = 30) +
  labs(title = "Histogram of Total Vitamin D Intake",
       x = "Vitamin D total (mcg/day)", y = "Number") +
  theme_minimal()

# MET

ggplot(bmd_clean, aes(x = totmet)) +
  geom_histogram(bins = 30) +
  labs(title = "Histogram of Physical Activity",
       x = "MET-min/week", y = "Number") +
  theme_minimal()

library(tidyverse)

corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()

  # Need at least 3 observations for correlation
  if (nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }

  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")

  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

# Table A: Age vs BMD
tableA <- bind_rows(
  corr_pair(bmd_clean, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(bmd_clean, "log_BMI", "DXXOFBMD"),
  corr_pair(bmd_clean, "log_calcium", "DXXOFBMD"),
  corr_pair(bmd_clean, "sqrt_vitd", "DXXOFBMD")
) %>%
  mutate(
    r = round(estimate, 3),
    p_value = signif(p_value, 3)
  ) %>%
  select(x, y, n, r, p_value) %>%
  rename(Predictor = x, Outcome = y, `p-value` = p_value)

tableA %>%
  kable(caption = "Table A: Predictors vs Outcome (BMD)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Table A: Predictors vs Outcome (BMD)
Predictor	Outcome	n	r	p-value
RIDAGEYR	DXXOFBMD	2286	-0.232	0.0000
log_BMI	DXXOFBMD	2275	0.425	0.0000
log_calcium	DXXOFBMD	2286	0.030	0.1450
sqrt_vitd	DXXOFBMD	2286	-0.051	0.0151

# Table B: Log BMI vs BMD
tableB <- bind_rows(
  corr_pair(bmd_clean, "RIDAGEYR", "sqrt_MET"),
  corr_pair(bmd_clean, "log_BMI", "sqrt_MET"),
  corr_pair(bmd_clean, "log_calcium", "sqrt_MET"),
  corr_pair(bmd_clean, "sqrt_vitd", "sqrt_MET")
) %>%
  mutate(
    r = round(estimate, 3),
    p_value = signif(p_value, 3)
  ) %>%
  select(x, y, n, r, p_value) %>%
  rename(Predictor = x, Exposure = y, `p-value` = p_value)

tableB %>%
  kable(caption = "Table B: Predictors vs Exposure (Physical Activity, sqrt_MET)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Table B: Predictors vs Exposure (Physical Activity, sqrt_MET)
Predictor	Exposure	n	r	p-value
RIDAGEYR	sqrt_MET	1934	-0.097	1.96e-05
log_BMI	sqrt_MET	1912	0.001	9.51e-01
log_calcium	sqrt_MET	1934	0.096	2.47e-05
sqrt_vitd	sqrt_MET	1934	0.007	7.58e-01

# Table C: Predictors vs Each Other
preds <- c("RIDAGEYR", "log_BMI", "log_calcium", "sqrt_vitd")
pairs <- combn(preds, 2, simplify = FALSE)

tableC <- map_dfr(pairs, ~ corr_pair(bmd_clean, .x[1], .x[2])) %>%
  mutate(
    r = round(estimate, 3),
    p_value = signif(p_value, 3)
  ) %>%
  select(x, y, n, r, p_value) %>%
  rename(Var1 = x, Var2 = y, `p-value` = p_value)

tableC %>%
  kable(caption = "Table C: Pairwise Correlations Among Predictors") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Table C: Pairwise Correlations Among Predictors
Var1	Var2	n	r	p-value
RIDAGEYR	log_BMI	2834	-0.098	2.00e-07
RIDAGEYR	log_calcium	2898	-0.029	1.16e-01
RIDAGEYR	sqrt_vitd	2898	0.149	0.00e+00
log_BMI	log_calcium	2834	0.021	2.73e-01
log_BMI	sqrt_vitd	2834	-0.003	8.76e-01
log_calcium	sqrt_vitd	2898	0.272	0.00e+00

Interpretation Age vs. BMD: Age showed a moderate negative correlation with BMD, showing that older individuals tend to have lower bone mineral density. Log-transformed BMI showed a moderate positive correlation with BMD, meaning that individuals with higher BMI, tend to have higher BMD. Square-root transformation displayed a weak negative correlation with BMD, which is statistically significant. Overall, BMI and age appear to be the strongest predictors of BMD.

Predictors vs. Exposure Age had a very weak negative correlation with BMD proving that physical activity decreases slightly with age. Log-transformed calcium intake showed a weak positive correlation with physical activity. BMI was not significantly associated with physical activity. Square root transformation vitamin D intake was not significantly associated with physical activity. Physical activity appears only weakly associated with these predictors.

Predictors vs. Each Other Age had a weak negative correlation with BMI and its calcium intake. Age also had a small positive correlation with vitamin D intake. BMI was did not have a specific correlation with calcium or vitamin D intake. The strongest association was between calcium and vitamin D intake, showing a small-to-moderate positive relationship. Since correlations among predictors were small, multicollinearity is unlikely to be a concern.

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?
What are the key differences in what these methods tell us?
Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?
How might violations of assumptions affect your conclusions?
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?
How did you solve it? (Describe your problem-solving process)
What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

In epidemiological research, ANOVA is used to compare the mean value of an outcome across groups, while correlation is used to determine the relationship between two continuous variables. ANOVA helps us determine whether group means are different, but it does not show the strength of a relationship, which correlation does. Correlation shows the direction and strength of association between variables. Another research question suited to ANOVA is whether physical activity differs across BMI categories, since it’s comparing average values amoung groups. A research question suited to correlation is whether vitamin D intake is associated with bone mineral density, since it examines relationship between two continuous variables.

A challenging assumption was linearity, which assumes that relationships between variables follow a straight-line pattern. Health data sometimes show more complex relationships, making this assumption difficult to meet. If linearity is violated, correlations may underestimate the true relationship and can contribute to incorrect conclusions about the strength of associations. A key limitation of cross-sectional correlation analysis is that it can’t establish cause and effect. Because data were collected at one point in time, it is unclear whether nutrition or physical activity influences bone health or whether other factors explain the observed relationships.

The most difficult part of this assignment was understanding how to run the Tukey HSD test and present the results in a clean, properly formatted table. At first, the output appeared as multiple values in the console rather than one organized table, which made it confusing to interpret and report the pairwise comparisons. I solved this by utilizing the template and learning how to convert the Tukey output into a data frame, add the comparison names as a column, and then format the results using functions like kable() to create a readable table. I also had to make sure the ANOVA model object was created before running the Tukey test, since missing objects caused errors. Through this process, I strengthened my skills in troubleshooting R errors, managing object names, and transforming statistical output into well-formatted tables for reporting. This experience helped me better understand how statistical results move from raw output to presentation-ready summaries.

Submission checklist

Before submitting, verify:

I went through this checklist.

Assignment 1

Sarah Bediako

2026-02-17

Part 1

Step 1: State Hypotheses

Part 3 — Reflection (10 points)

Submission checklist