#Part 0
# Load necessary libraries
library(tidyverse) # For data manipulation and visualization
library(kableExtra) # Enhanced tables
library(car) # For Levene's test
library(broom)
# Read the CSV file
library(readr)
bmd <- read_csv("~/Downloads/bmd.csv")
# Recode variables
bmd_clean <- bmd %>%
mutate(
# Ethnicity
RIDRETH1 = factor(RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American","Other Hispanic","Non-Hispanic White","Non-Hispanic Black","Other")
),
# Sex
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")
),
# Smoking status
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never")
)
)
| Metric | Value |
|---|---|
| Total Sample Size (raw) | 2898 participants |
| Variable | Missing_n | Missing_percent |
|---|---|---|
| BMD | 612 | 21.11801 |
| Ethnicity | 0 | 0.00000 |
| Metric | Value |
|---|---|
| Final Analytic Sample Size | 2898 participants |
Null Hypothesis (H₀): μ_MexicanAmerican = μ_OtherHispanic = μ_NonHispanicWhite = μ_NonHispanicBlack = μ_Other
Alternative Hypothesis (H₁): At least one population mean differs from the others
Significance level: α = 0.05
The null hypothesis states that mean total femur bone density is the same across all race/ethnicity groups, while the alternative states that at least one group differs.
anova_model <- aov(DXXOFBMD ~ RIDRETH1, data = bmd_clean)
# Boxplot with individual point BMD by Ethnicity
ggplot(bmd_clean,
aes(x = factor(RIDRETH1),
y = DXXOFBMD,
fill = factor(RIDRETH1))) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Total Femur BMD by Ethnicity",
subtitle = "Analytic Sample",
x = "Race/Ethnicity",
y = "Total Femur BMD (g/cm^2)",
fill = "Race/Ethnicity"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")
library(broom)
anova_model <- aov(DXXOFBMD ~ RIDRETH1, data = bmd_clean)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## RIDRETH1 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 612 observations deleted due to missingness
anova_table <- tidy(anova_model)
anova_table %>%
kable(caption = "ANOVA Results: Total Femur BMD by Ethnicity") %>%
kable_styling(full_width = FALSE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| RIDRETH1 | 4 | 2.948226 | 0.7370565 | 30.18504 | 0 |
| Residuals | 2281 | 55.697313 | 0.0244179 | NA | NA |
# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.0503
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")
## Percentage of variance explained: 5.03 %
F statistic = 30.18 df₁ = 4 (ethnicity groups − 1) df₂ = 2281 (residual df) p < 2e-16 ?(very small) Decision: Since p < 0.05, we reject H₀ Conclusion: There is statistically significant evidence that mean total femur bone density differs across race/ethnicity groups.
# Create diagnostic plots
par(mfrow = c(2, 2))
par(mfrow = c(1, 1))
plot(anova_model)
par(mfrow = c(1,2))
qqnorm(residuals(anova_model), main = "Q-Q Plot: Residuals")
qqline(residuals(anova_model), col = "red")
hist(residuals(anova_model), main = "Histogram of Residuals",
xlab = "Residuals")
# Levene's test
levene_test <- leveneTest(DXXOFBMD ~ RIDRETH1, data = bmd_clean)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
# Levene’s test was not significant p = 0.1788, suggesting equal variances assumption is met.
Diagnostic Plot Interpretation:
Residuals vs Fitted: Plot showed random scatter around zero with no clear pattern, indicating no major violations of linearity → Good! Q-Q Plot: Points follow the diagonal line reasonably well, Slight deviation at tails → Normality assumption is reasonable Scale-Location: Flat trend, suggesting consistent variance across fitted values → Equal variance assumption is reasonable Residuals vs Leverage: No points beyond Cook’s distance lines → No highly influential outliers
Levene’s Test Interpretation:
p-value: 0.1788 If p > 0.05, we would not reject equal variances Here: Equal variance assumption is met
Overall Assessment: With n > 2000, ANOVA is robust to minor violations. Our assumptions are reasonably satisfied.
tuk <- TukeyHSD(anova_model)
tuk
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ RIDRETH1, data = bmd_clean)
##
## $RIDRETH1
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
library(tidyverse)
library(knitr)
library(kableExtra)
tukey_tbl <- as.data.frame(tuk$RIDRETH1) %>%
tibble::rownames_to_column("Comparison") %>%
rename(
Mean_Difference = diff,
CI_Lower = lwr,
CI_Upper = upr,
Adjusted_p = `p adj`
) %>%
mutate(
Mean_Difference = round(Mean_Difference, 4),
CI_Lower = round(CI_Lower, 4),
CI_Upper = round(CI_Upper, 4),
Adjusted_p = signif(Adjusted_p, 3),
Significant = ifelse(Adjusted_p < 0.05, "Yes", "No")
)
# Tukey table
tukey_tbl %>%
kable(
caption = "Tukey HSD Comparisons: BMD by Ethnicity",
col.names = c("Comparison", "Mean Diff", "95% CI Lower", "95% CI Upper", "Adj p-value", "Significant?")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)
| Comparison | Mean Diff | 95% CI Lower | 95% CI Upper | Adj p-value | Significant? |
|---|---|---|---|---|---|
| Other Hispanic-Mexican American | -0.0044 | -0.0426 | 0.0338 | 9.98e-01 | No |
| Non-Hispanic White-Mexican American | -0.0624 | -0.0929 | -0.0319 | 3.00e-07 | Yes |
| Non-Hispanic Black-Mexican American | 0.0224 | -0.0101 | 0.0549 | 3.28e-01 | No |
| Other-Mexican American | -0.0533 | -0.0874 | -0.0193 | 1.92e-04 | Yes |
| Non-Hispanic White-Other Hispanic | -0.0579 | -0.0889 | -0.0269 | 3.60e-06 | Yes |
| Non-Hispanic Black-Other Hispanic | 0.0268 | -0.0062 | 0.0598 | 1.72e-01 | No |
| Other-Other Hispanic | -0.0489 | -0.0834 | -0.0144 | 1.07e-03 | Yes |
| Non-Hispanic Black-Non-Hispanic White | 0.0848 | 0.0612 | 0.1084 | 0.00e+00 | Yes |
| Other-Non-Hispanic White | 0.0091 | -0.0166 | 0.0347 | 8.72e-01 | No |
| Other-Non-Hispanic Black | -0.0757 | -0.1038 | -0.0477 | 0.00e+00 | Yes |
# Tukey’s test indicated that the mean total femur bone density differed between ethnicity groups, particularly between Non-Hispanic White and multiple other groups, while some group comparisons were not significant.
# Effect Size
# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.0503
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")
## Percentage of variance explained: 5.03 %
# The effect size (η² = 0.08) indicates a medium effect.
Interpretation: Ethnicity explains 5.03% of the variance in total femur bone mineral density.
Effect size guidelines: Small (0.01), Medium (0.06), Large (0.14) Our effect: Small
Part 2: Correlation
# Packages
library(tidyverse)
library(broom)
library(kableExtra)
# Creating calcium_total/vitd_total
bmd_clean <- bmd_clean %>%
mutate(
calcium_total = replace_na(calcium, 0) + replace_na(DSQTCALC, 0),
vitd_total = replace_na(vitd, 0) + replace_na(DSQTVD, 0)
)
bmd_clean <- bmd_clean %>%
mutate(
log_BMI = log(BMXBMI),
log_calcium = log(calcium_total + 1),
sqrt_MET = sqrt(totmet),
sqrt_vitd = sqrt(vitd_total)
)
summary(bmd_clean$calcium_total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 503.6 840.1 911.8 1229.5 5399.0
summary(bmd_clean$vitd_total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.45 7.85 27.81 29.35 2574.65
# Log transformations was to reduce skewness, since total vitamin D and MET had many zero values and were right skewed.
# Factoring RIDRETH1
bmd_clean <- bmd_clean %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other"))
)
# Histograms
# BMI
ggplot(bmd_clean, aes(x = BMXBMI)) +
geom_histogram(bins = 30) +
labs(title = "Histogram of BMI", x = "BMI", y = "Number") +
theme_minimal()
# Calcium
ggplot(bmd_clean, aes(x = calcium_total)) +
geom_histogram(bins = 30) +
labs(title = "Histogram of Total Calcium Intake",
x = "Calcium total (mg/day)", y = "Number") +
theme_minimal()
# Vitamin D
ggplot(bmd_clean, aes(x = vitd_total)) +
geom_histogram(bins = 30) +
labs(title = "Histogram of Total Vitamin D Intake",
x = "Vitamin D total (mcg/day)", y = "Number") +
theme_minimal()
# MET
ggplot(bmd_clean, aes(x = totmet)) +
geom_histogram(bins = 30) +
labs(title = "Histogram of Physical Activity",
x = "MET-min/week", y = "Number") +
theme_minimal()
library(tidyverse)
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if (nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}
# Table A: Age vs BMD
tableA <- bind_rows(
corr_pair(bmd_clean, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd_clean, "log_BMI", "DXXOFBMD"),
corr_pair(bmd_clean, "log_calcium", "DXXOFBMD"),
corr_pair(bmd_clean, "sqrt_vitd", "DXXOFBMD")
) %>%
mutate(
r = round(estimate, 3),
p_value = signif(p_value, 3)
) %>%
select(x, y, n, r, p_value) %>%
rename(Predictor = x, Outcome = y, `p-value` = p_value)
tableA %>%
kable(caption = "Table A: Predictors vs Outcome (BMD)") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)
| Predictor | Outcome | n | r | p-value |
|---|---|---|---|---|
| RIDAGEYR | DXXOFBMD | 2286 | -0.232 | 0.0000 |
| log_BMI | DXXOFBMD | 2275 | 0.425 | 0.0000 |
| log_calcium | DXXOFBMD | 2286 | 0.030 | 0.1450 |
| sqrt_vitd | DXXOFBMD | 2286 | -0.051 | 0.0151 |
# Table B: Log BMI vs BMD
tableB <- bind_rows(
corr_pair(bmd_clean, "RIDAGEYR", "sqrt_MET"),
corr_pair(bmd_clean, "log_BMI", "sqrt_MET"),
corr_pair(bmd_clean, "log_calcium", "sqrt_MET"),
corr_pair(bmd_clean, "sqrt_vitd", "sqrt_MET")
) %>%
mutate(
r = round(estimate, 3),
p_value = signif(p_value, 3)
) %>%
select(x, y, n, r, p_value) %>%
rename(Predictor = x, Exposure = y, `p-value` = p_value)
tableB %>%
kable(caption = "Table B: Predictors vs Exposure (Physical Activity, sqrt_MET)") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)
| Predictor | Exposure | n | r | p-value |
|---|---|---|---|---|
| RIDAGEYR | sqrt_MET | 1934 | -0.097 | 1.96e-05 |
| log_BMI | sqrt_MET | 1912 | 0.001 | 9.51e-01 |
| log_calcium | sqrt_MET | 1934 | 0.096 | 2.47e-05 |
| sqrt_vitd | sqrt_MET | 1934 | 0.007 | 7.58e-01 |
# Table C: Predictors vs Each Other
preds <- c("RIDAGEYR", "log_BMI", "log_calcium", "sqrt_vitd")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~ corr_pair(bmd_clean, .x[1], .x[2])) %>%
mutate(
r = round(estimate, 3),
p_value = signif(p_value, 3)
) %>%
select(x, y, n, r, p_value) %>%
rename(Var1 = x, Var2 = y, `p-value` = p_value)
tableC %>%
kable(caption = "Table C: Pairwise Correlations Among Predictors") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)
| Var1 | Var2 | n | r | p-value |
|---|---|---|---|---|
| RIDAGEYR | log_BMI | 2834 | -0.098 | 2.00e-07 |
| RIDAGEYR | log_calcium | 2898 | -0.029 | 1.16e-01 |
| RIDAGEYR | sqrt_vitd | 2898 | 0.149 | 0.00e+00 |
| log_BMI | log_calcium | 2834 | 0.021 | 2.73e-01 |
| log_BMI | sqrt_vitd | 2834 | -0.003 | 8.76e-01 |
| log_calcium | sqrt_vitd | 2898 | 0.272 | 0.00e+00 |
Interpretation Age vs. BMD: Age showed a moderate negative correlation with BMD, showing that older individuals tend to have lower bone mineral density. Log-transformed BMI showed a moderate positive correlation with BMD, meaning that individuals with higher BMI, tend to have higher BMD. Square-root transformation displayed a weak negative correlation with BMD, which is statistically significant. Overall, BMI and age appear to be the strongest predictors of BMD.
Predictors vs. Exposure Age had a very weak negative correlation with BMD proving that physical activity decreases slightly with age. Log-transformed calcium intake showed a weak positive correlation with physical activity. BMI was not significantly associated with physical activity. Square root transformation vitamin D intake was not significantly associated with physical activity. Physical activity appears only weakly associated with these predictors.
Predictors vs. Each Other Age had a weak negative correlation with BMI and its calcium intake. Age also had a small positive correlation with vitamin D intake. BMI was did not have a specific correlation with calcium or vitamin D intake. The strongest association was between calcium and vitamin D intake, showing a small-to-moderate positive relationship. Since correlations among predictors were small, multicollinearity is unlikely to be a concern.
Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
B. Assumptions and Limitations
C. R Programming Challenge
YOUR REFLECTION HERE (200–400 words)
In epidemiological research, ANOVA is used to compare the mean value of an outcome across groups, while correlation is used to determine the relationship between two continuous variables. ANOVA helps us determine whether group means are different, but it does not show the strength of a relationship, which correlation does. Correlation shows the direction and strength of association between variables. Another research question suited to ANOVA is whether physical activity differs across BMI categories, since it’s comparing average values amoung groups. A research question suited to correlation is whether vitamin D intake is associated with bone mineral density, since it examines relationship between two continuous variables.
A challenging assumption was linearity, which assumes that relationships between variables follow a straight-line pattern. Health data sometimes show more complex relationships, making this assumption difficult to meet. If linearity is violated, correlations may underestimate the true relationship and can contribute to incorrect conclusions about the strength of associations. A key limitation of cross-sectional correlation analysis is that it can’t establish cause and effect. Because data were collected at one point in time, it is unclear whether nutrition or physical activity influences bone health or whether other factors explain the observed relationships.
The most difficult part of this assignment was understanding how to run the Tukey HSD test and present the results in a clean, properly formatted table. At first, the output appeared as multiple values in the console rather than one organized table, which made it confusing to interpret and report the pairwise comparisons. I solved this by utilizing the template and learning how to convert the Tukey output into a data frame, add the comparison names as a column, and then format the results using functions like kable() to create a readable table. I also had to make sure the ANOVA model object was created before running the Tukey test, since missing objects caused errors. Through this process, I strengthened my skills in troubleshooting R errors, managing object names, and transforming statistical output into well-formatted tables for reporting. This experience helped me better understand how statistical results move from raw output to presentation-ready summaries.
Before submitting, verify:
I went through this checklist.