bmd <- readr::read_csv("~/Documents/Albany/Courses/Spring 2026/Epi 553/Homework Assignments/HW1/bmd.csv", show_col_types = FALSE)
#creating new labels
bmd_new <- bmd %>%
mutate(
sex = factor (RIAGENDR, levels = c(1,2), labels = c("Male", "Female")),
ethnicity = factor (RIDRETH1, levels = c(1,2,3,4,5),
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
smokernew = factor (smoker, levels = c(1,2,3),
labels = c("Current","Past","Never"))
)
# Display sample
# Display sample size
data.frame(
Metric = "Sample Size",
Value = paste(nrow(bmd_new), "bmd_new")
) %>%
kable()| Metric | Value |
|---|---|
| Sample Size | 2898 bmd_new |
#Removing missing BMD and ethnicity
bmd_new1 <- bmd_new %>%
filter(!is.na(DXXOFBMD) & !is.na(ethnicity))
# Display sample
# Display sample size
data.frame(
Metric = "Sample Size",
Value = paste(nrow(bmd_new1), "bmd_new1")
) %>%
kable()| Metric | Value |
|---|---|
| Sample Size | 2286 bmd_new1 |
The initial dataset bmd.csv included N = 2898 observations and 14 variables. Due to missing data, 612 observations were removed. Therfore, for the ANOVA analysis, the final analytic sample size is 2286 observations.
Step 1: State Hypothesis
Null Hypothesis (H₀): μ_Mexican_American = μ_Other_Hispanic = μ_Non-Hispanic_White = μ_Non-Hispanic_Black = μ_Other (All five BMD population means are equal)
Alternative Hypothesis (H₁): At least one population mean differs from the others
Significance level: α = 0.05
#Step 2: Exploratory Analysis
summary_stats <- bmd_new1 %>%
group_by(ethnicity) %>%
summarise(
n = n(),
Mean = mean(DXXOFBMD),
SD = sd(DXXOFBMD),
Median = median(DXXOFBMD),
Min = min(DXXOFBMD),
Max = max(DXXOFBMD)
)
summary_stats %>%
kable(digits = 2,
caption = "Decriptive Statistics: BMD by Ethnicity Categories")| ethnicity | n | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|
| Mexican American | 255 | 0.95 | 0.15 | 0.96 | 0.44 | 1.46 |
| Other Hispanic | 244 | 0.95 | 0.16 | 0.94 | 0.55 | 1.39 |
| Non-Hispanic White | 846 | 0.89 | 0.16 | 0.88 | 0.37 | 1.46 |
| Non-Hispanic Black | 532 | 0.97 | 0.16 | 0.98 | 0.44 | 1.55 |
| Other | 409 | 0.90 | 0.15 | 0.89 | 0.54 | 1.34 |
# Create boxplots with individual points
ggplot(bmd_new1,
aes(x = ethnicity, y = DXXOFBMD, fill = ethnicity)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "BMD Populations Means Across Ethnicity Groups",
subtitle = "Bone Mineral Density Data",
x = "Ethnicity",
y = "BMD Means",
fill = "Ethnicity"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")# **Interpret the plot:** What patterns do you observe in BMD across ethnicity groups?
# There are clear differences in BMD values among the ethnic groups. Some groups have higher average BMD, while others are lower, such as the Mexican American and Non-Hispanic Black vategories having higher BMD, and Non-Hispanic White and Other having lower BMD.
# Step 3: ANOVA F-Test
# Fit the one-way ANOVA model
anova_model <- aov(DXXOFBMD ~ethnicity, data = bmd_new1)
# Display ANOVA table
summary(anova_model)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# State Decision and Interpret: The p-value is <2e-16 which is <0.05. Therefore, we are rejecting the null hypothesis and there is a statistical significance between the mean BMD across different ehtnicity groups.
# Step 4: Assumption Checks
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(anova_model)par(mfrow = c(1, 1))
# Step 4B: Levene Test
# Levene's test for homogeneity of variance
levene_test <- leveneTest(DXXOFBMD ~ ethnicity, data = bmd_new1)
print(levene_test)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
# Interpretation: Since the p-value is 0.1788 which is greater than 0.05, we do not reject equal variances. Normality 1. Residuals vs Fitted: Points show random scatter around zero with no clear pattern → Good! 2. Q-Q Plot: Points follow the diagonal line reasonably well. 3. Scale-Location: Red line is relatively flat → Equal variance assumption is reasonable 4. Residuals vs Leverage: No points beyond Cook’s distance lines → No highly influential outliers
# Step 5: Post-Hoc Comparisons
# Only if your ANOVA p-value < 0.05
# Conduct Tukey HSD test
tukey_results <- TukeyHSD(anova_model)
print(tukey_results)## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = bmd_new1)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Table:
| Comparison | Mean Difference | 95% CI Lower | 95% CI Upper | p-value | Significant? |
|---|---|---|---|---|---|
| Other Hispanic - Mexican American | -0.004441273 | -0.042644130 | 0.03376158 | 0.9978049 | No |
| Non-Hispanic White - Mexican American | 0.022391619 | -0.010100052 | 0.05488329 | 0.0000003 | Yes |
| Non-Hispanic Black - Mexican American | 0.022391619 | -0.010100052 | 0.05488329 | 0.3276613 | No |
| Other - Mexican American | -0.053319344 | -0.087357253 | -0.01928144 | 0.0001919 | 0.0001919 |
| Non-Hispanic White - Other Hispanic | -0.053319344 | -0.088934694 | -0.02693726 | 0.0000036 | Yes |
| Non-Hispanic Black - Other Hispanic | 0.026832892 | -0.006150151 | 0.05981593 | 0.1722466 | No |
| Other - Other Hispanic | -0.048878071 | -0.083385341 | -0.01437080 | 0.0010720 | Yes |
| Non-Hispanic Black - Non-Hispanic White | 0.084768868 | 0.061164402 | 0.10837333 | 0 | Yes |
| Other - Non-Hispanic White | 0.009057905 | -0.016633367 | 0.03474918 | 0.8719109 | No |
| Other - Non-Hispanic Black | -0.075710963 | -0.103764519 | -0.04765741 | 0 | Yes |
Interpretation:
Which specific groups differ significantly? Non-Hispanic White - Mexican American, Other - Mexican American, Non-Hispanic White - Other Hispanic, Other - Other Hispanic, Non-Hispanic Black - Non-Hispanic White, Other - Non-Hispanic Black all have a p-value less than 0.05, therefore the are statistically significant.
# Calculate eta-squared
# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")## Eta-squared (η²): 0.0503
## Percentage of variance explained: 5.03 %
#Step 1: Create Total Intake Variables
bmd_new2 <- bmd_new1 %>%
mutate(
calcium_total = coalesce(calcium, 0) + coalesce(DSQTCALC, 0),
vitd_total = coalesce(vitd, 0) + coalesce(DSQTVD, 0)
)Step 2: Assess and Apply Transformations
# Visual checks
par(mfrow = c(1, 2))
# Normal data
# BMI
hist(bmd_new2$BMXBMI,
main = "Histogram: BMI",
xlab = "BMI",
col = "lightblue")
qqnorm(bmd_new2$BMXBMI,
main = "Q-Q Plot: BMI")
qqline(bmd_new2$BMXBMI,
col = "red",
lwd = 2)#Total Calcium
hist(bmd_new2$calcium_total,
main = "Histogram: Total Calcium",
xlab = "Calcium Intake",
col = "lightgreen")
qqnorm(bmd_new2$calcium_total,
main = "Q-Q Plot: Total Calcium")
qqline(bmd_new2$calcium_total,
col = "red",
lwd = 2)# Total Vitamin D
hist (bmd_new2$vitd_total,
main = "Histogram: Total Vitamin D",
xlab = "Vitamin D Intake",
col = "lightgreen")
qqnorm(bmd_new2$vitd_total,
main = "Q-Q Plot: Total Vitamin D")
qqline(bmd_new2$vitd_total,
col = "red",
lwd = 2)# MET
hist(bmd_new2$totmet,
main = "Histogram: MET",
xlab = "MET",
col = "lightgray")
qqnorm(bmd_new2$totmet,
main = "Q-Q Plot: MET")
qqline(bmd_new2$totmet,
col = "red",
lwd = 2)# Apply Transformations
# Log transformation (add +1 to calcium in case of zeros)
bmd_new2$log_BMXBMI <- log(bmd_new2$BMXBMI)
bmd_new2$log_calcium_total <- log(bmd_new2$calcium_total + 1)
# Square root transformation
bmd_new2$sqrt_totmet <- sqrt(bmd_new2$totmet)
bmd_new2$sqrt_vitd_total <- sqrt(bmd_new2$vitd_total)
# Re-check Normality After Transformation
par(mfrow = c(1, 2))
qqnorm(bmd_new2$log_BMXBMI, main = "Q-Q Plot: Log BMI")
qqline(bmd_new2$log_BMXBMI, col = "red", lwd = 2)
qqnorm(bmd_new2$log_calcium_total, main = "Q-Q Plot: Log Calcium")
qqline(bmd_new2$log_calcium_total, col = "red", lwd = 2)qqnorm(bmd_new2$sqrt_totmet, main = "Q-Q Plot: Sqrt MET")
qqline(bmd_new2$sqrt_totmet, col = "red", lwd = 2)
qqnorm(bmd_new2$sqrt_vitd_total, main = "Q-Q Plot: Sqrt Vitamin D")
qqline(bmd_new2$sqrt_vitd_total, col = "red", lwd = 2)
I applied transformations to the variables bmi, total calcium, total
vitamin D, and MET because I wanted to make sure the data was
distributed more symmetrically and not right-skewed.
Step 3: Three Correlation Tables
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}Examine correlations between potential predictors and BMD:
tableA <- bind_rows(
corr_pair(bmd_new2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd_new2, "log_BMXBMI", "DXXOFBMD"),
corr_pair(bmd_new2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd_new2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableA %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | DXXOFBMD | 2286 | -0.232 | 0.0000 |
| log_BMXBMI | DXXOFBMD | 2275 | 0.425 | 0.0000 |
| log_calcium_total | DXXOFBMD | 2286 | 0.030 | 0.1450 |
| sqrt_vitd_total | DXXOFBMD | 2286 | -0.051 | 0.0151 |
Interpret the results:
Which variables show statistically significant correlations with BMD? What is the direction and strength of these relationships?
The variables showing statistically significant correlations with BMD are age, BMI, and vitamin D. Age and BMD, as well as vitamin D and BMD, exhibit inverse correlations, indicating that BMD decreases as age or vitamin D levels increase. In contrast, BMI and BMD have a positive correlation, meaning that higher BMI is associated with higher BMD.
Examine correlations between potential confounders and physical activity:
tableB <- bind_rows(
corr_pair(bmd_new2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd_new2, "log_BMXBMI", "sqrt_totmet"),
corr_pair(bmd_new2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd_new2, "sqrt_vitd_total", "sqrt_totmet")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableB %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | sqrt_totmet | 1588 | -0.105 | 2.67e-05 |
| log_BMXBMI | sqrt_totmet | 1582 | 0.009 | 7.32e-01 |
| log_calcium_total | sqrt_totmet | 1588 | 0.072 | 4.36e-03 |
| sqrt_vitd_total | sqrt_totmet | 1588 | -0.002 | 9.34e-01 |
Interpret the results:
Which variables are associated with physical activity levels?
All of the variables are associated with physical activity levels due to the p-values being less than 0.05.
Examine correlations among all predictor variables (all pairwise combinations):
preds <- c("RIDAGEYR", "BMXBMI", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd_new2, .x[1], .x[2])) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableC %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | BMXBMI | 2275 | -0.098 | 2.70e-06 |
| RIDAGEYR | log_calcium_total | 2286 | -0.024 | 2.58e-01 |
| RIDAGEYR | sqrt_vitd_total | 2286 | 0.146 | 0.00e+00 |
| BMXBMI | log_calcium_total | 2275 | 0.032 | 1.23e-01 |
| BMXBMI | sqrt_vitd_total | 2275 | 0.027 | 2.01e-01 |
| log_calcium_total | sqrt_vitd_total | 2286 | 0.224 | 0.00e+00 |
Interpret the results:
Based on the correlations from table C (looking at the estimates), there is no strong evidence of multicollinearity among these predictors. Therefore, we should be able to include them in a regression model without worrying about inflated standard errors or unstable coefficient estimates due to multicollinearity.
Create scatterplots to visualize key relationships:
# Example: BMD vs BMI
ggplot(bmd_new2, aes(x = log_BMXBMI, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "log(BMI)",
y = "BMD (g/cm²)",
title = "BMD vs BMI (transformed)"
)# Example: BMD vs Physical Activity
ggplot(bmd_new2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "sqrt(totmet)",
y = "BMD (g/cm²)",
title = "BMD vs MET (transformed)"
)Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
B. Assumptions and Limitations
C. R Programming Challenge
ANOVA and correlation are both useful statistical tools in epidemiological research, but they answer different types of questions. ANOVA is used when comparing the mean outcome across two or more categorical groups. For example, it would be appropriate to use ANOVA to examine whether mean BMI differs across different physical activity categories (low, moderate, high). In contrast, correlation is used to measure the strength and direction of a linear relationship between two continuous variables. Correlation would be more appropriate for examining the relationship between age and blood pressure, since both variables are continuous.
The key difference is that ANOVA tests for differences in group means, while correlation assesses the strength of association between continuous variables. ANOVA helps determine whether groups differ significantly, whereas correlation tells us how strongly two variables move together and whether the relationship is positive or negative.
The most challenging assumptions to meet in this assignment were normality and linearity. Some variables, such as BMI and calcium intake, were skewed and required log transformations. Violations of normality or homoscedasticity could lead to biased estimates, incorrect p-values, and misleading conclusions.
A major limitation of cross-sectional correlation analysis is that it does not establish causation. Because all data are collected at one point point in time, we cannot determine whether higher calcium intake leads to improved bone health or whether individuals with better bone health engage in healthier behaviors. Confounding and reverse causation are also concerns.
The most difficult part of the R coding was managing variable names and ensuring transformations were correctly applied in the correlation tables and plots. Errors such as mismatched column names and pipe syntax issues required careful debugging. I solved these issues by re-checking variable names and testing codes in smaller sections before combining them. Through this assignment, I strengthened my understanding of writing functions, using bind_rows(), and applying transformations properly within tidyverse package.