#2. Import bmd.csv

bmd <- readr::read_csv("C:/Users/MY789914/OneDrive - University at Albany - SUNY/Desktop/Stat 553 (R)/Assignmen 1/bmd.csv")

#3 #4 #5. Convert RIDRETH1, RIAGENDR, smoker to factor

bmd_1=bmd %>%
  mutate(
    sex = factor(RIAGENDR, levels = c(1, 2), labels = c("Male", "Female")),     ethnicity=factor(RIDRETH1, levels = c(1,2,3,4,5), labels = c("Mexican American", "Other hispanic", "Non-Hispanic White", "Non-Hispanic Black", "Other")),
    smoking_status=factor(smoker, levels=c(1,2,3), labels = c("current", "past", "never")))


#7. Create analytic dataset removing missing BMD and ethnicity

data <- bmd_1 %>% filter(!is.na(DXXOFBMD), !is.na(ethnicity))

Step 6: Total sample size = 2,898

Step 8: Final analytic sample size =2,286

##Part 1: One-Way ANOVA

Research Question: Do mean BMD values differ across ethnicity groups?

#Step 1 State Hypotheses

Statistical notation: H0:μ1=μ2=μ3=μ4=μ5

HA: At least one ethnic group has a mean BMD different from the others

Plain language:

Null Hypothesis (H₀): Average bone mineral density (BMD) is the same across all ethnicity groups, indicating no ethnic differences in bone health.

Alternative Hypothesis (H₁): At least one ethnicity group has a different average BMD, suggesting potential disparities in bone health across ethnic groups.

Significance level: α = 0.05

#Step 2 Exploratory Analysis

library(dplyr)
library(knitr)

bmd_summary <- data %>%
  group_by(ethnicity) %>%
  summarize(
    n = n(),
    mean_BMD = mean(DXXOFBMD, na.rm = TRUE),
    sd_BMD = sd(DXXOFBMD, na.rm = TRUE)
  )

kable(bmd_summary, digits = 3,
      caption = "Summary of Total Femur BMD by Ethnicity")
Summary of Total Femur BMD by Ethnicity
ethnicity n mean_BMD sd_BMD
Mexican American 255 0.950 0.149
Other hispanic 244 0.946 0.156
Non-Hispanic White 846 0.888 0.160
Non-Hispanic Black 532 0.973 0.161
Other 409 0.897 0.148
#create plots

library(ggplot2)

ggplot(data, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_violin(fill = "lightblue", alpha = 0.6) +
  geom_jitter(width = 0.15, alpha = 0.4) +
  labs(
    title = "Distribution of Total Femur Bone Mineral Density by Ethnicity",
    x = "Ethnicity",
    y = "Total Femur BMD (g/cm²)"
  ) +
  theme_minimal()

#Step 3: ANOVA F-test

anova_model <- aov(DXXOFBMD ~ ethnicity, data = data)
summary(anova_model)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_table <- summary(anova_model)[[1]]

kable(anova_table, digits = 3,
      caption = "One-Way ANOVA Results for Total Femur BMD by Ethnicity")
One-Way ANOVA Results for Total Femur BMD by Ethnicity
Df Sum Sq Mean Sq F value Pr(>F)
ethnicity 4 2.948 0.737 30.185 0
Residuals 2281 55.697 0.024 NA NA

#Step 4 Assumption Checks

qqnorm(residuals(anova_model))
qqline(residuals(anova_model))

library(car)
leveneTest(DXXOFBMD ~ ethnicity, data = data)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Interpretation: A.Normality

The Q–Q plot indicates that residuals are approximately normally distributed, supporting the normality assumption.

Interpretation: B.Equal Variances:

The Levene’s test for Homogeneity of Variance was not statistically significant, F(4, 2281) = 1.57, p = 0.18, indicating that the variances of total femur BMD are not significantly different across groups.

#Step 5. Post-hoc Comparisons

tukey_results <- TukeyHSD(anova_model)

kable(tukey_results$ethnicity, digits = 3,
      caption = "Tukey HSD Post-hoc Comparisons for Total Femur BMD by Ethnicity")
Tukey HSD Post-hoc Comparisons for Total Femur BMD by Ethnicity
diff lwr upr p adj
Other hispanic-Mexican American -0.004 -0.043 0.034 0.998
Non-Hispanic White-Mexican American -0.062 -0.093 -0.032 0.000
Non-Hispanic Black-Mexican American 0.022 -0.010 0.055 0.328
Other-Mexican American -0.053 -0.087 -0.019 0.000
Non-Hispanic White-Other hispanic -0.058 -0.089 -0.027 0.000
Non-Hispanic Black-Other hispanic 0.027 -0.006 0.060 0.172
Other-Other hispanic -0.049 -0.083 -0.014 0.001
Non-Hispanic Black-Non-Hispanic White 0.085 0.061 0.108 0.000
Other-Non-Hispanic White 0.009 -0.017 0.035 0.872
Other-Non-Hispanic Black -0.076 -0.104 -0.048 0.000

Interpretation:

A p-value < 0.001 indicates a statistically significant difference between groups, while the sign of the mean difference indicates direction; In this analysis, the results indicated that Non-Hispanic Black participants had significantly higher mean BMD compared with Non-Hispanic White participants, while Non-Hispanic White and Other ethnicity groups generally exhibited lower mean BMD compared with Mexican American and Other Hispanic groups.

#Step 6. Effect Size

# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.0503
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")
## Percentage of variance explained: 5.03 %

Interpretation:

##Part 2. Correlation Analysis

#Step 1: Create Total Intake Variables #Step 2: Assess and Apply Transformations

library(dplyr)

data <- data %>%
  mutate(
    calcium_total = calcium + ifelse(is.na(DSQTCALC), 0, DSQTCALC),
    vitd_total    = vitd + ifelse(is.na(DSQTVD), 0, DSQTVD)
  )

par(mfrow = c(2, 2))

hist(data$BMXBMI, main = "BMI", xlab = "BMI")
hist(data$calcium_total, main = "Total Calcium", xlab = "Calcium")
hist(data$vitd_total, main = "Total Vitamin D", xlab = "Vitamin D")
hist(data$totmet, main = "Total MET", xlab = "MET")

par(mfrow = c(1, 1))

data <- data %>%
  mutate(
    log_BMI      = log(BMXBMI),
    log_calcium  = log(calcium_total + 1),  # +1 handles zeros
    sqrt_vitd    = sqrt(vitd_total),
    sqrt_MET     = sqrt(totmet)
  )

Interpretation:

Histograms indicated right-skewed distributions for BMI, total calcium intake, vitamin D intake, and total MET minutes. Log transformations were applied to BMI and calcium to reduce skewness, while square-root transformations were applied to vitamin D and MET because these variables contained zero values and moderate skewness. These transformations improve normality assumptions for Pearson correlation analysis.

#Step 3. Three Correlation Tables

library(dplyr)
library(purrr)
library(tibble)
library(knitr)

# Define corr_pair function
corr_pair <- function(data, var1, var2) {
  tmp <- data %>%
    select(all_of(c(var1, var2))) %>%
    na.omit()
  
  test <- cor.test(tmp[[1]], tmp[[2]], method = "pearson")
  
  tibble(
    var1 = var1,
    var2 = var2,
    r    = round(test$estimate, 3),
    p    = round(test$p.value, 4),
    n    = nrow(tmp)
  )
}

# Table A
table_A <- bind_rows(
  corr_pair(data, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(data, "log_BMI", "DXXOFBMD"),
  corr_pair(data, "log_calcium", "DXXOFBMD"),
  corr_pair(data, "sqrt_vitd", "DXXOFBMD")
)
kable(table_A, caption = "Table A. Correlations Between Predictors and BMD")
Table A. Correlations Between Predictors and BMD
var1 var2 r p n
RIDAGEYR DXXOFBMD -0.232 0.0000 2286
log_BMI DXXOFBMD 0.425 0.0000 2275
log_calcium DXXOFBMD 0.012 0.5919 2129
sqrt_vitd DXXOFBMD -0.062 0.0043 2129
# Table B
table_B <- bind_rows(
  corr_pair(data, "RIDAGEYR", "sqrt_MET"),
  corr_pair(data, "log_BMI", "sqrt_MET"),
  corr_pair(data, "log_calcium", "sqrt_MET"),
  corr_pair(data, "sqrt_vitd", "sqrt_MET")
)
kable(table_B, caption = "Table B. Correlations Between Predictors and Physical Activity (MET)")
Table B. Correlations Between Predictors and Physical Activity (MET)
var1 var2 r p n
RIDAGEYR sqrt_MET -0.105 0.0000 1588
log_BMI sqrt_MET 0.009 0.7317 1582
log_calcium sqrt_MET 0.068 0.0088 1500
sqrt_vitd sqrt_MET -0.003 0.8927 1500
# Table C
vars <- c("RIDAGEYR", "log_BMI", "log_calcium", "sqrt_vitd")
pairs <- combn(vars, 2, simplify = FALSE)
table_C <- map_dfr(pairs, function(pair) {
  corr_pair(data, pair[1], pair[2])
})
kable(table_C, caption = "Table C. Pairwise Correlations Among Predictors")
Table C. Pairwise Correlations Among Predictors
var1 var2 r p n
RIDAGEYR log_BMI -0.093 0.0000 2275
RIDAGEYR log_calcium 0.067 0.0021 2129
RIDAGEYR sqrt_vitd 0.152 0.0000 2129
log_BMI log_calcium 0.009 0.6860 2122
log_BMI sqrt_vitd 0.026 0.2256 2122
log_calcium sqrt_vitd 0.291 0.0000 2129

Interpretation:

BMI showed the strongest positive correlation with BMD (r = 0.425, p < .001), while age was negatively correlated (r = -0.232, p < .001), indicating older individuals and those with lower BMI have lower bone density. Vitamin D and calcium intake had weak or nonsignificant correlations with BMD. For physical activity (MET), age had a small negative correlation (r = -0.105, p < .001), calcium a very weak positive correlation (r = 0.068, p = 0.009), and BMI and vitamin D were not significant. Among predictors, only calcium and vitamin D were moderately correlated (r = 0.291, p < .001); Overall, BMI and age are the most important correlates of BMD, while nutrient intake and physical activity have limited practical associations. Significant p-values indicate statistical relationships, but effect sizes reveal that most correlations are small.

Part 3: Reflection

ANOVA and correlation serve different purposes in statistical analysis. ANOVA is used to compare mean differences across categorical groups, such as examining whether BMD differs by ethnicity. It determines whether at least one group mean is significantly different. In contrast, correlation assesses the strength and direction of a linear relationship between two continuous variables, such as BMI and BMD. While ANOVA tests group differences, correlation evaluates how variables change together.

Several assumptions required attention. For ANOVA, homogeneity of variance and approximate normality were important. Levene’s test indicated equal variances, but visual inspection was necessary to assess distribution shape. For correlation, linearity and absence of extreme outliers were key. Some variables were skewed and required log or square-root transformations to improve normality. Violations of these assumptions could bias results or increase Type I error. Additionally, because the data are cross-sectional, correlations cannot imply causation.

For me the most challenging R programming task was creating the pairwise correlation table using combn() and map_dfr() with corr_pair(), particularly resolving tidy evaluation issues with variable names. By modifying the function to accept character strings, I improved my debugging skills and strengthened my understanding of functional programming in R.