HW1_Yerdenova

#2. Import bmd.csv

bmd <- readr::read_csv("C:/Users/MY789914/OneDrive - University at Albany - SUNY/Desktop/Stat 553 (R)/Assignmen 1/bmd.csv")

#3 #4 #5. Convert RIDRETH1, RIAGENDR, smoker to factor

bmd_1=bmd %>%
  mutate(
    sex = factor(RIAGENDR, levels = c(1, 2), labels = c("Male", "Female")),     ethnicity=factor(RIDRETH1, levels = c(1,2,3,4,5), labels = c("Mexican American", "Other hispanic", "Non-Hispanic White", "Non-Hispanic Black", "Other")),
    smoking_status=factor(smoker, levels=c(1,2,3), labels = c("current", "past", "never")))


#7. Create analytic dataset removing missing BMD and ethnicity

data <- bmd_1 %>% filter(!is.na(DXXOFBMD), !is.na(ethnicity))

Step 6: Total sample size = 2,898

Step 8: Final analytic sample size =2,286

##Part 1: One-Way ANOVA

Research Question: Do mean BMD values differ across ethnicity groups?

#Step 1 State Hypotheses

Statistical notation: H0:μ1=μ2=μ3=μ4=μ5

HA: At least one ethnic group has a mean BMD different from the others

Plain language:

Null Hypothesis (H₀): Average bone mineral density (BMD) is the same across all ethnicity groups, indicating no ethnic differences in bone health.

Alternative Hypothesis (H₁): At least one ethnicity group has a different average BMD, suggesting potential disparities in bone health across ethnic groups.

Significance level: α = 0.05

#Step 2 Exploratory Analysis

library(dplyr)
library(knitr)

bmd_summary <- data %>%
  group_by(ethnicity) %>%
  summarize(
    n = n(),
    mean_BMD = mean(DXXOFBMD, na.rm = TRUE),
    sd_BMD = sd(DXXOFBMD, na.rm = TRUE)
  )

kable(bmd_summary, digits = 3,
      caption = "Summary of Total Femur BMD by Ethnicity")

Summary of Total Femur BMD by Ethnicity
ethnicity	n	mean_BMD	sd_BMD
Mexican American	255	0.950	0.149
Other hispanic	244	0.946	0.156
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148

#create plots

library(ggplot2)

ggplot(data, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_violin(fill = "lightblue", alpha = 0.6) +
  geom_jitter(width = 0.15, alpha = 0.4) +
  labs(
    title = "Distribution of Total Femur Bone Mineral Density by Ethnicity",
    x = "Ethnicity",
    y = "Total Femur BMD (g/cm²)"
  ) +
  theme_minimal()

#Step 3: ANOVA F-test

anova_model <- aov(DXXOFBMD ~ ethnicity, data = data)
summary(anova_model)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova_table <- summary(anova_model)[[1]]

kable(anova_table, digits = 3,
      caption = "One-Way ANOVA Results for Total Femur BMD by Ethnicity")

One-Way ANOVA Results for Total Femur BMD by Ethnicity
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
ethnicity	4	2.948	0.737	30.185	0
Residuals	2281	55.697	0.024	NA	NA

F-statistic: 30.18
Degrees of freedom: 4 (ethnicity), 2281 (Residuals)
p-value: <2e-16 (very small)
Decision (reject or fail to reject H₀): Since p < 0.05, we reject H₀
Statistical conclusion in words: This finding suggests that mean BMD varies significantly across ethnic groups, providing evidence of ethnic differences in bone mineral density.

#Step 4 Assumption Checks

qqnorm(residuals(anova_model))
qqline(residuals(anova_model))

library(car)
leveneTest(DXXOFBMD ~ ethnicity, data = data)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Interpretation: A.Normality

The Q–Q plot indicates that residuals are approximately normally distributed, supporting the normality assumption.

Interpretation: B.Equal Variances:

The Levene’s test for Homogeneity of Variance was not statistically significant, F(4, 2281) = 1.57, p = 0.18, indicating that the variances of total femur BMD are not significantly different across groups.

#Step 5. Post-hoc Comparisons

tukey_results <- TukeyHSD(anova_model)

kable(tukey_results$ethnicity, digits = 3,
      caption = "Tukey HSD Post-hoc Comparisons for Total Femur BMD by Ethnicity")

Tukey HSD Post-hoc Comparisons for Total Femur BMD by Ethnicity
	diff	lwr	upr	p adj
Other hispanic-Mexican American	-0.004	-0.043	0.034	0.998
Non-Hispanic White-Mexican American	-0.062	-0.093	-0.032	0.000
Non-Hispanic Black-Mexican American	0.022	-0.010	0.055	0.328
Other-Mexican American	-0.053	-0.087	-0.019	0.000
Non-Hispanic White-Other hispanic	-0.058	-0.089	-0.027	0.000
Non-Hispanic Black-Other hispanic	0.027	-0.006	0.060	0.172
Other-Other hispanic	-0.049	-0.083	-0.014	0.001
Non-Hispanic Black-Non-Hispanic White	0.085	0.061	0.108	0.000
Other-Non-Hispanic White	0.009	-0.017	0.035	0.872
Other-Non-Hispanic Black	-0.076	-0.104	-0.048	0.000

Interpretation:

A p-value < 0.001 indicates a statistically significant difference between groups, while the sign of the mean difference indicates direction; In this analysis, the results indicated that Non-Hispanic Black participants had significantly higher mean BMD compared with Non-Hispanic White participants, while Non-Hispanic White and Other ethnicity groups generally exhibited lower mean BMD compared with Mexican American and Other Hispanic groups.

#Step 6. Effect Size

# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")

## Eta-squared (η²): 0.0503

cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")

## Percentage of variance explained: 5.03 %

Interpretation:

η² = 0.0503
Percentage of variance explained: 5.03%
Effect size classification (small/medium/large): medium
What does this mean practically? Although the effect is statistically significant (p < 0.001), the proportion of variance explained is moderate. This indicates that other factors besides ethnicity (age, sex, body composition, lifestyle, nutrition, genetics) also contribute substantially to BMD differences.

##Part 2. Correlation Analysis

#Step 1: Create Total Intake Variables #Step 2: Assess and Apply Transformations

library(dplyr)

data <- data %>%
  mutate(
    calcium_total = calcium + ifelse(is.na(DSQTCALC), 0, DSQTCALC),
    vitd_total    = vitd + ifelse(is.na(DSQTVD), 0, DSQTVD)
  )

par(mfrow = c(2, 2))

hist(data$BMXBMI, main = "BMI", xlab = "BMI")
hist(data$calcium_total, main = "Total Calcium", xlab = "Calcium")
hist(data$vitd_total, main = "Total Vitamin D", xlab = "Vitamin D")
hist(data$totmet, main = "Total MET", xlab = "MET")

par(mfrow = c(1, 1))

data <- data %>%
  mutate(
    log_BMI      = log(BMXBMI),
    log_calcium  = log(calcium_total + 1),  # +1 handles zeros
    sqrt_vitd    = sqrt(vitd_total),
    sqrt_MET     = sqrt(totmet)
  )

Interpretation:

Histograms indicated right-skewed distributions for BMI, total calcium intake, vitamin D intake, and total MET minutes. Log transformations were applied to BMI and calcium to reduce skewness, while square-root transformations were applied to vitamin D and MET because these variables contained zero values and moderate skewness. These transformations improve normality assumptions for Pearson correlation analysis.

#Step 3. Three Correlation Tables

library(dplyr)
library(purrr)
library(tibble)
library(knitr)

# Define corr_pair function
corr_pair <- function(data, var1, var2) {
  tmp <- data %>%
    select(all_of(c(var1, var2))) %>%
    na.omit()
  
  test <- cor.test(tmp[[1]], tmp[[2]], method = "pearson")
  
  tibble(
    var1 = var1,
    var2 = var2,
    r    = round(test$estimate, 3),
    p    = round(test$p.value, 4),
    n    = nrow(tmp)
  )
}

# Table A
table_A <- bind_rows(
  corr_pair(data, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(data, "log_BMI", "DXXOFBMD"),
  corr_pair(data, "log_calcium", "DXXOFBMD"),
  corr_pair(data, "sqrt_vitd", "DXXOFBMD")
)
kable(table_A, caption = "Table A. Correlations Between Predictors and BMD")

Table A. Correlations Between Predictors and BMD
var1	var2	r	p	n
RIDAGEYR	DXXOFBMD	-0.232	0.0000	2286
log_BMI	DXXOFBMD	0.425	0.0000	2275
log_calcium	DXXOFBMD	0.012	0.5919	2129
sqrt_vitd	DXXOFBMD	-0.062	0.0043	2129

# Table B
table_B <- bind_rows(
  corr_pair(data, "RIDAGEYR", "sqrt_MET"),
  corr_pair(data, "log_BMI", "sqrt_MET"),
  corr_pair(data, "log_calcium", "sqrt_MET"),
  corr_pair(data, "sqrt_vitd", "sqrt_MET")
)
kable(table_B, caption = "Table B. Correlations Between Predictors and Physical Activity (MET)")

Table B. Correlations Between Predictors and Physical Activity (MET)
var1	var2	r	p	n
RIDAGEYR	sqrt_MET	-0.105	0.0000	1588
log_BMI	sqrt_MET	0.009	0.7317	1582
log_calcium	sqrt_MET	0.068	0.0088	1500
sqrt_vitd	sqrt_MET	-0.003	0.8927	1500

# Table C
vars <- c("RIDAGEYR", "log_BMI", "log_calcium", "sqrt_vitd")
pairs <- combn(vars, 2, simplify = FALSE)
table_C <- map_dfr(pairs, function(pair) {
  corr_pair(data, pair[1], pair[2])
})
kable(table_C, caption = "Table C. Pairwise Correlations Among Predictors")

Table C. Pairwise Correlations Among Predictors
var1	var2	r	p	n
RIDAGEYR	log_BMI	-0.093	0.0000	2275
RIDAGEYR	log_calcium	0.067	0.0021	2129
RIDAGEYR	sqrt_vitd	0.152	0.0000	2129
log_BMI	log_calcium	0.009	0.6860	2122
log_BMI	sqrt_vitd	0.026	0.2256	2122
log_calcium	sqrt_vitd	0.291	0.0000	2129

Interpretation:

BMI showed the strongest positive correlation with BMD (r = 0.425, p < .001), while age was negatively correlated (r = -0.232, p < .001), indicating older individuals and those with lower BMI have lower bone density. Vitamin D and calcium intake had weak or nonsignificant correlations with BMD. For physical activity (MET), age had a small negative correlation (r = -0.105, p < .001), calcium a very weak positive correlation (r = 0.068, p = 0.009), and BMI and vitamin D were not significant. Among predictors, only calcium and vitamin D were moderately correlated (r = 0.291, p < .001); Overall, BMI and age are the most important correlates of BMD, while nutrient intake and physical activity have limited practical associations. Significant p-values indicate statistical relationships, but effect sizes reveal that most correlations are small.

Part 3: Reflection

ANOVA and correlation serve different purposes in statistical analysis. ANOVA is used to compare mean differences across categorical groups, such as examining whether BMD differs by ethnicity. It determines whether at least one group mean is significantly different. In contrast, correlation assesses the strength and direction of a linear relationship between two continuous variables, such as BMI and BMD. While ANOVA tests group differences, correlation evaluates how variables change together.

Several assumptions required attention. For ANOVA, homogeneity of variance and approximate normality were important. Levene’s test indicated equal variances, but visual inspection was necessary to assess distribution shape. For correlation, linearity and absence of extreme outliers were key. Some variables were skewed and required log or square-root transformations to improve normality. Violations of these assumptions could bias results or increase Type I error. Additionally, because the data are cross-sectional, correlations cannot imply causation.

For me the most challenging R programming task was creating the pairwise correlation table using combn() and map_dfr() with corr_pair(), particularly resolving tidy evaluation issues with variable names. By modifying the function to accept character strings, I improved my debugging skills and strengthened my understanding of functional programming in R.

HW1_Yerdenova_Maral

Maral Yerdenova

2026-02-18