Part 0: Data Preparation

Tasks:

Load required packages (tidyverse, car, kableExtra, broom)
Import bmd.csv
Convert RIDRETH1 to factor with meaningful ethnicity labels
Convert RIAGENDR to factor (Male/Female)
Convert smoker to factor (Current/Past/Never)
Report total sample size
Create analytic dataset removing missing BMD and ethnicity
Report final analytic sample size

#reading the bmd csv file

bmd_data <- read.csv("C:/Users/tahia/OneDrive/Desktop/UAlbany PhD/Epi 553/HW/HW1/bmd.csv")

table(bmd_data$RIDRETH1)

## 
##    1    2    3    4    5 
##  331  286 1086  696  499

#Race

bmd_data <- bmd_data %>%
mutate(RIDRETH1 = factor(RIDRETH1,levels = c(1, 2, 3, 4, 5),labels = c(
"Mexican American","Other Hispanic","Non-Hispanic White","Non-Hispanic Black","Other Race - Including Multi-Racial")))
#end

table(bmd_data$RIAGENDR)

## 
##    1    2 
## 1431 1467

#Gender

bmd_data <- bmd_data %>%
mutate(RIAGENDR = factor(RIAGENDR,levels = c(1, 2),labels = c("Male","Female")))
#end

table(bmd_data$smoker)

## 
##    1    2    3 
##  449  894 1553

#Smoker

bmd_data <- bmd_data %>%
mutate(smoker = factor(smoker,levels = c(1, 2, 3),labels = c("Current","Past","Never")))
#end

nrow(bmd_data)

## [1] 2898

Total Sample Size of main dataset is, N= 2898

analytic_data <- bmd_data %>%
  filter(!is.na(DXXOFBMD),!is.na(RIDRETH1))

nrow(analytic_data)

## [1] 2286

Total sample size of analytical dataset is, n = 2286

Part 1: One-Way ANOVA

Research Question: Do mean BMD values differ across ethnicity groups?

Step 1: State Hypotheses

Statistical notation:

Let µ1, µ2, µ3, µ4, µ5 be the population mean BMD for the 5 ethnicity groups in RIDRETH1. • Null hypothesis (H₀): H0: µ1= µ2= µ3= µ4= µ5 • Alternative hypothesis (H₁): H1: At least one µi is different than other groups

Plain language:

• H₀ (null): Average bone mineral density (BMD) is the same across all five ethnicity groups. Any differences we see in our sample are just due to random chance. • H₁ (alternative): Average BMD is not the same for all ethnicity groups, meaning at least one ethnicity group has higher or lower average BMD, suggesting bone health may differ by ethnicity in the population.

Significance level: α = 0.05

Step 2: Exploratory Analysis

bmd_summary <- analytic_data %>%
group_by(RIDRETH1) %>%
summarize(n = n(),mean_BMD = mean(DXXOFBMD, na.rm = TRUE),sd_BMD = sd(DXXOFBMD, na.rm = TRUE))

bmd_summary %>%
kable(digits = 3, caption = "Descriptive Statistics of BMD by Ethnicity") %>%
kable_styling(full_width = FALSE)

Descriptive Statistics of BMD by Ethnicity
RIDRETH1	n	mean_BMD	sd_BMD
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other Race - Including Multi-Racial	409	0.897	0.148

ggplot(analytic_data, aes(x = RIDRETH1, y = DXXOFBMD)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2, alpha = 0.3) + labs(title = "Distribution of Bone Mineral Density by Ethnicity",x = "Ethnicity",y = "Bone Mineral Density (BMD)") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Non-Hispanic Black participants had the highest mean BMD (0.973), while Non-Hispanic White participants had the lowest (0.888). The remaining groups, Mexican American (0.950), Other Hispanic (0.946), and Other Race including multi-racial (0.897) had intermediate values. Standard deviations were similar across groups (0.148–0.161), indicating comparable variability.

Step 3: ANOVA F-test

anova_model <- aov(DXXOFBMD ~ RIDRETH1, data = analytic_data)

summary(anova_model)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## RIDRETH1       4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# F test

anova_results <- tidy(anova_model)

anova_results %>%
  kable(digits = 4, caption = " One way ANOVA Results: BMD by Ethnicity") %>%
  kable_styling(full_width = FALSE)

One way ANOVA Results: BMD by Ethnicity
term	df	sumsq	meansq	statistic	p.value
RIDRETH1	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

The one-way ANOVA shows a statistically significant difference in mean bone mineral density (BMD) across ethnicity groups, F(4, 2281) = 30.19, p < 0.001. The between-group variability (mean sqr = 0.7371) was much larger than the within-group variability (mean sqr = 0.0244), indicating real differences across groups. Therefore, we reject the null hypothesis and conclude that at least one ethnicity group differs in mean BMD.

Step 4: Assumption Checks

qqnorm(residuals(anova_model))
qqline(residuals(anova_model), col = "magenta")

leveneTest(DXXOFBMD ~ RIDRETH1, data = analytic_data)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

A. Normality

The Q-Q plot of residuals showed that the points generally followed the reference line, with only minor deviations at the tails. Given the large sample size (n = 2286), the residuals appear approximately normally distributed, and the normality assumption is reasonably satisfied.

B. Homogeneity of Variance

Levene’s test was not statistically significant, F(4, 2281) = 1.57, p = 0.179, indicating that variances do not significantly differ across ethnicity groups. Therefore, the assumption of equal variances is met.

Step 5: Post-hoc Comparisons

tukey_results <- TukeyHSD(anova_model)

tukey_results

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ RIDRETH1, data = analytic_data)
## 
## $RIDRETH1
##                                                                diff
## Other Hispanic-Mexican American                        -0.004441273
## Non-Hispanic White-Mexican American                    -0.062377249
## Non-Hispanic Black-Mexican American                     0.022391619
## Other Race - Including Multi-Racial-Mexican American   -0.053319344
## Non-Hispanic White-Other Hispanic                      -0.057935976
## Non-Hispanic Black-Other Hispanic                       0.026832892
## Other Race - Including Multi-Racial-Other Hispanic     -0.048878071
## Non-Hispanic Black-Non-Hispanic White                   0.084768868
## Other Race - Including Multi-Racial-Non-Hispanic White  0.009057905
## Other Race - Including Multi-Racial-Non-Hispanic Black -0.075710963
##                                                                 lwr         upr
## Other Hispanic-Mexican American                        -0.042644130  0.03376158
## Non-Hispanic White-Mexican American                    -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American                    -0.010100052  0.05488329
## Other Race - Including Multi-Racial-Mexican American   -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic                      -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic                      -0.006150151  0.05981593
## Other Race - Including Multi-Racial-Other Hispanic     -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White                   0.061164402  0.10837333
## Other Race - Including Multi-Racial-Non-Hispanic White -0.016633367  0.03474918
## Other Race - Including Multi-Racial-Non-Hispanic Black -0.103764519 -0.04765741
##                                                            p adj
## Other Hispanic-Mexican American                        0.9978049
## Non-Hispanic White-Mexican American                    0.0000003
## Non-Hispanic Black-Mexican American                    0.3276613
## Other Race - Including Multi-Racial-Mexican American   0.0001919
## Non-Hispanic White-Other Hispanic                      0.0000036
## Non-Hispanic Black-Other Hispanic                      0.1722466
## Other Race - Including Multi-Racial-Other Hispanic     0.0010720
## Non-Hispanic Black-Non-Hispanic White                  0.0000000
## Other Race - Including Multi-Racial-Non-Hispanic White 0.8719109
## Other Race - Including Multi-Racial-Non-Hispanic Black 0.0000000

#Creating table

tukey_table <- tidy(tukey_results)

tukey_table %>%
  kable(digits = 4, caption = "Tukey HSD Pairwise Comparisons of BMD by Ethnicity") %>%
  kable_styling(full_width = FALSE)

Tukey HSD Pairwise Comparisons of BMD by Ethnicity
term	contrast	estimate	conf.low	conf.high	adj.p.value
RIDRETH1	Other Hispanic-Mexican American	-0.0044	-0.0426	0.0338	0.9978
RIDRETH1	Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000
RIDRETH1	Non-Hispanic Black-Mexican American	0.0224	-0.0101	0.0549	0.3277
RIDRETH1	Other Race - Including Multi-Racial-Mexican American	-0.0533	-0.0874	-0.0193	0.0002
RIDRETH1	Non-Hispanic White-Other Hispanic	-0.0579	-0.0889	-0.0269	0.0000
RIDRETH1	Non-Hispanic Black-Other Hispanic	0.0268	-0.0062	0.0598	0.1722
RIDRETH1	Other Race - Including Multi-Racial-Other Hispanic	-0.0489	-0.0834	-0.0144	0.0011
RIDRETH1	Non-Hispanic Black-Non-Hispanic White	0.0848	0.0612	0.1084	0.0000
RIDRETH1	Other Race - Including Multi-Racial-Non-Hispanic White	0.0091	-0.0166	0.0347	0.8719
RIDRETH1	Other Race - Including Multi-Racial-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000

Tukey Test Interpretation: Post-hoc Comparisons

Tukey’s post-hoc test showed that Non-Hispanic White participants had significantly lower mean BMD than Mexican American (mean difference = −0.0624, 95% CI: −0.0929 to −0.0319, p < 0.001) and Other Hispanic participants (mean difference = −0.0579, 95% CI: −0.0889 to −0.0269, p < 0.001). Other Race including multi-racial participants also had significantly lower BMD compared to Mexican American (mean difference = −0.0533, p = 0.0002) and Other Hispanic participants (mean difference = −0.0489, p = 0.0011).

Non-Hispanic Black participants had significantly higher mean BMD than Non-Hispanic White participants (mean difference = 0.0848, 95% CI: 0.0612 to 0.1084, p < 0.001) and Other Race including multi-racial participants (mean difference = −0.0757 when reversed comparison, p < 0.001). No significant differences were observed between Mexican American and Other Hispanic participants, or between Non-Hispanic Black and Mexican American or Other Hispanic participants.

Step 6: Effect Size

From the ANOVA model table, we know that SS_between (RIDRETH1)= 2.9482 and SS_within (Residuals)= 55.6973

ss_between <- 2.9482
ss_within <- 55.6973

ss_total <- ss_between + ss_within
eta_sq <- ss_between / ss_total

eta_sq

## [1] 0.05027155

Interpretation:

The eta-squared effect size is η² = 0.05, meaning that approximately 5% of the variability in BMD is explained by ethnicity.

According to Cohen’s guidelines (0.01 = small, 0.06 = medium, 0.14 = large), this represents a small-to-moderate effect size, suggesting that while ethnicity is statistically significant, it explains a modest proportion of variation in bone mineral density.

Part two: Correlation

Step 1: Create Total Intake Variables

analytic_data <- analytic_data %>%
  
mutate(DSQTCALC = ifelse(is.na(DSQTCALC), 0, DSQTCALC),DSQTVD   = ifelse(is.na(DSQTVD), 0, DSQTVD),
    
       calcium_total = calcium + DSQTCALC,vitd_total    = vitd + DSQTVD)


summary(analytic_data$calcium_total)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    55.0   604.5   901.5  1001.5  1288.5  5399.0     157

summary(analytic_data$vitd_total)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.95    8.60   29.12   30.40 2574.65     157

Step 2: Assess and Apply Transformations

Creating Histogram

vars_to_check <- analytic_data %>%
  select(BMXBMI, calcium_total, vitd_total, totmet) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value")

ggplot(vars_to_check, aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Histograms of Continuous Variables", x = NULL, y = "Count") +
  theme_minimal()

Handling zero

analytic_data <- analytic_data %>%
  mutate(
    # suggested
    BMI_log = log1p(BMXBMI),
    calcium_log = log1p(calcium_total),
    vitd_sqrt = sqrt(vitd_total),
    MET_sqrt = sqrt(totmet)
  )

Checking histogram after adjusting zero

vars_transformed <- analytic_data %>%
  select(BMI_log, calcium_log, vitd_sqrt, MET_sqrt) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value")

ggplot(vars_transformed, aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Histograms of Continuous Variables (Transformed)", x = NULL, y = "Count") +
  theme_minimal()

Assessment of Distributions and Transformations

The original histograms showed that calcium_total, vitd_total, and MET were strongly right-skewed, with long upper tails and many low or zero values. BMI (BMXBMI) showed mild right skew but was closer to symmetric compared to the intake variables.

After transformation, the distributions improved substantially. The log transformation of BMI and calcium reduced right skew and produced more symmetric, approximately normal shapes. The square-root transformation of MET and vitamin D also reduced skewness and handled zero values appropriately, resulting in more balanced distributions. Overall, the transformed variables appear better suited for correlation analysis.

Rationale for Transformation

Several variables, especially calcium, vitamin D, and MET, were strongly right-skewed, which can violate the normality assumption and allow extreme values to overly influence correlations. Therefore, transformations were applied to reduce skewness and improve symmetry. Log transformations were used for BMI and calcium to compress large values, while square-root transformations were used for MET and vitamin D because they handle zero values better. After transformation, the distributions were more suitable for correlation analysis.

##Step 3: Three Correlation Tables

Creating correlation helper

# Create the correlation helper function

corr_pair <- function(data, var1, var2) {
  # Remove missing values
  complete_data <- data[complete.cases(data[[var1]], data[[var2]]), ]
  
  # Calculate correlation test
  cor_test <- cor.test(complete_data[[var1]], complete_data[[var2]], 
                       method = "pearson")
  
  # Extract results
  result <- data.frame(
    Variable_1 = var1,
    Variable_2 = var2,
    r = round(cor_test$estimate, 3),
    p_value = format.pval(cor_test$p.value, digits = 3),
    n = nrow(complete_data)
  )
  
  return(result)
  
}

Creating table A

table_A <- bind_rows(
  corr_pair(analytic_data, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(analytic_data, "BMI_log", "DXXOFBMD"),
  corr_pair(analytic_data, "calcium_log", "DXXOFBMD"),
  corr_pair(analytic_data, "vitd_sqrt", "DXXOFBMD"))

table_A

##          Variable_1 Variable_2      r p_value    n
## cor...1    RIDAGEYR   DXXOFBMD -0.232  <2e-16 2286
## cor...2     BMI_log   DXXOFBMD  0.424  <2e-16 2275
## cor...3 calcium_log   DXXOFBMD  0.012   0.592 2129
## cor...4   vitd_sqrt   DXXOFBMD -0.062 0.00426 2129

#cleaning table A


table_A %>%
  kable(digits = 4, caption = "Table A: Pearson Correlations Between Predictors and BMD") %>%
  kable_styling(full_width = FALSE)

Table A: Pearson Correlations Between Predictors and BMD
	Variable_1	Variable_2	r	p_value	n
cor…1	RIDAGEYR	DXXOFBMD	-0.232	<2e-16	2286
cor…2	BMI_log	DXXOFBMD	0.424	<2e-16	2275
cor…3	calcium_log	DXXOFBMD	0.012	0.592	2129
cor…4	vitd_sqrt	DXXOFBMD	-0.062	0.00426	2129

Creating table B

table_B <- bind_rows(
  corr_pair(analytic_data, "RIDAGEYR", "MET_sqrt"),
  corr_pair(analytic_data, "BMI_log", "MET_sqrt"),
  corr_pair(analytic_data, "calcium_log", "MET_sqrt"),
  corr_pair(analytic_data, "vitd_sqrt", "MET_sqrt")
)

table_B

##          Variable_1 Variable_2      r  p_value    n
## cor...1    RIDAGEYR   MET_sqrt -0.105 2.67e-05 1588
## cor...2     BMI_log   MET_sqrt  0.008     0.74 1582
## cor...3 calcium_log   MET_sqrt  0.068  0.00875 1500
## cor...4   vitd_sqrt   MET_sqrt -0.003    0.893 1500

#cleaning table B

table_B %>%
  kable(digits = 4, caption = "Table B: Pearson Correlations Between Predictors and Physical Activity (MET)") %>%
  kable_styling(full_width = FALSE)

Table B: Pearson Correlations Between Predictors and Physical Activity (MET)
	Variable_1	Variable_2	r	p_value	n
cor…1	RIDAGEYR	MET_sqrt	-0.105	2.67e-05	1588
cor…2	BMI_log	MET_sqrt	0.008	0.74	1582
cor…3	calcium_log	MET_sqrt	0.068	0.00875	1500
cor…4	vitd_sqrt	MET_sqrt	-0.003	0.893	1500

Creting table C

# Variables to correlate
vars_C <- c("RIDAGEYR", "BMI_log", "calcium_log", "vitd_sqrt")

# Pretty names
var_names <- c(
  RIDAGEYR = "Age",
  BMI_log = "BMI (log)",
  calcium_log = "Total Calcium (log)",
  vitd_sqrt = "Total Vitamin D (sqrt)"
)

# Generate all unique pairs
pairs_C <- combn(vars_C, 2, simplify = FALSE)

# Run correlations
table_C <- map_dfr(pairs_C, function(pair) {
  result <- corr_pair(analytic_data, pair[1], pair[2])
  result %>%
    mutate(
      Pair = paste(var_names[pair[1]], "vs", var_names[pair[2]])
    )
}) %>%
  select(Pair, r, p_value, n)

# Print formatted table
table_C %>%
  kable(digits = 4, caption = "Table C: Pearson Correlations Among Predictors") %>%
  kable_styling(full_width = FALSE)

Table C: Pearson Correlations Among Predictors
	Pair	r	p_value	n
cor…1	Age vs BMI (log)	-0.093	8.8e-06	2275
cor…2	Age vs Total Calcium (log)	0.067	0.0021	2129
cor…3	Age vs Total Vitamin D (sqrt)	0.152	2.02e-12	2129
cor…4	BMI (log) vs Total Calcium (log)	0.008	0.699	2122
cor…5	BMI (log) vs Total Vitamin D (sqrt)	0.026	0.225	2122
cor…6	Total Calcium (log) vs Total Vitamin D (sqrt)	0.291	<2e-16	2129

Part 3: Reflection

A. Comparing Methods

ANOVA is used when comparing the mean of a continuous outcome across two or more categorical groups, such as whether mean BMD differs by ethnicity. Correlation is used to assess the strength and direction of a linear relationship between two continuous variables, such as age and BMD. ANOVA tells us whether group means differ, while correlation quantifies how strongly two variables move together. For example, ANOVA addresses: “Does mean BMD differ across ethnic groups?” Correlation addresses: “Is age associated with changes in BMD?”

B. Assumptions & Limitations

Some variables were right-skewed, so the normality assumption was somewhat challenging and required transformation. If assumptions like normality or equal variances are seriously violated, the results (especially p-values) may not be fully reliable. In correlation analysis, outliers or non-linear relationships could make associations appear stronger or weaker than they truly are. Because the data are cross-sectional, we cannot determine cause and effect. A correlation between two variables does not mean that one causes the other, and other factors may explain the relationship.

C. R Programming Challenge

The most difficult part of this assignment was using the corr_pair() helper function. At first, I received an error because the function had not been defined in my session, and I was unsure how to correctly apply it to multiple variable pairs. I solved this by carefully defining the function, reviewing how it removes missing values, and using combn() with map_dfr() to automate the pairwise correlations. This process strengthened my understanding of writing custom functions and applying them efficiently across multiple variables in R.

Homework 1: ANOVA and Correlation

EPI 553: Principles of Statistical Inference II

Tahia Sufyani

2026-02-20