#reading the bmd csv file
bmd_data <- read.csv("C:/Users/tahia/OneDrive/Desktop/UAlbany PhD/Epi 553/HW/HW1/bmd.csv")##
## 1 2 3 4 5
## 331 286 1086 696 499
#Race
bmd_data <- bmd_data %>%
mutate(RIDRETH1 = factor(RIDRETH1,levels = c(1, 2, 3, 4, 5),labels = c(
"Mexican American","Other Hispanic","Non-Hispanic White","Non-Hispanic Black","Other Race - Including Multi-Racial")))
#end
table(bmd_data$RIAGENDR)##
## 1 2
## 1431 1467
#Gender
bmd_data <- bmd_data %>%
mutate(RIAGENDR = factor(RIAGENDR,levels = c(1, 2),labels = c("Male","Female")))
#end
table(bmd_data$smoker)##
## 1 2 3
## 449 894 1553
#Smoker
bmd_data <- bmd_data %>%
mutate(smoker = factor(smoker,levels = c(1, 2, 3),labels = c("Current","Past","Never")))
#end## [1] 2898
Total Sample Size of main dataset is, N= 2898
## [1] 2286
Total sample size of analytical dataset is, n = 2286
Statistical notation:
Let µ1, µ2, µ3, µ4, µ5 be the population mean BMD for the 5 ethnicity groups in RIDRETH1. • Null hypothesis (H₀): H0: µ1= µ2= µ3= µ4= µ5 • Alternative hypothesis (H₁): H1: At least one µi is different than other groups
Plain language:
• H₀ (null): Average bone mineral density (BMD) is the same across all five ethnicity groups. Any differences we see in our sample are just due to random chance. • H₁ (alternative): Average BMD is not the same for all ethnicity groups, meaning at least one ethnicity group has higher or lower average BMD, suggesting bone health may differ by ethnicity in the population.
Significance level: α = 0.05
bmd_summary <- analytic_data %>%
group_by(RIDRETH1) %>%
summarize(n = n(),mean_BMD = mean(DXXOFBMD, na.rm = TRUE),sd_BMD = sd(DXXOFBMD, na.rm = TRUE))
bmd_summary %>%
kable(digits = 3, caption = "Descriptive Statistics of BMD by Ethnicity") %>%
kable_styling(full_width = FALSE)| RIDRETH1 | n | mean_BMD | sd_BMD |
|---|---|---|---|
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other Race - Including Multi-Racial | 409 | 0.897 | 0.148 |
ggplot(analytic_data, aes(x = RIDRETH1, y = DXXOFBMD)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2, alpha = 0.3) + labs(title = "Distribution of Bone Mineral Density by Ethnicity",x = "Ethnicity",y = "Bone Mineral Density (BMD)") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))Non-Hispanic Black participants had the highest mean BMD (0.973), while Non-Hispanic White participants had the lowest (0.888). The remaining groups, Mexican American (0.950), Other Hispanic (0.946), and Other Race including multi-racial (0.897) had intermediate values. Standard deviations were similar across groups (0.148–0.161), indicating comparable variability.
## Df Sum Sq Mean Sq F value Pr(>F)
## RIDRETH1 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# F test
anova_results <- tidy(anova_model)
anova_results %>%
kable(digits = 4, caption = " One way ANOVA Results: BMD by Ethnicity") %>%
kable_styling(full_width = FALSE)| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| RIDRETH1 | 4 | 2.9482 | 0.7371 | 30.185 | 0 |
| Residuals | 2281 | 55.6973 | 0.0244 | NA | NA |
The one-way ANOVA shows a statistically significant difference in mean bone mineral density (BMD) across ethnicity groups, F(4, 2281) = 30.19, p < 0.001. The between-group variability (mean sqr = 0.7371) was much larger than the within-group variability (mean sqr = 0.0244), indicating real differences across groups. Therefore, we reject the null hypothesis and conclude that at least one ethnicity group differs in mean BMD.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
A. Normality
The Q-Q plot of residuals showed that the points generally followed the reference line, with only minor deviations at the tails. Given the large sample size (n = 2286), the residuals appear approximately normally distributed, and the normality assumption is reasonably satisfied.
B. Homogeneity of Variance
Levene’s test was not statistically significant, F(4, 2281) = 1.57, p = 0.179, indicating that variances do not significantly differ across ethnicity groups. Therefore, the assumption of equal variances is met.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ RIDRETH1, data = analytic_data)
##
## $RIDRETH1
## diff
## Other Hispanic-Mexican American -0.004441273
## Non-Hispanic White-Mexican American -0.062377249
## Non-Hispanic Black-Mexican American 0.022391619
## Other Race - Including Multi-Racial-Mexican American -0.053319344
## Non-Hispanic White-Other Hispanic -0.057935976
## Non-Hispanic Black-Other Hispanic 0.026832892
## Other Race - Including Multi-Racial-Other Hispanic -0.048878071
## Non-Hispanic Black-Non-Hispanic White 0.084768868
## Other Race - Including Multi-Racial-Non-Hispanic White 0.009057905
## Other Race - Including Multi-Racial-Non-Hispanic Black -0.075710963
## lwr upr
## Other Hispanic-Mexican American -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American -0.010100052 0.05488329
## Other Race - Including Multi-Racial-Mexican American -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic -0.006150151 0.05981593
## Other Race - Including Multi-Racial-Other Hispanic -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.061164402 0.10837333
## Other Race - Including Multi-Racial-Non-Hispanic White -0.016633367 0.03474918
## Other Race - Including Multi-Racial-Non-Hispanic Black -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other Race - Including Multi-Racial-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other Race - Including Multi-Racial-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other Race - Including Multi-Racial-Non-Hispanic White 0.8719109
## Other Race - Including Multi-Racial-Non-Hispanic Black 0.0000000
#Creating table
tukey_table <- tidy(tukey_results)
tukey_table %>%
kable(digits = 4, caption = "Tukey HSD Pairwise Comparisons of BMD by Ethnicity") %>%
kable_styling(full_width = FALSE)| term | contrast | null.value | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|---|---|
| RIDRETH1 | Other Hispanic-Mexican American | 0 | -0.0044 | -0.0426 | 0.0338 | 0.9978 |
| RIDRETH1 | Non-Hispanic White-Mexican American | 0 | -0.0624 | -0.0929 | -0.0319 | 0.0000 |
| RIDRETH1 | Non-Hispanic Black-Mexican American | 0 | 0.0224 | -0.0101 | 0.0549 | 0.3277 |
| RIDRETH1 | Other Race - Including Multi-Racial-Mexican American | 0 | -0.0533 | -0.0874 | -0.0193 | 0.0002 |
| RIDRETH1 | Non-Hispanic White-Other Hispanic | 0 | -0.0579 | -0.0889 | -0.0269 | 0.0000 |
| RIDRETH1 | Non-Hispanic Black-Other Hispanic | 0 | 0.0268 | -0.0062 | 0.0598 | 0.1722 |
| RIDRETH1 | Other Race - Including Multi-Racial-Other Hispanic | 0 | -0.0489 | -0.0834 | -0.0144 | 0.0011 |
| RIDRETH1 | Non-Hispanic Black-Non-Hispanic White | 0 | 0.0848 | 0.0612 | 0.1084 | 0.0000 |
| RIDRETH1 | Other Race - Including Multi-Racial-Non-Hispanic White | 0 | 0.0091 | -0.0166 | 0.0347 | 0.8719 |
| RIDRETH1 | Other Race - Including Multi-Racial-Non-Hispanic Black | 0 | -0.0757 | -0.1038 | -0.0477 | 0.0000 |
Tukey Test Interpretation: Post-hoc Comparisons
Tukey’s post-hoc test showed that Non-Hispanic White participants had significantly lower mean BMD than Mexican American (mean difference = −0.0624, 95% CI: −0.0929 to −0.0319, p < 0.001) and Other Hispanic participants (mean difference = −0.0579, 95% CI: −0.0889 to −0.0269, p < 0.001). Other Race including multi-racial participants also had significantly lower BMD compared to Mexican American (mean difference = −0.0533, p = 0.0002) and Other Hispanic participants (mean difference = −0.0489, p = 0.0011).
Non-Hispanic Black participants had significantly higher mean BMD than Non-Hispanic White participants (mean difference = 0.0848, 95% CI: 0.0612 to 0.1084, p < 0.001) and Other Race including multi-racial participants (mean difference = −0.0757 when reversed comparison, p < 0.001). No significant differences were observed between Mexican American and Other Hispanic participants, or between Non-Hispanic Black and Mexican American or Other Hispanic participants.
From the ANOVA model table, we know that SS_between (RIDRETH1)= 2.9482 and SS_within (Residuals)= 55.6973
ss_between <- 2.9482
ss_within <- 55.6973
ss_total <- ss_between + ss_within
eta_sq <- ss_between / ss_total
eta_sq## [1] 0.05027155
Interpretation:
The eta-squared effect size is η² = 0.05, meaning that approximately 5% of the variability in BMD is explained by ethnicity.
According to Cohen’s guidelines (0.01 = small, 0.06 = medium, 0.14 = large), this represents a small-to-moderate effect size, suggesting that while ethnicity is statistically significant, it explains a modest proportion of variation in bone mineral density.
analytic_data <- analytic_data %>%
mutate(DSQTCALC = ifelse(is.na(DSQTCALC), 0, DSQTCALC),DSQTVD = ifelse(is.na(DSQTVD), 0, DSQTVD),
calcium_total = calcium + DSQTCALC,vitd_total = vitd + DSQTVD)
summary(analytic_data$calcium_total)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 55.0 604.5 901.5 1001.5 1288.5 5399.0 157
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.95 8.60 29.12 30.40 2574.65 157
vars_to_check <- analytic_data %>%
select(BMXBMI, calcium_total, vitd_total, totmet) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value")
ggplot(vars_to_check, aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~variable, scales = "free") +
labs(title = "Histograms of Continuous Variables", x = NULL, y = "Count") +
theme_minimal()vars_transformed <- analytic_data %>%
select(BMI_log, calcium_log, vitd_sqrt, MET_sqrt) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value")
ggplot(vars_transformed, aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~variable, scales = "free") +
labs(title = "Histograms of Continuous Variables (Transformed)", x = NULL, y = "Count") +
theme_minimal()The original histograms showed that calcium_total, vitd_total, and MET were strongly right-skewed, with long upper tails and many low or zero values. BMI (BMXBMI) showed mild right skew but was closer to symmetric compared to the intake variables.
After transformation, the distributions improved substantially. The log transformation of BMI and calcium reduced right skew and produced more symmetric, approximately normal shapes. The square-root transformation of MET and vitamin D also reduced skewness and handled zero values appropriately, resulting in more balanced distributions. Overall, the transformed variables appear better suited for correlation analysis.
Several variables, especially calcium, vitamin D, and MET, were strongly right-skewed, which can violate the normality assumption and allow extreme values to overly influence correlations. Therefore, transformations were applied to reduce skewness and improve symmetry. Log transformations were used for BMI and calcium to compress large values, while square-root transformations were used for MET and vitamin D because they handle zero values better. After transformation, the distributions were more suitable for correlation analysis.
##Step 3: Three Correlation Tables
# Create the correlation helper function
corr_pair <- function(data, var1, var2) {
# Remove missing values
complete_data <- data[complete.cases(data[[var1]], data[[var2]]), ]
# Calculate correlation test
cor_test <- cor.test(complete_data[[var1]], complete_data[[var2]],
method = "pearson")
# Extract results
result <- data.frame(
Variable_1 = var1,
Variable_2 = var2,
r = round(cor_test$estimate, 3),
p_value = format.pval(cor_test$p.value, digits = 3),
n = nrow(complete_data)
)
return(result)
}table_A <- bind_rows(
corr_pair(analytic_data, "RIDAGEYR", "DXXOFBMD"),
corr_pair(analytic_data, "BMI_log", "DXXOFBMD"),
corr_pair(analytic_data, "calcium_log", "DXXOFBMD"),
corr_pair(analytic_data, "vitd_sqrt", "DXXOFBMD"))
table_A## Variable_1 Variable_2 r p_value n
## cor...1 RIDAGEYR DXXOFBMD -0.232 <2e-16 2286
## cor...2 BMI_log DXXOFBMD 0.424 <2e-16 2275
## cor...3 calcium_log DXXOFBMD 0.012 0.592 2129
## cor...4 vitd_sqrt DXXOFBMD -0.062 0.00426 2129
#cleaning table A
table_A %>%
kable(digits = 4, caption = "Table A: Pearson Correlations Between Predictors and BMD") %>%
kable_styling(full_width = FALSE)| Variable_1 | Variable_2 | r | p_value | n | |
|---|---|---|---|---|---|
| cor…1 | RIDAGEYR | DXXOFBMD | -0.232 | <2e-16 | 2286 |
| cor…2 | BMI_log | DXXOFBMD | 0.424 | <2e-16 | 2275 |
| cor…3 | calcium_log | DXXOFBMD | 0.012 | 0.592 | 2129 |
| cor…4 | vitd_sqrt | DXXOFBMD | -0.062 | 0.00426 | 2129 |
table_B <- bind_rows(
corr_pair(analytic_data, "RIDAGEYR", "MET_sqrt"),
corr_pair(analytic_data, "BMI_log", "MET_sqrt"),
corr_pair(analytic_data, "calcium_log", "MET_sqrt"),
corr_pair(analytic_data, "vitd_sqrt", "MET_sqrt")
)
table_B## Variable_1 Variable_2 r p_value n
## cor...1 RIDAGEYR MET_sqrt -0.105 2.67e-05 1588
## cor...2 BMI_log MET_sqrt 0.008 0.74 1582
## cor...3 calcium_log MET_sqrt 0.068 0.00875 1500
## cor...4 vitd_sqrt MET_sqrt -0.003 0.893 1500
#cleaning table B
table_B %>%
kable(digits = 4, caption = "Table B: Pearson Correlations Between Predictors and Physical Activity (MET)") %>%
kable_styling(full_width = FALSE)| Variable_1 | Variable_2 | r | p_value | n | |
|---|---|---|---|---|---|
| cor…1 | RIDAGEYR | MET_sqrt | -0.105 | 2.67e-05 | 1588 |
| cor…2 | BMI_log | MET_sqrt | 0.008 | 0.74 | 1582 |
| cor…3 | calcium_log | MET_sqrt | 0.068 | 0.00875 | 1500 |
| cor…4 | vitd_sqrt | MET_sqrt | -0.003 | 0.893 | 1500 |
# Variables to correlate
vars_C <- c("RIDAGEYR", "BMI_log", "calcium_log", "vitd_sqrt")
# Pretty names
var_names <- c(
RIDAGEYR = "Age",
BMI_log = "BMI (log)",
calcium_log = "Total Calcium (log)",
vitd_sqrt = "Total Vitamin D (sqrt)"
)
# Generate all unique pairs
pairs_C <- combn(vars_C, 2, simplify = FALSE)
# Run correlations
table_C <- map_dfr(pairs_C, function(pair) {
result <- corr_pair(analytic_data, pair[1], pair[2])
result %>%
mutate(
Pair = paste(var_names[pair[1]], "vs", var_names[pair[2]])
)
}) %>%
select(Pair, r, p_value, n)
# Print formatted table
table_C %>%
kable(digits = 4, caption = "Table C: Pearson Correlations Among Predictors") %>%
kable_styling(full_width = FALSE)| Pair | r | p_value | n | |
|---|---|---|---|---|
| cor…1 | Age vs BMI (log) | -0.093 | 8.8e-06 | 2275 |
| cor…2 | Age vs Total Calcium (log) | 0.067 | 0.0021 | 2129 |
| cor…3 | Age vs Total Vitamin D (sqrt) | 0.152 | 2.02e-12 | 2129 |
| cor…4 | BMI (log) vs Total Calcium (log) | 0.008 | 0.699 | 2122 |
| cor…5 | BMI (log) vs Total Vitamin D (sqrt) | 0.026 | 0.225 | 2122 |
| cor…6 | Total Calcium (log) vs Total Vitamin D (sqrt) | 0.291 | <2e-16 | 2129 |
ANOVA is used when comparing the mean of a continuous outcome across two or more categorical groups, such as whether mean BMD differs by ethnicity. Correlation is used to assess the strength and direction of a linear relationship between two continuous variables, such as age and BMD. ANOVA tells us whether group means differ, while correlation quantifies how strongly two variables move together. For example, ANOVA addresses: “Does mean BMD differ across ethnic groups?” Correlation addresses: “Is age associated with changes in BMD?”
Some variables were right-skewed, so the normality assumption was somewhat challenging and required transformation. If assumptions like normality or equal variances are seriously violated, the results (especially p-values) may not be fully reliable. In correlation analysis, outliers or non-linear relationships could make associations appear stronger or weaker than they truly are. Because the data are cross-sectional, we cannot determine cause and effect. A correlation between two variables does not mean that one causes the other, and other factors may explain the relationship.
The most difficult part of this assignment was using the corr_pair() helper function. At first, I received an error because the function had not been defined in my session, and I was unsure how to correctly apply it to multiple variable pairs. I solved this by carefully defining the function, reviewing how it removes missing values, and using combn() with map_dfr() to automate the pairwise correlations. This process strengthened my understanding of writing custom functions and applying them efficiently across multiple variables in R.