#2. Import bmd.csv
bmd <- readr::read_csv("C:/Users/MY789914/OneDrive - University at Albany - SUNY/Desktop/Stat 553 (R)/Assignmen 1/bmd.csv")
#3 #4 #5. Convert RIDRETH1, RIAGENDR, smoker to factor
bmd_1=bmd %>%
mutate(
sex = factor(RIAGENDR, levels = c(1, 2), labels = c("Male", "Female")), ethnicity=factor(RIDRETH1, levels = c(1,2,3,4,5), labels = c("Mexican American", "Other hispanic", "Non-Hispanic White", "Non-Hispanic Black", "Other")),
smoking_status=factor(smoker, levels=c(1,2,3), labels = c("current", "past", "never")))
#7. Create analytic dataset removing missing BMD and ethnicity
data <- bmd_1 %>% filter(!is.na(DXXOFBMD), !is.na(ethnicity))
Step 6: Total sample size = 2,898
Step 8: Final analytic sample size =2,286
##Part 1: One-Way ANOVA
Research Question: Do mean BMD values differ across ethnicity groups?
#Step 1 State Hypotheses
Statistical notation: H0:μ1=μ2=μ3=μ4=μ5
HA: At least one ethnic group has a mean BMD different from the others
Plain language:
Null Hypothesis (H₀): Average bone mineral density (BMD) is the same across all ethnicity groups, indicating no ethnic differences in bone health.
Alternative Hypothesis (H₁): At least one ethnicity group has a different average BMD, suggesting potential disparities in bone health across ethnic groups.
Significance level: α = 0.05
#Step 2 Exploratory Analysis
library(dplyr)
library(knitr)
bmd_summary <- data %>%
group_by(ethnicity) %>%
summarize(
n = n(),
mean_BMD = mean(DXXOFBMD, na.rm = TRUE),
sd_BMD = sd(DXXOFBMD, na.rm = TRUE)
)
kable(bmd_summary, digits = 3,
caption = "Summary of Total Femur BMD by Ethnicity")
| ethnicity | n | mean_BMD | sd_BMD |
|---|---|---|---|
| Mexican American | 255 | 0.950 | 0.149 |
| Other hispanic | 244 | 0.946 | 0.156 |
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other | 409 | 0.897 | 0.148 |
#create plots
library(ggplot2)
ggplot(data, aes(x = ethnicity, y = DXXOFBMD)) +
geom_violin(fill = "lightblue", alpha = 0.6) +
geom_jitter(width = 0.15, alpha = 0.4) +
labs(
title = "Distribution of Total Femur Bone Mineral Density by Ethnicity",
x = "Ethnicity",
y = "Total Femur BMD (g/cm²)"
) +
theme_minimal()
#Step 3: ANOVA F-test
anova_model <- aov(DXXOFBMD ~ ethnicity, data = data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_table <- summary(anova_model)[[1]]
kable(anova_table, digits = 3,
caption = "One-Way ANOVA Results for Total Femur BMD by Ethnicity")
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| ethnicity | 4 | 2.948 | 0.737 | 30.185 | 0 |
| Residuals | 2281 | 55.697 | 0.024 | NA | NA |
#Step 4 Assumption Checks
qqnorm(residuals(anova_model))
qqline(residuals(anova_model))
library(car)
leveneTest(DXXOFBMD ~ ethnicity, data = data)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Interpretation: A.Normality
The Q–Q plot indicates that residuals are approximately normally distributed, supporting the normality assumption.
Interpretation: B.Equal Variances:
The Levene’s test for Homogeneity of Variance was not statistically significant, F(4, 2281) = 1.57, p = 0.18, indicating that the variances of total femur BMD are not significantly different across groups.
#Step 5. Post-hoc Comparisons
tukey_results <- TukeyHSD(anova_model)
kable(tukey_results$ethnicity, digits = 3,
caption = "Tukey HSD Post-hoc Comparisons for Total Femur BMD by Ethnicity")
| diff | lwr | upr | p adj | |
|---|---|---|---|---|
| Other hispanic-Mexican American | -0.004 | -0.043 | 0.034 | 0.998 |
| Non-Hispanic White-Mexican American | -0.062 | -0.093 | -0.032 | 0.000 |
| Non-Hispanic Black-Mexican American | 0.022 | -0.010 | 0.055 | 0.328 |
| Other-Mexican American | -0.053 | -0.087 | -0.019 | 0.000 |
| Non-Hispanic White-Other hispanic | -0.058 | -0.089 | -0.027 | 0.000 |
| Non-Hispanic Black-Other hispanic | 0.027 | -0.006 | 0.060 | 0.172 |
| Other-Other hispanic | -0.049 | -0.083 | -0.014 | 0.001 |
| Non-Hispanic Black-Non-Hispanic White | 0.085 | 0.061 | 0.108 | 0.000 |
| Other-Non-Hispanic White | 0.009 | -0.017 | 0.035 | 0.872 |
| Other-Non-Hispanic Black | -0.076 | -0.104 | -0.048 | 0.000 |
Interpretation:
A p-value < 0.001 indicates a statistically significant difference between groups, while the sign of the mean difference indicates direction; In this analysis, the results indicated that Non-Hispanic Black participants had significantly higher mean BMD compared with Non-Hispanic White participants, while Non-Hispanic White and Other ethnicity groups generally exhibited lower mean BMD compared with Mexican American and Other Hispanic groups.
#Step 6. Effect Size
# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.0503
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")
## Percentage of variance explained: 5.03 %
Interpretation:
##Part 2. Correlation Analysis
#Step 1: Create Total Intake Variables #Step 2: Assess and Apply Transformations
library(dplyr)
data <- data %>%
mutate(
calcium_total = calcium + ifelse(is.na(DSQTCALC), 0, DSQTCALC),
vitd_total = vitd + ifelse(is.na(DSQTVD), 0, DSQTVD)
)
par(mfrow = c(2, 2))
hist(data$BMXBMI, main = "BMI", xlab = "BMI")
hist(data$calcium_total, main = "Total Calcium", xlab = "Calcium")
hist(data$vitd_total, main = "Total Vitamin D", xlab = "Vitamin D")
hist(data$totmet, main = "Total MET", xlab = "MET")
par(mfrow = c(1, 1))
data <- data %>%
mutate(
log_BMI = log(BMXBMI),
log_calcium = log(calcium_total + 1), # +1 handles zeros
sqrt_vitd = sqrt(vitd_total),
sqrt_MET = sqrt(totmet)
)
Interpretation:
Histograms indicated right-skewed distributions for BMI, total calcium intake, vitamin D intake, and total MET minutes. Log transformations were applied to BMI and calcium to reduce skewness, while square-root transformations were applied to vitamin D and MET because these variables contained zero values and moderate skewness. These transformations improve normality assumptions for Pearson correlation analysis.
#Step 3. Three Correlation Tables
library(dplyr)
library(purrr)
library(tibble)
library(knitr)
# Define corr_pair function
corr_pair <- function(data, var1, var2) {
tmp <- data %>%
select(all_of(c(var1, var2))) %>%
na.omit()
test <- cor.test(tmp[[1]], tmp[[2]], method = "pearson")
tibble(
var1 = var1,
var2 = var2,
r = round(test$estimate, 3),
p = round(test$p.value, 4),
n = nrow(tmp)
)
}
# Table A
table_A <- bind_rows(
corr_pair(data, "RIDAGEYR", "DXXOFBMD"),
corr_pair(data, "log_BMI", "DXXOFBMD"),
corr_pair(data, "log_calcium", "DXXOFBMD"),
corr_pair(data, "sqrt_vitd", "DXXOFBMD")
)
kable(table_A, caption = "Table A. Correlations Between Predictors and BMD")
| var1 | var2 | r | p | n |
|---|---|---|---|---|
| RIDAGEYR | DXXOFBMD | -0.232 | 0.0000 | 2286 |
| log_BMI | DXXOFBMD | 0.425 | 0.0000 | 2275 |
| log_calcium | DXXOFBMD | 0.012 | 0.5919 | 2129 |
| sqrt_vitd | DXXOFBMD | -0.062 | 0.0043 | 2129 |
# Table B
table_B <- bind_rows(
corr_pair(data, "RIDAGEYR", "sqrt_MET"),
corr_pair(data, "log_BMI", "sqrt_MET"),
corr_pair(data, "log_calcium", "sqrt_MET"),
corr_pair(data, "sqrt_vitd", "sqrt_MET")
)
kable(table_B, caption = "Table B. Correlations Between Predictors and Physical Activity (MET)")
| var1 | var2 | r | p | n |
|---|---|---|---|---|
| RIDAGEYR | sqrt_MET | -0.105 | 0.0000 | 1588 |
| log_BMI | sqrt_MET | 0.009 | 0.7317 | 1582 |
| log_calcium | sqrt_MET | 0.068 | 0.0088 | 1500 |
| sqrt_vitd | sqrt_MET | -0.003 | 0.8927 | 1500 |
# Table C
vars <- c("RIDAGEYR", "log_BMI", "log_calcium", "sqrt_vitd")
pairs <- combn(vars, 2, simplify = FALSE)
table_C <- map_dfr(pairs, function(pair) {
corr_pair(data, pair[1], pair[2])
})
kable(table_C, caption = "Table C. Pairwise Correlations Among Predictors")
| var1 | var2 | r | p | n |
|---|---|---|---|---|
| RIDAGEYR | log_BMI | -0.093 | 0.0000 | 2275 |
| RIDAGEYR | log_calcium | 0.067 | 0.0021 | 2129 |
| RIDAGEYR | sqrt_vitd | 0.152 | 0.0000 | 2129 |
| log_BMI | log_calcium | 0.009 | 0.6860 | 2122 |
| log_BMI | sqrt_vitd | 0.026 | 0.2256 | 2122 |
| log_calcium | sqrt_vitd | 0.291 | 0.0000 | 2129 |
Interpretation:
BMI showed the strongest positive correlation with BMD (r = 0.425, p < .001), while age was negatively correlated (r = -0.232, p < .001), indicating older individuals and those with lower BMI have lower bone density. Vitamin D and calcium intake had weak or nonsignificant correlations with BMD. For physical activity (MET), age had a small negative correlation (r = -0.105, p < .001), calcium a very weak positive correlation (r = 0.068, p = 0.009), and BMI and vitamin D were not significant. Among predictors, only calcium and vitamin D were moderately correlated (r = 0.291, p < .001); Overall, BMI and age are the most important correlates of BMD, while nutrient intake and physical activity have limited practical associations. Significant p-values indicate statistical relationships, but effect sizes reveal that most correlations are small.
Part 3: Reflection
ANOVA and correlation serve different purposes in statistical analysis. ANOVA is used to compare mean differences across categorical groups, such as examining whether BMD differs by ethnicity. It determines whether at least one group mean is significantly different. In contrast, correlation assesses the strength and direction of a linear relationship between two continuous variables, such as BMI and BMD. While ANOVA tests group differences, correlation evaluates how variables change together.
Several assumptions required attention. For ANOVA, homogeneity of variance and approximate normality were important. Levene’s test indicated equal variances, but visual inspection was necessary to assess distribution shape. For correlation, linearity and absence of extreme outliers were key. Some variables were skewed and required log or square-root transformations to improve normality. Violations of these assumptions could bias results or increase Type I error. Additionally, because the data are cross-sectional, correlations cannot imply causation.
For me the most challenging R programming task was creating the pairwise correlation table using combn() and map_dfr() with corr_pair(), particularly resolving tidy evaluation issues with variable names. By modifying the function to accept character strings, I improved my debugging skills and strengthened my understanding of functional programming in R.