title: “EPI553_HW01_Beru_Isaac.Rmd” author: “Isaac Beru” date: “2026-02-20” output: html_document
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(broom)
PART 0 — Data Preparation
bmd <- read_csv("~/Downloads/bmd.csv")
## Rows: 2898 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): SEQN, RIAGENDR, RIDAGEYR, RIDRETH1, BMXBMI, smoker, totmet, metcat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(bmd)
## # A tibble: 6 × 14
## SEQN RIAGENDR RIDAGEYR RIDRETH1 BMXBMI smoker totmet metcat DXXOFBMD tbmdcat
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 93705 2 66 4 31.7 2 240 0 1.06 0
## 2 93708 2 66 5 23.7 3 120 0 0.801 1
## 3 93709 2 75 4 38.9 1 720 1 0.88 0
## 4 93711 1 56 5 21.3 3 840 1 0.851 1
## 5 93713 1 67 3 23.5 1 360 0 0.778 1
## 6 93714 2 54 4 39.9 2 NA NA 0.994 0
## # ℹ 4 more variables: calcium <dbl>, vitd <dbl>, DSQTVD <dbl>, DSQTCALC <dbl>
dim(bmd)
## [1] 2898 14
bmd <- bmd %>%
mutate(
ethnicity = factor(RIDRETH1,
levels = c(1,2,3,4,5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other"))
)
table(bmd$ethnicity)
##
## Mexican American Other Hispanic Non-Hispanic White Non-Hispanic Black
## 331 286 1086 696
## Other
## 499
bmd <- bmd %>%
mutate(
sex = factor(RIAGENDR,
levels = c(1,2),
labels = c("Male","Female"))
)
table(bmd$sex)
##
## Male Female
## 1431 1467
bmd <- bmd %>%
mutate(
smoker = factor(smoker,
levels = c(1,2,3),
labels = c("Current","Past","Never"))
)
missing_summary <- bmd %>%
summarise(
total_n = n(),
missing_BMD = sum(is.na(DXXOFBMD)),
missing_ethnicity = sum(is.na(ethnicity))
)
missing_summary
## # A tibble: 1 × 3
## total_n missing_BMD missing_ethnicity
## <int> <int> <int>
## 1 2898 612 0
anova_df <- bmd %>%
filter(!is.na(DXXOFBMD),
!is.na(ethnicity))
nrow(anova_df)
## [1] 2286
Statistical notation: H₀: μ₁ = μ₂ = μ₃ = μ₄ = μ₅ H₁: At least one group mean differs plain language: There is no difference in mean bone mineral density across ethnicity groups.
summary_table <- anova_df %>%
group_by(ethnicity) %>%
summarise(
n = n(),
mean_BMD = mean(DXXOFBMD),
sd_BMD = sd(DXXOFBMD)
)
summary_table %>%
kable(digits = 3)
| ethnicity | n | mean_BMD | sd_BMD |
|---|---|---|---|
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other | 409 | 0.897 | 0.148 |
ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
geom_boxplot() +
geom_jitter(alpha = 0.2) +
labs(title = "BMD by Ethnicity",
x = "Ethnicity",
y = "Bone Mineral Density (g/cm²)") +
theme_minimal()
anova_model <- aov(DXXOFBMD ~ ethnicity, data = anova_df)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The one-way ANOVA showed a statistically significant difference in mean bone mineral density across ethnicity groups, F(4, 2281) = 30.18, p < 0.001. Therefore, we would reject the null hypothesis and conclude that at least one ethnicity group differs in mean BMD.
#Assumptions
#Assumption check 1: Normality of residuals (Q-Q plot)
qqnorm(residuals(anova_model))
qqline(residuals(anova_model))
#Assumption check 2: Homogeneity of variance (Levene’s test)
leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Overall, both ANOVA assumptions appear to be reasonably met, so the results of the F-test can be interpreted with confidence.
#Post-hoc comparisons:
tukey_results <- TukeyHSD(anova_model)
tukey_results
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Tukey post-hoc comparisons revealed several statistically significant differences between ethnicity groups. Non-Hispanic Whites had significantly lower BMD compared to Mexican Americans and Other Hispanics. Non-Hispanic Blacks had significantly higher BMD than Non-Hispanic Whites. The “Other” ethnicity group generally had lower BMD compared to several groups, including Mexican Americans and Non-Hispanic Blacks. These results indicate that differences in bone mineral density are not uniform across ethnic groups.
anova_tab <- summary(anova_model)[[1]]
SS_between <- anova_tab[1, "Sum Sq"]
SS_total <- sum(anova_tab[, "Sum Sq"])
eta_sq <- SS_between / SS_total
eta_sq
## [1] 0.05027196
The eta-squared value was 0.050, indicating that approximately 5.0% of the variability in bone mineral density is explained by ethnicity. According to Cohen’s guidelines, this represents a small-to-moderate effect size.
anova_df <- anova_df %>%
mutate(
calcium_total = replace_na(calcium, 0) + replace_na(DSQTCALC, 0),
vitd_total = replace_na(vitd, 0) + replace_na(DSQTVD, 0)
)
anova_df <- anova_df %>%
mutate(
log_BMI = log(BMXBMI),
log_calcium = log(calcium_total + 1),
sqrt_vitd = sqrt(vitd_total),
sqrt_met = sqrt(totmet)
)
anova_df %>%
summarise(
min_BMI = min(BMXBMI, na.rm = TRUE),
min_calcium_total = min(calcium_total, na.rm = TRUE),
min_vitd_total = min(vitd_total, na.rm = TRUE),
min_totmet = min(totmet, na.rm = TRUE)
)
## # A tibble: 1 × 4
## min_BMI min_calcium_total min_vitd_total min_totmet
## <dbl> <dbl> <dbl> <dbl>
## 1 14.2 0 0 0
Histograms indicated right-skew for calcium, vitamin D, and MET variables. Log and square-root transformations were applied to reduce skewness and stabilize variance
# Function to extract r, p, and n
corr_summary <- function(x, y, data) {
temp <- data %>%
select({{x}}, {{y}}) %>%
drop_na()
test <- cor.test(temp[[1]], temp[[2]])
tibble(
Predictor = deparse(substitute(x)),
r = round(test$estimate, 3),
p_value = round(test$p.value, 4),
n = nrow(temp)
)
}
tableA <- bind_rows(
corr_summary(RIDAGEYR, DXXOFBMD, anova_df),
corr_summary(log_BMI, DXXOFBMD, anova_df),
corr_summary(log_calcium, DXXOFBMD, anova_df),
corr_summary(sqrt_vitd, DXXOFBMD, anova_df)
)
tableA %>%
kable()
| Predictor | r | p_value | n |
|---|---|---|---|
| x | -0.232 | 0.0000 | 2286 |
| x | 0.425 | 0.0000 | 2275 |
| x | 0.030 | 0.1452 | 2286 |
| x | -0.051 | 0.0151 | 2286 |
Age was moderately negatively correlated with BMD (r = -0.23, p < 0.001), indicating that older individuals tend to have lower bone mineral density. BMI showed a moderate positive correlation with BMD (r = 0.43, p < 0.001), suggesting higher BMI is associated with higher BMD. Total calcium intake was not significantly correlated with BMD (p = 0.145). Vitamin D intake showed a statistically significant but very weak negative correlation with BMD (r = -0.05, p = 0.015). As this is cross-sectional data, these associations do not imply causation.
tableB <- bind_rows(
corr_summary(RIDAGEYR, sqrt_met, anova_df),
corr_summary(log_BMI, sqrt_met, anova_df),
corr_summary(log_calcium, sqrt_met, anova_df),
corr_summary(sqrt_vitd, sqrt_met, anova_df)
)
tableB %>%
kable()
| Predictor | r | p_value | n |
|---|---|---|---|
| x | -0.105 | 0.0000 | 1588 |
| x | 0.009 | 0.7317 | 1582 |
| x | 0.072 | 0.0044 | 1588 |
| x | -0.002 | 0.9340 | 1588 |
Age was weakly negatively correlated with physical activity (r = -0.11, p < 0.001), indicating that older individuals reported slightly lower levels of physical activity. BMI was not significantly associated with physical activity (p = 0.732). Total calcium intake showed a statistically significant but very weak positive correlation with physical activity (r = 0.07, p = 0.004). Vitamin D intake was not significantly associated with physical activity. Overall, associations were small in magnitude.
tableC <- bind_rows(
corr_summary(RIDAGEYR, log_BMI, anova_df),
corr_summary(RIDAGEYR, log_calcium, anova_df),
corr_summary(RIDAGEYR, sqrt_vitd, anova_df),
corr_summary(log_BMI, log_calcium, anova_df),
corr_summary(log_BMI, sqrt_vitd, anova_df),
corr_summary(log_calcium, sqrt_vitd, anova_df)
)
tableC %>%
kable()
| Predictor | r | p_value | n |
|---|---|---|---|
| x | -0.093 | 0.0000 | 2275 |
| x | -0.024 | 0.2577 | 2286 |
| x | 0.146 | 0.0000 | 2286 |
| x | 0.039 | 0.0658 | 2275 |
| x | 0.024 | 0.2498 | 2275 |
| x | 0.224 | 0.0000 | 2286 |
Among predictors, most correlations were weak. Age showed a very weak negative correlation with BMI (r = -0.09, p < 0.001) and a weak positive correlation with vitamin D intake (r = 0.15, p < 0.001). Calcium and vitamin D intake were moderately positively correlated (r = 0.22, p < 0.001), suggesting individuals consuming more calcium also tended to consume more vitamin D. Other predictor pairs were not significantly correlated. All associations were modest in magnitude and do not imply causation.
#Reflection
This assignment helped clarify the difference between ANOVA and correlation for me. ANOVA is useful when comparing mean differences across groups, like looking at whether bone mineral density varies by ethnicity. Correlation, on the other hand, is better for understanding how two continuous variables are related, such as age and BMD. While ANOVA tells us whether groups differ, correlation tells us how strongly two variables move together and in what direction. One thing that stood out to me was how important assumptions are. For ANOVA, checking normality and equal variances gave me more confidence in interpreting the F-test results. For correlation, I had to think carefully about skewed variables and apply transformations. Seeing how log and square-root transformations changed the distributions made the reasoning behind them more concrete. It also reinforced that statistical significance doesn’t always mean practical significance, especially when correlations are very small. A major limitation of this analysis is that the data are cross-sectional. Even when associations are statistically significant, we cannot say that one variable causes another. For example, BMI is correlated with BMD, but that does not mean BMI directly causes higher bone density without considering other factors. The most challenging part for me was organizing the correlation tables and making sure missing values were handled correctly. Working through that process improved my confidence in data cleaning and structuring results clearly.