title: “EPI553_HW01_Beru_Isaac.Rmd” author: “Isaac Beru” date: “2026-02-20” output: html_document

Load required libraries (data manipulation, stats tests, tables)

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(broom)

PART 0 — Data Preparation

Import dataset

bmd <- read_csv("~/Downloads/bmd.csv")
## Rows: 2898 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): SEQN, RIAGENDR, RIDAGEYR, RIDRETH1, BMXBMI, smoker, totmet, metcat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(bmd)
## # A tibble: 6 × 14
##    SEQN RIAGENDR RIDAGEYR RIDRETH1 BMXBMI smoker totmet metcat DXXOFBMD tbmdcat
##   <dbl>    <dbl>    <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl>
## 1 93705        2       66        4   31.7      2    240      0    1.06        0
## 2 93708        2       66        5   23.7      3    120      0    0.801       1
## 3 93709        2       75        4   38.9      1    720      1    0.88        0
## 4 93711        1       56        5   21.3      3    840      1    0.851       1
## 5 93713        1       67        3   23.5      1    360      0    0.778       1
## 6 93714        2       54        4   39.9      2     NA     NA    0.994       0
## # ℹ 4 more variables: calcium <dbl>, vitd <dbl>, DSQTVD <dbl>, DSQTCALC <dbl>
dim(bmd)
## [1] 2898   14

Recode ethnicity from numeric codes

bmd <- bmd %>%
  mutate(
    ethnicity = factor(RIDRETH1,
      levels = c(1,2,3,4,5),
      labels = c("Mexican American",
                 "Other Hispanic",
                 "Non-Hispanic White",
                 "Non-Hispanic Black",
                 "Other"))
  )

Recode sex as labeled factor

table(bmd$ethnicity)
## 
##   Mexican American     Other Hispanic Non-Hispanic White Non-Hispanic Black 
##                331                286               1086                696 
##              Other 
##                499

Recode smoking status as labeled factor

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR,
                 levels = c(1,2),
                 labels = c("Male","Female"))
  )
table(bmd$sex)
## 
##   Male Female 
##   1431   1467

Recode smoking status as labeled factor

bmd <- bmd %>%
  mutate(
    smoker = factor(smoker,
                    levels = c(1,2,3),
                    labels = c("Current","Past","Never"))
  )

Summarize missingness for thee key variables used in Part 1

missing_summary <- bmd %>%
  summarise(
    total_n = n(),
    missing_BMD = sum(is.na(DXXOFBMD)),
    missing_ethnicity = sum(is.na(ethnicity))
  )

missing_summary
## # A tibble: 1 × 3
##   total_n missing_BMD missing_ethnicity
##     <int>       <int>             <int>
## 1    2898         612                 0

Create analytic dataset by removing missing BMD or ethnicity

anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD),
         !is.na(ethnicity))

nrow(anova_df)
## [1] 2286

Part 1: One-Way ANOVA

Research question: Do mean BMD differ across ethnicity groups?

Statistical notation: H₀: μ₁ = μ₂ = μ₃ = μ₄ = μ₅ H₁: At least one group mean differs plain language: There is no difference in mean bone mineral density across ethnicity groups.

Descriptive statistics by ethnicity:

summary_table <- anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_BMD = mean(DXXOFBMD),
    sd_BMD = sd(DXXOFBMD)
  )

summary_table %>%
  kable(digits = 3)
ethnicity n mean_BMD sd_BMD
Mexican American 255 0.950 0.149
Other Hispanic 244 0.946 0.156
Non-Hispanic White 846 0.888 0.160
Non-Hispanic Black 532 0.973 0.161
Other 409 0.897 0.148

Visualization:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.2) +
  labs(title = "BMD by Ethnicity",
       x = "Ethnicity",
       y = "Bone Mineral Density (g/cm²)") +
  theme_minimal()

Fit ANOVA model and table:

anova_model <- aov(DXXOFBMD ~ ethnicity, data = anova_df)
summary(anova_model)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The one-way ANOVA showed a statistically significant difference in mean bone mineral density across ethnicity groups, F(4, 2281) = 30.18, p < 0.001. Therefore, we would reject the null hypothesis and conclude that at least one ethnicity group differs in mean BMD.

#Assumptions

#Assumption check 1: Normality of residuals (Q-Q plot)

qqnorm(residuals(anova_model))
qqline(residuals(anova_model))

#Assumption check 2: Homogeneity of variance (Levene’s test)

leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Overall, both ANOVA assumptions appear to be reasonably met, so the results of the F-test can be interpreted with confidence.

#Post-hoc comparisons:

tukey_results <- TukeyHSD(anova_model)
tukey_results
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Tukey post-hoc comparisons revealed several statistically significant differences between ethnicity groups. Non-Hispanic Whites had significantly lower BMD compared to Mexican Americans and Other Hispanics. Non-Hispanic Blacks had significantly higher BMD than Non-Hispanic Whites. The “Other” ethnicity group generally had lower BMD compared to several groups, including Mexican Americans and Non-Hispanic Blacks. These results indicate that differences in bone mineral density are not uniform across ethnic groups.

Effect size (eta-squared):

anova_tab <- summary(anova_model)[[1]]
SS_between <- anova_tab[1, "Sum Sq"]
SS_total <- sum(anova_tab[, "Sum Sq"])

eta_sq <- SS_between / SS_total
eta_sq
## [1] 0.05027196

The eta-squared value was 0.050, indicating that approximately 5.0% of the variability in bone mineral density is explained by ethnicity. According to Cohen’s guidelines, this represents a small-to-moderate effect size.

Part 2: Correlation

Create total intake variables:

anova_df <- anova_df %>%
  mutate(
    calcium_total = replace_na(calcium, 0) + replace_na(DSQTCALC, 0),
    vitd_total = replace_na(vitd, 0) + replace_na(DSQTVD, 0)
  )

Transformations:

anova_df <- anova_df %>%
  mutate(
    log_BMI = log(BMXBMI),
    log_calcium = log(calcium_total + 1),
    sqrt_vitd = sqrt(vitd_total),
    sqrt_met = sqrt(totmet)
  )
anova_df %>%
  summarise(
    min_BMI = min(BMXBMI, na.rm = TRUE),
    min_calcium_total = min(calcium_total, na.rm = TRUE),
    min_vitd_total = min(vitd_total, na.rm = TRUE),
    min_totmet = min(totmet, na.rm = TRUE)
  )
## # A tibble: 1 × 4
##   min_BMI min_calcium_total min_vitd_total min_totmet
##     <dbl>             <dbl>          <dbl>      <dbl>
## 1    14.2                 0              0          0

Histograms indicated right-skew for calcium, vitamin D, and MET variables. Log and square-root transformations were applied to reduce skewness and stabilize variance

Helper function:

# Function to extract r, p, and n
corr_summary <- function(x, y, data) {
  temp <- data %>%
    select({{x}}, {{y}}) %>%
    drop_na()
  
  test <- cor.test(temp[[1]], temp[[2]])
  
  tibble(
    Predictor = deparse(substitute(x)),
    r = round(test$estimate, 3),
    p_value = round(test$p.value, 4),
    n = nrow(temp)
  )
}

Table A: Predictors vs BMD

tableA <- bind_rows(
  corr_summary(RIDAGEYR, DXXOFBMD, anova_df),
  corr_summary(log_BMI, DXXOFBMD, anova_df),
  corr_summary(log_calcium, DXXOFBMD, anova_df),
  corr_summary(sqrt_vitd, DXXOFBMD, anova_df)
)

tableA %>%
  kable()
Predictor r p_value n
x -0.232 0.0000 2286
x 0.425 0.0000 2275
x 0.030 0.1452 2286
x -0.051 0.0151 2286

Age was moderately negatively correlated with BMD (r = -0.23, p < 0.001), indicating that older individuals tend to have lower bone mineral density. BMI showed a moderate positive correlation with BMD (r = 0.43, p < 0.001), suggesting higher BMI is associated with higher BMD. Total calcium intake was not significantly correlated with BMD (p = 0.145). Vitamin D intake showed a statistically significant but very weak negative correlation with BMD (r = -0.05, p = 0.015). As this is cross-sectional data, these associations do not imply causation.

Table B: Predictors vs Physical activity (METs)

tableB <- bind_rows(
  corr_summary(RIDAGEYR, sqrt_met, anova_df),
  corr_summary(log_BMI, sqrt_met, anova_df),
  corr_summary(log_calcium, sqrt_met, anova_df),
  corr_summary(sqrt_vitd, sqrt_met, anova_df)
)

tableB %>%
  kable()
Predictor r p_value n
x -0.105 0.0000 1588
x 0.009 0.7317 1582
x 0.072 0.0044 1588
x -0.002 0.9340 1588

Age was weakly negatively correlated with physical activity (r = -0.11, p < 0.001), indicating that older individuals reported slightly lower levels of physical activity. BMI was not significantly associated with physical activity (p = 0.732). Total calcium intake showed a statistically significant but very weak positive correlation with physical activity (r = 0.07, p = 0.004). Vitamin D intake was not significantly associated with physical activity. Overall, associations were small in magnitude.

Table C: Predictor vs predictor correlations

tableC <- bind_rows(
  corr_summary(RIDAGEYR, log_BMI, anova_df),
  corr_summary(RIDAGEYR, log_calcium, anova_df),
  corr_summary(RIDAGEYR, sqrt_vitd, anova_df),
  corr_summary(log_BMI, log_calcium, anova_df),
  corr_summary(log_BMI, sqrt_vitd, anova_df),
  corr_summary(log_calcium, sqrt_vitd, anova_df)
)

tableC %>%
  kable()
Predictor r p_value n
x -0.093 0.0000 2275
x -0.024 0.2577 2286
x 0.146 0.0000 2286
x 0.039 0.0658 2275
x 0.024 0.2498 2275
x 0.224 0.0000 2286

Among predictors, most correlations were weak. Age showed a very weak negative correlation with BMI (r = -0.09, p < 0.001) and a weak positive correlation with vitamin D intake (r = 0.15, p < 0.001). Calcium and vitamin D intake were moderately positively correlated (r = 0.22, p < 0.001), suggesting individuals consuming more calcium also tended to consume more vitamin D. Other predictor pairs were not significantly correlated. All associations were modest in magnitude and do not imply causation.

#Reflection

This assignment helped clarify the difference between ANOVA and correlation for me. ANOVA is useful when comparing mean differences across groups, like looking at whether bone mineral density varies by ethnicity. Correlation, on the other hand, is better for understanding how two continuous variables are related, such as age and BMD. While ANOVA tells us whether groups differ, correlation tells us how strongly two variables move together and in what direction. One thing that stood out to me was how important assumptions are. For ANOVA, checking normality and equal variances gave me more confidence in interpreting the F-test results. For correlation, I had to think carefully about skewed variables and apply transformations. Seeing how log and square-root transformations changed the distributions made the reasoning behind them more concrete. It also reinforced that statistical significance doesn’t always mean practical significance, especially when correlations are very small. A major limitation of this analysis is that the data are cross-sectional. Even when associations are statistically significant, we cannot say that one variable causes another. For example, BMI is correlated with BMD, but that does not mean BMI directly causes higher bone density without considering other factors. The most challenging part for me was organizing the correlation tables and making sure missing values were handled correctly. Working through that process improved my confidence in data cleaning and structuring results clearly.