EPI553 HW01

Part:0 Data Preparation

# Load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(kableExtra)

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(broom)  

bmd <- readr::read_csv("~/Downloads/bmd.csv", show_col_types = FALSE)

bmd_new <- bmd %>%

mutate(
  sex = factor(RIAGENDR, levels = c(1,2), labels = c("Male", "Female")),
  ethnicity = factor (RIDRETH1, levels = c(1,2,3,4,5), labels = c("Mexican American", "Other Hispanic", 
                                                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),
  smokernew = factor (smoker, levels = c(1,2,3), labels = c("Current","Past", "Never"))
)
# Display sample
# Display sample size
data.frame(
  Metric = "Sample Size",
  Value = paste(nrow(bmd_new), "bmd new labels")
) %>%
  kable()

Metric	Value
Sample Size	2898 bmd new labels

#Removing missing BMD and ethnicity
bmd_new2 <- bmd_new %>%
  filter(!is.na(DXXOFBMD) & !is.na(ethnicity))

# Display sample
# Display sample size
data.frame(
  Metric = "Sample Size",
  Value = paste(nrow(bmd_new2), "final bmd new labels")
) %>%
  kable()

Metric	Value
Sample Size	2286 final bmd new labels

Interpretation: The initial sample size for the given data set is 2898 for BMD. After I removed 612 observations,the final sample is 2286 BMD.

##Part 1: One-Way NOV Research Question: Do mean and BMD values differ across ethnicity group? The BMD Values are significantly different, as well as the mean values. But the mean values are very close in number to each other.

#Step1: State Hypotheses 

##Null Hypothesis 
#(H₀):** u_Mexican_America = u_Other_Hispanic = u_Non-Hispanic_White = u_Non-Hispanic_Black = u_Other
#Plain Language: All five BMD population means are equal
##Alternative Hypothesis 
#(H₁):** u_ethnicity≠u_ethnicity
#Plain Language:At least one population mean differs from the others.
##Significance level:** α = 0.05

#Step 2: Exploratory Analysis
# Calculate summary statistics by BMD category
summary_stats <- bmd_new2 %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    Mean = mean(DXXOFBMD),
    SD = sd(DXXOFBMD),
    Median = median(DXXOFBMD),
    Min = min(DXXOFBMD),
    Max = max(DXXOFBMD)
  )

summary_stats %>% 
  kable(digits = 2, 
        caption = "Descriptive Statistics: BMD by ethnicity Category")

Descriptive Statistics: BMD by ethnicity Category
ethnicity	n	Mean	SD	Median	Min	Max
Mexican American	255	0.95	0.15	0.96	0.44	1.46
Other Hispanic	244	0.95	0.16	0.94	0.55	1.39
Non-Hispanic White	846	0.89	0.16	0.88	0.37	1.46
Non-Hispanic Black	532	0.97	0.16	0.98	0.44	1.55
Other	409	0.90	0.15	0.89	0.54	1.34

  # Create boxplots with individual points
ggplot(bmd_new2, 
  aes(x = ethnicity, y = DXXOFBMD, fill = ethnicity)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "BMD Across Ethnicities",
    subtitle = "BMD Data",
    x = "Ethnicity",
    y = "BMD Means",
    fill = "BMD Category"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

#Interpretation:What patterns do you observe in BMD across ethnicity groups?
#I have noticed that the median in each ethnicity group are very similar. 
#The only group that has a slightly different box plot is "Non-Hispanic Black."
#The rest of the groups are similar with upper and lower quartile ranges.

#Step3: ANOVA F-test
# Fit the one-way ANOVA model
anova_model <- aov(DXXOFBMD ~ ethnicity, data = bmd_new2)

# Display the ANOVA table
summary(anova_model)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#State decision and interpret: Since the p-value= <2e-16, we reject the null hypothesis. Therefore, it is statistically significant between the mean BMD across at least one ethnicity groups.
#For ethnicity, the degrees of freedom = 4, the F-test = 30.18.

#Step 4: Assumption Checks
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(anova_model)

par(mfrow = c(1, 1))

#Step 4 B - Equal Variances
# Levene's test for homogeneity of variance
levene_test <- leveneTest(DXXOFBMD ~ ethnicity, data = bmd_new2)
print(levene_test)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

#Interpretation: The p-value = 0.1788. Since the p-value is greater than 0.05, we accept equal variances.
#The residuals for the Q-Q plot are normally distributed since the data points roughly follow the linear line.
#Since it is normally distributed and have equal variances, the assumptions have not been violated.

#Step 5: Post-hoc Comparisons
 # Conduct Tukey HSD test
 tukey_results <- TukeyHSD(anova_model)
print(tukey_results)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = bmd_new2)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Comparison	Mean Diff	95% CI	p-value	Significant?

    diff          lwr         upr     p adj

|Other Hispanic-Mexican American | -0.004441273| -0.042644130 | 0.03376158 |0.9978049| |Non-Hispanic White-Mexican American | -0.062377249 |-0.092852618 |-0.03190188 |0.0000003| |Non-Hispanic Black-Mexican American | 0.022391619 |-0.010100052 |0.05488329 |0.3276613| |Other-Mexican American | -0.053319344 |-0.087357253 |-0.01928144 |0.0001919| |Non-Hispanic White-Other Hispanic | -0.057935976 |-0.088934694 |-0.02693726 |0.0000036| |Non-Hispanic Black-Other Hispanic | 0.026832892 |-0.006150151 |0.05981593 |0.1722466| |Other-Other Hispanic | -0.048878071 |-0.083385341 |-0.01437080 |0.0010720| |Non-Hispanic Black-Non-Hispanic White | 0.084768868 |0.061164402 |0.10837333 |0.0000000| |Other-Non-Hispanic White |0.009057905 |-0.016633367 |0.03474918 |0.8719109| |Other-Non-Hispanic Black |-0.075710963 |-0.103764519 |-0.04765741 |0.0000000|

#Interpret: Based on the ethnicity groups above, the p-values for “Non-Hispanic White-Mexican American”, “Other-Mexican American”,“Non-Hispanic White Other Hispanic”, “Other-other Hispanic” “Non-Hispanic Black-Non-Hispanic White” and “Other-Non-Hispanic Black” are less than 0.05. This shows we reject the null hypothesis and the groups are statistically significant.

#Step 6: Effect Size
# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")

## Eta-squared (η²): 0.0503

cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")

## Percentage of variance explained: 5.03 %

Interpretation: η² = 0.0503 Percentage of variance explained: 5.03% Effect size classification (small/medium/large): small effect size What does this mean practically? The effect is real but modest in magnitude. Ethnicity represents 5.03% variance in BMD.

###Part 2: Correlation Analysis

#Step 1: Create Total Intake Variables 

bmd_new3 <- bmd_new2 %>%
mutate(
calcium_total = coalesce(calcium,0) + coalesce(DSQTCALC,0),
vitd_total = coalesce(vitd,0) + coalesce(DSQTVD,0)
)

#Step 2: Assess and Apply Transformations
#Create Histograms for continuous variables (BMI, calcium, vitamin D, MET)

# Normal data
hist(bmd_new3$BMXBMI, main = "Histogram: BMI", 
     xlab = "Value", col = "lightblue")

qqnorm(bmd_new3$BMXBMI, main = "Q-Q Plot: BMI")
qqline(bmd_new3$BMXBMI, col = "red", lwd = 2)

# Normal data
hist(bmd_new3$calcium_total, main = "Histogram: Calcium Total", 
     xlab = "Value", col = "lightblue")

qqnorm(bmd_new3$calcium_total, main = "Q-Q Plot: Calcium Total")
qqline(bmd_new3$calcium_total, col = "red", lwd = 2)

# Normal data
hist(bmd_new3$vitd_total, main = "Histogram: Vitamin D Total", 
     xlab = "Value", col = "lightblue")

qqnorm(bmd_new3$vitd_total, main = "Q-Q Plot: Vitamin D Total")
qqline(bmd_new3$vitd_total, col = "red", lwd = 2)

# Normal data
hist(bmd_new3$totmet, main = "Histogram: MET", 
     xlab = "Value", col = "lightblue")

qqnorm(bmd_new3$totmet, main = "Q-Q Plot: MET")
qqline(bmd_new3$totmet, col = "red", lwd = 2)

#Log transformations 
bmd_new3$log_BMXBMI <- log(bmd_new3$BMXBMI)
bmd_new3$log_calcium_total <- log(bmd_new3$calcium_total+1)

#SQRT transformations
bmd_new3$sqrt_totmet <- sqrt(bmd_new3$totmet)
bmd_new3$sqrt_vitd_total <- sqrt(bmd_new3$vitd_total)

#Re-check normality
qqnorm(bmd_new3$vitd_total, main = "Q-Q Plot:Log BMI")
qqline(bmd_new3$vitd_total, col = "red", lwd = 2)

qqnorm(bmd_new3$vitd_total, main = "Q-Q Plot: Log Calcium Total")
qqline(bmd_new3$vitd_total, col = "red", lwd = 2)

qqnorm(bmd_new3$vitd_total, main = "Q-Q Plot: sqrt MET")
qqline(bmd_new3$vitd_total, col = "red", lwd = 2)

qqnorm(bmd_new3$vitd_total, main = "Q-Q Plot: sqrt Vitamin D Total")
qqline(bmd_new3$vitd_total, col = "red", lwd = 2)

Documentation: I did apply transformations to the variables since we used log and SQRT for the four variables: BMI, calcium, vitamin D and MET. By applying transformations, I wanted to see if the data was normally distributed and not right skewed.

#Step 3:Three Correlation Tables:

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

#Table A: Predictors vs. Outcome (BMD)
#Examine correlations between potential predictors and BMD:

#Age vs. BMD
#BMI vs. BMD
#Total calcium vs. BMD
#Total vitamin D vs. BMD
#Use transformed versions where appropriate.**
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()

# Example structure:
tableA <- bind_rows(
  corr_pair(bmd_new3, "RIDAGEYR", "DXXOFBMD"),
   corr_pair(bmd_new3, "log_BMXBMI", "DXXOFBMD"),
  corr_pair(bmd_new3, "log_calcium_total","DXXOFBMD"),
   corr_pair(bmd_new3, "sqrt_vitd_total", "DXXOFBMD")
   
   ) %>%
 
  
  mutate(
    estimate = round(estimate, 3),
     p_value = signif(p_value, 3)
   )
 
 tableA %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	DXXOFBMD	2286	-0.232	0.0000
log_BMXBMI	DXXOFBMD	2275	0.425	0.0000
log_calcium_total	DXXOFBMD	2286	0.030	0.1450
sqrt_vitd_total	DXXOFBMD	2286	-0.051	0.0151

Interpret the results: Which variables show statistically significant correlations with BMD? What is the direction and strength of these relationships?] The variables “RIDAGEYR”, “log_calcium_total” and “sqrt_vitd_total” p-values are less than 0.05. Therefore, the three variables are statistically significant and we reject the null hypothesis. RIDEAGEYR and BMD, vitamin d and BMD are both inverse associations.

#Table B: Predictors vs. Exposure (Physical Activity)
#Examine correlations between potential confounders and physical activity:

#Age vs. MET
#BMI vs. MET
#Total calcium vs. MET
#Total vitamin D vs. MET

tableB <- bind_rows(
  corr_pair(bmd_new3, "RIDAGEYR", "sqrt_totmet"),
   corr_pair(bmd_new3, "log_BMXBMI", "sqrt_totmet"),
  corr_pair(bmd_new3, "log_calcium_total","sqrt_totmet"),
   corr_pair(bmd_new3, "sqrt_vitd_total", "sqrt_totmet")
   ) %>%
  mutate(
    estimate = round(estimate, 3),
     p_value = signif(p_value, 3)
   )
 
 tableB %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	sqrt_totmet	1588	-0.105	2.67e-05
log_BMXBMI	sqrt_totmet	1582	0.009	7.32e-01
log_calcium_total	sqrt_totmet	1588	0.072	4.36e-03
sqrt_vitd_total	sqrt_totmet	1588	-0.002	9.34e-01

Interpret the results: The variables that are associated with physical activity levels?

All of the variables are associated with physical activity level since each p-value is less than 0.05.

#Table C: Predictors vs. Eachother
#Examine correlations among all predictor variables (all pairwise combinations):
#- Age, BMI, Total calcium, Total vitamin D

preds <- c("RIDAGEYR", "BMXBMI", "log_calcium_total", "sqrt_vitd_total")
 pairs <- combn(preds, 2, simplify = FALSE)
 tableC <- map_dfr(pairs, ~corr_pair(bmd_new3, .x[1], .x[2])) %>%
   
  mutate(
    estimate = round(estimate, 3),
    p_value = signif(p_value, 3)
 )

tableC %>% kable()

x	y	n	estimate	p_value
RIDAGEYR	BMXBMI	2275	-0.098	2.70e-06
RIDAGEYR	log_calcium_total	2286	-0.024	2.58e-01
RIDAGEYR	sqrt_vitd_total	2286	0.146	0.00e+00
BMXBMI	log_calcium_total	2275	0.032	1.23e-01
BMXBMI	sqrt_vitd_total	2275	0.027	2.01e-01
log_calcium_total	sqrt_vitd_total	2286	0.224	0.00e+00

Interpret the results:

[YOUR INTERPRETATION: Are there strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses?] Yes, there are strong correlations between Age and vitamin D, calcium and vitamin D. Since both r values are higher, this might indicate multicollinearity concerns in future regression analyses.

EPI553 HW01

June McCarthy

2026-02-20

Part:0 Data Preparation

Part 3 Optional: Visualization examples

Part 3 — Reflection (10 points)