Part 0: Data Preparation

Load Required Packages

# 1. Load necessary libraries
library(tidyverse)   # For data manipulation and visualization
library(knitr)       # For nice tables
library(car)         # For Levene's test
library(NHANES)      # NHANES dataset
library(dplyr)
library (ggplot2)
library(kableExtra)
library(broom)
library(readr)
library(ggstats)

# Set seed
set.seed(553)

Set working directory and Import bmd.csv file

# Set to the folder containing bmd.csv
setwd("~/Desktop/EPI 553 ASSIGNMENT ")

# 2. Import CSV
bmd_data <- read_csv("bmd.csv")
## Rows: 2898 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): SEQN, RIAGENDR, RIDAGEYR, RIDRETH1, BMXBMI, smoker, totmet, metcat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first few rows
head(bmd_data)
## # A tibble: 6 × 14
##    SEQN RIAGENDR RIDAGEYR RIDRETH1 BMXBMI smoker totmet metcat DXXOFBMD tbmdcat
##   <dbl>    <dbl>    <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl>
## 1 93705        2       66        4   31.7      2    240      0    1.06        0
## 2 93708        2       66        5   23.7      3    120      0    0.801       1
## 3 93709        2       75        4   38.9      1    720      1    0.88        0
## 4 93711        1       56        5   21.3      3    840      1    0.851       1
## 5 93713        1       67        3   23.5      1    360      0    0.778       1
## 6 93714        2       54        4   39.9      2     NA     NA    0.994       0
## # ℹ 4 more variables: calcium <dbl>, vitd <dbl>, DSQTVD <dbl>, DSQTCALC <dbl>
# See all column names
names(bmd_data)
##  [1] "SEQN"     "RIAGENDR" "RIDAGEYR" "RIDRETH1" "BMXBMI"   "smoker"  
##  [7] "totmet"   "metcat"   "DXXOFBMD" "tbmdcat"  "calcium"  "vitd"    
## [13] "DSQTVD"   "DSQTCALC"
# Get summary statistics

summary(bmd_data)
##       SEQN           RIAGENDR        RIDAGEYR        RIDRETH1    
##  Min.   : 93705   Min.   :1.000   Min.   :50.00   Min.   :1.000  
##  1st Qu.: 95902   1st Qu.:1.000   1st Qu.:58.00   1st Qu.:3.000  
##  Median : 98233   Median :2.000   Median :64.00   Median :3.000  
##  Mean   : 98257   Mean   :1.506   Mean   :65.21   Mean   :3.257  
##  3rd Qu.:100581   3rd Qu.:2.000   3rd Qu.:73.00   3rd Qu.:4.000  
##  Max.   :102952   Max.   :2.000   Max.   :80.00   Max.   :5.000  
##                                                                  
##      BMXBMI          smoker          totmet         metcat      
##  Min.   :14.20   Min.   :1.000   Min.   :   0   Min.   :0.0000  
##  1st Qu.:25.20   1st Qu.:2.000   1st Qu.: 240   1st Qu.:0.0000  
##  Median :28.70   Median :3.000   Median : 480   Median :0.0000  
##  Mean   :29.76   Mean   :2.381   Mean   :1015   Mean   :0.7048  
##  3rd Qu.:33.40   3rd Qu.:3.000   3rd Qu.:1430   3rd Qu.:1.0000  
##  Max.   :84.40   Max.   :3.000   Max.   :9120   Max.   :2.0000  
##  NA's   :64      NA's   :2       NA's   :964    NA's   :964     
##     DXXOFBMD         tbmdcat          calcium            vitd       
##  Min.   :0.3700   Min.   :0.0000   Min.   :  55.0   Min.   : 0.000  
##  1st Qu.:0.8123   1st Qu.:0.0000   1st Qu.: 525.0   1st Qu.: 1.750  
##  Median :0.9165   Median :0.0000   Median : 773.0   Median : 3.200  
##  Mean   :0.9223   Mean   :0.3985   Mean   : 840.5   Mean   : 4.471  
##  3rd Qu.:1.0320   3rd Qu.:1.0000   3rd Qu.:1049.0   3rd Qu.: 5.500  
##  Max.   :1.5500   Max.   :2.0000   Max.   :5199.0   Max.   :57.900  
##  NA's   :612      NA's   :612      NA's   :293      NA's   :293     
##      DSQTVD            DSQTCALC      
##  Min.   :   0.004   Min.   :   0.74  
##  1st Qu.:  15.709   1st Qu.:  93.33  
##  Median :  25.000   Median : 211.67  
##  Mean   :  49.851   Mean   : 357.98  
##  3rd Qu.:  50.000   3rd Qu.: 500.00  
##  Max.   :2570.000   Max.   :3220.00  
##  NA's   :1515       NA's   :1633

Create analysis dataset

# 3. Convert RIDRETH1 to factor with meaningful ethnicity labels
bmd_data$RIDRETH1 <- factor(bmd_data$RIDRETH1,
                        levels = c(1, 2, 3, 4, 5),
                        labels = c("Mexican American", 
                                   "Other Hispanic", 
                                   "Non-Hispanic White", 
                                   "Non-Hispanic Black", 
                                   "Other"))

# Check the conversion worked
table(bmd_data$RIDRETH1)
## 
##   Mexican American     Other Hispanic Non-Hispanic White Non-Hispanic Black 
##                331                286               1086                696 
##              Other 
##                499
# Or view structure
str(bmd_data$RIDRETH1)
##  Factor w/ 5 levels "Mexican American",..: 4 5 4 5 3 4 5 5 1 3 ...
# First 20 rows
head(bmd_data$RIDRETH1, 20)
##  [1] Non-Hispanic Black Other              Non-Hispanic Black Other             
##  [5] Non-Hispanic White Non-Hispanic Black Other              Other             
##  [9] Mexican American   Non-Hispanic White Non-Hispanic White Mexican American  
## [13] Other              Non-Hispanic Black Other Hispanic     Non-Hispanic White
## [17] Other Hispanic     Non-Hispanic Black Non-Hispanic Black Non-Hispanic White
## 5 Levels: Mexican American Other Hispanic ... Other
# 4. Convert RIAGENDR to factor (Male/Female) labels
bmd_data$RIAGENDR <- factor(bmd_data$RIAGENDR,
                        levels = c(1, 2),
                        labels = c("Male", "Female"))

# Check the conversion
head(bmd_data$RIAGENDR)
## [1] Female Female Female Male   Male   Female
## Levels: Male Female
# Check the frequency
table(bmd_data$RIAGENDR)
## 
##   Male Female 
##   1431   1467
# 5. Convert smoker to factor with Current/Past/Never labels
bmd_data$smoker <- factor(bmd_data$smoker,
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))

# Check the conversion
head(bmd_data$smoker)
## [1] Past    Never   Current Never   Current Past   
## Levels: Current Past Never
# Or see the frequency
table(bmd_data$smoker)
## 
## Current    Past   Never 
##     449     894    1553
# 6. Report total sample size
total_n <- nrow(bmd_data)
cat("Total sample size:", total_n, "\n")
## Total sample size: 2898
# Or simply
nrow(bmd_data)
## [1] 2898
# 7. Create analytic dataset removing missing BMD and ethnicity
# First, check which column is your BMD variable (likely DXXFEBMD or similar)
names(bmd_data)
##  [1] "SEQN"     "RIAGENDR" "RIDAGEYR" "RIDRETH1" "BMXBMI"   "smoker"  
##  [7] "totmet"   "metcat"   "DXXOFBMD" "tbmdcat"  "calcium"  "vitd"    
## [13] "DSQTVD"   "DSQTCALC"
# Remove rows with missing BMD and ethnicity
# Replace 'BMD_column_name' with your actual BMD column name
analytic_data <- bmd_data[complete.cases(bmd_data$DXXOFBMD, bmd_data$RIDRETH1), ]

# Alternative using na.omit for specific columns
analytic_data <- bmd_data[!is.na(bmd_data$DXXOFBMD) & !is.na(bmd_data$RIDRETH1), ]

analytic_data <- bmd_data %>%
  filter(!is.na(DXXOFBMD) & !is.na(RIDRETH1))



# 8. Report final analytic sample size
final_n <- nrow(analytic_data)
cat("Final analytic sample size:", final_n, "\n")
## Final analytic sample size: 2286
# Show how many were excluded
excluded <- total_n - final_n
cat("Number excluded:", excluded, "\n")
## Number excluded: 612

Part 1: One-Way ANOVA

** Research Question: Do mean BMD values differ across ethnicity groups? **

Step 1: State Hypothesis - tate null and alternative hypotheses in: - Statistical notation

(H_0: mu_1 = mu_2 = mu_3 = mu_4 = mu_5) - Plain language (what this means for bone health)

Null Hypothesis (H₀): μ_1 = μ_2 = μ_3 = μ_4 = μ_5
(All five population means are equal)

Alternative Hypothesis (H₁): At least one population mean differs from the others

Significance level: α = 0.05

Step 1a: Fit the ANOVA Model

#Fit the ONE-WAY ANOVA model
anova_model <- aov(DXXOFBMD ~ RIDRETH1, data = analytic_data)

# View the ANOVA results
summary(anova_model)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## RIDRETH1       4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

f = 4 Mean = 0.7371 Pr(>F) = <2e-16 F-statistic = 30.18

Interpretation:

  • F-statistic: 30.19
  • Degrees of freedom: df₁ = 2 (k-1 groups), df₂ = 2281 (n-k)
  • p-value: < 2e-16 (very small)
  • Decision: reject H₀ if p < 0.05
  • Conclusion: There is statistically significant evidence that mean BMD differs across at least two ethnicity groups.

Step 2: Exploratory Analysis (10 points) Create: - Table: n, mean, SD of BMD by ethnicity (use

group_by(), summarize(), kable()) - Plot: Boxplot or violin plot with jittered points showing BMD distributions

# Create summary table
bmd_summary_stats <- analytic_data %>%
  group_by(RIDRETH1) %>%
  summarize(
    n = n(),
    Mean = round(mean(DXXOFBMD, na.rm = TRUE), 3),
    SD = round(sd(DXXOFBMD, na.rm = TRUE), 3)
  )

# Display as formatted table
kable(bmd_summary_stats, 
      caption = "Bone Mineral Density by Ethnicity",
      col.names = c("Ethnicity", "n", "Mean BMD", "SD"))
Bone Mineral Density by Ethnicity
Ethnicity n Mean BMD SD
Mexican American 255 0.950 0.149
Other Hispanic 244 0.946 0.156
Non-Hispanic White 846 0.888 0.160
Non-Hispanic Black 532 0.973 0.161
Other 409 0.897 0.148

Step 2a: Data Visualization

# Boxplot with jittered points
ggplot(analytic_data, 
       aes(x = RIDRETH1, y = DXXOFBMD, fill = RIDRETH1)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Bone Mineral Density by Ethnicity",
    subtitle = "NHANES Data",
    x = "Ethnicity",
    y = "BMD (g/cm²)",
    fill = "Ethnicity"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

What the plot tells us:

The differences are small - you can see the boxes overlap a lot, meaning the groups are more similar than different Lots of dots (data points) - this shows we have many people in each group, making our results more reliable

Step 3: ANOVA F-test

# Fit ANOVA model
anova_model <- aov(DXXOFBMD ~ RIDRETH1, data = analytic_data)

# Get ANOVA results
anova_results <- summary(anova_model)

# Create formatted table
anova_table <- data.frame(
  Source = c("Ethnicity (Between Groups)", "Residuals (Within Groups)"),
  Df = c(4, 2281),
  `Sum Sq` = c(2.95, 55.70),
  `Mean Sq` = c(0.7371, 0.0244),
  `F value` = c(30.18, NA),
  `p-value` = c("<0.001", NA)
)

kable(anova_table, 
      caption = "One-Way ANOVA Results: BMD by Ethnicity",
      digits = 4,
      col.names = c("Source", "df", "Sum of Squares", "Mean Square", "F-statistic", "p-value"))
One-Way ANOVA Results: BMD by Ethnicity
Source df Sum of Squares Mean Square F-statistic p-value
Ethnicity (Between Groups) 4 2.95 0.7371 30.18 <0.001
Residuals (Within Groups) 2281 55.70 0.0244 NA NA
# Print results
print(anova_results)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## RIDRETH1       4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Df = 4 Mean = 0.7371 Pr(>F) = <2e-16 F-statistic = 30.18 f = 4 Mean = 0.7371 Pr(>F) = <2e-16 F-statistic = 30.18

Interpretation:

Decision and Interpretation: Decision: We reject the null hypothesis (p < 0.001).

Interpretation: There is statistically significant evidence that mean bone mineral density differs across ethnicity groups (F(4, 2281) = 30.18, p < 0.001). At least one ethnic group has a significantly different mean BMD compared to the others.

#Step 4: Assumption Checks

ANOVA Assumptions:

  1. Independence: Observations are independent (assumed based on study design)
  2. Normality: Residuals are approximately normally distributed
  3. Homogeneity of variance: Equal variances across groups

A. Normality Check (Q-Q Plot)

# Create diagnostic plots
par(mfrow = c(2, 2))
plot(anova_model)

par(mfrow = c(1, 1))

1.Residuals vs Fitted: Points show vertical bands at distinct fitted values (corresponding to the 5 ethnic groups) with random scatter around zero within each band. No clear pattern or funnel shape → Good! This suggests the model fits reasonably well. 2. Q-Q Plot: Points follow the diagonal line quite well through the middle, with slight deviations at the extreme ends (tails). This is acceptable, especially with a large sample size → Normality assumption is reasonable. 3. Scale-Location: Shows vertical bands at each fitted value with relatively consistent spread. The pattern shows similar variability across groups → Equal variance assumption is reasonable. 4. Residuals vs Leverage: All points fall well within Cook’s distance boundaries (the dashed lines). No points extend beyond these lines → No highly influential outliers that would unduly affect our ANOVA results.

B. Equal Variances (Levene’s Test)

# Conduct Levene's test
levene_test <- leveneTest(DXXOFBMD ~ RIDRETH1, data = analytic_data)

# Display results
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281
# Create formatted table
levene_table <- data.frame(
  Test = "Levene's Test",
  Df1 = levene_test$Df[1],
  Df2 = levene_test$Df[2],
  `F value` = round(levene_test$`F value`[1], 3),
  `p-value` = round(levene_test$`Pr(>F)`[1], 4)
)

kable(levene_table,
      caption = "Levene's Test for Homogeneity of Variance",
      col.names = c("Test", "df1", "df2", "F-statistic", "p-value"))
Levene’s Test for Homogeneity of Variance
Test df1 df2 F-statistic p-value
Levene’s Test 4 2281 1.573 0.1788

Levene’s Test Interpretation: Test Results: F-statistic: 1.573 Degrees of freedom: df1 = 4, df2 = 2281 p-value: 0.1788

Interpretation: Since p = 0.179 > 0.05, we fail to reject the null hypothesis of equal variances. This means the equal variance assumption is met - the variability in BMD is similar across all five ethnic groups.

Conclusion: The homogeneity of variance assumption required for ANOVA is satisfied. The Scale-Location plot showed consistent spread, we can be confident that our ANOVA assumptions are valid.

#Step 5: Post-hoc Comparisons (Tukey HSD)

# Conduct Tukey HSD test
tukey_results <- TukeyHSD(anova_model)
print(tukey_results)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ RIDRETH1, data = analytic_data)
## 
## $RIDRETH1
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000
# Visualize the confidence intervals
plot(tukey_results, las = 0)

Tukey HSD Post-hoc Comparisons: Pairwise Differences in BMD
Comparison Mean Difference 95% CI Lower 95% CI Upper Adjusted p-value Significant
Other Hispanic-Mexican American Other Hispanic-Mexican American -0.0044 -0.0426 0.0338 0.9978 No
Non-Hispanic White-Mexican American Non-Hispanic White-Mexican American -0.0624 -0.0929 -0.0319 0.0000 Yes
Non-Hispanic Black-Mexican American Non-Hispanic Black-Mexican American 0.0224 -0.0101 0.0549 0.3277 No
Other-Mexican American Other-Mexican American -0.0533 -0.0874 -0.0193 0.0002 Yes
Non-Hispanic White-Other Hispanic Non-Hispanic White-Other Hispanic -0.0579 -0.0889 -0.0269 0.0000 Yes
Non-Hispanic Black-Other Hispanic Non-Hispanic Black-Other Hispanic 0.0268 -0.0062 0.0598 0.1722 No
Other-Other Hispanic Other-Other Hispanic -0.0489 -0.0834 -0.0144 0.0011 Yes
Non-Hispanic Black-Non-Hispanic White Non-Hispanic Black-Non-Hispanic White 0.0848 0.0612 0.1084 0.0000 Yes
Other-Non-Hispanic White Other-Non-Hispanic White 0.0091 -0.0166 0.0347 0.8719 No
Other-Non-Hispanic Black Other-Non-Hispanic Black -0.0757 -0.1038 -0.0477 0.0000 Yes

Interpretation: Based on the Tukey-adjusted 95% confidence intervals, ethnic group comparisons whose intervals exclude zero (do not cross the vertical dashed line) show statistically significant differences in mean BMD, whereas comparisons whose intervals include zero indicate no significant difference between those groups.

Pair-wise Interpretation: Non-Hispanic White vs Mexican American: Mean difference = -0.0624 g/cm² (95% CI: -0.0929, -0.0319, p < 0.001). Non-Hispanic White individuals have significantly lower BMD than Mexican Americans.

Other vs Mexican American: Mean difference = -0.0533 g/cm² (95% CI: -0.0874, -0.0193, p = 0.0002). The Other ethnicity group has significantly lower BMD than Mexican Americans.

Non-Hispanic White vs Other Hispanic: Mean difference = -0.0579 g/cm² (95% CI: -0.0889, -0.0269, p < 0.001). Non-Hispanic White individuals have significantly lower BMD than Other Hispanic individuals.

Other vs Other Hispanic: Mean difference = -0.0489 g/cm² (95% CI: -0.0834, -0.0144, p = 0.0011). The Other ethnicity group has significantly lower BMD than Other Hispanic individuals.

Non-Hispanic Black vs Non-Hispanic White: Mean difference = 0.0848 g/cm² (95% CI: 0.0612, 0.1084, p < 0.001). Non-Hispanic Black individuals have significantly higher BMD than Non-Hispanic White individuals.

Other vs Non-Hispanic Black: Mean difference = -0.0757 g/cm² (95% CI: -0.1038, -0.0477, p < 0.001). The Other ethnicity group has significantly lower BMD than Non-Hispanic Black individuals.

Step 6: Effect Size (Eta-squared)

# Hint: Extract Sum Sq from the ANOVA summary

# Extract sum of squares from ANOVA table
anova_summary <- summary(anova_model)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.0503
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")
## Percentage of variance explained: 5.03 %
# Create formatted output
effect_size_table <- data.frame(
  Measure = "Eta-squared (η²)",
  Value = round(eta_squared, 4),
  `Percent Variance` = paste0(round(eta_squared * 100, 2), "%"),
  Interpretation = "Small to Medium Effect"
)

kable(effect_size_table,
      caption = "Effect Size: Proportion of Variance Explained by Ethnicity",
      col.names = c("Effect Size Measure", "Value", "% Variance Explained", "Interpretation"))
Effect Size: Proportion of Variance Explained by Ethnicity
Effect Size Measure Value % Variance Explained Interpretation
Eta-squared (η²) 0.0503 5.03% Small to Medium Effect

-Effect Size Interpretation: η² = r round(eta_squared, 4) or r round(eta_squared * 100, 2)% Small

Effect Size Measure Value % Variance Explained Interpretation
Eta-squared (η²) 0.0503 5.03% Small to Medium Effect

Interpretation: Race/Ethnicity explains 5.03% of the variance in bone mineral density. This indicates a small to moderate effect size, meaning that while ethnicity is a statistically significant predictor of BMD, it accounts for only a small portion of the total variation in bone density. Other factors (age, sex, lifestyle, genetics, nutrition, etc.) explain the remaining 95% of the variance.

This represents a small to medium effect size, indicating that while ethnicity is a statistically significant predictor of BMD, it accounts for only a modest portion of the total variation. Other factors (genetics, lifestyle, nutrition, physical activity, age, etc.) explain the remaining ~95% of variance in bone mineral density.

Part 2: Correlation Analysis (40 points)

Step 1: Create Total Intake Variables (5 points) Calculate: - calcium_total = calcium +

DSQTCALC (replace NA with 0) - vitd_total = vitd + DSQTVD (replace NA with 0)

# Replace NA with 0 and create total variables
analytic_data$calcium_total <- analytic_data$calcium + 
                                ifelse(is.na(analytic_data$DSQTCALC), 0, analytic_data$DSQTCALC)

analytic_data$vitd_total <- analytic_data$vitd + 
                            ifelse(is.na(analytic_data$DSQTVD), 0, analytic_data$DSQTVD)

# Verify
summary(analytic_data[, c("calcium_total", "vitd_total")])
##  calcium_total      vitd_total     
##  Min.   :  55.0   Min.   :   0.00  
##  1st Qu.: 604.5   1st Qu.:   2.95  
##  Median : 901.5   Median :   8.60  
##  Mean   :1001.5   Mean   :  29.12  
##  3rd Qu.:1288.5   3rd Qu.:  30.40  
##  Max.   :5399.0   Max.   :2574.65  
##  NA's   :157      NA's   :157
# Check for any remaining NAs
sum(is.na(analytic_data$calcium_total))
## [1] 157
sum(is.na(analytic_data$vitd_total))
## [1] 157

Step 2: Assess and Apply Transformations (5 points)

• Create histograms for continuous variables (BMI, calcium, vitamin D, MET) • Decide on transformations based on skewness • Document reasoning

Suggested transformations: - Log transformation: BMI, calcium - Square root: MET, vitamin D (handles zeros better)

A. Create Histograms for continuous variables Variables (BMI, calcium, vitamin D, MET)

# Create histograms for all continuous variables
p1 <- ggplot(analytic_data, aes(x = BMXBMI)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "BMI Distribution", x = "BMI", y = "Frequency") +
  theme_minimal()

p2 <- ggplot(analytic_data, aes(x = calcium_total)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Total Calcium Distribution", x = "Calcium (mg)", y = "Frequency") +
  theme_minimal()

p3 <- ggplot(analytic_data, aes(x = vitd_total)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Total Vitamin D Distribution", x = "Vitamin D (mcg)", y = "Frequency") +
  theme_minimal()

p4 <- ggplot(analytic_data, aes(x = totmet)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Total MET Distribution", x = "MET (min/week)", y = "Frequency") +
  theme_minimal()

# Display all plots together
print(p1)
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_bin()`).

print(p2)
## Warning: Removed 157 rows containing non-finite outside the scale range
## (`stat_bin()`).

print(p3)
## Warning: Removed 157 rows containing non-finite outside the scale range
## (`stat_bin()`).

print(p4)
## Warning: Removed 698 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Check missing data for each variable
sum(is.na(analytic_data$BMXBMI))        # 11 missing
## [1] 11
sum(is.na(analytic_data$calcium_total)) # 157 missing
## [1] 157
sum(is.na(analytic_data$vitd_total))    # 157 missing
## [1] 157
sum(is.na(analytic_data$totmet))        # 698 missing
## [1] 698
# Check complete cases for correlation analysis
complete_cases <- sum(complete.cases(analytic_data[, c("BMXBMI", "calcium_total", "vitd_total", "totmet")]))
cat("Complete cases for all variables:", complete_cases)
## Complete cases for all variables: 1496

Interpretation

  1. BMI Distribution (p1): Right-skewed distribution with a long tail extending to higher BMI values Most data clustered around 25-30, but some extend to 50-60

Decision: Apply LOG transformation

Justification: Right-skewed shape needs correction No zeros in BMI data, so log is safe to use Log will compress the right tail and make distribution more symmetric for better correlation analysis

  1. Total Calcium Distribution (p2): Highly right-skewed with most values clustered at the low end (0-1000 mg) Long right tail extending to 4000+ mg Possible spike near zero

Decision: Apply LOG transformation

Justification: Severely right-skewed distribution Potential zero values (some people may have no calcium intake) Log will compress the extreme high values and create more symmetry

  1. Total Vitamin D Distribution (p3): Extremely right-skewed, massive spike at zero, few values extend to 2000+ mcg

Decision: SQUARE ROOT transformation

Justification: Many zeros present (people without supplements), sqrt handles zeros naturally without adding constants, appropriate for heavily zero-inflated data

  1. Total MET Distribution (p4): Severely right-skewed, large spike at zero (sedentary people), long tail to 7500+ min/week

Decision: SQUARE ROOT transformation

Justification: Many zeros (sedentary individuals), extremely skewed, sqrt handles zeros naturally, appropriate for count-like physical activity data

# Apply transformations based on visual assessment
analytic_data <- analytic_data %>%
  mutate(
    bmi_log = log(BMXBMI),
    calcium_log = log(calcium_total + 1),
    vitd_sqrt = sqrt(vitd_total),
    met_sqrt = sqrt(totmet)
  )

# Check that variables were created
names(analytic_data)
##  [1] "SEQN"          "RIAGENDR"      "RIDAGEYR"      "RIDRETH1"     
##  [5] "BMXBMI"        "smoker"        "totmet"        "metcat"       
##  [9] "DXXOFBMD"      "tbmdcat"       "calcium"       "vitd"         
## [13] "DSQTVD"        "DSQTCALC"      "calcium_total" "vitd_total"   
## [17] "bmi_log"       "calcium_log"   "vitd_sqrt"     "met_sqrt"
# Check missing data for each variable
sum(is.na(analytic_data$BMXBMI))        # 11 missing
## [1] 11
sum(is.na(analytic_data$calcium_total)) # 157 missing
## [1] 157
sum(is.na(analytic_data$vitd_total))    # 157 missing
## [1] 157
sum(is.na(analytic_data$totmet))        # 698 missing
## [1] 698
# Create comparison plots
# BMI
p5 <- ggplot(analytic_data, aes(x = BMXBMI)) +
  geom_histogram(fill = "coral", color = "black", bins = 30) +
  labs(title = "Original BMI (Right-Skewed)", x = "BMI") +
  theme_minimal()

p6 <- ggplot(analytic_data, aes(x = bmi_log)) +
  geom_histogram(fill = "lightgreen", color = "black", bins = 30) +
  labs(title = "Log-Transformed BMI (More Normal)", x = "log(BMI)") +
  theme_minimal()

print(p5)
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_bin()`).

print(p6)
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Calcium
p7 <- ggplot(analytic_data, aes(x = calcium_total)) +
  geom_histogram(fill = "coral", color = "black", bins = 30) +
  labs(title = "Original Calcium (Highly Skewed)", x = "Calcium (mg)") +
  theme_minimal()

p8 <- ggplot(analytic_data, aes(x = calcium_log)) +
  geom_histogram(fill = "lightgreen", color = "black", bins = 30) +
  labs(title = "Log-Transformed Calcium (Improved)", x = "log(Calcium+1)") +
  theme_minimal()

print(p7)
## Warning: Removed 157 rows containing non-finite outside the scale range
## (`stat_bin()`).

print(p8)
## Warning: Removed 157 rows containing non-finite outside the scale range
## (`stat_bin()`).

Interpretation BMI Transformation Summary: Before (Coral): Right-skewed with long tail extending to high BMI values (50-60) After (Green): More symmetric and bell-shaped distribution Result: Transformation successful - reduced skewness, improved normality for correlation analysis

Calcium Transformation Summary: Before (Coral): Highly right-skewed, clustered at low values After (Green): Bell-shaped, symmetric, approximately normal Result: Transformation successful - distribution now suitable for correlation analysis

Step 3: Correlation Analysis

# Create the correlation helper function

corr_pair <- function(data, var1, var2) {
  # Remove missing values
  complete_data <- data[complete.cases(data[[var1]], data[[var2]]), ]
  
  # Calculate correlation test
  cor_test <- cor.test(complete_data[[var1]], complete_data[[var2]], 
                       method = "pearson")
  
  # Extract results
  result <- data.frame(
    Variable_1 = var1,
    Variable_2 = var2,
    r = round(cor_test$estimate, 3),
    p_value = format.pval(cor_test$p.value, digits = 3),
    n = nrow(complete_data)
  )
  
  return(result)
  
}

Table A: Predictors vs. Outcome (BMD)

# Calculate correlations between predictors and BMD
table_a <- rbind(
  corr_pair(analytic_data, "RIDAGEYR", "DXXOFBMD"),      # Age vs BMD
  corr_pair(analytic_data, "bmi_log", "DXXOFBMD"),       # BMI vs BMD
  corr_pair(analytic_data, "calcium_log", "DXXOFBMD"),   # Calcium vs BMD
  corr_pair(analytic_data, "vitd_sqrt", "DXXOFBMD")     # Vitamin D vs BMD
)

# Add descriptive labels
table_a <- table_a %>%
  mutate(
    Variable_1 = case_when(
      Variable_1 == "RIDAGEYR" ~ "Age (years)",
      Variable_1 == "bmi_log" ~ "BMI (log-transformed)",
      Variable_1 == "calcium_log" ~ "Total Calcium (log-transformed)",
      Variable_1 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
      TRUE ~ Variable_1
    ),
    Variable_2 = "BMD (g/cm²)"
  )

# Display formatted table
kable(table_a,
      caption = "Table A: Correlations between Predictors and BMD (Outcome)",
      col.names = c("Predictor", "Outcome", "r", "p-value", "n"),
      align = c('l', 'l', 'r', 'r', 'r'))
Table A: Correlations between Predictors and BMD (Outcome)
Predictor Outcome r p-value n
cor Age (years) BMD (g/cm²) -0.232 <2e-16 2286
cor1 BMI (log-transformed) BMD (g/cm²) 0.425 <2e-16 2275
cor2 Total Calcium (log-transformed) BMD (g/cm²) 0.012 0.592 2129
cor3 Total Vitamin D (sqrt-transformed) BMD (g/cm²) -0.062 0.00426 2129
Table A: Correlations between Predictors and BMD (Outcome)
Predictor Outcome r p-value n
cor Age (years) BMD (g/cm²) -0.232 <2e-16 2286
cor1 BMI (log-transformed) BMD (g/cm²) 0.425 <2e-16 2275
cor2 Total Calcium (log-transformed) BMD (g/cm²) 0.012 0.592 2129
cor3 Total Vitamin D (sqrt-transformed) BMD (g/cm²) -0.062 0.00426 2129

Interpretation: BMI shows the strongest association with BMD (moderate positive), while age shows a weak negative association. Surprisingly, neither calcium nor vitamin D intake show meaningful positive associations with BMD in this cross-sectional analysis.

Table B: Predictors vs. Exposure (Physical Activity)

# Table B: Predictors vs. Exposure (Physical Activity)
table_b <- rbind(
  corr_pair(analytic_data, "RIDAGEYR", "met_sqrt"),  # Age vs MET
  corr_pair(analytic_data, "bmi_log", "met_sqrt"),   # BMI vs MET
  corr_pair(analytic_data, "calcium_log", "met_sqrt"), # Calcium vs MET
  corr_pair(analytic_data, "vitd_sqrt", "met_sqrt")  # Vitamin D vs MET
)

# Add descriptive labels
table_b <- table_b %>%
  mutate(
    Variable_1 = case_when(
      Variable_1 == "RIDAGEYR" ~ "Age (years)",
      Variable_1 == "bmi_log" ~ "BMI (log-transformed)",
      Variable_1 == "calcium_log" ~ "Total Calcium (log-transformed)",
      Variable_1 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
      TRUE ~ Variable_1
    ),
    Variable_2 = "Physical Activity (MET, sqrt-transformed)"
  )

# Display formatted table
kable(table_b,
      digits = 3,
      col.names = c("Predictor", "Exposure", "r", "p-value", "n"),
      caption = "Table B: Correlations between Predictors and Physical Activity (Exposure)",
      align = c('l', 'l', 'r', 'r', 'r'))
Table B: Correlations between Predictors and Physical Activity (Exposure)
Predictor Exposure r p-value n
cor Age (years) Physical Activity (MET, sqrt-transformed) -0.105 2.67e-05 1588
cor1 BMI (log-transformed) Physical Activity (MET, sqrt-transformed) 0.009 0.732 1582
cor2 Total Calcium (log-transformed) Physical Activity (MET, sqrt-transformed) 0.068 0.00875 1500
cor3 Total Vitamin D (sqrt-transformed) Physical Activity (MET, sqrt-transformed) -0.003 0.893 1500
Table B: Correlations between Predictors and Physical Activity (Exposure)
Predictor Exposure r p-value n
cor Age (years) Physical Activity (MET, sqrt-transformed) -0.105 2.67e-05 1588
cor1 BMI (log-transformed) Physical Activity (MET, sqrt-transformed) 0.009 0.732 1582
cor2 Total Calcium (log-transformed) Physical Activity (MET, sqrt-transformed) 0.068 0.00875 1500
cor3 Total Vitamin D (sqrt-transformed) Physical Activity (MET, sqrt-transformed) -0.003 0.893 1500

Interpretation Age shows a weak negative association with physical activity (older = less active). BMI and vitamin D show no meaningful associations. The calcium-MET relationship is statistically significant but extremely weak (r = 0.068) with minimal practical importance. Vitamin D intake is not associated with physical activity levels - no significant association..

#Table C: Predictors vs. Each Other (Intercorrelations)
table_c <- rbind(
  corr_pair(analytic_data, "RIDAGEYR", "bmi_log"),
  corr_pair(analytic_data, "RIDAGEYR", "calcium_log"),
  corr_pair(analytic_data, "RIDAGEYR", "vitd_sqrt"),
  corr_pair(analytic_data, "bmi_log", "calcium_log"),
  corr_pair(analytic_data, "bmi_log", "vitd_sqrt"),
  corr_pair(analytic_data, "calcium_log", "vitd_sqrt")
)

# Add descriptive labels
table_c <- table_c %>%
  mutate(
    Variable_1 = case_when(
      Variable_1 == "RIDAGEYR" ~ "Age (years)",
      Variable_1 == "bmi_log" ~ "BMI (log-transformed)",
      Variable_1 == "calcium_log" ~ "Total Calcium (log-transformed)",
      Variable_1 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
      TRUE ~ Variable_1
    ),
    Variable_2 = case_when(
      Variable_2 == "RIDAGEYR" ~ "Age (years)",
      Variable_2 == "bmi_log" ~ "BMI (log-transformed)",
      Variable_2 == "calcium_log" ~ "Total Calcium (log-transformed)",
      Variable_2 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
      TRUE ~ Variable_2
    )
  )

# Display formatted table
kable(table_c,
      digits = 3,
      col.names = c("Variable 1", "Variable 2", "r", "p-value", "n"),
      caption = "Table C: Intercorrelations Among Predictors",
      align = c('l', 'l', 'r', 'r', 'r'))
Table C: Intercorrelations Among Predictors
Variable 1 Variable 2 r p-value n
cor Age (years) BMI (log-transformed) -0.093 9.28e-06 2275
cor1 Age (years) Total Calcium (log-transformed) 0.067 0.0021 2129
cor2 Age (years) Total Vitamin D (sqrt-transformed) 0.152 2.02e-12 2129
cor3 BMI (log-transformed) Total Calcium (log-transformed) 0.009 0.686 2122
cor4 BMI (log-transformed) Total Vitamin D (sqrt-transformed) 0.026 0.226 2122
cor5 Total Calcium (log-transformed) Total Vitamin D (sqrt-transformed) 0.291 <2e-16 2129
Table C: Intercorrelations Among Predictors
Variable 1 Variable 2 r p-value n
cor Age (years) BMI (log-transformed) -0.093 9.28e-06 2275
cor1 Age (years) Total Calcium (log-transformed) 0.067 0.0021 2129
cor2 Age (years) Total Vitamin D (sqrt-transformed) 0.152 2.02e-12 2129
cor3 BMI (log-transformed) Total Calcium (log-transformed) 0.009 0.686 2122
cor4 BMI (log-transformed) Total Vitamin D (sqrt-transformed) 0.026 0.226 2122
cor5 Total Calcium (log-transformed) Total Vitamin D (sqrt-transformed) 0.291 <2e-16 2129
  • Age vs BMI: r = -0.093, p < 0.001, n = 2275 Very weak negative association. Older individuals tend to have slightly lower BMI in this sample. Statistically significant but weak effect.

  • Age vs Calcium: r = 0.067, p = 0.002, n = 2129 Very weak positive association. Older individuals tend to have slightly higher calcium intake. Statistically significant but minimal practical importance.

  • Age vs Vitamin D: r = 0.152, p < 0.001, n = 2129 Weak positive association. Older individuals tend to have higher vitamin D intake. This is the strongest age-related correlation among nutrients.

  • BMI vs Calcium: r = 0.009, p = 0.686, n = 2122 No significant association. BMI is not related to calcium intake.

  • BMI vs Vitamin D: r = 0.026, p = 0.226, n = 2122 No significant association. BMI is not related to vitamin D intake.

  • Calcium vs Vitamin D: r = 0.291, p < 0.001, n = 2129 Weak to moderate positive association. Individuals with higher calcium intake tend to have higher vitamin D intake. This is the strongest correlation in Table C, suggesting people who consume more of one nutrient tend to consume more of the other, likely reflecting overall dietary patterns or supplement use.

Part 3: Reflection (10 points)

Write 200–400 words addressing:

A. Comparing Methods (3 points) - When to use ANOVA vs. correlation? - Key differences in what they tell us - Example research questions for each

Answer:

When to use each method: ANOVA is most appropriate when comparing the mean values of a continuous outcome across two or more categorical groups, such as examining differences in BMD among ethnic groups. In contrast, correlation is used to assess the linear relationship between two continuous variables, such as the association between age and BMD.

Key differences: ANOVA evaluates whether statistically significant differences exist between group means, whereas correlation measures the strength and direction of a linear relationship between variables. In other words, ANOVA answers questions about whether groups differ from one another, while correlation focuses on how strongly two variables are related. For example, ANOVA demonstrated significant differences in BMD across ethnic groups (F = 30.18, p < 0.001), whereas correlation analysis indicated that age is negatively associated with BMD (r = -0.232, p < 0.001).

Example research questions:

B. Assumptions & Limitations (4 points) - Which assumptions were challenging? - How might violations affect conclusions? - Limitations of cross-sectional correlation for causal inference

Answer:

Challenging assumptions: Meeting the normality assumption was particularly difficult for variables such as calcium intake and MET, both of which exhibited pronounced right-skewness. Logarithmic and square root transformations were applied to improve normality. In contrast, the assumption of equal variances was more readily satisfied, as supported by Levene’s test (p = 0.179).

Impact of violations: Departures from normality can reduce the accuracy of correlation coefficients and compromise the validity of hypothesis testing. Substantial skewness may inflate Type I error rates and lead to misleading correlation estimates. Although data transformations helped mitigate these issues, some residual non-normality persisted in the distribution tails.

Cross-sectional limitations: The primary limitation of this study is that correlation does not establish causation. Because the design is cross-sectional, we cannot determine whether low calcium intake leads to poor bone mineral density (BMD) or if individuals with existing bone health concerns adjust their calcium intake in response. Furthermore, unmeasured confounding variables—such as genetic factors, lifetime physical activity, and medication use—may influence the results. The unexpected negative correlation between vitamin D and BMD (r = -0.062) likely reflects reverse causation, as individuals with poorer bone health may be more likely to increase their vitamin D supplementation.

C. R Programming Challenge (3 points) - What was most difficult? - How did you solve it? - What skill did you strengthen?

Answer:

Most difficult aspect: Because the data from previous work was set and imported for us ready to use I never took time to work on my self how to set the working directory so I had to learn how to do it this time which took much time although it was very easy. Also it was a bit difficult identifying the warnings when they pop up.

Solution approach: I had to revisit our first slide and study exactly how you started with us. I went to Youtube also and saw that I had to save it on my destop precisely in a folder. Went to my RStudio, click session (top menu), click set working directory-choose directory, select the folder where i saved my bmd.csv, click open, then R will automatically run the data for me.I also had to go back and check on the warning messages and understand them properly before recoding. Some were missen packages I had to restall, not sure why they left my RStudio because I knew I had already installed them already.

Skills strengthened: This is a good exercise because it made use of all what had been studied so far which has strengthened my coding skills. Also the need for patient and critically thinking through my codes and visualizing how the outputs look like