##Read CSV from workspace
bmd = read.csv('bmd.csv', header = T)
##Preview dataset
head(bmd)
## SEQN RIAGENDR RIDAGEYR RIDRETH1 BMXBMI smoker totmet metcat DXXOFBMD tbmdcat
## 1 93705 2 66 4 31.7 2 240 0 1.058 0
## 2 93708 2 66 5 23.7 3 120 0 0.801 1
## 3 93709 2 75 4 38.9 1 720 1 0.880 0
## 4 93711 1 56 5 21.3 3 840 1 0.851 1
## 5 93713 1 67 3 23.5 1 360 0 0.778 1
## 6 93714 2 54 4 39.9 2 NA NA 0.994 0
## calcium vitd DSQTVD DSQTCALC
## 1 503.5 1.85 20.557 211.67
## 2 473.5 5.85 25.000 820.00
## 3 NA NA NA NA
## 4 1248.5 3.85 25.000 35.00
## 5 660.5 2.35 NA 13.33
## 6 776.0 5.65 NA NA
bmd <- bmd %>%
mutate(
##Recode RIDRETH1 to V_ETH
V_ETH = factor(RIDRETH1,
levels = c(1,2,3,4,5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other")
),
##Recode RIAGENDR to V_GEN
V_GEN = factor(RIAGENDR,
levels = c(1,2),
labels = c("Male",
"Female")
),
##Recode smoker to V_SMK
V_SMK = factor(smoker,
levels = c(1,2,3),
labels = c("Current","Past","Never"))
)
#QA - Print top 3 rows, output for new var
print (bmd[1:3, (ncol(bmd)-2):ncol(bmd)])
## V_ETH V_GEN V_SMK
## 1 Non-Hispanic Black Female Past
## 2 Other Female Never
## 3 Non-Hispanic Black Female Current
# Store the total size of the dataset
M_completesamplesize <- nrow(bmd)
#a_bmd is new df, not missing any values in V_ETH or BMXBMI
a_bmd <- bmd[(!is.na(bmd$V_ETH) & !is.na(bmd$DXXOFBMD)),]
#Store total size of nonmissing dataset
M_total <- nrow(a_bmd)
Research Question: Do mean BMD values differ across ethnicity groups?
Null Hypothesis (H₀): μ1 = μ2 = μ3 = μ4 = μ5
The null hypothesis states that the BMD mean of all 5 ethnicity groups (Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, Other) are equal
Alternative Hypothesis (H₁): i,j | μi≠μj
The alternative hypothesis states that at least one ethnicity group’s population mean (μ) differs from the others
Significance level: α = 0.05
# Calculate summary statistics by ethnic group
s_bmd <- a_bmd %>%
group_by(V_ETH) %>%
summarise(
n = n(),
Mean = mean(DXXOFBMD),
SD = sd(DXXOFBMD),
Median = median(DXXOFBMD),
Min = min(DXXOFBMD),
Max = max(DXXOFBMD)
)
##clean table
s_bmd %>%
rename("Ethnicity Group" = V_ETH) %>%
kable(digits = 2,
caption = "Descriptive Statistics: Total Femur Bone Mineral Density (g/cm^2^) by Ethnicity Group",
)
| Ethnicity Group | n | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|
| Mexican American | 255 | 0.95 | 0.15 | 0.96 | 0.44 | 1.46 |
| Other Hispanic | 244 | 0.95 | 0.16 | 0.94 | 0.55 | 1.39 |
| Non-Hispanic White | 846 | 0.89 | 0.16 | 0.88 | 0.37 | 1.46 |
| Non-Hispanic Black | 532 | 0.97 | 0.16 | 0.98 | 0.44 | 1.55 |
| Other | 409 | 0.90 | 0.15 | 0.89 | 0.54 | 1.34 |
##plot
ggplot(a_bmd,
aes(x = V_ETH, y = DXXOFBMD, fill = V_ETH)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.1, size = 0.5) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Total Femur Bone Mineral Density by Ethnicity Group",
subtitle = "NHANES Data",
x = "Ethnicity Group",
y = "Total Femur Bone Mineral Density (g/cm^2^)",
fill = "V_ETH"
) +
theme_classic(base_size = 12) +
theme(legend.position = "none")
# Fit ANOVA model
anova_bmd <- aov(DXXOFBMD ~ V_ETH, data = a_bmd)
# Display the ANOVA table
s_anova <- summary(anova_bmd)[[1]]
rownames(s_anova)[1] <- "Treatment: Ethnicity"
rownames(s_anova)[2] <- "Error"
s_anova %>%
rownames_to_column(var= " ") %>%
kable(digits = 3)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Treatment: Ethnicity | 4 | 2.948 | 0.737 | 30.185 | 0 |
| Error | 2281 | 55.697 | 0.024 | NA | NA |
Interpretation: Based on an α = 0.05, at a p value of <0.001, we reject the null hypothesis - indicating that there is statistically significant difference in the mean total femur bone density across ethnicity groups. The F-Statistic is F(4, 2281) = 30.19.
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(anova_bmd)
Residuals vs. Fitted: Linearity 👍
Five distinct columns of points, one for each ethnicity group. The red line is flat on 0.0, which indicates that the model is unbiased and centered around group means.
Normal Q-Q: Normality 👍
The points follow the diagonal very closely, minimal skew away from the line indicating that the distribution is not stretched.
Scale-Location: Homoscedasticity ⚠️
This is only relatively stable, with some variation in height among the groups. It may be worth further investigating with Levene’s Test to get a specific number to interpet.
Residuals vs Leverage: Absence of Influential Outliers 👍
A handful of items are far enough to be labeled, but none pass Cook’s Distance lines, indicating that the outliers are not influential enough to warp the mean.
# Levene's test for homogeneity of variance
levene <- leveneTest(DXXOFBMD ~ V_ETH, data = a_bmd)
print(levene)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Levene’s Test for Homogeneity of Variance
P value of 0.179 indicates that we fail to reject the null hypothesis, and we can proceed with the ANOVA without any violations of assumptions.
Tukey HSD
# Conduct Tukey HSD test
tukey_results <- TukeyHSD(anova_bmd)
| Comparison | Mean Diff | 95% CI | p-value | Significant? |
|---|---|---|---|---|
| Other Hispanic - Mexican American | -0.004 | [-0.04, 0.03] | 0.998 | No |
| Non-Hispanic White - Mexican American | -0.062 | [-0.09, -0.03] | 2.57e-07 | Yes |
| Non-Hispanic Black - Mexican American | 0.022 | [-0.01, 0.05] | 0.328 | No |
| Other - Mexican American | -0.053 | [-0.09, -0.02] | 0.000192 | Yes |
| Non-Hispanic White - Other Hispanic | -0.058 | [-0.09, -0.03] | 3.61e-06 | Yes |
| Non-Hispanic Black - Other Hispanic | 0.027 | [-0.01, 0.06] | 0.172 | No |
| Other - Other Hispanic | -0.049 | [-0.08, -0.01] | 0.00107 | Yes |
| Non-Hispanic Black - Non-Hispanic White | 0.085 | [0.06, 0.11] | 3.94e-11 | Yes |
| Other - Non-Hispanic White | 0.009 | [-0.02, 0.03] | 0.872 | No |
| Other - Non-Hispanic Black | -0.076 | [-0.1, -0.05] | 4.18e-11 | Yes |
Interpretation
Effect Size
ss_treatment <- s_anova$`Sum Sq`[1]
ss_total <- sum(s_anova$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.0503
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")
## Percentage of variance explained: 5.03 %
cat("Effect Size: ",
ifelse(eta_squared < 0.06,
"Small", ifelse(eta_squared < 0.14,
"Medium",
"Large")
)
)
## Effect Size: Small
Research Question: Are there associations between BMI, Calcium Vitamin D, MET?
Calculate:
a_bmd <- a_bmd %>%
mutate(
##Missing value handling
DSQTCALC_0 = replace_na(DSQTCALC,0),
## Downstream, I run into -inf off the logs, and then NaN in table A, so-
#I see int he student's template (that I hadn't seen when I wrote this lol)
#does not transform the dietary variables
##CALCIUM_0 = replace_na(calcium,0),
DSQTVD_0 = replace_na(DSQTVD, 0),
##VITD_0 = replace_na(vitd, 0),
##Totals off MV cols
calcium_total = calcium + DSQTCALC_0,
vitd_total = vitd + DSQTVD_0
)
Create histograms for continuous variables (BMI, calcium, vitamin D, MET)
Suggested transformations:
##start with plots
p1 <- ggplot(a_bmd, aes(x=a_bmd$BMXBMI, fill="coral")) +
geom_histogram() +
labs(
title = "BMI",
subtitle = paste0("Skewness: ", round(skewness(a_bmd$BMXBMI, na.rm=TRUE),3)),
x = "BMI",
y = "Frequency",
) +
theme_classic() +
theme(legend.position = "none")
p2 <- ggplot(a_bmd, aes(x=a_bmd$calcium_total, fill = "coral")) +
geom_histogram() +
labs(
title = "Calcium",
subtitle = paste0("Skewness: ", round(skewness(a_bmd$calcium_total, na.rm=TRUE),3)),
x = "Total Calcium",
y = "Frequency",
) +
theme_classic() +
theme(legend.position = "none")
p3 <- ggplot(a_bmd, aes(x=a_bmd$totmet, fill = "coral")) +
geom_histogram() +
labs(
title = "MET",
subtitle = paste0("Skewness: ", round(skewness(a_bmd$totmet, na.rm=TRUE),3)),
x = "Total MET",
y = "Frequency",
) +
theme_classic() +
theme(legend.position = "none")
p4 <- ggplot(a_bmd, aes(x=a_bmd$vitd_total, fill = "coral")) +
geom_histogram() +
labs(
title = "Vitamin D",
subtitle = paste0("Skewness: ", round(skewness(a_bmd$vitd_total, na.rm=TRUE),3)),
x = "Total Vitamin D",
y = "Frequency",
) +
theme_classic() +
theme(legend.position = "none")
p1+p2+p3+p4
##Based on these tables, likely do not need to transform BMI, this variable has a slight right skew, but is very close to normal. Calcium has a slight larger right skew, but is still fairly close to normal - this is worth testing to see if a transformation may help, but if the adjustment is not substantial, then we will proceed without it.
## MET and Vitamin D do appear to require transformation, and with the large number of 0s, a square root transform may help.
a_bmd <- a_bmd %>%
mutate(
##we'll do em all and then decide
log_vitd = log(vitd_total),
log_calc = log(calcium_total),
log_bmi = log(BMXBMI),
log_met = log(totmet),
sr_vitd = sqrt(vitd_total),
sr_met = sqrt(totmet)
)
p1 <- ggplot(a_bmd, aes(x=a_bmd$log_bmi, fill="coral")) +
geom_histogram() +
labs(
title = "BMI - Log Transformed",
subtitle = paste0("Skewness: ", round(skewness(a_bmd$log_bmi, na.rm=TRUE),3)),
x = "BMI",
y = "Frequency",
) +
theme_classic() +
theme(legend.position = "none")
p2 <- ggplot(a_bmd, aes(x=a_bmd$log_calc, fill = "coral")) +
geom_histogram() +
labs(
title = "Calcium - Log Transformed",
subtitle = paste0("Skewness: ", round(skewness(a_bmd$log_calc, na.rm=TRUE),3)),
x = "Total Calcium",
y = "Frequency",
) +
theme_classic() +
theme(legend.position = "none")
p3 <- ggplot(a_bmd, aes(x=a_bmd$log_met, fill = "coral")) +
geom_histogram() +
labs(
title = "MET - Log Transformed",
subtitle = paste0("Skewness: ", round(skewness(a_bmd$log_met, na.rm=TRUE),3)),
x = "Total MET",
y = "Frequency",
) +
theme_classic() +
theme(legend.position = "none")
p4 <- ggplot(a_bmd, aes(x=a_bmd$log_vitd, fill = "coral")) +
geom_histogram() +
labs(
title = "Vitamin D - Log Transformed",
subtitle = paste0("Skewness: ", round(skewness(a_bmd$log_vitd, na.rm=TRUE),3)),
x = "Total Vitamin D",
y = "Frequency",
) +
theme_classic() +
theme(legend.position = "none")
##BMI, Calcium somewhat improved but not so substantially - may be worth seeing how the correlations look if I test both.
## MET and VITD somewhat improved but still remain quite far from normal.
p1+p2+p3+p4
p1 <- ggplot(a_bmd, aes(x=a_bmd$sr_met, fill = "coral")) +
geom_histogram() +
labs(
title = "MET - SqRt Transformed",
subtitle = paste0("Skewness: ", round(skewness(a_bmd$sr_met, na.rm=TRUE),3)),
x = "Total MET",
y = "Frequency",
) +
theme_classic() +
theme(legend.position = "none")
p2<- ggplot(a_bmd, aes(x=a_bmd$sr_vitd, fill = "coral")) +
geom_histogram() +
labs(
title = "Vitamin D - SqRt Transformed",
subtitle = paste0("Skewness: ", round(skewness(a_bmd$sr_vitd, na.rm=TRUE),3)),
x = "Total Vitamin D",
y = "Frequency",
) +
theme_classic() +
theme(legend.position = "none")
## These ones look workable
p1 + p2
While BMI and Calcium were not too far from normal, I will proceed with a LOG transformation with both for the Pearson Correlation, because this did improve their normality.
Both VitD and MET will be transformed using the SQRT functions to use for the Pearson Correlation.
Table A: Predictors vs. Outcome (BMD)
Create Corr_Helper function to use to generate the three tables:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}
Table A Generation
##- Age (RIDAGEYR) vs. BMD
##- BMI (log-transformed) vs. BMD
##- Total calcium (log-transformed) vs. BMD
##- Total vitamin D (sqrt-transformed) vs. BMD
x_list <- c("Age" = "RIDAGEYR",
"BMI (log-transformed)" = "log_bmi",
"Total Calcium (log-transformed)" = "log_calc",
"Total Vitamin D (sqrt-transformed)" = "sr_vitd"
)
##loop through that list
table_a <- lapply(x_list, function(col) {
corr_pair(a_bmd, x = col, y = "DXXOFBMD")
}) %>%
##convert lists to clean table
bind_rows(.id="Predictor") %>%
mutate(
outcome = "BMD"
) %>%
select(Predictor, outcome, everything())
table_a %>%
select(Predictor, outcome, n, estimate, p_value) %>%
kable(digits = 3,
caption = "**Table A:** Predictors vs. Outcome",
col.names = c("Predictor", "Outcome", "n", "Estimate", "P-value")
)
| Predictor | Outcome | n | Estimate | P-value |
|---|---|---|---|---|
| Age | BMD | 2286 | -0.232 | 0.000 |
| BMI (log-transformed) | BMD | 2275 | 0.425 | 0.000 |
| Total Calcium (log-transformed) | BMD | 2129 | 0.012 | 0.592 |
| Total Vitamin D (sqrt-transformed) | BMD | 2129 | -0.062 | 0.004 |
Interpretation: The initial Pearson Correlations indicate that there is a statistically significant negative association between BMD and Age (estimate = -0.232; p = <2e-16) and BMD and total Vitamin D (estimate = -0.0619;p = 0.00426), and a statistically significant positive association between BMD and BMI (log-transformed) (estimate = 0.425;p = <2e-16). The association between BMD and total calcium intake is not statistically significant.
Table B: Predictors vs. Exposure (Physical Activity)
Table B Generation
## X-List can actually be the same for this - the outcome is different (MET), but predictors are the same.
x_list <- c("Age" = "RIDAGEYR",
"BMI (log-transformed)" = "log_bmi",
"Total Calcium (log-transformed)" = "log_calc",
"Total Vitamin D (sqrt-transformed)" = "sr_vitd"
)
##loop through that list
table_b <- lapply(x_list, function(col) {
corr_pair(a_bmd, x = col, y = "sr_met")
}) %>%
##convert lists to clean table
bind_rows(.id="Predictor") %>%
mutate(
outcome = "MET"
) %>%
select(Predictor, outcome, everything())
table_b %>%
select(Predictor, outcome, n, estimate, p_value) %>%
kable(digits = 3,
caption = "**Table B:** Predictors vs. Outcome",
col.names = c("Predictor", "Outcome", "n", "Estimate", "P-value")
)
| Predictor | Outcome | n | Estimate | P-value |
|---|---|---|---|---|
| Age | MET | 1588 | -0.105 | 0.000 |
| BMI (log-transformed) | MET | 1582 | 0.009 | 0.732 |
| Total Calcium (log-transformed) | MET | 1500 | 0.068 | 0.009 |
| Total Vitamin D (sqrt-transformed) | MET | 1500 | -0.003 | 0.893 |
The initial Pearson Correlations indicate that there is a statistically significant negative association between MET and Age (estimate = -0.105; p = 2.67e-05) and MET and total Vitamin D (estimate = -0.00349;p = 0.893), and a statistically significant positive association between MET and Calcium (log-transformed) (estimate = 0.0677;p = 0.00876). The association between MET and BMI is not statistically significant.
Table C: Predictors vs. Each Other All pairwise correlations among:
Table C Generation
##List of Predictors remains the same
x_list <- c("Age" = "RIDAGEYR",
"BMI (log-transformed)" = "log_bmi",
"Total Calcium (log-transformed)" = "log_calc",
"Total Vitamin D (sqrt-transformed)" = "sr_vitd"
)
##I made this worse by just keeping the same named vector instead of making two lists
###all_pairs <- combn(x_list, 2, simplify=FALSE)
##Create a DF from the names
all_pairs <- as.data.frame(
##Transpose
t(
##just doing names here
combn(names(x_list), 2)),
stringsAsFactors = FALSE
)
# Add the actual variable codes
all_pairs <- all_pairs %>%
rename(var1_name = V1, var2_name = V2) %>%
mutate(
var1_value = x_list[var1_name],
var2_value = x_list[var2_name]
)
##I think for this, it makes the most sense to just, do a regular loop 1-row, count, access rows?
table_c <- lapply(1:nrow(all_pairs), function(i) {
##yeah, this is whatever row we're on
test <- all_pairs[i, ]
##dynamic function call
corr_pair(a_bmd,
x = test$var1_value[[1]],
y = test$var2_value[[1]]) %>%
mutate(Predictor = test$var1_name[[1]], outcome = test$var2_name[[1]])}) %>%
bind_rows()
table_c %>%
select(Predictor, outcome, n, estimate, p_value) %>%
kable(digits = 3,
caption = "**Table C:** Predictors vs. Each Other",
col.names = c("Predictor", "Outcome", "n", "Estimate", "P-value")
)
| Predictor | Outcome | n | Estimate | P-value |
|---|---|---|---|---|
| Age | BMI (log-transformed) | 2275 | -0.093 | 0.000 |
| Age | Total Calcium (log-transformed) | 2129 | 0.067 | 0.002 |
| Age | Total Vitamin D (sqrt-transformed) | 2129 | 0.152 | 0.000 |
| BMI (log-transformed) | Total Calcium (log-transformed) | 2122 | 0.009 | 0.685 |
| BMI (log-transformed) | Total Vitamin D (sqrt-transformed) | 2122 | 0.026 | 0.226 |
| Total Calcium (log-transformed) | Total Vitamin D (sqrt-transformed) | 2129 | 0.291 | 0.000 |
Age has a statistically significant negative association with BMI (estimate = -0.0928; p = 9.28e-06), and statistically significant positive associations with Calcium (0.0666; p = 0.00211) and Vitamin D (0.152; p = 2.02e-12).
BMI has a statistically significant negative association with Age (estimate = -0.0928; p = 9.28e-06), and no other predictors.
Calcium has a statistically significant positive association with Age (0.0666; p = 0.00211) and Vitamin D (0.291; p = <2e-16).
Vitamin D has statistically significant positive associations with Age(0.152; p = 2.02e-12) and Calcium (0.291; p = <2e-16).
No other pairs of predictors are statistically significant.
All of these predictors have very low correlations with each other, indicating that there are no major concerns for multicollinearity.
Both ANOVA and Correlation are informative. ANOVA is relevant when you’re looking to compare the differences in a continuous variable across a categorical variable, whereas Correlations are useful for identifying associations in two continuous variables. With ANOVA, we might be able to dig into which other categorical variables affect the mean BMD values – sex or education level may be a factor. With correlation, we can identify the strength and direction of any association between continuous variables, like diastolic blood pressure and BMD.
In this dataset, the normality needed for a linear correlation was the toughest assumption we needed to reach. We were able to transform the data to be relatively close to normal but if we weren’t able to, we would likely need to do a different type of correlation – perhaps the Spearman Correlation. ANOVA in general is quite robust at large enough sample sizes, because of the Central Limit Theorem.
Additionally, because NHANES data is cross-sectional, it cannot be used to establish temporality or causation. We can only see if there’s an association between variables. With BMD, which was the focus of this analysis, in particular, this measure is something that is linked to very long, years-long, factors in someone’s life, whereas something like dietary calcium is recently measured and may not reflect historical calcium intake. In particular – it might be confounded by indication, given that high dietary calcium now could be linked to specifically taking calcium supplements as a result of bone issues.
I do wish that I had found the template to work from ahead of time, but working without the template was very helpful for my coding skills in general. The toughest part was figuring out how to structure the loop to execute Table C. I actually decided to give up on using the loop, and rbind the function call output 6 times, and then when I ran into the same error (needing a numeric for the correlation function), I realized that I was calling the column and I needed to access the contents of the row with double brackets. So I was able to go back in and correct it by adding in the [[1]] into the loop.