This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.
Key variables:
| Variable | Description | Type |
|---|---|---|
DXXOFBMD |
Total femur bone mineral density | Continuous (g/cm²) |
RIDRETH1 |
Race/ethnicity | Categorical (1–5) |
RIAGENDR |
Sex | Categorical (1=Male, 2=Female) |
RIDAGEYR |
Age in years | Continuous |
BMXBMI |
Body mass index | Continuous (kg/m²) |
smoker |
Smoking status | Categorical (1=Current, 2=Past, 3=Never) |
calcium |
Dietary calcium intake | Continuous (mg/day) |
DSQTCALC |
Supplemental calcium | Continuous (mg/day) |
vitd |
Dietary vitamin D intake | Continuous (mcg/day) |
DSQTVD |
Supplemental vitamin D | Continuous (mcg/day) |
totmet |
Total physical activity (MET-min/week) | Continuous |
Load required packages and import the dataset.
# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/bmd.csv", show_col_types = FALSE)
# Inspect the data
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
Create readable labels for the main categorical variables:
bmd <- bmd %>%
mutate(
sex = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
ethnicity = factor(RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other")),
smoker_f = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)Report the total sample size and create the analytic dataset for ANOVA by removing missing values:
# Total observations
n_total <- nrow(bmd)
# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(ethnicity))
# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)
# Display sample sizes
n_total## [1] 2898
## [1] 2286
Write 2–4 sentences summarizing your analytic sample here.
The total observations before removing missing data is 2898, after removing the missing data there is as an observation size of 2286.
Research question: Do mean BMD values differ across ethnicity groups?
Statistical notation: - H₀: (H_0: mu_Mexican American = mu_Other Hispanic = mu_Non-Hispanic White = mu_Non-Hispanic Black = mu_Other) - Hₐ: (H_A: One BMD differs from another) Plain language: - H₀: Mean bone mineral density is the same across all ethnicity groups - Hₐ: Mean bone mineral density differs for at least one or more ethnicity groups, suggesting ethnicity may be associated with differences in bone health.
Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:
anova_df %>%
group_by(ethnicity) %>%
summarise(
n = n(),
mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
) %>%
arrange(desc(n)) %>%
kable(digits = 3)| ethnicity | n | mean_bmd | sd_bmd |
|---|---|---|---|
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other | 409 | 0.897 | 0.148 |
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
Create a visualization comparing BMD distributions across ethnicity groups:
ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.15, alpha = 0.25) +
labs(
x = "Ethnicity group",
y = "Bone mineral density (g/cm²)"
) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))Interpret the plot: What patterns do you observe in BMD across ethnicity groups?
The box plot shows that Non-Hispanic Black individuals have the highest median BMD overall and a relatively wide spread, indicating higher bone density on average compared with other groups. Mexican American and Other Hispanic groups have similar median BMD’s. Non-Hispanic White individuals show the lowest median BMD among the groups. The Other group also has a lower BMD.
Fit a one-way ANOVA with:
DXXOFBMD)# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)
# Display ANOVA table
summary(fit_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Report the F-statistic, degrees of freedom, and p-value in a formatted table:
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| ethnicity | 4 | 2.9482 | 0.7371 | 30.185 | 0 |
| Residuals | 2281 | 55.6973 | 0.0244 | NA | NA |
Write your interpretation here:
F-statistic= 30.185, p-value = 0 The results indicate that there is a statistically significant difference in mean bone mineral density across ethnicity groups. There is at least one ethnicity group that has a different average BMD level compared to the others. We reject the null hypothesis because the p-value is less than 0.05.
ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.
If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.
Create diagnostic plots:
par(mfrow = c(1, 2))
plot(fit_aov, which = 1) # Residuals vs fitted
plot(fit_aov, which = 2) # QQ plot## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Summarize what you see in 2–4 sentences:
Interpretation: The Q-Q plot assesses normality by comparing quantiles of the residuals to quantiles of a standard normal distribution. Points fall exactly on the diagonal line. The Q-Q plot follows normality. The p-value of Levenes test was 0.1788 which is greater than 0.05, so a non-significant result indicates homogeneous variances across groups, supporting the equal variance assumption underlying ANOVA.
Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.
Use Tukey HSD and report:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Create and present a readable table (you can tidy the Tukey output):
# YOUR CODE HERE to create a tidy table from Tukey results
# Hint: Convert to data frame, add rownames as column, filter for significance
# Extract and format results
tukey_summary <- as.data.frame(tuk$ethnicity)
tukey_summary$Comparison <- rownames(tukey_summary)
tukey_summary <- tukey_summary[, c("Comparison", "diff", "lwr", "upr", "p adj")]
# Add interpretation columns
tukey_summary$Significant <- ifelse(tukey_summary$`p adj` < 0.05, "Yes", "No")
tukey_summary$Direction <- ifelse(
tukey_summary$Significant == "Yes",
ifelse(tukey_summary$diff > 0, "First group higher", "Second group higher"),
"No difference"
)
kable(tukey_summary, digits = 4,
caption = "Tukey HSD post-hoc comparisons with 95% confidence intervals")| Comparison | diff | lwr | upr | p adj | Significant | Direction | |
|---|---|---|---|---|---|---|---|
| Other Hispanic-Mexican American | Other Hispanic-Mexican American | -0.0044 | -0.0426 | 0.0338 | 0.9978 | No | No difference |
| Non-Hispanic White-Mexican American | Non-Hispanic White-Mexican American | -0.0624 | -0.0929 | -0.0319 | 0.0000 | Yes | Second group higher |
| Non-Hispanic Black-Mexican American | Non-Hispanic Black-Mexican American | 0.0224 | -0.0101 | 0.0549 | 0.3277 | No | No difference |
| Other-Mexican American | Other-Mexican American | -0.0533 | -0.0874 | -0.0193 | 0.0002 | Yes | Second group higher |
| Non-Hispanic White-Other Hispanic | Non-Hispanic White-Other Hispanic | -0.0579 | -0.0889 | -0.0269 | 0.0000 | Yes | Second group higher |
| Non-Hispanic Black-Other Hispanic | Non-Hispanic Black-Other Hispanic | 0.0268 | -0.0062 | 0.0598 | 0.1722 | No | No difference |
| Other-Other Hispanic | Other-Other Hispanic | -0.0489 | -0.0834 | -0.0144 | 0.0011 | Yes | Second group higher |
| Non-Hispanic Black-Non-Hispanic White | Non-Hispanic Black-Non-Hispanic White | 0.0848 | 0.0612 | 0.1084 | 0.0000 | Yes | First group higher |
| Other-Non-Hispanic White | Other-Non-Hispanic White | 0.0091 | -0.0166 | 0.0347 | 0.8719 | No | No difference |
| Other-Non-Hispanic Black | Other-Non-Hispanic Black | -0.0757 | -0.1038 | -0.0477 | 0.0000 | Yes | Second group higher |
## POST-HOC TEST INTERPRETATION
for (i in 1:nrow(tukey_summary)) {
row <- tukey_summary[i, ]
cat("Comparison:", row$Comparison, "\n")
cat(" Estimated mean difference:", round(row$diff, 4), "g/cm²\n")
cat(" 95% Confidence Interval: [",
round(row$lwr, 4), ", ", round(row$upr, 4), "]\n")
cat(" Adjusted p-value:", format(row$`p adj`, digits = 4), "\n")
cat(" Statistical Significance:", row$Significant, "\n")
cat(" Direction:", row$Direction, "\n\n")
}## Comparison: Other Hispanic-Mexican American
## Estimated mean difference: -0.0044 g/cm²
## 95% Confidence Interval: [ -0.0426 , 0.0338 ]
## Adjusted p-value: 0.9978
## Statistical Significance: No
## Direction: No difference
##
## Comparison: Non-Hispanic White-Mexican American
## Estimated mean difference: -0.0624 g/cm²
## 95% Confidence Interval: [ -0.0929 , -0.0319 ]
## Adjusted p-value: 2.568e-07
## Statistical Significance: Yes
## Direction: Second group higher
##
## Comparison: Non-Hispanic Black-Mexican American
## Estimated mean difference: 0.0224 g/cm²
## 95% Confidence Interval: [ -0.0101 , 0.0549 ]
## Adjusted p-value: 0.3277
## Statistical Significance: No
## Direction: No difference
##
## Comparison: Other-Mexican American
## Estimated mean difference: -0.0533 g/cm²
## 95% Confidence Interval: [ -0.0874 , -0.0193 ]
## Adjusted p-value: 0.0001919
## Statistical Significance: Yes
## Direction: Second group higher
##
## Comparison: Non-Hispanic White-Other Hispanic
## Estimated mean difference: -0.0579 g/cm²
## 95% Confidence Interval: [ -0.0889 , -0.0269 ]
## Adjusted p-value: 3.607e-06
## Statistical Significance: Yes
## Direction: Second group higher
##
## Comparison: Non-Hispanic Black-Other Hispanic
## Estimated mean difference: 0.0268 g/cm²
## 95% Confidence Interval: [ -0.0062 , 0.0598 ]
## Adjusted p-value: 0.1722
## Statistical Significance: No
## Direction: No difference
##
## Comparison: Other-Other Hispanic
## Estimated mean difference: -0.0489 g/cm²
## 95% Confidence Interval: [ -0.0834 , -0.0144 ]
## Adjusted p-value: 0.001072
## Statistical Significance: Yes
## Direction: Second group higher
##
## Comparison: Non-Hispanic Black-Non-Hispanic White
## Estimated mean difference: 0.0848 g/cm²
## 95% Confidence Interval: [ 0.0612 , 0.1084 ]
## Adjusted p-value: 3.936e-11
## Statistical Significance: Yes
## Direction: First group higher
##
## Comparison: Other-Non-Hispanic White
## Estimated mean difference: 0.0091 g/cm²
## 95% Confidence Interval: [ -0.0166 , 0.0347 ]
## Adjusted p-value: 0.8719
## Statistical Significance: No
## Direction: No difference
##
## Comparison: Other-Non-Hispanic Black
## Estimated mean difference: -0.0757 g/cm²
## 95% Confidence Interval: [ -0.1038 , -0.0477 ]
## Adjusted p-value: 4.176e-11
## Statistical Significance: Yes
## Direction: Second group higher
Write a short paragraph: Interpretation: The groups that significantly differ were Non-Hispanic white-Mexican American,Other-Mexican American, Non-Hispanic White-Other Hispanic, Other-Other Hispanic,Non-Hispanic Black-Non-Hispanic White, and Other-Non-Hispanic Black.
Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ethnicity 4 2.95 0.737 30.2 1.65e-24
## 2 Residuals 2281 55.7 0.0244 NA NA
ss_total <- sum(anova_tbl$sumsq)
anova_eta <- anova_tbl %>%
mutate(eta_sq = sumsq / ss_total)
anova_eta## # A tibble: 2 × 7
## term df sumsq meansq statistic p.value eta_sq
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ethnicity 4 2.95 0.737 30.2 1.65e-24 0.0503
## 2 Residuals 2281 55.7 0.0244 NA NA 0.950
Interpret in 1–2 sentences: The calculated effect size is η² = 0.0503. This means ethnicity explains approximately 5.03% of the variance in BMD.To interpret this effect size, we use conventional guidelines: η² = 0.01 (small), η² = 0.06 (medium), and η² = 0.14 (large). In our case, the effect is small.
In this section you will:
Calculate total nutrient intake by adding dietary and supplemental sources:
Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.
Create histograms to assess distributions:
library(ggplot2)
library(patchwork)
p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
geom_histogram(bins = 30, na.rm = TRUE)
p2 <- ggplot(bmd2, aes(x = calcium_total)) +
geom_histogram(bins = 30, na.rm = TRUE)
p3 <- ggplot(bmd2, aes(x = vitd_total)) +
geom_histogram(bins = 30, na.rm = TRUE)
p4 <- ggplot(bmd2, aes(x = totmet)) +
geom_histogram(bins = 30, na.rm = TRUE)
#Arrange plots
(p1 | p2) / (p3 | p4)Based on your assessment, apply transformations as needed:
bmd2 <- bmd2 %>%
mutate(
log_bmi = log(BMXBMI),
log_calcium_total = log(calcium_total),
sqrt_totmet = sqrt(totmet),
sqrt_vitd_total = sqrt(vitd_total)
)Document your reasoning: I applied transformations to ALL variables because all variables are right skewed. By applying transformations the variables skewness will decrease.
For each correlation, report:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}Examine correlations between potential predictors and BMD:
Use transformed versions where appropriate.
bmd2 <- bmd2 %>%
mutate(
log_BMXBMI = log(BMXBMI),
log_calcium_total = log(calcium_total),
log_vitd_total = log(vitd_total + 1)
)
tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableA %>% kable(caption = "Table A: Correlations between predictors and BMD")| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | DXXOFBMD | 2286 | -0.232 | 0.00000 |
| log_bmi | DXXOFBMD | 2275 | 0.425 | 0.00000 |
| log_calcium_total | DXXOFBMD | 2129 | 0.012 | 0.59200 |
| sqrt_vitd_total | DXXOFBMD | 2129 | -0.062 | 0.00426 |
Interpret the results:
Interpretation:
Age vs. BMD: The relationship between age and BMD is statistically significant. The p-value was less than 0.05 with a value of 0.0000. Age is negatively correlated with BMD. As people get older, their BMD tends to decrease. The strength is weak with an estimate of -0.232.
BMI vs. BMD: The relationship between BMI and BMD is statistically significant. The p-value was less than 0.05 with a value of 0.0000. BMI is positively correlated with BMD. As BMI increase, BMD increases. The strength is moderate with an estimate of 0.425.
Total calcium vs. BMD: The relationship between total calcium and BMD is not statistically significant. The p-value was greater than 0.05 with a value of 0.59200. There is no meaningful correlation between total calcium and BMD.
Total Vitamin D vs. BMD: The relationship between total vitamin D and BMD is statistically significant. The p-value was less than 0.05 with a value of 0.00426. Vitamin D is negative correlated with BMD. The strength is weak with an estimate of -0.062.
Examine correlations between potential confounders and physical activity:
bmd2 <- bmd2 %>%
mutate(
log_BMXBMI = log(BMXBMI),
log_calcium_total = log(calcium_total),
log_vitd_total = log(vitd_total + 1)
)
tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableA %>% kable(caption = "Table B: Correlations between potential confounders and MET")| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | sqrt_totmet | 1934 | -0.097 | 1.96e-05 |
| log_bmi | sqrt_totmet | 1912 | 0.001 | 9.51e-01 |
| log_calcium_total | sqrt_totmet | 1777 | 0.086 | 2.82e-04 |
| sqrt_vitd_total | sqrt_totmet | 1777 | -0.002 | 9.32e-01 |
Interpret the results:
Age vs. MET: The relationship between age and MET is statistically significant. The p-value was less than 0.05 with a value of 1.96e-05.
BMI vs. MET: The relationship between total BMI and MET is not statistically significant. The p-value was greater than 0.05 with a value of 0.951.
Total Calcium vs. MET: The relationship between total calcium and MET is statistically significant. The p-value was less than 0.05 with a value of 2.82e-04.
Vitamin D vs. MET: The relationship between total vitamin D and MET is not statistically significant. The p-value was greater than 0.05 with a value of 0.932.
Examine correlations among all predictor variables (all pairwise combinations):
# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()
# Example structure:
preds <- c("RIDAGEYR", "log_BMXBMI", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
estimate = round(estimate, 3),
p_value = signif(p_value, 3)
)
tableC %>% kable(caption = "Table C: Pairwise correlations among predictors")| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | log_BMXBMI | 2834 | -0.098 | 2.00e-07 |
| RIDAGEYR | log_calcium_total | 2605 | 0.048 | 1.36e-02 |
| RIDAGEYR | sqrt_vitd_total | 2605 | 0.153 | 0.00e+00 |
| log_BMXBMI | log_calcium_total | 2569 | 0.000 | 9.81e-01 |
| log_BMXBMI | sqrt_vitd_total | 2569 | 0.007 | 7.31e-01 |
| log_calcium_total | sqrt_vitd_total | 2605 | 0.314 | 0.00e+00 |
Interpret the results:
There are no strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses. Age vs. BMI, calcium, and vitamin D, all showed weak correlations whether they were positive or negative associations. BMI vs. calcium and vitamin, showed no correlation. Calcium and Vitamin D showed a moderate positive correlation with r= 0.314. As calcium intake increases, vitamin D levels also increase.
Create scatterplots to visualize key relationships:
# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "log(BMI)",
y = "BMD (g/cm²)",
title = "BMD vs BMI (transformed)"
)# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "sqrt(totmet)",
y = "BMD (g/cm²)",
title = "BMD vs MET (transformed)"
)Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
B. Assumptions and Limitations
C. R Programming Challenge
YOUR REFLECTION HERE (200–400 words)
In epidemiological research, you want to use ANOVA when you are comparing 3 or more group means, has one categorical predictor, has continuous outcomes, and independent observations within each group. ANOVA tells you whether there is statistical significance among group means, but it does not measure the magnitude or direction of pairwise relationship. You use correlation when you want to measure the strength or direction of linear relationship, you want to describe associations, you have continuous variables, and you are exploring data before regression. Correlation provides a standardized estimate of linear association between variables but does compare group differences. A research question that is suited for ANOVA is ” Do average BMD levels differ across ethnic groups?” while a research question suited for correlation is ” Is higher BMI associated with higher BMD?”
A major learning outcome was challenging was assessing normality, linearity, and homoscedasticity because the variables ended up being skewed which required transformations to be added. This showed me how important it is to visually inspect the data and understand it before applying the transformations. Additionally, our analyses was cross-sectional, so correlation cannot establish causality. Observed relationships may have been influenced by unmeasured confounding factors. For example, we see the association between higher BMI and Higher BMD, but we cannot conclude BMI causes increase bone density without conducting a longitudinal study or using longitudinal data.
The most challenging part of this assignment was creating and combining multiple histograms. When I first ran the code, i got errors saying that something was wrong with my ggplot code and the syntax (p1|p2)/(p3|p4) was not arranging my plots to sit on top of each other. I solved this problem by using google and searching up how to “arrange plots with histograms” and I found an analytics website that had the steps. https://www.analytics-tuts.com/arrange-multiple-plots-using-patchwork-in-r/ In this link, the author had a library called “patchwork” and a library called “ggplot2”, I ended up adding these because besides these packages, the rest of my code looked similar to what the author had, and it ended up working and allowing my histograms to sit on top and side by side of each other. This first assignment has allowed me to strengthen by skills in ANOVA and correlation. We have done labs previously but this assignment allowed me to apply both concepts in the same assignment. This assignment also helped me with steps that will be important for our final paper. For example, removing missing data, import a csv file, create histograms, and compare statistical results.