This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.
Key variables:
| Variable | Description | Type |
|---|---|---|
DXXOFBMD |
Total femur bone mineral density | Continuous (g/cm²) |
RIDRETH1 |
Race/ethnicity | Categorical (1–5) |
RIAGENDR |
Sex | Categorical (1=Male, 2=Female) |
RIDAGEYR |
Age in years | Continuous |
BMXBMI |
Body mass index | Continuous (kg/m²) |
smoker |
Smoking status | Categorical (1=Current, 2=Past, 3=Never) |
calcium |
Dietary calcium intake | Continuous (mg/day) |
DSQTCALC |
Supplemental calcium | Continuous (mg/day) |
vitd |
Dietary vitamin D intake | Continuous (mcg/day) |
DSQTVD |
Supplemental vitamin D | Continuous (mcg/day) |
totmet |
Total physical activity (MET-min/week) | Continuous |
Load required packages and import the dataset.
# Import data
bmd <- readr::read_csv("D:/fizza documents/Epi 553/Assignments/HW01/bmd.csv", show_col_types = FALSE)
# Inspect the data
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
Create readable labels for the main categorical variables:
bmd <- bmd %>%
mutate(
sex = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
ethnicity = factor(RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other")),
smoker_f = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)Report the total sample size and create the analytic dataset for ANOVA by removing missing values:
# Total observations
n_total <- nrow(bmd)
# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(ethnicity))
# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)
# Display sample sizes
n_total## [1] 2898
## [1] 2286
## Total sample size: 2898
## ANOVA analytic sample size: 2286
## Observations removed: 612
## Percent missing: 21.1 %
Write 2–4 sentences summarizing your analytic sample here.
The total sample size in the original dataset is 2,898 participants. After removing observations with missing values for either BMD (DXXOFBMD) or ethnicity (RIDRETH1), the analytic sample for the ANOVA analysis is 2,286 participants. This represents a loss of 612 observations (21.1% of the original sample), primarily due to missing DEXA scan data since not all NHANES participants completed the DEXA component.
Research question: Do mean BMD values differ across ethnicity groups?
State your null and alternative hypotheses in both statistical notation and plain language.
Statistical notation:
Plain language:
Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:
anova_df %>%
group_by(ethnicity) %>%
summarise(
n = n(),
mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
) %>%
arrange(desc(n)) %>%
kable(digits = 3,
col.names = c("Ethnicity", "N", "Mean BMD (g/cm²)", "SD"),
caption = "Table 1: BMD Summary Statistics by Ethnicity") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Ethnicity | N | Mean BMD (g/cm²) | SD |
|---|---|---|---|
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other | 409 | 0.897 | 0.148 |
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
Create a visualization comparing BMD distributions across ethnicity groups:
ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD, fill = ethnicity)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.2, size = 0.5) +
labs(
title = "Distribution of Bone Mineral Density by Ethnicity",
x = "Ethnicity group",
y = "Bone mineral density (g/cm²)",
fill = "Ethnicity"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 25, hjust = 1),
legend.position = "none")Interpret the plot: What patterns do you observe in BMD across ethnicity groups?
The boxplot shows that Non-Hispanic Black individuals appear to have the highest median BMD, followed by Mexican American and Other Hispanic groups. Non-Hispanic White individuals and the “Other” ethnicity category appear to have lower median BMD values. There is considerable variability within each group, as shown by the spread of points and the height of the boxes. Some groups, particularly Non-Hispanic Black and Non-Hispanic White, have more extreme outliers.
Fit a one-way ANOVA with:
DXXOFBMD)# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)
# Display ANOVA table
summary(fit_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Report the F-statistic, degrees of freedom, and p-value in a formatted table:
# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
kable(digits = 4,
col.names = c("Term", "df", "Sum Sq", "Mean Sq", "F-statistic", "p-value"),
caption = "Table 2: One-way ANOVA Results") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Term | df | Sum Sq | Mean Sq | F-statistic | p-value |
|---|---|---|---|---|---|
| ethnicity | 4 | 2.9482 | 0.7371 | 30.185 | 0 |
| Residuals | 2281 | 55.6973 | 0.0244 | NA | NA |
Write your interpretation here:
The ANOVA F-test shows a statistically significant result (F = 30.19, df = 4, p < 0.0001). Therefore, we reject the null hypothesis and conclude that there is a significant difference in mean bone mineral density across at least two of the five ethnicity groups. The very small p-value (less than 0.0001) provides strong evidence against the null hypothesis.
ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.
If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.
Create diagnostic plots:
par(mfrow = c(1, 2))
plot(fit_aov, which = 1) # Residuals vs fitted
plot(fit_aov, which = 2) # QQ plot## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Summarize what you see in 2–4 sentences:
The Residuals vs Fitted plot shows no clear pattern, suggesting homogeneity of variances, which is confirmed by the non-significant Levene’s test (F = 1.57, df = 4, p = 0.1788). The Q-Q plot indicates some deviation from normality in the tails, but ANOVA is robust to this violation given our large sample size (n = 2,286). Overall, the assumptions are reasonably satisfied for this analysis.
Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.
Use Tukey HSD and report:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Create and present a readable table (you can tidy the Tukey output):
# Create tidy table from Tukey results
tukey_tidy <- as.data.frame(tuk$ethnicity) %>%
rownames_to_column("comparison") %>%
rename(
diff = `diff`,
lwr = `lwr`,
upr = `upr`,
p_adj = `p adj`
) %>%
mutate(
significant = ifelse(p_adj < 0.05, "Yes", "No"),
diff = round(diff, 3),
lwr = round(lwr, 3),
upr = round(upr, 3),
p_adj = signif(p_adj, 3)
)
tukey_tidy %>%
kable(caption = "Table 3: Tukey HSD Post-hoc Comparisons") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| comparison | diff | lwr | upr | p_adj | significant |
|---|---|---|---|---|---|
| Other Hispanic-Mexican American | -0.004 | -0.043 | 0.034 | 9.98e-01 | No |
| Non-Hispanic White-Mexican American | -0.062 | -0.093 | -0.032 | 3.00e-07 | Yes |
| Non-Hispanic Black-Mexican American | 0.022 | -0.010 | 0.055 | 3.28e-01 | No |
| Other-Mexican American | -0.053 | -0.087 | -0.019 | 1.92e-04 | Yes |
| Non-Hispanic White-Other Hispanic | -0.058 | -0.089 | -0.027 | 3.60e-06 | Yes |
| Non-Hispanic Black-Other Hispanic | 0.027 | -0.006 | 0.060 | 1.72e-01 | No |
| Other-Other Hispanic | -0.049 | -0.083 | -0.014 | 1.07e-03 | Yes |
| Non-Hispanic Black-Non-Hispanic White | 0.085 | 0.061 | 0.108 | 0.00e+00 | Yes |
| Other-Non-Hispanic White | 0.009 | -0.017 | 0.035 | 8.72e-01 | No |
| Other-Non-Hispanic Black | -0.076 | -0.104 | -0.048 | 0.00e+00 | Yes |
Write a short paragraph: “The groups that differed were …”
The Tukey post-hoc tests revealed that several ethnicity pairs had significantly different mean BMD. Non-Hispanic Black individuals had significantly higher BMD compared to Non-Hispanic White (diff = 0.085, p < 0.001) and Other groups (diff = -0.076, p < 0.001). Non-Hispanic White individuals had significantly lower BMD than Mexican American (diff = -0.062, p < 0.001) and Other Hispanic groups (diff = -0.058, p < 0.001). However, Non-Hispanic Black did not differ significantly from Mexican American or Other Hispanic groups, and Non-Hispanic White did not differ from the Other category.
Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).
Formula: η² = SS_between / SS_total
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ethnicity 4 2.95 0.737 30.2 1.65e-24
## 2 Residuals 2281 55.7 0.0244 NA NA
# Calculate eta-squared
ss_between <- anova_tbl$sumsq[1] # Sum of squares for ethnicity
ss_total <- sum(anova_tbl$sumsq) # Total sum of squares
eta_squared <- ss_between / ss_total
# Display result
cat("Eta-squared (η²) =", round(eta_squared, 4), "\n")## Eta-squared (η²) = 0.0503
## This means that ethnicity explains 5 % of the variance in BMD.
Interpret in 1–2 sentences:
The calculated eta-squared is 0.0503, meaning ethnicity explains 5% of the variance in BMD. This represents a small-to-medium effect size according to Cohen’s guidelines.
In this section you will:
Calculate total nutrient intake by adding dietary and supplemental sources:
Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.
Create histograms to assess distributions:
# Load patchwork for combining plots
library(patchwork)
# Create histograms
p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
labs(title = "BMI Distribution", x = "BMI (kg/m²)", y = "Count") +
theme_minimal()
p2 <- ggplot(bmd2, aes(x = calcium_total)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
labs(title = "Total Calcium Intake", x = "Calcium (mg/day)", y = "Count") +
theme_minimal()
p3 <- ggplot(bmd2, aes(x = vitd_total)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
labs(title = "Total Vitamin D Intake", x = "Vitamin D (mcg/day)", y = "Count") +
theme_minimal()
p4 <- ggplot(bmd2, aes(x = totmet)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
labs(title = "Physical Activity (MET-min/week)", x = "MET-min/week", y = "Count") +
theme_minimal()
# Arrange plots
(p1 | p2) / (p3 | p4)bmd2 <- bmd2 %>%
mutate(
log_bmi = log(BMXBMI),
log_calcium_total = log(calcium_total + 1), # +1 to handle zeros
sqrt_vitd_total = sqrt(vitd_total),
sqrt_totmet = sqrt(totmet)
)Based on your assessment, apply transformations as needed:
bmd2 <- bmd2 %>%
mutate(
log_bmi = log(BMXBMI),
log_calcium_total = log(calcium_total + 1), # +1 to handle zeros
sqrt_vitd_total = sqrt(vitd_total),
sqrt_totmet = sqrt(totmet)
)Document your reasoning:
I applied transformations to address positive skewness observed in the histograms. BMI shows a roughly normal distribution with slight right skew, so I used a log transformation to improve symmetry. Total calcium intake is highly right-skewed with many values clustered near zero, so I applied log(calcium+1) to handle zeros and reduce skewness. Vitamin D intake is extremely right-skewed with most values concentrated near zero, so I used a square root transformation. Physical activity (MET-min/week) is also highly right-skewed with a long tail, so I applied a square root transformation to better approximate normality for correlation analysis.
For each correlation, report:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}Examine correlations between potential predictors and BMD:
Use transformed versions where appropriate.
# YOUR CODE HERE
tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
mutate(
predictor = case_when(
x == "RIDAGEYR" ~ "Age",
x == "log_bmi" ~ "BMI (log-transformed)",
x == "log_calcium_total" ~ "Total Calcium (log-transformed)",
x == "sqrt_vitd_total" ~ "Total Vitamin D (sqrt-transformed)"
),
estimate = round(estimate, 3),
p_value = signif(p_value, 3),
significant = ifelse(p_value < 0.05, "Yes", "No")
) %>%
select(predictor, n, estimate, p_value, significant)
tableA %>%
kable(caption = "Table A: Correlations with BMD (Outcome)",
col.names = c("Predictor", "N", "r", "p-value", "p < 0.05?")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Predictor | N | r | p-value | p < 0.05? |
|---|---|---|---|---|
| Age | 2286 | -0.232 | 0.00000 | Yes |
| BMI (log-transformed) | 2275 | 0.425 | 0.00000 | Yes |
| Total Calcium (log-transformed) | 2129 | 0.012 | 0.59200 | No |
| Total Vitamin D (sqrt-transformed) | 2129 | -0.062 | 0.00426 | Yes |
Interpret the results:
Age shows a weak negative correlation with BMD (r = -0.232, p < 0.001), indicating that BMD tends to decrease slightly with age. BMI shows a moderate positive correlation with BMD (r = 0.425, p < 0.001), meaning higher BMI is associated with higher BMD. Total calcium intake shows no significant correlation with BMD (r = 0.012, p = 0.592). Total vitamin D intake shows a very weak negative but statistically significant correlation with BMD (r = -0.062, p = 0.004), though the practical significance is minimal given the small effect size. The sample sizes vary due to missing data in different variables.
Examine correlations between potential confounders and physical activity:
# YOUR CODE HERE
tableB <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
) %>%
mutate(
predictor = case_when(
x == "RIDAGEYR" ~ "Age",
x == "log_bmi" ~ "BMI (log-transformed)",
x == "log_calcium_total" ~ "Total Calcium (log-transformed)",
x == "sqrt_vitd_total" ~ "Total Vitamin D (sqrt-transformed)"
),
estimate = round(estimate, 3),
p_value = signif(p_value, 3),
significant = ifelse(p_value < 0.05, "Yes", "No")
) %>%
select(predictor, n, estimate, p_value, significant)
tableB %>%
kable(caption = "Table B: Correlations with Physical Activity (sqrt-transformed)",
col.names = c("Predictor", "N", "r", "p-value", "p < 0.05?")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Predictor | N | r | p-value | p < 0.05? |
|---|---|---|---|---|
| Age | 1934 | -0.097 | 1.96e-05 | Yes |
| BMI (log-transformed) | 1912 | 0.001 | 9.51e-01 | No |
| Total Calcium (log-transformed) | 1777 | 0.086 | 2.82e-04 | Yes |
| Total Vitamin D (sqrt-transformed) | 1777 | -0.002 | 9.32e-01 | No |
Interpret the results:
Age shows a weak negative correlation with physical activity (r = -0.097, p < 0.001), indicating that activity levels decrease slightly with age. BMI shows no correlation with physical activity (r = 0.001, p = 0.951). Total calcium intake has a weak positive correlation with activity (r = 0.086, p < 0.001), suggesting higher calcium intake is associated with slightly higher activity levels. Total vitamin D intake shows no significant correlation with physical activity (r = -0.002, p = 0.932). While some associations are statistically significant due to large sample sizes, the correlations are very weak in magnitude.
Examine correlations among all predictor variables (all pairwise combinations):
# YOUR CODE HERE
# Create all pairwise combinations of predictors
preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
var1 = case_when(
x == "RIDAGEYR" ~ "Age",
x == "log_bmi" ~ "BMI (log)",
x == "log_calcium_total" ~ "Calcium (log)",
x == "sqrt_vitd_total" ~ "Vitamin D (sqrt)"
),
var2 = case_when(
y == "RIDAGEYR" ~ "Age",
y == "log_bmi" ~ "BMI (log)",
y == "log_calcium_total" ~ "Calcium (log)",
y == "sqrt_vitd_total" ~ "Vitamin D (sqrt)"
),
estimate = round(estimate, 3),
p_value = signif(p_value, 3),
significant = ifelse(p_value < 0.05, "Yes", "No")
) %>%
select(var1, var2, n, estimate, p_value, significant)
tableC %>%
kable(caption = "Table C: Pairwise Correlations Among Predictors",
col.names = c("Variable 1", "Variable 2", "N", "r", "p-value", "p < 0.05?")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable 1 | Variable 2 | N | r | p-value | p < 0.05? |
|---|---|---|---|---|---|
| Age | BMI (log) | 2834 | -0.098 | 2.00e-07 | Yes |
| Age | Calcium (log) | 2605 | 0.048 | 1.35e-02 | Yes |
| Age | Vitamin D (sqrt) | 2605 | 0.153 | 0.00e+00 | Yes |
| BMI (log) | Calcium (log) | 2569 | 0.000 | 9.83e-01 | No |
| BMI (log) | Vitamin D (sqrt) | 2569 | 0.007 | 7.31e-01 | No |
| Calcium (log) | Vitamin D (sqrt) | 2605 | 0.314 | 0.00e+00 | Yes |
Interpret the results:
Most correlations among predictors are weak. The strongest correlation is between calcium and vitamin D intake (r = 0.314, p < 0.001), which makes sense as these nutrients often come from similar food sources. Age shows weak correlations with all variables: negatively with BMI (r = -0.098) and positively with calcium (r = 0.048) and vitamin D (r = 0.153). BMI shows essentially no correlation with either nutrient (r = 0.000 and r = 0.007, both non-significant). These weak correlations suggest that multicollinearity is unlikely to be a major concern in future regression analyses using these variables.
Create scatterplots to visualize key relationships:
# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "log(BMI)",
y = "BMD (g/cm²)",
title = "BMD vs BMI (transformed)"
)
# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "sqrt(totmet)",
y = "BMD (g/cm²)",
title = "BMD vs MET (transformed)"
)A. Comparing Methods (ANOVA vs Correlation)
ANOVA is used when comparing means across three or more categorical groups, while correlation is used to assess the strength and direction of linear relationships between two continuous variables. The key difference is that ANOVA tells us whether group means differ (categorical predictor → continuous outcome), while correlation tells us about the association between two continuous variables. An example research question for ANOVA would be: “Does mean blood pressure differ across smoking status groups (never, former, current)?” An example for correlation would be: “Is there a linear relationship between physical activity level and bone mineral density?”
B. Assumptions and Limitations
The most challenging assumptions to meet in this assignment were the normality of residuals and homogeneity of variances for ANOVA. Both were violated in our analysis, as shown by the Q-Q plot and Levene’s test. These violations could affect our conclusions by increasing the Type I error rate or reducing statistical power. However, ANOVA is fairly robust to these violations with large sample sizes. The cross-sectional correlation analysis has important limitations for understanding causality between nutrition/activity and bone health. Correlations cannot establish temporal sequence, are susceptible to confounding, and cannot rule out reverse causation (e.g., people with better bone health might be more physically active).
C. R Programming Challenge
The most difficult part of the R coding was creating all three correlation tables, particularly Table C with the pairwise combinations. I initially struggled with generating all possible pairs efficiently. I solved this by studying the combn() function and understanding how to combine it with map_dfr() from the purrr package. I also found the corr_pair() helper function very useful once I understood how it worked. Through this assignment, I strengthened my skills in data transformation, creating reproducible tables with kable(), and writing functions to automate repetitive tasks. I now feel more confident in my ability to conduct ANOVA and correlation analyses independently.
Before submitting, verify: