Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on
Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA
subsample)
What to submit (2 items):
.Rmd fileAI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.
Create a course folder on your computer like:
EPI553/
├── Assignments/
│ └── HW01/
│ ├── bmd.csv
│ └── EPI553_HW01_LastName_FirstName.Rmd
Required filename for your submission:
EPI553_HW01_LastName_FirstName.Rmd
Required RPubs title (use this exact pattern):
epi553_hw01_lastname_firstname
This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname
This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.
Key variables:
| Variable | Description | Type |
|---|---|---|
DXXOFBMD |
Total femur bone mineral density | Continuous (g/cm²) |
RIDRETH1 |
Race/ethnicity | Categorical (1–5) |
RIAGENDR |
Sex | Categorical (1=Male, 2=Female) |
RIDAGEYR |
Age in years | Continuous |
BMXBMI |
Body mass index | Continuous (kg/m²) |
smoker |
Smoking status | Categorical (1=Current, 2=Past, 3=Never) |
calcium |
Dietary calcium intake | Continuous (mg/day) |
DSQTCALC |
Supplemental calcium | Continuous (mg/day) |
vitd |
Dietary vitamin D intake | Continuous (mcg/day) |
DSQTVD |
Supplemental vitamin D | Continuous (mcg/day) |
totmet |
Total physical activity (MET-min/week) | Continuous |
Load required packages and import the dataset.
# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)
library (knitr)
library (ggplot2)
library (readr)
library(patchwork)
library (ggstats)# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/bmd.csv", show_col_types = FALSE)
# Inspect the data
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
## # A tibble: 10 × 14
## SEQN RIAGENDR RIDAGEYR RIDRETH1 BMXBMI smoker totmet metcat DXXOFBMD tbmdcat
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 93705 2 66 4 31.7 2 240 0 1.06 0
## 2 93708 2 66 5 23.7 3 120 0 0.801 1
## 3 93709 2 75 4 38.9 1 720 1 0.88 0
## 4 93711 1 56 5 21.3 3 840 1 0.851 1
## 5 93713 1 67 3 23.5 1 360 0 0.778 1
## 6 93714 2 54 4 39.9 2 NA NA 0.994 0
## 7 93715 1 71 5 22.5 1 6320 2 0.952 0
## 8 93716 1 61 5 30.7 2 2400 2 1.12 0
## 9 93721 2 60 1 35.9 3 NA NA NA NA
## 10 93722 2 60 3 23.8 3 NA NA 0.752 1
## # ℹ 4 more variables: calcium <dbl>, vitd <dbl>, DSQTVD <dbl>, DSQTCALC <dbl>
## spc_tbl_ [2,898 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ SEQN : num [1:2898] 93705 93708 93709 93711 93713 ...
## $ RIAGENDR: num [1:2898] 2 2 2 1 1 2 1 1 2 2 ...
## $ RIDAGEYR: num [1:2898] 66 66 75 56 67 54 71 61 60 60 ...
## $ RIDRETH1: num [1:2898] 4 5 4 5 3 4 5 5 1 3 ...
## $ BMXBMI : num [1:2898] 31.7 23.7 38.9 21.3 23.5 39.9 22.5 30.7 35.9 23.8 ...
## $ smoker : num [1:2898] 2 3 1 3 1 2 1 2 3 3 ...
## $ totmet : num [1:2898] 240 120 720 840 360 NA 6320 2400 NA NA ...
## $ metcat : num [1:2898] 0 0 1 1 0 NA 2 2 NA NA ...
## $ DXXOFBMD: num [1:2898] 1.058 0.801 0.88 0.851 0.778 ...
## $ tbmdcat : num [1:2898] 0 1 0 1 1 0 0 0 NA 1 ...
## $ calcium : num [1:2898] 504 474 NA 1248 660 ...
## $ vitd : num [1:2898] 1.85 5.85 NA 3.85 2.35 5.65 3.75 4.45 6.05 6.45 ...
## $ DSQTVD : num [1:2898] 20.6 25 NA 25 NA ...
## $ DSQTCALC: num [1:2898] 211.7 820 NA 35 13.3 ...
## - attr(*, "spec")=
## .. cols(
## .. SEQN = col_double(),
## .. RIAGENDR = col_double(),
## .. RIDAGEYR = col_double(),
## .. RIDRETH1 = col_double(),
## .. BMXBMI = col_double(),
## .. smoker = col_double(),
## .. totmet = col_double(),
## .. metcat = col_double(),
## .. DXXOFBMD = col_double(),
## .. tbmdcat = col_double(),
## .. calcium = col_double(),
## .. vitd = col_double(),
## .. DSQTVD = col_double(),
## .. DSQTCALC = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## [1] 2898 14
## [1] "SEQN" "RIAGENDR" "RIDAGEYR" "RIDRETH1" "BMXBMI" "smoker"
## [7] "totmet" "metcat" "DXXOFBMD" "tbmdcat" "calcium" "vitd"
## [13] "DSQTVD" "DSQTCALC"
Create readable labels for the main categorical variables:
bmd_data <- bmd %>%
mutate(
ethnic_category = case_when(
RIDRETH1 == 1 ~ "Mexican American",
RIDRETH1 == 2 ~ "Other Hispanic",
RIDRETH1 == 3 ~ "Non-Hispanic White",
RIDRETH1 == 4 ~ "Non-Hispanic Black",
RIDRETH1 == 5 ~ "Other (including Multi)",
),
gender_category = case_when(
RIAGENDR == 1 ~ "Male",
RIAGENDR == 2 ~ "Female",
),
smoke_category = case_when(
smoker == 1 ~ "Current smoker",
smoker == 2 ~ "Past smoker",
smoker == 3 ~ "Never smoked",
),
)
# Check sample sizes
table(bmd_data$ethnic_category)##
## Mexican American Non-Hispanic Black Non-Hispanic White
## 331 696 1086
## Other (including Multi) Other Hispanic
## 499 286
##
## Female Male
## 1467 1431
##
## Current smoker Never smoked Past smoker
## 449 1553 894
# Display first few rows
head(bmd_data) %>%
kable(caption = " BMD Dataset with recoded variables (first 6 rows)")| SEQN | RIAGENDR | RIDAGEYR | RIDRETH1 | BMXBMI | smoker | totmet | metcat | DXXOFBMD | tbmdcat | calcium | vitd | DSQTVD | DSQTCALC | ethnic_category | gender_category | smoke_category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 93705 | 2 | 66 | 4 | 31.7 | 2 | 240 | 0 | 1.058 | 0 | 503.5 | 1.85 | 20.557 | 211.67 | Non-Hispanic Black | Female | Past smoker |
| 93708 | 2 | 66 | 5 | 23.7 | 3 | 120 | 0 | 0.801 | 1 | 473.5 | 5.85 | 25.000 | 820.00 | Other (including Multi) | Female | Never smoked |
| 93709 | 2 | 75 | 4 | 38.9 | 1 | 720 | 1 | 0.880 | 0 | NA | NA | NA | NA | Non-Hispanic Black | Female | Current smoker |
| 93711 | 1 | 56 | 5 | 21.3 | 3 | 840 | 1 | 0.851 | 1 | 1248.5 | 3.85 | 25.000 | 35.00 | Other (including Multi) | Male | Never smoked |
| 93713 | 1 | 67 | 3 | 23.5 | 1 | 360 | 0 | 0.778 | 1 | 660.5 | 2.35 | NA | 13.33 | Non-Hispanic White | Male | Current smoker |
| 93714 | 2 | 54 | 4 | 39.9 | 2 | NA | NA | 0.994 | 0 | 776.0 | 5.65 | NA | NA | Non-Hispanic Black | Female | Past smoker |
Report the total sample size and create the analytic dataset for ANOVA by removing missing values:
# Total observations
n_total <- nrow(bmd_data)
# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd_data %>%
filter(!is.na(DXXOFBMD), !is.na(ethnic_category))
# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)
# Display sample sizes
n_total## [1] 2898
## [1] 2286
Write 2–4 sentences summarizing your analytic sample here.
[YOUR TEXT HERE: Describe the total sample, how many observations were removed due to missing data, and what the final analytic sample size is for the ANOVA analysis.] ## Analytic Sample Summary
Write 2–4 sentences summarizing your analytic sample here. The total sample consisted of 2,898 participants whose original data source is the National Health and Nutrition Examination Survey (NHANES). After removing participants with missing bone mineral density (BMD) or ethnicity data, 612 observations were excluded, resulting in a final analytic sample of 2,286 participants for the ANOVA analysis. This represents 78.9% of the total sample. The analytic sample includes adults with complete BMD measurements distributed across five ethnicity categories: Mexican American (n=331), Other Hispanic (n=286), Non-Hispanic White (n=1,086), Non-Hispanic Black (n=696), and Other (including Multi-racial, n=499). —
Part 1 — One-way ANOVA: BMD by ethnicity (50 points)
Research question: Do mean BMD values differ across ethnicity groups?
State your null and alternative hypotheses in both statistical notation and plain language.
Statistical notation:
H₀: μ_ethniccategory1 = μ_ethniccategory2 = μ_ethniccategory3 = μ_ethniccategory4 = μ_ethniccategory3
Hₐ: At least one population group mean differs from the others. At least one μi ≠ μj
Plain language:
###Fit the anova table
Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:
anova_df %>%
group_by(ethnic_category) %>%
summarise(
n = n(),
mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
) %>%
arrange(desc(n)) %>%
kable(digits = 3)| ethnic_category | n | mean_bmd | sd_bmd |
|---|---|---|---|
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other (including Multi) | 409 | 0.897 | 0.148 |
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
Create a visualization comparing BMD distributions across ethnicity groups:
ggplot(anova_df, aes(x = ethnic_category, y = DXXOFBMD)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.15, alpha = 0.25) +
labs(
x = "Ethnicity group",
y = "Bone mineral density (g/cm²)"
) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))Interpret the plot: What patterns do you observe in BMD across ethnicity groups?
All five ethnic group seem to have similar median BMD values at around 1 gm/cm2. Non- hispanic black have slightly higher median BMD whereas non-hispanic white has slightly lower median BMD. Outliers are present in all groups, mostly below.
Fit a one-way ANOVA with:
DXXOFBMD)# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnic_category, data = anova_df)
# Display ANOVA table
summary(fit_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnic_category 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Report the F-statistic, degrees of freedom, and p-value in a formatted table:
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| ethnic_category | 4 | 2.9482 | 0.7371 | 30.185 | 0 |
| Residuals | 2281 | 55.6973 | 0.0244 | NA | NA |
Write your interpretation here:
[YOUR INTERPRETATION: State your decision (reject or fail to reject H₀) and what this means for ethnic differences in BMD.]
The one-way ANOVA shows a statistically significant difference in mean BMD across ethnic categories (F(4, 2281) = 30.19, p < 0.001). Because the p-value is far below 0.05 ( <2e-16), we reject the null hypothesis (H₀) that mean BMD is equal across all ethnic groups. There is strong statistical evidence that at least one ethnic group has a different mean BMD compared to the others. However, since ANOVA does not tell us which specific groups differ, we will use post-hoc tests (e.g., Tukey’s HSD) to determine where those differences lie.
This could mean meaningful ethnic variation in BMD that we could look into biological, social, or structural determinants.
ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.
If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.
Create diagnostic plots:
par(mfrow = c(1, 2))
plot(fit_aov, which = 1) # Residuals vs fitted
plot(fit_aov, which = 2) # QQ plot## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
# Conduct Levene's test in neater table
levene_test <- leveneTest(DXXOFBMD ~ ethnic_category, data = anova_df)
# Display results
print(levene_test)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
# Create formatted table
levene_table <- data.frame(
Test = "Levene's Test",
Df1 = levene_test$Df[1],
Df2 = levene_test$Df[2],
`F value` = round(levene_test$`F value`[1], 3),
`p-value` = round(levene_test$`Pr(>F)`[1], 4)
)
kable(levene_table,
caption = "Levene's Test for Homogeneity of Variance",
col.names = c("Test", "df1", "df2", "F-statistic", "p-value"))| Test | df1 | df2 | F-statistic | p-value |
|---|---|---|---|---|
| Levene’s Test | 4 | 2281 | 1.573 | 0.1788 |
Summarize what you see in 2–4 sentences:
YOUR INTERPRETATION: Based on the diagnostic plots and levene test, we can see the following: Independence: Observations are independent (assumed based on study design) 1. For Normality,the QQ plot show that the residuals follow normal distribution line with only few deviation at each tails. Most lie on diagonal line neatly, so normality is satisfied for this sample and n is also fairly large.
Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.
Use Tukey HSD and report:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnic_category, data = anova_df)
##
## $ethnic_category
## diff lwr
## Non-Hispanic Black-Mexican American 0.022391619 -0.01010005
## Non-Hispanic White-Mexican American -0.062377249 -0.09285262
## Other (including Multi)-Mexican American -0.053319344 -0.08735725
## Other Hispanic-Mexican American -0.004441273 -0.04264413
## Non-Hispanic White-Non-Hispanic Black -0.084768868 -0.10837333
## Other (including Multi)-Non-Hispanic Black -0.075710963 -0.10376452
## Other Hispanic-Non-Hispanic Black -0.026832892 -0.05981593
## Other (including Multi)-Non-Hispanic White 0.009057905 -0.01663337
## Other Hispanic-Non-Hispanic White 0.057935976 0.02693726
## Other Hispanic-Other (including Multi) 0.048878071 0.01437080
## upr p adj
## Non-Hispanic Black-Mexican American 0.054883289 0.3276613
## Non-Hispanic White-Mexican American -0.031901881 0.0000003
## Other (including Multi)-Mexican American -0.019281435 0.0001919
## Other Hispanic-Mexican American 0.033761585 0.9978049
## Non-Hispanic White-Non-Hispanic Black -0.061164402 0.0000000
## Other (including Multi)-Non-Hispanic Black -0.047657407 0.0000000
## Other Hispanic-Non-Hispanic Black 0.006150151 0.1722466
## Other (including Multi)-Non-Hispanic White 0.034749177 0.8719109
## Other Hispanic-Non-Hispanic White 0.088934694 0.0000036
## Other Hispanic-Other (including Multi) 0.083385341 0.0010720
Create and present a readable table (you can tidy the Tukey output):
# Creating a readable table from Tukey results
# First, tidy the Tukey results
tukey_results <- broom::tidy(tukey_results)
# Then create the readable table with proper renaming
tukey_table <- tukey_results %>%
rename(
Comparison = contrast,
Mean_Difference = estimate,
Lower_CI = conf.low,
Upper_CI = conf.high,
Adj_p_value = adj.p.value
) %>%
##filtering for only significant results
mutate(
Significant = ifelse(Adj_p_value < 0.05, "Yes *", "No"),
Mean_Difference = round(Mean_Difference, 4),
Lower_CI = round(Lower_CI, 4),
Upper_CI = round(Upper_CI, 4),
Adj_p_value = signif(Adj_p_value, 4)
) %>%
select(Comparison, Mean_Difference, Lower_CI, Upper_CI, Adj_p_value, Significant)
# Display all comparisons
tukey_table %>%
kable(caption = "Tukey HSD Post-hoc Pairwise Comparisons of BMD by Ethnicity")| Comparison | Mean_Difference | Lower_CI | Upper_CI | Adj_p_value | Significant |
|---|---|---|---|---|---|
| Non-Hispanic Black-Mexican American | 0.0224 | -0.0101 | 0.0549 | 0.3277000 | No |
| Non-Hispanic White-Mexican American | -0.0624 | -0.0929 | -0.0319 | 0.0000003 | Yes * |
| Other (including Multi)-Mexican American | -0.0533 | -0.0874 | -0.0193 | 0.0001919 | Yes * |
| Other Hispanic-Mexican American | -0.0044 | -0.0426 | 0.0338 | 0.9978000 | No |
| Non-Hispanic White-Non-Hispanic Black | -0.0848 | -0.1084 | -0.0612 | 0.0000000 | Yes * |
| Other (including Multi)-Non-Hispanic Black | -0.0757 | -0.1038 | -0.0477 | 0.0000000 | Yes * |
| Other Hispanic-Non-Hispanic Black | -0.0268 | -0.0598 | 0.0062 | 0.1722000 | No |
| Other (including Multi)-Non-Hispanic White | 0.0091 | -0.0166 | 0.0347 | 0.8719000 | No |
| Other Hispanic-Non-Hispanic White | 0.0579 | 0.0269 | 0.0889 | 0.0000036 | Yes * |
| Other Hispanic-Other (including Multi) | 0.0489 | 0.0144 | 0.0834 | 0.0010720 | Yes * |
# Filter for only significant comparisons
sig_comparisons <- tukey_table %>%
filter(Significant == "Yes *")
sig_comparisons %>%
kable(caption = "Statistically Significant Pairwise Differences (p < 0.05)")| Comparison | Mean_Difference | Lower_CI | Upper_CI | Adj_p_value | Significant |
|---|---|---|---|---|---|
| Non-Hispanic White-Mexican American | -0.0624 | -0.0929 | -0.0319 | 0.0000003 | Yes * |
| Other (including Multi)-Mexican American | -0.0533 | -0.0874 | -0.0193 | 0.0001919 | Yes * |
| Non-Hispanic White-Non-Hispanic Black | -0.0848 | -0.1084 | -0.0612 | 0.0000000 | Yes * |
| Other (including Multi)-Non-Hispanic Black | -0.0757 | -0.1038 | -0.0477 | 0.0000000 | Yes * |
| Other Hispanic-Non-Hispanic White | 0.0579 | 0.0269 | 0.0889 | 0.0000036 | Yes * |
| Other Hispanic-Other (including Multi) | 0.0489 | 0.0144 | 0.0834 | 0.0010720 | Yes * |
Write a short paragraph: The Tukey HSD post-hoc test reveals statistically significant pairwise differences in mean BMD across ethnicity groups. These findings suggest that ethnicity is a meaningful predictor of bone mineral density, with Non-Hispanic Black individuals demonstrating the highest BMD values in this sample.
INTERPRETATION: Post-hoc Pairwise Comparisons
The Tukey HSD post-hoc test revealed 6 out of 10 pairwise comparisons showed statistically significant differences in mean BMD (adjusted p < 0.05): Non-Hispanic White vs. other groups:
Non-Hispanic White had significantly lower BMD than Mexican American (diff = -0.062, 95% CI [-0.093, -0.032], p < 0.001) Non-Hispanic White had significantly lower BMD than Non-Hispanic Black (diff = -0.085, 95% CI [-0.108, -0.061], p < 0.001) Non-Hispanic White had significantly lower BMD than Other Hispanic (diff = -0.058, 95% CI [-0.089, -0.027], p < 0.001)
Other (including Multi-Racial) vs. other groups:
Other (including Multi) had significantly lower BMD than Mexican American (diff = -0.053, 95% CI [-0.087, -0.019], p < 0.001) Other (including Multi) had significantly lower BMD than Non-Hispanic Black (diff = -0.076, 95% CI [-0.104, -0.048], p < 0.001) Other (including Multi) had significantly lower BMD than Other Hispanic (diff = -0.049, 95% CI [-0.083, -0.014], p = 0.001)
Key findings: Non-Hispanic Black and Mexican American participants had the highest mean BMD, while Non-Hispanic White and Other (including Multi-Racial) groups had the lowest. Other Hispanic fell in between. No significant difference was found between Non-Hispanic Black and Mexican American (p = 0.328), nor between Non-Hispanic White and Other (including Multi) (p = 0.872).
Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).
Formula: η² = SS_between / SS_total
# Hint: Extract Sum Sq from the ANOVA summary
# Extract sum of squares from ANOVA table
anova_summary <- summary(fit_aov)[[1]]
ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)
# Calculate eta-squared
eta_squared <- ss_treatment / ss_total
cat("Eta-squared (η²):", round(eta_squared, 4), "\n")## Eta-squared (η²): 0.0503
## Percentage of variance explained: 5.03 %
# Create formatted output
effect_size_table <- data.frame(
Measure = "Eta-squared (η²)",
Value = round(eta_squared, 4),
`Percent Variance` = paste0(round(eta_squared * 100, 2), "%"),
Interpretation = "Small to Medium Effect"
)
kable(effect_size_table,
caption = "Effect Size: Proportion of Variance Explained by Ethnicity",
col.names = c("Effect Size Measure", "Value", "% Variance Explained", "Interpretation"))| Effect Size Measure | Value | % Variance Explained | Interpretation |
|---|---|---|---|
| Eta-squared (η²) | 0.0503 | 5.03% | Small to Medium Effect |
Effect Size Interpretation: η² = r round(eta_squared, 4) or r round(eta_squared * 100, 2)% Small
Effect size guidelines: Small effect: η² ≈ 0.01 (1%) Medium effect: η² ≈ 0.06 (6%) Large effect: η² ≈ 0.14 (14%)
Our effect: Table: Effect Size: Proportion of Variance Explained by Ethnicity
| Effect Size Measure | Value | % Variance Explained | Interpretation |
|---|---|---|---|
| Eta-squared (η²) | 0.0503 | 5.03% | Small to Medium Effect |
Interpret in 1–2 sentences: Ethnicity explains about 5.0% of the variance in bone mineral density (η² = 0.050), which is a small to moderate effect according to Cohen’s guidelines (small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14). While the ANOVA test was statistically significant with a large sample size, and ethnicity does predict BMD, other factors (such as genetics, lifestyle, nutrition, and physical activity) may influence bone density in large part. —
In this section you will:
Calculate total nutrient intake by adding dietary and supplemental sources:
# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
mutate(
DSQTCALC_0 = replace_na(DSQTCALC, 0),
DSQTVD_0 = replace_na(DSQTVD, 0),
calcium_total = calcium + DSQTCALC_0,
vitd_total = vitd + DSQTVD_0
)
# Display first few rows
head(bmd2) %>%
kable(caption = " BMD Dataset 2 with mutated variables (first 6 rows)")| SEQN | RIAGENDR | RIDAGEYR | RIDRETH1 | BMXBMI | smoker | totmet | metcat | DXXOFBMD | tbmdcat | calcium | vitd | DSQTVD | DSQTCALC | DSQTCALC_0 | DSQTVD_0 | calcium_total | vitd_total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 93705 | 2 | 66 | 4 | 31.7 | 2 | 240 | 0 | 1.058 | 0 | 503.5 | 1.85 | 20.557 | 211.67 | 211.67 | 20.557 | 715.17 | 22.407 |
| 93708 | 2 | 66 | 5 | 23.7 | 3 | 120 | 0 | 0.801 | 1 | 473.5 | 5.85 | 25.000 | 820.00 | 820.00 | 25.000 | 1293.50 | 30.850 |
| 93709 | 2 | 75 | 4 | 38.9 | 1 | 720 | 1 | 0.880 | 0 | NA | NA | NA | NA | 0.00 | 0.000 | NA | NA |
| 93711 | 1 | 56 | 5 | 21.3 | 3 | 840 | 1 | 0.851 | 1 | 1248.5 | 3.85 | 25.000 | 35.00 | 35.00 | 25.000 | 1283.50 | 28.850 |
| 93713 | 1 | 67 | 3 | 23.5 | 1 | 360 | 0 | 0.778 | 1 | 660.5 | 2.35 | NA | 13.33 | 13.33 | 0.000 | 673.83 | 2.350 |
| 93714 | 2 | 54 | 4 | 39.9 | 2 | NA | NA | 0.994 | 0 | 776.0 | 5.65 | NA | NA | 0.00 | 0.000 | 776.00 | 5.650 |
# Verify the new variables
summary(bmd2 %>% select(calcium, DSQTCALC, calcium_total, vitd, DSQTVD, vitd_total, BMXBMI, totmet))## calcium DSQTCALC calcium_total vitd
## Min. : 55.0 Min. : 0.74 Min. : 55.0 Min. : 0.000
## 1st Qu.: 525.0 1st Qu.: 93.33 1st Qu.: 596.8 1st Qu.: 1.750
## Median : 773.0 Median : 211.67 Median : 895.5 Median : 3.200
## Mean : 840.5 Mean : 357.98 Mean : 997.6 Mean : 4.471
## 3rd Qu.:1049.0 3rd Qu.: 500.00 3rd Qu.:1285.0 3rd Qu.: 5.500
## Max. :5199.0 Max. :3220.00 Max. :5399.0 Max. :57.900
## NA's :293 NA's :1633 NA's :293 NA's :293
## DSQTVD vitd_total BMXBMI totmet
## Min. : 0.004 Min. : 0.000 Min. :14.20 Min. : 0
## 1st Qu.: 15.709 1st Qu.: 2.900 1st Qu.:25.20 1st Qu.: 240
## Median : 25.000 Median : 8.533 Median :28.70 Median : 480
## Mean : 49.851 Mean : 28.676 Mean :29.76 Mean :1015
## 3rd Qu.: 50.000 3rd Qu.: 30.400 3rd Qu.:33.40 3rd Qu.:1430
## Max. :2570.000 Max. :2574.650 Max. :84.40 Max. :9120
## NA's :1515 NA's :293 NA's :64 NA's :964
##
## Missing values in total intake variables:
## calcium_total vitd_total BMXBMI totmet
## 293 293 64 964
Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.
Create histograms to assess distributions:
# Create histograms for all continuous variables
p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
geom_histogram(fill = "steelblue", color = "black", bins = 30) +
labs(title = "BMI Distribution", x = "BMI", y = "Frequency") +
theme_minimal()
p2 <- ggplot(bmd2, aes(x = calcium_total)) +
geom_histogram(fill = "steelblue", color = "black", bins = 30) +
labs(title = "Total Calcium Distribution", x = "Calcium (mg)", y = "Frequency") +
theme_minimal()
p3 <- ggplot(bmd2, aes(x = vitd_total)) +
geom_histogram(fill = "steelblue", color = "black", bins = 30) +
labs(title = "Total Vitamin D Distribution", x = "Vitamin D (mcg)", y = "Frequency") +
theme_minimal()
p4 <- ggplot(bmd2, aes(x = totmet)) +
geom_histogram(fill = "steelblue", color = "black", bins = 30) +
labs(title = "Total MET Distribution", x = "MET (min/week)", y = "Frequency") +
theme_minimal()
## Combine plots in a 2x2 grid
(p1 | p2) / (p3 | p4)# Check complete cases for correlation analysis
complete_cases <- sum(complete.cases(bmd2[, c("BMXBMI", "calcium_total", "vitd_total", "totmet")]))
cat("Complete cases for all variables:", complete_cases) ###1765 complete cases## Complete cases for all variables: 1765
*Interpretation**
BMI Distribution (p1): The right-skewed distribution with a long tail extends to higher BMI values and most data are clustered around 25-30, but some extend to 50-60.Hence, we will apply LOG transformation because right-skewed tails needs correction and since there are no zeros in BMI data, so log is safe to use. Use of Log will compress the right tail and make distribution more symmetric for better analysis.
Total Calcium Distribution (p2):The plot is highly right-skewed with most values clustered at the low end (0-1000 mg) and has long right tail extending to 4000+ mg. There is some spike near zero. We will apply LOG transformation here as well due to the extreme right-skewed distribution with potential zero values. Use of Log will compress the extreme high values and create more symmetry in the data.
Total Vitamin D Distribution (p3):The plot shows extremely right-skew, massive spike at zero. We will use SQUARE ROOT transformation because there are many zeros present (people without supplements), square root will handle zero without adding constants, as it is appropriate for heavily zero-inflated data.
Total MET Distribution (p4):Fourth plot shows another right-skewed, with large data at zero (sedentary people). Hence, similar to P3, we will use SQUARE ROOT transformation due to data have many zero values and right skewed.
Based on your assessment, apply transformations as needed:
# Apply transformations based on visual assessment
bmd2 <- bmd2 %>%
mutate(
bmi_log = log(BMXBMI),
calcium_log = log(calcium_total + 1),
vitd_sqrt = sqrt(vitd_total),
met_sqrt = sqrt(totmet)
)
# Check that variables were created
names(bmd2)## [1] "SEQN" "RIAGENDR" "RIDAGEYR" "RIDRETH1"
## [5] "BMXBMI" "smoker" "totmet" "metcat"
## [9] "DXXOFBMD" "tbmdcat" "calcium" "vitd"
## [13] "DSQTVD" "DSQTCALC" "DSQTCALC_0" "DSQTVD_0"
## [17] "calcium_total" "vitd_total" "bmi_log" "calcium_log"
## [21] "vitd_sqrt" "met_sqrt"
## [1] 64
## [1] 293
## [1] 293
## [1] 964
# Create comparison plots
# BMI
p5 <- ggplot(bmd2, aes(x = BMXBMI)) +
geom_histogram(fill ="red", color = "black", bins = 30) +
labs(title = "Original BMI (Right-Skewed)", x = "BMI") +
theme_minimal()
p6 <- ggplot(bmd2, aes(x = bmi_log)) +
geom_histogram(fill = "lightblue", color = "black", bins = 30) +
labs(title = "Log-Transformed BMI (More Normal)", x = "log(BMI)") +
theme_minimal()
print(p5)# Calcium
p7 <- ggplot(bmd2, aes(x = calcium_total)) +
geom_histogram(fill = "red", color = "black", bins = 30) +
labs(title = "Original Calcium (Highly Skewed)", x = "Calcium (mg)") +
theme_minimal()
p8 <- ggplot(bmd2, aes(x = calcium_log)) +
geom_histogram(fill = "lightblue", color = "black", bins = 30) +
labs(title = "Log-Transformed Calcium (Improved)", x = "log(Calcium+1)") +
theme_minimal()
print(p7)# Vitamin D
p9 <- ggplot(bmd2, aes(x = vitd_total)) +
geom_histogram(fill = "red", color = "black", bins = 30) +
labs(title = "Original Vitamin D (Highly Skewed)", x = "Vitamin D ") +
theme_minimal()
p10 <- ggplot(bmd2, aes(x = vitd_sqrt)) +
geom_histogram(fill = "lightblue", color = "black", bins = 30) +
labs(title = "Log-Transformed Vitamin D (Improved)", x = "(Sqrt Vit D +1)") +
theme_minimal()
print(p9)# Total MET
p11 <- ggplot(bmd2, aes(x = totmet)) +
geom_histogram(fill = "red", color = "black", bins = 30) +
labs(title = "Total MET (Highly Skewed)", x = "Total MET ") +
theme_minimal()
p12 <- ggplot(bmd2, aes(x = met_sqrt)) +
geom_histogram(fill = "lightblue", color = "black", bins = 30) +
labs(title = "Total MET (Improved)", x = "(Sqrt Total MET +1)") +
theme_minimal()
print(p11)I applied log transformations to BMI and total calcium intake because both variables showed right-skewed distributions with positive skewness values (>1), indicating heavy right tails. The log transformation helps normalize these distributions and stabilize variance. For vitamin D and physical activity (MET), I applied square root transformations to address right skewness while preserving interpretability better than log transformations.
For each correlation, report:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}Examine correlations between potential predictors and BMD:
Use transformed versions where appropriate.
# YOUR CODE HERE
tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "bmi_log", "DXXOFBMD"),
corr_pair(bmd2, "calcium_log", "DXXOFBMD"),
corr_pair(bmd2, "vitd_sqrt", "DXXOFBMD"),
corr_pair(bmd2, "met_sqrt", "DXXOFBMD")
) %>%
mutate(
Variable = c("Age (years)", "BMI (log)", "Total Calcium (log)",
"Total Vitamin D (sqrt)", "Physical Activity"),
r = round(estimate, 3),
p_value = signif(p_value, 3),
Significant = ifelse(p_value < 0.05, "Yes", "No")
) %>%
select(Variable, r, p_value, Significant, n)
tableA %>% kable(caption = "Table A: Correlations between Predictors and BMD (Outcome)")| Variable | r | p_value | Significant | n |
|---|---|---|---|---|
| Age (years) | -0.232 | 0.00e+00 | Yes | 2286 |
| BMI (log) | 0.425 | 0.00e+00 | Yes | 2275 |
| Total Calcium (log) | 0.012 | 5.92e-01 | No | 2129 |
| Total Vitamin D (sqrt) | -0.062 | 4.26e-03 | Yes | 2129 |
| Physical Activity | 0.102 | 4.59e-05 | Yes | 1588 |
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()Interpret the results:
All four variables showed statistically significant correlations with BMD (p < 0.05) except BMI. Age demonstrated a moderate negative correlation (r = -0.232), indicating that older participants tended to have lower BMD. BMI showed a strong positive correlation (r = 0.425), suggesting higher body mass index is associated with greater bone mineral density. Total calcium intake had a weak correlation (r = 0.012). Total vitamin D intake showed a weak negative correlation (r = -0.062), which is unexpected but could be due to residual confounding. Physical activity demonstrated a weak positive correlation (r = 0.102), supporting the known beneficial effects of exercise on bone health.
Examine correlations between potential confounders and physical activity:
# YOUR CODE HERE
tableB <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "met_sqrt"),
corr_pair(bmd2, "bmi_log", "met_sqrt"),
corr_pair(bmd2, "calcium_log", "met_sqrt"),
corr_pair(bmd2, "vitd_sqrt", "met_sqrt"),
) %>%
mutate(
Variable = c("Age (years)", "BMI (log)", "Total Calcium (log)",
"Total Vitamin D (sqrt)"),
r = round(estimate, 3),
p_value = signif(p_value, 3),
Significant = ifelse(p_value < 0.05, "Yes", "No")
) %>%
select(Variable, r, p_value, Significant, n)
tableB %>% kable(caption = "Table B: Correlations between Potential Cofounders and Physical Activity")| Variable | r | p_value | Significant | n |
|---|---|---|---|---|
| Age (years) | -0.097 | 1.96e-05 | Yes | 1934 |
| BMI (log) | 0.001 | 9.51e-01 | No | 1912 |
| Total Calcium (log) | 0.086 | 2.82e-04 | Yes | 1777 |
| Total Vitamin D (sqrt) | -0.002 | 9.32e-01 | No | 1777 |
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()Interpret the results:
Based on Table B, age and calcium intake show the strongest evidence of being confounders, as they are significantly associated with physical activity. BMI and vitamin D may still be important covariates if they’re associated with BMD, but they don’t appear to confound the exposure-outcome relationship based on their weak associations with physical activity.
Examine correlations among all predictor variables (all pairwise combinations):
# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()
### TABLE C: Pairwise correlations among predictors (multicollinearity assessment)
# Define predictor variables
preds <- c("RIDAGEYR", "bmi_log", "calcium_log", "vitd_sqrt")
pred_labels <- c("Age (years)", "BMI (log)", "Total Calcium (log)",
"Total Vitamin D (sqrt)")
# Generate all pairwise combinations
pairs_list <- combn(preds, 2, simplify = FALSE)
# Create correlation table
tableC <- map_dfr(pairs_list, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
# Create proper labels for each pair
Pair_Index = 1:n(),
Variable_1 = c("Age (years)", "Age (years)", "Age (years)",
"BMI (log)", "BMI (log)",
"Total Calcium (log)"),
Variable_2 = c("BMI (log)", "Total Calcium (log)", "Total Vitamin D (sqrt)",
"Total Calcium (log)", "Total Vitamin D (sqrt)",
"Total Vitamin D (sqrt)"),
r = round(estimate, 3),
p_value = signif(p_value, 3),
Abs_r = abs(r),
Significant = ifelse(p_value < 0.05, "Yes", "No")
) %>%
arrange(desc(Abs_r)) %>%
select(Variable_1, Variable_2, r, p_value, Significant, n)
# Display table
tableC %>%
kable(caption = "Table C: Pairwise Correlations Among Predictors (Multicollinearity Assessment)")| Variable_1 | Variable_2 | r | p_value | Significant | n |
|---|---|---|---|---|---|
| Total Calcium (log) | Total Vitamin D (sqrt) | 0.314 | 0.00e+00 | Yes | 2605 |
| Age (years) | Total Vitamin D (sqrt) | 0.153 | 0.00e+00 | Yes | 2605 |
| Age (years) | BMI (log) | -0.098 | 2.00e-07 | Yes | 2834 |
| Age (years) | Total Calcium (log) | 0.048 | 1.35e-02 | Yes | 2605 |
| BMI (log) | Total Vitamin D (sqrt) | 0.007 | 7.31e-01 | No | 2569 |
| BMI (log) | Total Calcium (log) | 0.000 | 9.83e-01 | No | 2569 |
Interpret the results:
[YOUR INTERPRETATION: Are there strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses?]
All pairwise correlations among predictors were weak to moderate. The strongest correlation is between total calcium intake and total vitamin D intake (r = 0.314, p < 0.001). Individuals who supplement their diet with one nutrient often supplement with the other. However, this correlation is still below the multicollinearity threshold of r= 0.7. Age show very weak correlations with BMI (r = -0.098), calcium intake (r = 0.048), and vitamin D intake (r = 0.153). BMI is not correlated with both calcium (r = 0.000) and vitamin D (r = 0.007) intake.
Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
ANOVA and correlation analysis are complementary but also have distinct purposes in research and should be selected based on the research question and study design. ANOVA is used to test whether categorical variables predict differences in a continuous outcome by comparing mean values across groups, while correlation analysis examines the strength and direction of linear relationships between two continuous variables.
The key methodological difference is that ANOVA tells us if group differences exist (via p-value, the null hypothesis of equal means) and how much variance is explained overall, while correlation quantifies the strength of association between two variables on a -1 to +1 scale. ANOVA assumes categorical, while correlation does not require such. In epidemiology, ANOVA results often prompt post-hoc comparisons to identify specific group differences, whereas correlation results lead to regression modeling to adjust for confounders.
In epidemiology, ANOVA is ideal when your primary exposure of interest is categorical (e.g., ethnicity, disease status, treatment groups), and we want to determine whether significant differences in health outcomes exist across these groups. For example, our research question “Do mean BMD values differ across ethnicity groups?” for ANOVA because ethnicity is categorical with five distinct groups.
Correlation analysis is better when both variables are continuous or when we want to quantify the strength of associations. An example research question would be: “What is the linear relationship between dietary calcium intake and BMD?” where both variables are measured on continuous scales.
B. Assumptions and Limitations
The most challenging assumption to fully satisfy in this assignment was normality of residuals, particularly given the distribution of nutrient intake variables (calcium and vitamin D), which are extremely right-skewed.
How might violations of assumptions affect your conclusions? Violations of assumptions can affect conclusions in meaningful ways. For instance, if we had not addressed non-normality in the correlation analyses, confidence intervals might be too narrow and p-values unreliable, potentially leading to false discoveries.
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?
C. R Programming Challenge
The most difficult part of the R coding involved creating the Tukey HSD post-hoc table and the tidying up the tables. Somehow that always tripped me over and over. Initially, transformation part- interpreting when and why to apply transformations was the most difficult.
My problem-solving process was: 1) reading error message carefully 2) try to debug with more information 3) test each part separately 4) then i looked for other resources.
YOUR REFLECTION HERE (200–400 words)
[Write your reflection addressing all three components above]
This assignment really helped me see how ANOVA and correlation serve different but complementary roles in epidemiologic analysis. ANOVA help examine whether mean BMD differed across ethnic groups, which makes sense when the exposure is categorical. Correlation, on the other hand, allowed us to assess the strength of linear relationships between continuous variables such as calcium intake and physical activity. Working through both approaches clarified for me that the method depends entirely on the research question—ANOVA asks whether groups differ, while correlation asks how closely two variables move together.
One of the more difficult aspects was checking and addressing statistical assumptions, especially normality for skewed variables. Applying log and square root transformations required understanding why skewed distributions can distort regression estimates and inference. Even after transformations, I was thinking cross-sectional data come with limitations. We can observe associations, but we cannot establish temporality or causality—for example, whether low calcium intake contributes to bone loss or whether individuals with lower BMD modify their diet.
The R coding portion was particularly challenging. Creating clean Tukey HSD tables and debugging syntax errors in longer piped commands pushed me out of my comfort zone. At times it felt overwhelming. Over time, though, I shifted from reacting to error messages with frustration to approaching them more systematically—breaking code into smaller pieces, checking objects step by step, and reading documentation more carefully. I strengthened practical skills in data understanding, handling missing values, applying transformations, and producing clear visualizations. The workload was substantial and at moments genuinely felt tough, but working through those challenges made the concepts stick more deeply and improved both my analytical confidence and technical fluency. Hopefully!
Before submitting, verify:
Good luck! Remember to start early and use office hours if you need help.