Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on
Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA
subsample)
What to submit (2 items):
.Rmd fileAI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.
Create a course folder on your computer like:
EPI553/
├── Assignments/
│ └── HW01/
│ ├── bmd.csv
│ └── EPI553_HW01_LastName_FirstName.Rmd
Required filename for your submission:
EPI553_HW01_LastName_FirstName.Rmd
Required RPubs title (use this exact pattern):
epi553_hw01_lastname_firstname
This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname
This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.
Key variables:
| Variable | Description | Type |
|---|---|---|
DXXOFBMD |
Total femur bone mineral density | Continuous (g/cm²) |
RIDRETH1 |
Race/ethnicity | Categorical (1–5) |
RIAGENDR |
Sex | Categorical (1=Male, 2=Female) |
RIDAGEYR |
Age in years | Continuous |
BMXBMI |
Body mass index | Continuous (kg/m²) |
smoker |
Smoking status | Categorical (1=Current, 2=Past, 3=Never) |
calcium |
Dietary calcium intake | Continuous (mg/day) |
DSQTCALC |
Supplemental calcium | Continuous (mg/day) |
vitd |
Dietary vitamin D intake | Continuous (mcg/day) |
DSQTVD |
Supplemental vitamin D | Continuous (mcg/day) |
totmet |
Total physical activity (MET-min/week) | Continuous |
Load required packages and import the dataset.
# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)
library(ggplot2)
library(ggstats)
library(readr)# Import data (adjust path as needed)
bmd <- readr::read_csv("~/R/site-library/bmd.csv", show_col_types = FALSE)
# Inspect the data
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
Create readable labels for the main categorical variables:
bmd <- bmd %>%
mutate(
sex = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
ethnicity = factor(RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other")),
smoker_f = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)Report the total sample size and create the analytic dataset for ANOVA by removing missing values:
# Total observations
n_total <- nrow(bmd)
# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(ethnicity))
# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)
# Display sample sizes
n_total## [1] 2898
## [1] 2286
Write 2–4 sentences summarizing your analytic sample here.
[YOUR TEXT HERE: Describe the total sample, how many observations were removed due to missing data, and what the final analytic sample size is for the ANOVA analysis.]
The total sample included 2,898 participants from the dataset. For the ANOVA analysis, 612 observations were excluded due to missing data on key variables, including sex, ethnicity, and smoking status. After removing these incomplete cases, the final analytic sample consisted of 2,286 participants with complete data on all variables required for the analysis
# Fit the one-way ANOVA model
anova_model <- aov(DXXOFBMD ~ ethnicity, data = bmd)
# Display the ANOVA table
summary(anova_model)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 612 observations deleted due to missingness
Research question: Do mean BMD values differ across ethnicity groups?
State your null and alternative hypotheses in both statistical notation and plain language.
Statistical notation:
Plain language:
Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:
anova_df %>%
group_by(ethnicity) %>%
summarise(
n = n(),
mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
) %>%
arrange(desc(n)) %>%
kable(digits = 3)| ethnicity | n | mean_bmd | sd_bmd |
|---|---|---|---|
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other | 409 | 0.897 | 0.148 |
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
Create a visualization comparing BMD distributions across ethnicity groups:
ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.15, alpha = 0.25) +
labs(
x = "Ethnicity group",
y = "Bone mineral density (g/cm²)"
) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))Interpret the plot: What patterns do you observe in BMD across ethnicity groups?
[BMD appears to vary across ethnicity groups, with Non-Hispanic Black participants showing the highest median BMD and Non-Hispanic White participants showing the lowest median BMD. Mexican American, Other Hispanic, and Other groups fall in between these two groups. Although the distributions overlap substantially, there is a shift in central tendency across groups.]
Fit a one-way ANOVA with:
DXXOFBMD)# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)
# Display ANOVA table
summary(fit_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Report the F-statistic, degrees of freedom, and p-value in a formatted table: #| Source | df | F-statistic | p-value | | ——— | —- | ———– | ——- | | Ethnicity | 4 | 30.18 | < 0.001 | | Residuals | 2281 | — | — |
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| ethnicity | 4 | 2.9482 | 0.7371 | 30.185 | 0 |
| Residuals | 2281 | 55.6973 | 0.0244 | NA | NA |
Write your interpretation here:
[YOUR INTERPRETATION: State your decision (reject or fail to reject H₀) and what this means for ethnic differences in BMD.]
#Interpretation: We reject the null hypothesis (H₀). there is strong statistical evidence that mean bone mineral density (BMD) differs across ethnicity groups,therefore at least one ethnic group has a significantly different average BMD compared to the others.
ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.
If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests. ### Normality of residuals
Create diagnostic plots:
par(mfrow = c(1, 2))
plot(fit_aov, which = 1) # Residuals vs fitted
plot(fit_aov, which = 2) # QQ plot## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Summarize what you see in 2–4 sentences:
[YOUR INTERPRETATION: Comment on whether residuals appear normally distributed based on the QQ plot, and whether variances appear equal based on Levene’s test. State whether assumptions are met or violated.]
#Interpretation: The Q-Q plot shows that the residuals follow the reference line reasonably well, with only minor deviations in the tails, indicating approximate normality. Levene’s test for homogeneity of variance was not statistically significant (F(4, 2281) = 1.57, p = 0.1788), indicating that the variances are equal across ethnicity groups. Therefore, the normality and equal variance assumptions appear to be reasonably met.
Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.
Use Tukey HSD and report:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Create and present a readable table (you can tidy the Tukey output):
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Write a short paragraph: “The groups that differed were …”
#YOUR INTERPRETATION:
The groups that differed were primarily those involving Non-Hispanic White and Non-Hispanic Black participants. Non-Hispanic Whites had significantly lower mean BMD compared to Mexican Americans, Other Hispanics, and Non-Hispanic Blacks (all adjusted p < 0.001). The “Other” group also had significantly lower BMD compared to Mexican Americans and Non-Hispanic Blacks. Additionally, Non-Hispanic Blacks had significantly higher mean BMD than Non-Hispanic Whites. No statistically significant differences were observed between Mexican Americans and Other Hispanics, Mexican Americans and Non-Hispanic Blacks, Other Hispanics and Non-Hispanic Blacks, or Other and Non-Hispanic Whites.
Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).
Formula: η² = SS_between / SS_total
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ethnicity 4 2.95 0.737 30.2 1.65e-24
## 2 Residuals 2281 55.7 0.0244 NA NA
# YOUR CODE HERE to compute eta-squared
ss_between <- anova_tbl$sumsq[anova_tbl$term == "ethnicity"]
ss_within <- anova_tbl$sumsq[anova_tbl$term == "Residuals"]
# Compute total SS
ss_total <- ss_between + ss_within
# Compute eta-squared
eta_sq <- ss_between / ss_total
eta_sq## [1] 0.05027196
Interpret in 1–2 sentences:
[YOUR INTERPRETATION: What proportion of variance in BMD is explained by ethnicity? Use Cohen’s guidelines: small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14]
In this section you will:
Calculate total nutrient intake by adding dietary and supplemental sources:
# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
mutate(
DSQTCALC_0 = replace_na(DSQTCALC, 0),
DSQTVD_0 = replace_na(DSQTVD, 0),
calcium_total = calcium + DSQTCALC_0,
vitd_total = vitd + DSQTVD_0
)
bmd2$vitd_total <- bmd2$vitd +
ifelse(is.na(bmd2$DSQTVD), 0, bmd2$DSQTVD)
# Verify
summary(bmd2[, c("calcium_total", "vitd_total")])## calcium_total vitd_total
## Min. : 55.0 Min. : 0.000
## 1st Qu.: 596.8 1st Qu.: 2.900
## Median : 895.5 Median : 8.533
## Mean : 997.6 Mean : 28.675
## 3rd Qu.:1285.0 3rd Qu.: 30.400
## Max. :5399.0 Max. :2574.650
## NA's :293 NA's :293
## [1] 293
## [1] 293
Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.
Create histograms to assess distributions:
# YOUR CODE HERE: Create histograms for key continuous variables
# # Create histograms for all continuous variables
p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
geom_histogram(fill = "purple", color = "black", bins = 30) +
labs(title = "BMI Distribution", x = "BMI", y = "Frequency") +
theme_minimal()
p2 <- ggplot(bmd2, aes(x = calcium_total)) +
geom_histogram(fill = "steelblue", color = "black", bins = 30) +
labs(title = "Total Calcium Distribution", x = "Calcium (mg)", y = "Frequency") +
theme_minimal()
p3 <- ggplot(bmd2, aes(x = vitd_total)) +
geom_histogram(fill = "darkgreen", color = "black", bins = 30) +
labs(title = "Total Vitamin D Distribution", x = "Vitamin D (mcg)", y = "Frequency") +
theme_minimal()
p4 <- ggplot(bmd2, aes(x = totmet)) +
geom_histogram(fill = "orange2", color = "black", bins = 30) +
labs(title = "Total MET Distribution", x = "MET (min/week)", y = "Frequency") +
theme_minimal()
# Display all plots together
print(p1)## [1] 64
## [1] 293
## [1] 293
## [1] 964
# Check complete cases for correlation analysis
complete_cases <- sum(complete.cases(bmd2[, c("BMXBMI", "calcium_total", "vitd_total", "totmet")]))
cat("Complete cases for all variables:", complete_cases)## Complete cases for all variables: 1765
The bone mineral density (BMD) values of the five ethnic groups are similar.
Because the boxes are roughly the same height and location, the majority of individuals in each group have BMD values that are comparable (0.9–1.0 g/cm²).
Black non-Hispanic individuals have a somewhat higher BMD; their box is positioned slightly higher than the others.
Mexican Americans’ box sits a little lower than the others, indicating a slightly lower BMD.
*The boxes overlap a lot, indicating that there are not many distinctions between the groups. Numerous dots (data points) indicate that each group has a large number of members, which increases the reliability of our findings. Based on your assessment, apply transformations as needed:
bmd2 <- bmd2 %>%
mutate(
bmi_log = log(BMXBMI),
calcium_log = log(calcium_total + 1),
vitd_sqrt = sqrt(vitd_total),
met_sqrt = sqrt(totmet)
# Add your transformation code here
# Examples (modify as needed based on your assessment):
# log_bmi = log(BMXBMI),
# log_calcium_total = log(calcium_total),
# sqrt_totmet = sqrt(totmet),
# sqrt_vitd_total = sqrt(vitd_total)
)
# Check complete cases for correlation analysis
complete_cases <- sum(complete.cases(bmd2[, c("BMXBMI", "calcium_total", "vitd_total", "totmet")]))
cat("Complete cases for all variables:", complete_cases)## Complete cases for all variables: 1765
Document your reasoning: “I applied/did not apply transformations to [variables] because…”
[YOUR JUSTIFICATION HERE]
Decision: Apply LOG transformation
Justification: Right-skewed shape needs correction No zeros in BMI data, so log is safe to use For improved correlation analysis, log will compress the right tail and make the distribution more symmetrical.
Decision: Apply LOG transformation
Justification: Strong right-skewed distribution Possible zero values (some individuals might not consume any calcium) Log will increase symmetry by compressing the really high numbers.
Decision: SQUARE ROOT transformation
Justification: Many zeros present (people without supplements), sqrt handles zeros without adding constants, useful for heavily zero-inflated data
Decision: SQUARE ROOT transformation
Justification: Many zeros (sedentary individuals), extremely skewed, sqrt handles zeros naturally, appropriate for count-like physical activity data
For each correlation, report:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}Examine correlations between potential predictors and BMD:
Use transformed versions where appropriate.
# Create the correlation helper function
corr_pair <- function(data, var1, var2) {
# Remove missing values
complete_data <- data[complete.cases(data[[var1]], data[[var2]]), ]
# Calculate correlation test
cor_test <- cor.test(complete_data[[var1]], complete_data[[var2]],
method = "pearson")
# Extract results
result <- data.frame(
Variable_1 = var1,
Variable_2 = var2,
r = round(cor_test$estimate, 3),
p_value = format.pval(cor_test$p.value, digits = 3),
n = nrow(complete_data)
)
return(result)
}Table A: Predictors vs. Outcome (BMD)
#Table A: Predictors vs. Outcome (BMD)
table_a <- rbind(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"), # Age vs BMD
corr_pair(bmd2, "bmi_log", "DXXOFBMD"), # BMI vs BMD
corr_pair(bmd2, "calcium_log", "DXXOFBMD"), # Calcium vs BMD
corr_pair(bmd2, "vitd_sqrt", "DXXOFBMD") # Vitamin D vs BMD
)
table_a <- table_a %>%
mutate(
Variable_1 = case_when(
Variable_1 == "RIDAGEYR" ~ "Age (years)",
Variable_1 == "bmi_log" ~ "BMI (log-transformed)",
Variable_1 == "calcium_log" ~ "Total Calcium (log-transformed)",
Variable_1 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
TRUE ~ Variable_1
),
Variable_2 = "BMD (g/cm²)"
)
kable(table_a,
caption = "Table A: Correlations between Predictors and BMD (Outcome)",
col.names = c("Predictor", "Outcome", "r", "p-value", "n"),
align = c('l', 'l', 'r', 'r', 'r'))| Predictor | Outcome | r | p-value | n | |
|---|---|---|---|---|---|
| cor | Age (years) | BMD (g/cm²) | -0.232 | <2e-16 | 2286 |
| cor1 | BMI (log-transformed) | BMD (g/cm²) | 0.425 | <2e-16 | 2275 |
| cor2 | Total Calcium (log-transformed) | BMD (g/cm²) | 0.012 | 0.592 | 2129 |
| cor3 | Total Vitamin D (sqrt-transformed) | BMD (g/cm²) | -0.062 | 0.00426 | 2129 |
Interpret the results:
[YOUR INTERPRETATION: Age has a slight negative correlation with BMD, although BMI has a significant (moderately positive) correlation.the cross-sectional research does not reveal any significant positive correlations between BMD with either calcium or vitamin D intake.]
Examine correlations between potential confounders and physical activity:
# YOUR CODE HERE
# Table B: Predictors vs. Exposure (Physical Activity)
table_b <- rbind(
corr_pair(bmd2, "RIDAGEYR", "met_sqrt"), # Age vs MET
corr_pair(bmd2, "bmi_log", "met_sqrt"), # BMI vs MET
corr_pair(bmd2, "calcium_log", "met_sqrt"), # Calcium vs MET
corr_pair(bmd2, "vitd_sqrt", "met_sqrt") # Vitamin D vs MET
)
# Add descriptive labels
table_b <- table_b %>%
mutate(
Variable_1 = case_when(
Variable_1 == "RIDAGEYR" ~ "Age (years)",
Variable_1 == "bmi_log" ~ "BMI (log-transformed)",
Variable_1 == "calcium_log" ~ "Total Calcium (log-transformed)",
Variable_1 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
TRUE ~ Variable_1
),
Variable_2 = "Physical Activity (MET, sqrt-transformed)"
)
# Display formatted table
kable(table_b,
digits = 3,
col.names = c("Predictor", "Exposure", "r", "p-value", "n"),
caption = "Table B: Correlations between Predictors and Physical Activity (Exposure)",
align = c('l', 'l', 'r', 'r', 'r'))| Predictor | Exposure | r | p-value | n | |
|---|---|---|---|---|---|
| cor | Age (years) | Physical Activity (MET, sqrt-transformed) | -0.097 | 1.96e-05 | 1934 |
| cor1 | BMI (log-transformed) | Physical Activity (MET, sqrt-transformed) | 0.001 | 0.951 | 1912 |
| cor2 | Total Calcium (log-transformed) | Physical Activity (MET, sqrt-transformed) | 0.086 | 0.000282 | 1777 |
| cor3 | Total Vitamin D (sqrt-transformed) | Physical Activity (MET, sqrt-transformed) | -0.002 | 0.932 | 1777 |
Interpret the results:
YOUR INTERPRETATION: Physical activity and age have a slightly negative correlation (older = less active). There are no significant correlations between vitamin D and BMI. Although statistically significant, the calcium-MET link is fairly weak (r = 0.068) and has little practical significance. There is no apparent correlation between vitamin D intake and levels of physical exercise.
Examine correlations among all predictor variables (all pairwise combinations):
# #Table C: Predictors vs. Each Other (Intercorrelations)
table_c <- rbind(
corr_pair(bmd2, "RIDAGEYR", "bmi_log"),
corr_pair(bmd2, "RIDAGEYR", "calcium_log"),
corr_pair(bmd2, "RIDAGEYR", "vitd_sqrt"),
corr_pair(bmd2, "bmi_log", "calcium_log"),
corr_pair(bmd2, "bmi_log", "vitd_sqrt"),
corr_pair(bmd2, "calcium_log", "vitd_sqrt")
)
# Descriptive labels
table_c <- table_c %>%
mutate(
Variable_1 = case_when(
Variable_1 == "RIDAGEYR" ~ "Age (years)",
Variable_1 == "bmi_log" ~ "BMI (log-transformed)",
Variable_1 == "calcium_log" ~ "Total Calcium (log-transformed)",
Variable_1 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
TRUE ~ Variable_1
),
Variable_2 = case_when(
Variable_2 == "RIDAGEYR" ~ "Age (years)",
Variable_2 == "bmi_log" ~ "BMI (log-transformed)",
Variable_2 == "calcium_log" ~ "Total Calcium (log-transformed)",
Variable_2 == "vitd_sqrt" ~ "Total Vitamin D (sqrt-transformed)",
TRUE ~ Variable_2
)
)
# Display table
kable(table_c,
digits = 3,
col.names = c("Variable 1", "Variable 2", "r", "p-value", "n"),
caption = "Table C: Intercorrelations Among Predictors",
align = c('l', 'l', 'r', 'r', 'r'))| Variable 1 | Variable 2 | r | p-value | n | |
|---|---|---|---|---|---|
| cor | Age (years) | BMI (log-transformed) | -0.098 | 1.83e-07 | 2834 |
| cor1 | Age (years) | Total Calcium (log-transformed) | 0.048 | 0.0135 | 2605 |
| cor2 | Age (years) | Total Vitamin D (sqrt-transformed) | 0.153 | 4.68e-15 | 2605 |
| cor3 | BMI (log-transformed) | Total Calcium (log-transformed) | 0.000 | 0.983 | 2569 |
| cor4 | BMI (log-transformed) | Total Vitamin D (sqrt-transformed) | 0.007 | 0.731 | 2569 |
| cor5 | Total Calcium (log-transformed) | Total Vitamin D (sqrt-transformed) | 0.314 | <2e-16 | 2605 |
Interpret the results:
[YOUR INTERPRETATION: The largest correlation is 0.314 between Total Calcium and Total Vitamin D, which is only moderate, not strong. We can conclude that there are no strong intercorrelations among predictors, and based on this table alone, multicollinearity is unlikely to be a serious concern in future regression analyses.
Create scatterplots to visualize key relationships:
# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "log(BMI)",
y = "BMD (g/cm²)",
title = "BMD vs BMI (transformed)"
)
# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "sqrt(totmet)",
y = "BMD (g/cm²)",
title = "BMD vs MET (transformed)"
)Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
B. Assumptions and Limitations
C. R Programming Challenge
YOUR REFLECTION HERE (200–400 words)
[A. Comparing Methods (ANOVA vs Correlation) ANOVA and correlation are two methods used in epidemiological research to address many kinds of concerns. When comparing the mean of a continuous result across three or more categorical groups, an ANOVA is suitable. To ascertain whether mean bone mineral density varies among BMI categories (normal weight, overweight, and obese), for instance, an ANOVA would be utilized. The strength and direction of a linear link between two continuous variables, on the other hand, are evaluated using correlation. For example, correlation analysis would be more appropriate for determining if blood vitamin D levels are linked to total calcium levels.The primary difference is that correlation measures the degree to which two continuous variables move together, while ANOVA assesses mean differences between groups.
B. Assumptions & Limitations The assignment’s most difficult assumptions to meet were homogeneity of variance and normality. For a number of variables that were closer to normal distributions, log or square-root adjustments were necessary. Statistical tests may yield biased p-values if these presumptions are broken, which raises the possibility of Type I or Type II errors and reduces confidence in the findings.The fact that cross-sectional correlation analysis cannot prove causation or timing is one of its main drawbacks. theremay becorrelations between measures of bone health, activity, and nutrition, but we are unable to establish a trend. Observed associations may also be distorted by confounding variables, such as age, supplement use, and chronic illness. As a result, results should be read carefully.
C. R programing Challenges Correctly organizing the ANOVA workflow, including handling missing data and making sure categorical variables were represented as factors, was the most difficult aspect of the R coding. In order to calculate η² without misinterpreting the model components, I also had to carefully extract the right sums of squares from the ANOVA result. By methodically examining model output, comparing degrees of freedom to the analytical sample size, and using tidy() to clearly arrange results, I was able to overcome these difficulties. This technique increased my confidence in calculating and analyzing effect sizes in R and strengthened my ability to translate statistical results into meaningful interpretation.]
Before submitting, verify:
kable()EPI553_HW01_LastName_FirstName.RmdGood luck! Remember to start early and use office hours if you need help.