Homework 1 overview

Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

Your .Rmd file
Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.

File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)
library (knitr)
library (ggplot2)
library (readr)
library(patchwork)
library (ggstats)

# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Examine the first few rows
head(bmd, n = 10)

## # A tibble: 10 × 14
##     SEQN RIAGENDR RIDAGEYR RIDRETH1 BMXBMI smoker totmet metcat DXXOFBMD tbmdcat
##    <dbl>    <dbl>    <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl>
##  1 93705        2       66        4   31.7      2    240      0    1.06        0
##  2 93708        2       66        5   23.7      3    120      0    0.801       1
##  3 93709        2       75        4   38.9      1    720      1    0.88        0
##  4 93711        1       56        5   21.3      3    840      1    0.851       1
##  5 93713        1       67        3   23.5      1    360      0    0.778       1
##  6 93714        2       54        4   39.9      2     NA     NA    0.994       0
##  7 93715        1       71        5   22.5      1   6320      2    0.952       0
##  8 93716        1       61        5   30.7      2   2400      2    1.12        0
##  9 93721        2       60        1   35.9      3     NA     NA   NA          NA
## 10 93722        2       60        3   23.8      3     NA     NA    0.752       1
## # ℹ 4 more variables: calcium <dbl>, vitd <dbl>, DSQTVD <dbl>, DSQTCALC <dbl>

# View data structure
str(bmd)

## spc_tbl_ [2,898 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SEQN    : num [1:2898] 93705 93708 93709 93711 93713 ...
##  $ RIAGENDR: num [1:2898] 2 2 2 1 1 2 1 1 2 2 ...
##  $ RIDAGEYR: num [1:2898] 66 66 75 56 67 54 71 61 60 60 ...
##  $ RIDRETH1: num [1:2898] 4 5 4 5 3 4 5 5 1 3 ...
##  $ BMXBMI  : num [1:2898] 31.7 23.7 38.9 21.3 23.5 39.9 22.5 30.7 35.9 23.8 ...
##  $ smoker  : num [1:2898] 2 3 1 3 1 2 1 2 3 3 ...
##  $ totmet  : num [1:2898] 240 120 720 840 360 NA 6320 2400 NA NA ...
##  $ metcat  : num [1:2898] 0 0 1 1 0 NA 2 2 NA NA ...
##  $ DXXOFBMD: num [1:2898] 1.058 0.801 0.88 0.851 0.778 ...
##  $ tbmdcat : num [1:2898] 0 1 0 1 1 0 0 0 NA 1 ...
##  $ calcium : num [1:2898] 504 474 NA 1248 660 ...
##  $ vitd    : num [1:2898] 1.85 5.85 NA 3.85 2.35 5.65 3.75 4.45 6.05 6.45 ...
##  $ DSQTVD  : num [1:2898] 20.6 25 NA 25 NA ...
##  $ DSQTCALC: num [1:2898] 211.7 820 NA 35 13.3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   SEQN = col_double(),
##   ..   RIAGENDR = col_double(),
##   ..   RIDAGEYR = col_double(),
##   ..   RIDRETH1 = col_double(),
##   ..   BMXBMI = col_double(),
##   ..   smoker = col_double(),
##   ..   totmet = col_double(),
##   ..   metcat = col_double(),
##   ..   DXXOFBMD = col_double(),
##   ..   tbmdcat = col_double(),
##   ..   calcium = col_double(),
##   ..   vitd = col_double(),
##   ..   DSQTVD = col_double(),
##   ..   DSQTCALC = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Dimensions: rows (observations) and columns (variables)
dim(bmd)

## [1] 2898   14

# What variables do we have?
names(bmd)

##  [1] "SEQN"     "RIAGENDR" "RIDAGEYR" "RIDRETH1" "BMXBMI"   "smoker"  
##  [7] "totmet"   "metcat"   "DXXOFBMD" "tbmdcat"  "calcium"  "vitd"    
## [13] "DSQTVD"   "DSQTCALC"

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd_data <- bmd %>%
  mutate(
    ethnic_category = case_when(
      RIDRETH1 == 1 ~ "Mexican American",
      RIDRETH1 == 2 ~ "Other Hispanic",
      RIDRETH1 == 3 ~ "Non-Hispanic White",
      RIDRETH1 == 4 ~ "Non-Hispanic Black",
      RIDRETH1 == 5 ~ "Other (including Multi)",
    ),
    gender_category = case_when(
      RIAGENDR == 1 ~ "Male",
      RIAGENDR == 2 ~ "Female",
    ),
    
    smoke_category = case_when(
    smoker == 1 ~ "Current smoker",
    smoker == 2 ~ "Past smoker",
    smoker == 3 ~ "Never smoked",
    ),
  )

# Check sample sizes
table(bmd_data$ethnic_category)

## 
##        Mexican American      Non-Hispanic Black      Non-Hispanic White 
##                     331                     696                    1086 
## Other (including Multi)          Other Hispanic 
##                     499                     286

table(bmd_data$gender_category)

## 
## Female   Male 
##   1467   1431

table(bmd_data$smoke_category)

## 
## Current smoker   Never smoked    Past smoker 
##            449           1553            894

# Display first few rows
head(bmd_data) %>% 
  kable(caption = " BMD Dataset with recoded variables (first 6 rows)")

BMD Dataset with recoded variables (first 6 rows)
SEQN	RIAGENDR	RIDAGEYR	RIDRETH1	BMXBMI	smoker	totmet	metcat	DXXOFBMD	tbmdcat	calcium	vitd	DSQTVD	DSQTCALC	ethnic_category	gender_category	smoke_category
93705	2	66	4	31.7	2	240	0	1.058	0	503.5	1.85	20.557	211.67	Non-Hispanic Black	Female	Past smoker
93708	2	66	5	23.7	3	120	0	0.801	1	473.5	5.85	25.000	820.00	Other (including Multi)	Female	Never smoked
93709	2	75	4	38.9	1	720	1	0.880	0	NA	NA	NA	NA	Non-Hispanic Black	Female	Current smoker
93711	1	56	5	21.3	3	840	1	0.851	1	1248.5	3.85	25.000	35.00	Other (including Multi)	Male	Never smoked
93713	1	67	3	23.5	1	360	0	0.778	1	660.5	2.35	NA	13.33	Non-Hispanic White	Male	Current smoker
93714	2	54	4	39.9	2	NA	NA	0.994	0	776.0	5.65	NA	NA	Non-Hispanic Black	Female	Past smoker

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd_data)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd_data %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnic_category))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total

## [1] 2898

n_anova

## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

[YOUR TEXT HERE: Describe the total sample, how many observations were removed due to missing data, and what the final analytic sample size is for the ANOVA analysis.] ## Analytic Sample Summary

Write 2–4 sentences summarizing your analytic sample here. The total sample consisted of 2,898 participants whose original data source is the National Health and Nutrition Examination Survey (NHANES). After removing participants with missing bone mineral density (BMD) or ethnicity data, 612 observations were excluded, resulting in a final analytic sample of 2,286 participants for the ANOVA analysis. This represents 78.9% of the total sample. The analytic sample includes adults with complete BMD measurements distributed across five ethnicity categories: Mexican American (n=331), Other Hispanic (n=286), Non-Hispanic White (n=1,086), Non-Hispanic Black (n=696), and Other (including Multi-racial, n=499). —

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀: μ_ethniccategory1 = μ_ethniccategory2 = μ_ethniccategory3 = μ_ethniccategory4 = μ_ethniccategory3
Hₐ: At least one population group mean differs from the others. At least one μi ≠ μj

Plain language:

H₀: All five ethnicity groups have equal mean bone mineral density.
Hₐ: At least one ethnicity group has a different mean BMD.

1.2 Exploratory plot and group summaries

###Fit the anova table

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnic_category) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)

ethnic_category	n	mean_bmd	sd_bmd
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other (including Multi)	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnic_category, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

All five ethnic group seem to have similar median BMD values at around 1 gm/cm2. Non- hispanic black have slightly higher median BMD whereas non-hispanic white has slightly lower median BMD. Outliers are present in all groups, mostly below.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnic_category, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##                   Df Sum Sq Mean Sq F value Pr(>F)    
## ethnic_category    4   2.95  0.7371   30.18 <2e-16 ***
## Residuals       2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
ethnic_category	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here:

[YOUR INTERPRETATION: State your decision (reject or fail to reject H₀) and what this means for ethnic differences in BMD.]

The one-way ANOVA shows a statistically significant difference in mean BMD across ethnic categories (F(4, 2281) = 30.19, p < 0.001). Because the p-value is far below 0.05 ( <2e-16), we reject the null hypothesis (H₀) that mean BMD is equal across all ethnic groups. There is strong statistical evidence that at least one ethnic group has a different mean BMD compared to the others. However, since ANOVA does not tell us which specific groups differ, we will use post-hoc tests (e.g., Tukey’s HSD) to determine where those differences lie.

This could mean meaningful ethnic variation in BMD that we could look into biological, social, or structural determinants.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnic_category, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

# Conduct Levene's test in neater table
levene_test <- leveneTest(DXXOFBMD ~ ethnic_category, data = anova_df)

# Display results
print(levene_test)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

# Create formatted table
levene_table <- data.frame(
  Test = "Levene's Test",
  Df1 = levene_test$Df[1],
  Df2 = levene_test$Df[2],
  `F value` = round(levene_test$`F value`[1], 3),
  `p-value` = round(levene_test$`Pr(>F)`[1], 4)
)

kable(levene_table,
      caption = "Levene's Test for Homogeneity of Variance",
      col.names = c("Test", "df1", "df2", "F-statistic", "p-value"))

Levene’s Test for Homogeneity of Variance
Test	df1	df2	F-statistic	p-value
Levene’s Test	4	2281	1.573	0.1788

Summarize what you see in 2–4 sentences:

YOUR INTERPRETATION: Based on the diagnostic plots and levene test, we can see the following: Independence: Observations are independent (assumed based on study design) 1. For Normality,the QQ plot show that the residuals follow normal distribution line with only few deviation at each tails. Most lie on diagonal line neatly, so normality is satisfied for this sample and n is also fairly large.

Homogeniety: From the Levene test, we can see F(4, 2281) = 1.57 and P value is 0.179. Since p > 0.05, we fail to reject the null hypothesis of equal variances.Hence the variances differ across ethnic categories so the assumption is met.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

# Conduct Tukey HSD test
tukey_results <- TukeyHSD(fit_aov)
print(tukey_results)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnic_category, data = anova_df)
## 
## $ethnic_category
##                                                    diff         lwr
## Non-Hispanic Black-Mexican American         0.022391619 -0.01010005
## Non-Hispanic White-Mexican American        -0.062377249 -0.09285262
## Other (including Multi)-Mexican American   -0.053319344 -0.08735725
## Other Hispanic-Mexican American            -0.004441273 -0.04264413
## Non-Hispanic White-Non-Hispanic Black      -0.084768868 -0.10837333
## Other (including Multi)-Non-Hispanic Black -0.075710963 -0.10376452
## Other Hispanic-Non-Hispanic Black          -0.026832892 -0.05981593
## Other (including Multi)-Non-Hispanic White  0.009057905 -0.01663337
## Other Hispanic-Non-Hispanic White           0.057935976  0.02693726
## Other Hispanic-Other (including Multi)      0.048878071  0.01437080
##                                                     upr     p adj
## Non-Hispanic Black-Mexican American         0.054883289 0.3276613
## Non-Hispanic White-Mexican American        -0.031901881 0.0000003
## Other (including Multi)-Mexican American   -0.019281435 0.0001919
## Other Hispanic-Mexican American             0.033761585 0.9978049
## Non-Hispanic White-Non-Hispanic Black      -0.061164402 0.0000000
## Other (including Multi)-Non-Hispanic Black -0.047657407 0.0000000
## Other Hispanic-Non-Hispanic Black           0.006150151 0.1722466
## Other (including Multi)-Non-Hispanic White  0.034749177 0.8719109
## Other Hispanic-Non-Hispanic White           0.088934694 0.0000036
## Other Hispanic-Other (including Multi)      0.083385341 0.0010720

# Visualize the confidence intervals
plot(tukey_results, las = 0)

Create and present a readable table (you can tidy the Tukey output):

# Creating a readable table from Tukey results

# First, tidy the Tukey results
tukey_results <- broom::tidy(tukey_results)

# Then create the readable table with proper renaming
tukey_table <- tukey_results %>%
  rename(
    Comparison = contrast,
    Mean_Difference = estimate,
    Lower_CI = conf.low,
    Upper_CI = conf.high,
    Adj_p_value = adj.p.value
  ) %>%
  
  ##filtering for only significant results 
  
  mutate(
    Significant = ifelse(Adj_p_value < 0.05, "Yes *", "No"),
    Mean_Difference = round(Mean_Difference, 4),
    Lower_CI = round(Lower_CI, 4),
    Upper_CI = round(Upper_CI, 4),
    Adj_p_value = signif(Adj_p_value, 4)
  ) %>%
  select(Comparison, Mean_Difference, Lower_CI, Upper_CI, Adj_p_value, Significant)

# Display all comparisons
tukey_table %>% 
  kable(caption = "Tukey HSD Post-hoc Pairwise Comparisons of BMD by Ethnicity")

Tukey HSD Post-hoc Pairwise Comparisons of BMD by Ethnicity
Comparison	Mean_Difference	Lower_CI	Upper_CI	Adj_p_value	Significant
Non-Hispanic Black-Mexican American	0.0224	-0.0101	0.0549	0.3277000	No
Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000003	Yes *
Other (including Multi)-Mexican American	-0.0533	-0.0874	-0.0193	0.0001919	Yes *
Other Hispanic-Mexican American	-0.0044	-0.0426	0.0338	0.9978000	No
Non-Hispanic White-Non-Hispanic Black	-0.0848	-0.1084	-0.0612	0.0000000	Yes *
Other (including Multi)-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000000	Yes *
Other Hispanic-Non-Hispanic Black	-0.0268	-0.0598	0.0062	0.1722000	No
Other (including Multi)-Non-Hispanic White	0.0091	-0.0166	0.0347	0.8719000	No
Other Hispanic-Non-Hispanic White	0.0579	0.0269	0.0889	0.0000036	Yes *
Other Hispanic-Other (including Multi)	0.0489	0.0144	0.0834	0.0010720	Yes *

# Filter for only significant comparisons
sig_comparisons <- tukey_table %>%
  filter(Significant == "Yes *")

sig_comparisons %>% 
  kable(caption = "Statistically Significant Pairwise Differences (p < 0.05)")

Statistically Significant Pairwise Differences (p < 0.05)
Comparison	Mean_Difference	Lower_CI	Upper_CI	Adj_p_value	Significant
Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000003	Yes *
Other (including Multi)-Mexican American	-0.0533	-0.0874	-0.0193	0.0001919	Yes *
Non-Hispanic White-Non-Hispanic Black	-0.0848	-0.1084	-0.0612	0.0000000	Yes *
Other (including Multi)-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000000	Yes *
Other Hispanic-Non-Hispanic White	0.0579	0.0269	0.0889	0.0000036	Yes *
Other Hispanic-Other (including Multi)	0.0489	0.0144	0.0834	0.0010720	Yes *

Write a short paragraph: The Tukey HSD post-hoc test reveals statistically significant pairwise differences in mean BMD across ethnicity groups. These findings suggest that ethnicity is a meaningful predictor of bone mineral density, with Non-Hispanic Black individuals demonstrating the highest BMD values in this sample.

INTERPRETATION: Post-hoc Pairwise Comparisons

The Tukey HSD post-hoc test revealed 6 out of 10 pairwise comparisons showed statistically significant differences in mean BMD (adjusted p < 0.05): Non-Hispanic White vs. other groups:

Non-Hispanic White had significantly lower BMD than Mexican American (diff = -0.062, 95% CI [-0.093, -0.032], p < 0.001) Non-Hispanic White had significantly lower BMD than Non-Hispanic Black (diff = -0.085, 95% CI [-0.108, -0.061], p < 0.001) Non-Hispanic White had significantly lower BMD than Other Hispanic (diff = -0.058, 95% CI [-0.089, -0.027], p < 0.001)

Other (including Multi-Racial) vs. other groups:

Other (including Multi) had significantly lower BMD than Mexican American (diff = -0.053, 95% CI [-0.087, -0.019], p < 0.001) Other (including Multi) had significantly lower BMD than Non-Hispanic Black (diff = -0.076, 95% CI [-0.104, -0.048], p < 0.001) Other (including Multi) had significantly lower BMD than Other Hispanic (diff = -0.049, 95% CI [-0.083, -0.014], p = 0.001)

Key findings: Non-Hispanic Black and Mexican American participants had the highest mean BMD, while Non-Hispanic White and Other (including Multi-Racial) groups had the lowest. Other Hispanic fell in between. No significant difference was found between Non-Hispanic Black and Mexican American (p = 0.328), nor between Non-Hispanic White and Other (including Multi) (p = 0.872).

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Hint: Extract Sum Sq from the ANOVA summary

# Extract sum of squares from ANOVA table
anova_summary <- summary(fit_aov)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")

## Eta-squared (η²): 0.0503

cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")

## Percentage of variance explained: 5.03 %

# Create formatted output
effect_size_table <- data.frame(
  Measure = "Eta-squared (η²)",
  Value = round(eta_squared, 4),
  `Percent Variance` = paste0(round(eta_squared * 100, 2), "%"),
  Interpretation = "Small to Medium Effect"
)

kable(effect_size_table,
      caption = "Effect Size: Proportion of Variance Explained by Ethnicity",
      col.names = c("Effect Size Measure", "Value", "% Variance Explained", "Interpretation"))

Effect Size: Proportion of Variance Explained by Ethnicity
Effect Size Measure	Value	% Variance Explained	Interpretation
Eta-squared (η²)	0.0503	5.03%	Small to Medium Effect

Effect Size Interpretation: η² = r round(eta_squared, 4) or r round(eta_squared * 100, 2)% Small

Effect size guidelines: Small effect: η² ≈ 0.01 (1%) Medium effect: η² ≈ 0.06 (6%) Large effect: η² ≈ 0.14 (14%)
Our effect: Table: Effect Size: Proportion of Variance Explained by Ethnicity

Effect Size Measure	Value	% Variance Explained	Interpretation
Eta-squared (η²)	0.0503	5.03%	Small to Medium Effect

Interpret in 1–2 sentences: Ethnicity explains about 5.0% of the variance in bone mineral density (η² = 0.050), which is a small to moderate effect according to Cohen’s guidelines (small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14). While the ANOVA test was statistically significant with a large sample size, and ethnicity does predict BMD, other factors (such as genetics, lifestyle, nutrition, and physical activity) may influence bone density in large part. —

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

# Display first few rows
head(bmd2) %>% 
  kable(caption = " BMD Dataset 2 with mutated variables (first 6 rows)")

BMD Dataset 2 with mutated variables (first 6 rows)
SEQN	RIAGENDR	RIDAGEYR	RIDRETH1	BMXBMI	smoker	totmet	metcat	DXXOFBMD	tbmdcat	calcium	vitd	DSQTVD	DSQTCALC	DSQTCALC_0	DSQTVD_0	calcium_total	vitd_total
93705	2	66	4	31.7	2	240	0	1.058	0	503.5	1.85	20.557	211.67	211.67	20.557	715.17	22.407
93708	2	66	5	23.7	3	120	0	0.801	1	473.5	5.85	25.000	820.00	820.00	25.000	1293.50	30.850
93709	2	75	4	38.9	1	720	1	0.880	0	NA	NA	NA	NA	0.00	0.000	NA	NA
93711	1	56	5	21.3	3	840	1	0.851	1	1248.5	3.85	25.000	35.00	35.00	25.000	1283.50	28.850
93713	1	67	3	23.5	1	360	0	0.778	1	660.5	2.35	NA	13.33	13.33	0.000	673.83	2.350
93714	2	54	4	39.9	2	NA	NA	0.994	0	776.0	5.65	NA	NA	0.00	0.000	776.00	5.650

# Verify the new variables
summary(bmd2 %>% select(calcium, DSQTCALC, calcium_total, vitd, DSQTVD, vitd_total, BMXBMI, totmet))

##     calcium          DSQTCALC       calcium_total         vitd       
##  Min.   :  55.0   Min.   :   0.74   Min.   :  55.0   Min.   : 0.000  
##  1st Qu.: 525.0   1st Qu.:  93.33   1st Qu.: 596.8   1st Qu.: 1.750  
##  Median : 773.0   Median : 211.67   Median : 895.5   Median : 3.200  
##  Mean   : 840.5   Mean   : 357.98   Mean   : 997.6   Mean   : 4.471  
##  3rd Qu.:1049.0   3rd Qu.: 500.00   3rd Qu.:1285.0   3rd Qu.: 5.500  
##  Max.   :5199.0   Max.   :3220.00   Max.   :5399.0   Max.   :57.900  
##  NA's   :293      NA's   :1633      NA's   :293      NA's   :293     
##      DSQTVD           vitd_total           BMXBMI          totmet    
##  Min.   :   0.004   Min.   :   0.000   Min.   :14.20   Min.   :   0  
##  1st Qu.:  15.709   1st Qu.:   2.900   1st Qu.:25.20   1st Qu.: 240  
##  Median :  25.000   Median :   8.533   Median :28.70   Median : 480  
##  Mean   :  49.851   Mean   :  28.676   Mean   :29.76   Mean   :1015  
##  3rd Qu.:  50.000   3rd Qu.:  30.400   3rd Qu.:33.40   3rd Qu.:1430  
##  Max.   :2570.000   Max.   :2574.650   Max.   :84.40   Max.   :9120  
##  NA's   :1515       NA's   :293        NA's   :64      NA's   :964

# Check for any remaining NAs
cat("\nMissing values in total intake variables:\n")

## 
## Missing values in total intake variables:

colSums(is.na(bmd2 %>% select(calcium_total, vitd_total, BMXBMI, totmet)))

## calcium_total    vitd_total        BMXBMI        totmet 
##           293           293            64           964

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

# Create histograms for all continuous variables
p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "BMI Distribution", x = "BMI", y = "Frequency") +
  theme_minimal()

p2 <- ggplot(bmd2, aes(x = calcium_total)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Total Calcium Distribution", x = "Calcium (mg)", y = "Frequency") +
  theme_minimal()

p3 <- ggplot(bmd2, aes(x = vitd_total)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Total Vitamin D Distribution", x = "Vitamin D (mcg)", y = "Frequency") +
  theme_minimal()

p4 <- ggplot(bmd2, aes(x = totmet)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Total MET Distribution", x = "MET (min/week)", y = "Frequency") +
  theme_minimal()

## Combine plots in a 2x2 grid
(p1 | p2) / (p3 | p4)

# Display all plots together
print(p1)

print(p2)

print(p3)

print(p4)

# Check complete cases for correlation analysis
complete_cases <- sum(complete.cases(bmd2[, c("BMXBMI", "calcium_total", "vitd_total", "totmet")]))
cat("Complete cases for all variables:", complete_cases) ###1765 complete cases

## Complete cases for all variables: 1765

*Interpretation**

BMI Distribution (p1): The right-skewed distribution with a long tail extends to higher BMI values and most data are clustered around 25-30, but some extend to 50-60.Hence, we will apply LOG transformation because right-skewed tails needs correction and since there are no zeros in BMI data, so log is safe to use. Use of Log will compress the right tail and make distribution more symmetric for better analysis.
Total Calcium Distribution (p2):The plot is highly right-skewed with most values clustered at the low end (0-1000 mg) and has long right tail extending to 4000+ mg. There is some spike near zero. We will apply LOG transformation here as well due to the extreme right-skewed distribution with potential zero values. Use of Log will compress the extreme high values and create more symmetry in the data.
Total Vitamin D Distribution (p3):The plot shows extremely right-skew, massive spike at zero. We will use SQUARE ROOT transformation because there are many zeros present (people without supplements), square root will handle zero without adding constants, as it is appropriate for heavily zero-inflated data.
Total MET Distribution (p4):Fourth plot shows another right-skewed, with large data at zero (sedentary people). Hence, similar to P3, we will use SQUARE ROOT transformation due to data have many zero values and right skewed.

Based on your assessment, apply transformations as needed:

# Apply transformations based on visual assessment
bmd2 <- bmd2 %>%
  mutate(
    bmi_log = log(BMXBMI),
    calcium_log = log(calcium_total + 1),
    vitd_sqrt = sqrt(vitd_total),
    met_sqrt = sqrt(totmet)
  )

# Check that variables were created
names(bmd2)

##  [1] "SEQN"          "RIAGENDR"      "RIDAGEYR"      "RIDRETH1"     
##  [5] "BMXBMI"        "smoker"        "totmet"        "metcat"       
##  [9] "DXXOFBMD"      "tbmdcat"       "calcium"       "vitd"         
## [13] "DSQTVD"        "DSQTCALC"      "DSQTCALC_0"    "DSQTVD_0"     
## [17] "calcium_total" "vitd_total"    "bmi_log"       "calcium_log"  
## [21] "vitd_sqrt"     "met_sqrt"

# Check missing data for each variable
sum(is.na(bmd2$BMXBMI))

## [1] 64

sum(is.na(bmd2$calcium_total))

## [1] 293

sum(is.na(bmd2$vitd_total))

## [1] 293

sum(is.na(bmd2$totmet))

## [1] 964

# Create comparison plots
# BMI
p5 <- ggplot(bmd2, aes(x = BMXBMI)) +
  geom_histogram(fill ="red", color = "black", bins = 30) +
  labs(title = "Original BMI (Right-Skewed)", x = "BMI") +
  theme_minimal()

p6 <- ggplot(bmd2, aes(x = bmi_log)) +
  geom_histogram(fill = "lightblue", color = "black", bins = 30) +
  labs(title = "Log-Transformed BMI (More Normal)", x = "log(BMI)") +
  theme_minimal()

print(p5)

print(p6)

# Calcium
p7 <- ggplot(bmd2, aes(x = calcium_total)) +
  geom_histogram(fill = "red", color = "black", bins = 30) +
  labs(title = "Original Calcium (Highly Skewed)", x = "Calcium (mg)") +
  theme_minimal()

p8 <- ggplot(bmd2, aes(x = calcium_log)) +
  geom_histogram(fill = "lightblue", color = "black", bins = 30) +
  labs(title = "Log-Transformed Calcium (Improved)", x = "log(Calcium+1)") +
  theme_minimal()

print(p7)

print(p8)

# Vitamin D
p9 <- ggplot(bmd2, aes(x = vitd_total)) +
  geom_histogram(fill = "red", color = "black", bins = 30) +
  labs(title = "Original Vitamin D (Highly Skewed)", x = "Vitamin D ") +
  theme_minimal()

p10 <- ggplot(bmd2, aes(x = vitd_sqrt)) +
  geom_histogram(fill = "lightblue", color = "black", bins = 30) +
  labs(title = "Log-Transformed Vitamin D (Improved)", x = "(Sqrt Vit D +1)") +
  theme_minimal()

print(p9)

print(p10)

# Total MET
p11 <- ggplot(bmd2, aes(x = totmet)) +
  geom_histogram(fill = "red", color = "black", bins = 30) +
  labs(title = "Total MET (Highly Skewed)", x = "Total MET ") +
  theme_minimal()

p12 <- ggplot(bmd2, aes(x = met_sqrt)) +
  geom_histogram(fill = "lightblue", color = "black", bins = 30) +
  labs(title = "Total MET (Improved)", x = "(Sqrt Total MET +1)") +
  theme_minimal()

print(p11)

print(p12)

I applied log transformations to BMI and total calcium intake because both variables showed right-skewed distributions with positive skewness values (>1), indicating heavy right tails. The log transformation helps normalize these distributions and stabilize variance. For vitamin D and physical activity (MET), I applied square root transformations to address right skewness while preserving interpretability better than log transformations.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

# YOUR CODE HERE

tableA <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(bmd2, "bmi_log", "DXXOFBMD"),
  corr_pair(bmd2, "calcium_log", "DXXOFBMD"),
  corr_pair(bmd2, "vitd_sqrt", "DXXOFBMD"),
  corr_pair(bmd2, "met_sqrt", "DXXOFBMD")
) %>%
  mutate(
    Variable = c("Age (years)", "BMI (log)", "Total Calcium (log)", 
                 "Total Vitamin D (sqrt)", "Physical Activity"),
    r = round(estimate, 3),
    p_value = signif(p_value, 3),
    Significant = ifelse(p_value < 0.05, "Yes", "No")
  ) %>%
  select(Variable, r, p_value, Significant, n)

tableA %>% kable(caption = "Table A: Correlations between Predictors and BMD (Outcome)")

Table A: Correlations between Predictors and BMD (Outcome)
Variable	r	p_value	Significant	n
Age (years)	-0.232	0.00e+00	Yes	2286
BMI (log)	0.425	0.00e+00	Yes	2275
Total Calcium (log)	0.012	5.92e-01	No	2129
Total Vitamin D (sqrt)	-0.062	4.26e-03	Yes	2129
Physical Activity	0.102	4.59e-05	Yes	1588

# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()

Interpret the results:

All four variables showed statistically significant correlations with BMD (p < 0.05) except BMI. Age demonstrated a moderate negative correlation (r = -0.232), indicating that older participants tended to have lower BMD. BMI showed a strong positive correlation (r = 0.425), suggesting higher body mass index is associated with greater bone mineral density. Total calcium intake had a weak correlation (r = 0.012). Total vitamin D intake showed a weak negative correlation (r = -0.062), which is unexpected but could be due to residual confounding. Physical activity demonstrated a weak positive correlation (r = 0.102), supporting the known beneficial effects of exercise on bone health.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

# YOUR CODE HERE

tableB <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "met_sqrt"),
  corr_pair(bmd2, "bmi_log", "met_sqrt"),
  corr_pair(bmd2, "calcium_log", "met_sqrt"),
  corr_pair(bmd2, "vitd_sqrt", "met_sqrt"),
) %>%
  mutate(
    Variable = c("Age (years)", "BMI (log)", "Total Calcium (log)", 
                 "Total Vitamin D (sqrt)"),
    r = round(estimate, 3),
    p_value = signif(p_value, 3),
    Significant = ifelse(p_value < 0.05, "Yes", "No")
  ) %>%
  select(Variable, r, p_value, Significant, n)

tableB %>% kable(caption = "Table B: Correlations between Potential Cofounders and Physical Activity")

Table B: Correlations between Potential Cofounders and Physical Activity
Variable	r	p_value	Significant	n
Age (years)	-0.097	1.96e-05	Yes	1934
BMI (log)	0.001	9.51e-01	No	1912
Total Calcium (log)	0.086	2.82e-04	Yes	1777
Total Vitamin D (sqrt)	-0.002	9.32e-01	No	1777

# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()

Interpret the results:

Based on Table B, age and calcium intake show the strongest evidence of being confounders, as they are significantly associated with physical activity. BMI and vitamin D may still be important covariates if they’re associated with BMD, but they don’t appear to confound the exposure-outcome relationship based on their weak associations with physical activity.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()

### TABLE C: Pairwise correlations among predictors (multicollinearity assessment)

# Define predictor variables
preds <- c("RIDAGEYR", "bmi_log", "calcium_log", "vitd_sqrt")
pred_labels <- c("Age (years)", "BMI (log)", "Total Calcium (log)", 
                 "Total Vitamin D (sqrt)")

# Generate all pairwise combinations
pairs_list <- combn(preds, 2, simplify = FALSE)

# Create correlation table
tableC <- map_dfr(pairs_list, ~corr_pair(bmd2, .x[1], .x[2])) %>%
  mutate(
    
# Create proper labels for each pair
    Pair_Index = 1:n(),
    Variable_1 = c("Age (years)", "Age (years)", "Age (years)",
                   "BMI (log)", "BMI (log)",
                   "Total Calcium (log)"),
    Variable_2 = c("BMI (log)", "Total Calcium (log)", "Total Vitamin D (sqrt)",
                   "Total Calcium (log)", "Total Vitamin D (sqrt)",
                   "Total Vitamin D (sqrt)"),
    r = round(estimate, 3),
    p_value = signif(p_value, 3),
    Abs_r = abs(r),
    Significant = ifelse(p_value < 0.05, "Yes", "No")
  ) %>%
  arrange(desc(Abs_r)) %>%
  select(Variable_1, Variable_2, r, p_value, Significant, n)

# Display table
tableC %>% 
  kable(caption = "Table C: Pairwise Correlations Among Predictors (Multicollinearity Assessment)")

Table C: Pairwise Correlations Among Predictors (Multicollinearity Assessment)
Variable_1	Variable_2	r	p_value	Significant	n
Total Calcium (log)	Total Vitamin D (sqrt)	0.314	0.00e+00	Yes	2605
Age (years)	Total Vitamin D (sqrt)	0.153	0.00e+00	Yes	2605
Age (years)	BMI (log)	-0.098	2.00e-07	Yes	2834
Age (years)	Total Calcium (log)	0.048	1.35e-02	Yes	2605
BMI (log)	Total Vitamin D (sqrt)	0.007	7.31e-01	No	2569
BMI (log)	Total Calcium (log)	0.000	9.83e-01	No	2569

Interpret the results:

[YOUR INTERPRETATION: Are there strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses?]

All pairwise correlations among predictors were weak to moderate. The strongest correlation is between total calcium intake and total vitamin D intake (r = 0.314, p < 0.001). Individuals who supplement their diet with one nutrient often supplement with the other. However, this correlation is still below the multicollinearity threshold of r= 0.7. Age show very weak correlations with BMI (r = -0.098), calcium intake (r = 0.048), and vitamin D intake (r = 0.153). BMI is not correlated with both calcium (r = 0.000) and vitamin D (r = 0.007) intake.

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

When would you use ANOVA versus correlation in epidemiological research?

ANOVA and correlation analysis are complementary but also have distinct purposes in research and should be selected based on the research question and study design. ANOVA is used to test whether categorical variables predict differences in a continuous outcome by comparing mean values across groups, while correlation analysis examines the strength and direction of linear relationships between two continuous variables.

What are the key differences in what these methods tell us?

The key methodological difference is that ANOVA tells us if group differences exist (via p-value, the null hypothesis of equal means) and how much variance is explained overall, while correlation quantifies the strength of association between two variables on a -1 to +1 scale. ANOVA assumes categorical, while correlation does not require such. In epidemiology, ANOVA results often prompt post-hoc comparisons to identify specific group differences, whereas correlation results lead to regression modeling to adjust for confounders.

Give one example of a research question better suited to ANOVA and one better suited to correlation.

In epidemiology, ANOVA is ideal when your primary exposure of interest is categorical (e.g., ethnicity, disease status, treatment groups), and we want to determine whether significant differences in health outcomes exist across these groups. For example, our research question “Do mean BMD values differ across ethnicity groups?” for ANOVA because ethnicity is categorical with five distinct groups.

Correlation analysis is better when both variables are continuous or when we want to quantify the strength of associations. An example research question would be: “What is the linear relationship between dietary calcium intake and BMD?” where both variables are measured on continuous scales.

B. Assumptions and Limitations

Which assumptions were most challenging to meet in this assignment?

The most challenging assumption to fully satisfy in this assignment was normality of residuals, particularly given the distribution of nutrient intake variables (calcium and vitamin D), which are extremely right-skewed.

How might violations of assumptions affect your conclusions? Violations of assumptions can affect conclusions in meaningful ways. For instance, if we had not addressed non-normality in the correlation analyses, confidence intervals might be too narrow and p-values unreliable, potentially leading to false discoveries.
What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

cannot establish temporality
Reverse causality 3.healthy user bias
recall bias in dietary intake assessment

C. R Programming Challenge

What was the most difficult part of the R coding for this assignment?

The most difficult part of the R coding involved creating the Tukey HSD post-hoc table and the tidying up the tables. Somehow that always tripped me over and over. Initially, transformation part- interpreting when and why to apply transformations was the most difficult.

How did you solve it? (Describe your problem-solving process)

My problem-solving process was: 1) reading error message carefully 2) try to debug with more information 3) test each part separately 4) then i looked for other resources.

What R skill did you strengthen through completing this assignment?

multiple ways of use like mutate (), filter(), select()
handled missing datas
data transforming
coorelation and anova
debugging (a lot)
data visualization.

YOUR REFLECTION HERE (200–400 words)

[Write your reflection addressing all three components above]

This assignment really helped me see how ANOVA and correlation serve different but complementary roles in epidemiologic analysis. ANOVA help examine whether mean BMD differed across ethnic groups, which makes sense when the exposure is categorical. Correlation, on the other hand, allowed us to assess the strength of linear relationships between continuous variables such as calcium intake and physical activity. Working through both approaches clarified for me that the method depends entirely on the research question—ANOVA asks whether groups differ, while correlation asks how closely two variables move together.

One of the more difficult aspects was checking and addressing statistical assumptions, especially normality for skewed variables. Applying log and square root transformations required understanding why skewed distributions can distort regression estimates and inference. Even after transformations, I was thinking cross-sectional data come with limitations. We can observe associations, but we cannot establish temporality or causality—for example, whether low calcium intake contributes to bone loss or whether individuals with lower BMD modify their diet.

The R coding portion was particularly challenging. Creating clean Tukey HSD tables and debugging syntax errors in longer piped commands pushed me out of my comfort zone. At times it felt overwhelming. Over time, though, I shifted from reacting to error messages with frustration to approaching them more systematically—breaking code into smaller pieces, checking objects step by step, and reading documentation more carefully. I strengthened practical skills in data understanding, handling missing values, applying transformations, and producing clear visualizations. The workload was substantial and at moments genuinely felt tough, but working through those challenges made the concepts stick more deeply and improved both my analytical confidence and technical fluency. Hopefully!

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Suruchi

2026-02-20

Homework 1 overview

File organization and naming (required)

Data description

Part 0 — Load and prepare the data (10 points)

0.1 Import

0.2 Recode to readable factors

0.3 Missingness + analytic sample

1.1 Hypotheses (required)

1.2 Exploratory plot and group summaries

1.3 Fit ANOVA model + report the F-test

1.4 Assumption checks (normality + equal variances)

Normality of residuals

Equal variances (Levene’s test)

1.5 Post-hoc comparisons (pairwise tests)

1.6 Effect size (eta-squared)

Part 2 — Correlation analysis (40 points)

2.1 Create total intake variables

2.2 Decide on transformations (show evidence)

2.3 Create three correlation tables (Pearson)

Helper function for correlation pairs

Table A: Variables associated with the outcome (BMD)

Table B: Variables associated with the exposure (MET)

Table C: Predictors associated with each other

Part 3 — Reflection (10 points)

data visualization.

Submission checklist