Homework 1 overview

Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

  1. Your .Rmd file
  2. Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.


File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname


Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable Description Type
DXXOFBMD Total femur bone mineral density Continuous (g/cm²)
RIDRETH1 Race/ethnicity Categorical (1–5)
RIAGENDR Sex Categorical (1=Male, 2=Female)
RIDAGEYR Age in years Continuous
BMXBMI Body mass index Continuous (kg/m²)
smoker Smoking status Categorical (1=Current, 2=Past, 3=Never)
calcium Dietary calcium intake Continuous (mg/day)
DSQTCALC Supplemental calcium Continuous (mg/day)
vitd Dietary vitamin D intake Continuous (mcg/day)
DSQTVD Supplemental vitamin D Continuous (mcg/day)
totmet Total physical activity (MET-min/week) Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)
library (knitr)
library (ggplot2)
library (readr)
library(patchwork)
library (ggstats)
# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)
## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Examine the first few rows
head(bmd, n = 10)
## # A tibble: 10 × 14
##     SEQN RIAGENDR RIDAGEYR RIDRETH1 BMXBMI smoker totmet metcat DXXOFBMD tbmdcat
##    <dbl>    <dbl>    <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl>
##  1 93705        2       66        4   31.7      2    240      0    1.06        0
##  2 93708        2       66        5   23.7      3    120      0    0.801       1
##  3 93709        2       75        4   38.9      1    720      1    0.88        0
##  4 93711        1       56        5   21.3      3    840      1    0.851       1
##  5 93713        1       67        3   23.5      1    360      0    0.778       1
##  6 93714        2       54        4   39.9      2     NA     NA    0.994       0
##  7 93715        1       71        5   22.5      1   6320      2    0.952       0
##  8 93716        1       61        5   30.7      2   2400      2    1.12        0
##  9 93721        2       60        1   35.9      3     NA     NA   NA          NA
## 10 93722        2       60        3   23.8      3     NA     NA    0.752       1
## # ℹ 4 more variables: calcium <dbl>, vitd <dbl>, DSQTVD <dbl>, DSQTCALC <dbl>
# View data structure
str(bmd)
## spc_tbl_ [2,898 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SEQN    : num [1:2898] 93705 93708 93709 93711 93713 ...
##  $ RIAGENDR: num [1:2898] 2 2 2 1 1 2 1 1 2 2 ...
##  $ RIDAGEYR: num [1:2898] 66 66 75 56 67 54 71 61 60 60 ...
##  $ RIDRETH1: num [1:2898] 4 5 4 5 3 4 5 5 1 3 ...
##  $ BMXBMI  : num [1:2898] 31.7 23.7 38.9 21.3 23.5 39.9 22.5 30.7 35.9 23.8 ...
##  $ smoker  : num [1:2898] 2 3 1 3 1 2 1 2 3 3 ...
##  $ totmet  : num [1:2898] 240 120 720 840 360 NA 6320 2400 NA NA ...
##  $ metcat  : num [1:2898] 0 0 1 1 0 NA 2 2 NA NA ...
##  $ DXXOFBMD: num [1:2898] 1.058 0.801 0.88 0.851 0.778 ...
##  $ tbmdcat : num [1:2898] 0 1 0 1 1 0 0 0 NA 1 ...
##  $ calcium : num [1:2898] 504 474 NA 1248 660 ...
##  $ vitd    : num [1:2898] 1.85 5.85 NA 3.85 2.35 5.65 3.75 4.45 6.05 6.45 ...
##  $ DSQTVD  : num [1:2898] 20.6 25 NA 25 NA ...
##  $ DSQTCALC: num [1:2898] 211.7 820 NA 35 13.3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   SEQN = col_double(),
##   ..   RIAGENDR = col_double(),
##   ..   RIDAGEYR = col_double(),
##   ..   RIDRETH1 = col_double(),
##   ..   BMXBMI = col_double(),
##   ..   smoker = col_double(),
##   ..   totmet = col_double(),
##   ..   metcat = col_double(),
##   ..   DXXOFBMD = col_double(),
##   ..   tbmdcat = col_double(),
##   ..   calcium = col_double(),
##   ..   vitd = col_double(),
##   ..   DSQTVD = col_double(),
##   ..   DSQTCALC = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Dimensions: rows (observations) and columns (variables)
dim(bmd)
## [1] 2898   14
# What variables do we have?
names(bmd)
##  [1] "SEQN"     "RIAGENDR" "RIDAGEYR" "RIDRETH1" "BMXBMI"   "smoker"  
##  [7] "totmet"   "metcat"   "DXXOFBMD" "tbmdcat"  "calcium"  "vitd"    
## [13] "DSQTVD"   "DSQTCALC"

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd_data <- bmd %>%
  mutate(
    ethnic_category = case_when(
      RIDRETH1 == 1 ~ "Mexican American",
      RIDRETH1 == 2 ~ "Other Hispanic",
      RIDRETH1 == 3 ~ "Non-Hispanic White",
      RIDRETH1 == 4 ~ "Non-Hispanic Black",
      RIDRETH1 == 5 ~ "Other (including Multi)",
    ),
    gender_category = case_when(
      RIAGENDR == 1 ~ "Male",
      RIAGENDR == 2 ~ "Female",
    ),
    
    smoke_category = case_when(
    smoker == 1 ~ "Current smoker",
    smoker == 2 ~ "Past smoker",
    smoker == 3 ~ "Never smoked",
    ),
  )

# Check sample sizes
table(bmd_data$ethnic_category)
## 
##        Mexican American      Non-Hispanic Black      Non-Hispanic White 
##                     331                     696                    1086 
## Other (including Multi)          Other Hispanic 
##                     499                     286
table(bmd_data$gender_category)
## 
## Female   Male 
##   1467   1431
table(bmd_data$smoke_category)
## 
## Current smoker   Never smoked    Past smoker 
##            449           1553            894
# Display first few rows
head(bmd_data) %>% 
  kable(caption = " BMD Dataset with recoded variables (first 6 rows)")
BMD Dataset with recoded variables (first 6 rows)
SEQN RIAGENDR RIDAGEYR RIDRETH1 BMXBMI smoker totmet metcat DXXOFBMD tbmdcat calcium vitd DSQTVD DSQTCALC ethnic_category gender_category smoke_category
93705 2 66 4 31.7 2 240 0 1.058 0 503.5 1.85 20.557 211.67 Non-Hispanic Black Female Past smoker
93708 2 66 5 23.7 3 120 0 0.801 1 473.5 5.85 25.000 820.00 Other (including Multi) Female Never smoked
93709 2 75 4 38.9 1 720 1 0.880 0 NA NA NA NA Non-Hispanic Black Female Current smoker
93711 1 56 5 21.3 3 840 1 0.851 1 1248.5 3.85 25.000 35.00 Other (including Multi) Male Never smoked
93713 1 67 3 23.5 1 360 0 0.778 1 660.5 2.35 NA 13.33 Non-Hispanic White Male Current smoker
93714 2 54 4 39.9 2 NA NA 0.994 0 776.0 5.65 NA NA Non-Hispanic Black Female Past smoker

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd_data)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd_data %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnic_category))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total
## [1] 2898
n_anova
## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

[YOUR TEXT HERE: Describe the total sample, how many observations were removed due to missing data, and what the final analytic sample size is for the ANOVA analysis.] ## Analytic Sample Summary

Write 2–4 sentences summarizing your analytic sample here. The total sample consisted of 2,898 participants whose original data source is the National Health and Nutrition Examination Survey (NHANES). After removing participants with missing bone mineral density (BMD) or ethnicity data, 612 observations were excluded, resulting in a final analytic sample of 2,286 participants for the ANOVA analysis. This represents 78.9% of the total sample. The analytic sample includes adults with complete BMD measurements distributed across five ethnicity categories: Mexican American (n=331), Other Hispanic (n=286), Non-Hispanic White (n=1,086), Non-Hispanic Black (n=696), and Other (including Multi-racial, n=499). —

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

  • H₀: μ_ethniccategory1 = μ_ethniccategory2 = μ_ethniccategory3 = μ_ethniccategory4 = μ_ethniccategory3

  • Hₐ: At least one population group mean differs from the others. At least one μi ≠ μj

Plain language:

  • H₀: All five ethnicity groups have equal mean bone mineral density.
  • Hₐ: At least one ethnicity group has a different mean BMD.

1.2 Exploratory plot and group summaries

###Fit the anova table

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnic_category) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)
ethnic_category n mean_bmd sd_bmd
Non-Hispanic White 846 0.888 0.160
Non-Hispanic Black 532 0.973 0.161
Other (including Multi) 409 0.897 0.148
Mexican American 255 0.950 0.149
Other Hispanic 244 0.946 0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnic_category, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

All five ethnic group seem to have similar median BMD values at around 1 gm/cm2. Non- hispanic black have slightly higher median BMD whereas non-hispanic white has slightly lower median BMD. Outliers are present in all groups, mostly below.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

  • Outcome: BMD (DXXOFBMD)
  • Predictor: Ethnicity (5 groups)
# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnic_category, data = anova_df)

# Display ANOVA table
summary(fit_aov)
##                   Df Sum Sq Mean Sq F value Pr(>F)    
## ethnic_category    4   2.95  0.7371   30.18 <2e-16 ***
## Residuals       2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)
term df sumsq meansq statistic p.value
ethnic_category 4 2.9482 0.7371 30.185 0
Residuals 2281 55.6973 0.0244 NA NA

Write your interpretation here:

[YOUR INTERPRETATION: State your decision (reject or fail to reject H₀) and what this means for ethnic differences in BMD.]

The one-way ANOVA shows a statistically significant difference in mean BMD across ethnic categories (F(4, 2281) = 30.19, p < 0.001). Because the p-value is far below 0.05 ( <2e-16), we reject the null hypothesis (H₀) that mean BMD is equal across all ethnic groups. There is strong statistical evidence that at least one ethnic group has a different mean BMD compared to the others. However, since ANOVA does not tell us which specific groups differ, we will use post-hoc tests (e.g., Tukey’s HSD) to determine where those differences lie.

This could mean meaningful ethnic variation in BMD that we could look into biological, social, or structural determinants.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnic_category, data = anova_df)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281
# Conduct Levene's test in neater table
levene_test <- leveneTest(DXXOFBMD ~ ethnic_category, data = anova_df)

# Display results
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281
# Create formatted table
levene_table <- data.frame(
  Test = "Levene's Test",
  Df1 = levene_test$Df[1],
  Df2 = levene_test$Df[2],
  `F value` = round(levene_test$`F value`[1], 3),
  `p-value` = round(levene_test$`Pr(>F)`[1], 4)
)

kable(levene_table,
      caption = "Levene's Test for Homogeneity of Variance",
      col.names = c("Test", "df1", "df2", "F-statistic", "p-value"))
Levene’s Test for Homogeneity of Variance
Test df1 df2 F-statistic p-value
Levene’s Test 4 2281 1.573 0.1788

Summarize what you see in 2–4 sentences:

YOUR INTERPRETATION: Based on the diagnostic plots and levene test, we can see the following: Independence: Observations are independent (assumed based on study design) 1. For Normality,the QQ plot show that the residuals follow normal distribution line with only few deviation at each tails. Most lie on diagonal line neatly, so normality is satisfied for this sample and n is also fairly large.

  1. Homogeniety: From the Levene test, we can see F(4, 2281) = 1.57 and P value is 0.179. Since p > 0.05, we fail to reject the null hypothesis of equal variances.Hence the variances differ across ethnic categories so the assumption is met.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

  • Pairwise comparisons
  • Mean differences
  • 95% confidence intervals
  • Adjusted p-values
# Conduct Tukey HSD test
tukey_results <- TukeyHSD(fit_aov)
print(tukey_results)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnic_category, data = anova_df)
## 
## $ethnic_category
##                                                    diff         lwr
## Non-Hispanic Black-Mexican American         0.022391619 -0.01010005
## Non-Hispanic White-Mexican American        -0.062377249 -0.09285262
## Other (including Multi)-Mexican American   -0.053319344 -0.08735725
## Other Hispanic-Mexican American            -0.004441273 -0.04264413
## Non-Hispanic White-Non-Hispanic Black      -0.084768868 -0.10837333
## Other (including Multi)-Non-Hispanic Black -0.075710963 -0.10376452
## Other Hispanic-Non-Hispanic Black          -0.026832892 -0.05981593
## Other (including Multi)-Non-Hispanic White  0.009057905 -0.01663337
## Other Hispanic-Non-Hispanic White           0.057935976  0.02693726
## Other Hispanic-Other (including Multi)      0.048878071  0.01437080
##                                                     upr     p adj
## Non-Hispanic Black-Mexican American         0.054883289 0.3276613
## Non-Hispanic White-Mexican American        -0.031901881 0.0000003
## Other (including Multi)-Mexican American   -0.019281435 0.0001919
## Other Hispanic-Mexican American             0.033761585 0.9978049
## Non-Hispanic White-Non-Hispanic Black      -0.061164402 0.0000000
## Other (including Multi)-Non-Hispanic Black -0.047657407 0.0000000
## Other Hispanic-Non-Hispanic Black           0.006150151 0.1722466
## Other (including Multi)-Non-Hispanic White  0.034749177 0.8719109
## Other Hispanic-Non-Hispanic White           0.088934694 0.0000036
## Other Hispanic-Other (including Multi)      0.083385341 0.0010720
# Visualize the confidence intervals
plot(tukey_results, las = 0)

Create and present a readable table (you can tidy the Tukey output):

# Creating a readable table from Tukey results

# First, tidy the Tukey results
tukey_results <- broom::tidy(tukey_results)

# Then create the readable table with proper renaming
tukey_table <- tukey_results %>%
  rename(
    Comparison = contrast,
    Mean_Difference = estimate,
    Lower_CI = conf.low,
    Upper_CI = conf.high,
    Adj_p_value = adj.p.value
  ) %>%
  
  ##filtering for only significant results 
  
  mutate(
    Significant = ifelse(Adj_p_value < 0.05, "Yes *", "No"),
    Mean_Difference = round(Mean_Difference, 4),
    Lower_CI = round(Lower_CI, 4),
    Upper_CI = round(Upper_CI, 4),
    Adj_p_value = signif(Adj_p_value, 4)
  ) %>%
  select(Comparison, Mean_Difference, Lower_CI, Upper_CI, Adj_p_value, Significant)

# Display all comparisons
tukey_table %>% 
  kable(caption = "Tukey HSD Post-hoc Pairwise Comparisons of BMD by Ethnicity")
Tukey HSD Post-hoc Pairwise Comparisons of BMD by Ethnicity
Comparison Mean_Difference Lower_CI Upper_CI Adj_p_value Significant
Non-Hispanic Black-Mexican American 0.0224 -0.0101 0.0549 0.3277000 No
Non-Hispanic White-Mexican American -0.0624 -0.0929 -0.0319 0.0000003 Yes *
Other (including Multi)-Mexican American -0.0533 -0.0874 -0.0193 0.0001919 Yes *
Other Hispanic-Mexican American -0.0044 -0.0426 0.0338 0.9978000 No
Non-Hispanic White-Non-Hispanic Black -0.0848 -0.1084 -0.0612 0.0000000 Yes *
Other (including Multi)-Non-Hispanic Black -0.0757 -0.1038 -0.0477 0.0000000 Yes *
Other Hispanic-Non-Hispanic Black -0.0268 -0.0598 0.0062 0.1722000 No
Other (including Multi)-Non-Hispanic White 0.0091 -0.0166 0.0347 0.8719000 No
Other Hispanic-Non-Hispanic White 0.0579 0.0269 0.0889 0.0000036 Yes *
Other Hispanic-Other (including Multi) 0.0489 0.0144 0.0834 0.0010720 Yes *
# Filter for only significant comparisons
sig_comparisons <- tukey_table %>%
  filter(Significant == "Yes *")

sig_comparisons %>% 
  kable(caption = "Statistically Significant Pairwise Differences (p < 0.05)")
Statistically Significant Pairwise Differences (p < 0.05)
Comparison Mean_Difference Lower_CI Upper_CI Adj_p_value Significant
Non-Hispanic White-Mexican American -0.0624 -0.0929 -0.0319 0.0000003 Yes *
Other (including Multi)-Mexican American -0.0533 -0.0874 -0.0193 0.0001919 Yes *
Non-Hispanic White-Non-Hispanic Black -0.0848 -0.1084 -0.0612 0.0000000 Yes *
Other (including Multi)-Non-Hispanic Black -0.0757 -0.1038 -0.0477 0.0000000 Yes *
Other Hispanic-Non-Hispanic White 0.0579 0.0269 0.0889 0.0000036 Yes *
Other Hispanic-Other (including Multi) 0.0489 0.0144 0.0834 0.0010720 Yes *

Write a short paragraph: The Tukey HSD post-hoc test reveals statistically significant pairwise differences in mean BMD across ethnicity groups. These findings suggest that ethnicity is a meaningful predictor of bone mineral density, with Non-Hispanic Black individuals demonstrating the highest BMD values in this sample.

INTERPRETATION: Post-hoc Pairwise Comparisons

The Tukey HSD post-hoc test revealed 6 out of 10 pairwise comparisons showed statistically significant differences in mean BMD (adjusted p < 0.05): Non-Hispanic White vs. other groups:

Non-Hispanic White had significantly lower BMD than Mexican American (diff = -0.062, 95% CI [-0.093, -0.032], p < 0.001) Non-Hispanic White had significantly lower BMD than Non-Hispanic Black (diff = -0.085, 95% CI [-0.108, -0.061], p < 0.001) Non-Hispanic White had significantly lower BMD than Other Hispanic (diff = -0.058, 95% CI [-0.089, -0.027], p < 0.001)

Other (including Multi-Racial) vs. other groups:

Other (including Multi) had significantly lower BMD than Mexican American (diff = -0.053, 95% CI [-0.087, -0.019], p < 0.001) Other (including Multi) had significantly lower BMD than Non-Hispanic Black (diff = -0.076, 95% CI [-0.104, -0.048], p < 0.001) Other (including Multi) had significantly lower BMD than Other Hispanic (diff = -0.049, 95% CI [-0.083, -0.014], p = 0.001)

Key findings: Non-Hispanic Black and Mexican American participants had the highest mean BMD, while Non-Hispanic White and Other (including Multi-Racial) groups had the lowest. Other Hispanic fell in between. No significant difference was found between Non-Hispanic Black and Mexican American (p = 0.328), nor between Non-Hispanic White and Other (including Multi) (p = 0.872).

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Hint: Extract Sum Sq from the ANOVA summary

# Extract sum of squares from ANOVA table
anova_summary <- summary(fit_aov)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")
## Eta-squared (η²): 0.0503
cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")
## Percentage of variance explained: 5.03 %
# Create formatted output
effect_size_table <- data.frame(
  Measure = "Eta-squared (η²)",
  Value = round(eta_squared, 4),
  `Percent Variance` = paste0(round(eta_squared * 100, 2), "%"),
  Interpretation = "Small to Medium Effect"
)

kable(effect_size_table,
      caption = "Effect Size: Proportion of Variance Explained by Ethnicity",
      col.names = c("Effect Size Measure", "Value", "% Variance Explained", "Interpretation"))
Effect Size: Proportion of Variance Explained by Ethnicity
Effect Size Measure Value % Variance Explained Interpretation
Eta-squared (η²) 0.0503 5.03% Small to Medium Effect

Effect Size Interpretation: η² = r round(eta_squared, 4) or r round(eta_squared * 100, 2)% Small

  • Effect size guidelines: Small effect: η² ≈ 0.01 (1%) Medium effect: η² ≈ 0.06 (6%) Large effect: η² ≈ 0.14 (14%)

  • Our effect: Table: Effect Size: Proportion of Variance Explained by Ethnicity

Effect Size Measure Value % Variance Explained Interpretation
Eta-squared (η²) 0.0503 5.03% Small to Medium Effect

Interpret in 1–2 sentences: Ethnicity explains about 5.0% of the variance in bone mineral density (η² = 0.050), which is a small to moderate effect according to Cohen’s guidelines (small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14). While the ANOVA test was statistically significant with a large sample size, and ethnicity does predict BMD, other factors (such as genetics, lifestyle, nutrition, and physical activity) may influence bone density in large part. —

Part 2 — Correlation analysis (40 points)

In this section you will:

  1. Create total intake variables (dietary + supplemental)
  2. Assess whether transformations are needed
  3. Produce three correlation tables examining relationships between:
    • Table A: Predictors vs. outcome (BMD)
    • Table B: Predictors vs. exposure (physical activity)
    • Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

# Display first few rows
head(bmd2) %>% 
  kable(caption = " BMD Dataset 2 with mutated variables (first 6 rows)")
BMD Dataset 2 with mutated variables (first 6 rows)
SEQN RIAGENDR RIDAGEYR RIDRETH1 BMXBMI smoker totmet metcat DXXOFBMD tbmdcat calcium vitd DSQTVD DSQTCALC DSQTCALC_0 DSQTVD_0 calcium_total vitd_total
93705 2 66 4 31.7 2 240 0 1.058 0 503.5 1.85 20.557 211.67 211.67 20.557 715.17 22.407
93708 2 66 5 23.7 3 120 0 0.801 1 473.5 5.85 25.000 820.00 820.00 25.000 1293.50 30.850
93709 2 75 4 38.9 1 720 1 0.880 0 NA NA NA NA 0.00 0.000 NA NA
93711 1 56 5 21.3 3 840 1 0.851 1 1248.5 3.85 25.000 35.00 35.00 25.000 1283.50 28.850
93713 1 67 3 23.5 1 360 0 0.778 1 660.5 2.35 NA 13.33 13.33 0.000 673.83 2.350
93714 2 54 4 39.9 2 NA NA 0.994 0 776.0 5.65 NA NA 0.00 0.000 776.00 5.650
# Verify the new variables
summary(bmd2 %>% select(calcium, DSQTCALC, calcium_total, vitd, DSQTVD, vitd_total, BMXBMI, totmet))
##     calcium          DSQTCALC       calcium_total         vitd       
##  Min.   :  55.0   Min.   :   0.74   Min.   :  55.0   Min.   : 0.000  
##  1st Qu.: 525.0   1st Qu.:  93.33   1st Qu.: 596.8   1st Qu.: 1.750  
##  Median : 773.0   Median : 211.67   Median : 895.5   Median : 3.200  
##  Mean   : 840.5   Mean   : 357.98   Mean   : 997.6   Mean   : 4.471  
##  3rd Qu.:1049.0   3rd Qu.: 500.00   3rd Qu.:1285.0   3rd Qu.: 5.500  
##  Max.   :5199.0   Max.   :3220.00   Max.   :5399.0   Max.   :57.900  
##  NA's   :293      NA's   :1633      NA's   :293      NA's   :293     
##      DSQTVD           vitd_total           BMXBMI          totmet    
##  Min.   :   0.004   Min.   :   0.000   Min.   :14.20   Min.   :   0  
##  1st Qu.:  15.709   1st Qu.:   2.900   1st Qu.:25.20   1st Qu.: 240  
##  Median :  25.000   Median :   8.533   Median :28.70   Median : 480  
##  Mean   :  49.851   Mean   :  28.676   Mean   :29.76   Mean   :1015  
##  3rd Qu.:  50.000   3rd Qu.:  30.400   3rd Qu.:33.40   3rd Qu.:1430  
##  Max.   :2570.000   Max.   :2574.650   Max.   :84.40   Max.   :9120  
##  NA's   :1515       NA's   :293        NA's   :64      NA's   :964
# Check for any remaining NAs
cat("\nMissing values in total intake variables:\n")
## 
## Missing values in total intake variables:
colSums(is.na(bmd2 %>% select(calcium_total, vitd_total, BMXBMI, totmet)))
## calcium_total    vitd_total        BMXBMI        totmet 
##           293           293            64           964

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

# Create histograms for all continuous variables
p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "BMI Distribution", x = "BMI", y = "Frequency") +
  theme_minimal()

p2 <- ggplot(bmd2, aes(x = calcium_total)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Total Calcium Distribution", x = "Calcium (mg)", y = "Frequency") +
  theme_minimal()

p3 <- ggplot(bmd2, aes(x = vitd_total)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Total Vitamin D Distribution", x = "Vitamin D (mcg)", y = "Frequency") +
  theme_minimal()

p4 <- ggplot(bmd2, aes(x = totmet)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Total MET Distribution", x = "MET (min/week)", y = "Frequency") +
  theme_minimal()

## Combine plots in a 2x2 grid
(p1 | p2) / (p3 | p4)

# Display all plots together
print(p1)

print(p2)

print(p3)

print(p4)

# Check complete cases for correlation analysis
complete_cases <- sum(complete.cases(bmd2[, c("BMXBMI", "calcium_total", "vitd_total", "totmet")]))
cat("Complete cases for all variables:", complete_cases) ###1765 complete cases
## Complete cases for all variables: 1765

*Interpretation**

  1. BMI Distribution (p1): The right-skewed distribution with a long tail extends to higher BMI values and most data are clustered around 25-30, but some extend to 50-60.Hence, we will apply LOG transformation because right-skewed tails needs correction and since there are no zeros in BMI data, so log is safe to use. Use of Log will compress the right tail and make distribution more symmetric for better analysis.

  2. Total Calcium Distribution (p2):The plot is highly right-skewed with most values clustered at the low end (0-1000 mg) and has long right tail extending to 4000+ mg. There is some spike near zero. We will apply LOG transformation here as well due to the extreme right-skewed distribution with potential zero values. Use of Log will compress the extreme high values and create more symmetry in the data.

  3. Total Vitamin D Distribution (p3):The plot shows extremely right-skew, massive spike at zero. We will use SQUARE ROOT transformation because there are many zeros present (people without supplements), square root will handle zero without adding constants, as it is appropriate for heavily zero-inflated data.

  4. Total MET Distribution (p4):Fourth plot shows another right-skewed, with large data at zero (sedentary people). Hence, similar to P3, we will use SQUARE ROOT transformation due to data have many zero values and right skewed.

Based on your assessment, apply transformations as needed:

# Apply transformations based on visual assessment
bmd2 <- bmd2 %>%
  mutate(
    bmi_log = log(BMXBMI),
    calcium_log = log(calcium_total + 1),
    vitd_sqrt = sqrt(vitd_total),
    met_sqrt = sqrt(totmet)
  )

# Check that variables were created
names(bmd2)
##  [1] "SEQN"          "RIAGENDR"      "RIDAGEYR"      "RIDRETH1"     
##  [5] "BMXBMI"        "smoker"        "totmet"        "metcat"       
##  [9] "DXXOFBMD"      "tbmdcat"       "calcium"       "vitd"         
## [13] "DSQTVD"        "DSQTCALC"      "DSQTCALC_0"    "DSQTVD_0"     
## [17] "calcium_total" "vitd_total"    "bmi_log"       "calcium_log"  
## [21] "vitd_sqrt"     "met_sqrt"
# Check missing data for each variable
sum(is.na(bmd2$BMXBMI))        
## [1] 64
sum(is.na(bmd2$calcium_total)) 
## [1] 293
sum(is.na(bmd2$vitd_total))    
## [1] 293
sum(is.na(bmd2$totmet))        
## [1] 964
# Create comparison plots
# BMI
p5 <- ggplot(bmd2, aes(x = BMXBMI)) +
  geom_histogram(fill ="red", color = "black", bins = 30) +
  labs(title = "Original BMI (Right-Skewed)", x = "BMI") +
  theme_minimal()

p6 <- ggplot(bmd2, aes(x = bmi_log)) +
  geom_histogram(fill = "lightblue", color = "black", bins = 30) +
  labs(title = "Log-Transformed BMI (More Normal)", x = "log(BMI)") +
  theme_minimal()

print(p5)

print(p6)

# Calcium
p7 <- ggplot(bmd2, aes(x = calcium_total)) +
  geom_histogram(fill = "red", color = "black", bins = 30) +
  labs(title = "Original Calcium (Highly Skewed)", x = "Calcium (mg)") +
  theme_minimal()

p8 <- ggplot(bmd2, aes(x = calcium_log)) +
  geom_histogram(fill = "lightblue", color = "black", bins = 30) +
  labs(title = "Log-Transformed Calcium (Improved)", x = "log(Calcium+1)") +
  theme_minimal()

print(p7)

print(p8)

# Vitamin D
p9 <- ggplot(bmd2, aes(x = vitd_total)) +
  geom_histogram(fill = "red", color = "black", bins = 30) +
  labs(title = "Original Vitamin D (Highly Skewed)", x = "Vitamin D ") +
  theme_minimal()

p10 <- ggplot(bmd2, aes(x = vitd_sqrt)) +
  geom_histogram(fill = "lightblue", color = "black", bins = 30) +
  labs(title = "Log-Transformed Vitamin D (Improved)", x = "(Sqrt Vit D +1)") +
  theme_minimal()

print(p9)

print(p10)

# Total MET
p11 <- ggplot(bmd2, aes(x = totmet)) +
  geom_histogram(fill = "red", color = "black", bins = 30) +
  labs(title = "Total MET (Highly Skewed)", x = "Total MET ") +
  theme_minimal()

p12 <- ggplot(bmd2, aes(x = met_sqrt)) +
  geom_histogram(fill = "lightblue", color = "black", bins = 30) +
  labs(title = "Total MET (Improved)", x = "(Sqrt Total MET +1)") +
  theme_minimal()

print(p11)

print(p12)

I applied log transformations to BMI and total calcium intake because both variables showed right-skewed distributions with positive skewness values (>1), indicating heavy right tails. The log transformation helps normalize these distributions and stabilize variance. For vitamin D and physical activity (MET), I applied square root transformations to address right skewness while preserving interpretability better than log transformations.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

  • Correlation coefficient (r)
  • p-value
  • Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

  • Age vs. BMD
  • BMI vs. BMD
  • Total calcium vs. BMD
  • Total vitamin D vs. BMD

Use transformed versions where appropriate.

# YOUR CODE HERE

tableA <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(bmd2, "bmi_log", "DXXOFBMD"),
  corr_pair(bmd2, "calcium_log", "DXXOFBMD"),
  corr_pair(bmd2, "vitd_sqrt", "DXXOFBMD"),
  corr_pair(bmd2, "met_sqrt", "DXXOFBMD")
) %>%
  mutate(
    Variable = c("Age (years)", "BMI (log)", "Total Calcium (log)", 
                 "Total Vitamin D (sqrt)", "Physical Activity"),
    r = round(estimate, 3),
    p_value = signif(p_value, 3),
    Significant = ifelse(p_value < 0.05, "Yes", "No")
  ) %>%
  select(Variable, r, p_value, Significant, n)

tableA %>% kable(caption = "Table A: Correlations between Predictors and BMD (Outcome)")
Table A: Correlations between Predictors and BMD (Outcome)
Variable r p_value Significant n
Age (years) -0.232 0.00e+00 Yes 2286
BMI (log) 0.425 0.00e+00 Yes 2275
Total Calcium (log) 0.012 5.92e-01 No 2129
Total Vitamin D (sqrt) -0.062 4.26e-03 Yes 2129
Physical Activity 0.102 4.59e-05 Yes 1588
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()

Interpret the results:

All four variables showed statistically significant correlations with BMD (p < 0.05) except BMI. Age demonstrated a moderate negative correlation (r = -0.232), indicating that older participants tended to have lower BMD. BMI showed a strong positive correlation (r = 0.425), suggesting higher body mass index is associated with greater bone mineral density. Total calcium intake had a weak correlation (r = 0.012). Total vitamin D intake showed a weak negative correlation (r = -0.062), which is unexpected but could be due to residual confounding. Physical activity demonstrated a weak positive correlation (r = 0.102), supporting the known beneficial effects of exercise on bone health.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

  • Age vs. MET
  • BMI vs. MET
  • Total calcium vs. MET
  • Total vitamin D vs. MET
# YOUR CODE HERE

tableB <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "met_sqrt"),
  corr_pair(bmd2, "bmi_log", "met_sqrt"),
  corr_pair(bmd2, "calcium_log", "met_sqrt"),
  corr_pair(bmd2, "vitd_sqrt", "met_sqrt"),
) %>%
  mutate(
    Variable = c("Age (years)", "BMI (log)", "Total Calcium (log)", 
                 "Total Vitamin D (sqrt)"),
    r = round(estimate, 3),
    p_value = signif(p_value, 3),
    Significant = ifelse(p_value < 0.05, "Yes", "No")
  ) %>%
  select(Variable, r, p_value, Significant, n)

tableB %>% kable(caption = "Table B: Correlations between Potential Cofounders and Physical Activity")
Table B: Correlations between Potential Cofounders and Physical Activity
Variable r p_value Significant n
Age (years) -0.097 1.96e-05 Yes 1934
BMI (log) 0.001 9.51e-01 No 1912
Total Calcium (log) 0.086 2.82e-04 Yes 1777
Total Vitamin D (sqrt) -0.002 9.32e-01 No 1777
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()

Interpret the results:

Based on Table B, age and calcium intake show the strongest evidence of being confounders, as they are significantly associated with physical activity. BMI and vitamin D may still be important covariates if they’re associated with BMD, but they don’t appear to confound the exposure-outcome relationship based on their weak associations with physical activity.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

  • Age, BMI, Total calcium, Total vitamin D
# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()

### TABLE C: Pairwise correlations among predictors (multicollinearity assessment)

# Define predictor variables
preds <- c("RIDAGEYR", "bmi_log", "calcium_log", "vitd_sqrt")
pred_labels <- c("Age (years)", "BMI (log)", "Total Calcium (log)", 
                 "Total Vitamin D (sqrt)")

# Generate all pairwise combinations
pairs_list <- combn(preds, 2, simplify = FALSE)

# Create correlation table
tableC <- map_dfr(pairs_list, ~corr_pair(bmd2, .x[1], .x[2])) %>%
  mutate(
    
# Create proper labels for each pair
    Pair_Index = 1:n(),
    Variable_1 = c("Age (years)", "Age (years)", "Age (years)",
                   "BMI (log)", "BMI (log)",
                   "Total Calcium (log)"),
    Variable_2 = c("BMI (log)", "Total Calcium (log)", "Total Vitamin D (sqrt)",
                   "Total Calcium (log)", "Total Vitamin D (sqrt)",
                   "Total Vitamin D (sqrt)"),
    r = round(estimate, 3),
    p_value = signif(p_value, 3),
    Abs_r = abs(r),
    Significant = ifelse(p_value < 0.05, "Yes", "No")
  ) %>%
  arrange(desc(Abs_r)) %>%
  select(Variable_1, Variable_2, r, p_value, Significant, n)

# Display table
tableC %>% 
  kable(caption = "Table C: Pairwise Correlations Among Predictors (Multicollinearity Assessment)")
Table C: Pairwise Correlations Among Predictors (Multicollinearity Assessment)
Variable_1 Variable_2 r p_value Significant n
Total Calcium (log) Total Vitamin D (sqrt) 0.314 0.00e+00 Yes 2605
Age (years) Total Vitamin D (sqrt) 0.153 0.00e+00 Yes 2605
Age (years) BMI (log) -0.098 2.00e-07 Yes 2834
Age (years) Total Calcium (log) 0.048 1.35e-02 Yes 2605
BMI (log) Total Vitamin D (sqrt) 0.007 7.31e-01 No 2569
BMI (log) Total Calcium (log) 0.000 9.83e-01 No 2569

Interpret the results:

[YOUR INTERPRETATION: Are there strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses?]

All pairwise correlations among predictors were weak to moderate. The strongest correlation is between total calcium intake and total vitamin D intake (r = 0.314, p < 0.001). Individuals who supplement their diet with one nutrient often supplement with the other. However, this correlation is still below the multicollinearity threshold of r= 0.7. Age show very weak correlations with BMI (r = -0.098), calcium intake (r = 0.048), and vitamin D intake (r = 0.153). BMI is not correlated with both calcium (r = 0.000) and vitamin D (r = 0.007) intake.

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

  • When would you use ANOVA versus correlation in epidemiological research?

ANOVA and correlation analysis are complementary but also have distinct purposes in research and should be selected based on the research question and study design. ANOVA is used to test whether categorical variables predict differences in a continuous outcome by comparing mean values across groups, while correlation analysis examines the strength and direction of linear relationships between two continuous variables.

  • What are the key differences in what these methods tell us?

The key methodological difference is that ANOVA tells us if group differences exist (via p-value, the null hypothesis of equal means) and how much variance is explained overall, while correlation quantifies the strength of association between two variables on a -1 to +1 scale. ANOVA assumes categorical, while correlation does not require such. In epidemiology, ANOVA results often prompt post-hoc comparisons to identify specific group differences, whereas correlation results lead to regression modeling to adjust for confounders.

  • Give one example of a research question better suited to ANOVA and one better suited to correlation.

In epidemiology, ANOVA is ideal when your primary exposure of interest is categorical (e.g., ethnicity, disease status, treatment groups), and we want to determine whether significant differences in health outcomes exist across these groups. For example, our research question “Do mean BMD values differ across ethnicity groups?” for ANOVA because ethnicity is categorical with five distinct groups.

Correlation analysis is better when both variables are continuous or when we want to quantify the strength of associations. An example research question would be: “What is the linear relationship between dietary calcium intake and BMD?” where both variables are measured on continuous scales.

B. Assumptions and Limitations

  • Which assumptions were most challenging to meet in this assignment?

The most challenging assumption to fully satisfy in this assignment was normality of residuals, particularly given the distribution of nutrient intake variables (calcium and vitamin D), which are extremely right-skewed.

  • How might violations of assumptions affect your conclusions? Violations of assumptions can affect conclusions in meaningful ways. For instance, if we had not addressed non-normality in the correlation analyses, confidence intervals might be too narrow and p-values unreliable, potentially leading to false discoveries.

  • What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

  1. cannot establish temporality
  2. Reverse causality 3.healthy user bias
  3. recall bias in dietary intake assessment

C. R Programming Challenge

  • What was the most difficult part of the R coding for this assignment?

The most difficult part of the R coding involved creating the Tukey HSD post-hoc table and the tidying up the tables. Somehow that always tripped me over and over. Initially, transformation part- interpreting when and why to apply transformations was the most difficult.

  • How did you solve it? (Describe your problem-solving process)

My problem-solving process was: 1) reading error message carefully 2) try to debug with more information 3) test each part separately 4) then i looked for other resources.

  • What R skill did you strengthen through completing this assignment?
  1. multiple ways of use like mutate (), filter(), select()
  2. handled missing datas
  3. data transforming
  4. coorelation and anova
  5. debugging (a lot)
  6. data visualization.

YOUR REFLECTION HERE (200–400 words)

[Write your reflection addressing all three components above]

This assignment really helped me see how ANOVA and correlation serve different but complementary roles in epidemiologic analysis. ANOVA help examine whether mean BMD differed across ethnic groups, which makes sense when the exposure is categorical. Correlation, on the other hand, allowed us to assess the strength of linear relationships between continuous variables such as calcium intake and physical activity. Working through both approaches clarified for me that the method depends entirely on the research question—ANOVA asks whether groups differ, while correlation asks how closely two variables move together.

One of the more difficult aspects was checking and addressing statistical assumptions, especially normality for skewed variables. Applying log and square root transformations required understanding why skewed distributions can distort regression estimates and inference. Even after transformations, I was thinking cross-sectional data come with limitations. We can observe associations, but we cannot establish temporality or causality—for example, whether low calcium intake contributes to bone loss or whether individuals with lower BMD modify their diet.

The R coding portion was particularly challenging. Creating clean Tukey HSD tables and debugging syntax errors in longer piped commands pushed me out of my comfort zone. At times it felt overwhelming. Over time, though, I shifted from reacting to error messages with frustration to approaching them more systematically—breaking code into smaller pieces, checking objects step by step, and reading documentation more carefully. I strengthened practical skills in data understanding, handling missing values, applying transformations, and producing clear visualizations. The workload was substantial and at moments genuinely felt tough, but working through those challenges made the concepts stick more deeply and improved both my analytical confidence and technical fluency. Hopefully!

Submission checklist

Before submitting, verify:


Good luck! Remember to start early and use office hours if you need help.