Homework 1 overview

Due: Friday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

Your .Rmd file
Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.

File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname

Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable	Description	Type
`DXXOFBMD`	Total femur bone mineral density	Continuous (g/cm²)
`RIDRETH1`	Race/ethnicity	Categorical (1–5)
`RIAGENDR`	Sex	Categorical (1=Male, 2=Female)
`RIDAGEYR`	Age in years	Continuous
`BMXBMI`	Body mass index	Continuous (kg/m²)
`smoker`	Smoking status	Categorical (1=Current, 2=Past, 3=Never)
`calcium`	Dietary calcium intake	Continuous (mg/day)
`DSQTCALC`	Supplemental calcium	Continuous (mg/day)
`vitd`	Dietary vitamin D intake	Continuous (mcg/day)
`DSQTVD`	Supplemental vitamin D	Continuous (mcg/day)
`totmet`	Total physical activity (MET-min/week)	Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)

# Import dataset
bmd <- readr::read_csv("~/Desktop/EPI553/data/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables: RIDRETH1, RIAGENDR, and smoker recoded

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )
table(bmd$sex)

## 
##   Male Female 
##   1431   1467

table(bmd$ethnicity)

## 
##   Mexican American     Other Hispanic Non-Hispanic White Non-Hispanic Black 
##                331                286               1086                696 
##              Other 
##                499

table(bmd$smoker_f)

## 
## Current    Past   Never 
##     449     894    1553

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total

## [1] 2898

n_anova

## [1] 2286

Write 2–4 sentences summarizing your analytic sample here. The total sample was 2898. There were 612 observations that were removed due to missing data so the final analytic sample size is 2286.

Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

H₀: μ_MexicanAmerican = μ_OtherHispanic = μ_Non-HispanicWhite = μ_NonHispanicBlack = μ_Other
Hₐ: At least one mean differs from the others

Plain language:

H₀: All means are equal
Hₐ: At least one mean differs from the others

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)

ethnicity	n	mean_bmd	sd_bmd
Non-Hispanic White	846	0.888	0.160
Non-Hispanic Black	532	0.973	0.161
Other	409	0.897	0.148
Mexican American	255	0.950	0.149
Other Hispanic	244	0.946	0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

Non-Hispanic Black individuals have the highest median BMD among the ethnicity groups and Non-Hispanic White individuals have the lowest median BMD. The Non-Hispanic White group has a more points spread toward lower BMD but there is substantial overlap between all groups. Each group had outliers in both directions.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

Outcome: BMD (DXXOFBMD)
Predictor: Ethnicity (5 groups)

# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)

term	df	sumsq	meansq	statistic	p.value
ethnicity	4	2.9482	0.7371	30.185	0
Residuals	2281	55.6973	0.0244	NA	NA

Write your interpretation here: F-statistic= 30.185 p< 0.001 Since the p-value is 0 (p<0.001), we reject the null hypothesis. This means there is statistically significant evidence that at least one mean BMD differs from the others across ethnicity groups.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

Based on the QQ plot, the residuals appear to be normally distributed since they follow the diagonal line. Based on Levene’s test, the p-value is 0.1788 which is greater than 0.05 so we accept equal variances. So based on the plot and test the assumption of normality and equal variances are met

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

Pairwise comparisons
Mean differences
95% confidence intervals
Adjusted p-values

tuk <- TukeyHSD(fit_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# YOUR CODE HERE to create a tidy table from Tukey results
# Hint: Convert to data frame, add rownames as column, filter for significances
tukey_summary <- as.data.frame(tuk$ethnicity)
tukey_summary$Comparison <- rownames(tukey_summary)
tukey_summary <- tukey_summary[, c("Comparison", "diff", "lwr", "upr", "p adj")]

# Add interpretation columns
tukey_summary$Significant <- ifelse(tukey_summary$`p adj` < 0.05, "Yes", "No")
tukey_summary$Direction <- ifelse(
  tukey_summary$Significant == "Yes",
  ifelse(tukey_summary$diff > 0, "First group higher", "Second group higher"),
  "No difference"
)

kable(tukey_summary, digits = 4,
      caption = "Tukey HSD post-hoc comparisons with 95% confidence intervals")

Tukey HSD post-hoc comparisons with 95% confidence intervals
	Comparison	diff	lwr	upr	p adj	Significant	Direction
Other Hispanic-Mexican American	Other Hispanic-Mexican American	-0.0044	-0.0426	0.0338	0.9978	No	No difference
Non-Hispanic White-Mexican American	Non-Hispanic White-Mexican American	-0.0624	-0.0929	-0.0319	0.0000	Yes	Second group higher
Non-Hispanic Black-Mexican American	Non-Hispanic Black-Mexican American	0.0224	-0.0101	0.0549	0.3277	No	No difference
Other-Mexican American	Other-Mexican American	-0.0533	-0.0874	-0.0193	0.0002	Yes	Second group higher
Non-Hispanic White-Other Hispanic	Non-Hispanic White-Other Hispanic	-0.0579	-0.0889	-0.0269	0.0000	Yes	Second group higher
Non-Hispanic Black-Other Hispanic	Non-Hispanic Black-Other Hispanic	0.0268	-0.0062	0.0598	0.1722	No	No difference
Other-Other Hispanic	Other-Other Hispanic	-0.0489	-0.0834	-0.0144	0.0011	Yes	Second group higher
Non-Hispanic Black-Non-Hispanic White	Non-Hispanic Black-Non-Hispanic White	0.0848	0.0612	0.1084	0.0000	Yes	First group higher
Other-Non-Hispanic White	Other-Non-Hispanic White	0.0091	-0.0166	0.0347	0.8719	No	No difference
Other-Non-Hispanic Black	Other-Non-Hispanic Black	-0.0757	-0.1038	-0.0477	0.0000	Yes	Second group higher

Write a short paragraph: “The groups that differed were …” Based on the tukey HSD poc-hoc analysis, there were statistically significant differences in mean BMD for several pairs. Mexican American individuals had higher mean BMD in comparison to Non-Hispanic Whites and the Other group. Other Hispanic individuals had a higher mean BMD than Non-Hispanic White individuals and the Other group. Non-Hispanic Black individuals had a higher mean BMD in comparison to the Non-White Hispanic group and the Other group. The rest of the pairs showed no statistically significant differences in mean BMD.

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl

## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA

# YOUR CODE HERE to compute eta-squared

anova_summary <- summary(fit_aov)[[1]]

ss_treatment <- anova_summary$`Sum Sq`[1]
ss_total <- sum(anova_summary$`Sum Sq`)

# Calculate eta-squared
eta_squared <- ss_treatment / ss_total

cat("Eta-squared (η²):", round(eta_squared, 4), "\n")

## Eta-squared (η²): 0.0503

cat("Percentage of variance explained:", round(eta_squared * 100, 2), "%")

## Percentage of variance explained: 5.03 %

# Hint: Extract SS for ethnicity and Residuals, then calculate proportion

Interpret in 1–2 sentences: 5% of variance in BMD is explained by ethnicity. Since effect size is 0.0503 the classification is small according to Cohen’s guidelines.

Part 2 — Correlation analysis (40 points)

In this section you will:

Create total intake variables (dietary + supplemental)
Assess whether transformations are needed
Produce three correlation tables examining relationships between:
- Table A: Predictors vs. outcome (BMD)
- Table B: Predictors vs. exposure (physical activity)
- Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    calcium_total = replace_na(calcium,0) + replace_na(DSQTCALC,0),
    vitd_total = replace_na(vitd, 0) + replace_na(DSQTVD, 0)
  )

2.2 Decide on transformations (show evidence)

Create histograms to assess distributions:

library(patchwork)
p1 <- ggplot(bmd2, aes(x = BMXBMI)) + geom_histogram(bins = 30)
p2 <- ggplot(bmd2, aes(x = calcium_total)) + geom_histogram(bins = 30)
p3 <- ggplot(bmd2, aes(x=vitd_total))+ geom_histogram(bins = 30)
p4 <- ggplot(bmd2, aes(x=totmet))+ geom_histogram(bins = 30)
(p1 | p2) / (p3 | p4)

Based on your assessment, apply transformations as needed:

bmd2 %>%
  summarise(
    min_BMI = min(BMXBMI, na.rm = TRUE),
    min_calcium_total = min(calcium_total, na.rm = TRUE),
    min_vitd_total = min(vitd_total, na.rm = TRUE),
    min_totmet = min(totmet, na.rm = TRUE)
  )

## # A tibble: 1 × 4
##   min_BMI min_calcium_total min_vitd_total min_totmet
##     <dbl>             <dbl>          <dbl>      <dbl>
## 1    14.2                 0              0          0

bmd2 <- bmd2 %>%
  mutate(
    log_BMI = log(BMXBMI),
    log_calcium_total = log(calcium_total+1),
    sqrt_totmet = sqrt(totmet),
    sqrt_vitd_total = sqrt(vitd_total)
  )

I applied transformations to total physical activity, BMI, total calcium intake, and total vitamin D intake because the histograms were skewed to the right. By applying transformations skewness can be reduced.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

Correlation coefficient (r)
p-value
Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observ[or correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    Predictor = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

Age vs. BMD
BMI vs. BMD
Total calcium vs. BMD
Total vitamin D vs. BMD

Use transformed versions where appropriate.

tableA <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
  corr_pair(bmd2, "log_BMI", "DXXOFBMD"),
  corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
  corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
) %>%
  
  mutate(
    estimate= round(estimate, 3),
    p_value= signif(p_value, 3)
  )

tableA %>% kable()

Predictor	y	n	estimate	p_value
RIDAGEYR	DXXOFBMD	2286	-0.232	0.0000
log_BMI	DXXOFBMD	2275	0.425	0.0000
log_calcium_total	DXXOFBMD	2286	0.030	0.1450
sqrt_vitd_total	DXXOFBMD	2286	-0.051	0.0151

Interpret the results: Age and BMI show statistically significant correlations with BMD. Age is moderately negatively correlated with BMD (r=-0.232, p< 0.001) which indicates that as age increases BMD tends to increase. Also, BMI is moderately positively correlated with BMD (r=0.425, p< 0.001), which indicates that higher BMI is associated with higher BMD. Total calcium intake does not have a statistically significant relationship with BMD. Vitamin D intake is weakly associated with BMD indicating that it is likely due to the large sample size although it is statistically significant (r=-0.051, p=0.015).

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

Age vs. MET
BMI vs. MET
Total calcium vs. MET
Total vitamin D vs. MET

# YOUR CODE HERE

tableB <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
  corr_pair(bmd2, "log_BMI", "sqrt_totmet"),
  corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
  corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
) %>%
  mutate(
    estimate= round(estimate, 3),
    p_value= signif(p_value, 3)
  )

tableB %>%
  kable()

Predictor	y	n	estimate	p_value
RIDAGEYR	sqrt_totmet	1934	-0.097	1.96e-05
log_BMI	sqrt_totmet	1912	0.001	9.51e-01
log_calcium_total	sqrt_totmet	1934	0.096	2.47e-05
sqrt_vitd_total	sqrt_totmet	1934	0.007	7.58e-01

Interpret the results: Vitamin D intake and BMI both did not have a statistically significant correlation with physical activity (p= 0.489 and p= 0.959, respectively). Age has a weak negative association with physical activity that is statistically significant (r=-0.094, p<0.001), which indicates that lower physical activity is associated with older age. Total calcium intake also showed a statistically significant weak positive association with physical activity (r=0.081, p<0.001). Overall, there was either no association between the potential confounders and physical activity.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

Age, BMI, Total calcium, Total vitamin D

tableC <- bind_rows(
  corr_pair(bmd2, "RIDAGEYR", "log_BMI"),
  corr_pair(bmd2, "RIDAGEYR", "log_calcium_total"),
  corr_pair(bmd2, "RIDAGEYR", "sqrt_vitd_total"),
  corr_pair(bmd2, "sqrt_vitd_total", "log_calcium_total"),
  corr_pair(bmd2, "sqrt_vitd_total", "log_BMI"),
  corr_pair(bmd2, "log_calcium_total", "log_BMI")
)

 tableC %>% kable()

Predictor	y	n	estimate	p_value
RIDAGEYR	log_BMI	2834	-0.0977876	0.0000002
RIDAGEYR	log_calcium_total	2898	-0.0292405	0.1155426
RIDAGEYR	sqrt_vitd_total	2898	0.1487316	0.0000000
sqrt_vitd_total	log_calcium_total	2898	0.2722949	0.0000000
sqrt_vitd_total	log_BMI	2834	-0.0029275	0.8762060
log_calcium_total	log_BMI	2834	0.0206142	0.2726247

Interpret the results: There are no strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses. The strongest association is a weak positive correlation between total calcium intake and total vitamin intake (r=0.272, p<0.001).

Part 3 — Reflection (10 points)

REFLECTION ANOVA is used in epidemiologic research to compare the mean of a continuous outcome across multiple categorical groups (3 or more) defined by one predictor. Correlation allows epidemiologists to see the strength and direction of the linear relationship between two variables. ANOVA does not allow epidemiologists to calculate the strength or direction of a linear relationship, like correlation can, but it can determine if the means differ significantly (not by chance). Correlation does not mean causation though, so even if there is a strong relationship between two variables it cannot be assumed that one causes the other. ANOVA better suits the question if bone mineral density varies by ethnicity while Correlation is best for looking at how total calcium intake and BMD are related. Normality and equal variances were important assumptions to check and this assignment reinforced in me the practice of checking when using ANOVA. Violating these assumptions can result in distorted results if extreme outliers go unchecked. With cross-sectional correlation analysis it is important to understand that nutrition, physical activity, and BMD may be influenced by other confounding factors. In this assignment I saw that while there was a statistically significant association between continous variables the r value was quite weak indicating that the association seen is likely due to other factors. During this assignment, I struggled with a lot of writing my own code. I had gotten used to being provided the code in 552 and the first few labs so interpreting was easy but making changes/writing the code was difficult. When I was stuck I would first revisit the labs we have already done and if I was still stuck I would use Google to look up what I was trying to do. Through this assignment I strengthened my problem solving and now feel more confident on certain aspects of R coding.

EPI 553 Homework 1: ANOVA & Correlation

EPI 553/STA 553: Principles of Statistical Inference II (Spring 2026)

Jackie Korosec

2026-02-24