Homework 1 overview

Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA subsample)
What to submit (2 items):

  1. Your .Rmd file
  2. Your RPubs link (published HTML)

AI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.


File organization and naming (required)

Create a course folder on your computer like:

EPI553/
  ├── Assignments/
  │   └── HW01/
  │       ├── bmd.csv
  │       └── EPI553_HW01_LastName_FirstName.Rmd

Required filename for your submission:

EPI553_HW01_LastName_FirstName.Rmd

Required RPubs title (use this exact pattern):

epi553_hw01_lastname_firstname

This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_lastname_firstname


Data description

This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.

Key variables:

Variable Description Type
DXXOFBMD Total femur bone mineral density Continuous (g/cm²)
RIDRETH1 Race/ethnicity Categorical (1–5)
RIAGENDR Sex Categorical (1=Male, 2=Female)
RIDAGEYR Age in years Continuous
BMXBMI Body mass index Continuous (kg/m²)
smoker Smoking status Categorical (1=Current, 2=Past, 3=Never)
calcium Dietary calcium intake Continuous (mg/day)
DSQTCALC Supplemental calcium Continuous (mg/day)
vitd Dietary vitamin D intake Continuous (mcg/day)
DSQTVD Supplemental vitamin D Continuous (mcg/day)
totmet Total physical activity (MET-min/week) Continuous

Part 0 — Load and prepare the data (10 points)

0.1 Import

Load required packages and import the dataset.

# Load required packages
library(tidyverse)
library(car)
library(kableExtra)
library(broom)
# Import data (adjust path as needed)
bmd <- readr::read_csv("C:/Users/Alyss/OneDrive/Desktop/epi553/bmd.csv", show_col_types = FALSE)

# Inspect the data
glimpse(bmd)
## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

0.2 Recode to readable factors

Create readable labels for the main categorical variables:

bmd <- bmd %>%
  mutate(
    sex = factor(RIAGENDR, 
                 levels = c(1, 2), 
                 labels = c("Male", "Female")),
    ethnicity = factor(RIDRETH1, 
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Mexican American", 
                                  "Other Hispanic",
                                  "Non-Hispanic White", 
                                  "Non-Hispanic Black", 
                                  "Other")),
    smoker_f = factor(smoker, 
                      levels = c(1, 2, 3),
                      labels = c("Current", "Past", "Never"))
  )

0.3 Missingness + analytic sample

Report the total sample size and create the analytic dataset for ANOVA by removing missing values:

# Total observations
n_total <- nrow(bmd)

# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(ethnicity))

# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)

# Display sample sizes
n_total
## [1] 2898
n_anova
## [1] 2286

Write 2–4 sentences summarizing your analytic sample here.

There were originally 2,898 observations, 612 missing entries were removed, leaving 2,286 observations for the ANOVA analysis.


Part 1 — One-way ANOVA: BMD by ethnicity (50 points)

Research question: Do mean BMD values differ across ethnicity groups?

1.1 Hypotheses (required)

State your null and alternative hypotheses in both statistical notation and plain language.

Statistical notation:

  • H₀: 𝛍_Mexican American = 𝛍_Other Hispanic = 𝛍_Non-Hispanic White = 𝛍_Non-Hispanic Black = 𝛍_Other
  • Hₐ: Not all 𝛍 are equal

Plain language:

  • H₀: All ethnicity population means are equal
  • Hₐ: At least one ethnicity population mean is different

1.2 Exploratory plot and group summaries

Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:

anova_df %>%
  group_by(ethnicity) %>%
  summarise(
    n = n(),
    mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
    sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
  ) %>%
  arrange(desc(n)) %>%
  kable(digits = 3)
ethnicity n mean_bmd sd_bmd
Non-Hispanic White 846 0.888 0.160
Non-Hispanic Black 532 0.973 0.161
Other 409 0.897 0.148
Mexican American 255 0.950 0.149
Other Hispanic 244 0.946 0.156

Create a visualization comparing BMD distributions across ethnicity groups:

ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.25) +
  labs(
    x = "Ethnicity group",
    y = "Bone mineral density (g/cm²)"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Interpret the plot: What patterns do you observe in BMD across ethnicity groups?

The median BMD looks to be the highest in Non-Hispanic blacks and lowest in non-Hispanic whites and other. Mexican Americans and Other Hispanics look similar when it comes to median BMD. Variability looks similar in each group. There are some outliers in each group.

1.3 Fit ANOVA model + report the F-test

Fit a one-way ANOVA with:

  • Outcome: BMD (DXXOFBMD)
  • Predictor: Ethnicity (5 groups)
# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)

# Display ANOVA table
summary(fit_aov)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## ethnicity      4   2.95  0.7371   30.18 <2e-16 ***
## Residuals   2281  55.70  0.0244                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report the F-statistic, degrees of freedom, and p-value in a formatted table:

# Create tidy ANOVA table
broom::tidy(fit_aov) %>%
  kable(digits = 4)
term df sumsq meansq statistic p.value
ethnicity 4 2.9482 0.7371 30.185 0
Residuals 2281 55.6973 0.0244 NA NA

Write your interpretation here:

F-statistic: 30.19

p-value: <2e-16

Conclusion: Since the p-value is less than 0.05, we reject the null hypothesis. Mean BMD differs significantly across at least one group of ethnicity.

1.4 Assumption checks (normality + equal variances)

ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.

If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.

Normality of residuals

Create diagnostic plots:

par(mfrow = c(1, 2))
plot(fit_aov, which = 1)  # Residuals vs fitted
plot(fit_aov, which = 2)  # QQ plot

Equal variances (Levene’s test)

car::leveneTest(DXXOFBMD ~ ethnicity, data = anova_df)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.5728 0.1788
##       2281

Summarize what you see in 2–4 sentences:

The residuals vs fitted plot shows random scatter with no patterns. There may be some minor outliers. Based on the Q-Q plot, we can assume normality from where the points follow the diagonal line practically perfectly so we can assume independence. The Levene’s test had a value of p = 0.1788, which fails the significance test. We can assume equal variances.

1.5 Post-hoc comparisons (pairwise tests)

Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.

Use Tukey HSD and report:

  • Pairwise comparisons
  • Mean differences
  • 95% confidence intervals
  • Adjusted p-values
tuk <- TukeyHSD(fit_aov)
print(tuk)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
## 
## $ethnicity
##                                               diff          lwr         upr
## Other Hispanic-Mexican American       -0.004441273 -0.042644130  0.03376158
## Non-Hispanic White-Mexican American   -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American    0.022391619 -0.010100052  0.05488329
## Other-Mexican American                -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic     -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic      0.026832892 -0.006150151  0.05981593
## Other-Other Hispanic                  -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White  0.084768868  0.061164402  0.10837333
## Other-Non-Hispanic White               0.009057905 -0.016633367  0.03474918
## Other-Non-Hispanic Black              -0.075710963 -0.103764519 -0.04765741
##                                           p adj
## Other Hispanic-Mexican American       0.9978049
## Non-Hispanic White-Mexican American   0.0000003
## Non-Hispanic Black-Mexican American   0.3276613
## Other-Mexican American                0.0001919
## Non-Hispanic White-Other Hispanic     0.0000036
## Non-Hispanic Black-Other Hispanic     0.1722466
## Other-Other Hispanic                  0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White              0.8719109
## Other-Non-Hispanic Black              0.0000000

Create and present a readable table (you can tidy the Tukey output):

# YOUR CODE HERE to create a tidy table from Tukey results
# Hint: Convert to data frame, add rownames as column, filter for significance


# Extract and format results
tukey_table <- as.data.frame(tuk$ethnicity)
tukey_table$Comparison <- rownames(tukey_table)

# Add interpretation columns
tukey_table$Significant <- ifelse(tukey_table$`p adj` < 0.05, "Yes", "No")
tukey_table$Direction <- ifelse(
  tukey_table$Significant == "Yes", 
  ifelse(tukey_table$diff > 0, "First group Higher", "Second Group Higher"), "No Difference"
)

#Display Table
kable(tukey_table, digits = 4, 
      caption = "Tukey HSD post hoc comparisons w/ 95% confidence intervals")
Tukey HSD post hoc comparisons w/ 95% confidence intervals
diff lwr upr p adj Comparison Significant Direction
Other Hispanic-Mexican American -0.0044 -0.0426 0.0338 0.9978 Other Hispanic-Mexican American No No Difference
Non-Hispanic White-Mexican American -0.0624 -0.0929 -0.0319 0.0000 Non-Hispanic White-Mexican American Yes Second Group Higher
Non-Hispanic Black-Mexican American 0.0224 -0.0101 0.0549 0.3277 Non-Hispanic Black-Mexican American No No Difference
Other-Mexican American -0.0533 -0.0874 -0.0193 0.0002 Other-Mexican American Yes Second Group Higher
Non-Hispanic White-Other Hispanic -0.0579 -0.0889 -0.0269 0.0000 Non-Hispanic White-Other Hispanic Yes Second Group Higher
Non-Hispanic Black-Other Hispanic 0.0268 -0.0062 0.0598 0.1722 Non-Hispanic Black-Other Hispanic No No Difference
Other-Other Hispanic -0.0489 -0.0834 -0.0144 0.0011 Other-Other Hispanic Yes Second Group Higher
Non-Hispanic Black-Non-Hispanic White 0.0848 0.0612 0.1084 0.0000 Non-Hispanic Black-Non-Hispanic White Yes First group Higher
Other-Non-Hispanic White 0.0091 -0.0166 0.0347 0.8719 Other-Non-Hispanic White No No Difference
Other-Non-Hispanic Black -0.0757 -0.1038 -0.0477 0.0000 Other-Non-Hispanic Black Yes Second Group Higher

Write a short paragraph: “The groups that differed were …”

[YOUR INTERPRETATION: Which specific ethnicity pairs showed statistically significant differences in mean BMD?] Groups that showed significant differences in mean BMD were Non-hispanic whites ~ Mexican american, Other ~ Mexican american, Non-Hispanic whites ~ Other hispanics, Other ~ Other hispanics, Non-hispanic blacks ~ Non-hispanic whites, Other ~ Non-hispanic blacks.

1.6 Effect size (eta-squared)

Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).

Formula: η² = SS_between / SS_total

# Get ANOVA table with sum of squares
anova_tbl <- broom::tidy(fit_aov)
anova_tbl
## # A tibble: 2 × 6
##   term         df sumsq meansq statistic   p.value
##   <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1 ethnicity     4  2.95 0.737       30.2  1.65e-24
## 2 Residuals  2281 55.7  0.0244      NA   NA
# YOUR CODE HERE to compute eta-squared
# Hint: Extract SS for ethnicity and Residuals, then calculate proportion


ss_treatment <- anova_tbl$sumsq[1]
ss_total <- sum(anova_tbl$sumsq)


eta_squared <- ss_treatment / ss_total
cat("Eta-squared:", round(eta_squared, 4), "\n")
## Eta-squared: 0.0503

Interpret in 1–2 sentences:

[Ethnicity explains 5.03% of the variance in BMD. The effect size is small; so there are many other things that could explain the variation in BMD.


Part 2 — Correlation analysis (40 points)

In this section you will:

  1. Create total intake variables (dietary + supplemental)
  2. Assess whether transformations are needed
  3. Produce three correlation tables examining relationships between:
    • Table A: Predictors vs. outcome (BMD)
    • Table B: Predictors vs. exposure (physical activity)
    • Table C: Predictors vs. each other

2.1 Create total intake variables

Calculate total nutrient intake by adding dietary and supplemental sources:

# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
  mutate(
    DSQTCALC_0 = replace_na(DSQTCALC, 0),
    DSQTVD_0 = replace_na(DSQTVD, 0),
    calcium_total = calcium + DSQTCALC_0,
    vitd_total = vitd + DSQTVD_0
  )

2.2 Decide on transformations (show evidence)

Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.

Create histograms to assess distributions:

# YOUR CODE HERE: Create histograms for key continuous variables
# Recommended: BMXBMI, calcium_total, vitd_total, totmet
# Example structure:
# library(patchwork)
# p1 <- ggplot(bmd2, aes(x = BMXBMI)) + geom_histogram(bins = 30)
# p2 <- ggplot(bmd2, aes(x = calcium_total)) + geom_histogram(bins = 30)
# ...
# (p1 | p2) / (p3 | p4)

 ggplot(bmd2, aes(x = BMXBMI)) + geom_histogram(bins = 30)

  ggplot(bmd2, aes(x = calcium_total)) + geom_histogram(bins = 30)

  ggplot(bmd2, aes(x = vitd_total)) + geom_histogram(bins = 30)

  ggplot(bmd2, aes(x = totmet)) + geom_histogram(bins = 30)

Based on your assessment, apply transformations as needed:

bmd2 <- bmd2 %>%
  mutate(
    # Add your transformation code here
    # Examples (modify as needed based on your assessment):
    # log_bmi = log(BMXBMI),
    # log_calcium_total = log(calcium_total),
    # sqrt_totmet = sqrt(totmet),
    # sqrt_vitd_total = sqrt(vitd_total)
    log_bmi = log(BMXBMI),
    log_calcium_total = log(calcium_total),
    sqrt_totmet = sqrt(totmet),
    sqrt_vitd_total = sqrt(vitd_total)
  )

Document your reasoning: “I applied/did not apply transformations to [variables] because…”

I applied transformations to BMXBMI, calcium_total, totmet, and vitd_total because they were all highly skewed.

2.3 Create three correlation tables (Pearson)

For each correlation, report:

  • Correlation coefficient (r)
  • p-value
  • Sample size (n)

Helper function for correlation pairs

# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
  # Select variables and remove missing values
  d <- data %>%
    select(all_of(c(x, y))) %>%
    drop_na()
  
  # Need at least 3 observations for correlation
  if(nrow(d) < 3) {
    return(tibble(
      x = x,
      y = y,
      n = nrow(d),
      estimate = NA_real_,
      p_value = NA_real_
    ))
  }
  
  # Calculate Pearson correlation
  ct <- cor.test(d[[x]], d[[y]], method = "pearson")
  
  # Return tidy results
  tibble(
    x = x,
    y = y,
    n = nrow(d),
    estimate = unname(ct$estimate),
    p_value = ct$p.value
  )
}

Table A: Variables associated with the outcome (BMD)

Examine correlations between potential predictors and BMD:

  • Age vs. BMD
  • BMI vs. BMD
  • Total calcium vs. BMD
  • Total vitamin D vs. BMD

Use transformed versions where appropriate.

# YOUR CODE HERE
# Use the corr_pair function to create correlations
# Use bind_rows() to combine multiple correlation results
# Format with kable()

# Example structure:
 tableA <- bind_rows(
   corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
   corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
  corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
   corr_pair(bmd2, "sqrt_vitd_total", "DXXOFBMD")
 ) %>%
   mutate(
    estimate = round(estimate, 3),
     p_value = signif(p_value, 3)
   )
 
 tableA %>% kable()
x y n estimate p_value
RIDAGEYR DXXOFBMD 2286 -0.232 0.00000
log_bmi DXXOFBMD 2275 0.425 0.00000
log_calcium_total DXXOFBMD 2129 0.012 0.59200
sqrt_vitd_total DXXOFBMD 2129 -0.062 0.00426

Interpret the results:

RIDAGEYR, log_bmi, and sqrt_vitd_total all had statistically significant correlations with BMD. RIDAGEYR had a weak negative relationship. log_bmi had a moderately positive relationship. sqrt_vitd_total had a weak negative relationship.

Table B: Variables associated with the exposure (MET)

Examine correlations between potential confounders and physical activity:

  • Age vs. MET
  • BMI vs. MET
  • Total calcium vs. MET
  • Total vitamin D vs. MET
# YOUR CODE HERE
# Follow the same structure as Table A
# Use sqrt_totmet (or transformed version) as the outcome variable
 tableB <- bind_rows(
   corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
   corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
  corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
   corr_pair(bmd2, "sqrt_vitd_total", "sqrt_totmet")
 ) %>%
   mutate(
    estimate = round(estimate, 3),
     p_value = signif(p_value, 3)
   )
 
 tableB %>% kable()
x y n estimate p_value
RIDAGEYR sqrt_totmet 1934 -0.097 1.96e-05
log_bmi sqrt_totmet 1912 0.001 9.51e-01
log_calcium_total sqrt_totmet 1777 0.086 2.82e-04
sqrt_vitd_total sqrt_totmet 1777 -0.002 9.32e-01

Interpret the results:

RIDAGEYR and log_calcium_total are significantly correlated with physical activity level.

Table C: Predictors associated with each other

Examine correlations among all predictor variables (all pairwise combinations):

  • Age, BMI, Total calcium, Total vitamin D
# YOUR CODE HERE
# Create all pairwise combinations of predictors
# Hint: Use combn() to generate pairs, then map_dfr() with corr_pair()

# Example structure:
 preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "sqrt_vitd_total")
 pairs <- combn(preds, 2, simplify = FALSE)
 tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
   mutate(
     estimate = round(estimate, 3),
     p_value = signif(p_value, 3)
   )
 
 tableC %>% kable()
x y n estimate p_value
RIDAGEYR log_bmi 2834 -0.098 2.00e-07
RIDAGEYR log_calcium_total 2605 0.048 1.36e-02
RIDAGEYR sqrt_vitd_total 2605 0.153 0.00e+00
log_bmi log_calcium_total 2569 0.000 9.81e-01
log_bmi sqrt_vitd_total 2569 0.007 7.31e-01
log_calcium_total sqrt_vitd_total 2605 0.314 0.00e+00

Interpret the results:

[YOUR INTERPRETATION: Are there strong correlations among predictors that might indicate multicollinearity concerns in future regression analyses?] The highest statistically significant pairwise predictors were log_calcium_total and sqrt_vitd_total with a positive moderate correlation of 0.314.

Optional: Visualization examples

Create scatterplots to visualize key relationships:

# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "log(BMI)",
    y = "BMD (g/cm²)",
    title = "BMD vs BMI (transformed)"
  )

# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "sqrt(totmet)",
    y = "BMD (g/cm²)",
    title = "BMD vs MET (transformed)"
  )

Part 3 — Reflection (10 points)

Write a structured reflection (200–400 words) addressing the following:

A. Comparing Methods (ANOVA vs Correlation)

  • When would you use ANOVA versus correlation in epidemiological research?
  • What are the key differences in what these methods tell us?
  • Give one example of a research question better suited to ANOVA and one better suited to correlation.

B. Assumptions and Limitations

  • Which assumptions were most challenging to meet in this assignment?
  • How might violations of assumptions affect your conclusions?
  • What are the limitations of cross-sectional correlation analysis for understanding relationships between nutrition/activity and bone health?

C. R Programming Challenge

  • What was the most difficult part of the R coding for this assignment?
  • How did you solve it? (Describe your problem-solving process)
  • What R skill did you strengthen through completing this assignment?

YOUR REFLECTION HERE (200–400 words)

[Write your reflection addressing all three components above] ANOVA should be used when comparing 3 or more groups means. There should a categorical predictor (i.e. ethnicity group) and a continuous outcome (i.e. BMD). Correlation should be used when trying to find the strength/direction of a linear relationship. Both varables should be continuous but it is okay if they are ordinal. Anova tells us whether there is a difference in mean of a variable of interest when looked at in different categories. Correlation tells us whether there is a relationship between two variables and what the strength and direction of the relationship is. A research question where Anova is best used when determining if there is a mean difference of BMI across age groups (young, middle-aged, old). A research question where correlation might be used is determining if BMI is associated with hours of sleep per night.

I think the most challenging assumptions to meet were the linearity and having to apply transformations. If linearity was not met then I could not do the tests and could not determine correlation. A limitation to cross-secrtional correaltion analysis is not being able to determine causation. We can not say for certain that nutrition/activity directly causes better or worse bone health.

The most challenging part of this assignment was determining how to make the eta-squared test work. I tried to follow the labs and lectures for its structure but they kept throwing error messages. To solve this problem, I went onto an R help website(s) (Rpubs and R-Bloggers) and was able to find the problem and fix it. The R skill I streghtened during this assignment is create histograms and conduct linear transformations.


Submission checklist

Before submitting, verify:


Good luck! Remember to start early and use office hours if you need help.