Due: Thursday, February 20, 2026 (11:59 pm ET)
Topics: One-way ANOVA + Correlation (based on
Lectures/Labs 02–03)
Dataset: bmd.csv (NHANES 2017–2018 DEXA
subsample)
What to submit (2 items):
.Rmd fileAI policy (Homework 1): AI tools (ChatGPT, Gemini, Claude, Copilot, etc.) are not permitted for coding assistance on HW1. You must write/debug your code independently.
Create a course folder on your computer like:
EPI553/
├── Assignments/
│ └── HW01/
│ ├── bmd.csv
│ └── EPI553_HW01_LastName_FirstName.Rmd
Required filename for your submission:
EPI553_HW01_Yatsiv_Viktoriya.Rmd
Required RPubs title (use this exact pattern):
epi553_hw01_lastname_firstname
This will create an RPubs URL that ends in something like:
https://rpubs.com/your_username/epi553_hw01_Yatsiv_Viktoriya
This homework uses bmd.csv, a subset of NHANES 2017–2018 respondents who completed a DEXA scan.
Key variables:
| Variable | Description | Type |
|---|---|---|
DXXOFBMD |
Total femur bone mineral density | Continuous (g/cm²) |
RIDRETH1 |
Race/ethnicity | Categorical (1–5) |
RIAGENDR |
Sex | Categorical (1=Male, 2=Female) |
RIDAGEYR |
Age in years | Continuous |
BMXBMI |
Body mass index | Continuous (kg/m²) |
smoker |
Smoking status | Categorical (1=Current, 2=Past, 3=Never) |
calcium |
Dietary calcium intake | Continuous (mg/day) |
DSQTCALC |
Supplemental calcium | Continuous (mg/day) |
vitd |
Dietary vitamin D intake | Continuous (mcg/day) |
DSQTVD |
Supplemental vitamin D | Continuous (mcg/day) |
totmet |
Total physical activity (MET-min/week) | Continuous |
Load required packages and import the dataset.
# Import data (adjust path as needed)
bmd <- readr::read_csv("/Users/vikya/OneDrive - University at Albany - SUNY/Documents/553 Principles of Statistical Inference II/Assignments/Assignment 1 ANOVA Correlation/bmd.csv", show_col_types = FALSE)
# Inspect the data
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
Create readable labels for the main categorical variables:
bmd <- bmd %>%
mutate(
sex = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
ethnicity = factor(RIDRETH1,
levels = c(1, 2, 3, 4, 5),
labels = c("Mexican American",
"Other Hispanic",
"Non-Hispanic White",
"Non-Hispanic Black",
"Other")),
smoker_f = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)Report the total sample size and create the analytic dataset for ANOVA by removing missing values:
# Total observations
n_total <- nrow(bmd)
# Create analytic dataset for ANOVA (complete cases for BMD and ethnicity)
anova_df <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(ethnicity))
# Sample size for ANOVA analysis
n_anova <- nrow(anova_df)
# Display sample sizes
n_total## [1] 2898
## [1] 2286
Write 2–4 sentences summarizing your analytic sample here.
The original dataset contained 2,898 observations from the NHANES 2017-2018 DEXA sample. After removing observations with missing values for bone mineral density or ethnicity, the final sample contains 2,286 participants. A total of 612 observations were excluded due to missing data on these key variables.
Research question: Do mean BMD values differ across ethnicity groups?
State your null and alternative hypotheses in both statistical notation and plain language.
Statistical notation:
Plain language:
Create a table showing sample size, mean, and standard deviation of BMD for each ethnicity group:
anova_df %>%
group_by(ethnicity) %>%
summarise(
n = n(),
mean_bmd = mean(DXXOFBMD, na.rm = TRUE),
sd_bmd = sd(DXXOFBMD, na.rm = TRUE)
) %>%
arrange(desc(n)) %>%
kable(digits = 3, col.names = c("Ethnicity", "N", "Mean BMD (g/cm²)", "SD"),
caption = "Descriptive Statistics: BMD by Ethnicity")| Ethnicity | N | Mean BMD (g/cm²) | SD |
|---|---|---|---|
| Non-Hispanic White | 846 | 0.888 | 0.160 |
| Non-Hispanic Black | 532 | 0.973 | 0.161 |
| Other | 409 | 0.897 | 0.148 |
| Mexican American | 255 | 0.950 | 0.149 |
| Other Hispanic | 244 | 0.946 | 0.156 |
Create a visualization comparing BMD distributions across ethnicity groups:
ggplot(anova_df, aes(x = ethnicity, y = DXXOFBMD)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.15, alpha = 0.25) +
labs(
x = "Ethnicity group",
y = "Bone mineral density (g/cm²)"
) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))Interpret the plot: The boxplot shows that Non-Hispanic Black group has the highest median BMD (~0.97 g/cm²), while Non-Hispanic White participants have the lowest median BMD (~0.89 g/cm²).All groups show similar variability and some outliers.
Fit a one-way ANOVA with:
DXXOFBMD)# Fit ANOVA model
fit_aov <- aov(DXXOFBMD ~ ethnicity, data = anova_df)
# Display ANOVA table
summary(fit_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## ethnicity 4 2.95 0.7371 30.18 <2e-16 ***
## Residuals 2281 55.70 0.0244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Report the F-statistic, degrees of freedom, and p-value in a formatted table:
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| ethnicity | 4 | 2.9482 | 0.7371 | 30.185 | 0 |
| Residuals | 2281 | 55.6973 | 0.0244 | NA | NA |
The one-way ANOVA shows a statistically significant difference in mean bone mineral density across ethnicity groups (F(4, 2281) = 30.18, p < 0.001). We reject the null hypothesis.This means that at least one ethnicity group has significantly different mean BMD.
ANOVA requires three assumptions: independence, normality of residuals, and equal variances. Check the latter two graphically and with formal tests.
If assumptions are not satisfied, say so clearly, but still proceed with the ANOVA and post-hoc tests.
Create diagnostic plots:
par(mfrow = c(1, 2))
plot(fit_aov, which = 1) # Residuals vs fitted
plot(fit_aov, which = 2) # QQ plot## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.5728 0.1788
## 2281
Summarize what you see in 2–4 sentences:
The Q-Q plot shows that residuals mostly follow the diagonal line. The Residuals vs Fitted plot shows random scatter with no particular patterns. Levene’s test yields F = 1.57, p = 0.179 (p > 0.05), therefore we fail to reject the null hypothesis of equal variances.The homogeneity of variance assumption is satisfied. There is a large sample (n=2286) and balanced groups. ANOVA is robust to minor violations.
Because ethnicity has 5 groups, you must do a multiple-comparisons procedure.
Use Tukey HSD and report:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DXXOFBMD ~ ethnicity, data = anova_df)
##
## $ethnicity
## diff lwr upr
## Other Hispanic-Mexican American -0.004441273 -0.042644130 0.03376158
## Non-Hispanic White-Mexican American -0.062377249 -0.092852618 -0.03190188
## Non-Hispanic Black-Mexican American 0.022391619 -0.010100052 0.05488329
## Other-Mexican American -0.053319344 -0.087357253 -0.01928144
## Non-Hispanic White-Other Hispanic -0.057935976 -0.088934694 -0.02693726
## Non-Hispanic Black-Other Hispanic 0.026832892 -0.006150151 0.05981593
## Other-Other Hispanic -0.048878071 -0.083385341 -0.01437080
## Non-Hispanic Black-Non-Hispanic White 0.084768868 0.061164402 0.10837333
## Other-Non-Hispanic White 0.009057905 -0.016633367 0.03474918
## Other-Non-Hispanic Black -0.075710963 -0.103764519 -0.04765741
## p adj
## Other Hispanic-Mexican American 0.9978049
## Non-Hispanic White-Mexican American 0.0000003
## Non-Hispanic Black-Mexican American 0.3276613
## Other-Mexican American 0.0001919
## Non-Hispanic White-Other Hispanic 0.0000036
## Non-Hispanic Black-Other Hispanic 0.1722466
## Other-Other Hispanic 0.0010720
## Non-Hispanic Black-Non-Hispanic White 0.0000000
## Other-Non-Hispanic White 0.8719109
## Other-Non-Hispanic Black 0.0000000
Create and present a readable table (you can tidy the Tukey output):
# Create table from Tukey results
tuk_df <- tuk$ethnicity %>%
as.data.frame() %>%
rownames_to_column("Comparison") %>%
arrange(`p adj`) %>%
mutate(
Significant = ifelse(`p adj` < 0.05, "Yes", "No")
)
# Display all comparisons
tuk_df %>%
kable(digits = 4, caption = "Tukey HSD Results")| Comparison | diff | lwr | upr | p adj | Significant |
|---|---|---|---|---|---|
| Non-Hispanic Black-Non-Hispanic White | 0.0848 | 0.0612 | 0.1084 | 0.0000 | Yes |
| Other-Non-Hispanic Black | -0.0757 | -0.1038 | -0.0477 | 0.0000 | Yes |
| Non-Hispanic White-Mexican American | -0.0624 | -0.0929 | -0.0319 | 0.0000 | Yes |
| Non-Hispanic White-Other Hispanic | -0.0579 | -0.0889 | -0.0269 | 0.0000 | Yes |
| Other-Mexican American | -0.0533 | -0.0874 | -0.0193 | 0.0002 | Yes |
| Other-Other Hispanic | -0.0489 | -0.0834 | -0.0144 | 0.0011 | Yes |
| Non-Hispanic Black-Other Hispanic | 0.0268 | -0.0062 | 0.0598 | 0.1722 | No |
| Non-Hispanic Black-Mexican American | 0.0224 | -0.0101 | 0.0549 | 0.3277 | No |
| Other-Non-Hispanic White | 0.0091 | -0.0166 | 0.0347 | 0.8719 | No |
| Other Hispanic-Mexican American | -0.0044 | -0.0426 | 0.0338 | 0.9978 | No |
# Show only significant comparisons
tuk_df %>%
filter(`p adj` < 0.05) %>%
kable(digits = 4, caption = "Significant Pairwise Differences")| Comparison | diff | lwr | upr | p adj | Significant |
|---|---|---|---|---|---|
| Non-Hispanic Black-Non-Hispanic White | 0.0848 | 0.0612 | 0.1084 | 0.0000 | Yes |
| Other-Non-Hispanic Black | -0.0757 | -0.1038 | -0.0477 | 0.0000 | Yes |
| Non-Hispanic White-Mexican American | -0.0624 | -0.0929 | -0.0319 | 0.0000 | Yes |
| Non-Hispanic White-Other Hispanic | -0.0579 | -0.0889 | -0.0269 | 0.0000 | Yes |
| Other-Mexican American | -0.0533 | -0.0874 | -0.0193 | 0.0002 | Yes |
| Other-Other Hispanic | -0.0489 | -0.0834 | -0.0144 | 0.0011 | Yes |
Write a short paragraph: “The groups that differed were …”
The groups that differed were Non-Hispanic Black, Non-Hispanic White, Mexican American, Other Hispanic, and Other. Non-Hispanic Black participants had much higher BMD compared to Non-Hispanic White (+0.085 g/cm², p < 0.001) and Other (+0.076 g/cm², p < 0.001). Non-Hispanic White had much lower BMD compared to Mexican American (-0.062 g/cm², p < 0.001) and Other Hispanic (-0.058 g/cm², p < 0.001). The Other group also differed significantly from Mexican American (-0.053 g/cm², p < 0.001) and Other Hispanic (-0.049 g/cm², p = 0.001).
Compute η² and interpret it (small/moderate/large is okay as a qualitative statement).
Formula: η² = SS_between / SS_total
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ethnicity 4 2.95 0.737 30.2 1.65e-24
## 2 Residuals 2281 55.7 0.0244 NA NA
anova_tbl <- broom::tidy(fit_aov)
ss_between <- anova_tbl$sumsq[1]
ss_total <- sum(anova_tbl$sumsq)
eta_squared <- ss_between / ss_total
data.frame(
Measure = c("SS Between", "SS Total", "Eta-squared", "% Variance"),
Value = c(round(ss_between, 3), round(ss_total, 3),
round(eta_squared, 4), paste0(round(eta_squared*100, 2), "%"))
) %>%
kable(caption = "Effect Size")| Measure | Value |
|---|---|
| SS Between | 2.948 |
| SS Total | 58.646 |
| Eta-squared | 0.0503 |
| % Variance | 5.03% |
Interpret in 1–2 sentences:
The eta-squared value is 0.050. This indicates that ethnicity explains 5.0% of the variance in bone mineral density.This is a small to medium effect.While statistically significant, ethnicity accounts for a modest portion of BMD variance. Therefore it suggests that there are other factors having an effect.
In this section you will:
Calculate total nutrient intake by adding dietary and supplemental sources:
# Create total calcium and vitamin D variables
# Note: Replace NA in supplement variables with 0 (no supplementation)
bmd2 <- bmd %>%
mutate(
DSQTCALC_0 = replace_na(DSQTCALC, 0),
DSQTVD_0 = replace_na(DSQTVD, 0),
calcium_total = calcium + DSQTCALC_0,
vitd_total = vitd + DSQTVD_0
)
summary(bmd2[, c("calcium_total", "vitd_total")])## calcium_total vitd_total
## Min. : 55.0 Min. : 0.000
## 1st Qu.: 596.8 1st Qu.: 2.900
## Median : 895.5 Median : 8.533
## Mean : 997.6 Mean : 28.676
## 3rd Qu.:1285.0 3rd Qu.: 30.400
## Max. :5399.0 Max. :2574.650
## NA's :293 NA's :293
Before calculating correlations, examine the distributions of continuous variables. Based on skewness and the presence of outliers, you may need to apply transformations.
Create histograms to assess distributions:
p1 <- ggplot(bmd2, aes(x = BMXBMI)) +
geom_histogram(bins = 30, fill = "steelblue") +
labs(title = "BMI", x = "BMI") +
theme_minimal()
p2 <- ggplot(bmd2, aes(x = calcium_total)) +
geom_histogram(bins = 30, fill = "darkgreen") +
labs(title = "Total Calcium", x = "Calcium (mg/day)") +
theme_minimal()
p3 <- ggplot(bmd2, aes(x = vitd_total)) +
geom_histogram(bins = 30, fill = "red") +
labs(title = "Total Vitamin D", x = "Vitamin D (mcg/day)") +
theme_minimal()
p4 <- ggplot(bmd2, aes(x = totmet)) +
geom_histogram(bins = 30, fill = "purple") +
labs(title = "Physical Activity", x = "MET-min/week") +
theme_minimal()
library(patchwork)
(p1 | p2) / (p3 | p4)Based on your assessment, apply transformations as needed:
bmd2 <- bmd2 %>%
mutate(
log_bmi = log(BMXBMI),
log_calcium_total = log(calcium_total + 1),
log_vitd_total = log(vitd_total + 1),
sqrt_totmet = sqrt(totmet)
)
p1t <- ggplot(bmd2, aes(x = log_bmi)) +
geom_histogram(bins = 30, fill = "steelblue") +
labs(title = "log(BMI)") + theme_minimal()
p2t <- ggplot(bmd2, aes(x = log_calcium_total)) +
geom_histogram(bins = 30, fill = "darkgreen") +
labs(title = "log(Calcium+1)") + theme_minimal()
p3t <- ggplot(bmd2, aes(x = log_vitd_total)) +
geom_histogram(bins = 30, fill = "red") +
labs(title = "log(VitD+1)") + theme_minimal()
p4t <- ggplot(bmd2, aes(x = sqrt_totmet)) +
geom_histogram(bins = 30, fill = "purple") +
labs(title = "sqrt(MET)") + theme_minimal()
(p1t | p2t) / (p3t | p4t)Document your reasoning: “I applied/did not apply transformations to [variables] because…”
I applied transformations to improve normality for correlation analysis. Total calcium and vitamin D showed right skew with outliers.
For each correlation, report:
# Function to calculate Pearson correlation for a pair of variables
corr_pair <- function(data, x, y) {
# Select variables and remove missing values
d <- data %>%
select(all_of(c(x, y))) %>%
drop_na()
# Need at least 3 observations for correlation
if(nrow(d) < 3) {
return(tibble(
x = x,
y = y,
n = nrow(d),
estimate = NA_real_,
p_value = NA_real_
))
}
# Calculate Pearson correlation
ct <- cor.test(d[[x]], d[[y]], method = "pearson")
# Return tidy results
tibble(
x = x,
y = y,
n = nrow(d),
estimate = unname(ct$estimate),
p_value = ct$p.value
)
}Examine correlations between potential predictors and BMD:
Use transformed versions where appropriate.
tableA <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "DXXOFBMD"),
corr_pair(bmd2, "log_bmi", "DXXOFBMD"),
corr_pair(bmd2, "log_calcium_total", "DXXOFBMD"),
corr_pair(bmd2, "log_vitd_total", "DXXOFBMD")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = format.pval(p_value, digits = 3)
)
tableA %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | DXXOFBMD | 2286 | -0.232 | <2e-16 |
| log_bmi | DXXOFBMD | 2275 | 0.425 | <2e-16 |
| log_calcium_total | DXXOFBMD | 2129 | 0.012 | 0.5919 |
| log_vitd_total | DXXOFBMD | 2129 | -0.070 | 0.0013 |
Interpret the results:
Age shows a moderate negative correlation with BMD (r = -0.232, p < 0.001), consistent with age-related bone loss. Log-BMI demonstrates a moderate positive correlation (r = 0.425, p < 0.001). This suggests higher body mass is associated with greater bone density.
Examine correlations between potential confounders and physical activity:
tableB <- bind_rows(
corr_pair(bmd2, "RIDAGEYR", "sqrt_totmet"),
corr_pair(bmd2, "log_bmi", "sqrt_totmet"),
corr_pair(bmd2, "log_calcium_total", "sqrt_totmet"),
corr_pair(bmd2, "log_vitd_total", "sqrt_totmet")
) %>%
mutate(
estimate = round(estimate, 3),
p_value = format.pval(p_value, digits = 3)
)
tableB %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | sqrt_totmet | 1934 | -0.097 | 1.96e-05 |
| log_bmi | sqrt_totmet | 1912 | 0.001 | 0.951153 |
| log_calcium_total | sqrt_totmet | 1777 | 0.086 | 0.000282 |
| log_vitd_total | sqrt_totmet | 1777 | -0.018 | 0.437819 |
Interpret the results:
Age shows a weak negative correlation with physical activity (r = -0.097, p < 0.001), indicating older adults are less active. BMI showed no significant correlation with activity (r = 0.001, p = 0.95). Log-calcium showed a weak positive correlation (r = 0.086, p < 0.001). This suggests that more active individuals may have better overall nutrition. ### Table C: Predictors associated with each other
Examine correlations among all predictor variables (all pairwise combinations):
preds <- c("RIDAGEYR", "log_bmi", "log_calcium_total", "log_vitd_total")
pairs <- combn(preds, 2, simplify = FALSE)
tableC <- map_dfr(pairs, ~corr_pair(bmd2, .x[1], .x[2])) %>%
mutate(
estimate = round(estimate, 3),
p_value = format.pval(p_value, digits = 3)
)
tableC %>% kable()| x | y | n | estimate | p_value |
|---|---|---|---|---|
| RIDAGEYR | log_bmi | 2834 | -0.098 | 1.83e-07 |
| RIDAGEYR | log_calcium_total | 2605 | 0.048 | 0.0135 |
| RIDAGEYR | log_vitd_total | 2605 | 0.190 | < 2e-16 |
| log_bmi | log_calcium_total | 2569 | 0.000 | 0.9831 |
| log_bmi | log_vitd_total | 2569 | -0.008 | 0.7005 |
| log_calcium_total | log_vitd_total | 2605 | 0.445 | < 2e-16 |
Interpret the results:
The predictor correlation matrix reveals weak to moderate correlations among variables. The strongest correlation is between calcium and vitamin D (r = 0.445, p < 0.001). All other correlations are very weak (|r| < 0.2). These predictors capture mostly independent aspects of health and demographics
Create scatterplots to visualize key relationships:
# Example: BMD vs BMI
ggplot(bmd2, aes(x = log_bmi, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "log(BMI)",
y = "BMD (g/cm²)",
title = "BMD vs BMI (transformed)"
)
# Example: BMD vs Physical Activity
ggplot(bmd2, aes(x = sqrt_totmet, y = DXXOFBMD)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "sqrt(totmet)",
y = "BMD (g/cm²)",
title = "BMD vs MET (transformed)"
)Write a structured reflection (200–400 words) addressing the following:
A. Comparing Methods (ANOVA vs Correlation)
B. Assumptions and Limitations
C. R Programming Challenge
YOUR REFLECTION HERE (200–400 words)
ANOVA compares means across categorical groups to test whether groups differ, while correlation measures the strength and direction of linear relationships between continuous variables. ANOVA tells us which groups differ through post-hoc tests, while correlation quantifies association strength through r and variance explained through r². For example, “Do mean blood pressure levels differ between normal weight, overweight, and obese individuals?” requires ANOVA because BMI category is categorical. “What is the relationship between daily sodium intake and blood pressure?” requires correlation because both are continuous.
Normality was the most challenging assumption. Nutritional variables showed extreme right skew. This could mean that many people take no supplements while some take very high doses. Physical activity required square root transformation due to many zeros.Assumption violations can reduce statistical power and produce incorrect p-values for correlations. However, the large size of this sample helped with these issues. The biggest limitation is the inability to establish causation. It’s seen through no calcium-BMD correlation and negative vitamin D correlation
Creating Table C with all pairwise correlations was most difficult. First I tried creating manually each correlation. But then I used combn() and map_dfr(). This helped generate combinations of correlations and then mapping the helper function. Transformation and missing data handling were my primary target skills that i solidified during this assignment.
Before submitting, verify:
Good luck! Remember to start early and use office hours if you need help.