Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
# Import the dataset — update the path if needed
bmd <- readr::read_csv("C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/bmd.csv", show_col_types = FALSE)
# Quick check
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = 1:5,
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
# Recode RIAGENDR as a labeled factor
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
# Recode smoker as a labeled factor
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)## Total N: 2898
## Missing DXXOFBMD: 612
## Missing calcium: 293
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_hw <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(calcium))
cat("Final analytic N:", nrow(bmd_hw), "\n")## Final analytic N: 2129
Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?
# Create a scatterplot with a fitted regression line
ggplot(bmd_hw, aes(x = calcium, y = DXXOFBMD)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "Scatter Plot of Calcium (mg/day) and Total Femur Bone Mineral Density (g/cm²)",
x = "Calcium (mg/day)",
y = "Total Femur Bone Mineral Density (g/cm²)"
) +
theme_minimal()Written interpretation (3–5 sentences):
[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]
Response: There appears to be a cluster of observations between Calcium and Total Femur Bone Mineral Density. The relationship appears to be a weak positive trend. There also appears to be some outliers in the dataset as well.
# Fit the simple linear regression model
model_hw <- lm(DXXOFBMD ~ calcium, data = bmd_hw)
# Display the full model summary
summary(model_hw)##
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_hw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55653 -0.10570 -0.00561 0.10719 0.62624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e-01 7.192e-03 125.037 < 2e-16 ***
## calcium 3.079e-05 7.453e-06 4.131 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared: 0.007959, Adjusted R-squared: 0.007493
## F-statistic: 17.07 on 1 and 2127 DF, p-value: 3.751e-05
A. Intercept (β₀):
[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]
Response: The estimated mean Bone Mineral Density when Calcium = 0. The intercept is not directly interpret able in this context as calcium is always present in bone, but is necessary to anchor the line.
B. Slope (β₁):
[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]
Response: For each 1-day increase in calcium, Bone Mineral Density is estimated to increase by 3.1^{-5} g/cm², on average, holding all else constant (though there are no other variables in this simple model).
## 2.5 % 97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium 1.617334e-05 4.540649e-05
State your hypotheses:
\[H_0: \beta_1 = 0 \quad \text{(no linear relationship between Calcium and Bone Mineral Density)}\] \[H_A: \beta_1 \neq 0 \quad \text{(there is a linear relationship)}\]
n <- nrow(bmd_hw)
# Extract slope test statistics
slope_test_hw <- tidy(model_hw, conf.int = TRUE) %>% filter(term == "calcium")
tibble(
Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
"Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
Value = c(
round(slope_test_hw$estimate, 6),
round(slope_test_hw$std.error, 6),
round(slope_test_hw$statistic, 3),
n - 2,
format.pval(slope_test_hw$p.value, digits = 3),
round(slope_test_hw$conf.low, 6),
round(slope_test_hw$conf.high, 6)
)
) %>%
kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value |
|---|---|
| Slope (b₁) | 3.1e-05 |
| SE(b₁) | 7e-06 |
| t-statistic | 4.131 |
| Degrees of freedom | 2127 |
| p-value | 3.75e-05 |
| 95% CI Lower | 1.6e-05 |
| 95% CI Upper | 4.5e-05 |
Report the test results:
[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]
Decision: With p < 0.001, we reject \(H_0\) at the \(\alpha = 0.05\) level. There is statistically significant evidence of a linear association between Calcium and Total Bone Mineral Density.
Interpret the 95% confidence interval for β₁:
[Interpret the CI in plain language — what range of values is plausible for the true slope?]
R² (coefficient of determination):
[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]
# Extract R-squared from model
r_sq_hw <- summary(model_hw)$r.squared
adj_r_sq_hw <- summary(model_hw)$adj.r.squared
tibble(
Metric = c("R²", "Adjusted R²", "Variance Explained"),
Value = c(
round(r_sq_hw, 4),
round(adj_r_sq_hw, 4),
paste0(round(r_sq_hw * 100, 2), "%")
)
) %>%
kable(caption = "R² and Adjusted R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| R² | 0.008 |
| Adjusted R² | 0.0075 |
| Variance Explained | 0.8% |
Residual Standard Error (RSE):
# Augment dataset with fitted values and residuals
augmented_hw <- augment(model_hw)
# Show a sample of fitted values and residuals
augmented_hw %>%
select(DXXOFBMD, calcium, .fitted, .resid) %>%
slice_head(n = 10) %>%
mutate(across(where(is.numeric), ~ round(., 3))) %>%
kable(
caption = "First 10 Observations: Observed, Fitted, and Residual Values",
col.names = c("Observed BMD (Y)", "Calcium (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Observed BMD (Y) | Calcium (X) | Fitted (Ŷ) | Residual (e = Y − Ŷ) |
|---|---|---|---|
| 1.058 | 503.5 | 0.915 | 0.143 |
| 0.801 | 473.5 | 0.914 | -0.113 |
| 0.851 | 1248.5 | 0.938 | -0.087 |
| 0.778 | 660.5 | 0.920 | -0.142 |
| 0.994 | 776.0 | 0.923 | 0.071 |
| 0.952 | 452.0 | 0.913 | 0.039 |
| 1.121 | 853.5 | 0.925 | 0.196 |
| 0.752 | 1183.5 | 0.936 | -0.184 |
| 1.085 | 752.5 | 0.922 | 0.163 |
| 0.899 | 570.5 | 0.917 | -0.018 |
# Select a random sample of 80 points to illustrate residuals
set.seed(42)
resid_sample_hw <- augmented_hw %>% slice_sample(n = 80)
p_resid_hw <- ggplot(resid_sample_hw, aes(x = calcium, y = DXXOFBMD)) +
geom_segment(aes(xend = calcium, yend = .fitted),
color = "tomato", alpha = 0.5, linewidth = 0.5) +
geom_point(color = "steelblue", size = 1.8, alpha = 0.8) +
geom_line(aes(y = .fitted), color = "black", linewidth = 1.1) +
labs(
title = "Residuals Illustrated on the Regression Line",
subtitle = "Red segments = residuals (Y − Ŷ); Black line = fitted regression line",
x = "calcium (mg/day)",
y = "BMD (g/cm²)"
) +
theme_minimal(base_size = 13)
p_resid_hwn <- nrow(bmd_hw)
ss_resid_hw <- sum(augmented_hw$.resid^2)
mse_hw <- ss_resid_hw / (n - 2)
sigma_hat_hw <- sqrt(mse_hw)
tibble(
Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
Value = c(round(ss_resid_hw, 2), round(mse_hw, 3), round(sigma_hat_hw, 3)),
Units = c("", "", "kg/m²")
) %>%
kable(caption = "Model Error Estimates") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value | Units |
|---|---|---|
| SS Residual | 53.260 | |
| MSE (σ̂²) |
0.02
|
[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]
Interpretation: On average, our model’s predictions are off by about 0.16 Bone Mineral Density units.
# Create a new data frame with the target predictor value
new_data_hw <- data.frame(calcium = 1000)
# 95% confidence interval for the mean response at calcium = 1000
ci_pred_lab <- predict(model_hw, newdata = new_data_hw, interval = "confidence") %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
# 95% prediction interval for a new individual at calcium = 1000
pi_pred_lab <- predict(model_hw, newdata = new_data_hw, interval = "prediction") %>%
as.data.frame() %>%
rename(PI_Lower = lwr, PI_Upper = upr) %>%
select(-fit)
results_table <- bind_cols(new_data_hw, ci_pred_lab, pi_pred_lab) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
results_table %>%
kable(
caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals for Calcium = 1000",
col.names = c("Calcium", "Fitted BMD", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))| Calcium | Fitted BMD | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|---|
| 1000 | 0.93 | 0.92 | 0.94 | 0.62 | 1.24 |
Written interpretation (3–6 sentences):
[Answer all four questions from the assignment description:
For someone with calcium of 1000, the estimated mean number of bone mineral density is 093 days. The 95% confidence interval for this mean is .92 to .94, which means that we can say with 95 percent confidence that the true mean lies between these two values. The predicted confidence interval lies between .62 and 1.24. The predicted interval range is wider because it accounts for more variability. Given the data, the 1000 mg/day is not a meaningful value to predict BMD because the predicted confidence interval crosses over 1 indicating that the 1000 mg/day is not meaningful.
Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.
A. Statistical Insight (6 points)
[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]
B. From ANOVA to Regression (5 points)
[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]
C. R Programming Growth (4 points)
[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]
Response: The simple linear regression model showed there was a statistically significance between calcium and BMD. I figured that there would be an association between the two variables however I thought that calcium would explain more than .8% of bone mineral density. The key limitation of interpreting cross-sectional survey is that you cannot establish temporarily between the outcome variable and the key predictor variable.
Compared to homework 1 and ANOVA, the SLR is looking at something different. In ANOVA we were interested if there are differences across groups whereas in SLR we are testing to see if there is a association between two variables. ANOVA we can only determine if there is a difference between groups but we cannot determine if there is a relationship.You would use ANOVA if you have a continuous outcome and a categorical predictor that had independent observations. SLR would be used if you have a continuous outcome and wanted to quantify the relationship with a single predictor.
When it comes to programming skills, I didn’t have many difficulties with this assignment. I definitely feel more confident with my programming skills after this assignment and working through the progress report check in 1 for the final project.
End of Homework 2