Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)# Import the dataset — update the path if needed
bmd <- read.csv("/Users/emmanuelsmac/Desktop/bmd.csv")
# Quick check
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = 1:5,
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
# Recode RIAGENDR as a labeled factor
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
# Recode smoker as a labeled factor
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)## Total N: 2898
## Missing DXXOFBMD: 612
## Missing calcium: 293
Answer:
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(calcium))
cat("Final analytic N:", nrow(bmd_analytic), "\n")## Final analytic N: 2129
Answer:
| Quantity | Value |
|---|---|
| Missing DXXOFBMD only | 476 |
| Missing calcium only | 157 |
| Missing BOTH (overlap) | 136 |
| Rows dropped (missing either) | 769 |
| Final analytic N | 2,129 |
769 rows were dropped in total. The overlap of 136 rows (missing on both variables) is why the dropped rows (769) is less than the sum of individual missingness counts (612 + 293 = 905). The correct Final analytic N is exactly 2,129.
Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?
# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "Association Between Dietary Calcium Intake and Total Femur BMD",
x = "Dietary Calcium Intake (mg/day)",
y = "Total Femur Bone Mineral Density (g/cm²)"
) +
theme_minimal(base_size = 13)Written interpretation (3–5 sentences):
[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]
Answer
Interpretation:
# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)
# Display the full model summary
summary(model)##
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55653 -0.10570 -0.00561 0.10719 0.62624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e-01 7.192e-03 125.037 < 2e-16 ***
## calcium 3.079e-05 7.453e-06 4.131 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared: 0.007959, Adjusted R-squared: 0.007493
## F-statistic: 17.07 on 1 and 2127 DF, p-value: 3.751e-05
A. Intercept (β₀):
[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]
Answer:
B. Slope (β₁):
Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?
Answer:
## 2.5 % 97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium 1.617334e-05 4.540649e-05
State your hypotheses:
MODEL HYPOTHESES:
Report the test results:
[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]
MODEL TEST RESULTS:
The t-statistic for the slope is t = 4.131 on 2,127 degrees of freedom, with a p-value of 3.75 × 10⁻⁵ (p < 0.001). Because p < 0.05, we reject H₀. This provides statistically significant evidence that dietary calcium intake is positively associated with total femur BMD in this sample.
Interpret the 95% confidence interval for β₁:
[Interpret the CI in plain language — what range of values is plausible for the true slope?]
Answer:
The 95% confidence interval for the slope is (1.617 × 10⁻⁵, 4.541 × 10⁻⁵) g/cm² per mg/day. This means we are 95% confident that the true population slope falls within this range. In practical terms, for each additional 1 mg/day increase in dietary calcium intake, the true average increase in total femur BMD is likely between 0.00001617 and 0.00004541 g/cm². Because the entire interval lies above zero, the result is consistent with rejecting the null hypothesis (H₀) of no association.
R² (coefficient of determination):
[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]
Answer: R² INTERPRETATION:
The R² is 0.007959 (approximately 0.80%), meaning that dietary calcium intake explains less than 1% of the variance in total femur BMD in this sample. While the association is statistically significant (p < 0.001), the model fit is very poor by conventional standards. This low R² strongly suggests that BMD is influenced by many other factors beyond dietary calcium including age, sex, body mass index, physical activity, smoking status, race/ethnicity, and supplemental calcium none of which are accounted for in this simple model.
Residual Standard Error (RSE):
[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]
Answer RSE INTERPRETATION:
The residual standard error is 0.1582 g/cm². This represents the average distance between observed BMD values and the model’s predicted values the typical prediction error of the model is about 0.1582 g/cm². Given that BMD values in this population range from roughly 0.4 to 1.6 g/cm² (as visible in the scatterplot), an RSE of 0.1582 g/cm² represents substantial unexplained variability, consistent with the near-zero R².
# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)
# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")## fit lwr upr
## 1 0.9299936 0.9229112 0.937076
# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")## fit lwr upr
## 1 0.9299936 0.6195964 1.240391
Written interpretation (3–6 sentences):
[Answer all four questions from the assignment description:
Answer:
The model predicts a total femur BMD of 0.9229112 g/cm² for an individual with a dietary calcium intake of 1,000 mg/day.
Answer:
The 95% confidence interval for the mean BMD at calcium = 1,000 mg/day is (0.9229, 0.9371) g/cm². This interval estimates the average BMD across all individuals in the population who consume 1,000 mg/day of dietary calcium. It is narrow (width ≈ 0.014 g/cm²) because it reflects only the uncertainty in estimating the mean regression line at this point, not variability between individuals.
Answer:
The 95% prediction interval is (0.6196, 1.2404) g/cm². This is substantially wider than the confidence interval (width ≈ 0.621 g/cm²) because it must account for two sources of uncertainty simultaneously: the uncertainty in estimating the mean response, and the natural individual-to-individual variability in BMD. While the CI tells us where the average BMD falls for people consuming 1,000 mg/day, the PI tells us where a single new individual’s BMD would likely fall and people vary considerably even at the same calcium intake level.
Answer:
A calcium intake of 1,000 mg/day is a clinically meaningful and appropriate value to predict at. It corresponds to the Recommended Dietary Allowance (RDA) for adults aged 19–50, and it falls well within the observed range of calcium values in the dataset making this an interpolation rather than an extrapolation, so the prediction is on solid footing.
Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.
A. Statistical Insight (6 points)
[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]
Answer:
The regression model indicates a statistically significant but practically very small positive relationship between dietary calcium intake and total femur BMD. Dietary calcium accounts for less than 1% of the variation in BMD, which is notable given that calcium is often emphasized in bone health recommendations. However, this finding is not entirely unexpected, as BMD is influenced by many factors, including genetics, age, hormonal status, physical activity, body composition, and other biological and lifestyle determinants that cannot be captured by a single dietary variable.
An important limitation of this analysis is its cross-sectional design. Because calcium intake and BMD were measured at the same time, it is not possible to determine the temporal order of these variables, and therefore causal conclusions cannot be drawn about whether calcium intake affects BMD. In addition, several potential confounding factors could influence the observed association. For example, age may play a role, since older adults often have both lower BMD and reduced appetite or caloric intake. Sex may also confound the relationship, as men generally have higher BMD and may consume greater amounts of calories and calcium. Physical activity is another relevant factor, because individuals who are more physically active tend to have higher bone density and may also maintain healthier dietary patterns overall.
B. From ANOVA to Regression (5 points)
[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]
Answer
One-way ANOVA and simple linear regression are both commonly used methods for analyzing a continuous outcome, but they address different types of research questions. One-way ANOVA evaluates whether the mean BMD differs across distinct categories, such as racial or ethnic groups, and provides an overall (omnibus) F-test that indicates whether any differences exist among the groups, without specifying the direction or size of those differences. In contrast, simple linear regression examines how BMD changes as a function of a continuous predictor and estimates a slope that quantifies how much BMD is expected to change for each unit increase in that predictor. Regression also offers additional information not provided by ANOVA, such as predicted values at specific levels of the predictor, confidence intervals for those predictions, and an R² value that summarizes how well the model explains variation in the outcome. ANOVA is most appropriate when the predictor variable is categorical and the goal is to compare group means, whereas regression is more suitable when the predictor is continuous and the objective is to model a dose–response relationship or generate predictions.
C. R Programming Growth (4 points)
[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]
Answer:
The most challenging aspect of this assignment was distinguishing
between the confidence interval and the prediction interval produced by
the predict() function. Because both intervals are
generated using very similar code and produce nearly identical-looking
output, it was initially easy to confuse them. Clarifying the conceptual
difference helped resolve this issue: the confidence interval estimates
where the average BMD for a group of individuals with a given calcium
intake is expected to fall, whereas the prediction interval estimates
where the BMD of a single new individual might fall. Because individual
outcomes vary more than group averages, the prediction interval is
naturally wider.
Another concept that required careful interpretation was the residual standard error. At first, the value of 0.1582 g/cm² was not particularly informative on its own. However, placing it in context made its meaning clearer. Given that BMD values in this population range from approximately 0.4 to 1.6 g/cm², a typical prediction error of 0.1582 g/cm² represents a substantial margin of variability. This highlights how much variation in BMD remains unexplained by dietary calcium alone.
After completing this assignment, I feel much more confident using
lm(), interpreting the output from summary(),
and applying functions such as confint() and
predict() to obtain inferential results that extend beyond
the default regression output.
End of Homework 2