Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
# Import the dataset — update the path if needed
bmd <- read.csv('bmd.csv')
# Quick check
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = 1:5,
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
# Recode RIAGENDR as a labeled factor
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
# Recode smoker as a labeled factor
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)## Total N: 2898
## Missing DXXOFBMD: 612
## Missing calcium: 293
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(calcium))
cat("Final analytic N:", nrow(bmd_analytic), "\n")## Final analytic N: 2129
The raw dataset contains 2,898 observations. After excluding 612 records missing DXXOFBMD and 293 records missing calcium, the final analytic sample is n = 2,129.
Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?
# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "Dietary Calcium Intake and Total Femur Bone Mineral Density",
x = "Dietary Calcium Intake (mg/day)",
y = "Total Femur BMD (g/cm\u00b2)"
) +
theme_minimal()Written interpretation (3–5 sentences):
The scatterplot shows a weak positive linear trend between dietary calcium intake and total femur BMD. As calcium intake increases, BMD increases slightly. This is shown by the upward-sloping fitted regression line. However, the relationship is not strong. There is a vertical scatter at every level of calcium intake, especially at lower intakes where most observations are concentrated. The data are right-skewed along the x-axis. Also there is a dense cluster below 2,000 mg/day and a sparse tail of very high intakes extending to ~ 5,000 mg/day. No extreme outliers, and the relationship appears approximately linear. This supports the use of a simple linear regression model.
# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)
# Display the full model summary
summary(model)##
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55653 -0.10570 -0.00561 0.10719 0.62624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e-01 7.192e-03 125.037 < 2e-16 ***
## calcium 3.079e-05 7.453e-06 4.131 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared: 0.007959, Adjusted R-squared: 0.007493
## F-statistic: 17.07 on 1 and 2127 DF, p-value: 3.751e-05
Fitted equation: \[\widehat{\text{DXXOFBMD}} = 0.89920 + 0.00003079 \times \text{calcium}\]
A. Intercept (β₀):
The intercept is the estimated mean total femur BMD for an individual with a dietary calcium intake of 0 mg/day. The model predicts a BMD of 0.8992 g/cm² when Calcium is at zero. This value is not meaningful because a calcium intake of exactly zero is unlikely and outside the observed range. The intercept serves as an anchor for the regression line and not an estimate.
B. Slope (β₁):
For every 1 mg/day increase in dietary calcium intake, total femur BMD is estimated to increase by 0.00003079 g/cm² on average. This is given that everything else is held equal. The direction of the association is positive. This is a likely outcome as the biological expectation is that higher calcium intake would correlate to better bone density. A 500 mg/day difference in calcium intake corresponds to an estimated BMD difference of 0.0154 g/cm². This is small compared to the typical BMD range of approximately 0.4 to 1.6 g/cm². This suggests that dietary calcium has a modest effect on femur BMD by itself.
## 2.5 % 97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium 1.617334e-05 4.540649e-05
State your hypotheses:
Report the test results:
The t-statistic for the slope is t(2127) = 4.131, with p = 3.75 × 10^-5. Since p < 0.05, we reject H0 at alpha = 0.05. There is statistically significant evidence of a positive linear association between dietary calcium intake and total femur BMD.
Interpret the 95% confidence interval for β₁:
The 95% confidence interval for the slope is (0.00001617, 0.00004541) g/cm² per mg/day. Because the entire interval lies above zero, this is consistent with the rejection of H0. It provides more evidence for a positive association. The interval is narrow.It reflects a small estimated effect of calcium on BMD.
glance(model) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Table 2. Model Fit Statistics",
col.names = c("R-squared", "Adj. R-squared", "RSE", "F-statistic", "p-value", "N")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| R-squared | Adj. R-squared | RSE | F-statistic | p-value | N |
|---|---|---|---|---|---|
| 0.008 | 0.0075 | 0.1582 | 17.0654 | 0 | 2129 |
R² (coefficient of determination)=0.0080:
Dietary calcium intake explains only 0.80% of the variance in total femur BMD. This is an extremely low R² provides a poor fit. The vast majority of the variation in BMD is attributable to other factors that are not included in this model. This includes age, sex, body weight, physical activity, smoking status, and genetic predisposition to bone density. This does not necessarily mean calcium is not important. However, it does indicate that a single dietary predictor is insufficient to model BMD adequately. In addition, a multivariable approach including other covariates would be needed to explain meaningful variance.
Residual Standard Error (RSE) = 0.1582 g/cm²:
The RSE represents the average distance between observed BMD values and the values predicted by the fitted regression line. The model’s predictions are off by approximately 0.1582 g/cm². Given that total femur BMD ranges from roughly 0.4 to 1.6 g/cm², a prediction error of 0.1582 g/cm² represents a big proportion of that range. This adds on to the fact that dietary calcium alone is not a precise predictor of individual BMD.
# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)
# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")## fit lwr upr
## 1 0.9299936 0.9229112 0.937076
# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")## fit lwr upr
## 1 0.9299936 0.6195964 1.240391
Written interpretation (3–6 sentences):
[Answer all four questions from the assignment description:
Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.
A. Statistical Insight (6 points)
The regression model confirms a statistically significant positive association between dietary calcium intake and total femur BMD (β₁ = 0.00003079, p = 3.75 × 10^-5). However, R² = 0.0080 shows that calcium explains less than 1% of the variance in bone density. BMD is determined by various factors, and dietary calcium intake is just one detail. The results cannot be interpreted causally because this is cross-sectional observational data. Individuals with higher calcium intake could differ in other ways that affect BMD. For example, they could be more physically active, have better diet, be of different age or sex, or have different BMI. Reverse causation is also possible if those with diagnosed with low bone density were prescribed to increase their calcium intake. This would attenuate or reverse the association.
B. From ANOVA to Regression (5 points)
Homework 1 used one-way ANOVA to compare mean BMD across various ethnic groups. It asked if the group membership is associated with differences in the mean outcome. That approach is appropriate for categorical predictors but cannot describe the functional relationship between BMD and a continuous variable like calcium intake. This is due ethnicity being treated as a set of categories without a specific order. This would be different with a quantitative predictor. Simple linear regression models BMD as a continuous linear function of calcium. This estimates a slope that determines the rate of change in BMD/unit increase in intake. In addition, it makes the prediction at any calcium point between the observed range possible. ANOVA answers the question about the difference across groups. In contrast, regression answers by how much the outcome changes per unit increase in the predictor. For a continuous dose-response question regression is more informative. ANOVA would be used when the predictor is categorical or when we compare groups and not continuous associations.
C. R Programming Growth (4 points)
The most challenging part of this assignment from a programming
perspective was the knitting process as it requires the
.Rmd file and the .csv to be in the same
working directory. I always seem to encounter issues with this
particular process. Working through this required understanding how
RStudio sets the working directory during knitting and how to
troubleshoot it. After this assignment, I feel more confident using
tidy() and glance() from the
broom package to extract clean model output, and using
predict() with both interval types.
End of Homework 2