Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
# Import the dataset — update the path if needed
bmd <- read_csv("~/Downloads/epi552/bmd(in).csv")
# Quick check
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = 1:5,
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
# Recode RIAGENDR as a labeled factor
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
# Recode smoker as a labeled factor
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)## Total N: 2898
## Missing DXXOFBMD: 612
## Missing calcium: 293
total_n <- nrow(bmd)
missing_bmd <- sum(is.na(bmd$DXXOFBMD))
missing_calcium <- sum(is.na(bmd$calcium))
missing_table <- tibble(
Measure = c("Total sample size", "Missing DXXOFBMD", "Missing calcium"),
Value = c(total_n, missing_bmd, missing_calcium)
)
missing_table %>%
kable(caption = "Sample Size and Missingness Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Measure | Value |
|---|---|
| Total sample size | 2898 |
| Missing DXXOFBMD | 612 |
| Missing calcium | 293 |
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(calcium))
cat("Final analytic N:", nrow(bmd_analytic), "\n")## Final analytic N: 2129
Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?
# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "Association Between Dietary Calcium Intake and Total Femur BMD",
x = "Calcium Intake (mg/day)",
y = "Femur Bone Mineral Density (g/cm²))"
) +
theme_minimal()Written interpretation (3–5 sentences):
[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]
# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)
# Display the full model summary
summary(model)##
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55653 -0.10570 -0.00561 0.10719 0.62624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e-01 7.192e-03 125.037 < 2e-16 ***
## calcium 3.079e-05 7.453e-06 4.131 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared: 0.007959, Adjusted R-squared: 0.007493
## F-statistic: 17.07 on 1 and 2127 DF, p-value: 3.751e-05
A. Intercept (β₀):
The intercept is 0.8992 g/cm^2. This is the predicted total femur bone mineral density for an person with a dietary calcium intake at 0 mg/day. It’s not really meaningful in a real-world context because it is unrealistic for individuals to consume zero calcium.
B. Slope (β₁):
The slope is around 0.00003079 g/cm^2 per mg/day of calcium. This means that for every 1 mg/day increase in dietary calcium intake, the BMD increases by 0.00003079 g/cm^2, on average. The positive slope shows a positive association between these two variables, however, the magnitude of this effect is very small, suggesting that even large increases in calcium intake would result in only a slight increase in BMD, indicating a relatively weak relationship.
## 2.5 % 97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium 1.617334e-05 4.540649e-05
State your hypotheses:
H₀ β1=0 There is no linear association between dietary calcium intake and total femur bone mineral density.
H₁:β1≠0 There is a linear association between dietary calcium intake and total femur BMD.
Report the test results:
[Report the t-statistic, degrees of freedom, and p-value from the
summary() output above. State your conclusion: do you reject H₀? What
does this mean for the association between calcium and BMD?]
t-statistic: 4.131
degrees of freedom: 2127 p-value: 3.75 × 10^-5, Because the p-value is
less than 0.05, we reject the null hypothesis. There is statistically
significant evidence that there is a linear association between dietary
calcium intake and total femur BMD in this sample.
Interpret the 95% confidence interval for β₁:
[Interpret the CI in plain language — what range of values is plausible for the true slope?]
The 95% confidence interval for the slope is the range of values for the true change in BMD associated with a 1 mg/day increase in calcium intake. The interval is (0.000016, 0.000045) g/cm^2 per mg/day. Because the interval is greater than 0, it supports the conclusion that the association is positive.
R² (coefficient of determination):
[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]
The R^2 value is 0.007959, meaning that dietary calcium intake explains about 0.80% of the variability in total femur BMD. This shows that the model has a very weak fit, and calcium intake alone does not account for much of the variation in BMD. This suggests that other factors, such as age, sex, BMI, physical activity, can play a much larger role in explaining differences in BMD.
Residual Standard Error (RSE):
[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]
The residual standard error is 0.1582 g/cm^2. This represents the difference between the observed BMD values and the values predicted by the model. On average, the model’s predictions are off by about 0.158 g/cm^2, indicating a moderate amount of unexplained variability in BMD.
# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)
# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")## fit lwr upr
## 1 0.9299936 0.9229112 0.937076
# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")## fit lwr upr
## 1 0.9299936 0.6195964 1.240391
Written interpretation (3–6 sentences):
[Answer all four questions from the assignment description:
The model predicts that someone with a calcium intake of 1,000 mg/day would have a BMD of about 0.930 g/cm^2. The 95% confidence interval (0.923 to 0.937 g/cm^2) represents the average BMD for people with this level of calcium intake. The 95% prediction interval (0.620 to 1.240 g/cm^2) is wider because it shows the range for an individual’s BMD, which varies more than the average. The prediction interval is wider because it includes both model uncertainty and individual differences. A calcium intake of 1,000 mg/day is a reasonable value to predict since it falls within the typical range in the data.
Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.
A. Statistical Insight (6 points)
[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]
This regression model shows that there is a statistically significant but very weak positive association between dietary calcium intake and total femur bone mineral density. As calcium intake increases, BMD increases slightly, but the effect size is very small and the R^2 value shows that calcium explains less than 1% of the variation in BMD. This was evident, since bone health is influenced by many factors beyond calcium intake alone. A key limitation of this analysis is that the data are cross-sectional, meaning calcium intake and BMD were measured at the same time. Because of this, we cannot say that calcium intake causes changes in BMD. There may also be confounding variables such as age, sex, BMI, physical activity, smoking, and vitamin D intake that influence both calcium intake and BMD.
B. From ANOVA to Regression (5 points)
[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]
ANOVA was used to compare mean BMD across different ethnic groups, which addresses whether there are differences between group averages. However, regression allows us to examine the relationship between a continuous predictor, calcium, and a continuous outcome BMD, and to estimate how much BMD changes with an increase in calcium. Regression provides more detailed information, such as the direction and magnitude of the relationship. I would use ANOVA when comparing categories, and regression when working with continuous variables or when I want to model relationships.
C. R Programming Growth (4 points)
[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]
End of Homework 2