Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
# Import the dataset — update the path if needed
bmd <- read.csv('~/Desktop/EPI553/data/bmd(in).csv')
# Quick check
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = 1:5,
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
# Recode RIAGENDR as a labeled factor
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
# Recode smoker as a labeled factor
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)## Total N: 2898
## Missing DXXOFBMD: 612
## Missing calcium: 293
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(calcium))
cat("Final analytic N:", nrow(bmd_analytic), "\n")## Final analytic N: 2129
Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?
# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "Relationship Between Dietary Calcium Intake and Total Femur Bone Mineral Density",
x = "Calcium Intake (mg/day)",
y = "Bone Mineral Density (g/cm^2)"
) +
theme_minimal()Written interpretation (3–5 sentences): [Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]
# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)
# Display the full model summary
summary(model)##
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55653 -0.10570 -0.00561 0.10719 0.62624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e-01 7.192e-03 125.037 < 2e-16 ***
## calcium 3.079e-05 7.453e-06 4.131 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared: 0.007959, Adjusted R-squared: 0.007493
## F-statistic: 17.07 on 1 and 2127 DF, p-value: 3.751e-05
A. Intercept (β₀):
[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]
The intercept is 0.8992 g/cm^2 which represents the expected BMD when calcium intake is zero. In this context, this isn’t the most meaningful quanitity since most individuals don’t have a calcium intake of zero.
B. Slope (β₁):
[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]
The slope is 3.079e-05. This means for every 1 unit increase in calcium intake, the predicted BMD increases by 3.079e-05 (0.00003079) g/cm^2 on average. There is a positive relationship. The effect is small.
## 2.5 % 97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium 1.617334e-05 4.540649e-05
State your hypotheses:
Report the test results:
[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]
t-statistic= 4.1310 df= 2127 p-value= 3.751e-05 (0.00003751) Since the p-value is less than 0.05, I reject the null hypothesis. This means there is positive statistically significant association between calcium intake and BMD. Therefore there is a positive linear relationship between dietary calcium intake and BMD.
Interpret the 95% confidence interval for β₁:
[Interpret the CI in plain language — what range of values is plausible for the true slope?]
The 95% CI for the slope of calcium intake is [0.00001617334, 0.00004540649]. Therefore, the true average increase for BMD for each 1 unit increase of calcium falls between 0.000016 and 0.000045.
R² (coefficient of determination):
[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]
R^2= 0.007959 0.8% of the variance in BMD is explained by dietary calcium intake. Based on the R^2 value the model does not fit the data very well since it is small. This suggests that there are other predictors that influence the variance of BMD more than calcium intake, including age, BMI, physical activity, genetics.
Residual Standard Error (RSE):
[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]
RSE= 0.1582 g/cm^2. This tells me that the predicted BMD values from the model differ from the observed BMD values by 0.1582 g/cm^2 on average.
# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)
# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")## fit lwr upr
## 1 0.9299936 0.9229112 0.937076
# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")## fit lwr upr
## 1 0.9299936 0.6195964 1.240391
Written interpretation (3–6 sentences):
[Answer all four questions from the assignment description:
Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.
A. Statistical Insight (6 points)
[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]
B. From ANOVA to Regression (5 points)
[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]
C. R Programming Growth (4 points)
[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]
This regression model shows a statistically significant positive relationship between calcium intake and BMD, indicating that as calcium intake increases so does BMD. Since R^2 is low, we can assume that little variation in BMD is explained by calcium intake. This result is not surprising because while greater calcium intake can improve bone mineral density it is also affected by other factors such as physical activity and genetics. A key limitation of interpreting SLR from a cross-sectional survey is causality can not be established. Because cross-sectional data is collected at one point in time, it cannot be assumed that calcium intake causes higher BMD in comparsion to confounders. Other confounders that might explain the observed association include age, sex, physical activity, diet.
The one way ANOVA used in homework 1 to compare BMD across ethnic groups measures if one categorical group mean differs from the other groups. The SLR examines the linear relationship between the mean of a continuous outcome and a single predictor. The single linear regression tells us the magnitude and direction of the relationship. ANOVA only tests for differences and cannot test for a linear trend which shows the direction and magnitude. ANOVA would be preferred when the predictor is categorical and group means want to be compared. SLR is preferred when the predictor is continuous.
End of Homework 2