Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
# Import the dataset — update the path if needed
bmd <- read.csv("bmd2.csv")
# Quick check
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = 1:5,
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
# Recode RIAGENDR as a labeled factor
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
# Recode smoker as a labeled factor
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)## Total N: 2898
## Missing DXXOFBMD: 612
## Missing calcium: 293
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(calcium))
cat("Final analytic N:", nrow(bmd_analytic), "\n")## Final analytic N: 2129
Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?
# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "Linear Regression Between Dietary Calcium Intake
(calcium, mg/day) And Total Femur Bone Mineral Density (DXXOFBMD, g/cm²) ",
x = "Dietary Calcium Intake (calcium, mg/day)",
y = "Total Femur Bone Mineral Density (DXXOFBMD, g/cm²)"
) +
theme_minimal()Written interpretation (3–5 sentences):
In the scatterplot, there is a visible linear trend with a slight positive association. The relationship is weak-moderate for strength since the linear line is only slightly increasing. Yes, there are notable outliers but there are not non-linearities.
# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)
# Display the full model summary
summary(model)##
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55653 -0.10570 -0.00561 0.10719 0.62624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e-01 7.192e-03 125.037 < 2e-16 ***
## calcium 3.079e-05 7.453e-06 4.131 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared: 0.007959, Adjusted R-squared: 0.007493
## F-statistic: 17.07 on 1 and 2127 DF, p-value: 3.751e-05
A. Intercept (β₀):
Intercept (b0=8.992e-01) conveys the predicted number of the total femur bone mineral density (DXXOFBMD, g/cm²) when the dietary calcium intake (calcium, mg/day) = 0. This shows that it is not a meaningful quantity in this context since its unrealistic for people to have a calcium of 0 since its realistic for people to consume calcium in their food.
B. Slope (β₁):
Slope= 3.079e-05 For every 1-unit increase in calcium (mg/day), the expected BMD (g/cm²) increases by 3.079e-05, on average, holding all else constant. The direction is positive and there is a small effect given the typical calcium intake.
## 2.5 % 97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium 1.617334e-05 4.540649e-05
State your hypotheses:
Report the test results:
t-statistic: 4.131 degrees of freedom: 2127 p-value: 3.75e-05
Conclusion: We reject the H0, therefore it is statistically significant. This shows there is an association between calcium and BMD.
Interpret the 95% confidence interval for β₁:
We are 95% confident that the true change in BMD is associated with a 1 mg/day increase in calcium intake lies between 1.617334e-05 and 4.540649e-05. The interval is above 0, therefore it supports the association between calcium intake and BMD is positive.
R² (coefficient of determination):
R² = 0.007959 The model explains 0.8% of variability in BMD. Since the variance is low, it indicates the model does not fit well in the data. While the association is statistically significant, calcium intake by itself is not a strong predictor of BMD. This suggests that other predictors play a larger role in the variation.
Residual Standard Error (RSE):
RSE: 0.1582g/cm²
This explains that the typical prediction error of the model is 0.1582 from the observed outcome.
# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)
# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")## fit lwr upr
## 1 0.9299936 0.9229112 0.937076
# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")## fit lwr upr
## 1 0.9299936 0.6195964 1.240391
Written interpretation (3–6 sentences):
[Answer all four questions from the assignment description:
Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.
A. Statistical Insight (6 points)
The regression model tells me that the calcium-BMD relationship has a positive correlation between the two variables. This is not surprising because the more calcium a person consumes, the stronger their bone density is which decreases their risk of potential health problems. The key limitation of interpreting SLR from a cross-sectional survey as casual evidence is that since we are not observing multiple points in time, only a single point we cannot determine the cause to effect also referred as the temporality. Counfounders that might explain the observed association is age and race.
B. From ANOVA to Regression (5 points)
ANOVA allows us to compare the mean outcome across multiple categorical variables versus the simple linear regression only looks at the association between exposure and the outcome. Regression gives us the continuous relationship while ANOVA does not. If my predictor is categorical I would use ANOVA, if my predictor is continuous I would use regression.
C. R Programming Growth (4 points)
The most challenging part of this assignment from a programming perspective was interpreting the difference between the confidence interval and the prediction interval. I read through the lecture materials to identify the difference. I feel more confident reading 95% confidence interval table after completing this homework.
End of Homework 2