Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
# Import the dataset — update the path if needed
bmd <- read.csv('C:/Users/Alyss/OneDrive/Desktop/epi553/data/bmd.csv')
# Quick check
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = 1:5,
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
# Recode RIAGENDR as a labeled factor
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
# Recode smoker as a labeled factor
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)## Total N: 2898
## Missing DXXOFBMD: 612
## Missing calcium: 293
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(calcium))
cat("Final analytic N:", nrow(bmd_analytic), "\n")## Final analytic N: 2129
Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?
# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "Dietary Calcium Intake vs. Femur Bone Mineral Density",
x = "Dietary Calcium Intake (mg/day)",
y = "Total Femur Bone Mineral Density (g/cm²)"
) +
theme_minimal()Written interpretation (3–5 sentences):
There is a positive linear trend between calcium intake and femur bone density. The relationship appears to be weak. Most of the points are centered on the lower end of calcium intake with a small number of points that vary as calcium intake increases.
# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)
# Display the full model summary
summary(model)##
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55653 -0.10570 -0.00561 0.10719 0.62624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e-01 7.192e-03 125.037 < 2e-16 ***
## calcium 3.079e-05 7.453e-06 4.131 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared: 0.007959, Adjusted R-squared: 0.007493
## F-statistic: 17.07 on 1 and 2127 DF, p-value: 3.751e-05
A. Intercept (β₀ = 8.992e-01):
The intercept represents what the estimated mean total femur bone mineral density is when dietary calcium intake is 0. The intercept is there to anchor the line but is can not be directly interpreted.
B. Slope (β₁ = 3.079e-05):
For every 1 unit increase in calcium (mg/day), femur BMD is estimated to increase by 3.079e-05 g/cm², when holding everything else constant. The direction is positive. Given typical calcium intake ranges, the effect is small.
## 2.5 % 97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium 1.617334e-05 4.540649e-05
State your hypotheses:
Report the test results:
t-statistic: 4.131
Degrees of freedom: 2128
p-value: 3.75e-05
With a p-value of 3.75e-05, we reject H₀ at the ⍺ = 0.05 level. There is statistically significant evidence of a linear association between dietary calcium intake and femur BMD.
Interpret the 95% confidence interval for β₁:
If we conducted this study many times, 95% of the confidence intervals would contain the true slope.
# Extract R-squared from model
r_sq <- summary(model)$r.squared
adj_r_sq <- summary(model)$adj.r.squared
tibble(
Metric = c("R²", "Adjusted R²", "Variance Explained"),
Value = c(
round(r_sq, 4),
round(adj_r_sq, 4),
paste0(round(r_sq * 100, 2), "%")
)
) %>%
kable(caption = "R² and Adjusted R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| R² | 0.008 |
| Adjusted R² | 0.0075 |
| Variance Explained | 0.8% |
## Residual Standard Error: 0.16 g/cm²
R² (coefficient of determination):
The proportion of variance in BMD that is explained by dietary calcium intake is 0.8%. Based on R², the model fits the data poorly as over 99% of the data’s variance remains unexplained. This suggests that other predictors may play a crucial role in explaining the variance in BMD
Residual Standard Error (RSE):
RSE = 0.16 g/cm². On average, the model’s prediction is off by 0.16 g/cm² for BMD.
# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)
# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")## fit lwr upr
## 1 0.9299936 0.9229112 0.937076
# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")## fit lwr upr
## 1 0.9299936 0.6195964 1.240391
Written interpretation (3–6 sentences):
[Answer all four questions from the assignment description:
Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.
The regression model tells us that there is a weak positive relationship between dietary calcium intake and total femur bone mineral density. The results were a little surprising that dietary calcium intake does not have more effect on BMD. Key limitations of interpreting SLR from a cross-sectional survey as causal evidence are we cannot determine the direction of the association, and we cannot determine if the data is still applicable to current trends. Confounders that could explain the observed association are differences in gender, ethnicity, or smoking status.
One-way ANOVA answers the question of whether categorical groups have different means and simple linear regression answers the question of whether there is a linear association between variables of interest and predicts the outcome value based on a predictor variable. Regression gives us the strength and direction of the relationship, while ANOVA can only show whether there is a difference in groups. ANOVA should be used to compare categorical group and SLR should be used for looking into relationships between variables.
The most challenging part of this assignment was recoding the variables in the dataset. I had never recoded any variables in r before. I worked through it by referencing the lectures. I feel more confident in recoding variables and interpreting outputs from R.
End of Homework 2