Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
# Import the dataset — update the path if needed
bmd <- read.csv('bmd.csv')
# Quick check
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = 1:5,
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
# Recode RIAGENDR as a labeled factor
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
# Recode smoker as a labeled factor
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)## Total N: 2898
## Missing DXXOFBMD: 612
## Missing calcium: 293
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(calcium))
cat("Final analytic N:", nrow(bmd_analytic), "\n")## Final analytic N: 2129
The original dataset contained N = 2,345 observations. After excluding individuals with missing values for calcium intake or femur BMD, the analytic sample included N = 2,129 participants.
Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?
# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "Association Between Dietary Calcium Intake and Femur Bone Mineral Density",
x = "Dietary Calcium Intake (mg/day)",
y = "Total Femur Bone Mineral Density (g/cm²)"
) +
theme_minimal()Written interpretation (3–5 sentences):
The scatterplot suggests a positive linear association between dietary calcium intake and total femur bone mineral density (BMD). As calcium intake increases, femur BMD tends to increase slightly, although the points are widely dispersed. This indicates that the relationship appears relatively weak, with substantial variability in BMD across different levels of calcium intake. There do not appear to be strong non-linear patterns, though a few observations with very high calcium intake may represent potential outliers. Overall, the plot suggests a modest positive relationship between calcium intake and femur BMD.
# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)
# Display the full model summary
summary(model)##
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55653 -0.10570 -0.00561 0.10719 0.62624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e-01 7.192e-03 125.037 < 2e-16 ***
## calcium 3.079e-05 7.453e-06 4.131 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared: 0.007959, Adjusted R-squared: 0.007493
## F-statistic: 17.07 on 1 and 2127 DF, p-value: 3.751e-05
## [1] 2898
## [1] 612
## [1] 293
## [1] 2129
A. Intercept (β₀):
The intercept (β₀) represents the estimated total femur bone mineral density (BMD) when dietary calcium intake is 0 mg/day. Based on the regression model, the predicted BMD at zero calcium intake is approximately 0.899 g/cm². In reality, consuming 0 mg/day of calcium is unlikely, so this value is not particularly meaningful in a practical sense. Instead, the intercept mainly serves as the baseline point where the regression line crosses the y-axis.
B. Slope (β₁):
The slope (β₁) represents the estimated change in total femur bone mineral density associated with a 1 mg/day increase in dietary calcium intake. According to the model, for every additional 1 mg/day of calcium intake, femur BMD is expected to increase by approximately 0.0000308 g/cm². This indicates a positive association, meaning higher calcium intake is associated with slightly higher BMD. However, the effect size is very small, since typical calcium intake varies by hundreds of milligrams per day.
## 2.5 % 97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium 1.617334e-05 4.540649e-05
State your hypotheses:
Report the test results:
From the regression output, the estimated slope for calcium intake has a t-statistic of 4.131 with 2127 degrees of freedom and a p-value of 3.75 × 10⁻⁵. Because the p-value is much smaller than the conventional significance level of 0.05, we reject the null hypothesis. This provides strong statistical evidence of a linear association between dietary calcium intake and total femur bone mineral density. Specifically, higher calcium intake is associated with slightly higher BMD.
Interpret the 95% confidence interval for β₁:
The 95% confidence interval for the slope ranges from 1.62 × 10⁻⁵ to 4.54 × 10⁻⁵ g/cm² per mg/day of calcium intake. This means that we are 95% confident that the true increase in femur BMD associated with a 1 mg/day increase in calcium intake lies within this range. Because the entire interval is positive and does not include zero, it further supports the conclusion that calcium intake is positively associated with femur BMD. However, the magnitude of the effect is relatively small.
R² (coefficient of determination):
The R² value for this model is 0.00796, meaning that approximately 0.8% of the variability in total femur bone mineral density (BMD) is explained by dietary calcium intake. This indicates that the model has very limited explanatory power, as calcium intake alone accounts for only a small proportion of the variation in BMD. Although the association is statistically significant, the low R² suggests that many other factors likely influence BMD. These may include variables such as age, sex, body mass index, physical activity, vitamin D levels, and other health or lifestyle factors.
Residual Standard Error (RSE):
The residual standard error (RSE) of the model is 0.158 g/cm², which represents the typical difference between the observed BMD values and the values predicted by the regression model. In other words, the model’s predictions for BMD are off by about 0.158 g/cm² on average. Because this error is relatively large compared with the range of BMD values, it further indicates that calcium intake alone does not strongly predict femur bone mineral density. Additional predictors would likely improve the model’s accuracy.
# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)
# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")## fit lwr upr
## 1 0.9299936 0.9229112 0.937076
# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")## fit lwr upr
## 1 0.9299936 0.6195964 1.240391
Written interpretation (3–6 sentences):
[Answer all four questions from the assignment description:
The regression model predicts that the total femur bone mineral density (BMD) for an individual with a dietary calcium intake of 1,000 mg/day is approximately 0.930 g/cm². The 95% confidence interval (0.923 to 0.937 g/cm²) represents the range in which the true mean BMD for individuals consuming 1,000 mg/day of calcium is expected to fall with 95% confidence. The 95% prediction interval (0.620 to 1.240 g/cm²) represents the range in which the BMD of a single new individual with calcium intake of 1,000 mg/day is likely to fall. The prediction interval is wider than the confidence interval because it accounts for both uncertainty in estimating the mean response and the natural variability of individual BMD values. A value of 1,000 mg/day is a meaningful value for prediction because it falls within the typical recommended and observed range of daily calcium intake.
Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.
A. Statistical Insight (6 points)
The simple linear regression model suggests that there is a statistically significant but very small positive association between dietary calcium intake and total femur bone mineral density (BMD). As calcium intake increases, BMD tends to increase slightly; however, the model explains less than 1% of the variability in BMD, indicating that calcium intake alone is not a strong predictor. This result was not entirely surprising because bone density is influenced by many biological and lifestyle factors beyond diet alone. A major limitation of interpreting this simple linear regression as causal evidence is that the data come from a cross-sectional survey, meaning that exposure and outcome are measured at the same time. Because of this, we cannot determine whether higher calcium intake actually causes higher BMD. Additionally, the observed association may be influenced by confounding variables such as age, sex, body mass index, physical activity, vitamin D status, hormonal factors, or overall health behaviors.
B. From ANOVA to Regression (5 points)
This assignment also highlights the difference between one-way ANOVA and regression analysis. In Homework 1, one-way ANOVA was used to compare the mean BMD across different ethnic groups, which allowed us to determine whether there were statistically significant differences in average BMD between categories of a categorical variable. In contrast, simple linear regression models the relationship between BMD and a continuous predictor, such as calcium intake, and estimates how much the outcome changes with each unit increase in the predictor. Regression therefore provides additional information, including the slope of the relationship and the ability to make predictions for specific values of the predictor. ANOVA is most appropriate when comparing group means for categorical variables, whereas regression is preferable when examining relationships involving continuous predictors or when prediction is of interest.
C. R Programming Growth (4 points)
From a programming perspective, one of the most challenging parts of
this assignment was correctly structuring the R Markdown code chunks and
ensuring that the regression model and prediction commands ran without
errors. Initially, interpreting the output and formatting the results
within the document required careful attention. Working through these
challenges helped reinforce my understanding of how to run regression
models in R, interpret model summaries, and generate predictions and
confidence intervals using the predict() function. As a
result, I now feel more confident in fitting linear regression models in
R and interpreting their statistical output.
End of Homework 2