Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)
library(knitr)
library(gtsummary)
library(plotly)
library(ggeffects)
library(haven)
library(here)
# Import the dataset — update the path if needed
bmd <- read.csv("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Assignment\\2\\bmd.csv")
# Quick check
glimpse(bmd)
## Rows: 2,898
## Columns: 14
## $ SEQN <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = 1:5,
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
# Recode RIAGENDR as a labeled factor
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
# Recode smoker as a labeled factor
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)
# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")
## Total N: 2898
cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")
## Missing DXXOFBMD: 612
cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")
## Missing calcium: 293
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(calcium))
cat("Final analytic N:", nrow(bmd_analytic), "\n")
## Final analytic N: 2129
Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?
# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "Association Between Dietary Calcium Intake and Total Femur",
x = "Calcium Intake ( Calcium, mg/day)",
y = "Femur Bone Mineral Density (DXXOFBMD, g/cm²) "
) +
theme_minimal()
Written interpretation (3–5 sentences):
[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]
Ans : The intercept (0.8992 g/cm²) represents the predicted total femur BMD when dietary calcium intake is zero. However, since no participants in the dataset reported zero calcium intake, this value lies outside the observed range of the data and therefore has limited practical interpretation. Rather, it primarily serves as a mathematical starting point that anchors the regression model.
# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)
# Display the full model summary
summary(model)
##
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55653 -0.10570 -0.00561 0.10719 0.62624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e-01 7.192e-03 125.037 < 2e-16 ***
## calcium 3.079e-05 7.453e-06 4.131 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared: 0.007959, Adjusted R-squared: 0.007493
## F-statistic: 17.07 on 1 and 2127 DF, p-value: 3.751e-05
A. Intercept (β₀):
[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]
Ans : The intercept (0.8992 g/cm²) represents the predicted total femur BMD when dietary calcium intake is zero. While it serves as the baseline of the regression equation, it has limited clinical or biological meaning because no participants in the dataset reported consuming zero dietary calcium. As this value lies outside the observed range of the data, interpreting it as a real-world prediction would require extrapolation beyond the available evidence. Therefore, the intercept should mainly be considered a mathematical component that establishes the starting point of the regression line rather than a meaningful substantive estimate.
B. Slope (β₁):
[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]
Ans:The estimated slope (3.079 × 10⁻⁵ g/cm² per mg/day, or 0.00003079 g/cm² per mg/day) suggests that each additional 1 mg/day increase in dietary calcium intake is associated with an average increase of about 0.00003079 g/cm² in total femur BMD. This indicates a positive relationship between calcium intake and bone mineral density. However, the magnitude of the effect per unit increase is very small.When considering a more meaningful change in intake—such as 500 mg/day—the predicted increase in BMD would be approximately 0.0154 g/cm². Even with this larger difference in calcium intake, the change remains relatively modest when compared with the typical variability in BMD within the population (interquartile range ≈ 0.21 g/cm², based on the residual distribution). .
# 95% confidence interval for the slope
confint(model)
## 2.5 % 97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium 1.617334e-05 4.540649e-05
State your hypotheses:
Report the test results:
[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]
Ans: The t-statistic for the slope is t = 4.131 on 2,127 degrees of freedom, with a p-value of 3.75 × 10⁻⁵ (p < 0.001). Because p < 0.05, we reject H₀. This provides statistically significant evidence that dietary calcium intake is positively associated with total femur BMD in this sample
Interpret the 95% confidence interval for β₁:
[Interpret the CI in plain language — what range of values is plausible for the true slope?]
Ans: The 95% confidence interval for the slope ranges from 1.617 × 10⁻⁵ to 4.541 × 10⁻⁵ g/cm² per mg/day. This indicates that we are 95% confident that the true population slope lies within this interval. Practically, this suggests that for every additional 1 mg/day increase in dietary calcium intake, the average total femur BMD is expected to increase by approximately 0.00001617 to 0.00004541 g/cm². Since the entire confidence interval is greater than zero, the finding supports rejecting the null hypothesis (H₀) of no association between calcium intake and femur BMD.
R² (coefficient of determination):
[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]
Ans : The R² value is 0.007959 (about 0.80%), indicating that dietary calcium intake explains less than 1% of the variation in total femur BMD in this sample. Although the association is statistically significant (p < 0.001), the model demonstrates very poor explanatory power according to conventional standards. This extremely low R² suggests that bone mineral density is influenced by numerous other factors not included in this model, such as age, sex, body mass index, physical activity, smoking status, race/ethnicity, and calcium supplementation. Therefore, relying solely on dietary calcium intake provides limited ability to explain variability in BMD in this population.
Residual Standard Error (RSE):
[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]
Ans : The residual standard error is 0.1582 g/cm², indicating the average difference between the observed BMD values and those predicted by the model. In other words, the model’s typical prediction error is approximately 0.1582 g/cm². Considering that BMD values in this population range from about 0.4 to 1.6 g/cm², as shown in the scatterplot, this level of error suggests a considerable amount of unexplained variability. This finding is consistent with the model’s near-zero R², indicating that the model explains very little of the variation in BMD.
# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)
# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")
## fit lwr upr
## 1 0.9299936 0.9229112 0.937076
# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")
## fit lwr upr
## 1 0.9299936 0.6195964 1.240391
Written interpretation (3–6 sentences):
[Answer all four questions from the assignment description:
Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.
A. Statistical Insight (6 points)
[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]
Ans: The regression analysis shows a statistically significant but very small positive association between dietary calcium intake and total femur bone mineral density (BMD). Dietary calcium explains less than 1% of the variation in BMD, which is noteworthy given the strong emphasis placed on calcium in bone health recommendations. However, this result is not entirely surprising, as BMD is determined by a wide range of factors. These include genetics, age, hormonal status, physical activity, body composition, and other biological and lifestyle influences that cannot be captured by a single dietary variable. An important limitation of this analysis is its cross-sectional design. Because calcium intake and BMD were measured at the same time, the temporal relationship between these variables cannot be established. As a result, causal conclusions about the effect of calcium intake on BMD cannot be drawn. Additionally, several potential confounding factors may influence the observed association. Age, for example, may play a role because older adults often experience lower BMD and may also consume fewer calories and less calcium due to reduced appetite. Sex may also confound the relationship, since men typically have higher BMD and may consume greater amounts of calories and calcium. Physical activity is another important factor, as individuals who are more physically active tend to have higher bone density and may also follow healthier dietary patterns overall.
B. From ANOVA to Regression (5 points)
[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]
Ans:One-way ANOVA and simple linear regression are both statistical methods used to analyze a continuous outcome, but they address different types of research questions. One-way ANOVA is used to determine whether the mean value of an outcome, such as BMD, differs across several distinct categories, such as racial or ethnic groups. It provides an overall (omnibus) F-test that indicates whether at least one group mean differs from the others, but it does not specify the direction or magnitude of those differences. In contrast, simple linear regression examines the relationship between a continuous predictor and a continuous outcome. It estimates a slope that represents the expected change in the outcome (e.g., BMD) for each one-unit increase in the predictor variable. Linear regression also provides additional information beyond what ANOVA offers, including predicted values of the outcome at specific levels of the predictor, confidence intervals for those predictions, and an R² statistic that describes how much of the variation in the outcome is explained by the model. Overall, ANOVA is most appropriate when the independent variable is categorical and the goal is to compare mean values across groups. Linear regression, on the other hand, is more suitable when the predictor is continuous and the objective is to evaluate a dose–response relationship or make predictions about the outcome.
C. R Programming Growth (4 points)
[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]
Ans: The most challenging part of this assignment was distinguishing between the confidence interval and the prediction interval produced by the predict() function. Because both intervals are generated using very similar code and produce output that looks almost identical, it was initially easy to confuse them. Understanding the conceptual difference helped resolve this confusion. A confidence interval estimates where the average BMD for a group of individuals with a given calcium intake is expected to fall, whereas a prediction interval estimates where the BMD of a single new individual might fall. Since individual outcomes vary more than group averages, the prediction interval is naturally wider. Another concept that required careful interpretation was the residual standard error. At first, the value of 0.1582 g/cm² was not very meaningful on its own. However, its significance became clearer when placed in context. Considering that BMD values in this population range from approximately 0.4 to 1.6 g/cm², a typical prediction error of 0.1582 g/cm² represents a considerable amount of variability. This indicates that a substantial portion of variation in BMD remains unexplained by dietary calcium alone. After completing this assignment, I feel much more confident using lm(), interpreting the results from summary(), and applying functions such as confint() and predict() to obtain inferential results that go beyond the default regression output.
End of Homework 2