Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
# Import the dataset — update the path if needed
bmd <- read.csv("C:/Users/abbym/OneDrive/Desktop/STATS553/Assignments/Homework #2/bmd(in).csv")
# Quick check
glimpse(bmd)## Rows: 2,898
## Columns: 14
## $ SEQN <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …
# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
mutate(
RIDRETH1 = factor(RIDRETH1,
levels = 1:5,
labels = c("Mexican American", "Other Hispanic",
"Non-Hispanic White", "Non-Hispanic Black", "Other")),
# Recode RIAGENDR as a labeled factor
RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("Male", "Female")),
# Recode smoker as a labeled factor
smoker = factor(smoker,
levels = c(1, 2, 3),
labels = c("Current", "Past", "Never"))
)## Total N: 2898
## Missing DXXOFBMD: 612
## Missing calcium: 293
# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
filter(!is.na(DXXOFBMD), !is.na(calcium))
cat("Final analytic N:", nrow(bmd_analytic), "\n")## Final analytic N: 2129
Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?
# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(
title = "The Association between Daily Calcium Intake and Bone Mineral Density",
x = "Calcium (mg/day)",
y = "Bone Mineral Density (g/cm²)"
) +
theme_minimal()Written interpretation (3–5 sentences):
[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?] The points on the scatterplot do not show a strong linear trend. The fitted line given is in the positive direction. The relationship appears weak. There are a a few outliers when it comes to calcium intake, as well as bone mineral density. —
# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)
# Display the full model summary
summary(model)##
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55653 -0.10570 -0.00561 0.10719 0.62624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.992e-01 7.192e-03 125.037 < 2e-16 ***
## calcium 3.079e-05 7.453e-06 4.131 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared: 0.007959, Adjusted R-squared: 0.007493
## F-statistic: 17.07 on 1 and 2127 DF, p-value: 3.751e-05
A. Intercept (β₀):
[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?] The intercept represents what bone mineral density would be if calcium intake was 0 mg/day. The intercept of 8.992e-01 g/cm² is not very meaningful, because it would be hard to not have any calcium intake during a day, especially during a long period of time.
B. Slope (β₁):
[Interpret the slope in 2–4 sentences. For every 1-unit increase
in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State
the direction. Is the effect large or small given typical calcium intake
ranges?] For every 1 unit increase in calcium, bone mineral density
increases by 3.079e-05 g/cm². The direction is positive, and even though
it is an increase by a very small amount, the effect is fairly large
given typical calcium intake ranges and the ranges of BMD
measures.
### Step 3 — Statistical Inference (15 points)
## 2.5 % 97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium 1.617334e-05 4.540649e-05
State your hypotheses:
Report the test results:
[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?] The t-statistic is 4.131, the degrees of freedom are 2127, and the p-value is 3.75e-05. You reject the null hypothesis. This means that there is a linear relationship between bone mineral density and calcium intake. Interpret the 95% confidence interval for β₁:
[Interpret the CI in plain language — what range of values is plausible for the true slope?] The range of values possible for the true slope are 1.617334e-05 to 4.540649e-05. We are 95% confident that the true linear relationship between calcium intake and BMD is within this range of slope. ### Step 4 — Model Fit: R² and Residual Standard Error (10 points)
R² (coefficient of determination):
[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?] 0.796% of the variance in BMD is explained by dietary calcium intake. This means that the model doesn’t fit the data very well, and that other predictors are important for determining what can predict the bone mineral density of a person. Residual Standard Error (RSE):
[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.] The RSE is 0.1582. This means that the model is off, on average, by 0.1582 g/cm² when predicting bone mineral density. This also tells us that compared to the prediction value of calcium intake on BMD, the possibility and magnitude of the error is high. —
# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)
# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")## fit lwr upr
## 1 0.9299936 0.9229112 0.937076
# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")## fit lwr upr
## 1 0.9299936 0.6195964 1.240391
Written interpretation (3–6 sentences):
[Answer all four questions from the assignment description:
Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.
A. Statistical Insight (6 points)
[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]
B. From ANOVA to Regression (5 points)
[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]
C. R Programming Growth (4 points)
[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]
The regression models tells us that daily calcium intake may affect the bone mineral density of a person, but adding more variables may help to explain the association more, and make the model stronger. The results were somewhat surprising, as I expected calcium intake to have more of an association with bone mineral density. The key limitations of interpreting SLR from a cross-sectional survey as causal evidence is that you are unable to establish temporality, meaning you are unsure if the calcium intake or bone mineral density occurred first, or which was measured first. There also may be independence issues, as with a cross-sectional study like BRFSS, there may be geographic or household clusters included. Confounders such as age, gender, physical activity, BMI, and more might explain the observed association. ANOVA answered whether BMD differed amongst different groups of people, whereas SLR measures the linear relationship between calcium intake and BMD. Regression helps to give a quantifiable number regarding the strength of the relationship between calcium intake and BMD, whereas ANOVA tells us whether BMD varied by different groups, which groups it varied by, and how much it varied by. You would prefer ANOVA when you are trying to decide if categories of certain things, like age group, gender, race/ethnicity, and more, differ when it comes to a continuous outcome. Simple linear regression is good at comparing two continuous variables and quantifying the relationship between the two variables. Since all of coding was completed for us, the most challenging part of the assignment was making sure I was interpreting the correct values and interpreting them correctly. I have had troubles in the past with R coding for assignments where certain values or variables had to be changed in order to receive the correct outcome. Throughout these assignments, and this assignment, I have become more confident in proofreading the code in order to obtain the results and outcomes I am looking for. I make sure to read the code carefully and determine what needs to be changed to fit my data and what I am looking for and what can stay the same. —
End of Homework 2