EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- readr::read_csv("C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/bmd.csv", show_col_types = FALSE)

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_hw <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_hw), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_hw, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Scatter Plot of Calcium (mg/day) and Total Femur Bone Mineral Density (g/cm²)",
    x       = "Calcium (mg/day)",
    y       = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]

Response: There appears to be a cluster of observations between Calcium and Total Femur Bone Mineral Density. The relationship appears to be a weak positive trend. There also appears to be some outliers in the dataset as well.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model_hw <- lm(DXXOFBMD ~ calcium, data = bmd_hw)

# Display the full model summary
summary(model_hw)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_hw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

b0_hw <- round(coef(model_hw)[1],  digits = 6)
b1_hw <- round(coef(model_hw)[2],  digits = 6)

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]

Response: The estimated mean Bone Mineral Density when Calcium = 0. The intercept is not directly interpret able in this context as calcium is always present in bone, but is necessary to anchor the line.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]

Response: For each 1-day increase in calcium, Bone Mineral Density is estimated to increase by 3.1^{-5} g/cm², on average, holding all else constant (though there are no other variables in this simple model).

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model_hw)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

\[H_0: \beta_1 = 0 \quad \text{(no linear relationship between Calcium and Bone Mineral Density)}\] \[H_A: \beta_1 \neq 0 \quad \text{(there is a linear relationship)}\]

n <- nrow(bmd_hw)

# Extract slope test statistics
slope_test_hw <- tidy(model_hw, conf.int = TRUE) %>% filter(term == "calcium")

tibble(
  Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
               "Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
  Value    = c(
    round(slope_test_hw$estimate, 6),
    round(slope_test_hw$std.error, 6),
    round(slope_test_hw$statistic, 3),
    n - 2,
    format.pval(slope_test_hw$p.value, digits = 3),
    round(slope_test_hw$conf.low, 6),
    round(slope_test_hw$conf.high, 6)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

t-Test for the Slope (H₀: β₁ = 0)
Quantity	Value
Slope (b₁)	3.1e-05
SE(b₁)	7e-06
t-statistic	4.131
Degrees of freedom	2127
p-value	3.75e-05
95% CI Lower	1.6e-05
95% CI Upper	4.5e-05

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]

Decision: With p < 0.001, we reject \(H_0\) at the \(\alpha = 0.05\) level. There is statistically significant evidence of a linear association between Calcium and Total Bone Mineral Density.

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?]

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]

# Extract R-squared from model
r_sq_hw <- summary(model_hw)$r.squared
adj_r_sq_hw <- summary(model_hw)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq_hw, 4),
    round(adj_r_sq_hw, 4),
    paste0(round(r_sq_hw * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.008
Adjusted R²	0.0075
Variance Explained	0.8%

Residual Standard Error (RSE):

# Augment dataset with fitted values and residuals
augmented_hw <- augment(model_hw)

# Show a sample of fitted values and residuals
augmented_hw %>%
  select(DXXOFBMD, calcium, .fitted, .resid) %>%
  slice_head(n = 10) %>%
  mutate(across(where(is.numeric), ~ round(., 3))) %>%
  kable(
    caption = "First 10 Observations: Observed, Fitted, and Residual Values",
    col.names = c("Observed BMD (Y)", "Calcium (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

First 10 Observations: Observed, Fitted, and Residual Values
Observed BMD (Y)	Calcium (X)	Fitted (Ŷ)	Residual (e = Y − Ŷ)
1.058	503.5	0.915	0.143
0.801	473.5	0.914	-0.113
0.851	1248.5	0.938	-0.087
0.778	660.5	0.920	-0.142
0.994	776.0	0.923	0.071
0.952	452.0	0.913	0.039
1.121	853.5	0.925	0.196
0.752	1183.5	0.936	-0.184
1.085	752.5	0.922	0.163
0.899	570.5	0.917	-0.018

# Select a random sample of 80 points to illustrate residuals
set.seed(42)
resid_sample_hw <- augmented_hw %>% slice_sample(n = 80)

p_resid_hw <- ggplot(resid_sample_hw, aes(x = calcium, y = DXXOFBMD)) +
  geom_segment(aes(xend = calcium, yend = .fitted),
               color = "tomato", alpha = 0.5, linewidth = 0.5) +
  geom_point(color = "steelblue", size = 1.8, alpha = 0.8) +
  geom_line(aes(y = .fitted), color = "black", linewidth = 1.1) +
  labs(
    title = "Residuals Illustrated on the Regression Line",
    subtitle = "Red segments = residuals (Y − Ŷ); Black line = fitted regression line",
    x = "calcium (mg/day)",
    y = "BMD (g/cm²)"
  ) +
  theme_minimal(base_size = 13)

p_resid_hw

n <- nrow(bmd_hw)
ss_resid_hw <- sum(augmented_hw$.resid^2)
mse_hw <- ss_resid_hw / (n - 2)
sigma_hat_hw <- sqrt(mse_hw)

tibble(
  Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
  Value    = c(round(ss_resid_hw, 2), round(mse_hw, 3), round(sigma_hat_hw, 3)),
  Units    = c("", "", "kg/m²")
) %>%
  kable(caption = "Model Error Estimates") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Error Estimates
Quantity	Value	Units
SS Residual	53.260
MSE (σ̂²)	0.02

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]

Interpretation: On average, our model’s predictions are off by about 0.16 Bone Mineral Density units.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data_hw <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
ci_pred_lab <- predict(model_hw, newdata = new_data_hw, interval = "confidence") %>%
  as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)

# 95% prediction interval for a new individual at calcium = 1000
pi_pred_lab <- predict(model_hw, newdata = new_data_hw, interval = "prediction") %>%
  as.data.frame() %>%
  rename(PI_Lower = lwr, PI_Upper = upr) %>%
  select(-fit)


results_table <- bind_cols(new_data_hw, ci_pred_lab, pi_pred_lab) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

results_table %>%
  kable(
    caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals for Calcium = 1000",
    col.names = c("Calcium", "Fitted BMD", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))

Fitted Values, 95% Confidence Intervals, and Prediction Intervals for Calcium = 1000
		95% CI for Mean		95% PI for Individual
Calcium	Fitted BMD	CI Lower	CI Upper	PI Lower	PI Upper
1000	0.93	0.92	0.94	0.62	1.24

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.)
What does the 95% confidence interval represent?
What is the 95% prediction interval, and why is it wider than the CI?
Is 1,000 mg/day a meaningful value to predict at, given the data?]

For someone with calcium of 1000, the estimated mean number of bone mineral density is 093 days. The 95% confidence interval for this mean is .92 to .94, which means that we can say with 95 percent confidence that the true mean lies between these two values. The predicted confidence interval lies between .62 and 1.24. The predicted interval range is wider because it accounts for more variability. Given the data, the 1000 mg/day is not a meaningful value to predict BMD because the predicted confidence interval crosses over 1 indicating that the 1000 mg/day is not meaningful.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

Response: The simple linear regression model showed there was a statistically significance between calcium and BMD. I figured that there would be an association between the two variables however I thought that calcium would explain more than .8% of bone mineral density. The key limitation of interpreting cross-sectional survey is that you cannot establish temporarily between the outcome variable and the key predictor variable.

Compared to homework 1 and ANOVA, the SLR is looking at something different. In ANOVA we were interested if there are differences across groups whereas in SLR we are testing to see if there is a association between two variables. ANOVA we can only determine if there is a difference between groups but we cannot determine if there is a relationship.You would use ANOVA if you have a continuous outcome and a categorical predictor that had independent observations. SLR would be used if you have a continuous outcome and wanted to quantify the relationship with a single predictor.

When it comes to programming skills, I didn’t have many difficulties with this assignment. I definitely feel more confident with my programming skills after this assignment and working through the progress report check in 1 for the final project.

End of Homework 2