EPI 553 — Homework 2: Simple Linear Regression

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('/Users/nataliasmall/Downloads/EPI 553/bmd.csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Total N: 2898 Missing DXXOFBMD: 612 Missing calcium: 293 Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Calcium vs Total Femur Bone Mineral Density",
    x       = "Dietary Calcium Intake (mg/day)",
    y       = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]

As daily dietary calcium intake increases there appears to be a slight increase in total femur bone mineral density (g/cm²). The scatterplot shows a slight positive linear trend and a weak relationship. Most observed data points are clustered towards the left side of the scatterplot (0 to 2000 calcium intake mg/day) with fewer observations at higher intake levels. Thus, notable outliers are on the far right side of the scatterplot, towards the highest amounts of daily dietary calcium intake mg/day.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?] Intercept (β₀) = 8.992e-01. Numerically this represents the estimated mean Total Femur Bone Mineral Density (g/cm²) when daily dietary calcium intake = 0. This is a mathematical artifact — it is biologically impossible for a living person to have daily dietary calcium intake = 0 and maintain any stable Bone Mineral Density. The intercept is not directly interpretable in this context, but is necessary to anchor the line.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?] Slope (β₁) = 3.079e-05. For every 1-unit increase in calcium (mg/day), there is an estimated 3.079e-05 g/cm² increase in BMD. The direction is positive. Given typical calcium intake ranges, 1,000 to 1,300 mg/day for most adults, the effect is very small.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: [complete in notation and plain language] 𝐻0:𝛽1=0 (no linear relationship between calcium and DXXOFBMD)
H₁: [complete in notation and plain language] 𝐻𝐴:𝛽1≠0 (there is a linear relationship between calcium and DXXOFBMD)

Report the test results:

# Report the test results
n <- nrow(bmd_analytic)

slope_test <- tidy(model, conf.int = TRUE) %>% filter(term == "calcium")

tibble(
  Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
               "Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
  Value    = c(
    round(slope_test$estimate, 4),
    round(slope_test$std.error, 4),
    round(slope_test$statistic, 3),
    n - 2,
    format.pval(slope_test$p.value, digits = 3),
    round(slope_test$conf.low, 4),
    round(slope_test$conf.high, 4)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

t-Test for the Slope (H₀: β₁ = 0)
Quantity	Value
Slope (b₁)	0
SE(b₁)	0
t-statistic	4.131
Degrees of freedom	2127
p-value	3.75e-05
95% CI Lower	0
95% CI Upper	0

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?] t-statistic: 4.131 degrees of freedom: 2127 p-value: 3.75e-05 (< 0.05) conclusion: Since the p-value is 3.75e-05 (< 0.05) we reject 𝐻0 at the 𝛼=0.05 level. There is statistically significant evidence of a linear association between dietary calcium intake (mg/day) and total femur BMD (g/cm²).

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?] The the 95% confidence interval for β₁ is 1.617334e-05 mg/day to 4.540649e-05 mg/day. We are 95% confident that for every 1-unit increase in dietary calcium intake (mg/day) average increase in total femur BMD in the population falls between 1.617334e-05 and 4.540649e-05 g/cm².

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

# Extract R-squared from model
r_sq <- summary(model)$r.squared
adj_r_sq <- summary(model)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.008
Adjusted R²	0.0075
Variance Explained	0.8%

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?] 0.8% of the variance in BMD is explained by dietary calcium intake. Based on R² (0.008), the model is a poor fit for the data. calcium intake explains 0.8% of the variability in BMD. This suggest that while calcium has a statistically significant association, alone it is a weak predictor. Other predictors are important in explaining variability in BMD.

Residual Standard Error (RSE):

# Residual Standard Error (RSE)
n <- nrow(bmd_analytic)
augmented <- augment(model)

ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)

tibble(
  Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
  Value    = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3)),
  Units    = c("", "", "g/cm²")
) %>%
  kable(caption = "Model Error Estimates") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Error Estimates
Quantity	Value	Units
SS Residual	53.260
MSE (σ̂²)	0.02

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.] The Residual Standard Error (RSE) is 0.15 g/cm², indicating that the model’s predictions typically deviate from the actual observed total femur BMD by 0.15 g/cm².

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.)
What does the 95% confidence interval represent?
What is the 95% prediction interval, and why is it wider than the CI?
Is 1,000 mg/day a meaningful value to predict at, given the data?] The predicted BMD at calcium = 1,000 mg/day is 0.93 g/cm². The 95% Confidence Interval (0.923 to 0.937 g/cm²) indicates that if we repeated this study many times, 95% of CIs would contain the true population mean BMD for calcium = 1,000 mg/day. The 95% Prediction Interval is (0.620 to 1.240 g/cm²), representing the range in which we expect the BMD of a single, new individual consuming 1,000 mg of calcium per day to fall. The 95% Prediction Interval is wider than the Confidence Interval because it accounts for uncertainty and individual-level variability. Yes, calcium = 1,000 mg/day is a meaningful value to predict at, in fact looking back at the scatterplot (Part 1) most of the datapoints are heavily concentrated within this range, representing a common dietary target.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

Reflection: Answered

The simple linear regression model suggests that dietary calcium intake has a statistically significant (p-value < 0.05) but a weak association with total femur bone mineral density (BMD). These results were not surprising because bone mineral density is influenced by many biological and lifestyle factors beyond calcium intake. One important limitation of interpreting SLR from a cross-sectional survey as causal evidence is that the variables are measured at a single point in time. As a result, the model cannot establish a causal relationship between calcium intake and BMD. Potential confounders that might explain the observed association are age, sex, body mass index (BMI), physical activity, and potential genetic factors.

ANOVA answers whether at least one group mean is statistically different from the others. Regression answers questions about the direction, magnitude, and predictive relationship between variables. It also allows for predictions of the outcome at specific values of the predictor. Regression can describe how an outcome changes as a predictor increases while ANOVA cannot. ANOVA only tests for differences, not relationships. Regression is preferred when you want to quantify the linear relationship between a continuous outcome and relevant predictors, predict values of an outcome given a predictor value, and test hypotheses about whether a predictor is associated with an outcome. ANOVA is preferred when comparing discrete categories and answering questions about differences between categories or groups.

There wasn’t really any challenging part of this assignment from a programming perspective, however, in part 3, after running the code the output was printed directly under the code chunk, because the output wasn’t in a structured table I was momentarily confused when interpreting. I worked through this by taking a further look at the code notes, determining the first row of the output to be the CI and the second to be the PI. After completing this homework I feel more confident in my ability to comprehend R code and if any errors arise, I feel confident in saying that I am able to debug and pinpoint the problem.

End of Homework 2