EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)

# Import the dataset — update the path if needed
bmd <- read.csv("C:/Users/God's Icon/Desktop/553-Statistical Inference/553 Lab Codes/data/bmd.csv")

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

Answer:

Total N: 2,898
Missing DXXOFBMD: 612
Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Answer:

Quantity	Value
Missing DXXOFBMD only	476
Missing calcium only	157
Missing BOTH (overlap)	136
Rows dropped (missing either)	769
Final analytic N	2,129

769 rows were dropped in total. The overlap of 136 rows (missing on both variables) is why the dropped rows (769) is less than the sum of individual missingness counts (612 + 293 = 905). The correct Final analytic N is exactly 2,129.

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
 title = "Association Between Dietary Calcium Intake and Total Femur BMD",
    x     = "Dietary Calcium Intake (mg/day)",
    y     = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal(base_size = 13)

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]

Answer

Interpretation:

The scatterplot reveals a modest positive linear trend between dietary calcium intake and total femur BMD individuals with higher calcium intake tend to have slightly higher BMD values. The relationship appears weak in magnitude.
The regression line has a shallow positive slope and there is substantial scatter around it. The data display considerable right-skew in calcium intake, with a cluster of observations below 2,000 mg/day and several extreme values exceeding 4,000 mg/day that may represent influential points.
No obvious non-linearity is evident in the bulk of the data, suggesting a linear model is a reasonable first approximation, though the extreme outliers in calcium warrant attention in sensitivity analyses.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]

Answer:

The intercept (0.8992 g/cm²) represents the model’s predicted total femur BMD for an individual with zero dietary calcium intake per day. While mathematically necessary as the baseline of the regression line, this value is not clinically or biologically meaningful in isolation: no person in the dataset consumes zero dietary calcium, and extrapolating this far outside the observed data range is unreliable. The intercept should be treated as an anchoring constant rather than a substantively interpretable quantity.

B. Slope (β₁):

Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?

Answer:

The slope (3.079 × 10⁻⁵ g/cm² per mg/day, or 0.00003079 g/cm² per mg/day) indicates that for each additional 1 mg/day increase in dietary calcium intake, total femur BMD is estimated to increase by approximately 0.00003079 g/cm², on average. The direction of the association is positive. This per-unit effect is extremely small. However, over a more meaningful range (e.g., a 500 mg/day difference in calcium intake), the estimated BMD difference is approximately 0.0154 g/cm², which remains modest relative to the typical BMD spread in this population (IQR ≈ 0.21 g/cm² from the residuals).

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: [complete in notation and plain language]
H₁: [complete in notation and plain language]

MODEL HYPOTHESES:

H₀: β₁ = 0 — There is no linear association between dietary calcium intake and total femur BMD in the population; the slope equals zero.
H₁: β₁ ≠ 0 — There is a linear association between dietary calcium intake and total femur BMD; the slope is different from zero (two-sided test).

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]

MODEL TEST RESULTS:

The t-statistic for the slope is t = 4.131 on 2,127 degrees of freedom, with a p-value of 3.75 × 10⁻⁵ (p < 0.001). Because p < 0.05, we reject H₀. This provides statistically significant evidence that dietary calcium intake is positively associated with total femur BMD in this sample.

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?]

Answer:

The 95% confidence interval for the slope is (1.617 × 10⁻⁵, 4.541 × 10⁻⁵) g/cm² per mg/day. We are 95% confident that the true population slope lies within this range that is, for every 1 mg/day increase in dietary calcium, the true mean increase in BMD is plausibly between 0.00001617 and 0.00004541 g/cm². Because the entire interval lies above zero, this is consistent with our rejection of H₀.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]

Answer: R² INTERPRETATION:

The R² is 0.007959 (approximately 0.80%), meaning that dietary calcium intake explains less than 1% of the variance in total femur BMD in this sample. While the association is statistically significant (p < 0.001), the model fit is very poor by conventional standards. This low R² strongly suggests that BMD is influenced by many other factors beyond dietary calcium including age, sex, body mass index, physical activity, smoking status, race/ethnicity, and supplemental calcium none of which are accounted for in this simple model.

Residual Standard Error (RSE):

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]

Answer RSE INTERPRETATION:

The residual standard error is 0.1582 g/cm². This represents the average distance between observed BMD values and the model’s predicted values the typical prediction error of the model is about 0.1582 g/cm². Given that BMD values in this population range from roughly 0.4 to 1.6 g/cm² (as visible in the scatterplot), an RSE of 0.1582 g/cm² represents substantial unexplained variability, consistent with the near-zero R².

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.)

Answer:

The model predicts a total femur BMD of 0.9229112 g/cm² for an individual with a dietary calcium intake of 1,000 mg/day.

What does the 95% confidence interval represent?

Answer:

The 95% confidence interval for the mean BMD at calcium = 1,000 mg/day is (0.9229, 0.9371) g/cm². This interval estimates the average BMD across all individuals in the population who consume 1,000 mg/day of dietary calcium. It is narrow (width ≈ 0.014 g/cm²) because it reflects only the uncertainty in estimating the mean regression line at this point, not variability between individuals.

What is the 95% prediction interval, and why is it wider than the CI?

Answer:

The 95% prediction interval is (0.6196, 1.2404) g/cm². This is substantially wider than the confidence interval (width ≈ 0.621 g/cm²) because it must account for two sources of uncertainty simultaneously: the uncertainty in estimating the mean response, and the natural individual-to-individual variability in BMD. While the CI tells us where the average BMD falls for people consuming 1,000 mg/day, the PI tells us where a single new individual’s BMD would likely fall and people vary considerably even at the same calcium intake level.

Is 1,000 mg/day a meaningful value to predict at, given the data?]

Answer:

A calcium intake of 1,000 mg/day is a clinically meaningful and appropriate value to predict at. It corresponds to the Recommended Dietary Allowance (RDA) for adults aged 19–50, and it falls well within the observed range of calcium values in the dataset making this an interpolation rather than an extrapolation, so the prediction is on solid footing.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

Answer:

The regression model reveals a statistically significant but practically very small positive association between dietary calcium intake and total femur BMD. Calcium explains less than 1% of the variance in BMD, which is humbling given that dietary calcium is one of the most commonly cited nutrients in bone health guidelines. In some ways the result is not surprising: BMD is a complex outcome shaped by genetics, age, hormonal status, physical activity, body composition, and many other factors that a single dietary variable simply cannot capture on its own. The key limitation of this analysis is its cross-sectional design. Because calcium intake and BMD were measured at the same point in time, we cannot establish which came first, and therefore we cannot claim that calcium intake causes changes in BMD. Several confounders could explain or distort the observed association. Age is one example, since older adults tend to have both lower BMD and lower appetite and caloric intake. Sex is another, since men generally have higher BMD and may also consume more calories and calcium. Physical activity is a third, since individuals who are more physically active tend to have denser bones and may also follow healthier diets overall.

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

Answer

One-way ANOVA and simple linear regression are both foundational methods for analyzing a continuous outcome, but they answer fundamentally different questions. ANOVA asks whether mean BMD differs across categories, such as racial and ethnic groups, and produces an omnibus F-test that tells us whether any group differences exist without quantifying their direction or magnitude. Simple linear regression asks how BMD changes as a continuous function of a predictor, and it produces a slope that tells us exactly how much BMD is estimated to change per unit increase in that predictor. Regression also provides quantities that ANOVA cannot, including a predicted value at any level of the predictor, a confidence interval for that prediction, and an R² that summarizes overall model fit. ANOVA is the better choice when the predictor is genuinely categorical and the scientific question is about group mean differences. Regression is preferable when the predictor is continuous and the goal is to model a dose response relationship or generate predictions.

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

Answer:

The most challenging part of this assignment was understanding the difference between the confidence interval and the prediction interval produced by the predict() function. Both intervals are generated with nearly identical code and produce output that looks exactly the same, which made it easy to confuse them at first. Working through the conceptual distinction helped resolve this: the confidence interval estimates where the average BMD falls for a group of people at a given calcium level, while the prediction interval estimates where a single new individual’s BMD would likely fall, which is inherently more uncertain and therefore wider. Another concept that required careful thought was the residual standard error. At first glance the value of 0.1582 g/cm² did not mean much on its own, but placing it in context made it interpretable: since BMD values in this population range from roughly 0.4 to 1.6 g/cm², a typical prediction error of 0.1582 g/cm² represents a meaningful margin, reinforcing just how much variability in BMD remains unexplained by calcium alone. After completing this assignment, I feel considerably more confident working with lm(), reading and interpreting summary() output, and using confint() and predict() to extract inferential information that goes beyond what the default model output provides.

End of Homework 2

EPI 553 — Homework 2: Simple Linear Regression

Marina Frimpong

March 2, 2026

Part 0: Data Preparation (10 points)

Part 1: Exploratory Visualization (15 points)

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

Step 2 — Interpret the Coefficients (10 points)

Step 3 — Statistical Inference (15 points)

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

Part 3: Prediction (20 points)

Part 4: Reflection (15 points)