EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read_csv("~/Downloads/epi552/bmd(in).csv")
# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

total_n <- nrow(bmd)
missing_bmd <- sum(is.na(bmd$DXXOFBMD))
missing_calcium <- sum(is.na(bmd$calcium))

missing_table <- tibble(
  Measure = c("Total sample size", "Missing DXXOFBMD", "Missing calcium"),
  Value = c(total_n, missing_bmd, missing_calcium)
)

missing_table %>%
  kable(caption = "Sample Size and Missingness Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Sample Size and Missingness Summary
Measure	Value
Total sample size	2898
Missing DXXOFBMD	612
Missing calcium	293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Association Between Dietary Calcium Intake and Total Femur BMD",
    x       = "Calcium Intake (mg/day)",
    y       = "Femur Bone Mineral Density (g/cm²))"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]

The scatterplot shows a slight positive linear relationship between dietary calcium intake and total femur bone mineral density. As calcium intake increases, BMD appears to increase but it shows a weak relationship and also a few high-calcium outliers above 3000 mg/day, but they don’t change the overall pattern. Overall, the relationship appears approximately linear, with no strong evidence of non-linearity.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

The intercept is 0.8992 g/cm^2. This is the predicted total femur bone mineral density for an person with a dietary calcium intake at 0 mg/day. It’s not really meaningful in a real-world context because it is unrealistic for individuals to consume zero calcium.

B. Slope (β₁):

The slope is around 0.00003079 g/cm^2 per mg/day of calcium. This means that for every 1 mg/day increase in dietary calcium intake, the BMD increases by 0.00003079 g/cm^2, on average. The positive slope shows a positive association between these two variables, however, the magnitude of this effect is very small, suggesting that even large increases in calcium intake would result in only a slight increase in BMD, indicating a relatively weak relationship.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀ β1=0 There is no linear association between dietary calcium intake and total femur bone mineral density.
H₁:β1≠0 There is a linear association between dietary calcium intake and total femur BMD.

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?] t-statistic: 4.131
degrees of freedom: 2127 p-value: 3.75 × 10^-5, Because the p-value is less than 0.05, we reject the null hypothesis. There is statistically significant evidence that there is a linear association between dietary calcium intake and total femur BMD in this sample.

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?]

The 95% confidence interval for the slope is the range of values for the true change in BMD associated with a 1 mg/day increase in calcium intake. The interval is (0.000016, 0.000045) g/cm^2 per mg/day. Because the interval is greater than 0, it supports the conclusion that the association is positive.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]

The R^2 value is 0.007959, meaning that dietary calcium intake explains about 0.80% of the variability in total femur BMD. This shows that the model has a very weak fit, and calcium intake alone does not account for much of the variation in BMD. This suggests that other factors, such as age, sex, BMI, physical activity, can play a much larger role in explaining differences in BMD.

Residual Standard Error (RSE):

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]

The residual standard error is 0.1582 g/cm^2. This represents the difference between the observed BMD values and the values predicted by the model. On average, the model’s predictions are off by about 0.158 g/cm^2, indicating a moderate amount of unexplained variability in BMD.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.)
What does the 95% confidence interval represent?
What is the 95% prediction interval, and why is it wider than the CI?
Is 1,000 mg/day a meaningful value to predict at, given the data?]

The model predicts that someone with a calcium intake of 1,000 mg/day would have a BMD of about 0.930 g/cm^2. The 95% confidence interval (0.923 to 0.937 g/cm^2) represents the average BMD for people with this level of calcium intake. The 95% prediction interval (0.620 to 1.240 g/cm^2) is wider because it shows the range for an individual’s BMD, which varies more than the average. The prediction interval is wider because it includes both model uncertainty and individual differences. A calcium intake of 1,000 mg/day is a reasonable value to predict since it falls within the typical range in the data.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

This regression model shows that there is a statistically significant but very weak positive association between dietary calcium intake and total femur bone mineral density. As calcium intake increases, BMD increases slightly, but the effect size is very small and the R^2 value shows that calcium explains less than 1% of the variation in BMD. This was evident, since bone health is influenced by many factors beyond calcium intake alone. A key limitation of this analysis is that the data are cross-sectional, meaning calcium intake and BMD were measured at the same time. Because of this, we cannot say that calcium intake causes changes in BMD. There may also be confounding variables such as age, sex, BMI, physical activity, smoking, and vitamin D intake that influence both calcium intake and BMD.

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

ANOVA was used to compare mean BMD across different ethnic groups, which addresses whether there are differences between group averages. However, regression allows us to examine the relationship between a continuous predictor, calcium, and a continuous outcome BMD, and to estimate how much BMD changes with an increase in calcium. Regression provides more detailed information, such as the direction and magnitude of the relationship. I would use ANOVA when comparing categories, and regression when working with continuous variables or when I want to model relationships.

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

The most challenging part of this assignment was making sure the R Markdown file would knit without errors. I had to carefully check that I used the proper working directory, and that my interpretations for each part of the assignment could be applied to real world public health scenarios. Debugging these issues helped me better understand how R Markdown runs code from top to bottom. I feel more confident organizing my code, fixing errors, and making sure the analysis runs properly from the beginning.

End of Homework 2