EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv("bmd2.csv")

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Linear Regression Between Dietary Calcium Intake
(calcium, mg/day) And Total Femur Bone Mineral Density (DXXOFBMD, g/cm²) ",
    x       = "Dietary Calcium Intake (calcium, mg/day)",
    y       = "Total Femur Bone Mineral Density (DXXOFBMD, g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

In the scatterplot, there is a visible linear trend with a slight positive association. The relationship is weak-moderate for strength since the linear line is only slightly increasing. Yes, there are notable outliers but there are not non-linearities.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

Intercept (b0=8.992e-01) conveys the predicted number of the total femur bone mineral density (DXXOFBMD, g/cm²) when the dietary calcium intake (calcium, mg/day) = 0. This shows that it is not a meaningful quantity in this context since its unrealistic for people to have a calcium of 0 since its realistic for people to consume calcium in their food.

B. Slope (β₁):

Slope= 3.079e-05 For every 1-unit increase in calcium (mg/day), the expected BMD (g/cm²) increases by 3.079e-05, on average, holding all else constant. The direction is positive and there is a small effect given the typical calcium intake.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: There is no association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)
H₁: There is an association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)

Report the test results:

t-statistic: 4.131 degrees of freedom: 2127 p-value: 3.75e-05

Conclusion: We reject the H0, therefore it is statistically significant. This shows there is an association between calcium and BMD.

Interpret the 95% confidence interval for β₁:

We are 95% confident that the true change in BMD is associated with a 1 mg/day increase in calcium intake lies between 1.617334e-05 and 4.540649e-05. The interval is above 0, therefore it supports the association between calcium intake and BMD is positive.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

R² = 0.007959 The model explains 0.8% of variability in BMD. Since the variance is low, it indicates the model does not fit well in the data. While the association is statistically significant, calcium intake by itself is not a strong predictor of BMD. This suggests that other predictors play a larger role in the variation.

Residual Standard Error (RSE):

RSE: 0.1582g/cm²

This explains that the typical prediction error of the model is 0.1582 from the observed outcome.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.) The predicted BMD at calcium = 1,000 mg/day is 0.9299936.
What does the 95% confidence interval represent? The 95% confidence interval represents the true mean BMD individuals consuming the 1,000 mg/day, of calcium lies between 0.9229112 and 0.937076.
What is the 95% prediction interval, and why is it wider than the CI? The 95% prediction interval 0.9299936, it is wider than the CI because it accounts for both the uncertainty in mean BMD and the individual variability around the mean.
Is 1,000 mg/day a meaningful value to predict at, given the data?] Given the data, the 1,000 mg/day is a meaningful value because it lies within the recommended dietary guidelines for calcium intake. —

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

The regression model tells me that the calcium-BMD relationship has a positive correlation between the two variables. This is not surprising because the more calcium a person consumes, the stronger their bone density is which decreases their risk of potential health problems. The key limitation of interpreting SLR from a cross-sectional survey as casual evidence is that since we are not observing multiple points in time, only a single point we cannot determine the cause to effect also referred as the temporality. Counfounders that might explain the observed association is age and race.

B. From ANOVA to Regression (5 points)

ANOVA allows us to compare the mean outcome across multiple categorical variables versus the simple linear regression only looks at the association between exposure and the outcome. Regression gives us the continuous relationship while ANOVA does not. If my predictor is categorical I would use ANOVA, if my predictor is continuous I would use regression.

C. R Programming Growth (4 points)

The most challenging part of this assignment from a programming perspective was interpreting the difference between the confidence interval and the prediction interval. I read through the lecture materials to identify the difference. I feel more confident reading 95% confidence interval table after completing this homework.

End of Homework 2