EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv("data/bmd.csv")

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Association between dietary calcium intake and total femur bone mineral density",
    x       = "Calcium (mg/day)",
    y       = "Bone mineral density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

There appears to be a positive linear association between dietary calcium intake and total femur bone mineral density. The relationship appears to be weak with most observations scattering around 800-1,000 with a BMD ranging between 0.7-1.2. There are many outliers with extreme high values for calicum and low or high values for BMD.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

The intercept for this model is 0.8992. This represents the BMD when calcium is equal to zero. This is a mathematical artifact, as calcium cannot be zero for a living individual, and is meant to anchor the regression line.

B. Slope (β₁):

The slope represents the estimated change in BMD for every unit increase in calcium. For every 1-unit increase in calcium (mg/day), there is an estimated 3.079e-05 change in BMD (g/cm²). This effect is moderate considering calcium can be over 5,000, and it is positive, meaning as calcium increases BMD is expected to increase.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: (β₁ = 0) There is no evidence of a linear relationship between calcium and BMD.
H₁: (β₁ /= 0) There is evidence of a linear relationship between calcium and BMD.

Report the test results:

The test show a t-statistic of 125.037, with 2127 degrees of freedom and a p-value of <2e-16. This represents a statistically significant finding so we can reject the null and report that there is evidence of a linear relationship between calcium and BMD.

Interpret the 95% confidence interval for β₁:

The 95% CI of the slope= [0.885, 0.913]. This means that if the sampling process were repeated many times, 95% of the calculated intervals would contain the true slope.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

The R² for this model is 0.007959, meaning that only ~0.8% of the variance in BMD is explained by dietary calcium intake. Therefore, 99.2% of the variance in BMD is not explained in this model; however, small factors can still be significant and have clinical or public health importance. Including other measures to better explain the BMD variance would improve this model.

Residual Standard Error (RSE):

The residual standard error for this model is 0.1582. This tells me that the average prediction error of this model is off by approximately 0.153 BMD units.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.) The predicted BMD at 1,000 mg/day calcium is 0.93 g/cm^2.
What does the 95% confidence interval represent? The 95% CI is [0.923, 0.937]. This means that if the sampling process were repeated many times, 95% of the calculated intervals would contain the true value.
What is the 95% prediction interval, and why is it wider than the CI? The 95% PI is [0.620, 1.240]. This means that if a new individual was added to the sampling frame, 95% of the calculated intervals would contain their true value.
Is 1,000 mg/day a meaningful value to predict at, given the data?] Yes, as this is between the 1st and 3rd quartile for sampled values. A value closer to the mean, 848, or median, 781, may be a better value to predict. —

Part 4: Reflection (15 points)

The results for this regression model were slightly surprising. Calcium is known to be a critical component of BMD; however, this model suggests that other factors may be much more predictive of low BMD. While calcium is significantly associated with BMD, other factors such as age, sex, and BMI likely account for much more variance for BMD. Using a MLR model would be more complete for analyzing this association. In addition, an ANOVA model could further enhance this analysis by comparing the mean BMD between groups, such as those with varying levels of calcium. This test would show whether the mean BMD does differ between groups with different levels of dietary calcium. This differs from a regression model which seeks to understand the relationship between calcium and BMD and model the effect size and predict values. An ANOVA or MLR model would be effective at analyzing the association between calcium and BMD. This homework was helpful in performing an example analysis on public health data. I would like to further analyze this relationship by including more variables to improve the fit of my linear model. In addition, I would like to see how an ANOVA test could enhance my overall analysis. The most challenging part was interpreting the model results.