EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_Samriddhi Ranjan, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)

# Import the dataset — update the path if needed
bmd <- read.csv("/Users/samriddhi/Downloads/bmd(in).csv")
# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

Answer:

Total N: 2,898
Missing DXXOFBMD: 612
Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Answer:

Quantity	Value
Missing DXXOFBMD only	476
Missing calcium only	157
Missing BOTH (overlap)	136
Rows dropped (missing either)	769
Final analytic N	2,129

769 rows were dropped in total. The overlap of 136 rows (missing on both variables) is why the dropped rows (769) is less than the sum of individual missingness counts (612 + 293 = 905). The correct Final analytic N is exactly 2,129.

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
 title = "Association Between Dietary Calcium Intake and Total Femur BMD",
    x     = "Dietary Calcium Intake (mg/day)",
    y     = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal(base_size = 13)

Written interpretation (3–5 sentences):

The scatterplot shows a slight positive linear trend between dietary calcium intake and total femur bone mineral density (BMD). As calcium intake increases, bone mineral density tends to increase modestly. However, the points are widely dispersed around the regression line, suggesting that the relationship is relatively weak. A few observations with very high calcium intake appear somewhat distant from the main cluster and may represent potential outliers, but there is no clear evidence of strong non-linearity.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

The intercept (8.992 x 10^-1 g/cm²) represents the model’s predicted total femur BMD when dietary calcium intake is zero. Although it provides the baseline for the regression line, this value has limited clinical or biological relevance because its not biologically plausible for an individuals to have zero dietary calcium intake. Therefore, the intercept mainly serves as a starting point for the regression line.

B. Slope (β₁):

The slope (3.079 × 10^-5 g/cm²) indicates that for each additional 1 mg/day increase in dietary calcium intake, total femur BMD is expected to increase by about 3.079 × 10^-5 g/cm² on average. This reflects a positive association between calcium intake and BMD. However R-squared is very low (0.007959). This means that calcium explains less than 1% of the variability in the outcome.Even though the effect is statistically significant (p < 0.05), it is practically very small.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: There is no linear association between dietary calcium intake and femur bone mineral density
H₁: There is a linear association between dietary calcium intake and femur bone mineral density.

Report the test results: t-statistic: 4.131 Degrees of freedom: 2127 p-value: 3.751x 10^-05 (< 0.001)

Conclusion: Since the p-value is smaller than 0.05, we reject H₀. This indicates that there is statistical evidence of a linear association between calcium intake and femur BMD. Therefore, higher calcium intake is associated with slightly higher BMD.

Interpret the 95% confidence interval for β₁:

95% CI: (1.617334 x 10^-05- 4.54064910^-05)

We are 95% confident that the true effect of calcium on the outcome lies between 0.00001617 and 0.00004541. Because the interval does not include 0, this supports the conclusion that calcium intake is statistically significant and positively associated with BMD.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

The R² from the model is 0.00796, meaning that dietary calcium intake explains about 0.8% (less than 1%) of the variance in femur bone mineral density. While the association is statistically significant (p < 0.001), it does not explain much of the variability in BMD. This low R² strongly suggests that BMD is influenced by many other factors beyond dietary calcium including age, sex, body mass index, physical activity, smoking status, race/ethnicity, are much likely to influence BMD as well.

Residual Standard Error (RSE):

The residual standard error is 0.1582 g/cm². This represents the average distance between observed BMD values and the model’s predicted values the typical prediction error of the model is about 0.1582 g/cm².

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.)

The predicted total femur bone mineral density at 1,000 mg/day of calcium intake is 0.93 g/cm².

What does the 95% confidence interval represent?

The 95% confidence interval (0.92–0.94 g/cm²) represents the range in which the true mean BMD for individuals consuming 1,000 mg/day of calcium is expected to lie with 95% confidence.

What is the 95% prediction interval, and why is it wider than the CI?

The 95% prediction interval is (0.62-1.24) g/cm² represents the range where the BMD of a single individual with 1,000 mg/day calcium intake is expected to fall. This is substantially wider than the confidence interval because it must account for the uncertainty in estimating the mean response, and the natural individual variability in BMD. The PI tells us where a single new individual’s BMD would likely fall.

Is 1,000 mg/day a meaningful value to predict at, given the data?]

Yes, 1,000 mg/day is a meaningful value because it falls within the typical range of dietary calcium intake in adults, so predicting BMD at this level is reasonable.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

The simple linear regression model suggests that there is a positive association between dietary calcium intake and femur bone mineral density (BMD). Individuals with higher calcium intake tend to have slightly higher predicted BMD values. However, the magnitude of the effect is very small, and the model explains less than 1% of the variance in BMD. This indicates that calcium intake alone is not a strong predictor of BMD. As this data is extracted from a cross-sectional survey, it is not possible to determine the temporal relationship of the variables, and therefore causal conclusions cannot be drawn about whether calcium intake affects BMD. Several potential confounders could influence this relationship, including age, sex, physical activity, vitamin D intake, hormonal status, and genetic factors.

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

In Homework 1, one-way ANOVA was used to compare mean BMD across different ethnic groups, which is appropriate when the predictor variable is categorical. Regression allows us to estimate the expected change in BMD for each unit increase in calcium intake and to generate predicted values.

One-way ANOVA and simple linear regression are both commonly used methods for analyzing a continuous outcome, but they address different types of research questions. In Homework 1, one-way ANOVA was used to compare mean BMD across different ethnic groups, which is appropriate when the predictor variable is categorical. In contrast, simple linear regression examines the relationship between BMD and a continuous predictor such as calcium intake. It examines how BMD changes as a function of a continuous predictor and estimates a slope that quantifies how much BMD is expected to change for each unit increase in that predictor. While ANOVA focuses on comparing group means, regression is more useful when studying continuous exposures or trends in the data.

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

From a programming standpoint, the most difficult aspect of this assignment was interpreting the regression output and identifying the key statistics from the model summary. Carefully working through the code step by step enhanced my understanding of how to calculate confidence intervals, prediction intervals, and assess model fit in R.

End of Homework 2