EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/bmd(in).csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd_lab <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd_lab), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd_lab$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd_lab$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd_lab %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Association between dietary calcium intake and total femur bone mineral density",
    x       = "Calcium (mg/day)",
    y       = "DXXOFBMD (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]

The scatterplot shows a weak positive linear relationship between calcium intake and bone mineral density. As calcium intake increases, bone mineral density appears to increase slightly. However, the points are widely scattered around the line, suggesting the association is relatively weak.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]

The intercept is 0.8992, which represents the predicted value of bone mineral density when calcium intake is zero. It suggests that if a person consumed no calcium then their bone mineral density would be about 0.8992 units. An intercept at zero calcium intake is likely not meaningful because it is unrealistic for someone to have zero calcium intake in a day. The intercept in this case mainly serves as a mathematical baseline for the regression equation.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]

The slope for calcium is reported as 3.079e-05. This means that for every 1 mg/day increase in calcium intake, the estimated bone mineral density increases by 3.079e-05 which is an extremely small amount. The direction of the effect is positive, so higher calcium intake is associated with slightly higher bone mineral density. Due to peoples calcium intake varying by hundreds of mg/day, the increase in BMD across realistic intake differences would be relatively small. The relationship is postive but the practical effect is small.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

slope_test <- tidy(model, conf.int = TRUE) %>% filter(term == "calcium")

tibble(
  Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
               "Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
  Value    = c(
    round(slope_test$estimate, 4),
    round(slope_test$std.error, 4),
    round(slope_test$statistic, 3),
     2129-2,
    format.pval(slope_test$p.value, digits = 3),
    round(slope_test$conf.low, 4),
    round(slope_test$conf.high, 4)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

t-Test for the Slope (H₀: β₁ = 0)
Quantity	Value
Slope (b₁)	0
SE(b₁)	0
t-statistic	4.131
Degrees of freedom	2127
p-value	3.75e-05
95% CI Lower	0
95% CI Upper	0

State your hypotheses:

H₀: There is no linear association between calcium intake and bone mineral density.B=0
H₁:There is an linear association between calcium intake and bone mineral density. B≠0

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]

T-statistic: 4.131 Degrees of freedom: 2127 p-value: 3.75e-05

Conclusion: There is statistically significant evidence that calcium intake is associated with bone mineral density.The p-value is much smaller than 0.05 so we can reject the null hypothesis, this shows that there is a linear relationship between calcium intake and BMD.

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?]

CI: [1.617334e-05,4.540649e-05] Interpretation: We are 95% confident that the true effect of calcium on bone mineral density is very small but positive.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

#R²
augmented <- augment(model)

r_sq <- summary(model)$r.squared
adj_r_sq <- summary(model)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.008
Adjusted R²	0.0075
Variance Explained	0.8%

#Residual Standard Error
n <- nrow(bmd_lab)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)

tibble(
  Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
  Value    = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3)),
  Units    = c("", "", "kg/m²")
) %>%
  kable(caption = "Model Error Estimates") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Error Estimates
Quantity	Value	Units
SS Residual	53.260
MSE (σ̂²)	0.01

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]

The R^2 value is 0.008 meaning that 0.8% of the variance in bone mineral density is explained by calcium intake. The R^2 indicates that the model does not fit the data well since calcium intake alone only explains very little of the variation in bone mineral density. This suggests that other confounding factors are likely play a role in determining bone mineral density rather than calcium intake alone.

Residual Standard Error (RSE):

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]

The RSE is 0.1582 kg/m^2. This means that even though the model finds a statistical significance between calcium intake and bone mineral density, the predicted error is relatively large compared to the effect of calcium.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_bmd <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_bmd, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_bmd, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.) The predicted BMD at calcium=1,000 mg/day is approximately [0.9229112 g/cm^2, 0.937076 g/cm^2].
What does the 95% confidence interval represent? The 95% CI represents the range of plausible values for the average BMD among populations who consume 1,000 mg/day of calcium.
What is the 95% prediction interval, and why is it wider than the CI? The 95% PI is the expected range for BMD of a single individual with that calcium intake. The PI value for the dataset is [0.6195964, 1.240391]. The PI is wider than the IC because it accounts for uncertainty in estimating the mean and natural variability in the individual BMD values.
Is 1,000 mg/day a meaningful value to predict at, given the data?] Predicting at 1,000 mg/day is meaningful because it falls within the main range of observed calcium intake values in the data set. —

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

The regression model tells us that there is a statistically significant relationship between calcium and BMD but it appears to be a weak positive association. As calcium intake increase, BMD increases but the magnitude of this effect is extremely small. I found the results to be somewhat surprising. I knew that there was an association between calcium and BMD but I thought the relationship would be stronger. Some limitations to interpreting SLR from a cross sectional study is that cross sectional data measures exposure and outcome at the same time, so it is not possible to determine whether calcium intake actually causes higher BMD. In addition, there may be confounding factors influencing both calcium intake and BMD. Confounding factors could be age, sex, BMI, physical activity, other vitamin intake, and dietary patterns. These confounding factors could be contributing to stronger bones which would explain the association. B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

ANOVA is used when the predictor variable is categorical and the goal is to test whether the mean outcome differ between groups. SLR is used when the predictor variable is continuous and allows us to estimate how much the outcome changes for each unit increase in the predictor. While ANOVA, focuses on differences between group means, SLR provides a quantitative estimate of the strength and direction of the association. ANOVA should be used when comparing several discrete groups, whereas regression is better for examining relationships involving continuous predictors.

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

Since this assignment provided a lot of the code for us, I didn’t struggle with creating my own code. The main challenge was understanding the scatterplot and the tables that were produced. I focused on understanding what I was looking at. I have had challenges in the past with interpreting/understanding the direction, strength, and variability, so this assignment helped me better understand because I could soley focus on interpreting the data instead of having to write the code and interprete the data.

End of Homework 2