EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/hw02/bmd.csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Dietary Cacium intake vs. Total femur bone mineral density",
    x       = "Dietary Calcium intake (mg/day)",
    y       = "Total femur bone mineral density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences): 1) Is there a visible linear trend? In which direction? Yes, according to the scatterplot there is a positive linear relation between Dietary Calcium intake (mg/day) and Total femur mineral density (g/cm²), meaning the more you have calcium in your diet per day, the higher your Bone mineral density 2) Does the relationship appear weak? There is a weak relationship, most of the points are around 1000 mg/day, and they are located in the wide range of BMD axis , therefore high variability, low strength of association 3)Are there any notable outliers? yes, dietary calcium intake 3000-5000 mg(day)

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

augmented <- augment(model)

#Tidy coefficient table
tidy(model, conf.int = TRUE) %>%
  mutate(across(where(is.numeric))) %>%
  kable(
    caption = "Simple Linear Regression: BMD ~ calcium",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: BMD ~ calcium
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.8992037	0.0071915	125.036678	0.00e+00	0.8851006	0.9133069
calcium	0.0000308	0.0000075	4.131029	3.75e-05	0.0000162	0.0000454

b0 <- round(coef(model)[1], 3)
b1 <- round(coef(model)[2], 4)

Step 2 — Interpret the Coefficients (10 points)

Fitted regression equation:

\[\widehat{\text{BMD}} = 0.899 + 0 \times \text{calcium (mg/day)}\] A. Intercept (β₀): The estimated Bone Mineral density = = 0.89, when calcium intake is 0(mg/day) ; The intercept is not practically interpretable in this context, but is necessary to anchor the line.

B. Slope (β₁): 1)Slope interpretation: For every 1-unit increase in calcium (mg/day), the estimated change in BMD (g/cm²)= 0.0000308. 2)State the direction. This is a direct relation between calcium intake and BMD. 3) Is the effect large or small given typical calcium intake ranges? *Given the typical calcium intake range: 0-5000 mg/day, for every 1000 mg/day increase in dietary calcium intake estimated increase in BMD 0.0308 (g/cm²). As the range of BMD 0-1.6, effect is still small.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: [0=0, There is no linear relationship between BMD and daily calcium intake (mg/day)]
H₁: [0≠ 0. There is a linear relationship between BMD and daily calcium intake (mg/day)]

Report the test results:

[t = 4.13, df = 2127, p < 0.001. We reject H₀, there is a statistically significant positive linear association between daily calcium intake and total femur BMD, though the relationship is weak]

Interpret the 95% confidence interval for β₁:

95% confidence interval for [0.0000162 0.0000454], CI does not include null value, meaning that the relationship is statistically significant

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

n <- nrow(bmd_analytic)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)

tibble(
  Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
  Value    = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3)),
  Units    = c("", "", "g/cm²")
) %>%
  kable(caption = "Model Error Estimates") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Error Estimates
Quantity	Value	Units
SS Residual	53.260
MSE (σ̂²)	0.02

# Extract R-squared from model
r_sq <- summary(model)$r.squared
adj_r_sq <- summary(model)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.008
Adjusted R²	0.0075
Variance Explained	0.8%

R² (coefficient of determination):

[0.8% of the BMD is explained by the dietary calcium intake (mg/day); The model fits poorly.This means that more that 99% of BMD variability explained by other predictors]

Residual Standard Error (RSE):

[RSE = 8.32(g/cm²). On average, our model’s predictions are off by about 8.32 (g/cm²) of Total femur Bone mineral density.]

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.) Predicted BMD at calcium = 1,000 mg/day is 0.93 (g/cm²)
What does the 95% confidence interval represent? The 95% confidence interval represents that for people in our sample, who have calcium intake 1000 mg per day, the mean BMD lie within 95% CI range (0.92 - 0.94)
What is the 95% prediction interval, and why is it wider than the CI? The 95% prediction interval means that some new specific individual who takes 1000 mg of calcium per day, will have BMD that will lie within 95% PI range (0.62-1.24). PI is wider that CI because of the individual variability in BMD among people with the same calcium intake and uncertainty in estimating the mean BMD (on one person).
Is 1,000 mg/day a meaningful value to predict at, given the data?] 1000 mg/day is not a meaningful value to predict at, because PI is crosses the null value and quite wide, means that BMD will vary at calcium intake 1000 mg/day

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

What does the regression model tell you about the calcium–BMD relationship? There is a weak positive linear relationship between Total femur Bone mineral density (g/cm^2) and Dietary calcium intake (mg/day). Although the linear relationship is statistically significant, calcium intake only explains 0.8% of variance in BMD, meaning that other factors (age, sex, genetics etc.) explains more than 99.2% of BMD.

Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? Results were somewhat surprising, and do not fully align with literature suggesting calcium intake improves BMD. But as this is the data from cross-sectional study, there is no temporality (exposure and outcome were measured at the same time), so we can not claim that calcium intake caused changes in the BMD What confounders might explain the observed association? The age, sex, physical activity and ethnicity could explain the observed association at some level

B. From ANOVA to Regression (5 points)

Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: 1)what kinds of questions does each method answer? One-way ANOVA method answer the question like “Is there any difference in the mean BMD levels across 3 or more ethnicity groups?” So Anova tests the difference in the mean of continuous variable (BMD) by categorical variable (ethnicity). Linear regression explains if there is a linear association between continuous variable (BMD) and other predictors.Meaning how outcome variable changes when predictors changes for 1 unit.

2)What does regression give you that ANOVA does not? Regression can estimate the direction and strength of changes in the outcome by 1 unit change in predictor, if the design is longitudinal. Whereas ANOVE only compares group means

3)When would you prefer one over the other? We would use ANOVA if we simply want to establish if there are difference in means of 3 or more groups, and linear regression if we want to calculate the estimated effect of a predictor on outcome, and potentially explore causal relationship, if the design is longitudinal

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?] This was not most challenging assignment from a programming perspective, especially after completing labs on simple linear regression, multiple linear regression and the Data Checkin#1 homework. I feel more confident in reading and understanding chunks of R codes than at the beginning of the semester. I also understand better theory behind Linear regression: concepts of slopes and predicted values and Residual standard error.

End of Homework 2