EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('C:/Users/Alyss/OneDrive/Desktop/epi553/data/bmd.csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Dietary Calcium Intake vs. Femur Bone Mineral Density",
    x       = "Dietary Calcium Intake (mg/day)",
    y       = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

There is a positive linear trend between calcium intake and femur bone density. The relationship appears to be weak. Most of the points are centered on the lower end of calcium intake with a small number of points that vary as calcium intake increases.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀ = 8.992e-01):

The intercept represents what the estimated mean total femur bone mineral density is when dietary calcium intake is 0. The intercept is there to anchor the line but is can not be directly interpreted.

B. Slope (β₁ = 3.079e-05):

For every 1 unit increase in calcium (mg/day), femur BMD is estimated to increase by 3.079e-05 g/cm², when holding everything else constant. The direction is positive. Given typical calcium intake ranges, the effect is small.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: β₁ = 0 ( There is no linear relationship between dietary calcium intake and femur BMD.)
H₁: β₁ \(\neq\) 0 (There is a linear relationship between dietary calcium intake and femur BMD.)

Report the test results:

t-statistic: 4.131
Degrees of freedom: 2128
p-value: 3.75e-05

With a p-value of 3.75e-05, we reject H₀ at the ⍺ = 0.05 level. There is statistically significant evidence of a linear association between dietary calcium intake and femur BMD.

Interpret the 95% confidence interval for β₁:

If we conducted this study many times, 95% of the confidence intervals would contain the true slope.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

# Extract R-squared from model
r_sq <- summary(model)$r.squared
adj_r_sq <- summary(model)$adj.r.squared

tibble(
   Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.008
Adjusted R²	0.0075
Variance Explained	0.8%

rse <- round(summary(model)$sigma, 2)
cat("Residual Standard Error:", rse, " g/cm²\n")

## Residual Standard Error: 0.16  g/cm²

R² (coefficient of determination):

The proportion of variance in BMD that is explained by dietary calcium intake is 0.8%. Based on R², the model fits the data poorly as over 99% of the data’s variance remains unexplained. This suggests that other predictors may play a crucial role in explaining the variance in BMD

Residual Standard Error (RSE):

RSE = 0.16 g/cm². On average, the model’s prediction is off by 0.16 g/cm² for BMD.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.)
1. The predicted BMD when calcium intake = 1,000 mg/day is 0.9230 g/cm².
What does the 95% confidence interval represent?
1. The confidence interval represents how likely it is that the true value of BMD falls between the intervals if the study was repeated many times.
What is the 95% prediction interval, and why is it wider than the CI?
1. The prediction interval represents how likely a new observation not included in the study of BMD would fall between the intervals if the study designs were repeated many times. The prediction interval is wider than the confidence interval because it has to account for the uncertainty of a new observation and for the individual variability around the mean.
Is 1,000 mg/day a meaningful value to predict at, given the data?]
1. Yes, calcium intake of 1,000 mg/day is a meaningful value to predict at because those value is used as a current recommendation for adults to maintain and/or improve bone health.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

The regression model tells us that there is a weak positive relationship between dietary calcium intake and total femur bone mineral density. The results were a little surprising that dietary calcium intake does not have more effect on BMD. Key limitations of interpreting SLR from a cross-sectional survey as causal evidence are we cannot determine the direction of the association, and we cannot determine if the data is still applicable to current trends. Confounders that could explain the observed association are differences in gender, ethnicity, or smoking status.

One-way ANOVA answers the question of whether categorical groups have different means and simple linear regression answers the question of whether there is a linear association between variables of interest and predicts the outcome value based on a predictor variable. Regression gives us the strength and direction of the relationship, while ANOVA can only show whether there is a difference in groups. ANOVA should be used to compare categorical group and SLR should be used for looking into relationships between variables.

The most challenging part of this assignment was recoding the variables in the dataset. I had never recoded any variables in r before. I worked through it by referencing the lectures. I feel more confident in recoding variables and interpreting outputs from R.

End of Homework 2