EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/Assignments/HW02/bmd.csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables and print out total sample size
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

# Print out final analytic sample size
cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Figure1: Scatterplot of Dietary Calcium Intake on Total Bone Mineral Density with Fitted Regression Line",
    x       = "Dietary Calcium Intake (mg/day)",
    y       = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?] There is a visible trend in the scatterplot. Points appear to be scattered around the line but demonstrating not a perfect relationship. It does appear to be roughly linear with a positive direction and weak association. There is a few points toward the top of the plot near 1.6, a few points near 0.4, and some all the way to the right of the graph that could be potential outliers; however, there do not appear to be any suggestions of non-linearities. —

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

# Display a tidy coefficient table with 95% CIs and slope, intercept, t-statistic, p-value
tidy(model, conf.int = TRUE) %>% 
  kable(
    caption = "Simple Linear Regression: DXXOFBMD ~ Calcium (BRFSS 2020)",
    col.names = c ("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>% 
  row_spec(0, bold = TRUE)

Simple Linear Regression: DXXOFBMD ~ Calcium (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.8992037	0.0071915	125.036678	0.00e+00	0.8851006	0.9133069
calcium	0.0000308	0.0000075	4.131029	3.75e-05	0.0000162	0.0000454

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?] The current intercept would suggest that the estimated total femur bone mineral density when calcium is 0 is 0.8992037 g/cm². This is not a meaningful quantity in this context because calcium levels are rarely ever 0 unless someone has a deficiency in calcium.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]

For each 1 mg/day increase in calcium, total femur bone mineral density is estimated to increase by 0.0000308 g/cm², on average. This is considered a positive change in direction, it would be a negative relationship if there was a negative sign in front of the slope. The effect size is rather small; however, most individuals take in 1,000 mg of calcium per day which makes sense why the effect size is so small since it is only calculated for 1 mg increase in calcium. If we were to think of 1,000 mg of calcium the effect size would be 0.03 g/cm² which is still small.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

\(H_0: \beta_1 = 0 \quad \text{(no linear relationship between calcium and bone density)}\)
\(H_A: \beta_1 \neq 0 \quad \text{(there is a linear relationship between calcium and bone density)}\)

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]

The t-statistic is 4.121029 and the p-value is 0.0000375. The degrees of freedom is 2127. This very small p-value which is below the threshold of 0.05 suggests that we should reject the null hypothesis and consider the evidence for the alternative hypothesis. This means that there is a linear relationship between calcium and bone density.

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?]

The confidence intervals demonstrate that 0.0000162 is the lower bound of the true slope and 0.0000454 is the upper bound of the true slope.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

# Extract R-squared from model
r_sq <- summary(model)$r.squared
adj_r_sq <- summary(model)$adj.r.squared

# Create a clean, styled table
tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.008
Adjusted R²	0.0075
Variance Explained	0.8%

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]

It appears that dietary calcium intake explains 0.8% of the variance in bone mineral density. The model does not fit the data as well as it could because calcium intake explains only a small amount of variance and suggests that there are many other integrated factors that influence dietary calcium intake.

Residual Standard Error (RSE):

# Calculate the Redisual Standard Error
rmse_model <- round(summary(model)$sigma, 2)
cat("Root MSE or Residual Standard Error (RSE):", rmse_model, "g/cm² bone femur density\n")

## Root MSE or Residual Standard Error (RSE): 0.16 g/cm² bone femur density

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.] The residual standard error for this model is 0.16 g/cm² of bone femur density. This suggests that the model has a typical prediction error of 0.16 g/cm².

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.) The predicted BMD at calcium = 1,000 mg/day is 0.9299 g/cm².
What does the 95% confidence interval represent? The 95% confidence interval represents the range of values plausible for the true slope. The 95% lower confidence interval is 0.9299936 and the 95% upper confidence interval is 0.937076.
What is the 95% prediction interval, and why is it wider than the CI? The 95% prediction interval is always going to be wider than the confidence interval because it takes into account uncertainty in E(Y) and individual variability around the mean. The prediction interval has a 95% lower prediction interval of 0.6195964 and a 95% upper prediction interval of 1.240391.
Is 1,000 mg/day a meaningful value to predict at, given the data?] 1,000 mg/day is a not a fully meaningful value to predict at because the National Institute of Health suggest that the recommended calcium intake for adults 19-50 is 1,000 mg/day (Mayo Clinic, 2025). This means that this is a typical value for a normal individual to be consuming for calcium. It would be more meaningful to employ 1,500 mg/day or 2,000 mg/day. Furthermore, if you look at the BMD data, it appears that numerous individuals demonstrate a DXXOFBMD near the predicted value of 0.9299 g/cm².

References Mayo Clinic. (2025, April 24). Calcium intake and absorption: Are you getting enough? Mayo Clinic Health System. https://www.mayoclinichealthsystem.org/hometown-health/speaking-of-health/calcium-intake-and-absorption —

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

The regression model tells us that the relationship between calcium and BMD is positive and linear in nature. The results to me were not surprising because there has been countless information and research done regarding calcium intake and bone density. One limitation of interpreting SLR from a cross-section survey is that observations from the same household or geographic cluster may not fully be independent. One confounder that may explain the observed association are income because it may determine how much money there is to be spend on groceries and calcium intake. Another confounder could be age because it has been suggested that as individuals age, they lose bone mineral density.

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

ANOVA is preferred when we are interested in comparing three or more group means and want to employ an omnibus test that controls for error at α = 0.05. ANOVA answers questions such as if blood pressure reduction differs across three different medications. Simple linear regression is preferred when the purpose of research is to quantify the linear relationship between a continuous outcome and a single predictor, predict values of an outcome given a predictor value, and test hypotheses about whether a predictor is associated with an outcome. It is allows answers to questions regarding predicting specific variables such as BMI of patients as well as estimates for the average BMI of a group in the population because it provides confidence and prediction intervals.

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

I did not personally find any part of the programming components to be explicitly challenging because we have done a three labs on modeling (model/simple linear regression/ multiple linear regression) and this assignment, so I felt that I had a very strong foundation from the programming perspective. The part I did struggle with was interpreting the coefficient with the intercept and the slope. I always tend to confuse which one pertains to which component. To work through it, I went back through our other lab lectures and took time to full comprehend what the intercept and slope mean in a simple linear regression model. I feel more confident about debugging after this homework. I ran into a few errors and was able to fix them completely on my own.

End of Homework 2