EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('E:/College WORK MS in EPI/SEM 2 MS EPI/Epi553 STATS 2/bmd.csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never")),

#recode metcat as a labeled factor
metcat = factor(metcat,
                levels = c(0, 1, 2),
                labels = c("Inactive", "Insufficiently active", "Active")),

#recode BMD category as a labeled factor
tbmdcat = factor(tbmdcat, 
                 levels = c(0,1),
                 labels = c("Normal", "Low bone mass")),
)

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title = "Association Between Dietary Calcium Intake and Bone Mineral Density",
    subtitle = "NHANES 2017-2018 Adults (N = 2,605)",
    x = "Dietary Calcium Intake (mg/day)",
    y = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]

The scatterplot shows a weak, slightly positive relationship between dietary calcium intake and total femur bone mineral density (BMD). The trend looks positive as BMD is slightly increasing with calcium intake increase. Some individuals have extremely high calcium intake (3,000–5,000 mg/day), but their BMD values do not differ much from the rest. There are no extreme vertical outliers to distort the trend.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]

The intercept is 0.8992 g/cm² for the predicted bone mineral density when dietary calcium intake is 0 mg/day. From a statistical standpoint, this is the y-intercept where the regression line crosses the vertical axis. However, this value is not practical at all because even individuals with very poor diets obtain some calcium from their food. It provides a baseline from which the effect of calcium intake is measured, to anchor down the slope.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]

The slope is 0.00003079 g/cm² per mg/day of calcium, which means for every 1 mg/day increase in dietary calcium intake, bone mineral density is estimated to increase by approximately 0.00003 g/cm². This means that increase of 100 mg/day in calcium intake corresponds to an estimated BMD increase of 0.0031 g/cm². The association is positive but small in magnitude. Given that typical calcium intake ranges from 400-1600 mg/day in this sample, the total predicted difference in BMD across this range would be only about 0.0123 - 0.049 g/cm², which represents a modest effect relative to the typical variability in BMD values.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

Null Hypothesis (H₀): β₁ = 0
There is no linear association between dietary calcium intake and bone mineral density in the population. Any observed association in the sample is due to chance.

**Alternative Hypothesis (H₁): β₁ ≠ 0
There is a linear association between dietary calcium intake and bone mineral density in the population.

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]

t-statistic: 4.131 - Degrees of freedom: 2127 - p-value: 3.751e-05

Conclusion: We reject the null hypothesis at the α = 0.05 significance level (p = 3.751e-05 < 0.05). There is statistically significant evidence of a positive linear association between dietary calcium intake and bone mineral density. We are 95% confident that the observed relationship is unlikely to have occurred by chance alone in the population.

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?]

The 95% confidence interval for the slope (β₁) is [1.617334e-05 to 4.540649e-05] g/cm² per mg/day. This means we are 95% confident that the true population slope lies within this range. In simple words, for every 100 mg/day increase in calcium intake, we estimate the true increase in BMD to be between 0.0016 and 0.0045 g/cm². Because this interval does not include 0, it confirms our hypothesis test conclusion that there is a statistically significant positive association. However, the narrow range and small magnitude suggest the effect, while real, is modest.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]

R² = 0.007959

This indicates that only 0.80% of the variance in bone mineral density is explained by dietary calcium intake alone. This is an extremely low R², suggesting that the simple linear model with calcium as the sole predictor provides a poor fit to the data. While the association between calcium and BMD is statistically significant, calcium intake explains almost none of the variability in BMD across individuals. This strongly suggests that other factors like age, sex, body composition, physical activity, vitamin D status, hormonal factors, genetics, and overall dietary patterns probably affect bone mineral density more than dietary calcium intake alone. A multivariable model including these additional predictors would be necessary to adequately explain BMD variation.

Residual Standard Error (RSE):

*[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]

Residual Standard Error = 0.1582 g/cm² on 2127 degrees of freedom. The RSE represents the average distance that observed BMD values deviate from the fitted regression line. In other words, the model’s predictions are off by approximately 0.15 g/cm² on average. Given that BMD values in this sample typically range from about 0.7 to 1.2 g/cm², an average prediction error of 0.15 g/cm² is substantial—roughly equivalent to 30% of the typical range. This large RSE, combined with the very low R², confirms that a simple linear model with calcium alone does not predict individual BMD values with much precision. Most of the variation in BMD remains unexplained and is captured in the residual error term.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.)

The predicted bone mineral density for an individual consuming 1,000 mg of calcium per day is 0.930 g/cm². Alternatively, we can also put calcium = 1,000 mg/day in our regression equation: BMD = 0.8992 + (0.00003079 × 1,000), giving us 0.930 g/cm².

What does the 95% confidence interval represent?

The 95% confidence interval for the mean response is lwr 0.923 to upr 0.937 g/cm². This interval estimates the average (mean) BMD for all individuals in the population who consume 1,000 mg/day of calcium. We are 95% confident that the true population mean BMD at this calcium intake level falls within this range. This is a statement about the population parameter (the mean), not about individual predictions.

What is the 95% prediction interval, and why is it wider than the CI?

The 95% prediction interval is lwr 0.620 to upr 1.240 g/cm², which is wider than the confidence interval. This interval estimates where a single individual’s BMD value is likely to fall if they consume 1,000 mg/day of calcium. The prediction interval is wider because it accounts for two sources of uncertainty: (1) uncertainty in estimating the population mean (same as the CI), plus (2) natural variability among individuals around that mean. Even if we knew the true population mean BMD perfectly, individuals would still vary considerably due to factors like genetics, physical activity, age, sex, and other dietary factors. The prediction interval accounts for this individual-level variation, making it much wider than the confidence interval.

Is 1,000 mg/day a meaningful value to predict at, given the data?]

Yes, 1,000 mg/day is a meaningful value to predict at.

It falls well within the range of observed calcium intakes in our dataset (which ranges from approximately 100 to 5,000 mg/day, with most values between 400-1,600 mg/day). This means we are interpolating(estimating unknown values between known data), not extrapolating (predicts values outside that range), which makes our prediction reliable.
1,000 mg/day is the Recommended Dietary Allowance (RDA)** for calcium for most adults. Predictions at this value is applicable to public health recommendations and can inform whether meeting the RDA is associated with better bone health outcomes.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

Answer The simple linear regression model reveals a statistically significant positive association between dietary calcium intake and bone mineral density (β₁ = 0.00003 g/cm² per mg/day, p < 0.001). However, the clinical significance of this relationship is modest with a 100 mg/day increase in calcium associated with only a 0.003 g/cm² increase in BMD. These results is surprising because while I expected a positive association based on biological mechanisms as calcium is a key mineral in bone formation, the effect size was much smaller than I thought. The extremely low R² (0.8%) suggests that calcium intake alone explains almost none of the variation in BMD across individuals.

The key limitations of interpreting SLR from a cross-sectional survey as causal evidence is that this cross-sectional analysis cannot establish causation. The potential confounders that might explain the observed association. Like for example, individuals who consume more calcium may also engage in more physical activity, have healthier overall diets, take vitamin D supplements, or have higher socioeconomic status, all of which may affect bone health.

Reverse causation maybe is possible. individuals diagnosed with low bone mass may intentionally increase calcium intake following medical advice. Additionally, genetic factors, hormonal status (especially estrogen in postmenopausal women), body weight, smoking history, and medication use could all confound the calcium-BMD relationship. A randomized controlled trial or longitudinal study may be helpful.

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

In Homework 1, we used one-way ANOVA to compare mean BMD across ethnic groups (categorical), trying to answer the question: “Do different ethnic groups have different average bone densities?”

ANOVA tests for differences in means between groups but treats ethnicity as discrete categories without any inherent ordering. In contrast, simple linear regression models BMD as a continuous function of a continuous predictor (calcium intake), answering: “How much does BMD change for each unit increase in calcium?”

Regression provides several advantages that ANOVA does not. First, regression quantifies the magnitude and direction of the association through the slope coefficient, allowing us to make precise predictions. Second, regression can give us predicted values for specific predictor levels like calcium levels, which ANOVA cannot. Third, we can also do the regression with multiple predictors in multivariable models, allowing us to control for confounders. Finally, regression provides both confidence intervals (for mean responses) and prediction intervals (for individual predictions), where we can have nuanced inference than ANOVA’s simple group comparisons.

However, ANOVA remains preferable when the predictor of interest is inherently categorical (like ethnicity) and we want to compare discrete groups without assuming any ordering or linear trend. Regression would be the better choice when studying continuous predictors (age, calcium intake, BMI) or when we need to make quantitative predictions. In practice, both methods have their strengths that helps us choose the right tool for the research question at hand.

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

Honestly, this assignment felt much more manageable than previous homeworks, which I think speaks to how much the weekly labs have prepared me for this kind of work. The repetition of fitting models, interpreting output, and creating visualizations in lab has made these tasks feel easy-ish now. I didn’t encounter any major roadblocks with the prediction intervals or model fitting—the concepts that seemed abstract a few weeks ago now feel more intuitive because I’ve practiced them repeatedly.

The most valuable skill I’ve developed isn’t necessarily learning new functions, but rather learning how to troubleshoot my own code systematically. When something doesn’t work, I’ve gotten better at methodically going through my code line by line to figure out where things went wrong. Sometimes it’s a missing comma, sometimes it’s a capitalization issue, and sometimes I’ve just piped something incorrectly. I’ve learned that most errors aren’t mysterious—they’re usually something small and fixable if I slow down and actually read the error message instead of panicking. And when I truly can’t figure it out, I’ve gotten more comfortable Googling the error message and reading through Stack Overflow or R documentation until I find some answers.

Another habit that’s served me well is keeping my code organized and well-commented—not for the grader, but for my own sake. I’ve learned that code that makes perfect sense while I’m writing it becomes completely incomprehensible three days later when I’m trying to knit the document and fix errors. Writing clear chunk labels and adding inline comments with “####” helps me retrace my own steps when I inevitably need to debug something. It’s almost like leaving breadcrumbs for my future self.

What I appreciate most about this homework is that I’m getting comfortable with using the same functions repeatedly in different contexts—lm(), summary(), predict(), ggplot(). Each time I use them, I understand them a little better. I’m not just memorizing syntax; I’m actually building muscle memory for the analytical workflow: fit model → check assumptions → interpret coefficients → make predictions. That consistent practice across labs and homeworks has been the real learning curve, and I can genuinely say I feel more confident approaching regression problems now than I did at the start of the semester.

End of Homework 2