EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('bmd.csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

The raw dataset contains 2,898 observations. After excluding 612 records missing DXXOFBMD and 293 records missing calcium, the final analytic sample is n = 2,129.

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Dietary Calcium Intake and Total Femur Bone Mineral Density",
    x       = "Dietary Calcium Intake (mg/day)",
    y       = "Total Femur BMD (g/cm\u00b2)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

The scatterplot shows a weak positive linear trend between dietary calcium intake and total femur BMD. As calcium intake increases, BMD increases slightly. This is shown by the upward-sloping fitted regression line. However, the relationship is not strong. There is a vertical scatter at every level of calcium intake, especially at lower intakes where most observations are concentrated. The data are right-skewed along the x-axis. Also there is a dense cluster below 2,000 mg/day and a sparse tail of very high intakes extending to ~ 5,000 mg/day. No extreme outliers, and the relationship appears approximately linear. This supports the use of a simple linear regression model.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Fitted equation: \[\widehat{\text{DXXOFBMD}} = 0.89920 + 0.00003079 \times \text{calcium}\]

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

The intercept is the estimated mean total femur BMD for an individual with a dietary calcium intake of 0 mg/day. The model predicts a BMD of 0.8992 g/cm² when Calcium is at zero. This value is not meaningful because a calcium intake of exactly zero is unlikely and outside the observed range. The intercept serves as an anchor for the regression line and not an estimate.

B. Slope (β₁):

For every 1 mg/day increase in dietary calcium intake, total femur BMD is estimated to increase by 0.00003079 g/cm² on average. This is given that everything else is held equal. The direction of the association is positive. This is a likely outcome as the biological expectation is that higher calcium intake would correlate to better bone density. A 500 mg/day difference in calcium intake corresponds to an estimated BMD difference of 0.0154 g/cm². This is small compared to the typical BMD range of approximately 0.4 to 1.6 g/cm². This suggests that dietary calcium has a modest effect on femur BMD by itself.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H0: β₁ = 0;dietary calcium intake is not linearly associated with total femur BMD; the population slope is zero.
H1: β₁ ≠ 0; dietary calcium intake is linearly associated with total femur BMD; the population slope differs from zero.

Report the test results:

The t-statistic for the slope is t(2127) = 4.131, with p = 3.75 × 10^-5. Since p < 0.05, we reject H0 at alpha = 0.05. There is statistically significant evidence of a positive linear association between dietary calcium intake and total femur BMD.

Interpret the 95% confidence interval for β₁:

The 95% confidence interval for the slope is (0.00001617, 0.00004541) g/cm² per mg/day. Because the entire interval lies above zero, this is consistent with the rejection of H0. It provides more evidence for a positive association. The interval is narrow.It reflects a small estimated effect of calcium on BMD.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

glance(model) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption   = "Table 2. Model Fit Statistics",
    col.names = c("R-squared", "Adj. R-squared", "RSE", "F-statistic", "p-value", "N")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2. Model Fit Statistics
R-squared	Adj. R-squared	RSE	F-statistic	p-value	N
0.008	0.0075	0.1582	17.0654	0	2129

R² (coefficient of determination)=0.0080:

Dietary calcium intake explains only 0.80% of the variance in total femur BMD. This is an extremely low R² provides a poor fit. The vast majority of the variation in BMD is attributable to other factors that are not included in this model. This includes age, sex, body weight, physical activity, smoking status, and genetic predisposition to bone density. This does not necessarily mean calcium is not important. However, it does indicate that a single dietary predictor is insufficient to model BMD adequately. In addition, a multivariable approach including other covariates would be needed to explain meaningful variance.

Residual Standard Error (RSE) = 0.1582 g/cm²:

The RSE represents the average distance between observed BMD values and the values predicted by the fitted regression line. The model’s predictions are off by approximately 0.1582 g/cm². Given that total femur BMD ranges from roughly 0.4 to 1.6 g/cm², a prediction error of 0.1582 g/cm² represents a big proportion of that range. This adds on to the fact that dietary calcium alone is not a precise predictor of individual BMD.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

Predicted BMD: At a dietary calcium intake of 1,000 mg/day, the model predicts a mean total femur BMD of approximately 0.9300 g/cm².
95% Confidence Interval: The confidence interval at calcium = 1,000 mg/day estimates the range within which the true mean BMD falls for all individuals in the population with this level of calcium intake. It is relatively narrow because it describes the mean of many individuals and not than a single person.
95% Prediction Interval: The prediction interval estimates within which the range the BMD of a new individual with calcium = 1,000 mg/day would likely fall. It is wider than the confidence interval because it incorporates both the uncertainty the mean estimation and the individual to individual variability around that it (RSE = 0.1582 g/cm²). The prediction interval is always wider than the confidence interval because of this.
Is 1,000 mg/day a meaningful value to predict at, given the data? Yes, 1,000 mg/day is within the observed range of the data and clinically meaningful.It corresponds to the Recommended Dietary Allowance for calcium in adults aged 19–50. Predicting at this value is appropriate and relevant to a public health interpretation of the model.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

The regression model confirms a statistically significant positive association between dietary calcium intake and total femur BMD (β₁ = 0.00003079, p = 3.75 × 10^-5). However, R² = 0.0080 shows that calcium explains less than 1% of the variance in bone density. BMD is determined by various factors, and dietary calcium intake is just one detail. The results cannot be interpreted causally because this is cross-sectional observational data. Individuals with higher calcium intake could differ in other ways that affect BMD. For example, they could be more physically active, have better diet, be of different age or sex, or have different BMI. Reverse causation is also possible if those with diagnosed with low bone density were prescribed to increase their calcium intake. This would attenuate or reverse the association.

B. From ANOVA to Regression (5 points)

Homework 1 used one-way ANOVA to compare mean BMD across various ethnic groups. It asked if the group membership is associated with differences in the mean outcome. That approach is appropriate for categorical predictors but cannot describe the functional relationship between BMD and a continuous variable like calcium intake. This is due ethnicity being treated as a set of categories without a specific order. This would be different with a quantitative predictor. Simple linear regression models BMD as a continuous linear function of calcium. This estimates a slope that determines the rate of change in BMD/unit increase in intake. In addition, it makes the prediction at any calcium point between the observed range possible. ANOVA answers the question about the difference across groups. In contrast, regression answers by how much the outcome changes per unit increase in the predictor. For a continuous dose-response question regression is more informative. ANOVA would be used when the predictor is categorical or when we compare groups and not continuous associations.

C. R Programming Growth (4 points)

The most challenging part of this assignment from a programming perspective was the knitting process as it requires the .Rmd file and the .csv to be in the same working directory. I always seem to encounter issues with this particular process. Working through this required understanding how RStudio sets the working directory during knitting and how to troubleshoot it. After this assignment, I feel more confident using tidy() and glance() from the broom package to extract clean model output, and using predict() with both interval types.

End of Homework 2