EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv("bmd(in).csv")

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title = "Relationship Between Dietary Calcium Intake and Bone Mineral Density",
x = "Dietary Calcium Intake (mg/day)",
y = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

The scatterplot shows a weak positive association between dietary calcium intake and total femur bone mineral density. As calcium intake increases, BMD appears to increase slightly on average. However, the points are widely scattered around the regression line, indicating that the relationship is relatively weak. There are a few observations with very high calcium intake that may be considered potential outliers, but there is no strong evidence of a nonlinear pattern. Overall, the plot suggests that a simple linear model is reasonable but that calcium alone does not strongly predict BMD.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

The intercept estimate is 0.8992 g/cm², which represents the predicted total femur bone mineral density for an individual with a dietary calcium intake of 0 mg/day. While this interpretation is mathematically correct, it is not particularly meaningful in practice because it is unrealistic for individuals to consume zero calcium. Therefore, the intercept mainly serves as the starting point of the regression equation rather than representing a biologically meaningful scenario.

B. Slope (β₁):

The slope estimate is 0.00003079 g/cm² per mg/day of calcium intake. This means that for every 1 mg/day increase in dietary calcium intake, the model predicts an average increase of 0.00003079 g/cm² in total femur bone mineral density. This indicates a positive association between calcium intake and BMD. Although the effect appears small on a per-milligram scale, larger increases in calcium intake can produce more noticeable changes. For example, an increase of 500 mg/day in calcium intake would correspond to an estimated increase of approximately 0.015 g/cm² in BMD.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: β₁ = 0 There is no linear association between dietary calcium intake and total femur bone mineral density.
H₁: β₁ ≠ 0 There is a linear association between dietary calcium intake and total femur bone mineral density.

Report the test results:

The hypothesis test for the slope produced a t-statistic of 4.131 with 2127 degrees of freedom and a p-value of 3.75 × 10⁻⁵. Because the p-value is much smaller than 0.05, we reject the null hypothesis. This indicates that dietary calcium intake is statistically significantly associated with total femur bone mineral density in this sample.

Interpret the 95% confidence interval for β₁:

The 95% confidence interval for the slope ranges from 0.00001617 to 0.00004541 g/cm² per mg/day of calcium intake. Because this interval does not include zero, it provides additional evidence that the association between calcium intake and bone mineral density is statistically significant. This suggests that the true average increase in BMD associated with a 1 mg/day increase in calcium intake likely falls somewhere within this range.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

The model’s R² value is 0.00796, meaning that dietary calcium intake explains approximately 0.8% of the variation in total femur bone mineral density in the dataset. This is a very small proportion, indicating that calcium intake alone is not a strong predictor of BMD. Many other factors—such as age, sex, body mass index, physical activity, smoking, and other dietary factors—likely contribute much more to explaining variation in bone density.

Residual Standard Error (RSE):

The residual standard error is 0.1582 g/cm², which represents the typical difference between the observed bone mineral density values and the values predicted by the regression model. In practical terms, this means that predictions of BMD based on calcium intake alone typically differ from the observed values by about 0.158 g/cm² on average.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

The regression model predicts that an individual with a dietary calcium intake of 1,000 mg/day would have an expected total femur bone mineral density of approximately 0.930 g/cm². The 95% confidence interval for the mean predicted BMD is [insert your exact CI here] g/cm², which represents the plausible range for the average BMD among individuals consuming 1,000 mg/day of calcium. In contrast, the 95% prediction interval is [insert your exact PI here] g/cm², which represents the range where the BMD of a single individual with this calcium intake level is likely to fall. The prediction interval is wider than the confidence interval because it accounts for both uncertainty in the estimated mean and natural variability between individuals. A value of 1,000 mg/day is a meaningful level to predict at because it is a realistic calcium intake level and falls within the range of values observed in the dataset.

Part 4: Reflection (15 points)

The simple linear regression analysis suggests that dietary calcium intake is positively associated with total femur bone mineral density in this sample. Individuals with higher calcium intake tend to have slightly higher BMD values on average. However, the magnitude of the association is relatively small, and the model explains less than 1% of the variation in bone mineral density. This indicates that although calcium intake plays a role in bone health, it is clearly not the only factor influencing BMD. Many additional variables, including age, sex, body mass index, hormonal status, physical activity, smoking behavior, and overall dietary patterns, likely contribute to differences in bone density. Because this analysis is based on cross-sectional survey data, it cannot establish a causal relationship between calcium intake and bone mineral density. The association observed may be influenced by confounding variables such as physical activity levels, supplement use, or general health behaviors.

In Homework 1, one-way ANOVA was used to compare mean bone mineral density across different racial and ethnic groups. ANOVA is appropriate when comparing average outcomes across categorical groups. In contrast, simple linear regression models the relationship between a continuous predictor and a continuous outcome. Regression provides additional information, including the direction and magnitude of the relationship between variables and allows for prediction at specific values of the predictor. ANOVA is useful when testing differences between groups, whereas regression is more appropriate when examining continuous relationships.

From a programming perspective, one of the most challenging aspects of this assignment was learning how to interpret the regression output produced by the lm() function. Understanding the meaning of the coefficients, p-values, and model fit statistics required careful review of the lecture materials. Working through the assignment helped me become more comfortable fitting regression models in R and using functions such as confint() and predict() to generate confidence and prediction intervals.

End of Homework 2