EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('C:/Users/MY789914/OneDrive - University at Albany - SUNY/Desktop/Stat 553 (R)/Assignment 2 LRM/bmd(in).csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Association Between Dietary Calcium Intake and Femur Bone Mineral Density",
    x       = "Dietary Calcium Intake (mg/day)",
    y       = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

The scatterplot shows a slight positive linear trend between dietary calcium intake and total femur bone mineral density. As calcium intake increases, bone mineral density tends to increase modestly. However, the points are widely dispersed around the regression line, suggesting that the relationship is relatively weak. A few observations with very high calcium intake appear somewhat distant from the main cluster and may represent potential outliers, but there is no clear evidence of strong non-linear patterns.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

The intercept is 0.899 g/cm², which represents the predicted femur bone mineral density when calcium intake is 0 mg/day. This value is not very meaningful in practice because consuming 0 mg of calcium per day is unrealistic. Therefore, the intercept mainly serves as a starting point for the regression line.

B. Slope (β₁):

The slope is 0.00003079 g/cm², meaning that for every 1 mg/day increase in calcium intake, bone mineral density increases by about 0.00003 g/cm² on average. This indicates a positive but very small association between calcium intake and bone mineral density.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: β₁ = 0; There is no linear association between dietary calcium intake and femur bone mineral density.

H₁: β₁ ≠ 0; There is a linear association between dietary calcium intake and femur bone mineral density.

Report the test results:

t-statistic: 4.131 Degrees of freedom: 2127 p-value: 3.751e-05 (< 0.001)

Conclusion: Since the p-value is much smaller than 0.05, we reject H₀. This indicates that there is statistical evidence of a linear association between calcium intake and femur bone mineral density. Specifically, higher calcium intake is associated with slightly higher BMD.

Interpret the 95% confidence interval for β₁:

95% CI: (1.617334e-05 , 4.540649e-05)

We are 95% confident that the true change in femur bone mineral density associated with a 1 mg/day increase in calcium intake lies between 0.000016 and 0.000045 g/cm². Because the interval does not include 0, this supports the conclusion that calcium intake is positively associated with BMD.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

The R² from the model is 0.00796, meaning that dietary calcium intake explains about 0.8% of the variance in femur bone mineral density. This very small proportion indicates that the model does not explain much of the variability in BMD. Therefore, while calcium intake is statistically associated with BMD, many other factors (such as age, sex, physical activity, vitamin D intake, and genetics) likely play a much larger role.

Residual Standard Error (RSE):

The residual standard error is 0.158 g/cm². This value represents the typical difference between the observed BMD values and the values predicted by the regression model.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

*[Answer all four questions from the assignment description:*]

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.)

The predicted total femur bone mineral density at 1,000 mg/day of calcium intake is 0.93 g/cm².

What does the 95% confidence interval represent?

The 95% confidence interval (0.923–0.937 g/cm²) represents the range in which the true mean BMD for individuals consuming 1,000 mg/day of calcium is expected to lie with 95% confidence.

What is the 95% prediction interval, and why is it wider than the CI?

The 95% prediction interval (0.620–1.240 g/cm²) represents the range where the BMD of a single individual with 1,000 mg/day calcium intake is expected to fall. It is wider than the confidence interval because it includes both uncertainty in estimating the mean and natural variation between individuals.

Is 1,000 mg/day a meaningful value to predict at, given the data?

Yes, 1,000 mg/day is a meaningful value because it falls within the typical range of dietary calcium intake in adults, so predicting BMD at this level is reasonable.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

The simple linear regression model suggests that there is a positive association between dietary calcium intake and femur bone mineral density (BMD). Individuals with higher calcium intake tend to have slightly higher predicted BMD values. However, the magnitude of the effect is very small, and the model explains less than 1% of the variance in BMD. This indicates that calcium intake alone is not a strong predictor of bone density. Because the data come from a cross-sectional survey, the results should not be interpreted as causal evidence. Cross-sectional studies measure exposure and outcome at the same time, so it is not possible to determine whether higher calcium intake leads to higher BMD or whether other factors explain the association. Several potential confounders could influence this relationship, including age, sex, physical activity, vitamin D intake, hormonal status, and genetic factors.

In Homework 1, one-way ANOVA was used to compare mean BMD across different ethnic groups, which is appropriate when the predictor variable is categorical. In contrast, simple linear regression examines the relationship between BMD and a continuous predictor such as calcium intake. Regression allows us to estimate the expected change in BMD for each unit increase in calcium intake and to generate predicted values. While ANOVA focuses on comparing group means, regression is more useful when studying continuous exposures or trends in the data.

From a programming perspective, one of the most challenging parts of this assignment was interpreting the regression output and extracting the relevant statistics from the model summary. Working through the code step by step helped me better understand how to obtain confidence intervals, prediction intervals, and model fit statistics in R. After completing this assignment, I feel more confident using ggplot2 for visualization and fitting simple linear regression models in R.

End of Homework 2