EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('~/Desktop/EPI553/data/bmd(in).csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Relationship Between Dietary Calcium Intake and Total Femur Bone Mineral Density",
    x       = "Calcium Intake (mg/day)",
    y       = "Bone Mineral Density (g/cm^2)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences): [Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]

The scatterplot reveals a visible weak positive linear trend between calcium intake and BMD. The points are scattered along the regression line with a higher density focused around lower calcium intake levels. There are a few notable outliers near higher calcium intake levels.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]

The intercept is 0.8992 g/cm^2 which represents the expected BMD when calcium intake is zero. In this context, this isn’t the most meaningful quanitity since most individuals don’t have a calcium intake of zero.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]

The slope is 3.079e-05. This means for every 1 unit increase in calcium intake, the predicted BMD increases by 3.079e-05 (0.00003079) g/cm^2 on average. There is a positive relationship. The effect is small.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: 𝛽1=0 (no linear relationship between calcium intake and BMD)
H₁: 𝛽1≠0 (There is a linear relationship between calcium intake and BMD)

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]

t-statistic= 4.1310 df= 2127 p-value= 3.751e-05 (0.00003751) Since the p-value is less than 0.05, I reject the null hypothesis. This means there is positive statistically significant association between calcium intake and BMD. Therefore there is a positive linear relationship between dietary calcium intake and BMD.

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?]

The 95% CI for the slope of calcium intake is [0.00001617334, 0.00004540649]. Therefore, the true average increase for BMD for each 1 unit increase of calcium falls between 0.000016 and 0.000045.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]

R^2= 0.007959 0.8% of the variance in BMD is explained by dietary calcium intake. Based on the R^2 value the model does not fit the data very well since it is small. This suggests that there are other predictors that influence the variance of BMD more than calcium intake, including age, BMI, physical activity, genetics.

Residual Standard Error (RSE):

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]

RSE= 0.1582 g/cm^2. This tells me that the predicted BMD values from the model differ from the observed BMD values by 0.1582 g/cm^2 on average.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.) The predicted BMD at a calcium intake of 1000 mg/day is 0.9299936 g/cm^2
What does the 95% confidence interval represent? The 95% CI [0.9229, 0.9371] represents the true possible values for the true mean BMD for those who have a calcium intake of 1000 mg/day.
What is the 95% prediction interval, and why is it wider than the CI? The PI is [0.6196, 0.1240], it is wider than CI because it accounts for greater variability of individual levels and uncertainity.
Is 1,000 mg/day a meaningful value to predict at, given the data?] 1,000 mg/day of calcium is a meaningful value to predict at given the data. —

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

This regression model shows a statistically significant positive relationship between calcium intake and BMD, indicating that as calcium intake increases so does BMD. Since R^2 is low, we can assume that little variation in BMD is explained by calcium intake. This result is not surprising because while greater calcium intake can improve bone mineral density it is also affected by other factors such as physical activity and genetics. A key limitation of interpreting SLR from a cross-sectional survey is causality can not be established. Because cross-sectional data is collected at one point in time, it cannot be assumed that calcium intake causes higher BMD in comparsion to confounders. Other confounders that might explain the observed association include age, sex, physical activity, diet.

The one way ANOVA used in homework 1 to compare BMD across ethnic groups measures if one categorical group mean differs from the other groups. The SLR examines the linear relationship between the mean of a continuous outcome and a single predictor. The single linear regression tells us the magnitude and direction of the relationship. ANOVA only tests for differences and cannot test for a linear trend which shows the direction and magnitude. ANOVA would be preferred when the predictor is categorical and group means want to be compared. SLR is preferred when the predictor is continuous.

This assignment was easier than the past few since the code was included in the skeleton of this assignment. The hardest part was reading the SLR model output and making the correct interpretation off the data. I worked through it by reading through the SLR lecture alongside my coding to ensure I was looking at the right data in the model. After completing this homework I feel more confident in making scatter plots since this reinforced the practice of doing them and doing an SLR model.

End of Homework 2