EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('/Users/jingjunyang/Desktop/EPI553 Project/bmd(in).csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Assoaiction between dietary calcium intake and total femur bone mineraldensity",
    x       = "Dietary calcium intake (calcium, mg/day)",
    y       = "Total femur bone mineral density (DXXOFBMD, g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?] # There is a weak positive linear trend between dietary calcium intake and total femur bone mineral density. As the dietary calcium intake increase, the total femur bone mineral density also increases. The relationship is weak and there is no notable outliers. —

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?] #Intercept = 0.8992 g/cm^2 Numerically, this represents the predicted DXXOFBMD bone mineral density when calcium intake is 0. In this context, a calcium intake of 0 mg/day is not realistic, so the intercept does not represent a practical meaningful value.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?] # The slope for calcium is 0.00003079. This means that for every 1 mg/day increase in calcium intake, the model predicts an increase of about 0.00003079 g/cm² in bone density. The positive value indicates that higher calcium intake is associated with slightly higher BMD. This effect is small. Most people daily calcium intake varies from 400 to 1200 mg/day.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: β₁=0. There is no association between calcium intake (mg/day) and bone mineral density (BMD). In other words, changes in calcium intake do not affect BMD
H₁: β₁!=0 There is an association between calcium intake (mg/day) and bone mineral density (BMD). In other words, changes in calcium intake have an affect in BMD

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?] ## t-statistic for calcium: 4.131, degree of freedom(df): 2127, p-value: 3.751e-05, since p < 0.05, we can reject the null hypothesis β₁=0. This means that there is statistical evidence of an association between calcium intake and BMD. Higher calcium intake is associated with slightly highly BMD. The effect is small.

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?] ## [0.00001617, 0.00004541], we are 95% confident that true effect of calcium intake on BMD lies within this range. This means each additional 1mg/day of calcium intake, bone mineral density is plausibly expected to increase by between 0.00001617 and 0.000045 g/cm^2.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?] ## The dietary calcium intake explain about 0.7959% of variance in BMD. The model fit the data poorly, which means calcium does not predict BMD very well. Other factors such as age, sex, Vitamin D intake are more important in determing BMD.

Residual Standard Error (RSE):

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.] ##RSE=0.1582 g/cm^2, on average, the model’s predicted BMD values differ from the actual observed BMD by about 0.1582 g/cm^2. Since the BMD ranges from 0.4 to 1.6 g/cm^2, the level of error is large. This suggests that dietary calcium alone is not a precise predictor of individual BMD. —

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.) ## the predicted BMD is 0.93 g/cm^2 at calcium =1000 mg/day.
What does the 95% confidence interval represent? ##The 95% confident intervel [0.9229112, 0.937076] represents the range that the average BMD of all people consuming 1000mg/day of calcium lies.
What is the 95% prediction interval, and why is it wider than the CI? ##The 95% prediction interval is [0.6195964, 1.240391]. The prediction interval is wider because it include both the uncertainty of the mean and the individual variation around the mean.
Is 1,000 mg/day a meaningful value to predict at, given the data?] ## Predicting at 1,000 mg/day is meaningful because the value falls within the range of calcium intake observed in the data. —

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?] ## The regression model shows a positive linear relationship between dietary calcium intake and BMD, meaning that as calcium intake increases, BMD tends to increase slightly. While this association is statistically significant, the effect size is very small, which was superising. A key limitations of interpreting smple linear regression from a cross-sectional survey as casaul evidence is that confouding variables may explain the observed association. Factors such as sex, age, vitamind intake, sun exposure, and overall dietary habits can all influence BMD, so the relationship between calcium and BMD cannot be interpreted as casual.

B. From ANOVA to Regression (5 points) [Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?] ## Compared to one-way ANOVA, which I used in HW1 to compare BMD across ethnic groups, SLR addresses a different type of question. ANOVA test whether the means of a continuous outcome differ across two or more categorical groups such as whether mean BMD differs by ethnicity. In contrast, regression models test the relationship between a continuous predictor and a continuous outcome, allowing us to estimate the direction, magnitude and significance of the associaiton. Regression can also adjust for multiple predictors which ANOVA cannot do. Therefore, while ANOVA is useful for group comparison, regression answers questions about linear relationships and is preferable when the predictors is continuous or when adjustment for confounders is needed.

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

## The most challenging part of the assignment fro ma programming perspective was deciding which r functions and code structure were most appropriate for different scenarios such as predicting values, calculating CIs and so on. I worked through by reviewing class note. I feel more confidence in performing linear regresssion modeling in R, interpreting summaries and generating prections.

End of Homework 2