EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Assignment 2/bmd.csv")

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Bone Mineral Density and Calcium Intake",
    y       = "Total femur bone mineral density (g/cm²)",
    x       = "Calcium intake (mg/day)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]

The scatterplot reveals a somewhat weak positive linear trend. There are a few outliers regarding abnormally high and low bone mineral density. There are also some outliers regarding abnormally high calcium intake, with some individuals having several times the average calcium intake. I do not observe any non-linearities.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]

The intercept is 0.8992. It represents the predicted bone mineral density of someone with a calcium intake of 0 mg/day. Thus, someone with a calcium intake of 0 mg/day would be predicted to have a femur BMD of 0.8992 g/cm2. In this case, it is not a meaningful quantity because no one can survive with a calcium intake of 0 mg/day.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]

The slope is 0.00003079. This means that for everyone 1 mg/day increase in calcium intake, the predicted BMD increases by 0.00003079 g/cm2. This is fairly large considering calcium intake has a large range (from just above 0 mg/day to a few thousand mg/day) and BMD has a somewhat small range (around 0.4-1.6)

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: β1=0; There is no linear relationship between calcium intake and total femur bone mineral density
H₁: β1≠0; There is a linear relationship between calcium intake and total femur bone mineral density

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]

T-value: 4.131 df: 2127 p-value: 0.0000375

p<0.05. Therefore, I reject H0 and conclude that there is an association between calcium intake and bone mineral density.

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?]

It is likely the true population mean for the slope falls between 0.00001617334 - 0.00004540649. If we sampled the population 100 times with the same N, 95% of the confidence intervals generated would contain the true population mean for the slope.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?]

Adjusted R-squared: 0.007493

0.7493% of the variance in BMD can be explained by calcium intake. This suggests that the model does not fit the data particularly well and that other predictors are likely more insightful.

Residual Standard Error (RSE):

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]

Residual standard error: 0.1582 on 2127 degrees of freedom

Residual standard error is the sum of deviations from the regression line divided by the degrees of freedom, so it tells you the average amount that datapoints deviate from the fitted line. A high RSE indicates poor model fit, as the points deviate substantially from the model. A low RSE indicates good model fit, as the points are close to the regression line.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.)
What does the 95% confidence interval represent?
What is the 95% prediction interval, and why is it wider than the CI?
Is 1,000 mg/day a meaningful value to predict at, given the data?]

The predicted BMD with a calcium intake of 1,000 mg/day is 0.9299936 g/cm2. The 95% confidence interval is 0.9229112 - 0.937076 g/cm2, and it represents a range of plausible values for the population mean BMD for a calcium intake of 1,000 mg/day. The 95% prediction interval is 0.6195964 - 1.240391 g/cm2, and this represents a range of values for BMD given a calcium intake of 1,000 mg/day. It is wider than the CI because the PI represents all the possible values of BMD for a given calcium intake, whereas the CI represents the plausible population MEANS for a given calcium intake. Given this data, 1,000 mg/day is a meaningful prediction value, as most individuals had a calcium intake around 1,000 mg/day.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

The regression model tells us that BMD and calcium intake are significantly positively associated, but that calcium intake does not explain a substantial amount of the variance in BMD. The results are not particularly surprising because it is well known that calcium is a contributor to healthy bones. A cross sectional survey only provides data from a single point in time, and a simple linear regression does not account for other variables, so from this we cannot infer that higher calcium causes higher BMD. Other variables might explain the observed observation; for example, higher calcium intake may be associated with a healthier lifestyle and higher physical activity, and higher physical activity actually increases BMD.

ANOVA tells you if there is a difference in a continuous outcome between discrete groups, while SLR tells you if there is a linear relationship between two continuous variables. ANOVA answers whether groups (e.g. different drugs, different income levels) are significantly different from one another in terms of the outcome variable of interest, whereas SLR answers if two continuous measures are associated with each other in a linear fashion. Unlike ANOVA, SLR gives you information regarding the extent to which one variable can be used to predict the other. ANOVA is preferred when one of the variables is categorical, while SLR is preferred when both variables are continuous.

The most challenging part of this assignment was learning how to create prediction and confidence intervals from a simple linear regression model. I worked through this by learning from the provided code. After this homework, I feel more confident using R to produce confidence and prediction intervals from an SLR model.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?]

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

End of Homework 2