EPI 553 — Homework 2: Simple Linear Regression

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read_csv("/Users/morganwheat/Downloads/bmd.csv")

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <dbl> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <dbl> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <dbl> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <dbl> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <dbl> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <dbl> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <dbl> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <dbl> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Dietary Calcium Intake vs. Total Femur Bone Mineral Density",
    x       = "Dietary Calcium Intake (mg/day)",
    y       = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

The scatterplot reveals a visual linear relationship between between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²), as the trend line shows a slight increase as the X-axis values increase. This plot reveals a relatively weak positive association with some notable outliers. The strength of the association was indicated by the spread of the data points around the trend line.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

The rounded intercept estimate of 0.90 corresponds to the total femoral bone mineral density (g/cm²) when dietary calcium intake is 0 mg/day. It is not typically a meaningful quantity in this context, as most foods contain at least some trace of calcium, making it rare for most individuals to avoid this variable entirely.

B. Slope (β₁):

To interpret the estimated rounded slope of 0.00003, for every 1-unit increase in dietary calcium intake (mg/day), there is a 0.00003 g/cm² increase in total femoral bone mineral density (g/cm²). This effect is rather small, as most sources report that the typical daily intake of calcium is around 1,000 mg (Hoy, 2014). Applying this knowledge to the slope, if this estimate holds true, we would see a 0.03 g/cm² (small) increase in total femoral bone mineral density per day.

Hoy, M.K., & Goldman, J.D.(2014). Calcium intake of the U.S. population: What we eat in America, NHANES 2009–2010 (Dietary Data Brief No. 13). United States Department of Agriculture, Food Surveys Research Group. https://www.ncbi.nlm.nih.gov/books/NBK589560/

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: ρ = 0 (no correlation exists between total femoral bone mineral density (g/cm²) and dietary calcium intake (mg/day) in the population)
H₁: ρ ≠ 0 (A correlation does exist between total femoral bone mineral density (g/cm²) and dietary calcium intake (mg/day) in the population)

Report the test results:

According to the test results from summary(), (t = 4.131), the numerator df = 1, the denominator df = 2127, and the p-value= 3.751e-05. I would reject the null hypothesis as the p-value is far less than 0.05. This means that there is evidence that the positive association predicted between dietary calcium intake and BMD exists.

Interpret the 95% confidence interval for β₁:

Since the 95% CI [1.617334e-05,4.540649e-05] does not contain zero, this further supports the statement that the association between total femoral bone mineral density and dietary calcium intake is significant. To interpret the CI in plain language, according to the results, the slope of the true relationship between BMD and calcium intake will fall between 1.617334e-05 and 4.540649e-05 at least 95 times if we were to repeat this study 100 times (95% of the time).

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

According to the summary results, the model produced an R² value of 0.007959, indicating that approximately 0.8% of the variance in total femoral bone mineral density (BMD) is explained by dietary calcium intake. This small R² suggests that the model explains very little of the variability in BMD and therefore provides a poor fit to the data. It also suggests that other predictors not included in the model may explain additional variability in BMD and should be considered in future analyses.

Residual Standard Error (RSE):

The residual standard error (RSE) is 0.1582 on 2127 degrees of freedom. In other words, the model that was produced has an average prediction error of 0.1582 g/cm² in each of its estimates of an individuals total femoral bone mineral density.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

When the total intake of dietary calcium is 1,000 mg/day, the predicted BMD is 0.93 g/cm². The 95% CI {0.9229112,0.937076} represents the range of possible values in which the population mean lies. In comparison, the 95% PI {0.6195964,1.240391}, represents the range of possible values in which a single individuals mean lies. The PI is always wider than the CI because it accounts for both the uncertainty in E(Y) and the individual variables around the mean. 1,000 mg/day is a meaningful value to predict at, as this is the around the average intake of calcium most adult individuals have a day (Hoy, 2014).

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

This regression model indicates a positive association between BMD and dietary calcium intake. However, dietary calcium intake explains very little of the variance (0.8%) in BMD. While calcium intake is very important for bone health, these results were not entirely surprising, as it is widely know that BMD is affected by an array of factors. The key limitations of interpreting SLP from a cross-sectional survey are that you cannot make temporal claims since you are analyzing data from a single point in time, and observations from the same household or geographic cluster may not be fully independent. Additional confounders that may explain the observed association include age, sex, and weight.

B. From ANOVA to Regression (5 points)

The simple linear regression model represents the mean of a continuous outcome as a linear function of a single predictor. In contrast, a one-way ANOVA is used to compare the mean of a continuous outcome across three or more categories of a categorical predictor. While ANOVA focuses on testing whether the group means differ overall, simple linear regression estimates the slope of the relationship between a predictor and the outcome, allowing interpretation of how a one-unit change in the predictor affects the outcome. In practice, the choice between these methods depends largely on the type of predictor variable being analyzed.

C. R Programming Growth (4 points)

This assignment was very easy from a programming perspective. This is because the code was already complete, aside from adding a relevant title and axis labels to the scatter plot in part 1. Likewise, while interpreting the code’s output, if I found myself forgetting the proper terminology, I would simply refer to our previous lectures. After completing this homework, I feel much more confident in my ability to interpret key interpretations required for simple linear regression.

End of Homework 2