EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv("bmd(in).csv")

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title   = "Linear Associatian Between Dietary Calcium Intake and Total Femur Bone Mineral Density",
    x       = "Dietary Calcium Intake (mg/day)",
    y       = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]

Based on the scatterplot, there is positive association between dietary calcium intake and total femur bone mineral density with a visible linear trend. This associatioin is slightly weak to moderate and had notable outliers, but no non-linearities.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]

The intercept (β₀) = 8.992e-01 represents the predicted number of total bone mineral density (g/cm²) when the dietary calcium intake (mg/day) = 0. This is not a meaningful quantity in this context because an intake of 0 mg/day of calcium is unrealistic. Most people normally consume at least some calcium through food. Therfore, does not have a meaningful real-world interpretation.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]

For each 1-unit increase in calcium (mg/day), the expected BMD (g/cm²) increases by 3.079e-05 g/cm² on average, holding all else constant. The direction of the association is positive, meaning higher calcium intake is associated with higher bone mineral density. However, the effect size is very small for a 1 mg/day increase.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: There is no association between dietary calcium intake (mg/day) and bone mineral density (g/cm²).
H₁: There is an association between dietary calcium intake (mg/day) and bone mineral density (g/cm²).

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?]

t-statistic is = 4.131, degrees of freedom = 2127, p-value = 3.751e-05

This indicates there is statistically significant evidence of an association between dietary calcium intake and bone mineral density (BMD). The relationship is positive, meaning higher calcium intake is associated with higher BMD.

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?]

We are 95% confident that the true change in BMD (g/cm²) associated with a 1 mg/day increase in calcium intake lies between 1.617334e-05 and 4.540649e-05 g/cm². Since the entire interval is above 0, it supports that the association between calcium intake and BMD is positive.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggestbabout the importance of other predictors?]

R² = 0.007959 The R² value of 0.007959 indicates that about 0.8% of the variation in bone mineral density (BMD) is explained by dietary calcium intake in this model. This suggests that the model does not fit the data well, because calcium intake alone explains very little of the variability in BMD. While the association is statistically significant, calcium intake by itself is not a strong predictor of BMD. This suggests that other factors likely play a larger role in the variation of BMD.

Residual Standard Error (RSE):

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.]

The RSE = 0.1582 g/cm². This represents the typical difference between the observed bone mineral density (BMD) values and the values predicted by the mode, meaning that the model’s predictions of BMD are off by about 0.1582 g/cm². This indicates there is still substantial unexplained variability in BMD that is not captured by calcium intake alone.

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.) The predicted BMD is 0.9299936.
What does the 95% confidence interval represent? The 95% confidence interval represents the true mean BMD individuals consuming 1,000 mg/day of calcium lies between 0.9229112 and 0.937076.
What is the 95% prediction interval, and why is it wider than the CI? The 95% confidence interval of 0.6195964-1.240391 is wider than the CI because it accounts for both the uncertainty in mean BMD and the individual variability around the mean.
Is 1,000 mg/day a meaningful value to predict at, given the data?] 1,000 mg/day is a meaningful value to predict given the data because it lies within the range of the observed data and the recommended daily calcium intake.

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

The regression model suggests that there is a positive association between dietary calcium intake (mg/day) and bone mineral density (g/cm²), meaning higher calcium intake is associated with higher BMD. The results are not surprising because calcium is an important nutrient for contributing to higher bone mineral density and overall bone health. One limitation of interpreting SLR models from cross-sectional survey is that the model cannot establish a causal relationship between calcium intake and BMD; in other words, we are measuring the exposure and outcome at the same time, so we cannot determine the direction of the relationship. Also, confounding variables may influence both calcium intake and BMD. Potential confounders include age, sex, level of physical activity, vitamin D intake, body mass index (BMI), and overall diet quality, which could partially explain the observed association.

ANOVA is used when comparing the mean outcome across multiple categorical variables and testing to see if at least one group mean differs from the others. It answers questions about differences between groups averages but does not model the association between continuous predictors. Simple linear regression only looks at the association between a continuous predictor and outcome. For example, it would be appropriate to use ANOVA to examine whether mean BMI varies across different physical activity categories (low, moderate, high), and it would be appropriate to test the relationship between calcium intake (predictor) and bone mineral density (BMD) (outcome), allowing us to estimate how much BMD changes for each unit increase in calcium intake. Regression gives us information on the direction, magnitude, and prediction of the relationship, which ANOVA does not provide. We would prefer ANOVA when the predictor is categorical and we want to compare group means, while regression is preferred when the predictor is continuous or when we want to estimate and interpret the change in the outcome associated with changes in a predictor.

The most challenging part of this assignment from a programming perspective was reading in the dataset. I often have difficulty with the file path, even when I think I entered it correctly. Because of this, I usually end up manually selecting the file from my computer, which has worked everytime for me so far. I also struggled with interpreting the residual standard error (RSE) and the prediction interval, but I worked through those challenges by reviewing explanations and examples from previous lab activities. After completing this homework, I feel most confident about running and interpreting a simple linear regression model in R.

End of Homework 2