EPI 553 — Homework 2: Simple Linear Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(kableExtra)
library(broom)

# Import the dataset — update the path if needed
bmd <- read.csv('C:/Users/userp/OneDrive/Рабочий стол/HW1_HEPI553 Nursultan/bmd.csv')

# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
    RIDRETH1 = factor(RIDRETH1,
                      levels = 1:5,
                      labels = c("Mexican American", "Other Hispanic",
                                 "Non-Hispanic White", "Non-Hispanic Black", "Other")),

    # Recode RIAGENDR as a labeled factor
    RIAGENDR = factor(RIAGENDR,
                      levels = c(1, 2),
                      labels = c("Male", "Female")),

    # Recode smoker as a labeled factor
    smoker = factor(smoker,
                    levels = c(1, 2, 3),
                    labels = c("Current", "Past", "Never"))
  )

# Report missing values for the key variables
cat("Total N:", nrow(bmd), "\n")

## Total N: 2898

cat("Missing DXXOFBMD:", sum(is.na(bmd$DXXOFBMD)), "\n")

## Missing DXXOFBMD: 612

cat("Missing calcium:", sum(is.na(bmd$calcium)), "\n")

## Missing calcium: 293

# Create the analytic dataset (exclude missing DXXOFBMD or calcium)
bmd_analytic <- bmd %>%
  filter(!is.na(DXXOFBMD), !is.na(calcium))

cat("Final analytic N:", nrow(bmd_analytic), "\n")

## Final analytic N: 2129

Part 1: Exploratory Visualization (15 points)

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(bmd_analytic, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title = "Association Between Dietary Calcium Intake and Total Femur Bone Mineral Density",
    x = "Dietary Calcium Intake (mg/day)",
    y = "Total Femur Bone Mineral Density (g/cm²)"
  ) +
  theme_minimal()

Written interpretation (3–5 sentences):

[Describe what the scatterplot reveals. Is there a visible linear trend? In which direction? Does the relationship appear strong or weak? Are there any notable outliers or non-linearities?]

The scatterplot shows a slight upward (positive) linear trend between dietary calcium intake and total femur bone mineral density. As calcium intake increases, bone mineral density appears to increase slightly. However, the relationship is very weak. The data points are widely scattered around the regression line, and there is substantial variability in bone mineral density at nearly all levels of calcium intake. This indicates that calcium intake alone explains very little of the variation in BMD. There are a few observations at very high calcium intake levels (above approximately 3000–5000 mg/day) that could be considered potential outliers. However, they do not appear to substantially change the overall pattern. There is no obvious curvature or non-linear pattern, suggesting that a linear model is appropriate, although the strength of the association is minimal.

Part 2: Simple Linear Regression (40 points)

Step 1 — Fit the Model (5 points)

# Fit the simple linear regression model
model <- lm(DXXOFBMD ~ calcium, data = bmd_analytic)

# Display the full model summary
summary(model)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = bmd_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

Step 2 — Interpret the Coefficients (10 points)

A. Intercept (β₀):

[Interpret the intercept in 2–4 sentences. What does it represent numerically? Is it a meaningful quantity in this context?]

The intercept represents the predicted total femur bone mineral density (g/cm²) when dietary calcium intake is 0 mg/day. Numerically, it reflects the estimated BMD for an individual who consumes no calcium. However, this value is not particularly meaningful in this context, because a daily calcium intake of 0 mg is unrealistic in a free-living population. Therefore, while the intercept is necessary for defining the regression line mathematically, it does not have practical or clinical interpretation in this study.

B. Slope (β₁):

[Interpret the slope in 2–4 sentences. For every 1-unit increase in calcium (mg/day), what is the estimated change in BMD (g/cm²)? State the direction. Is the effect large or small given typical calcium intake ranges?]

The slope represents the estimated change in total femur bone mineral density (g/cm²) for every 1 mg/day increase in dietary calcium intake. The coefficient is positive, indicating that higher calcium intake is associated with slightly higher BMD. However, the magnitude of the effect is extremely small—for example, even a 100 mg/day increase in calcium intake corresponds to only a very small increase in BMD. Given that typical daily calcium intake ranges from several hundred to over 1,000 mg/day, the effect size is statistically significant but practically minimal.

Step 3 — Statistical Inference (15 points)

# 95% confidence interval for the slope
confint(model)

##                    2.5 %       97.5 %
## (Intercept) 8.851006e-01 9.133069e-01
## calcium     1.617334e-05 4.540649e-05

State your hypotheses:

H₀: [complete in notation and plain language] There is no linear association between dietary calcium intake (mg/day) and total femur bone mineral density (g/cm²). Calcium intake does not predict BMD.
H₁: [complete in notation and plain language] There is a linear association between dietary calcium intake and total femur bone mineral density. Calcium intake is associated with BMD (either positively or negatively).

Report the test results:

[Report the t-statistic, degrees of freedom, and p-value from the summary() output above. State your conclusion: do you reject H₀? What does this mean for the association between calcium and BMD?] The association between dietary calcium intake and total femur bone mineral density was statistically significant (t(2127) = 4.13, p < 0.001).

Interpret the 95% confidence interval for β₁:

[Interpret the CI in plain language — what range of values is plausible for the true slope?] The 95% confidence interval for the slope ranges from approximately 0.000016 to 0.000045 g/cm² per 1 mg/day increase in calcium intake. This means that for every 1 mg/day increase in calcium intake, the true increase in total femur BMD is plausibly between 0.000016 and 0.000045 g/cm². Because the entire interval is above zero, this supports a statistically significant positive association. However, the values in this interval are extremely small, indicating that although the association is statistically significant, the magnitude of the effect is minimal.

Step 4 — Model Fit: R² and Residual Standard Error (10 points)

R² (coefficient of determination):

[What proportion of the variance in BMD is explained by dietary calcium intake? Based on this R², how well does the model fit the data? What does this suggest about the importance of other predictors?] Multiple R-squared = 0.007959. This very small R² indicates that the model fits the data poorly. Although the association between calcium intake and BMD is statistically significant, calcium intake explains only a tiny fraction of the differences in bone mineral density across individuals.

Residual Standard Error (RSE):

[Report the RSE. Express it in the units of the outcome and explain what it tells you about the average prediction error of the model.] The residual standard error is 0.158 g/cm², meaning that the typical prediction error of the model is about 0.158 g/cm² in total femur bone mineral density. —

Part 3: Prediction (20 points)

# Create a new data frame with the target predictor value
new_data <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 1000
predict(model, newdata = new_data, interval = "confidence")

##         fit       lwr      upr
## 1 0.9299936 0.9229112 0.937076

# 95% prediction interval for a new individual at calcium = 1000
predict(model, newdata = new_data, interval = "prediction")

##         fit       lwr      upr
## 1 0.9299936 0.6195964 1.240391

Written interpretation (3–6 sentences):

[Answer all four questions from the assignment description:

What is the predicted BMD at calcium = 1,000 mg/day? (Report with units.) The predicted total femur bone mineral density at 1,000 mg/day of calcium intake is approximately 0.93 g/cm².
What does the 95% confidence interval represent? CI: (0.9229, 0.9371).The range of plausible values for the mean BMD among individuals consuming 1,000 mg/day of calcium.
What is the 95% prediction interval, and why is it wider than the CI? Prediction Interval: (0.6196, 1.2404). The range in which we expect the BMD of a single individual with calcium intake of 1,000 mg/day to fall.
Is 1,000 mg/day a meaningful value to predict at, given the data?] 1,000 mg/day is a meaningful value since it lies within the observed data range and reflects common dietary recommendations. —

Part 4: Reflection (15 points)

Write 200–400 words in continuous prose (not bullet points) addressing all three areas below.

A. Statistical Insight (6 points)

[What does the regression model tell you about the calcium–BMD relationship? Were the results surprising? What are the key limitations of interpreting SLR from a cross-sectional survey as causal evidence? What confounders might explain the observed association?]

The simple linear regression model indicates a statistically significant positive association between dietary calcium intake and total femur bone mineral density. Specifically, higher calcium intake is associated with slightly higher BMD. However, the magnitude of this association is extremely small (β = 3.08 × 10⁻⁵), and calcium intake explains less than 1% of the variability in BMD (R² ≈ 0.008). Thus, while statistically significant, the relationship is very weak and has limited practical or clinical significance. The positive direction of the association is biologically plausible and therefore not surprising, as calcium plays an important role in bone health. However, the very small effect size may be somewhat surprising, given the common perception that calcium intake strongly influences bone density. The results suggest that calcium intake alone is not a major determinant of BMD in this population. Limitations of Interpreting SLR from Cross-Sectional Data as Causal. Because this analysis is based on cross-sectional survey data: Exposure and outcome were measured at the same time. Temporality cannot be established. Reverse causation is possible (e.g., individuals with low BMD may increase calcium intake). Residual confounding is likely. Measurement error in dietary recall may bias estimates. Several important variables may confound the observed association: Age, Sex , BMI,Vitamin D levels, Physical activity, Smoking status, Hormonal status (e.g., menopause), Chronic disease conditions. These factors may influence both calcium intake and BMD, potentially explaining part or all of the observed association.

B. From ANOVA to Regression (5 points)

[Homework 1 used one-way ANOVA to compare mean BMD across ethnic groups. Now you have used SLR to model BMD as a function of a continuous predictor. Compare these two approaches: what kinds of questions does each method answer? What does regression give you that ANOVA does not? When would you prefer one over the other?] In Homework 1, one-way ANOVA was used to compare mean BMD across ethnic groups, which is a categorical predictor.ANOVA answers the question: Do mean BMD values differ across ethnic groups? It tests whether at least one group mean is statistically different from the others but does not quantify a directional relationship or provide an effect per unit increase. In contrast, simple linear regression (SLR) models BMD as a function of a continuous predictor (calcium intake). SLR answers the question: How much does BMD change for each one-unit increase in calcium intake? It provides: A slope (quantifying the change in BMD per mg/day increase). A prediction equation. Confidence intervals for the effect. The proportion of variance explained (R²) What does regression give you that ANOVA does not? Regression gives: A quantitative estimate of the relationship (β coefficient). Direction of association (positive or negative). Predictive capability (e.g., predicted BMD at 1,000 mg/day). Interpretation per unit increase. Easier extension to multivariable adjustment ANOVA, by contrast: Only compares group means. Does not model continuous predictors directly. Does not provide a slope interpretation When to Prefer Each Method? Use ANOVA when: The predictor is categorical (e.g., ethnicity, treatment group). The goal is to compare group means. There is no natural ordering or numeric scale to the predictor. Use Regression when: The predictor is continuous. You want to quantify the magnitude and direction of association. You want to make predictions. You plan to adjust for confounders (multiple regression).

C. R Programming Growth (4 points)

[What was the most challenging part of this assignment from a programming perspective? How did you work through it? What R skill do you feel more confident about after completing this homework?]

The most challenging part of this assignment from a programming perspective was making sure the analytic dataset was properly created before running the regression model. One challenge was correctly extracting and interpreting model output from summary(), confint(), and predict(). Reviewing the documentation and breaking the output into components (coefficients, R², RSE, intervals) helped clarify the interpretation. After completing this homework, I feel more confident in fitting simple linear regression models in R, generating predictions with confidence and prediction intervals, and translating statistical output into clear interpretations.

End of Homework 2