Assignment 2: Simple Linear Regression

Part 0: Data Preparation

Load Required Packages
Import bmd.csv

# Import the dataset — update the path if needed
bmd <- read.csv('bmd.csv', header = T)
# Quick check
glimpse(bmd)

## Rows: 2,898
## Columns: 14
## $ SEQN     <int> 93705, 93708, 93709, 93711, 93713, 93714, 93715, 93716, 93721…
## $ RIAGENDR <int> 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ RIDAGEYR <int> 66, 66, 75, 56, 67, 54, 71, 61, 60, 60, 64, 67, 70, 53, 57, 7…
## $ RIDRETH1 <int> 4, 5, 4, 5, 3, 4, 5, 5, 1, 3, 3, 1, 5, 4, 2, 3, 2, 4, 4, 3, 3…
## $ BMXBMI   <dbl> 31.7, 23.7, 38.9, 21.3, 23.5, 39.9, 22.5, 30.7, 35.9, 23.8, 2…
## $ smoker   <int> 2, 3, 1, 3, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 1, 1…
## $ totmet   <int> 240, 120, 720, 840, 360, NA, 6320, 2400, NA, NA, 1680, 240, 4…
## $ metcat   <int> 0, 0, 1, 1, 0, NA, 2, 2, NA, NA, 2, 0, 0, 0, 1, NA, 0, NA, 1,…
## $ DXXOFBMD <dbl> 1.058, 0.801, 0.880, 0.851, 0.778, 0.994, 0.952, 1.121, NA, 0…
## $ tbmdcat  <int> 0, 1, 0, 1, 1, 0, 0, 0, NA, 1, 0, 0, 1, 0, 0, 1, NA, NA, 0, N…
## $ calcium  <dbl> 503.5, 473.5, NA, 1248.5, 660.5, 776.0, 452.0, 853.5, 929.0, …
## $ vitd     <dbl> 1.85, 5.85, NA, 3.85, 2.35, 5.65, 3.75, 4.45, 6.05, 6.45, 3.3…
## $ DSQTVD   <dbl> 20.557, 25.000, NA, 25.000, NA, NA, NA, 100.000, 50.000, 46.6…
## $ DSQTCALC <dbl> 211.67, 820.00, NA, 35.00, 13.33, NA, 26.67, 1066.67, 35.00, …

Recode RIDRETH1 as a labeled factor (Mexican American, Other Hispanic, Non- Hispanic White, Non-Hispanic Black, Other)
Recode RIAGENDR as a labeled factor (Male / Female)
Recode smoker as a labeled factor (Current / Past / Never)

# Recode RIDRETH1 as a labeled factor
bmd <- bmd %>%
  mutate(
  ##Recode RIDRETH1 to V_ETH
  V_ETH = factor(RIDRETH1, 
                 levels = c(1,2,3,4,5),
                 labels = c("Mexican American", 
                            "Other Hispanic",
                            "Non-Hispanic White",
                            "Non-Hispanic Black",
                            "Other")
                ),
  ##Recode RIAGENDR to V_GEN
  V_GEN = factor(RIAGENDR, 
                 levels = c(1,2),
                 labels = c("Male",
                            "Female")
                ),
  ##Recode smoker to V_SMK
  V_SMK = factor(smoker,
                 levels = c(1,2,3),
                 labels = c("Current","Past","Never"))
)

#QA - Print top 3 rows, output for new var
print (bmd[1:3, (ncol(bmd)-2):ncol(bmd)])

##                V_ETH  V_GEN   V_SMK
## 1 Non-Hispanic Black Female    Past
## 2              Other Female   Never
## 3 Non-Hispanic Black Female Current

Report the total sample size (N) and the number of missing values for DXXOFBMD and calcium
Create an analytic dataset excluding observations with missing values on DXXOFBMD or calcium. Report the final analytic N.

# Store the total size of the dataset
M_completesamplesize <- nrow(bmd)

#a_bmd is new df, not missing any values in Calcium or BMD
a_bmd <- bmd[(!is.na(bmd$calcium) & !is.na(bmd$DXXOFBMD)),]

#Store total size of nonmissing dataset
M_total <- nrow(a_bmd)

Size of Complete Dataset: 2898
Number of Rows Excluded: 769 rows, 26.54%
Working Dataset size: n = 2129

Part 1: Exploratory Visualization

Research Question: Is there a linear association between dietary calcium intake (calcium, mg/day) and total femur bone mineral density (DXXOFBMD, g/cm²)?

# Create a scatterplot with a fitted regression line
ggplot(a_bmd, aes(x = calcium, y = DXXOFBMD)) +
  geom_point(alpha = 0.3, color = "orchid") +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(
    title   = "Dietary Calcium and Total Femur Bone Mineral Density",
    x       = "Dietary Calcium Intake (mg/day)",
    y       = "Bone Mineral Density (g/cm²)"
  ) +
  theme_classic()

Interpretation: The scatter plot indicates that there is some positive linear association, with the majority of observations clustered below 2000 mg/day of dietary calcium. This linear relationship appears to be relatively weak, with outliers in terms of Dietary Calcium appearing to be potentially influential towards the high-calcium observations.

Part 2: Simple Linear Regression

Step 1: Fit the model

# Fit the simple linear regression model
m_bmd_slr <- lm(DXXOFBMD ~ calcium, data = a_bmd)

summary(m_bmd_slr)

## 
## Call:
## lm(formula = DXXOFBMD ~ calcium, data = a_bmd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55653 -0.10570 -0.00561  0.10719  0.62624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.992e-01  7.192e-03 125.037  < 2e-16 ***
## calcium     3.079e-05  7.453e-06   4.131 3.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 2127 degrees of freedom
## Multiple R-squared:  0.007959,   Adjusted R-squared:  0.007493 
## F-statistic: 17.07 on 1 and 2127 DF,  p-value: 3.751e-05

# Display the full model summary
tidy(m_bmd_slr, conf.int = TRUE) %>%
  mutate(
    ##Then P value gets pval format
    p.value = format.pval(p.value)
    ) %>%
  kable(
    caption = "Simple Linear Regression: DXXOFBMD ~ Calcium (NHANES 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
    
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: DXXOFBMD ~ Calcium (NHANES 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.8992037	0.0071915	125.036678	< 2.22e-16	0.8851006	0.9133069
calcium	0.0000308	0.0000075	4.131029	3.7512e-05	0.0000162	0.0000454

b0 <- m_bmd_slr$coefficients[1]
b1 <- m_bmd_slr$coefficients[2]

Step 2: Interpret the Coefficients

A. Intercept (β₀):

The intercept (β₀ = 0.899) represents the predicted bone density in g/cm² for a daily intake of 0 mg of calcium. In this context, it is meaningful, as someone can have 0 mg of calcium intake daily, and the expected bone density, while low, is above 0, so it is possible.

B. Slope (β₁):

The slope (β₁ = 3.1^{-5}) represents an estimated 3.1^{-5} g/cm² increase in total femur bone density for each additional 1 mg/day increase in calcium intake. This slope is positive, with a relatively large effect size, given that a typical adult has over 1000 mg/day of calcium. More meaningfully, this can be interpreted as an increase of 0.03g/cm² BMD for each 1000 mg calcium/day.

Step 3: Statistical Inference

##Extract things I'll need from model object to table
s_bmd_slr <- data.frame(cbind(
                summary(m_bmd_slr)$coefficients,
                confint(m_bmd_slr)
                   ))
##extract individual variables to call in interpretation
t <- s_bmd_slr$t.value[2]
p_val <- s_bmd_slr$Pr...t..[2]
df <- m_bmd_slr$df.residual
cf_l <- s_bmd_slr$X2.5..[2]
cf_u <- s_bmd_slr$X97.5..[2]

Hypotheses

The null hypothesis states that there is no linear relationship between daily calcium intake and total femur bone density.

\[H_0: \beta_1 = 0 \quad \text{(no linear relationship between X and Y)}\] The alternate hypothesis states that there is linear relationship between daily calcium intake and total femur bone density. \[H_A: \beta_1 \neq 0 \quad \text{(there is a linear relationship)}\]

The t-statistic for this model is 4.13, with 2127 degrees of freedom. The p-value is 3.7512e-05. Because the p-value (3.7512e-05) is less than the alpha (0.05), we reject the null hypothesis.

This means that we can accept the alternate hypothesis that there is a linear association between daily calcium intake and total femur bone density.

Interpret the 95% confidence interval for β₁:

We are 95% confident that the true increase in total femur bone mineral density (g/cm²) for each 1mg/day of additional calcium intake lies around 3.1^{-5}, between the range of Lower CI: 1.6^{-5} to Upper CI:4.5^{-5}.

Step 4: Model Fit -R² and Residual Standard Error

##Extract things I'll need from model object to table
r2 <- summary(m_bmd_slr)$r.squared
print(r2)

## [1] 0.007959364

##does this really work?
rse <- sd(summary(m_bmd_slr)$residuals)

mean <- mean(a_bmd$calcium)
print(rse)

## [1] 0.1582003

R² (coefficient of determination): The R² of the association between dietary calcium intake and BMD is 0.008, indicating that about 0.796% of variance in BMD is explained by dietary calcium intake. Based on this R², the model does not fit the data that cleanly, as substantial variance in observations has not been explained. This indicates that other predictors are likely impacting BMD as much as or more than dietary calcium intake, and may be important to incorporate into future models.

Residual Standard Error (RSE):

The RSE for this model is 0.158 BMD in g/cm², this tells us that the estimated typical prediction error of the model is quite large. For the mean value of calcium intake in mg/day seen in our dataset (848), the predicted BMD would be 0.026 g/cm², just as an example of a value for calcium that is expected. The estimated typical prediction error of this value is larger than this value - 6.1x as large - indicating that this model has very low predictive power.

Part 3: Prediction

#range of reasonable calcium values
min <- min(a_bmd$calcium)
max <- max(a_bmd$calcium)
# Create a new data frame with the target predictor value
test_1k <- data.frame(calcium = 1000)

# 95% confidence interval for the mean response at calcium = 100
ci_pred <- predict(m_bmd_slr, newdata = test_1k, interval = "confidence") %>%
   as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)

# 95% confidence interval for the mean response at calcium = 100
pi_pred <- predict(m_bmd_slr, newdata = test_1k, interval = "prediction") %>%
  as.data.frame() %>%
  rename(PI_Lower = lwr, PI_Upper = upr) %>%
  select(-fit)

results <- bind_cols(test_1k, ci_pred, pi_pred) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

results %>%
  kable(
    caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by Age",
    col.names = c("Calcium Intake (mg/day)", "Fitted BMD (g/cm²)", "CI Lower (g/cm²)", "CI Upper (g/cm²)", "PI Lower (g/cm²)", "PI Upper (g/cm²)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))

Fitted Values, 95% Confidence Intervals, and Prediction Intervals by Age
		95% CI for Mean		95% PI for Individual
Calcium Intake (mg/day)	Fitted BMD (g/cm²)	CI Lower (g/cm²)	CI Upper (g/cm²)	PI Lower (g/cm²)	PI Upper (g/cm²)
1000	0.93	0.92	0.94	0.62	1.24

Interpetation: At 1,000 mg/day of dietary calcium, the predicted BMD is 0.93g/cm². The 95% confidence interval ranges from 0.92 to 0.94, and gives a range of plausible values for the population mean BMD at 1,000 mg/day of dietary calcium. The 95% prediction interval ranges from 0.62 to 1.24, and reflects the range for a single new observation. It is always wider than the confidence interval, as it accounts for both uncertainty in the predictor and the variability around the mean seen by individuals - and as we have seen, relatively little individual variance is accounted for by the model.

1,000 mg/day is a meaningful value to predict at given the data, as the mean mg of daily calcium intake in the dataset is 848, and values in the dataset range from 55 to 5199.

Part 4: Reflection

The regression model told us that there is a slight positive calcium-BMD relationship, but that the association is loose and explains very little. We saw some indicators of this in our correlation matrix for last assignment, but it is still surprising. Because this analysis is entirely cross-sectional, temporality is very unclear: dietary calcium in this data reflects dietary habits in the past 30 days, while bone mineral density is a product of years or a lifetime of dietary. This data can be further confounded, along the question of temporality, with confounding by indication – people who have higher dietary calcium levels may be specifically taking supplements or eating differently because they already have low bone mineral density, not to mention that other things like Vitamin D levels enable calcium metabolism.

ANOVA is useful to identify if there is a significant difference in mean of a continuous variable across a categorical variable. Additional post-hoc tests can give insight into the extent to which the categorical variable explains variance in the continuous variable, which categorical variables differ, and directionality. SLR is useful, alongside correlation, in measuring the association between two continuous variables. It is particularly useful because SLR can offer insight into both the magnitude of the association (for each unit of change in the predictor, how much change is expected in the outcome), and the goodness of fit.

ANOVA can help us identify categorical variables that may affect the mean – sex, or education. While SLR allows us to dig into multiple continuous variables – like the various dietary intake habits and how they affect continuous variables of health outcomes, like BMD.

As with last assignment, I ran into some trouble initially accessing specific items from stored objects. The models in particular are lists with a variety of different data structures inside of them - more lists, integers, dataframes, atomic vectors. My solution to this is trial and error in a simplified call, or a print statement in the console. But over the course of each assignment, I’ve found that I’m much quicker to identify what a structure is and how to interact with it, so it requires less trial and error, and winds up being less of a headache.