Simple Linear Regression (SLR) is one of the most fundamental and widely used tools in epidemiology and public health research. It allows us to:
In epidemiology, we frequently use SLR to model continuous outcomes such as blood pressure, BMI, cholesterol levels, or hospital length of stay as a function of age, exposure levels, or other continuous predictors.
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)We will use the Behavioral Risk Factor Surveillance System (BRFSS) 2020 data throughout this lecture. The BRFSS is a large-scale, nationally representative telephone survey conducted by the CDC that collects data on health behaviors, chronic conditions, and preventive service use among U.S. adults.
glimpse(brfss_slr) # Examine the first few rows head(brfss_slr, n = 10)
str(brfss_slr)
dim(brfss_slr)
names(brfss_slr)
brfss_slr %>%
select(bmi, phys_days) %>%
summary() %>%
kable(
caption = "Table 1a. Summary Statistics: Body Mass Index and Physical Health Days",
col.names = c("Statistic", "BMI (kg/m²)", "Days of Poor Physical Health")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left"
)| Statistic | BMI (kg/m²) | Days of Poor Physical Health |
|---|---|---|
| Min. :14.63 | Min. : 1.00 | |
| 1st Qu.:24.32 | 1st Qu.: 2.00 | |
| Median :27.89 | Median : 6.00 | |
| Mean :29.18 | Mean :11.66 | |
| 3rd Qu.:32.89 | 3rd Qu.:20.00 | |
| Max. :59.60 | Max. :30.00 |
##comprehensive descriptive table
brfss_slr %>%
select(bmi, phys_days, age, sex, education, gen_health_num, sleep_hrs) %>% # Add more variables
tbl_summary(
label = list(
bmi ~ "Body Mass Index (kg/m²)",
phys_days ~ "Days of Poor Physical Health (Past 30 Days)",
age ~ "Age (years)",
sex ~ "Sex",
education ~ "Education Level",
gen_health_num ~ "Gen Health Number",
sleep_hrs ~ "Number of hours of sleep"
),
statistic = list(
all_continuous() ~ "{mean} ({sd}) | {median} [{p25}, {p75}]",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no" # Don't show missing data row
) %>%
add_n() %>%
bold_labels() %>%
modify_header(label ~ "**Variable**") %>%
modify_caption("**Table 1. Descriptive Statistics of Study Sample (BRFSS 2020, n = 3,000)**") %>%
modify_footnote(all_stat_cols() ~ "Mean (SD) | Median [IQR] for continuous; n (%) for categorical")| Variable | N | N = 3,0001 |
|---|---|---|
| Body Mass Index (kg/m²) | 3,000 | 29.2 (7.0) | 27.9 [24.3, 32.9] |
| Days of Poor Physical Health (Past 30 Days) | 3,000 | 11.7 (11.2) | 6.0 [2.0, 20.0] |
| Age (years) | 3,000 | 55.5 (17.4) | 58.0 [43.0, 70.0] |
| Sex | 3,000 | |
| Female | 1,701 (57%) | |
| Male | 1,299 (43%) | |
| Education Level | 3,000 | |
| < High school | 237 (7.9%) | |
| High school graduate | 796 (27%) | |
| Some college | 937 (31%) | |
| College graduate | 1,030 (34%) | |
| Gen Health Number | 3,000 | |
| 1 | 266 (8.9%) | |
| 2 | 756 (25%) | |
| 3 | 941 (31%) | |
| 4 | 692 (23%) | |
| 5 | 345 (12%) | |
| Number of hours of sleep | 3,000 | 6.9 (1.7) | 7.0 [6.0, 8.0] |
| 1 Mean (SD) | Median [IQR] for continuous; n (%) for categorical | ||
# Plot
p_hist_phys <- ggplot(brfss_slr, aes(x = phys_days)) +
geom_histogram(binwidth = 1, fill = "darkblue", color = "white", alpha = 0.7) +
labs(
title = "Distribution of Physically Unhealthy Days (Past 30 Days)",
subtitle = "BRFSS 2020 (n = 3,000)",
x = "Number of Days",
y = "Frequency"
) +
theme_minimal(base_size = 12)
ggplotly(p_hist_phys)Days of Poor Physical Health
Before fitting any model, always visualize the bivariate relationship.
p_scatter <- ggplot(brfss_slr, aes(x = bmi, y = phys_days)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "BMI vs. phys_days (BRFSS 2020)",
subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
x = "BMI (kg/m²)",
y = "phys_days"
) +
theme_minimal(base_size = 13)
ggplotly(p_scatter)BMI vs. phys_days
Questions:
phys_days?
Of bmi? What do you notice about the distribution of
phys_days?The mean of phys_days is 11.7 days and a standard deviation is 11.2 days. THe mean (11.7) is much higher than median (6 days) which suggest it has a right skew. The histogram confirms the distribution as highly right‑skewed, with many respondents reporting 0 days and a second peak at 30 days.
The scatter plot shows a weak positive association between BMI and poor physical health days. The linear trend increases slightly with BMI, but the non‑linear smoother indicates curve, suggesting the relationship is not strictly linear. Several high‑BMI and high‑phys_days values appear as outliers.
We use ordinary least squares (OLS) to find the best-fit line that minimizes the squared deviations of observed values from predicted values.
Key outputs: Intercept: Expected unhealthy days when BMI = 0 Slope: Change in unhealthy days for each 1-unit increase in BMI Standard Error: Precision of the estimate (smaller = more precise) t-statistic & p-value: Tests whether the slope is significantly different from zero 95% CI: Range likely to contain the true population slope
# (a) Fit the SLR model: phys_days ~ bmi
model_slr <- lm(phys_days ~ bmi, data = brfss_slr)
# Summary output
summary(model_slr)##
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_slr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.808 -9.160 -5.623 8.943 20.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.42285 0.86881 8.544 < 2e-16 ***
## bmi 0.14513 0.02895 5.013 5.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared: 0.008314, Adjusted R-squared: 0.007983
## F-statistic: 25.13 on 1 and 2998 DF, p-value: 5.659e-07
# Tidy coefficient table
tidy(model_slr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.4228 | 0.8688 | 8.5437 | 0 | 5.7193 | 9.1264 |
| bmi | 0.1451 | 0.0289 | 5.0134 | 0 | 0.0884 | 0.2019 |
Fitted regression equation:
\[\widehat{\text{Phys_days}} = 7.423 + 0.1451 \times \text{BMI}\]
For every 1-unit increase in BMI, the model predicts that number of physically unhealthy days increases by 0.1451 days, on average.
###Visualizing Fitted Values and Residuals
# Augment dataset with fitted values and residuals
augmented <- augment(model_slr)
# Show a sample of fitted values and residuals
augmented %>%
select(phys_days, bmi, .fitted, .resid) %>%
slice_head(n = 10) %>%
mutate(across(where(is.numeric), ~ round(., 3))) %>%
kable(
caption = "First 10 Observations: Observed, Fitted, and Residual Values",
col.names = c("Observed Phys_days (Y)", "BMI (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Observed Phys_days (Y) | BMI (X) | Fitted (Ŷ) | Residual (e = Y − Ŷ) |
|---|---|---|---|
| 5 | 26.58 | 11.280 | -6.280 |
| 30 | 33.47 | 12.280 | 17.720 |
| 30 | 35.15 | 12.524 | 17.476 |
| 4 | 30.42 | 11.838 | -7.838 |
| 3 | 22.67 | 10.713 | -7.713 |
| 30 | 30.11 | 11.793 | 18.207 |
| 2 | 35.43 | 12.565 | -10.565 |
| 3 | 31.58 | 12.006 | -9.006 |
| 2 | 28.13 | 11.505 | -9.505 |
| 1 | 34.01 | 12.359 | -11.359 |
Questions:
The fitted regression equation is:
\(\hat{Y} = b_0 + b_1 BMI\).
where hat{Y} is the predicted number of poor physical health days. b_0 is the intercept (7.42285) and, b_1 is the slope (0.14513).
The slope is 0.145 days per BMI unit. For each 1‑unit increase in BMI, the model predicts an average increase of 0.145 poor physical health days in the past month. SImply, someone with a BMI 10 days higher would be predicted to have about 1.45 more unhealthy days.
The intercept is 7.42, which is the predicted number of poor physical health days when BMI is 0 but that is not physiologically possible since everyone does have some BMI.
Our null hypothesis is that there is no association between BMI and phys_days and alternative hypothesis is that there is an association. Based on the test statistic, t-value for BMI is 5.013 and p-value is 5.66e-07 and because the p-value <<0.05, we reject the null hypothesis with a significant evidence of an association between BMI and poor physical health days.
However, the effect size is extremely small (R2 = 0.008), it explains less than 1% of the variation in phys_days.
# Select a random sample of 80 points to illustrate residuals
set.seed(42)
resid_sample <- augmented %>% slice_sample(n = 80)
p_resid <- ggplot(resid_sample, aes(x = bmi, y = phys_days)) +
geom_segment(aes(xend = bmi, yend = .fitted),
color = "tomato", alpha = 0.5, linewidth = 0.5) +
geom_point(color = "steelblue", size = 1.8, alpha = 0.8) +
geom_line(aes(y = .fitted), color = "black", linewidth = 1.1) +
labs(
title = "Residuals Illustrated on the Regression Line",
subtitle = "Red segments = residuals (Y − Ŷ); Black line = fitted regression line",
x = "BMI (kg/m²)",
y = "Number of bad physical health days"
) +
theme_minimal(base_size = 13)
p_residVisualizing Residuals on the Regression Line
# (a) Display the ANOVA table
# ANOVA decomposition
anova_slr <- anova(model_slr)
anova_slr %>%
kable(
caption = "ANOVA Table: Physically Unhealthy Days ~ BMI",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| bmi | 1 | 3105.365 | 3105.365 | 25.134 | 0 |
| Residuals | 2998 | 370411.743 | 123.553 | NA | NA |
n <- nrow(brfss_slr)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)
tibble(
Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
Value = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3)),
Units = c("", "", "kg/m²")
) %>%
kable(caption = "Model Error Estimates") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Quantity | Value | Units |
|---|---|---|
| SS Residual | 370411.740 | |
| MSE (σ̂²) |
123.55
|
# Extract R-squared from model
r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared
tibble(
Metric = c("R²", "Adjusted R²", "Variance Explained"),
Value = c(
round(r_sq, 4),
round(adj_r_sq, 4),
paste0(round(r_sq * 100, 2), "%")
)
) %>%
kable(caption = "R² and Adjusted R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| R² | 0.0083 |
| Adjusted R² | 0.008 |
| Variance Explained | 0.83% |
Questions:
From the ANOVA table,
SSResidual (SSE) = 370411.74, df= 2998 SSRegression (SSR) = = 3105.37, df= 1 SSTotal (SST) = SSR + SSE = 373517.11, df= 2999
F- statistic: 25.134
R2 = 0.008 About 0.83% of the variation in poor physical health days is explained by BMI in this sample.
BMI explains almost none of the variability in poor physical health days, even though the association is statistically significant. Other health conditions like Mental health, stress, and social factors, Access to care, lifestyle, occupation, environment may explain remaining.
# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
# Compute CI and PI at specific age values
new_bmi <- data.frame(bmi= c(25))
ci_pred <- predict(model_slr, newdata = new_bmi, interval = "confidence") %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
pi_pred <- predict(model_slr, newdata = new_bmi, interval = "prediction") %>%
as.data.frame() %>%
rename(PI_Lower = lwr, PI_Upper = upr) %>%
select(-fit)
results_table <- bind_cols(new_bmi, ci_pred, pi_pred) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
results_table %>%
kable(
caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by BMI",
col.names = c("bmi", "Fitted phys_days", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))| bmi | Fitted phys_days | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|---|
| 25 | 11.05 | 10.59 | 11.51 | -10.75 | 32.85 |
above tables. 95% PI for the individual (32.85, -10.75)
# Generate CI and PI across the full age range
bmi_grid <- data.frame(bmi= seq(min(brfss_slr$bmi), max(brfss_slr$bmi),, length.out = 200))
ci_band <- predict(model_slr, newdata = bmi_grid, interval = "confidence") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
pi_band <- predict(model_slr, newdata = bmi_grid, interval = "prediction") %>%
as.data.frame() %>%
bind_cols(bmi_grid)
p_ci_pi <- ggplot() +
geom_point(data = brfss_slr, aes(x = bmi, y = phys_days),
alpha = 0.10, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.3) +
geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.4) +
geom_line(data = ci_band, aes(x = bmi, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Simple Linear Regression: Phys_days ~ BMI",
subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
x = "BMI (kg/m²)",
y = "Phys_days",
caption = "BRFSS 2020, n = 3,000"
) +
theme_minimal(base_size = 13)
p_ci_piRegression Line with 95% Confidence and Prediction Intervals
Questions:
For someone with a BMI of 25, the estimated mean number of poor physical health days is estimated to be about 11.05 days, and we are 95% confident that the true population mean lies between 10.59 and 11.51 days.
If a specific new person has a BMI of 25, the 95% prediction interval for their number of poor physical health days is between 10.75 and 32.85 days with 95% confidence.
A prediction interval is wider because of the uncertainty in estimating the mean (same as the CI) and the person‑to‑person variability around that mean. CI only show the first one.
We would use Confidence interval (CI) when we have to estimate the average response for a group - population mean and we use PI when we have to predict the outcome for a single new individual.
# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)Interpretation:
Residuals vs Fitted: The residuals do not form a tight, horizontal band. There is a curved pattern, wuggesting non-linear relationship. The spread of residuals increases as fitted values increase (heteroscedasticity).
Normal Q-Q plot: Residuals are not normally distributed. With n = 3,000, this is less critical for inference, but the model is not capturing the underlying structure well.
Scale- Location Plot: The red line slopes upward and residual spread increases with fitted values.
Cooks Distance: There are no influential observations that threaten the stability of the model. Outliers exist, but they are not influential enough to distort the regression line.
p_resid_x <- ggplot(augmented, aes(x = bmi, y = .resid)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
labs(
title = "Residuals vs. BMI",
subtitle = "Should show no pattern — random scatter around zero",
x = "BMI (kg/m2)",
y = "Residuals"
) +
theme_minimal(base_size = 13)
p_resid_xResiduals vs. BMI — Checking Linearity
p_hist <- ggplot(augmented, aes(x = .resid)) +
geom_histogram(aes(y = after_stat(density)), bins = 40,
fill = "steelblue", color = "white", alpha = 0.8) +
geom_density(color = "red", linewidth = 1) +
stat_function(fun = dnorm,
args = list(mean = mean(augmented$.resid),
sd = sd(augmented$.resid)),
color = "black", linetype = "dashed", linewidth = 1) +
labs(
title = "Distribution of Residuals",
subtitle = "Red = kernel density | Black dashed = normal distribution",
x = "Residuals",
y = "Density"
) +
theme_minimal(base_size = 13)
p_histDistribution of Residuals
# ggplot version of QQ plot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on the red line if residuals are normally distributed",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)
p_qqNormal Q-Q Plot of Residuals
# Cook's distance
augmented <- augmented %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(model_slr),
influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
)
n_influential <- sum(augmented$cooks_d > 4 / n)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4 / n, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooksCook’s Distance: Identifying Influential Observations
Questions:
The plot shows that the smooth curve bends upward at low fitted values, dips in the middle, and rises again at higher fitted values. This indicates the true relationship between BMI and phys_days is curved,and not straight. The vertical spread of residuals also increases as fitted values increase.
phys_days?Residuals are not normal as the Q‑Q plot shows that Points deviating from the line in both tails with a subtle S‑shape (Heavy tails and a more peaked center than expected under normality). The residuals are not normally distributed.
The Cook’s distance plot shows: - 136 observations above the 4/n threshold - No extreme spikes—just many moderately influential points - Influence spread across the dataset rather than driven by a single outlier.
The LINE assumptions are: - Linearity → violated - Independence → okay-ish? (given BRFSS sampling) - Normality → violated - Equal variance → violated
Importantly, The relationship between BMI and phys_days is not linear. And the variance of phys_days increases with BMI. The model is statistically significant but not appropriate practically. —
Now fit a second SLR model using age as the
predictor of phys_days instead of BMI.
# (a) Fit SLR: phys_days ~ age
### Visualizing the Relationship
### Before fitting any model, always visualize the bivariate relationship.
####{r scatter-age-phys_days, fig.cap="age vs. phys_days"}
p_scatter <- ggplot(brfss_slr, aes(x = age, y = phys_days)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "Age vs. phys_days (BRFSS 2020)",
subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
x = "Age",
y = "phys_days"
) +
theme_minimal(base_size = 13)
ggplotly(p_scatter)# (a) Fit SLR: phys_days ~ age
model_slr_age <- lm(phys_days ~ age, data = brfss_slr)
# Display results
tidy_age <- tidy(model_slr_age, conf.int = TRUE)
tidy_age %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Physically Unhealthy Days ~ Age",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 4.3702 | 0.6661 | 6.5611 | 0 | 3.0642 | 5.6762 |
| age | 0.1313 | 0.0114 | 11.4675 | 0 | 0.1088 | 0.1537 |
b1_bmi <- round(coef(model_slr)["bmi"], 4)
b1_age <- round(coef(model_slr_age)["age"], 4)
r2_bmi <- round(summary(model_slr)$r.squared, 4)
r2_age <- round(summary(model_slr_age)$r.squared, 4)
p_bmi <- round(summary(model_slr)$coefficients["bmi", "Pr(>|t|)"], 5)
p_age <- round(summary(model_slr_age)$coefficients["age", "Pr(>|t|)"], 5)
# Create comparison table
comparison_table <- tibble(
Metric = c("Slope (b₁)", "Standard Error", "t-statistic", "p-value", "R²", "Direction"),
BMI_Model = c(
b1_bmi,
round(summary(model_slr)$coefficients["bmi", "Std. Error"], 4),
round(summary(model_slr)$coefficients["bmi", "t value"], 3),
p_bmi,
r2_bmi,
ifelse(b1_bmi > 0, "Positive", "Negative")
),
Age_Model = c(
b1_age,
round(summary(model_slr_age)$coefficients["age", "Std. Error"], 4),
round(summary(model_slr_age)$coefficients["age", "t value"], 3),
p_age,
r2_age,
ifelse(b1_age > 0, "Positive", "Negative")
)
)
comparison_table %>%
kable(
caption = "Comparing BMI vs. Age as Predictors of Unhealthy Days",
align = "lrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Metric | BMI_Model | Age_Model |
|---|---|---|
| Slope (b₁) | 0.1451 | 0.1313 |
| Standard Error | 0.0289 | 0.0114 |
| t-statistic | 5.013 | 11.467 |
| p-value | 0 | 0 |
| R² | 0.0083 | 0.042 |
| Direction | Positive | Positive |
p_age_scatter <- ggplot(brfss_slr, aes(x = age, y = phys_days)) +
geom_point(alpha = 0.15, color = "purple", size = 1) +
geom_smooth(method = "lm", color = "darkred", linewidth = 1.2, se = TRUE) +
labs(
title = "Physically Unhealthy Days vs. Age",
x = "Age (years)",
y = "Number of Days of physically unhealthy"
) +
theme_minimal(base_size = 12)
ggplotly(p_age_scatter)
**Questions:**
a) How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance?
Both predictors age and BMI show a positive association with poor physical health days, but the strength of the association differs.
1. Both slopes are positive.
2. BMI slope is 0.145 days per BMI unit whereas Age slope is 0.131 days per year. But if we look at t-statistics, BMI: t-value is 5.01 but for age, it is 11.47 which shows that age has more than double the standardized effect size of BMI.
Both predictors are statistically significant (p < 0.001) but age has more stable association.
b) Compare the $R^2$ values of the two models. Which predictor explains more variability in `phys_days`?
The R² values for both the model are:
1. BMI model: 0.0083 (0.83% variance explained)
2. Age model: 0.042 (4.2% variance explained)
Age explains five times more variance in phys_days than BMI. But even 4.2% is still very small. Age is a better predictor than BMI, but both predictors explain only a small fraction of the variability in poor physical health days.
c) Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome?
Age is a stronger predictor than BMI as they show statistically significant positive associations. However, the outcome (phys_days) is highly influenced by many other factors (chronic illness, disability, stress, access to care, lifestyle, etc.).
As a result, simple linear regression captures only a tiny portion of the true variation.
Limitations of using simple linear regression for this outcome based on the diagnostics and residual plots
- Nonlinearity: The relationship is curved, not straight.
- Heteroscedasticity: Variance increases with fitted values.
- Non-normal residuals: phys_days is a skewed count variable.
- Influential points: Many moderately influential observations.
- Low R²: Predictors explain very little variance.
These issues arise because phys_days is a bounded variable (0–30) and right‑skewed.
---
# Session Info
``` r
sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gtsummary_2.5.0 ggeffects_2.3.2 broom_1.0.11 plotly_4.12.0
## [5] kableExtra_1.4.0 knitr_1.51 here_1.0.2 haven_2.5.5
## [9] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.0
## [13] purrr_1.2.1 readr_2.1.6 tidyr_1.3.2 tibble_3.3.1
## [17] ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0 htmlwidgets_1.6.4
## [5] insight_1.4.5 lattice_0.22-7 tzdb_0.5.0 crosstalk_1.2.2
## [9] vctrs_0.7.1 tools_4.5.2 generics_0.1.4 pkgconfig_2.0.3
## [13] Matrix_1.7-4 data.table_1.18.0 RColorBrewer_1.1-3 S7_0.2.1
## [17] gt_1.3.0 lifecycle_1.0.5 compiler_4.5.2 farver_2.1.2
## [21] textshaping_1.0.4 litedown_0.9 htmltools_0.5.9 sass_0.4.10
## [25] yaml_2.3.12 lazyeval_0.2.2 pillar_1.11.1 jquerylib_0.1.4
## [29] cachem_1.1.0 nlme_3.1-168 commonmark_2.0.0 tidyselect_1.2.1
## [33] digest_0.6.39 stringi_1.8.7 splines_4.5.2 labeling_0.4.3
## [37] rprojroot_2.1.1 fastmap_1.2.0 grid_4.5.2 cli_3.6.5
## [41] magrittr_2.0.4 cards_0.7.1 withr_3.0.2 scales_1.4.0
## [45] backports_1.5.0 timechange_0.3.0 rmarkdown_2.30 httr_1.4.7
## [49] otel_0.2.0 hms_1.1.4 evaluate_1.0.5 viridisLite_0.4.2
## [53] mgcv_1.9-4 markdown_2.0 rlang_1.1.7 glue_1.8.0
## [57] xml2_1.5.2 svglite_2.2.2 rstudioapi_0.18.0 jsonlite_2.0.0
## [61] R6_2.6.1 systemfonts_1.3.1 fs_1.6.6