Introduction

Simple Linear Regression (SLR) is one of the most fundamental and widely used tools in epidemiology and public health research. It allows us to:

  • Quantify the linear relationship between a continuous outcome and a single predictor
  • Predict values of an outcome given a predictor value
  • Test hypotheses about whether a predictor is associated with an outcome
  • Lay the groundwork for multiple regression, which controls for confounding

In epidemiology, we frequently use SLR to model continuous outcomes such as blood pressure, BMI, cholesterol levels, or hospital length of stay as a function of age, exposure levels, or other continuous predictors.


Setup and Data

library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)

Loading BRFSS 2020 Data

We will use the Behavioral Risk Factor Surveillance System (BRFSS) 2020 data throughout this lecture. The BRFSS is a large-scale, nationally representative telephone survey conducted by the CDC that collects data on health behaviors, chronic conditions, and preventive service use among U.S. adults.

Task 1: Explore the Variables (15 points)

# (a) Create a summary table of phys_days and bmi

###load brfss_subset_2020.rds 
brfss_slr <- read_rds('C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/brfss_slr_2020.rds')

Inspect the data

glimpse(brfss_slr) # Examine the first few rows head(brfss_slr, n = 10)

View data structure

str(brfss_slr)

Dimensions: rows (observations) and columns (variables)

dim(brfss_slr)

What variables do we have?

names(brfss_slr)

Descriptive Statistics

brfss_slr %>%
  select(bmi, phys_days) %>%
  summary() %>%
  kable(
    caption = "Table 1a. Summary Statistics: Body Mass Index and Physical Health Days",
    col.names = c("Statistic", "BMI (kg/m²)", "Days of Poor Physical Health")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "left"
  )
Table 1a. Summary Statistics: Body Mass Index and Physical Health Days
Statistic BMI (kg/m²) Days of Poor Physical Health
Min. :14.63 Min. : 1.00
1st Qu.:24.32 1st Qu.: 2.00
Median :27.89 Median : 6.00
Mean :29.18 Mean :11.66
3rd Qu.:32.89 3rd Qu.:20.00
Max. :59.60 Max. :30.00
##comprehensive descriptive table

brfss_slr %>%
  select(bmi, phys_days, age, sex, education, gen_health_num, sleep_hrs) %>%  # Add more variables
  tbl_summary(
    label = list(
      bmi ~ "Body Mass Index (kg/m²)",
      phys_days ~ "Days of Poor Physical Health (Past 30 Days)",
      age ~ "Age (years)",
      sex ~ "Sex",
      education ~ "Education Level",
      gen_health_num ~ "Gen Health Number",
      sleep_hrs ~ "Number of hours of sleep"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd}) | {median} [{p25}, {p75}]",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"  # Don't show missing data row
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_header(label ~ "**Variable**") %>%
  modify_caption("**Table 1. Descriptive Statistics of Study Sample (BRFSS 2020, n = 3,000)**") %>%
  modify_footnote(all_stat_cols() ~ "Mean (SD) | Median [IQR] for continuous; n (%) for categorical")
Table 1. Descriptive Statistics of Study Sample (BRFSS 2020, n = 3,000)
Variable N N = 3,0001
Body Mass Index (kg/m²) 3,000 29.2 (7.0) | 27.9 [24.3, 32.9]
Days of Poor Physical Health (Past 30 Days) 3,000 11.7 (11.2) | 6.0 [2.0, 20.0]
Age (years) 3,000 55.5 (17.4) | 58.0 [43.0, 70.0]
Sex 3,000
    Female
1,701 (57%)
    Male
1,299 (43%)
Education Level 3,000
    < High school
237 (7.9%)
    High school graduate
796 (27%)
    Some college
937 (31%)
    College graduate
1,030 (34%)
Gen Health Number 3,000
    1
266 (8.9%)
    2
756 (25%)
    3
941 (31%)
    4
692 (23%)
    5
345 (12%)
Number of hours of sleep 3,000 6.9 (1.7) | 7.0 [6.0, 8.0]
1 Mean (SD) | Median [IQR] for continuous; n (%) for categorical

(b) Create a histogram of phys_days — describe the distribution

# Plot
p_hist_phys <- ggplot(brfss_slr, aes(x = phys_days)) +
      geom_histogram(binwidth = 1, fill = "darkblue", color = "white", alpha = 0.7) +
  labs(
    title = "Distribution of Physically Unhealthy Days (Past 30 Days)",
    subtitle = "BRFSS 2020 (n = 3,000)",
    x = "Number of Days",
    y = "Frequency"
  ) +
  theme_minimal(base_size = 12)

ggplotly(p_hist_phys)

Days of Poor Physical Health

(c) Create a scatter plot of phys_days (Y) vs bmi (X)

Visualizing the Relationship

Before fitting any model, always visualize the bivariate relationship.

p_scatter <- ggplot(brfss_slr, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "BMI vs. phys_days (BRFSS 2020)",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "BMI (kg/m²)",
    y = "phys_days"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_scatter)

BMI vs. phys_days

Questions:

  1. What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days?

The mean of phys_days is 11.7 days and a standard deviation is 11.2 days. THe mean (11.7) is much higher than median (6 days) which suggest it has a right skew. The histogram confirms the distribution as highly right‑skewed, with many respondents reporting 0 days and a second peak at 30 days.

  1. Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers?

The scatter plot shows a weak positive association between BMI and poor physical health days. The linear trend increases slightly with BMI, but the non‑linear smoother indicates curve, suggesting the relationship is not strictly linear. Several high‑BMI and high‑phys_days values appear as outliers.

Task 2: Fit and Interpret the SLR Model (20 points)

We use ordinary least squares (OLS) to find the best-fit line that minimizes the squared deviations of observed values from predicted values.

Key outputs: Intercept: Expected unhealthy days when BMI = 0 Slope: Change in unhealthy days for each 1-unit increase in BMI Standard Error: Precision of the estimate (smaller = more precise) t-statistic & p-value: Tests whether the slope is significantly different from zero 95% CI: Range likely to contain the true population slope

# (a) Fit the SLR model: phys_days ~ bmi
model_slr <- lm(phys_days ~ bmi, data = brfss_slr)

# Summary output
summary(model_slr)
## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_slr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.808  -9.160  -5.623   8.943  20.453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.42285    0.86881   8.544  < 2e-16 ***
## bmi          0.14513    0.02895   5.013 5.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared:  0.008314,   Adjusted R-squared:  0.007983 
## F-statistic: 25.13 on 1 and 2998 DF,  p-value: 5.659e-07

(b) Display a tidy coefficient table with 95% CIs

# Tidy coefficient table

tidy(model_slr, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)
Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 7.4228 0.8688 8.5437 0 5.7193 9.1264
bmi 0.1451 0.0289 5.0134 0 0.0884 0.2019

(c) Extract and report: slope, intercept, t-statistic, p-value

b0 <- round(coef(model_slr)[1], 3)
b1 <- round(coef(model_slr)[2], 4)

Fitted regression equation:

\[\widehat{\text{Phys_days}} = 7.423 + 0.1451 \times \text{BMI}\]

For every 1-unit increase in BMI, the model predicts that number of physically unhealthy days increases by 0.1451 days, on average.

###Visualizing Fitted Values and Residuals

# Augment dataset with fitted values and residuals
augmented <- augment(model_slr)

# Show a sample of fitted values and residuals
augmented %>%
  select(phys_days, bmi, .fitted, .resid) %>%
  slice_head(n = 10) %>%
  mutate(across(where(is.numeric), ~ round(., 3))) %>%
  kable(
    caption = "First 10 Observations: Observed, Fitted, and Residual Values",
    col.names = c("Observed Phys_days (Y)", "BMI (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
First 10 Observations: Observed, Fitted, and Residual Values
Observed Phys_days (Y) BMI (X) Fitted (Ŷ) Residual (e = Y − Ŷ)
5 26.58 11.280 -6.280
30 33.47 12.280 17.720
30 35.15 12.524 17.476
4 30.42 11.838 -7.838
3 22.67 10.713 -7.713
30 30.11 11.793 18.207
2 35.43 12.565 -10.565
3 31.58 12.006 -9.006
2 28.13 11.505 -9.505
1 34.01 12.359 -11.359

Questions:

  1. Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\).

The fitted regression equation is:

\(\hat{Y} = b_0 + b_1 BMI\).

where hat{Y} is the predicted number of poor physical health days. b_0 is the intercept (7.42285) and, b_1 is the slope (0.14513).

  1. Interpret the slope (\(b_1\)) in context — what does it mean in plain English?

The slope is 0.145 days per BMI unit. For each 1‑unit increase in BMI, the model predicts an average increase of 0.145 poor physical health days in the past month. SImply, someone with a BMI 10 days higher would be predicted to have about 1.45 more unhealthy days.

  1. Is the intercept (\(b_0\)) interpretable in this context? Why or why not?

The intercept is 7.42, which is the predicted number of poor physical health days when BMI is 0 but that is not physiologically possible since everyone does have some BMI.

  1. Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value.

Our null hypothesis is that there is no association between BMI and phys_days and alternative hypothesis is that there is an association. Based on the test statistic, t-value for BMI is 5.013 and p-value is 5.66e-07 and because the p-value <<0.05, we reject the null hypothesis with a significant evidence of an association between BMI and poor physical health days.

However, the effect size is extremely small (R2 = 0.008), it explains less than 1% of the variation in phys_days.

# Select a random sample of 80 points to illustrate residuals
set.seed(42)
resid_sample <- augmented %>% slice_sample(n = 80)

p_resid <- ggplot(resid_sample, aes(x = bmi, y = phys_days)) +
  geom_segment(aes(xend = bmi, yend = .fitted),
               color = "tomato", alpha = 0.5, linewidth = 0.5) +
  geom_point(color = "steelblue", size = 1.8, alpha = 0.8) +
  geom_line(aes(y = .fitted), color = "black", linewidth = 1.1) +
  labs(
    title = "Residuals Illustrated on the Regression Line",
    subtitle = "Red segments = residuals (Y − Ŷ); Black line = fitted regression line",
    x = "BMI (kg/m²)",
    y = "Number of bad physical health days"
  ) +
  theme_minimal(base_size = 13)

p_resid
Visualizing Residuals on the Regression Line

Visualizing Residuals on the Regression Line


Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
# ANOVA decomposition
anova_slr <- anova(model_slr)

anova_slr %>%
  kable(
    caption = "ANOVA Table: Physically Unhealthy Days ~ BMI",
    digits = 3,
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
ANOVA Table: Physically Unhealthy Days ~ BMI
Source Df Sum Sq Mean Sq F value Pr(>F)
bmi 1 3105.365 3105.365 25.134 0
Residuals 2998 370411.743 123.553 NA NA

(b) Compute and report SSTotal, SSRegression, and SSResidual

n <- nrow(brfss_slr)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)

tibble(
  Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
  Value    = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3)),
  Units    = c("", "", "kg/m²")
) %>%
  kable(caption = "Model Error Estimates") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Error Estimates
Quantity Value Units
SS Residual 370411.740
MSE (σ̂²)
   123.55

(c) Compute R² two ways: from the model object and from the SS decomposition

# Extract R-squared from model
r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
R² and Adjusted R²
Metric Value
0.0083
Adjusted R² 0.008
Variance Explained 0.83%

Questions:

  1. Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic.

From the ANOVA table,

SSResidual (SSE) = 370411.74, df= 2998 SSRegression (SSR) = = 3105.37, df= 1 SSTotal (SST) = SSR + SSE = 373517.11, df= 2999

F- statistic: 25.134

  1. What is the \(R^2\) value? Interpret it in plain English.

R2 = 0.008 About 0.83% of the variation in poor physical health days is explained by BMI in this sample.

  1. What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation?

BMI explains almost none of the variability in poor physical health days, even though the association is statistically significant. Other health conditions like Mental health, stress, and social factors, Access to care, lifestyle, occupation, environment may explain remaining.


Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25

# Compute CI and PI at specific age values
new_bmi <- data.frame(bmi= c(25))

ci_pred <- predict(model_slr, newdata = new_bmi, interval = "confidence") %>%
  as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)

pi_pred <- predict(model_slr, newdata = new_bmi, interval = "prediction") %>%
  as.data.frame() %>%
  rename(PI_Lower = lwr, PI_Upper = upr) %>%
  select(-fit)

results_table <- bind_cols(new_bmi, ci_pred, pi_pred) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

results_table %>%
  kable(
    caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by BMI",
    col.names = c("bmi", "Fitted phys_days", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))
Fitted Values, 95% Confidence Intervals, and Prediction Intervals by BMI
95% CI for Mean
95% PI for Individual
bmi Fitted phys_days CI Lower CI Upper PI Lower PI Upper
25 11.05 10.59 11.51 -10.75 32.85

(b) Calculate the 95% prediction interval for a person with BMI = 25

above tables. 95% PI for the individual (32.85, -10.75)

(c) Plot the regression line with both the CI band and PI band

# Generate CI and PI across the full age range
bmi_grid <- data.frame(bmi= seq(min(brfss_slr$bmi), max(brfss_slr$bmi),, length.out = 200))

ci_band <- predict(model_slr, newdata = bmi_grid, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

pi_band <- predict(model_slr, newdata = bmi_grid, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

p_ci_pi <- ggplot() +
  geom_point(data = brfss_slr, aes(x = bmi, y = phys_days),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.4) +
  geom_line(data = ci_band, aes(x = bmi, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Simple Linear Regression: Phys_days ~ BMI",
    subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
    x = "BMI (kg/m²)",
    y = "Phys_days",
    caption = "BRFSS 2020, n = 3,000"
  ) +
  theme_minimal(base_size = 13)

p_ci_pi
Regression Line with 95% Confidence and Prediction Intervals

Regression Line with 95% Confidence and Prediction Intervals

Questions:

  1. For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean?

For someone with a BMI of 25, the estimated mean number of poor physical health days is estimated to be about 11.05 days, and we are 95% confident that the true population mean lies between 10.59 and 11.51 days.

  1. If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days?

If a specific new person has a BMI of 25, the 95% prediction interval for their number of poor physical health days is between 10.75 and 32.85 days with 95% confidence.

  1. Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice?

A prediction interval is wider because of the uncertainty in estimating the mean (same as the CI) and the person‑to‑person variability around that mean. CI only show the first one.

We would use Confidence interval (CI) when we have to estimate the average response for a group - population mean and we use PI when we have to predict the outcome for a single new individual.


Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())

par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

par(mfrow = c(1, 1))

Interpretation:

Residuals vs Fitted: The residuals do not form a tight, horizontal band. There is a curved pattern, wuggesting non-linear relationship. The spread of residuals increases as fitted values increase (heteroscedasticity).

Normal Q-Q plot: Residuals are not normally distributed. With n = 3,000, this is less critical for inference, but the model is not capturing the underlying structure well.

Scale- Location Plot: The red line slopes upward and residual spread increases with fitted values.

Cooks Distance: There are no influential observations that threaten the stability of the model. Outliers exist, but they are not influential enough to distort the regression line.

(b) Create a residuals vs. fitted plot using ggplot

p_resid_x <- ggplot(augmented, aes(x = bmi, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs. BMI",
    subtitle = "Should show no pattern — random scatter around zero",
    x = "BMI (kg/m2)",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

p_resid_x
Residuals vs. BMI — Checking Linearity

Residuals vs. BMI — Checking Linearity

(c) Create a normal Q-Q plot of residuals using ggplot

p_hist <- ggplot(augmented, aes(x = .resid)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "red", linewidth = 1) +
  stat_function(fun = dnorm,
                args = list(mean = mean(augmented$.resid),
                            sd = sd(augmented$.resid)),
                color = "black", linetype = "dashed", linewidth = 1) +
  labs(
    title = "Distribution of Residuals",
    subtitle = "Red = kernel density | Black dashed = normal distribution",
    x = "Residuals",
    y = "Density"
  ) +
  theme_minimal(base_size = 13)

p_hist
Distribution of Residuals

Distribution of Residuals

# ggplot version of QQ plot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Points should lie on the red line if residuals are normally distributed",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)

p_qq
Normal Q-Q Plot of Residuals

Normal Q-Q Plot of Residuals

(d) Create a Cook’s distance plot

# Cook's distance
augmented <- augmented %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(model_slr),
    influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
  )

n_influential <- sum(augmented$cooks_d > 4 / n)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4 / n, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ",
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks
Cook's Distance: Identifying Influential Observations

Cook’s Distance: Identifying Influential Observations

Questions:

  1. Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see.

The plot shows that the smooth curve bends upward at low fitted values, dips in the middle, and rises again at higher fitted values. This indicates the true relationship between BMI and phys_days is curved,and not straight. The vertical spread of residuals also increases as fitted values increase.

  1. Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days?

Residuals are not normal as the Q‑Q plot shows that Points deviating from the line in both tails with a subtle S‑shape (Heavy tails and a more peaked center than expected under normality). The residuals are not normally distributed.

  1. Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them?

The Cook’s distance plot shows: - 136 observations above the 4/n threshold - No extreme spikes—just many moderately influential points - Influence spread across the dataset rather than driven by a single outlier.

  1. Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.)

The LINE assumptions are: - Linearity → violated - Independence → okay-ish? (given BRFSS sampling) - Normality → violated - Equal variance → violated

Importantly, The relationship between BMI and phys_days is not linear. And the variance of phys_days increases with BMI. The model is statistically significant but not appropriate practically. —

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age

### Visualizing the Relationship

### Before fitting any model, always visualize the bivariate relationship.

####{r scatter-age-phys_days, fig.cap="age vs. phys_days"}
p_scatter <- ggplot(brfss_slr, aes(x = age, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "Age vs. phys_days (BRFSS 2020)",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "Age",
    y = "phys_days"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_scatter)
# (a) Fit SLR: phys_days ~ age
model_slr_age <- lm(phys_days ~ age, data = brfss_slr)

# Display results
tidy_age <- tidy(model_slr_age, conf.int = TRUE)

tidy_age %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: Physically Unhealthy Days ~ Age",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic", 
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Simple Linear Regression: Physically Unhealthy Days ~ Age
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 4.3702 0.6661 6.5611 0 3.0642 5.6762
age 0.1313 0.0114 11.4675 0 0.1088 0.1537

(b) Display results and compare to the BMI model

b1_bmi <- round(coef(model_slr)["bmi"], 4)
b1_age <- round(coef(model_slr_age)["age"], 4)

r2_bmi <- round(summary(model_slr)$r.squared, 4)
r2_age <- round(summary(model_slr_age)$r.squared, 4)

p_bmi <- round(summary(model_slr)$coefficients["bmi", "Pr(>|t|)"], 5)
p_age <- round(summary(model_slr_age)$coefficients["age", "Pr(>|t|)"], 5)

# Create comparison table

comparison_table <- tibble(
  Metric = c("Slope (b₁)", "Standard Error", "t-statistic", "p-value", "R²", "Direction"),
  BMI_Model = c(
    b1_bmi,
    round(summary(model_slr)$coefficients["bmi", "Std. Error"], 4),
    round(summary(model_slr)$coefficients["bmi", "t value"], 3),
    p_bmi,
    r2_bmi,
    ifelse(b1_bmi > 0, "Positive", "Negative")
  ),
  Age_Model = c(
    b1_age,
    round(summary(model_slr_age)$coefficients["age", "Std. Error"], 4),
    round(summary(model_slr_age)$coefficients["age", "t value"], 3),
    p_age,
    r2_age,
    ifelse(b1_age > 0, "Positive", "Negative")
  )
)

comparison_table %>%
  kable(
    caption = "Comparing BMI vs. Age as Predictors of Unhealthy Days",
    align = "lrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Comparing BMI vs. Age as Predictors of Unhealthy Days
Metric BMI_Model Age_Model
Slope (b₁) 0.1451 0.1313
Standard Error 0.0289 0.0114
t-statistic 5.013 11.467
p-value 0 0
0.0083 0.042
Direction Positive Positive

visualizing both models

p_age_scatter <- ggplot(brfss_slr, aes(x = age, y = phys_days)) +
  geom_point(alpha = 0.15, color = "purple", size = 1) +
  geom_smooth(method = "lm", color = "darkred", linewidth = 1.2, se = TRUE) +
  labs(
    title = "Physically Unhealthy Days vs. Age",
    x = "Age (years)",
    y = "Number of Days of physically unhealthy"
  ) +
  theme_minimal(base_size = 12)

ggplotly(p_age_scatter)

(c) Which predictor has the stronger association? Compare R² values.


**Questions:**

a) How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance?

Both predictors age and BMI show a positive association with poor physical health days, but the strength of the association differs. 
1. Both slopes are positive.
2. BMI slope is  0.145 days per BMI unit whereas Age slope is 0.131 days per year. But if we look at t-statistics, BMI: t-value is 5.01 but for age, it is 11.47 which shows that age has more than double the standardized effect size of BMI.

Both predictors are statistically significant (p < 0.001) but age has more stable association.

b) Compare the $R^2$ values of the two models. Which predictor explains more variability in `phys_days`?

The R² values for both the model are: 
1. BMI model: 0.0083 (0.83% variance explained)
2. Age model: 0.042 (4.2% variance explained)

Age explains five times more variance in phys_days than BMI. But even 4.2% is still very small. Age is a better predictor than BMI, but both predictors explain only a small fraction of the variability in poor physical health days.

c) Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome?

Age is a stronger predictor than BMI as they show statistically significant positive associations. However, the outcome (phys_days) is highly influenced by many other factors (chronic illness, disability, stress, access to care, lifestyle, etc.).
As a result, simple linear regression captures only a tiny portion of the true variation.

Limitations of using simple linear regression for this outcome based on the diagnostics and residual plots 

- Nonlinearity: The relationship is curved, not straight.
- Heteroscedasticity: Variance increases with fitted values.
- Non-normal residuals: phys_days is a skewed count variable.
- Influential points: Many moderately influential observations.
- Low R²: Predictors explain very little variance.

These issues arise because phys_days is a bounded variable (0–30) and right‑skewed. 

---

# Session Info


``` r
sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gtsummary_2.5.0  ggeffects_2.3.2  broom_1.0.11     plotly_4.12.0   
##  [5] kableExtra_1.4.0 knitr_1.51       here_1.0.2       haven_2.5.5     
##  [9] lubridate_1.9.4  forcats_1.0.1    stringr_1.6.0    dplyr_1.2.0     
## [13] purrr_1.2.1      readr_2.1.6      tidyr_1.3.2      tibble_3.3.1    
## [17] ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.10.0       htmlwidgets_1.6.4 
##  [5] insight_1.4.5      lattice_0.22-7     tzdb_0.5.0         crosstalk_1.2.2   
##  [9] vctrs_0.7.1        tools_4.5.2        generics_0.1.4     pkgconfig_2.0.3   
## [13] Matrix_1.7-4       data.table_1.18.0  RColorBrewer_1.1-3 S7_0.2.1          
## [17] gt_1.3.0           lifecycle_1.0.5    compiler_4.5.2     farver_2.1.2      
## [21] textshaping_1.0.4  litedown_0.9       htmltools_0.5.9    sass_0.4.10       
## [25] yaml_2.3.12        lazyeval_0.2.2     pillar_1.11.1      jquerylib_0.1.4   
## [29] cachem_1.1.0       nlme_3.1-168       commonmark_2.0.0   tidyselect_1.2.1  
## [33] digest_0.6.39      stringi_1.8.7      splines_4.5.2      labeling_0.4.3    
## [37] rprojroot_2.1.1    fastmap_1.2.0      grid_4.5.2         cli_3.6.5         
## [41] magrittr_2.0.4     cards_0.7.1        withr_3.0.2        scales_1.4.0      
## [45] backports_1.5.0    timechange_0.3.0   rmarkdown_2.30     httr_1.4.7        
## [49] otel_0.2.0         hms_1.1.4          evaluate_1.0.5     viridisLite_0.4.2 
## [53] mgcv_1.9-4         markdown_2.0       rlang_1.1.7        glue_1.8.0        
## [57] xml2_1.5.2         svglite_2.2.2      rstudioapi_0.18.0  jsonlite_2.0.0    
## [61] R6_2.6.1           systemfonts_1.3.1  fs_1.6.6