Introduction

Simple Linear Regression (SLR) is one of the most fundamental and widely used tools in epidemiology and public health research. It allows us to:

Quantify the linear relationship between a continuous outcome and a single predictor
Predict values of an outcome given a predictor value
Test hypotheses about whether a predictor is associated with an outcome
Lay the groundwork for multiple regression, which controls for confounding

In epidemiology, we frequently use SLR to model continuous outcomes such as blood pressure, BMI, cholesterol levels, or hospital length of stay as a function of age, exposure levels, or other continuous predictors.

Setup and Data

library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)

Loading BRFSS 2020 Data

We will use the Behavioral Risk Factor Surveillance System (BRFSS) 2020 data throughout this lecture. The BRFSS is a large-scale, nationally representative telephone survey conducted by the CDC that collects data on health behaviors, chronic conditions, and preventive service use among U.S. adults.

# Load raw BRFSS 2020 data
brfss_full <- read_xpt(
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/LLCP2020.XPT"
) %>%
  janitor::clean_names()

# Select variables of interest
brfss_slr <- brfss_full %>%
  select(bmi5, age80, sex, educag, genhlth, physhlth, sleptim1)

# Recode variables
brfss_slr <- brfss_slr %>%
  mutate(
    bmi       = bmi5 / 100,
    age       = age80,
    sex       = factor(ifelse(sex == 1, "Male", "Female")),
    education = factor(case_when(
      educag == 1 ~ "< High school",
      educag == 2 ~ "High school graduate",
      educag == 3 ~ "Some college",
      educag == 4 ~ "College graduate"
    ), levels = c("< High school", "High school graduate",
                  "Some college", "College graduate")),
    gen_health_num = ifelse(genhlth  %in% 1:5,  genhlth,  NA_real_),
    sleep_hrs      = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_),
    phys_days      = ifelse(physhlth %in% 0:30, physhlth, NA_real_)
  )

# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_slr <- brfss_slr %>%
  select(bmi, age, sex, education, gen_health_num, sleep_hrs, phys_days) %>%
  filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
  drop_na() %>%
  slice_sample(n = 3000)

# Save analytic dataset
saveRDS(brfss_slr, here::here(
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/brfss_slr_2020.rds"
))

tibble(
  Metric = c("Observations", "Variables"),
  Value  = c(nrow(brfss_slr), ncol(brfss_slr))
) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	3000
Variables	7

Part 2: Lab Activity

Overview

In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.

Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?

Setup Instructions

Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.

# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)

# Load raw BRFSS 2020 data
brfss_full <- read_xpt(
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/LLCP2020.XPT"
) %>%
  janitor::clean_names()

# Select variables of interest
brfss_lab <- brfss_full %>%
  select(bmi5, age80, sex, physhlth, sleptim1)

# Recode variables
brfss_lab <- brfss_lab %>%
  mutate(
    bmi       = bmi5 / 100,
    age       = age80,
    sex       = factor(ifelse(sex == 1, "Male", "Female")),
    phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
    sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
  )

# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab <- brfss_lab %>%
  select(bmi, age, sex, phys_days, sleep_hrs) %>%
  filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
  drop_na() %>%
  slice_sample(n = 3000)

# Save analytic dataset
saveRDS(brfss_lab, here::here("C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/brfss_lab_2020.rds"))

brfss_lab <- readRDS(
  here::here("C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/brfss_lab_2020.rds")
)

Task 1: Explore the Variables (15 points)

# (a) Create a summary table of phys_days and bmi
brfss_lab %>%
  select(bmi, age, sex, phys_days, sleep_hrs) %>%
  tbl_summary(
    label = list(
      bmi ~ "BMI (kg/m²)",
      age ~ "Age (years)",
      sex ~ "Sex",
      phys_days ~ "Poor Physical health (days)",
      sleep_hrs ~ "Sleep (hours/night)"
          ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**")

**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**
Characteristic	N	N = 3,000¹
BMI (kg/m²)	3,000	29.5 (6.9)
Age (years)	3,000	55.9 (17.1)
Sex	3,000
Female		1,672 (56%)
Male		1,328 (44%)
Poor Physical health (days)	3,000	12.0 (11.2)
Sleep (hours/night)	3,000	6.9 (1.7)
¹ Mean (SD); n (%)

# (b) Create a histogram of phys_days — describe the distribution
p_scatter <- ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "BMI vs. Poor Physical health days (BRFSS 2020)",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "BMI (kg/m²)",
    y= "Poor Physical health (days)"
  ) +
  theme_minimal()

ggplotly(p_scatter)

# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
p_scatter <- ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "Poor Physical health days vs. BMI (BRFSS 2020)",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "BMI (kg/m²)",
    y = "Poor Physical health (days)"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_scatter)

Questions:

What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days? The mean and SD of Phys_days = 12.0 (11.2) and the mean and SD of BMI = 29.5 (6.9)
Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers? Based on the scatter plot the LOESS smoother follows data without assuming linearity. Departures suggest non-linearity —

Task 2: Fit and Interpret the SLR Model (20 points)

# (a) Fit the SLR model: phys_days ~ bmi
model_lab <- lm(phys_days ~ bmi, data = brfss_lab)

# Summary output
summary(model_lab)

## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.412  -9.294  -5.379   9.803  20.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.51927    0.89781   8.375  < 2e-16 ***
## bmi          0.15306    0.02967   5.158 2.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.16 on 2998 degrees of freedom
## Multiple R-squared:  0.008798,   Adjusted R-squared:  0.008467 
## F-statistic: 26.61 on 1 and 2998 DF,  p-value: 2.652e-07

# (b) Display a tidy coefficient table with 95% CIs
tidy(model_lab, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: BMI ~ poor physical health (days) (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: BMI ~ poor physical health (days) (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	7.5193	0.8978	8.3751	0	5.7589	9.2796
bmi	0.1531	0.0297	5.1584	0	0.0949	0.2112

# (c) Extract and report: slope, intercept, t-statistic, p-value
b0 <- round(coef(model_lab)[1], 3)
b1 <- round(coef(model_lab)[2], 4)

Questions:

Write the fitted regression equation in the form $ Days of Phys activity$ = 7.5193 + 0.1531 X BMI
Interpret the slope ($b_1$) in context — what does it mean in plain English? Slope(b1= 0.1531): For each 1-$kg/m^2$ increase in BMI, days of physical activity is estimated to increase by 0.1531 days, on average, holding all else constant (though there are no other variables in this simple model).
Is the intercept ($b_0$) interpretable in this context? Why or why not? Intercept ($b_0$)=:7.5193. The estimated mean days of physical activity when BMI=0. This is a mathematical artifact and intercept is not directly interpretable in this context, but is necessary to anchor the line (A BMI of 0 is not a realistic or meaningful value for adults).
Is the association statistically significant at $\alpha = 0.05$? State the null hypothesis, test statistic, and p-value. This association is statistically significant at $\alpha = 0.05$ Null hypothesis: There is no linear association between Days of Poor Physical health and BMI. t-statistic = 5.1584 and p-value <0.05. We reject Null hypothesis.

Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
anova_lab <- anova(model_lab)

anova_lab %>%
  kable(
    caption = "ANOVA Table: Poor Physical health days ~ BMI",
    digits = 3,
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Table: Poor Physical health days ~ BMI
Source	Df	Sum Sq	Mean Sq	F value	Pr(>F)
bmi	1	3314.969	3314.969	26.609	0
Residuals	2998	373487.391	124.579	NA	NA

# (b) Compute and report SSTotal, SSRegression, and SSResidual
# Mean of response variable
y_bar <- mean(brfss_lab$phys_days)

# Total Sum of Squares
SST <- sum((brfss_lab$phys_days - y_bar)^2)

# Regression Sum of Squares (explained variation)
SSR <- sum((fitted(model_lab) - y_bar)^2)

# Residual Sum of Squares (unexplained variation)
SSE <- sum(residuals(model_lab)^2)

# Display results
SS_table <- data.frame(
  SSTotal = round(SST, 3),
  SSRegression = round(SSR, 3),
  SSResidual = round(SSE, 3)
)

SS_table %>%
  kable(
    caption = "SSTotal, SSRegression, and SSResidual",
    align = "c"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

SSTotal, SSRegression, and SSResidual
SSTotal	SSRegression	SSResidual
376802.4	3314.969	373487.4

#Mean Squared Error (MSE) and σ^
n <- nrow(brfss_lab)
SSResidual <- SSE
mse <- SSResidual / (n - 2)
sigma_hat <- sqrt(mse)

tibble(
  Quantity = c("SSResidual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
  Value    = c(round(SSResidual, 2), round(mse, 3), round(sigma_hat, 3)),
  Units    = c("", "", "days")
) %>%
  kable(caption = "Model Error Estimates") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Error Estimates
Quantity	Value	Units
SSResidual	373487.390
MSE (σ̂²)	124.57

# (c) Compute R² two ways: from the model object and from the SS decomposition

# Extract R-squared from model
r_sq <- summary(model_lab)$r.squared
adj_r_sq <- summary(model_lab)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.0088
Adjusted R²	0.0085
Variance Explained	0.88%

Questions:

Fill in the ANOVA table components: $SS_{Total}$, $SS_{Regression}$, $SS_{Residual}$, $df$, and $F$-statistic. $SS_{Total}$=$SS_{Regression}$+ $SS_{Residual}$= 3314.969+373487.4= 376802.4 $df$_total= n - 1 = 3000 - 1 = 2999 $df$_regression= 1 $df$_residuals = n-2 = 3000 - 2 = 2998 $F$-value = 26.609
What is the $R^2$ value? Interpret it in plain English. $R^2$ = 0.0088. This means that 0.88% of the outcome variance is explained by BMI, the rest >99% explained by other factors.
What does this $R^2$ tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation? This means that 0.88% of the variance in days of poor physical health is explained by BMI, the rest >99% explained by other factors.

library(broom)

augmented <- augment(model_lab)

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = 25)
# Confidence interval for the mean response
ci_pred <- predict(model_lab, newdata = new_bmi, interval = "confidence") %>%
  as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
# Prediction interval for an individual
pi_pred <- predict(model_lab, newdata = new_bmi, interval = "prediction") %>%
  as.data.frame() %>%
  rename(PI_Lower = lwr, PI_Upper = upr)
  
#combine results
bmi_results_table <- bind_cols(new_bmi, ci_pred, pi_pred) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

# (b) Calculate the 95% prediction interval for a person with BMI = 25
bmi_results_table %>%
  kable(
    caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by BMI",
    col.names = c("BMI", "Fitted", "CI Lower", "CI Upper", "fit", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 1, "95% CI for Mean" = 3, "95% PI for Individual" = 3))

Fitted Values, 95% Confidence Intervals, and Prediction Intervals by BMI
	95% CI for Mean			95% PI for Individual
BMI	Fitted	CI Lower	CI Upper	fit	PI Lower	PI Upper
25	11.35	10.87	11.82	11.35	-10.54	33.24

# (c) Plot the regression line with both the CI band and PI band

bmi_grid <- data.frame(bmi = 25, length.out = 200)

ci_band <- predict(model_lab, newdata = bmi_grid, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

pi_band <- predict(model_lab, newdata = bmi_grid, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)
p_ci_pi<- ggplot() +
  geom_point(data = brfss_lab, aes(x = bmi, y = phys_days),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.4) +
  geom_line(data = ci_band, aes(x = bmi, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Simple Linear Regression: BMI ~ Poor Physical health days",
    subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
    x = "BMI (kg/m²)",
    y = "Poor Physical health (days)",
    caption = "BRFSS 2020, n = 3,000"
  ) +
  theme_minimal(base_size = 13)
p_ci_pi

Questions:

For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean? For a person with BMI=25 the estimated mean of poor physical health days = 11.35 and 95% CI (10.87; 11.82)
If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days? If a specific new person has BMI of 25 their estimated days of poor physical health will be somewhere in between 95% PI (-10.54 and 33.24)
Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice? It is always wider than the confidence interval because it accounts for both the uncertainty in E(Y) and the individual variability around the mean.

Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_lab, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

# (b) Create a residuals vs. fitted plot using ggplot
p_resid_x <- ggplot(augmented, aes(x = bmi, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs. BMI",
    subtitle = "Should show no pattern — random scatter around zero",
    x = "BMI (kg/m^2)",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

p_resid_x

# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Points should lie on the red line if residuals are normally distributed",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)

p_qq

# (d) Create a Cook's distance plot
augmented <- augmented %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(model_lab),
    influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
  )

n_influential <- sum(augmented$cooks_d > 4 / n)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4 / n, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ",
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks

Questions:

Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. We observe a slightly downwards directed red line, so we can conclude that there is linearity. But we do not observe a random scatter with no pattern, therefore we can conclude that there is evidence of heteroscedasticity.
Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days? the QQ plot shows an ‘S’ shaped pattern, therefore we can conclude that residuals are not approximately normally distributed. Days of poor physical health departures from normality, because of as skewness or heavier tails than expected. There are maybe skeweness or influential outliers in days of poor physical health days (there are unusually low or high numbers of days with poor physical health, which affects the residual distribution.).
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? There are many influential observations, too many to count. We need to examine if those are entry errors (like negative values of poor physical health days, it should be at least=0), we can remove them. But with valid influential outliers we need to run the sensitivity analysis, to check whether sample is big enough for robustness of regression analysis conclusion.
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) Overall, LINE assumptions do not appear to be met. Normality and homoscedasticity do not seem to be met, and may be most problematic for this model.

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age
model_lab1 <- lm(phys_days ~ age, data = brfss_lab)

# (b) Display results and compare to the BMI model
summary(model_lab)

## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.412  -9.294  -5.379   9.803  20.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.51927    0.89781   8.375  < 2e-16 ***
## bmi          0.15306    0.02967   5.158 2.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.16 on 2998 degrees of freedom
## Multiple R-squared:  0.008798,   Adjusted R-squared:  0.008467 
## F-statistic: 26.61 on 1 and 2998 DF,  p-value: 2.652e-07

summary(model_lab1)

## 
## Call:
## lm(formula = phys_days ~ age, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.354  -8.911  -4.432  10.992  22.910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.33587    0.68259   6.352 2.45e-10 ***
## age          0.13773    0.01168  11.789  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 2998 degrees of freedom
## Multiple R-squared:  0.04431,    Adjusted R-squared:  0.04399 
## F-statistic:   139 on 1 and 2998 DF,  p-value: < 2.2e-16

bmi_table <- tidy(model_lab)
age_table <- tidy(model_lab1)
comparison <- rbind(
  data.frame(Model = "BMI Model", bmi_table),
  data.frame(Model = "Age Model", age_table)
)
comparison

##       Model        term  estimate  std.error statistic      p.value
## 1 BMI Model (Intercept) 7.5192652 0.89780706  8.375146 8.356564e-17
## 2 BMI Model         bmi 0.1530557 0.02967099  5.158431 2.652358e-07
## 3 Age Model (Intercept) 4.3358682 0.68259322  6.352053 2.446729e-10
## 4 Age Model         age 0.1377270 0.01168233 11.789343 2.162208e-31

# (c) Which predictor has the stronger association? Compare R² values.
r_sq <- summary(model_lab1)$r.squared
adj_r_sq <- summary(model_lab1)$adj.r.squared

summary(model_lab)$r.squared

## [1] 0.008797634

summary(model_lab1)$r.sqaured

## NULL

model_comparison <- data.frame(
  Predictor = c("BMI","Age"),
  R_squared = c(
    summary(model_lab)$r.squared,
    summary(model_lab1)$r.squared
  )
)

model_comparison

##   Predictor   R_squared
## 1       BMI 0.008797634
## 2       Age 0.044306381

Questions:

How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance? Both age and BMI predictors to poor physical health days have positive direction (both coefficients being positive). The association with BMI has slightly larger magnitude, but they both have practically same strength. Both are statistically significant at the 0.05 level, probably because of the large sample.
Compare the $R^2$ values of the two models. Which predictor explains more variability in phys_days? While $r^2$ of BMI is 0.0088 the $r^2$ is much higher = 0.0443. Meaning that Age explain more variability of the poor physical health days (4.43%) than BMI (0.88%).
Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome?

Based on these two simple models, both predictors of poor physical health days are statistically significant at the 0.05 level, which may be partly due to the large sample size. Both predictors have positive direction,meaning that poor physical health days tend to increase as age or BMI increases. But both explain not large portion of variance in the outcome, indicating that other factors explain variability in the outcome more (96% at best).

Submission Instructions

Submit your completed .Rmd file and the RPubs link to your knitted HTML document.

Your .Rmd must knit without errors. Make sure all code chunks produce visible output and all questions are answered in complete sentences below each code chunk.

Due: Before the next class session.

Session Info

sessionInfo()

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gtsummary_2.5.0  ggeffects_2.3.2  broom_1.0.11     plotly_4.12.0   
##  [5] kableExtra_1.4.0 knitr_1.51       here_1.0.2       haven_2.5.5     
##  [9] lubridate_1.9.4  forcats_1.0.1    stringr_1.6.0    dplyr_1.1.4     
## [13] purrr_1.2.1      readr_2.1.6      tidyr_1.3.2      tibble_3.3.1    
## [17] ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.9.0        htmlwidgets_1.6.4 
##  [5] insight_1.4.4      lattice_0.22-7     tzdb_0.5.0         crosstalk_1.2.2   
##  [9] vctrs_0.7.1        tools_4.5.2        generics_0.1.4     pkgconfig_2.0.3   
## [13] Matrix_1.7-4       data.table_1.18.0  RColorBrewer_1.1-3 S7_0.2.1          
## [17] gt_1.3.0           lifecycle_1.0.5    compiler_4.5.2     farver_2.1.2      
## [21] textshaping_1.0.4  janitor_2.2.1      snakecase_0.11.1   litedown_0.9      
## [25] htmltools_0.5.9    sass_0.4.10        yaml_2.3.12        lazyeval_0.2.2    
## [29] pillar_1.11.1      jquerylib_0.1.4    cachem_1.1.0       nlme_3.1-168      
## [33] commonmark_2.0.0   tidyselect_1.2.1   digest_0.6.39      stringi_1.8.7     
## [37] labeling_0.4.3     splines_4.5.2      rprojroot_2.1.1    fastmap_1.2.0     
## [41] grid_4.5.2         cli_3.6.5          magrittr_2.0.4     cards_0.7.1       
## [45] withr_3.0.2        scales_1.4.0       backports_1.5.0    timechange_0.3.0  
## [49] rmarkdown_2.30     httr_1.4.7         otel_0.2.0         hms_1.1.4         
## [53] evaluate_1.0.5     viridisLite_0.4.2  mgcv_1.9-3         markdown_2.0      
## [57] rlang_1.1.7        glue_1.8.0         xml2_1.5.2         svglite_2.2.2     
## [61] rstudioapi_0.18.0  jsonlite_2.0.0     R6_2.6.1           systemfonts_1.3.1 
## [65] fs_1.6.6

Simple Linear Regression

EPI 553 - Advanced Epidemiologic Methods

Aigerim Azamatova

February 24 & 26, 2026

Introduction

Setup and Data

Loading BRFSS 2020 Data

Part 2: Lab Activity

Overview

Setup Instructions

Task 1: Explore the Variables (15 points)

Task 2: Fit and Interpret the SLR Model (20 points)

Task 3: ANOVA Decomposition and R² (15 points)

Task 4: Confidence and Prediction Intervals (20 points)

Task 5: Residual Diagnostics (20 points)

Task 6: Testing a Different Predictor (10 points)

Submission Instructions

Session Info