knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  fig.width = 10,
  fig.height = 6
)

Introduction

Simple Linear Regression (SLR) is one of the most fundamental and widely used tools in epidemiology and public health research. It allows us to:

Quantify the linear relationship between a continuous outcome and a single predictor
Predict values of an outcome given a predictor value
Test hypotheses about whether a predictor is associated with an outcome
Lay the groundwork for multiple regression, which controls for confounding

In epidemiology, we frequently use SLR to model continuous outcomes such as blood pressure, BMI, cholesterol levels, or hospital length of stay as a function of age, exposure levels, or other continuous predictors.

Setup and Data

library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)

Part 2: Lab Activity

Overview

In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.

Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?

Setup Instructions

Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.

# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)

# Load raw BRFSS 2020 data
file.exists("/Users/morganwheat/Desktop/R/LLCP2020.XPT copy")

## [1] TRUE

list.files("/Users/morganwheat/Desktop/R")

## [1] "LLCP2020.XPT "      "LLCP2020.XPT  copy" "LLCP2020.XPT copy"

dir.exists("/Users/morganwheat/Desktop/R")

## [1] TRUE

brfss_full <- read_xpt(
  "/Users/morganwheat/Desktop/R/LLCP2020.XPT  copy"
) %>%
  janitor::clean_names()

# Select variables of interest
brfss_lab_new <- brfss_full %>%
  select(bmi5, age80, sex, physhlth, sleptim1)

# Recode variables
brfss_lab_new <- brfss_lab_new %>%
  mutate(
    bmi       = bmi5 / 100,
    age       = age80,
    sex       = factor(ifelse(sex == 1, "Male", "Female")),
    phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
    sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
  )

# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab_new <- brfss_lab_new %>%
  select(bmi, age, sex, phys_days, sleep_hrs) %>%
  filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
  drop_na() %>%
  slice_sample(n = 3000)

# Save analytic dataset
saveRDS(brfss_lab_new, here::here(
    "/Users/morganwheat/Desktop/R/LLCP2020.XPT copy"
))

Task 1: Explore the Variables (15 points)

# (a) Create a summary table of phys_days and bmi
tibble(
  Metric = c("Observations", "Variables"),
  Value  = c(nrow(brfss_lab_new), ncol(brfss_lab_new))
) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	3000
Variables	5

brfss_lab_new %>%
  select(phys_days, bmi) %>%
  summary() %>%
  kable(caption = "Descriptive Statistics: Key Continuous Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Descriptive Statistics: Key Continuous Variables
	phys_days	bmi
	Min. : 1.00	Min. :14.91
	1st Qu.: 2.00	1st Qu.:24.41
	Median : 7.00	Median :28.33
	Mean :12.03	Mean :29.47
	3rd Qu.:21.00	3rd Qu.:33.47
	Max. :30.00	Max. :58.24

brfss_lab_new %>%
  select(phys_days, bmi) %>%
  tbl_summary(
    label = list(
      phys_days ~ "Poor Physical Health (days)",
      bmi ~ "BMI (kg/m²)"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**")

**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**
Characteristic	N	N = 3,000¹
Poor Physical Health (days)	3,000	12.0 (11.2)
BMI (kg/m²)	3,000	29.5 (6.9)
¹ Mean (SD)

# (b) Create a histogram of phys_days — describe the distribution
p_hist <- ggplot(brfss_lab_new, aes(x = phys_days)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "red", linewidth = 1) +
  stat_function(fun = dnorm,
                args = list(mean = mean(brfss_lab_new$phys_days),
                            sd = sd(brfss_lab_new$phys_days)),
                color = "black", linetype = "dashed", linewidth = 1) +
  labs(
    title = "Distribution of Number of Poor Physical Health Days in the Past 30 Days",
    subtitle = "Red = kernel density | Black dashed = normal distribution",
    x = "Poor Physical Health (days)",
    y = "Density"
  ) +
  theme_minimal(base_size = 13)

p_hist

# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
p_scatter <- ggplot(brfss_lab_new, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "BMI vs. Poor Physical Health Days (BRFSS 2020)",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "BMI (kg/m²)",
    y = "Poor Physical Health (Days)",
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_scatter)

Questions:

What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days?

The mean of ‘phys_days’ is 12.03 and the mean of ‘bmi’ is 29.46. The standard deviation of ‘phys-days’ is 11.2 and the standard deviation for BMI is 6.9. According to the histogram, the distribution appears to be strongly skewed to the right and does not follow a normal distribution.

Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers?

Based on the scatter plot, the LOESS smoother (orange) does not track the linear fit (red) very closely as there is departure towards the ends, meaning a linear model is not entirely reasonable.There are obvious outliers.

Task 2: Fit and Interpret the SLR Model (20 points)

# (a) Fit the SLR model: phys_days ~ bmi
new_model_slr <- lm(phys_days ~ bmi, data = brfss_lab_new)
summary(new_model_slr)

## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.412  -9.294  -5.379   9.803  20.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.51927    0.89781   8.375  < 2e-16 ***
## bmi          0.15306    0.02967   5.158 2.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.16 on 2998 degrees of freedom
## Multiple R-squared:  0.008798,   Adjusted R-squared:  0.008467 
## F-statistic: 26.61 on 1 and 2998 DF,  p-value: 2.652e-07

# (b) Display a tidy coefficient table with 95% CIs
tidy(new_model_slr, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression:Poor Physical Health ~ BMI (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression:Poor Physical Health ~ BMI (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	7.5193	0.8978	8.3751	0	5.7589	9.2796
bmi	0.1531	0.0297	5.1584	0	0.0949	0.2112

# (c) Extract and report: slope, intercept, t-statistic, p-value
# Extract slope test statistics
slope_test <- tidy(new_model_slr, conf.int = TRUE) %>%
   filter(term == "bmi")

tibble(
  Quantity = c("Slope (b1)", "t-statistic", "p-value"),
  Value = c(
    round(slope_test$estimate, 4),
    round(slope_test$statistic, 3),
    format.pval(slope_test$p.value, digits = 3)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

t-Test for the Slope (H₀: β₁ = 0)
Quantity	Value
Slope (b1)	0.1531
t-statistic	5.158
p-value	2.65e-07

Questions:

Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\).

\(\hat{Poor Physical Health Days} = 7.5193 + 0.1531 X\)BMI.

Interpret the slope (\(b_1\)) in context — what does it mean in plain English?

Slope (b_1 = 0.1531): For each 1 unit increase in BMI, the predicted number of Poor Physical health Days is estimated to increase by 0.1531 days, on average holding else constant (though there are no other variables in this simple model.)

Is the intercept (\(b_0\)) interpretable in this context? Why or why not?

Intercept (b_0 = 7.5193): The intercept represents the estimated mean number of Poor Physical Health Days when BMI = 0. Since a BMI of 0 is not a realistic or possible value for any individual, the intercept is not practically interpretative in this context.

Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value.

The association is statistically significant since p < 0.05, you can reject the null hypothesis. The null hypothesis (H₀): β₁ = 0) is that there is no linear relationship between BMI and Poor Physical Health Days. The the test statistic is (t = 5.158) and the p-value is (p = 2.65e-07).

Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
# ANOVA decomposition
new_anova_slr <- anova(new_model_slr)

new_anova_slr %>%
  kable(
    caption = "ANOVA Table: Poor Physical Health ~ BMI",
    digits = 3,
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Table: Poor Physical Health ~ BMI
Source	Df	Sum Sq	Mean Sq	F value	Pr(>F)
bmi	1	3314.969	3314.969	26.609	0
Residuals	2998	373487.391	124.579	NA	NA

# (b) Compute and report SSTotal, SSRegression, and SSResidual
anova_table <- anova(new_model_slr)

SSregression <- anova_table["bmi", "Sum Sq"]
SSresidual <- anova_table["Residuals", "Sum Sq"]
SStotal <- SSregression + SSresidual

tibble(
  Quantity = c("SS Regression", "SS Residual", "SS Total"),
  Value    = c(round(SSregression, 2), round(SSresidual, 2), round(SStotal, 2)),
) %>%
  kable(caption = "SS Total, SS Regression, and SS Residual Results") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

SS Total, SS Regression, and SS Residual Results
Quantity	Value
SS Regression	3314.97
SS Residual	373487.39
SS Total	376802.36

# (c) Compute R² two ways: from the model object and from the SS decomposition
# Extract R-squared from model
r_sq <- summary(new_model_slr)$r.squared
adj_r_sq <- summary(new_model_slr)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.0088
Adjusted R²	0.0085
Variance Explained	0.88%

Questions:

Fill in the ANOVA table components: \(SS_{Total}\) = 376802.36, \(SS_{Regression}\) = 3314.97, \(SS_{Residual}\) = 373487.39, \(df\) = 1, and \(F\)-statistic = 26.609.
What is the \(R^2\) value? Interpret it in plain English.

The R-squared value is 0.0088. The proportion of total variability in Poor Physical Health Days explained by the linear regression in BMI. In plain English, BMI explains 0.88% of the variability in the number of Poor Physical Health Days.

What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation?

Since the R-squared is close to zero, this means that BMI explains little to none of the variability in Poor Physical Health Days. Unmeasured factors outside of BMI such as pre-existing medical conditions and poor working environments may explain the remaining variation.

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
# Compute 95% CI for a person with BMI = 25
new_bmi_CI <- data.frame(bmi = 25)

ci_pred <- predict(new_model_slr, newdata = new_bmi_CI, interval = "confidence") %>%
  as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)

results_table <- bind_cols(new_bmi_CI, ci_pred) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

results_table %>%
  kable(
    caption = "Fitted Value and 95% Confidence Interval for BMI = 25",
    col.names = c("BMI", "Fitted BMI", "CI Lower", "CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "95% CI for Mean" = 2))

Fitted Value and 95% Confidence Interval for BMI = 25
		95% CI for Mean
BMI	Fitted BMI	CI Lower	CI Upper
25	11.35	10.87	11.82

# (b) Calculate the 95% prediction interval for a person with BMI = 25
# Compute 95% PI for a person with BMI = 25
new_bmi_PI <- data.frame(bmi = 25)

pi_pred <- predict(new_model_slr, newdata = new_bmi_PI, interval = "prediction") %>%
  as.data.frame() %>%
  rename(Fitted = fit, PI_Lower = lwr, PI_Upper = upr)

results_table <- bind_cols(new_bmi_PI, pi_pred) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

results_table %>%
  kable(
    caption = "Fitted Value and 95% Prediction Interval for BMI = 25",
    col.names = c("BMI", "Fitted BMI", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "95% PI for Individual" = 2))

Fitted Value and 95% Prediction Interval for BMI = 25
		95% PI for Individual
BMI	Fitted BMI	PI Lower	PI Upper
25	11.35	-10.54	33.24

# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(18, 80, length.out = 200))

ci_band <- predict(new_model_slr, newdata = bmi_grid, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

pi_band <- predict(new_model_slr, newdata = bmi_grid, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

p_ci_pi <- ggplot() +
  geom_point(data = brfss_lab_new, aes(x = bmi, y = phys_days),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.4) +
  geom_line(data = ci_band, aes(x = bmi, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Simple Linear Regression: Poor Physical Health Days ~ BMI",
    subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
    x = "BMI (kg/m²)",
    y = "Poor Physical Health (days)",
    caption = "BRFSS 2020, n = 3,000"
  ) +
  theme_minimal(base_size = 13)

p_ci_pi

Questions:

For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean?

For someone with a BMI of 25, the estimated mean number of poor physical health days is 11.35 days with a 95% CI [10.87,11.82].

If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days?

If a specific new person has a BMI of 25, the 95% prediction interval for their number of poor physical health days is [-10.54, 33.24]

Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice?

The prediction interval is wider than the confidence interval because it accounts for both the uncertainty in E(Y) and the individual variability around the mean. If you want to estimate the average number of poor physical health days for all individuals in the population with a BMI of 25, use the confidence interval. If you want to predict the number of poor physical health days of a specific individual with a BMI of 25, use the prediction interval.

In summary, use the confidence interval when you want a estimate of the average outcome for a population, and use a prediction interval when you want to predict the outcome for a specific individual.

Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(new_model_slr, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

# (b) Create a residuals vs. fitted plot using ggplot
# Augment dataset with fitted values and residuals
augmented <- augment(new_model_slr)

p_resid_x <- ggplot(augmented, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs. Fitted",
    subtitle = "Should show no pattern — random scatter around zero",
    x = "Fitted",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

p_resid_x

# (c) Create a normal Q-Q plot of residuals using ggplot
# ggplot version of QQ plot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Points should lie on the red line if residuals are normally distributed",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)

p_qq

# (d) Create a Cook's distance plot
augmented <- augmented %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(new_model_slr),
    influential = ifelse(cooks_d > 4 / n(), "Potentially influential", "Not influential")
  )

# Count influential observations
n_influential <- sum(augmented$cooks_d > 4 / nrow(augmented))

# Cook's distance plot
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4 / nrow(augmented), linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ",
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks

Questions:

Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see.

There is evidence of nonlinearity as the points are not all randomly scattered around zero. However, since the spread of residuals does not appear to form a “cone” shape, it indicates that there is no heteroscedasticity.

Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days?

The residuals are not approximately normal.The residuals show an S-shaped pattern since there is a heavy upper and heavy lower tail suggesting a non-normal distribution for “phys_days”.

Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them?

Yes, there here are 133 potentially influential observations (Cook’s D > 4/n). If these influential observations were not determined to be a result of data entry errors, I would then run a sensitivity analysis and apply transformations as necessary.

Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.)

Overall, the LINE assumptions do not appear to be fully met. The linearity assumption may be most problematic for this model since the outcome variable which is poor physical health days, is a countable variable rather than a continuous one.

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age
# Fit simple linear regression: Poor Physical Health ~ Age
newer_model_slr <- lm(phys_days ~ age, data = brfss_lab_new)

# Summary output
summary(newer_model_slr)

## 
## Call:
## lm(formula = phys_days ~ age, data = brfss_lab_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.354  -8.911  -4.432  10.992  22.910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.33587    0.68259   6.352 2.45e-10 ***
## age          0.13773    0.01168  11.789  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 2998 degrees of freedom
## Multiple R-squared:  0.04431,    Adjusted R-squared:  0.04399 
## F-statistic:   139 on 1 and 2998 DF,  p-value: < 2.2e-16

# (b) Display results and compare to the BMI model
# Tidy coefficient table
tidy(newer_model_slr, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: Poor Physical Health Days ~ Age (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: Poor Physical Health Days ~ Age (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	4.3359	0.6826	6.3521	0	2.9975	5.6743
age	0.1377	0.0117	11.7893	0	0.1148	0.1606

# Create scatterplot
p_scatter <- ggplot(brfss_lab_new, aes(x = age, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "Poor Physical Health Days vs. Age (BRFSS 2020)",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "Age (years)",
    y = "Poor Physical Health (days)"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_scatter)

# (c) Which predictor has the stronger association? Compare R² values.
r_sq <- summary(newer_model_slr)$r.squared
adj_r_sq <- summary(newer_model_slr)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.0443
Adjusted R²	0.044
Variance Explained	4.43%

Questions:

How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance?

The association between age and poor physical health days is stronger compared to the BMI association. Both predictors have statistically significant positive associations with poor physical health days (p < 0.001). BMI explains 0.88% of the variance in poor physical health days, while age explains 4.43%. In terms of direction, both relationships are positive. For every 1-unit increase in BMI, poor physical health days increase by an estimated 0.1513 days, while each additional year of age increases poor physical health days by an estimated 0.1377 days.

Compare the \(R^2\) values of the two models. Which predictor explains more variability in phys_days?

Age explains more variability in ‘phys_days’ than BMI. This is because the association between age and poor physical health days has a r-squared value of 0.0443 (4.43% variance explained), compared to the r-squared value of 0.0088 (0.88% of the variance explained) for the BMI association.

Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome?

Based on these two simple models, my overall conclusion is that age appears to be a better predictor of poor physical health compared to BMI. However, there are likely many external factors that influence the variability of poor physical health days than these two predictors. One limitation of simple linear regression is that it only considers one predictor at a time and does not account for potential confounding variables. Additionally, poor physical health days is a discrete as it is a countable variable, which may violate some assumptions of linear regression.

Submission Instructions

Submit your completed .Rmd file and the RPubs link to your knitted HTML document.

Your .Rmd must knit without errors. Make sure all code chunks produce visible output and all questions are answered in complete sentences below each code chunk.

Due: Before the next class session.

Session Info

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gtsummary_2.5.0  ggeffects_2.3.2  broom_1.0.11     plotly_4.12.0   
##  [5] kableExtra_1.4.0 knitr_1.51       here_1.0.2       haven_2.5.5     
##  [9] lubridate_1.9.4  forcats_1.0.1    stringr_1.6.0    dplyr_1.2.0     
## [13] purrr_1.2.1      readr_2.1.5      tidyr_1.3.2      tibble_3.3.0    
## [17] ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.9.0        htmlwidgets_1.6.4 
##  [5] insight_1.4.5      lattice_0.22-7     tzdb_0.5.0         crosstalk_1.2.2   
##  [9] vctrs_0.7.1        tools_4.5.2        generics_0.1.4     pkgconfig_2.0.3   
## [13] Matrix_1.7-4       data.table_1.18.0  RColorBrewer_1.1-3 S7_0.2.1          
## [17] gt_1.3.0           lifecycle_1.0.5    compiler_4.5.2     farver_2.1.2      
## [21] textshaping_1.0.4  janitor_2.2.1      snakecase_0.11.1   litedown_0.9      
## [25] htmltools_0.5.9    sass_0.4.10        yaml_2.3.12        lazyeval_0.2.2    
## [29] pillar_1.11.1      jquerylib_0.1.4    cachem_1.1.0       nlme_3.1-168      
## [33] commonmark_2.0.0   tidyselect_1.2.1   digest_0.6.39      stringi_1.8.7     
## [37] splines_4.5.2      labeling_0.4.3     rprojroot_2.1.1    fastmap_1.2.0     
## [41] grid_4.5.2         cli_3.6.5          magrittr_2.0.4     cards_0.7.1       
## [45] withr_3.0.2        scales_1.4.0       backports_1.5.0    timechange_0.3.0  
## [49] rmarkdown_2.30     httr_1.4.7         otel_0.2.0         hms_1.1.3         
## [53] evaluate_1.0.5     viridisLite_0.4.2  mgcv_1.9-3         markdown_2.0      
## [57] rlang_1.1.7        glue_1.8.0         xml2_1.5.2         svglite_2.2.2     
## [61] rstudioapi_0.18.0  jsonlite_2.0.0     R6_2.6.1           systemfonts_1.3.1 
## [65] fs_1.6.6

Simple Linear Regression

EPI 553 - Advanced Epidemiologic Methods

Morgan Wheat

February 24 & 26, 2026

Introduction

Setup and Data

Part 2: Lab Activity

Overview

Setup Instructions

Task 1: Explore the Variables (15 points)

Task 2: Fit and Interpret the SLR Model (20 points)

Task 3: ANOVA Decomposition and R² (15 points)

Task 4: Confidence and Prediction Intervals (20 points)

Task 5: Residual Diagnostics (20 points)

Task 6: Testing a Different Predictor (10 points)

Submission Instructions

Session Info