Lab Activity

Overview

In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.

Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?

Setup and Data

# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)
library(plotly)
library (gtsummary)

# Load BRFSS 2020 analytic dataset
brfss_slr <- read_rds("/Users/nataliasmall/Downloads/EPI 553/brfss_slr_2020.rds") %>%
  janitor::clean_names()

# Display dataset dimensions
names(brfss_slr)

## [1] "bmi"            "age"            "sex"            "education"     
## [5] "gen_health_num" "sleep_hrs"      "phys_days"

Task 1: Explore the Variables (15 points)

# (a) Create a summary table of phys_days and bmi
summ_table <- brfss_slr %>%
  select(bmi, phys_days) %>%
  summary() %>%
  kable(caption = "Descriptive Statistics: phys_days and bmi") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

summ_table

Descriptive Statistics: phys_days and bmi
	bmi	phys_days
	Min. :14.63	Min. : 1.00
	1st Qu.:24.32	1st Qu.: 2.00
	Median :27.89	Median : 6.00
	Mean :29.18	Mean :11.66
	3rd Qu.:32.89	3rd Qu.:20.00
	Max. :59.60	Max. :30.00

# Descriptive statistics with SD and Mean
brfss_slr %>%
  select(bmi, phys_days) %>%
  tbl_summary(
    label = list(
      bmi ~ "BMI (kg/m²)",
      phys_days ~ "Poor Physical Health Days"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})"
    ),
    digits = all_continuous() ~ 1
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**")

**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**
Characteristic	N	N = 3,000¹
BMI (kg/m²)	3,000	29.2 (7.0)
Poor Physical Health Days	3,000	11.7 (11.2)
¹ Mean (SD)

# (b) Create a histogram of phys_days — describe the distribution
p_hist <- ggplot(brfss_slr, aes(x = phys_days)) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Distribution of Poor Physical Health Days",
    subtitle = "Number of Poor Physical Health Days",
    x = "Poor Physical Health Days",
    y = "Frequency"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
p_scatter <- ggplot(brfss_slr, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "Poor Physical Health Days vs. BMI (BRFSS 2020)",
    subtitle = "Red = Linear fit | Blue Dashed = LOESS smoother",
    x = "BMI (kg/m²)",
    y = "Poor Physical Health Days"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_scatter)

Questions:

What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days?

Mean phys_days: 11.7, SD: 11.2
Mean BMI: 29.2 kg/m², SD: 7.0 kg/m²
Distribution of phys_days appears to be rightly skewed, indicating non-normal distribution. There are two distinctive peaks in distribution, the first at 30 phys_days and the second at 2 phys_days.

Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers? Based on the scatter plot, since the LOESS smoother (blue dashed) closely tracks the linear fit (red), a linear relationship between BMI and poor physical health is reasonable. There aren’t any obvious outliers.

Task 2: Fit and Interpret the SLR Model (20 points)

# (a) Fit the SLR model: phys_days ~ bmi
model_slr <- lm(phys_days ~ bmi, data = brfss_slr)
summary(model_slr)

## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_slr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.808  -9.160  -5.623   8.943  20.453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.42285    0.86881   8.544  < 2e-16 ***
## bmi          0.14513    0.02895   5.013 5.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared:  0.008314,   Adjusted R-squared:  0.007983 
## F-statistic: 25.13 on 1 and 2998 DF,  p-value: 5.659e-07

# (b) Display a tidy coefficient table with 95% CIs
tidy(model_slr, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	7.4228	0.8688	8.5437	0	5.7193	9.1264
bmi	0.1451	0.0289	5.0134	0	0.0884	0.2019

# (c) Extract and report: slope, intercept, t-statistic, p-value
slope_test <- tidy(model_slr, conf.int = TRUE) %>% filter(term == "bmi")

n <- nrow(brfss_slr)

tibble(
  Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
               "Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
  Value = c(
    round(slope_test$estimate, 4),
    round(slope_test$std.error, 4),
    round(slope_test$statistic, 3),
    n - 2,
    format.pval(slope_test$p.value, digits = 3),
    round(slope_test$conf.low, 4),
    round(slope_test$conf.high, 4)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0): From SLR model, phys_days ~ bmi") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

t-Test for the Slope (H₀: β₁ = 0): From SLR model, phys_days ~ bmi
Quantity	Value
Slope (b₁)	0.1451
SE(b₁)	0.0289
t-statistic	5.013
Degrees of freedom	2998
p-value	5.66e-07
95% CI Lower	0.0884
95% CI Upper	0.2019

Questions:

Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\).

phys_days = 7.4228 + 0.1451 × BMI

Interpret the slope (\(b_1\)) in context — what does it mean in plain English? Slope (𝑏1 = 0.1451): For each 1-unit increase in BMI, number of poor physical health days is estimated to increase by 0.1451, on average, holding all else constant (though there are no other variables in this simple model).

Plain English: For every unit increase in BMI, it is assumed that people will report about a 0.15 increase in number of poor physical health days.

Is the intercept (\(b_0\)) interpretable in this context? Why or why not? Intercept (𝑏0 = 7.4228): The estimated mean phys_days when bmi = 0. This is a mathematical artifact — no living human being can have a bmi = 0, so no amount of physical health days would be reasonable. The intercept is not directly interpretable in this context, but is necessary to anchor the line.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value.

null hypothesis: 𝐻0:𝛽1=0 (no linear relationship between phys_days (Y) and bmi (X))
test statistic: 5.013
p-value: 5.66e-07 (< 0.05)
since p-value = 5.66e-07 (< 0.05), we reject null hypothesis. It is statistically significant that BMI is positively associated with number of poor physical health days among U.S. adults.

Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
anova_slr <- anova(model_slr)

anova_slr %>%
  kable(
    caption = "ANOVA Table: phys_days ~ bmi",
    digits = 3,
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Table: phys_days ~ bmi
Source	Df	Sum Sq	Mean Sq	F value	Pr(>F)
bmi	1	3105.365	3105.365	25.134	0
Residuals	2998	370411.743	123.553	NA	NA

# (b) Compute and report SSTotal, SSRegression, and SSResidual
# Augment dataset
augmented <- augment(model_slr)

n <- nrow(brfss_slr)
ss_total <- sum((brfss_slr$phys_days - mean(brfss_slr$phys_days))^2)
ss_reg   <- sum((augmented$.fitted - mean(brfss_slr$phys_days))^2)
ss_resid <- sum(augmented$.resid^2)

tibble(
  Quantity = c("SSTotal", "SSRegression", "SSResidual"),
  Value    = c(round(ss_total, 2), round(ss_reg, 3), round(ss_resid, 3)),
) %>%
  kable(caption = "ANOVA Decomposition") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

ANOVA Decomposition
Quantity	Value
SSTotal	373517.110
SSRegression	3105.365
SSResidual	370411.743

# (c) Compute R² two ways: from the model object and from the SS decomposition
r_sq     <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared
ss_r_sq  <- ss_reg / ss_total

tibble(
  Metric = c("R²", "Adjusted R²", "SS decomposition R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    round(ss_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "Model R², Adjusted Model R², and SS decomposition R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model R², Adjusted Model R², and SS decomposition R²
Metric	Value
R²	0.0083
Adjusted R²	0.008
SS decomposition R²	0.0083
Variance Explained	0.83%

Questions:

Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic.

SSTotal: 373517.110
SSRegression: 3105.365
SSResidual: 370411.743
BMI df: 1
Residual df: 2998
Total df: 2999
F-Statistic: 25.134

What is the \(R^2\) value? Interpret it in plain English.

R²: 0.0083 BMI explains 0.83% of variability in the number of poor physical health days among US adults.

What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation? This tells us that BMI is a weak predictor of poor physical health days. While statistically significant, practical significance is very low, remaining variation may be better explained by other social/health factors.

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = c(25))

ci_pred <- predict(model_slr, newdata = new_bmi, interval = "confidence") %>%
  as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)

# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_pred <- predict(model_slr, newdata = new_bmi, interval = "prediction") %>%
  as.data.frame() %>%
  rename(PI_Lower = lwr, PI_Upper = upr) %>%
  select(-fit)

results_table <- bind_cols(new_bmi, ci_pred, pi_pred) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

results_table %>%
  kable(
    caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals for BMI = 25",
    col.names = c("BMI", "Fitted phys_days", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))

Fitted Values, 95% Confidence Intervals, and Prediction Intervals for BMI = 25
		95% CI for Mean		95% PI for Individual
BMI	Fitted phys_days	CI Lower	CI Upper	PI Lower	PI Upper
25	11.05	10.59	11.51	-10.75	32.85

# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(min(brfss_slr$bmi), max(brfss_slr$bmi), length.out = 200))

ci_band <- predict(model_slr, newdata = bmi_grid, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

pi_band <- predict(model_slr, newdata = bmi_grid, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

p_ci_pi <- ggplot() +
  geom_point(data = brfss_slr, aes(x = bmi, y = phys_days),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.4) +
  geom_line(data = ci_band, aes(x = bmi, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Simple Linear Regression: Poor Physical Health Days ~ BMI",
    subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
    x = "BMI (kg/m²)",
    y = "Poor Physical Health Days",
    caption = "BRFSS 2020, n = 3,000"
  ) +
  theme_minimal(base_size = 13)

p_ci_pi

Questions:

For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean?

Mean: 11.05
95% CI: [10.59, 11.51]

If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days?

95% PI: [-10.75, 32.85] If a specific new person has a BMI of 25, the 95% prediction interval for their number of poor physical health days is -10.75 to 32.85 days.

Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice? The prediction interval (PI) is wider than the confidence interval (CI) because it accounts for both the uncertainty in the mean and the individual variability. You would use a CI when you need a range of plausible values for the population, and a PI for a range for a single new observation.

Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

# (b) Create a residuals vs. fitted plot using ggplot
augmented <- augment(model_slr)

p_resid_fit <- ggplot(augmented, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs. Fitted phys_days",
    subtitle = "Should show no pattern — random scatter around zero",
    x = "Fitted Poor Physical Health Days",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

p_resid_fit

# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Points should lie on the red line if residuals are normally distributed",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)

p_qq

# (d) Create a Cook's distance plot
n <- nrow(brfss_slr)

augmented <- augmented %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(model_slr),
    influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
  )

n_influential <- sum(augmented$cooks_d > 4 / n)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4 / n, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ",
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks

Questions:

Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. There is evidence of non linearity. The LOESS smooth line clearly shows a curve indicating a non-linear relationship between BMI and poor physical health days.
Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days? Points deviate from the line, especially in the upper and lower tails, creating a S-shape curve significantly non-normality. Departures from normality in this context suggest that the distribution of number of poor physical health days is highly skewed.
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? Yes, there are 136 potentially influential observations. You could transform them using Log or square root. You could also assess the 136 data points to determine if they are valid or errors made during entry.
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) No, LINE assumptions do not appear to be met. Normality and linearity assumptions are violated. Normality and Linearity assumptions may be the most problematic for this model since the outcome variable is bounded and highly skewed.

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age
model_age <- lm(phys_days ~ age, data = brfss_slr)
summary(model_age)

## 
## Call:
## lm(formula = phys_days ~ age, data = brfss_slr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.872  -8.803  -4.733   9.460  23.267 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.37023    0.66608   6.561 6.27e-11 ***
## age          0.13127    0.01145  11.467  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.92 on 2998 degrees of freedom
## Multiple R-squared:  0.04202,    Adjusted R-squared:  0.0417 
## F-statistic: 131.5 on 1 and 2998 DF,  p-value: < 2.2e-16

# (b) Display results and compare to the BMI model
# b.1 display results
tidy(model_age, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: Poor Physical Health Days ~ Age (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: Poor Physical Health Days ~ Age (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	4.3702	0.6661	6.5611	0	3.0642	5.6762
age	0.1313	0.0114	11.4675	0	0.1088	0.1537

# b.2 compare to the BMI model
r2_bmi <- summary(model_slr)$r.squared
r2_age <- summary(model_age)$r.squared

comp_tbl <- tibble(
  Model = c("BMI model", "Age model"),
  Predictor = c("BMI", "Age"),
  R_squared = round(c(r2_bmi, r2_age), 4),
  Variance_Explained = paste0(round(c(r2_bmi, r2_age) * 100, 2), "%")
) %>%
  kable(caption = "Comparison: Age Model vs BMI Model") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

comp_tbl

Comparison: Age Model vs BMI Model
Model	Predictor	R_squared	Variance_Explained
BMI model	BMI	0.0083	0.83%
Age model	Age	0.0420	4.2%

# (c) Which predictor has the stronger association? Compare R² values.
r2_age <- summary(model_age)$r.squared
r2_age

## [1] 0.04202056

r2_bmi <- summary(model_slr)$r.squared
r2_bmi

## [1] 0.008313849

Questions:

How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance? Both age and BMI have positive and statistical significant relationships with poor physical health days (p < 0.001). But, Age has a stronger magnitude and statistical significance. Age also has a slope of 0.1313. Age had a much higher F-Statistic, 131.5, than BMI, 25.134. Also, age has a higher R² value, 0.0420, than BMI, 0.0083.
Compare the \(R^2\) values of the two models. Which predictor explains more variability in phys_days?

BMI model R²: 0.0083
Age model R²: 0.0420
BMI model Variance Explained: 0.83%
Age model Variance Explained: 4.2% Age as a predictor explains more variability in number of poor physical health days.

Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome? Overall, while both predictors are statistically significant, age is a stronger predictor of number of poor physical health days, compared to BMI. The limitations of using simple linear regression for this outcome is that in using cross-sectional survey data like BRFSS, observations from the same household or geographic cluster may not be fully independent. Also there are limattaiosn due to violations of Normality and Linearity because of the bounded nature of phys_days.

Session Info

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gtsummary_2.5.0  plotly_4.12.0    broom_1.0.8      kableExtra_1.4.0
##  [5] knitr_1.50       here_1.0.2       haven_2.5.4      lubridate_1.9.4 
##  [9] forcats_1.0.0    stringr_1.5.1    dplyr_1.2.0      purrr_1.0.4     
## [13] readr_2.1.5      tidyr_1.3.1      tibble_3.2.1     ggplot2_4.0.2   
## [17] tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.9.0        htmlwidgets_1.6.4 
##  [5] lattice_0.22-7     tzdb_0.5.0         crosstalk_1.2.2    vctrs_0.7.1       
##  [9] tools_4.5.2        generics_0.1.4     pkgconfig_2.0.3    Matrix_1.7-4      
## [13] data.table_1.17.4  RColorBrewer_1.1-3 S7_0.2.1           gt_1.3.0          
## [17] lifecycle_1.0.5    compiler_4.5.2     farver_2.1.2       textshaping_1.0.1 
## [21] janitor_2.2.1      snakecase_0.11.1   litedown_0.9       htmltools_0.5.8.1 
## [25] sass_0.4.10        yaml_2.3.10        lazyeval_0.2.2     pillar_1.10.2     
## [29] jquerylib_0.1.4    cachem_1.1.0       nlme_3.1-168       commonmark_2.0.0  
## [33] tidyselect_1.2.1   digest_0.6.37      stringi_1.8.7      splines_4.5.2     
## [37] labeling_0.4.3     rprojroot_2.1.1    fastmap_1.2.0      grid_4.5.2        
## [41] cli_3.6.5          magrittr_2.0.3     cards_0.7.1        withr_3.0.2       
## [45] scales_1.4.0       backports_1.5.0    timechange_0.3.0   rmarkdown_2.29    
## [49] httr_1.4.7         hms_1.1.3          evaluate_1.0.3     viridisLite_0.4.2 
## [53] mgcv_1.9-3         markdown_2.0       rlang_1.1.7        glue_1.8.0        
## [57] xml2_1.3.8         svglite_2.2.2      rstudioapi_0.17.1  jsonlite_2.0.0    
## [61] R6_2.6.1           systemfonts_1.3.1  fs_1.6.6

Simple Linear Regression

EPI 553 - Advanced Epidemiologic Methods

Natalia Small

February 24 & 26, 2026

Lab Activity

Overview

Setup and Data

Task 1: Explore the Variables (15 points)

Task 2: Fit and Interpret the SLR Model (20 points)

Task 3: ANOVA Decomposition and R² (15 points)

Task 4: Confidence and Prediction Intervals (20 points)

Task 5: Residual Diagnostics (20 points)

Task 6: Testing a Different Predictor (10 points)

Session Info