Lab Activity

Overview

In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.

Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?

Setup Instructions

Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.

# Load packages
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(plotly)

# Load raw BRFSS 2020 data
brfss_full <- read_xpt(
  "C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/BRFSS_2020.XPT"
) %>%
  janitor::clean_names()

# Select variables of interest
brfss_lab <- brfss_full %>%
  select(bmi5, age80, sex, physhlth, sleptim1)

# Recode variables
brfss_lab <- brfss_lab %>%
  mutate(
    bmi       = bmi5 / 100,
    age       = age80,
    sex       = factor(ifelse(sex == 1, "Male", "Female")),
    phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
    sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
  )

# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab <- brfss_lab %>%
  select(bmi, age, sex, phys_days, sleep_hrs) %>%
  filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
  drop_na() %>%
  slice_sample(n = 3000)

# Save analytic dataset
saveRDS(brfss_lab, here::here(
  "C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/BRFSS_2020.rds"
))

Task 1: Explore the Variables (15 points)

# (a) Create a summary table of phys_days and bmi

brfss_lab %>%
  select(bmi, phys_days) %>%
  summary() %>%
  kable(caption = "Descriptive Statistics: Key Continuous Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Descriptive Statistics: Key Continuous Variables
	bmi	phys_days
	Min. :14.91	Min. : 1.00
	1st Qu.:24.41	1st Qu.: 2.00
	Median :28.33	Median : 7.00
	Mean :29.47	Mean :12.03
	3rd Qu.:33.47	3rd Qu.:21.00
	Max. :58.24	Max. :30.00

brfss_lab %>%
  select(bmi, phys_days) %>%
  tbl_summary(
    label = list(
      bmi ~ "BMI (kg/m^2)",
      phys_days ~ "Number of Days Physical Health Was Not Good"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})"
    ),
    digits = all_continuous() ~ 1
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("Table 1. Descriptive Statistics (BRFSS 2020, n = 3000)")

Table 1. Descriptive Statistics (BRFSS 2020, n = 3000)
Characteristic	N	N = 3,000¹
BMI (kg/m^2)	3,000	29.5 (6.9)
Number of Days Physical Health Was Not Good	3,000	12.0 (11.2)
¹ Mean (SD)

# (b) Create a histogram of phys_days — describe the distribution

hist(brfss_lab$phys_days, main = "Bad Physical Days Distribution",
     xlab = "Number of Bad Physical days", ylab = "Frequency", col = "lightblue")

# (c) Create a scatter plot of phys_days (Y) vs bmi (X)

p_scatter <- ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "BMI vs. Phys_days (BRFSS 2020)",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "Phys_days",
    y = "BMI (kg/m^2)"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_scatter)

Questions:

What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days? The mean for BMI is 29.5kg/m^2 and the standard deviation for BMI is 6.9 kg/m^2. The mean for phys_days is 12.0 days and the standard deviation is 11.2 days. The distribution of phys_days has a standard deviation and mean that are very close together. This indicates that there is substantial variability.
Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers? The relationship between BMI and phys_days is positive but weak, and characterized by substantial variability and some outliers. The dashed line falls very close to the red diagonal line but there is deviation at the tails.

Task 2: Fit and Interpret the SLR Model (20 points)

# (a) Fit the SLR model: phys_days ~ bmi

model_slr <- lm(phys_days ~ bmi, data = brfss_lab)

# Summary output
summary(model_slr)

## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.412  -9.294  -5.379   9.803  20.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.51927    0.89781   8.375  < 2e-16 ***
## bmi          0.15306    0.02967   5.158 2.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.16 on 2998 degrees of freedom
## Multiple R-squared:  0.008798,   Adjusted R-squared:  0.008467 
## F-statistic: 26.61 on 1 and 2998 DF,  p-value: 2.652e-07

# (b) Display a tidy coefficient table with 95% CIs

tidy(model_slr, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	7.5193	0.8978	8.3751	0	5.7589	9.2796
bmi	0.1531	0.0297	5.1584	0	0.0949	0.2112

# (c) Extract and report: slope, intercept, t-statistic, p-value

slope_test <- tidy(model_slr, conf.int = TRUE) %>% filter(term =="bmi")

tibble(
  Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
               "Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
  Value    = c(
    round(slope_test$estimate, 4),
    round(slope_test$std.error, 4),
    round(slope_test$statistic, 3),
    3000-2,
    format.pval(slope_test$p.value, digits = 3),
    round(slope_test$conf.low, 4),
    round(slope_test$conf.high, 4)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

t-Test for the Slope (H₀: β₁ = 0)
Quantity	Value
Slope (b₁)	0.1531
SE(b₁)	0.0297
t-statistic	5.158
Degrees of freedom	2998
p-value	2.65e-07
95% CI Lower	0.0949
95% CI Upper	0.2112

Questions:

Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\).

{Y}= 7.52 + 0.153X

Interpret the slope (\(b_1\)) in context — what does it mean in plain English? The slope estimate is 0.153 which means that for each one unit increase in BMI, individuals report about 0.15 additional days of poor physical health on average.
Is the intercept (\(b_0\)) interpretable in this context? Why or why not? The intercept is 7.52 which represents the predicted number of poor physical health days for an individual with BMI of zero which makes this realistically impossible. This intercept is not interpretable in a meaningful way and has no real world meaning.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value. Null hypotheses:There is no association between BMI and the number of days physical health was not good. The p-value is 2.65e-07 which is less than 0.05 so we reject the null hypothesis. The relationship between BMI and the number of days physical health was not good is statistically significant. —

Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table

anova_slr <- anova(model_slr)

anova_slr %>%
  kable(
    caption = "ANOVA Table: BMI ~ Phys_days",
    digits = 3,
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Table: BMI ~ Phys_days
Source	Df	Sum Sq	Mean Sq	F value	Pr(>F)
bmi	1	3314.969	3314.969	26.609	0
Residuals	2998	373487.391	124.579	NA	NA

# (b) Compute and report SSTotal, SSRegression, and SSResidual
#SS Total:376,802.36
#SS Regression: 3314.969
#SS Residual: 373,487.391
  
# (c) Compute R² two ways: from the model object and from the SS decomposition

r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.0088
Adjusted R²	0.0085
Variance Explained	0.88%

r_pearson <- cor(brfss_lab$phys_days, brfss_lab$bmi)
tibble(
  Quantity   = c("Pearson r", "r² (from Pearson)", "R² (from model)", "r² = R²?"),
  Value      = c(
    round(r_pearson, 4),
    round(r_pearson^2, 4),
    round(r_sq, 4),
    as.character(round(r_pearson^2, 4) == round(r_sq, 4))
  )
) %>%
  kable(caption = "Pearson r vs. R² from Model") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Pearson r vs. R² from Model
Quantity	Value
Pearson r	0.0938
r² (from Pearson)	0.0088
R² (from model)	0.0088
r² = R²?	TRUE

Questions:

Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic.

SS regression = 3314.969 SS residual = 373,487.391 SS total = 376,802.360

What is the \(R^2\) value? Interpret it in plain English. The R^2 value of 0.0088 means that BMI explains about 0.88% of the variability in the number of days physical health was not good.
What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation? BMI alone is not a strong predictor of poor physical health days. There is a 99.12% of variation that is unexplained. This could be due to other factors like lifestyle factors, chronic diseases, mental health, age and gender, etc.

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25

new_bmi <- data.frame(bmi = 25)

ci_pred <- predict(model_slr, newdata = new_bmi, interval = "confidence", level= 0.95) %>%
  as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)

# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_pred <- predict(model_slr, newdata = new_bmi, interval = "prediction", level=0.95) %>%
  as.data.frame() %>%
  rename(PI_Lower = lwr, PI_Upper = upr) %>%
  select(-fit)

results_table <- bind_cols(new_bmi, ci_pred, pi_pred) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

results_table %>%
  kable(
    caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by BMI",
    col.names = c("BMI", "Fitted BMI", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))

Fitted Values, 95% Confidence Intervals, and Prediction Intervals by BMI
		95% CI for Mean		95% PI for Individual
BMI	Fitted BMI	CI Lower	CI Upper	PI Lower	PI Upper
25	11.35	10.87	11.82	-10.54	33.24

# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(length.out = 200))

ci_band <- predict(model_slr, newdata = bmi_grid, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

pi_band <- predict(model_slr, newdata = bmi_grid, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

p_ci_pi <- ggplot() +
  geom_point(data = brfss_lab, aes(x = bmi, y = phys_days),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.4) +
  geom_line(data = ci_band, aes(x = bmi, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Simple Linear Regression: Phys Days ~ BMI",
    subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
    x = "BMI (kg/m^2)",
    y = "Physical Activity Days",
    caption = "BRFSS 2020, n = 3,000"
  ) +
  theme_minimal(base_size = 13)

p_ci_pi

Questions:

For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean? The estimated mean number of poor physical health days for someone with a BMI of 25 is 11.35 days. The 95% CI tells us the true mean for all people with BMI of 25 is likely between 10.87 and 11.87.
If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days? For the prediction interval we see that there is a much wider interval, ranging from -10.54 to 33.24, showing us that the individuals experiences vary widely, even though negative days are not possible in reality.
Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice? The prediction interval is much wider than the confidence interval because PI estimates the likely outcome for a single individual while CI estimates the mean outcome for the group. You want eo use CI whne you want to know the average outcome for a popualtion with a certain characteristic. You want to use PI when you want to predict an individuals outcome.

Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())

par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

# (b) Create a residuals vs. fitted plot using ggplot

augmented <- augment(model_slr)

p_resid_f <- ggplot(augmented, aes(x = bmi, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs. BMI",
    x = "BMI",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

p_resid_f

# (c) Create a normal Q-Q plot of residuals using ggplot

p_qq <- ggplot(augmented, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Q-Q Plot of Residuals",
    subtitle = "Points should lie on the red line if residuals are normally distributed",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)

p_qq

# (d) Create a Cook's distance plot

augmented <- augmented %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(model_slr),
    influential = ifelse(cooks_d > 4 / 3000, "Potentially influential", "Not influential")
  )

n_influential <- sum(augmented$cooks_d > 4 / 3000)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4 / 3000, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ",
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks

Questions:

Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. The orange line dips below zero for higher BMI values and slightly rises for lower BMI values, suggesting some non linearity, meaning the relationship is not perfectly linear. There is evidence of heteroscedasticity becasue in the plot, the residuals appear more spread out at higher BMI vales compared to lower BMI leaves, meaning the variability increases as BMI increases.
Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days? The points deviate significantly from the red line at both tails. This suggests that phys_days has a non normal distribution with heavier tails than a normal distribution. This could indicate that there are outliers or extreme values in phys_days.
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? Yes there are influential observations. There are 133 potentially influential observations. To account for potential influential observations, I would look to see if there were any data entry errors that could be contributing to the influential observations. I would also refit the model with and without the influential observations to see if there is a change.
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) Using the LINE mnemonic: L-linearity: The residual vs. fitted plot shows nonliterary. I-independence: cannot be checked N- normality of errors: the Q-Q plots shows deviations from the line in both tails, not normally distributed. E- equal variance: the residuals vs. fitted plot shows increasing spread of residuals at higher BMI values, indicating heteroscedasticity. Overall, the LINE assumptions are not met. Linearity, normality, and equal variance are not met and indicates that a different model might be a better fit for this type of data.

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age

model_lab_age <- lm(phys_days ~ age, data = brfss_lab)


# (b) Display results and compare to the BMI model

tidy(model_lab_age, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: phys_age ~ Age (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: phys_age ~ Age (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	4.3359	0.6826	6.3521	0	2.9975	5.6743
age	0.1377	0.0117	11.7893	0	0.1148	0.1606

# (c) Which predictor has the stronger association? Compare R^2 values.

tibble(
  Model       = c("Linear:Phys_days ~ BMI", "Linear: Phys_days ~ Age"),
  R_squared   = c(
    round(summary(model_slr)$r.squared, 4),
    round(summary(model_lab_age)$r.squared, 4)
  ),
  Adj_R2      = c(
    round(summary(model_slr)$adj.r.squared, 4),
    round(summary(model_lab_age)$adj.r.squared, 4)
  ),
  AIC         = c(round(AIC(model_slr), 1), round(AIC(model_lab_age), 1))
) %>%
  kable(
    caption = "Model Comparison",
    col.names = c("Model", "R^2", "Adj. R^2", "AIC")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which.min(c(AIC(model_slr), AIC(model_lab_age))),
           bold = TRUE, background = "#d4edda")

Model Comparison
Model	R^2	Adj. R^2	AIC
Linear:Phys_days ~ BMI	0.0088	0.0085	22992.4
Linear: Phys_days ~ Age	0.0443	0.0440	22883.0

Questions:

How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance? Both age and BMI show a positive association with phys_days. The magnitude of BMI (0.1531) is slightly larger than age (0.1377), meaning that one unit increase in BMI is associated with about 0.15 additional phys_days, while each additional year of age is associated with about 0.14 additional days. Age and BMI are both statistical significant with p-values lower than 0.05.
Compare the \(R^2\) values of the two models. Which predictor explains more variability in phys_days? The R^2 value for age (0.0443) is larger than the R^2 value for BMI (0.0088). This shows that age explains more variability in phys_days than BMI, even though both values are relatively small.
Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome? Overall, age and BMI are statistically significant predictors of phys_days, however age appears to have a stronger predictor in terms of power. Due to phys_days being a discrete, the normality and equal variance assumptions of linear regression might be violated. A different modeling approach may be more appropriate.

Session Info

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plotly_4.12.0    gtsummary_2.5.0  broom_1.0.11     kableExtra_1.4.0
##  [5] knitr_1.51       here_1.0.2       haven_2.5.5      lubridate_1.9.4 
##  [9] forcats_1.0.1    stringr_1.6.0    dplyr_1.1.4      purrr_1.2.1     
## [13] readr_2.1.6      tidyr_1.3.2      tibble_3.3.1     ggplot2_4.0.0   
## [17] tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.9.0        htmlwidgets_1.6.4 
##  [5] lattice_0.22-7     tzdb_0.5.0         crosstalk_1.2.2    vctrs_0.6.5       
##  [9] tools_4.5.1        generics_0.1.4     pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.18.0  RColorBrewer_1.1-3 S7_0.2.0           gt_1.3.0          
## [17] lifecycle_1.0.5    compiler_4.5.1     farver_2.1.2       textshaping_1.0.4 
## [21] janitor_2.2.1      snakecase_0.11.1   litedown_0.9       htmltools_0.5.9   
## [25] sass_0.4.10        yaml_2.3.12        lazyeval_0.2.2     pillar_1.11.1     
## [29] jquerylib_0.1.4    cachem_1.1.0       nlme_3.1-168       commonmark_2.0.0  
## [33] tidyselect_1.2.1   digest_0.6.39      stringi_1.8.7      labeling_0.4.3    
## [37] splines_4.5.1      rprojroot_2.1.1    fastmap_1.2.0      grid_4.5.1        
## [41] cli_3.6.5          magrittr_2.0.4     cards_0.7.1        withr_3.0.2       
## [45] scales_1.4.0       backports_1.5.0    timechange_0.3.0   rmarkdown_2.30    
## [49] httr_1.4.7         otel_0.2.0         hms_1.1.4          evaluate_1.0.5    
## [53] viridisLite_0.4.2  mgcv_1.9-3         markdown_2.0       rlang_1.1.6       
## [57] glue_1.8.0         xml2_1.5.1         svglite_2.2.2      rstudioapi_0.18.0 
## [61] jsonlite_2.0.0     R6_2.6.1           systemfonts_1.3.1  fs_1.6.6

Simple Linear Regression

EPI 553 - Advanced Epidemiologic Methods

Grace Beal

February 28, 2026

Lab Activity

Overview

Setup Instructions

Task 1: Explore the Variables (15 points)

Task 2: Fit and Interpret the SLR Model (20 points)

Task 3: ANOVA Decomposition and R² (15 points)

Task 4: Confidence and Prediction Intervals (20 points)

Task 5: Residual Diagnostics (20 points)

Task 6: Testing a Different Predictor (10 points)

Session Info