Introduction

In the previous lectures, we used Simple Linear Regression (SLR) to model a continuous outcome as a function of a single predictor. But real-world health phenomena are never driven by one variable alone. A person’s mental health is shaped by how much they sleep, how physically ill they are, how old they are, their income, their sex, and much more — simultaneously.

Multiple Linear Regression (MLR) extends SLR to accommodate this complexity. It allows us to:

Adjust for confounding — estimate the association between an exposure and outcome after accounting for other covariates
Improve prediction — use multiple pieces of information to better forecast an outcome
Model effect modification — examine whether the effect of one variable changes across levels of another
Gain precision — by explaining additional variance in the outcome, we can reduce uncertainty in our estimates

In epidemiology, MLR is our primary tool for multivariable analysis of continuous outcomes.

Part 2: Lab Activity

Overview

In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.

Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?

Setup Instructions

Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.

# Load packages
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)
library(haven)
library(janitor)
library(plotly)
library(GGally)
library(here)
library(lmtest)
library(corrplot)

# Load dataset
lab_rds_path <- "~/R/site-library/Epi553/code/brfss_slr_2020_lab.rds"

if (file.exists(lab_rds_path)) {
  brfss_lab <- read_rds(lab_rds_path)
} else {
  brfss_lab <- read_rds(raw_rds_path) %>%
    clean_names()

  set.seed(553)
  brfss_lab <- brfss_lab %>%
    select(bmi, age, sex, phys_days, sleep_hrs) %>%
    filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
    drop_na() %>%
    slice_sample(n = 3000)

  saveRDS(brfss_lab, lab_rds_path)
}

Task 1: Explore the Variables (15 points)

# (a) Create a summary table of phys_days and bmi
brfss_lab %>%
  select(phys_days, bmi) %>%
  summary() %>%
  kable(caption = "Descriptive Statistics: phys_days and bmi") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Descriptive Statistics: phys_days and bmi
	phys_days	bmi
	Min. : 1.00	Min. :14.63
	1st Qu.: 2.00	1st Qu.:24.32
	Median : 6.00	Median :27.89
	Mean :11.66	Mean :29.18
	3rd Qu.:20.00	3rd Qu.:32.89
	Max. :30.00	Max. :59.60

# (b) Create a histogram of phys_days — describe the distribution
ggplot(brfss_lab, aes(x = phys_days)) +
  geom_histogram(bins = 31, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title = "Distribution of Poor Physical Health Days (Past 30 Days)",
    subtitle = "BRFSS 2020, n = 3,000",
    x = "Number of Poor Physical Health Days",
    y = "Count"
  ) +
  theme_minimal(base_size = 13)

# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "darkgreen", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "purple", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "Poor Physical Health Days vs. BMI (BRFSS 2020)",
    subtitle = "Red = Linear fit | Blue dashed = LOESS smoother",
    x = "BMI (kg/m²)",
    y = "Poor Physical Health Days (past 30)"
  ) +
  theme_minimal(base_size = 13)

Questions:

What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days? #The distribution of poor physical health days is strongly right-skewed, with a large spike at 0 days and another concentration at 30 days, indicating a ceiling effect.
Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers? #The scatterplot shows a weak positive linear trend between BMI and poor physical health days. —

Task 2: Fit and Interpret the SLR Model (20 points)

# (a) Fit the SLR model: phys_days ~ bmi
model_lab <- lm(phys_days ~ bmi, data = brfss_lab)

summary(model_lab)

## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.808  -9.160  -5.623   8.943  20.453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.42285    0.86881   8.544  < 2e-16 ***
## bmi          0.14513    0.02895   5.013 5.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared:  0.008314,   Adjusted R-squared:  0.007983 
## F-statistic: 25.13 on 1 and 2998 DF,  p-value: 5.659e-07

# (b) Display a tidy coefficient table with 95% CIs
tidy(model_lab, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: phys_days ~ bmi (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	7.4228	0.8688	8.5437	0	5.7193	9.1264
bmi	0.1451	0.0289	5.0134	0	0.0884	0.2019

# (c) Extract and report: slope, intercept, t-statistic, p-value

b0_lab <- round(coef(model_lab)[1], 3)
b1_lab <- round(coef(model_lab)[2], 4)

slope_info <- tidy(model_lab, conf.int = TRUE) %>%
  filter(term == "bmi")

tibble(
  Quantity   = c("Intercept (b₀)", "Slope (b₁)", "t-statistic", "p-value",
                 "95% CI Lower", "95% CI Upper"),
  Value      = c(
    round(slope_info$estimate[1], 4),  
    round(b1_lab, 4),
    round(slope_info$statistic, 3),
    format.pval(slope_info$p.value, digits = 3),
    round(slope_info$conf.low, 4),
    round(slope_info$conf.high, 4)
  )
) %>%
  kable(caption = "Key Model Statistics: phys_days ~ bmi") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Key Model Statistics: phys_days ~ bmi
Quantity	Value
Intercept (b₀)	0.1451
Slope (b₁)	0.1451
t-statistic	5.013
p-value	5.66e-07
95% CI Lower	0.0884
95% CI Upper	0.2019

Questions:

Write the fitted regression equation in the form $\hat{Y} = b_0 + b_1 X$.

phys_days^=7.4229+0.1451(BMI)

Interpret the slope ($b_1$) in context — what does it mean in plain English? #For each 1-unit increase in BMI, the mean number of poor physical health days increases by approximately 0.145 days, on average.
Is the intercept ($b_0$) interpretable in this context? Why or why not? #The intercept represents the predicted number of poor physical health days when BMI equals 0. Because a BMI of 0 is not realistic, the intercept is not practically interpretable.
Is the association statistically significant at $\alpha = 0.05$? State the null hypothesis, test statistic, and p-value.

null hypothesis= 0 t=5.013 R^2 = 0.008314. Since p < 0.05, we reject the null hypothesis and conclude that BMI is statistically significantly associated with poor physical health day

Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
anova_lab <- anova(model_lab)

anova_lab %>%
  kable(
    caption = "ANOVA Table: phys_days ~ bmi",
    digits = 3,
    col.names = c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Table: phys_days ~ bmi
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
bmi	1	3105.365	3105.365	25.134	0
Residuals	2998	370411.743	123.553	NA	NA

# (b) Compute and report SSTotal, SSRegression, and SSResidual
augmented_lab <- augment(model_lab)

ss_total    <- sum((brfss_lab$phys_days - mean(brfss_lab$phys_days))^2)
ss_resid    <- sum(augmented_lab$.resid^2)
ss_reg      <- ss_total - ss_resid

n_lab <- nrow(brfss_lab)

tibble(
  Component  = c("SS Total (SST)", "SS Regression (SSReg)", "SS Residual (SSE)"),
  Formula    = c("Σ(Yᵢ − Ȳ)²", "Σ(Ŷᵢ − Ȳ)²", "Σ(Yᵢ − Ŷᵢ)²"),
  Value      = c(round(ss_total, 2), round(ss_reg, 2), round(ss_resid, 2)),
  df         = c(n_lab - 1, 1, n_lab - 2)
) %>%
  kable(caption = "ANOVA Decomposition: Manual Calculation") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Decomposition: Manual Calculation
Component	Formula	Value	df
SS Total (SST)	Σ(Yᵢ − Ȳ)²	373517.11	2999
SS Regression (SSReg)	Σ(Ŷᵢ − Ȳ)²	3105.36	1
SS Residual (SSE)	Σ(Yᵢ − Ŷᵢ)²	370411.74	2998

# (c) Compute R² two ways: from the model object and from the SS decomposition
r2_model <- summary(model_lab)$r.squared
r2_manual <- ss_reg / ss_total

tibble(
  Method  = c("From model object (summary()$r.squared)", "From SS decomposition (SSReg / SSTotal)"),
  R2      = c(round(r2_model, 6), round(r2_manual, 6)),
  Match   = c("—", as.character(round(r2_model, 6) == round(r2_manual, 6)))
) %>%
  kable(caption = "R² Computed Two Ways") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² Computed Two Ways
Method	R2	Match
From model object (summary()$r.squared)	0.008314	—
From SS decomposition (SSReg / SSTotal)	0.008314	TRUE

Questions:

Fill in the ANOVA table components: $SS_{Total}$, $SS_{Regression}$, $SS_{Residual}$, $df$, and $F$-statistic. #The F-test indicates that the regression model is statistically significant.
What is the $R^2$ value? Interpret it in plain English. # the BMI explains very little of the differences in poor physical health days among individuals over 99% of the variation is due to other factors not included in this simple model.
What does this $R^2$ tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation? # the r^2 value of 0.0083 indicates that BMI alone explains less than 1% of the variation in poor physical health days. —

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi25 <- data.frame(bmi = 25)

ci_bmi25 <- predict(model_lab, newdata = new_bmi25, interval = "confidence") %>%
  as.data.frame()

ci_bmi25 %>%
  mutate(across(where(is.numeric), ~ round(., 3))) %>%
  kable(
    caption = "Fitted Value and 95% CI for Mean phys_days at BMI = 25",
    col.names = c("Fitted (Mean Estimate)", "CI Lower", "CI Upper")
  ) %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Fitted Value and 95% CI for Mean phys_days at BMI = 25
Fitted (Mean Estimate)	CI Lower	CI Upper
11.051	10.588	11.514

# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_bmi25 <- predict(model_lab, newdata = new_bmi25, interval = "prediction") %>%
  as.data.frame()

pi_bmi25 %>%
  mutate(across(where(is.numeric), ~ round(., 3))) %>%
  kable(
    caption = "95% Prediction Interval for an Individual with BMI = 25",
    col.names = c("Fitted (Point Estimate)", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

95% Prediction Interval for an Individual with BMI = 25
Fitted (Point Estimate)	PI Lower	PI Upper
11.051	-10.749	32.851

# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(15, 60, length.out = 300))

ci_band_lab <- predict(model_lab, newdata = bmi_grid, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

pi_band_lab <- predict(model_lab, newdata = bmi_grid, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

ggplot() +
  geom_point(data = brfss_lab, aes(x = bmi, y = phys_days),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band_lab, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.30) +
  geom_ribbon(data = ci_band_lab, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.45) +
  geom_line(data = ci_band_lab, aes(x = bmi, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Simple Linear Regression: phys_days ~ BMI",
    subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual",
    x = "BMI (kg/m²)",
    y = "Poor Physical Health Days (past 30)",
    caption = "BRFSS 2020, n = 3,000"
  ) +
  theme_minimal(base_size = 13)

Questions:

For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean?

#This interval represents uncertainty around the estimated mean for individuals with BMI = 25. b) If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days? #This interval is much wider because it reflects variability in individual outcomes, not just uncertainty in the mean.

Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice? #The confidence interval shows the uncertainty in estimating the average response, whereas the prediction interval accounts for both uncertainty in the mean and individual-level variability. —

Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_lab, which = 1:4,
     col = adjustcolor("blue3", 0.4),
     pch = 19, cex = 0.6)

par(mfrow = c(1, 1))

# (b) Create a residuals vs. fitted plot using ggplot
augmented_lab <- augmented_lab %>%
  mutate(obs_num = row_number())

ggplot(augmented_lab, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "green", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs. Fitted Values",
    subtitle = "Red line = zero | Orange = LOESS smoother",
    x = "Fitted Values (Predicted phys_days)",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

# (c) Create a normal Q-Q plot of residuals using ggplot
ggplot(augmented_lab, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Points should lie on the red line if residuals are normally distributed",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)

# (d) Create a Cook's distance plot
augmented_lab <- augmented_lab %>%
  mutate(
    cooks_d     = cooks.distance(model_lab),
    influential = ifelse(cooks_d > 4 / n_lab, "Potentially influential", "Not influential")
  )

n_influential_lab <- sum(augmented_lab$cooks_d > 4 / n_lab)

ggplot(augmented_lab, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4 / n_lab, linetype = "dashed",
             color = "black", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ",
                      n_influential_lab, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Questions:

Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see.

#The LOESS smoother shows mild curvature, suggesting some nonlinearity in the relationship. There is also evidence of unequal spread across fitted values, indicating potential heteroscedasticity.

Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days? #The Q-Q plot shows substantial deviation from the reference line, particularly in the tails, indicating that residuals are not normally distributed.
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? #There are 136 observations exceeding the Cook’s Distance threshold of 4/n. While these observations may influence the model, they should not be removed without further investigation or sensitivity analysis.
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) #The independence assumption likely holds. However, the most problematic assumption is normality, given that poor physical health days is a bounded count variable, which is not well approximated by a normal distribution

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age
model_age <- lm(phys_days ~ age, data = brfss_lab)

summary(model_age)

## 
## Call:
## lm(formula = phys_days ~ age, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.872  -8.803  -4.733   9.460  23.267 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.37023    0.66608   6.561 6.27e-11 ***
## age          0.13127    0.01145  11.467  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.92 on 2998 degrees of freedom
## Multiple R-squared:  0.04202,    Adjusted R-squared:  0.0417 
## F-statistic: 131.5 on 1 and 2998 DF,  p-value: < 2.2e-16

# (b) Display results and compare to the BMI model
tidy(model_age, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "SLR: phys_days ~ Age",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

SLR: phys_days ~ Age
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	4.3702	0.6661	6.5611	0	3.0642	5.6762
age	0.1313	0.0114	11.4675	0	0.1088	0.1537

# (c) Which predictor has the stronger association? Compare R² values.
r2_bmi <- summary(model_lab)$r.squared
r2_age <- summary(model_age)$r.squared

tibble(
  Model       = c("phys_days ~ BMI", "phys_days ~ Age"),
  R_squared   = c(round(r2_bmi, 4), round(r2_age, 4)),
  Adj_R2      = c(round(summary(model_lab)$adj.r.squared, 4),
                  round(summary(model_age)$adj.r.squared, 4)),
  AIC         = c(round(AIC(model_lab), 1), round(AIC(model_age), 1))
) %>%
  kable(
    caption = "Model Comparison: BMI vs. Age as Predictor of phys_days",
    col.names = c("Model", "R²", "Adj. R²", "AIC")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Comparison: BMI vs. Age as Predictor of phys_days
Model	R²	Adj. R²	AIC
phys_days ~ BMI	0.0083	0.0080	22967.6
phys_days ~ Age	0.0420	0.0417	22863.9

Questions:

How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance?

#Both BMI and age show positive associations with poor physical health days. However, age has a stronger association (R² = 0.0420) compared to BMI (R² = 0.0083), and a much larger F-statistic, indicating stronger statistical evidence.

Compare the $R^2$ values of the two models. Which predictor explains more variability in phys_days? #Age explains approximately 4.2% of the variability in poor physical health days, while BMI explains only 0.83%.
Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome? #Both age and BMI are statistically significant predictors of poor physical health days, but neither explains much variability when modeled alone. The outcome variable is bounded and non-normally distributed, violating key linear regression assumptions.

Submission Instructions

Submit your completed .Rmd file and the RPubs link to your knitted HTML document.

Your .Rmd must knit without errors. Make sure all code chunks produce visible output and all questions are answered in complete sentences below each code chunk.

Due: Before the next class session.

Session Info

sessionInfo()

## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] corrplot_0.92    lmtest_0.9-40    zoo_1.8-12       here_1.0.1      
##  [5] GGally_2.2.1     plotly_4.10.4    janitor_2.2.0    haven_2.5.4     
##  [9] ggeffects_1.5.0  gtsummary_1.7.2  kableExtra_1.4.0 knitr_1.45      
## [13] broom_1.0.5      lubridate_1.9.3  forcats_1.0.0    stringr_1.5.1   
## [17] dplyr_1.1.4      purrr_1.0.2      readr_2.1.5      tidyr_1.3.1     
## [21] tibble_3.2.1     ggplot2_3.5.0    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0     viridisLite_0.4.2    farver_2.1.1        
##  [4] fastmap_1.1.1        lazyeval_0.2.2       broom.helpers_1.14.0
##  [7] digest_0.6.34        timechange_0.3.0     lifecycle_1.0.4     
## [10] ellipsis_0.3.2       magrittr_2.0.3       compiler_4.3.2      
## [13] rlang_1.1.3          sass_0.4.8           tools_4.3.2         
## [16] utf8_1.2.4           yaml_2.3.8           gt_0.10.1           
## [19] data.table_1.15.99   labeling_0.4.3       htmlwidgets_1.6.4   
## [22] plyr_1.8.9           xml2_1.3.6           RColorBrewer_1.1-3  
## [25] withr_3.0.0          grid_4.3.2           fansi_1.0.6         
## [28] colorspace_2.1-0     scales_1.3.0         insight_0.19.8      
## [31] cli_3.6.2            rmarkdown_2.25       generics_0.1.3      
## [34] rstudioapi_0.15.0    httr_1.4.7           tzdb_0.4.0          
## [37] cachem_1.0.8         splines_4.3.2        vctrs_0.6.5         
## [40] Matrix_1.6-1.1       jsonlite_1.8.8       hms_1.1.3           
## [43] systemfonts_1.0.5    jquerylib_0.1.4      glue_1.7.0          
## [46] ggstats_0.5.1        stringi_1.8.3        gtable_0.3.4        
## [49] munsell_0.5.0        pillar_1.9.0         htmltools_0.5.7     
## [52] R6_2.5.1             rprojroot_2.0.4      evaluate_0.23       
## [55] lattice_0.21-9       highr_0.10           backports_1.4.1     
## [58] snakecase_0.11.1     bslib_0.6.1          Rcpp_1.0.12         
## [61] svglite_2.1.3        nlme_3.1-163         mgcv_1.9-0          
## [64] xfun_0.42            pkgconfig_2.0.3

Multiple Linear Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Karo Ekor

March 3, 5, and 10, 2026

Introduction

Part 2: Lab Activity

Overview

Setup Instructions

Task 1: Explore the Variables (15 points)

Task 2: Fit and Interpret the SLR Model (20 points)

phys_days^=7.4229+0.1451(BMI)

null hypothesis= 0 t=5.013 R^2 = 0.008314. Since p < 0.05, we reject the null hypothesis and conclude that BMI is statistically significantly associated with poor physical health day

Task 3: ANOVA Decomposition and R² (15 points)

Task 4: Confidence and Prediction Intervals (20 points)

Task 5: Residual Diagnostics (20 points)

Task 6: Testing a Different Predictor (10 points)

Submission Instructions

Session Info

Multiple Linear Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Karo Ekor

March 3, 5, and 10, 2026

Introduction

Part 2: Lab Activity

Overview

Setup Instructions

Task 1: Explore the Variables (15 points)

Task 2: Fit and Interpret the SLR Model (20 points)

phys_days^​=7.4229+0.1451(BMI)

null hypothesis= 0 t=5.013 R^2 = 0.008314. Since p < 0.05, we reject the null hypothesis and conclude that BMI is statistically significantly associated with poor physical health day

Task 3: ANOVA Decomposition and R² (15 points)

Task 4: Confidence and Prediction Intervals (20 points)

Task 5: Residual Diagnostics (20 points)

Task 6: Testing a Different Predictor (10 points)

Submission Instructions

Session Info

phys_days^=7.4229+0.1451(BMI)