Part 2: Lab Activity

Overview

In this lab, you will apply Simple Linear Regression to the BRFSS 2020 dataset using a different outcome variable: number of days of poor physical health in the past 30 days (phys_days). You will model it as a continuous outcome predicted by BMI.

Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?

Setup Instructions

Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.

# Load packages
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(haven)

## Warning: package 'haven' was built under R version 4.5.2

library(here)

## Warning: package 'here' was built under R version 4.5.2

## here() starts at C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553

library(knitr)

## Warning: package 'knitr' was built under R version 4.5.2

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(broom)

## Warning: package 'broom' was built under R version 4.5.2

library(dplyr)
library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.2

# Load raw BRFSS 2020 data
brfss_full <- read_xpt(
  "C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553/scripts/LLCP2020XPT/LLCP2020.XPT"
) %>%
  janitor::clean_names()

# Select variables of interest
brfss_lab <- brfss_full %>%
  select(bmi5, age80, sex, physhlth, sleptim1)

# Recode variables
brfss_lab <- brfss_lab %>%
  mutate(
    bmi       = bmi5 / 100,
    age       = age80,
    sex       = factor(ifelse(sex == 1, "Male", "Female")),
    phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_),
    sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_)
  )

# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_lab <- brfss_lab %>%
  select(bmi, age, sex, phys_days, sleep_hrs) %>%
  filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
  drop_na() %>%
  slice_sample(n = 3000)

# Save analytic dataset
saveRDS(brfss_lab, here::here(
  "C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553/scripts/brfss_lab_2020.rds"
))

Task 1: Explore the Variables (15 points)

# (a) Create a summary table of phys_days and bmi
brfss_lab %>%
  select(bmi, phys_days) %>%
  summary() %>%
  kable(caption = "Descriptive Statistics: Key Continuous Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Descriptive Statistics: Key Continuous Variables
	bmi	phys_days
	Min. :14.91	Min. : 1.00
	1st Qu.:24.41	1st Qu.: 2.00
	Median :28.33	Median : 7.00
	Mean :29.47	Mean :12.03
	3rd Qu.:33.47	3rd Qu.:21.00
	Max. :58.24	Max. :30.00

#Descriptive Statistics 
brfss_lab %>%
  select(bmi, phys_days) %>%
  tbl_summary(
    label = list(
      bmi ~ "BMI (kg/m²)",
      phys_days ~ "Number of Days Physical Health was not good"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})"
    ),
    digits = all_continuous() ~ 1
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**")

**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**
Characteristic	N	N = 3,000¹
BMI (kg/m²)	3,000	29.5 (6.9)
Number of Days Physical Health was not good	3,000	12.0 (11.2)
¹ Mean (SD)

# (b) Create a histogram of phys_days — describe the distribution
hist(brfss_lab$phys_days, main = "Bad Physical Days Distribution", 
     xlab = "Number of Bad Physical Days", ylab = "Frequency" ,col = "lightblue")

# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
p_scatter <- ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
  geom_smooth(method = "loess", color = "blue", linewidth = 1,
              linetype = "dashed", se = FALSE) +
  labs(
    title = "Number of Bad Physical Days vs. BMI (BRFSS 2020)",
    subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
    x = "BMI (kg/m²)",
    y = "Number of Bad Physical Days"
  ) +
  theme_minimal(base_size = 13)

print(p_scatter)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Questions:

What is the mean and standard deviation of phys_days? Of bmi? What do you notice about the distribution of phys_days? The mean of BMI is 29.47 and the mean of phys_days is 12.03. The standard deviation of BMI is 6.9 and the standard deviation of phys_days is 11.2. The distribution of phys_days is mainly right-skewed, but there is a large frequency of people that responded 30 days for the number of bad physical health days.
Based on the scatter plot, does the relationship between BMI and poor physical health days appear to be linear? Are there any obvious outliers? It does not appear linear, and there are many outliers at the 30 day mark. —

Task 2: Fit and Interpret the SLR Model (20 points)

# (a) Fit the SLR model: phys_days ~ bmi
model_lab <- lm(phys_days ~ bmi, data = brfss_lab)

# (b) Display a tidy coefficient table with 95% CIs
tidy(model_lab, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: Bad Physical Health Days ~ BMI (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: Bad Physical Health Days ~ BMI (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	7.5193	0.8978	8.3751	0	5.7589	9.2796
bmi	0.1531	0.0297	5.1584	0	0.0949	0.2112

# (c) Extract and report: slope, intercept, t-statistic, p-value
summary(model_lab)

## 
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.412  -9.294  -5.379   9.803  20.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.51927    0.89781   8.375  < 2e-16 ***
## bmi          0.15306    0.02967   5.158 2.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.16 on 2998 degrees of freedom
## Multiple R-squared:  0.008798,   Adjusted R-squared:  0.008467 
## F-statistic: 26.61 on 1 and 2998 DF,  p-value: 2.652e-07

slope_test <- tidy(model_lab, conf.int = TRUE) %>% filter(term == "bmi")

tibble(
  Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
               "Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
  Value    = c(
    round(slope_test$estimate, 4),
    round(slope_test$std.error, 4),
    round(slope_test$statistic, 3),
    2998,
    format.pval(slope_test$p.value, digits = 3),
    round(slope_test$conf.low, 4),
    round(slope_test$conf.high, 4)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

t-Test for the Slope (H₀: β₁ = 0)
Quantity	Value
Slope (b₁)	0.1531
SE(b₁)	0.0297
t-statistic	5.158
Degrees of freedom	2998
p-value	2.65e-07
95% CI Lower	0.0949
95% CI Upper	0.2112

Questions:

Write the fitted regression equation in the form \(\hat{Y} = b_0 + b_1 X\). = 0.0297 + 0.1531(BMI)
Interpret the slope (\(b_1\)) in context — what does it mean in plain English? For every one unit increase in BMI, the number of bad physical days increases by 0.1531 days.
Is the intercept (\(b_0\)) interpretable in this context? Why or why not? It is not interpretable because you can’t have a BMI of 0.
Is the association statistically significant at \(\alpha = 0.05\)? State the null hypothesis, test statistic, and p-value. The association is statistically significant. Null= There is no linear relationship between BMI and number of bad physical health days. Test statistic= 5.158. p-value= 2.65e-07 —

Task 3: ANOVA Decomposition and R² (15 points)

# (a) Display the ANOVA table
anova_lab <- anova(model_lab)

anova_lab %>%
  kable(
    caption = "ANOVA Table: Number of Bad Physical Health Days ~ BMI",
    digits = 3,
    col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

ANOVA Table: Number of Bad Physical Health Days ~ BMI
Source	Df	Sum Sq	Mean Sq	F value	Pr(>F)
bmi	1	3314.969	3314.969	26.609	0
Residuals	2998	373487.391	124.579	NA	NA

# (b) Compute and report SSTotal, SSRegression, and SSResidual
#See answer below

# (c) Compute R² two ways: from the model object and from the SS decomposition
r_sq <- summary(model_lab)$r.squared
adj_r_sq <- summary(model_lab)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq, 4),
    round(adj_r_sq, 4),
    paste0(round(r_sq * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.0088
Adjusted R²	0.0085
Variance Explained	0.88%

r_pearson <- cor(brfss_lab$bmi, brfss_lab$phys_days)
tibble(
  Quantity   = c("Pearson r", "r² (from Pearson)", "R² (from model)", "r² = R²?"),
  Value      = c(
    round(r_pearson, 4),
    round(r_pearson^2, 4),
    round(r_sq, 4),
    as.character(round(r_pearson^2, 4) == round(r_sq, 4))
  )
) %>%
  kable(caption = "Pearson r vs. R² from Model") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Pearson r vs. R² from Model
Quantity	Value
Pearson r	0.0938
r² (from Pearson)	0.0088
R² (from model)	0.0088
r² = R²?	TRUE

Questions:

Fill in the ANOVA table components: \(SS_{Total}\), \(SS_{Regression}\), \(SS_{Residual}\), \(df\), and \(F\)-statistic. \(SS_{Total}\) is 376802.36, the \(SS_{Regression}\) is 3314.969, and the \(SS_{Residual}\) is 373487.391. The degrees of freedom for \(SS_{Total}\) is 2999, the degrees of freedom for \(SS_{Regression}\) is 1 and the degrees of freedom for \(SS_{Residual}\) is 2998. The F- statistic is 26.609.
What is the \(R^2\) value? Interpret it in plain English. The \(R^2\) value is 0.0088, which is the amount of variation in the dependent variable that can be explained by the independent variable.
What does this \(R^2\) tell you about how well BMI alone explains variation in poor physical health days? What might explain the remaining variation? The \(R^2\) of 0.0088 tells us that only 0.88% of the variation in number of bad physical health days can be explained by BMI. Therefore, BMI has a very small effect on bad physical health days. The explaining variation may be explained by other variables, such as age, sex, and more. —

Task 4: Confidence and Prediction Intervals (20 points)

# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi = c(25))

ci_pred <- predict(model_lab, newdata = new_bmi, interval = "confidence") %>%
  as.data.frame() %>%
  rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)

# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_pred <- predict(model_lab, newdata = new_bmi, interval = "prediction") %>%
  as.data.frame() %>%
  rename(PI_Lower = lwr, PI_Upper = upr) %>%
  select(-fit)

results_table <- bind_cols(new_bmi, ci_pred, pi_pred) %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

results_table %>%
  kable(
    caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by BMI",
    col.names = c("BMI", "Fitted Bad Physical Health Days", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))

Fitted Values, 95% Confidence Intervals, and Prediction Intervals by BMI
		95% CI for Mean		95% PI for Individual
BMI	Fitted Bad Physical Health Days	CI Lower	CI Upper	PI Lower	PI Upper
25	11.35	10.87	11.82	-10.54	33.24

# (c) Plot the regression line with both the CI band and PI band
bmi_grid <- data.frame(bmi = seq(length.out = 200))

ci_band <- predict(model_lab, newdata = bmi_grid, interval = "confidence") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

pi_band <- predict(model_lab, newdata = bmi_grid, interval = "prediction") %>%
  as.data.frame() %>%
  bind_cols(bmi_grid)

p_ci_pi <- ggplot() +
  geom_point(data = brfss_lab, aes(x = bmi, y = phys_days),
             alpha = 0.10, color = "steelblue", size = 1) +
  geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "lightblue", alpha = 0.3) +
  geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
              fill = "steelblue", alpha = 0.4) +
  geom_line(data = ci_band, aes(x = bmi, y = fit),
            color = "red", linewidth = 1.2) +
  labs(
    title = "Simple Linear Regression: Bad Physical Health Days ~ BMI",
    subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
    x = "BMI (kg/m²)",
    y = "Number of Bad Physical Health Days",
    caption = "BRFSS 2020, n = 3,000"
  ) +
  theme_minimal(base_size = 13)

p_ci_pi

Questions:

For someone with a BMI of 25, what is the estimated mean number of poor physical health days? What is the 95% confidence interval for this mean? The estimated mean number of poor physical health days for a person with a BMI of 25 is 11.35 days. The 95% confidence interval is [10.87,11.82].
If a specific new person has a BMI of 25, what is the 95% prediction interval for their number of poor physical health days? The 95% prediction interval for their number of poor physical health days is [-10.54,33.24].
Explain in your own words why the prediction interval is wider than the confidence interval. When would you use each one in practice? The prediction interval is wider than the confidence interval, because a specific value for a specific person is harder to predict than a mean value for a group of people. When you want to predict what a group’s mean value would be, you use confidence interval, but when you want to predict a specific value for a specific person, you use prediction interval. —

Task 5: Residual Diagnostics (20 points)

# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_lab, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

# (b) Create a residuals vs. fitted plot using ggplot
augmented <- augment(model_lab)

p_resid_x <- ggplot(augmented, aes(x = bmi, y = .resid)) +
  geom_point(alpha = 0.15, color = "steelblue", size = 1) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
  labs(
    title = "Residuals vs. BMI",
    subtitle = "Should show no pattern — random scatter around zero",
    x = "BMI (kg/m²)",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

p_resid_x

## `geom_smooth()` using formula = 'y ~ x'

# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
  stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
  stat_qq_line(color = "red", linewidth = 1) +
  labs(
    title = "Normal Q-Q Plot of Residuals",
    subtitle = "Points should lie on the red line if residuals are normally distributed",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal(base_size = 13)

p_qq

# (d) Create a Cook's distance plot
augmented <- augmented %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(model_lab),
    influential = ifelse(cooks_d > 4 / 3000, "Potentially influential", "Not influential")
  )

n_influential <- sum(augmented$cooks_d > 4 / 3000)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 4 / 3000, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 4/n threshold | ",
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks

Questions:

Examine the Residuals vs. Fitted plot. Is there evidence of nonlinearity or heteroscedasticity? Describe what you see. The red line is horizontal, so there is no evidence of nonlinearity. The points are fairly randomly scatter, so there is no evidence of heteroscedasticity.
Examine the Q-Q plot. Are the residuals approximately normal? What do departures from normality in this context suggest about the distribution of phys_days? The residuals do not follow the normality line very well, meaning the number of bad physical health days is not very normally distributed.
Are there any influential observations (Cook’s D > 4/n)? How many? What would you do about them? There are many influential observations. There are 133 potentially influential observations. I would remove the potentially influential observations.
Overall, do the LINE assumptions appear to be met? Which assumption(s) may be most problematic for this model, and why? (Hint: think about the nature of the outcome variable.) Most of the LINE assumptions appear to be met. The assumption that is most problematic are normality, which may be because “number of bad days of physical health” is a pretty subjective variable, so it may not follow normality. The highly influential observations may also be problematic. —

Task 6: Testing a Different Predictor (10 points)

Now fit a second SLR model using age as the predictor of phys_days instead of BMI.

# (a) Fit SLR: phys_days ~ age
model_lab_age <- lm(phys_days ~ age, data = brfss_lab)

# (b) Display results and compare to the BMI model
tidy(model_lab_age, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: Bad Physical Health Days ~ Age (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: Bad Physical Health Days ~ Age (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	4.3359	0.6826	6.3521	0	2.9975	5.6743
age	0.1377	0.0117	11.7893	0	0.1148	0.1606

# (c) Extract and report: slope, intercept, t-statistic, p-value
summary(model_lab_age)

## 
## Call:
## lm(formula = phys_days ~ age, data = brfss_lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.354  -8.911  -4.432  10.992  22.910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.33587    0.68259   6.352 2.45e-10 ***
## age          0.13773    0.01168  11.789  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 2998 degrees of freedom
## Multiple R-squared:  0.04431,    Adjusted R-squared:  0.04399 
## F-statistic:   139 on 1 and 2998 DF,  p-value: < 2.2e-16

slope_test <- tidy(model_lab_age, conf.int = TRUE) %>% filter(term == "age")

tibble(
  Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
               "Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
  Value    = c(
    round(slope_test$estimate, 4),
    round(slope_test$std.error, 4),
    round(slope_test$statistic, 3),
    2998,
    format.pval(slope_test$p.value, digits = 3),
    round(slope_test$conf.low, 4),
    round(slope_test$conf.high, 4)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

t-Test for the Slope (H₀: β₁ = 0)
Quantity	Value
Slope (b₁)	0.1377
SE(b₁)	0.0117
t-statistic	11.789
Degrees of freedom	2998
p-value	<2e-16
95% CI Lower	0.1148
95% CI Upper	0.1606

# (c) Which predictor has the stronger association? Compare R² values.
r_sq_age <- summary(model_lab_age)$r.squared
adj_r_sq_age <- summary(model_lab_age)$adj.r.squared

tibble(
  Metric = c("R²", "Adjusted R²", "Variance Explained"),
  Value  = c(
    round(r_sq_age, 4),
    round(adj_r_sq_age, 4),
    paste0(round(r_sq_age * 100, 2), "%")
  )
) %>%
  kable(caption = "R² and Adjusted R²") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

R² and Adjusted R²
Metric	Value
R²	0.0443
Adjusted R²	0.044
Variance Explained	4.43%

tibble(
  Model       = c("Linear: Bad Physical Days ~ BMI", "Linear: Bad Physical Days ~  Age"),
  R_squared   = c(
    round(summary(model_lab)$r.squared, 4),
    round(summary(model_lab_age)$r.squared, 4)
  ),
  Adj_R2      = c(
    round(summary(model_lab)$adj.r.squared, 4),
    round(summary(model_lab_age)$adj.r.squared, 4)
  ),
  AIC         = c(round(AIC(model_lab), 1), round(AIC(model_lab_age), 1))
) %>%
  kable(
    caption = "Model Comparison",
    col.names = c("Model", "R²", "Adj. R²", "AIC")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which.min(c(AIC(model_lab), AIC(model_lab_age))),
           bold = TRUE, background = "#d4edda")

Model Comparison
Model	R²	Adj. R²	AIC
Linear: Bad Physical Days ~ BMI	0.0088	0.0085	22992.4
Linear: Bad Physical Days ~ Age	0.0443	0.0440	22883.0

Questions:

How does the association between age and poor physical health days compare to the BMI association in terms of direction, magnitude, and statistical significance? Both have positive associations and are statistically significant, but BMI has a stronger association to bad physical health days than age does.
Compare the \(R^2\) values of the two models. Which predictor explains more variability in phys_days? Age explains 4.43% of the variation of bad physical health days, whereas BMI only explains 0.88% of the variation in bad physical health days. Age explains more of the variability.
Based on these two simple models, what is your overall conclusion about predictors of poor physical health days? What are the limitations of using simple linear regression for this outcome? The predictors of poor physical outcome don’t explain much of the variation within poor physical outcome. The limitations of using simple linear regression are that you are unable to account for potential confounding or effect modification, which may diminish or increase the effect that certain predictors have on the outcome. —

Submission Instructions

Submit your completed .Rmd file and the RPubs link to your knitted HTML document.

Your .Rmd must knit without errors. Make sure all code chunks produce visible output and all questions are answered in complete sentences below each code chunk.

Due: Before the next class session.

Lab06_SimpleLinearRegression_AbigailMiller

Abigail Miller

2026-02-26

Part 2: Lab Activity

Overview

Setup Instructions

Task 1: Explore the Variables (15 points)

Task 2: Fit and Interpret the SLR Model (20 points)

Task 3: ANOVA Decomposition and R² (15 points)

Task 4: Confidence and Prediction Intervals (20 points)

Task 5: Residual Diagnostics (20 points)

Task 6: Testing a Different Predictor (10 points)

Submission Instructions

Session Info