hu— title: “Simple Linear Regression” subtitle: “EPI 553 - Advanced Epidemiologic Methods” author: “Muntasir Masum” date: “February 24 & 26, 2026” output: html_document: toc: true toc_float: true toc_depth: 3 theme: flatly highlight: tango code_folding: show —
Simple Linear Regression (SLR) is one of the most fundamental and widely used tools in epidemiology and public health research. It allows us to:
In epidemiology, we frequently use SLR to model continuous outcomes such as blood pressure, BMI, cholesterol levels, or hospital length of stay as a function of age, exposure levels, or other continuous predictors.
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
We will use the Behavioral Risk Factor Surveillance System (BRFSS) 2020 data throughout this lecture. The BRFSS is a large-scale, nationally representative telephone survey conducted by the CDC that collects data on health behaviors, chronic conditions, and preventive service use among U.S. adults.
# Load raw BRFSS 2020 data
brfss_full <- read_xpt("C:/Users/tahia/OneDrive/Desktop/UAlbany PhD/Epi 553/LLCP2020XPT/LLCP2020.XPT") %>%
janitor::clean_names()
# Select variables of interest
brfss_slr <- brfss_full %>%
select(bmi5, age80, sex, educag, genhlth, physhlth, sleptim1)
# Recode variables
brfss_slr <- brfss_slr %>%
mutate(
bmi = bmi5 / 100,
age = age80,
sex = factor(ifelse(sex == 1, "Male", "Female")),
education = factor(case_when(
educag == 1 ~ "< High school",
educag == 2 ~ "High school graduate",
educag == 3 ~ "Some college",
educag == 4 ~ "College graduate"
), levels = c("< High school", "High school graduate",
"Some college", "College graduate")),
gen_health_num = ifelse(genhlth %in% 1:5, genhlth, NA_real_),
sleep_hrs = ifelse(sleptim1 %in% 1:24, sleptim1, NA_real_),
phys_days = ifelse(physhlth %in% 0:30, physhlth, NA_real_)
)
# Select recoded variables, apply filters, drop missing, take sample
set.seed(553)
brfss_slr <- brfss_slr %>%
select(bmi, age, sex, education, gen_health_num, sleep_hrs, phys_days) %>%
filter(bmi > 14.5, bmi < 60, age >= 18, age <= 80) %>%
drop_na() %>%
slice_sample(n = 3000)
# Save analytic dataset
saveRDS(brfss_slr, ("C:/Users/tahia/OneDrive/Desktop/UAlbany PhD/Epi 553/Lab 6/new_data.rds"))
tibble(
Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_slr), ncol(brfss_slr))
) %>%
kable(caption = "Analytic Dataset Dimensions") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Metric | Value |
|---|---|
| Observations | 3000 |
| Variables | 7 |
brfss_slr %>%
select(bmi, age, sleep_hrs, phys_days) %>%
summary() %>%
kable(caption = "Descriptive Statistics: Key Continuous Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| bmi | age | sleep_hrs | phys_days | |
|---|---|---|---|---|
| Min. :14.63 | Min. :18.00 | Min. : 1.000 | Min. : 1.00 | |
| 1st Qu.:24.32 | 1st Qu.:43.00 | 1st Qu.: 6.000 | 1st Qu.: 2.00 | |
| Median :27.89 | Median :58.00 | Median : 7.000 | Median : 6.00 | |
| Mean :29.18 | Mean :55.52 | Mean : 6.915 | Mean :11.66 | |
| 3rd Qu.:32.89 | 3rd Qu.:70.00 | 3rd Qu.: 8.000 | 3rd Qu.:20.00 | |
| Max. :59.60 | Max. :80.00 | Max. :20.000 | Max. :30.00 |
brfss_slr %>%
select(bmi, age, sleep_hrs, sex, education) %>%
tbl_summary(
label = list(
bmi ~ "BMI (kg/m²)",
age ~ "Age (years)",
sleep_hrs ~ "Sleep (hours/night)",
sex ~ "Sex",
education ~ "Education"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics (BRFSS 2020, n = 3,000)**")
| Characteristic | N | N = 3,0001 |
|---|---|---|
| BMI (kg/m²) | 3,000 | 29.2 (7.0) |
| Age (years) | 3,000 | 55.5 (17.4) |
| Sleep (hours/night) | 3,000 | 6.9 (1.7) |
| Sex | 3,000 | |
| Female | 1,701 (57%) | |
| Male | 1,299 (43%) | |
| Education | 3,000 | |
| < High school | 237 (7.9%) | |
| High school graduate | 796 (27%) | |
| Some college | 937 (31%) | |
| College graduate | 1,030 (34%) | |
| 1 Mean (SD); n (%) | ||
Simple linear regression models the mean of a continuous outcome \(Y\) as a linear function of a single predictor \(X\):
\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad i = 1, 2, \ldots, n\]
Where:
| Symbol | Name | Meaning |
|---|---|---|
| \(Y_i\) | Response / Outcome | Observed value for subject \(i\) (e.g., BMI) |
| \(X_i\) | Predictor / Covariate | Observed predictor for subject \(i\) (e.g., age) |
| \(\beta_0\) | Intercept | Expected value of \(Y\) when \(X = 0\) |
| \(\beta_1\) | Slope | Expected change in \(Y\) for a 1-unit increase in \(X\) |
| \(\varepsilon_i\) | Error term | Random deviation of \(Y_i\) from the regression line |
The population regression line (also called the true or theoretical regression line) describes the expected (mean) value of \(Y\) at each value of \(X\):
\[E(Y \mid X) = \mu_{Y|X} = \beta_0 + \beta_1 X\]
| Population | Sample | |
|---|---|---|
| Line | \(\beta_0 + \beta_1 X\) | \(\hat{y} = b_0 + b_1 X\) |
| Intercept | \(\beta_0\) (parameter) | \(b_0\) or \(\hat{\beta}_0\) (estimate) |
| Slope | \(\beta_1\) (parameter) | \(b_1\) or \(\hat{\beta}_1\) (estimate) |
| Error | \(\varepsilon_i\) | \(e_i = Y_i - \hat{Y}_i\) (residual) |
We use our sample to estimate the population parameters. The estimates \(b_0\) and \(b_1\) define the fitted regression line.
Before fitting any model, always visualize the bivariate relationship.
p_scatter <- ggplot(brfss_slr, aes(x = age, y = bmi)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "BMI vs. Age (BRFSS 2020)",
subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
x = "Age (years)",
y = "BMI (kg/m²)"
) +
theme_minimal(base_size = 13)
ggplotly(p_scatter)
BMI vs. Age — BRFSS 2020
Interpretation tip: The LOESS smoother (orange) follows the data without assuming linearity. When it closely tracks the linear fit (red), a linear model is reasonable. Departures suggest nonlinearity.
A useful mnemonic is LINE:
| Letter | Assumption | Description |
|---|---|---|
| L | Linearity | The relationship between \(X\) and \(E(Y)\) is linear |
| I | Independence | Observations are independent of one another |
| N | Normality | Errors \(\varepsilon_i\) are normally distributed |
| E | Equal variance | Errors have constant variance (homoscedasticity): \(\text{Var}(\varepsilon_i) = \sigma^2\) |
Formally, we assume:
\[\varepsilon_i \overset{iid}{\sim} N(0, \sigma^2)\]
This means that for any value of \(X\), the distribution of \(Y\) is:
\[Y \mid X \sim N(\beta_0 + \beta_1 X, \; \sigma^2)\]
Note on independence: In cross-sectional survey data like BRFSS, observations from the same household or geographic cluster may not be fully independent. We acknowledge this limitation but proceed with the standard SLR framework for pedagogical purposes.
We estimate \(\beta_0\) and \(\beta_1\) by finding the values \(b_0\) and \(b_1\) that minimize the sum of squared residuals (SSR):
\[SSR = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n}(Y_i - b_0 - b_1 X_i)^2\]
This is called the Ordinary Least Squares (OLS) criterion. Minimizing SSR yields the closed-form solutions:
\[b_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} = \frac{S_{XY}}{S_{XX}}\]
\[b_0 = \bar{Y} - b_1 \bar{X}\]
where \(\bar{X}\) and \(\bar{Y}\) are the sample means of \(X\) and \(Y\).
Gauss-Markov Theorem: Under the LINE assumptions, OLS estimators are the Best Linear Unbiased Estimators (BLUE) — they have the smallest variance among all linear unbiased estimators.
# Fit simple linear regression: BMI ~ Age
model_slr <- lm(bmi ~ age, data = brfss_slr)
# Summary output
summary(model_slr)
##
## Call:
## lm(formula = bmi ~ age, data = brfss_slr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.633 -4.883 -1.325 3.688 30.340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.528231 0.427507 69.071 <2e-16 ***
## age -0.006238 0.007347 -0.849 0.396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.012 on 2998 degrees of freedom
## Multiple R-squared: 0.0002404, Adjusted R-squared: -9.312e-05
## F-statistic: 0.7208 on 1 and 2998 DF, p-value: 0.396
# Tidy coefficient table
tidy(model_slr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: BMI ~ Age (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 29.5282 | 0.4275 | 69.0708 | 0.000 | 28.6900 | 30.3665 |
| age | -0.0062 | 0.0073 | -0.8490 | 0.396 | -0.0206 | 0.0082 |
b0 <- round(coef(model_slr)[1], 3)
b1 <- round(coef(model_slr)[2], 4)
Fitted regression equation:
\[\widehat{\text{BMI}} = 29.528 + -0.0062 \times \text{Age}\]
Intercept (\(b_0 = 29.528\)): The estimated mean BMI when age = 0. This is a mathematical artifact — a newborn does not have an adult BMI. The intercept is not directly interpretable in this context, but is necessary to anchor the line.
Slope (\(b_1 = -0.0062\)): For each 1-year increase in age, BMI is estimated to decrease by 0.0062 kg/m², on average, holding all else constant (though there are no other variables in this simple model).
Practical significance vs. statistical significance: Even a small slope can be highly statistically significant with a large sample. Always consider whether the magnitude is meaningful in the real world.
# Augment dataset with fitted values and residuals
augmented <- augment(model_slr)
# Show a sample of fitted values and residuals
augmented %>%
select(bmi, age, .fitted, .resid) %>%
slice_head(n = 10) %>%
mutate(across(where(is.numeric), ~ round(., 3))) %>%
kable(
caption = "First 10 Observations: Observed, Fitted, and Residual Values",
col.names = c("Observed BMI (Y)", "Age (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Observed BMI (Y) | Age (X) | Fitted (Ŷ) | Residual (e = Y − Ŷ) |
|---|---|---|---|
| 26.58 | 67 | 29.110 | -2.530 |
| 33.47 | 38 | 29.291 | 4.179 |
| 35.15 | 78 | 29.042 | 6.108 |
| 30.42 | 65 | 29.123 | 1.297 |
| 22.67 | 55 | 29.185 | -6.515 |
| 30.11 | 80 | 29.029 | 1.081 |
| 35.43 | 34 | 29.316 | 6.114 |
| 31.58 | 71 | 29.085 | 2.495 |
| 28.13 | 55 | 29.185 | -1.055 |
| 34.01 | 62 | 29.141 | 4.869 |
The fitted value (Y^) and the observed BMI (Y) are different because a person’s age does not perfectly determine their exact BMI.
Fitted Value (Y^): This is the predicted average BMI for anyone of a specific age, based on the trend line calculated by the model. For example, the model might calculate that the average 67-year-old has a BMI of 29.110.
Observed BMI (Y): This is the actual, real-world BMI of a specific individual in the dataset.
The Residual/Error (e=Y−Y^): The difference between the two is the residual or error term. The error term (εi) represents the “random deviation of Yi from the regression line.”
# Select a random sample of 80 points to illustrate residuals
set.seed(42)
resid_sample <- augmented %>% slice_sample(n = 80)
p_resid <- ggplot(resid_sample, aes(x = age, y = bmi)) +
geom_segment(aes(xend = age, yend = .fitted),
color = "tomato", alpha = 0.5, linewidth = 0.5) +
geom_point(color = "steelblue", size = 1.8, alpha = 0.8) +
geom_line(aes(y = .fitted), color = "black", linewidth = 1.1) +
labs(
title = "Residuals Illustrated on the Regression Line",
subtitle = "Red segments = residuals (Y − Ŷ); Black line = fitted regression line",
x = "Age (years)",
y = "BMI (kg/m²)"
) +
theme_minimal(base_size = 13)
p_resid
Visualizing Residuals on the Regression Line
The total variability in \(Y\) can be decomposed into two parts:
\[\underbrace{SS_{Total}}_{Total\ variability} = \underbrace{SS_{Regression}}_{Explained\ by\ X} + \underbrace{SS_{Residual}}_{Unexplained}\]
Where:
\[SS_{Total} = \sum(Y_i - \bar{Y})^2 \qquad (df = n-1)\] \[SS_{Regression} = \sum(\hat{Y}_i - \bar{Y})^2 \qquad (df = 1)\] \[SS_{Residual} = \sum(Y_i - \hat{Y}_i)^2 \qquad (df = n-2)\]
In statistics, degrees of freedom refer to the number of independent pieces of information that are free to vary when calculating an estimate. You can think of it as the amount of “data budget” you have. Every time you estimate a parameter (like a mean or a slope), you spend one degree of freedom.
Why it’s \(n - 1\): You have \(n\) total observations. However, to calculate this variance, you first had to calculate the sample mean. Because the sum of deviations from the mean must always equal zero, knowing \(n-1\) of the deviations automatically tells you the last one. You “spent” 1 degree of freedom calculating the mean, leaving you with \(n - 1\).
Why it’s \(1\): In Simple Linear Regression, you are using exactly one predictor variable (e.g., Age) to explain the outcome (e.g., BMI). Because you are only estimating one slope parameter (\(\beta_1\)) to capture this relationship, the regression model has 1 degree of freedom. (Note: In multiple regression, this would be equal to \(k\), the number of predictor variables).
Why it’s \(n - 2\): To calculate the predicted values (\(\hat{Y}\)) and find the residuals, your model had to estimate two parameters from the data: the intercept (\(\beta_0\)) and the slope (\(\beta_1\)). Since you “spent” 2 pieces of information to create the regression line, you are left with \(n - 2\) degrees of freedom for the errors.
# ANOVA decomposition
anova_slr <- anova(model_slr)
anova_slr %>%
kable(
caption = "ANOVA Table: BMI ~ Age",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| age | 1 | 35.438 | 35.438 | 0.721 | 0.396 |
| Residuals | 2998 | 147400.214 | 49.166 | NA | NA |
The Mean Squared Error estimates the variance of the error term:
\[MSE = \frac{SS_{Residual}}{n - 2} = \hat{\sigma}^2\]
The Residual Standard Error \(\hat{\sigma} = \sqrt{MSE}\) is in the same units as \(Y\) and tells us the typical prediction error of the model.
n <- nrow(brfss_slr)
ss_resid <- sum(augmented$.resid^2)
mse <- ss_resid / (n - 2)
sigma_hat <- sqrt(mse)
tibble(
Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
Value = c(round(ss_resid, 2), round(mse, 3), round(sigma_hat, 3)),
Units = c("", "", "kg/m²")
) %>%
kable(caption = "Model Error Estimates") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Quantity | Value | Units |
|---|---|---|
| SS Residual | 147400.210 | |
| MSE (σ̂²) |
49.16
|
Interpretation: On average, our model’s predictions are off by about 7.01 BMI units.
\(R^2\) measures the proportion of total variability in \(Y\) explained by the linear regression on \(X\):
\[R^2 = \frac{SS_{Regression}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}}\]
\(R^2\) ranges from 0 to 1:
# Extract R-squared from model
r_sq <- summary(model_slr)$r.squared
adj_r_sq <- summary(model_slr)$adj.r.squared
tibble(
Metric = c("R²", "Adjusted R²", "Variance Explained"),
Value = c(
round(r_sq, 4),
round(adj_r_sq, 4),
paste0(round(r_sq * 100, 2), "%")
)
) %>%
kable(caption = "R² and Adjusted R²") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Metric | Value |
|---|---|
| R² | 2e-04 |
| Adjusted R² | -1e-04 |
| Variance Explained | 0.02% |
For simple linear regression:
\[R^2 = r^2\]
where \(r\) is the Pearson correlation coefficient between \(X\) and \(Y\).
r_pearson <- cor(brfss_slr$age, brfss_slr$bmi)
tibble(
Quantity = c("Pearson r", "r² (from Pearson)", "R² (from model)", "r² = R²?"),
Value = c(
round(r_pearson, 4),
round(r_pearson^2, 4),
round(r_sq, 4),
as.character(round(r_pearson^2, 4) == round(r_sq, 4))
)
) %>%
kable(caption = "Pearson r vs. R² from Model") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Quantity | Value |
|---|---|
| Pearson r | -0.0155 |
| r² (from Pearson) | 2e-04 |
| R² (from model) | 2e-04 |
| r² = R²? | TRUE |
Important caveat: A low \(R^2\) does not mean the regression is useless. In epidemiology, outcomes are influenced by many unmeasured factors, so \(R^2\) values of 0.05–0.20 can still yield scientifically meaningful and statistically significant estimates.
The most important hypothesis test in SLR is:
\[H_0: \beta_1 = 0 \quad \text{(no linear relationship between X and Y)}\] \[H_A: \beta_1 \neq 0 \quad \text{(there is a linear relationship)}\]
Test statistic:
\[t = \frac{b_1 - 0}{SE(b_1)} \sim t_{n-2} \quad \text{under } H_0\]
Where:
\[SE(b_1) = \frac{\hat{\sigma}}{\sqrt{\sum(X_i - \bar{X})^2}} = \frac{\hat{\sigma}}{\sqrt{S_{XX}}}\]
# Extract slope test statistics
slope_test <- tidy(model_slr, conf.int = TRUE) %>% filter(term == "age")
tibble(
Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
"Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
Value = c(
round(slope_test$estimate, 4),
round(slope_test$std.error, 4),
round(slope_test$statistic, 3),
n - 2,
format.pval(slope_test$p.value, digits = 3),
round(slope_test$conf.low, 4),
round(slope_test$conf.high, 4)
)
) %>%
kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Quantity | Value |
|---|---|
| Slope (b₁) | -0.0062 |
| SE(b₁) | 0.0073 |
| t-statistic | -0.849 |
| Degrees of freedom | 2998 |
| p-value | 0.396 |
| 95% CI Lower | -0.0206 |
| 95% CI Upper | 0.0082 |
Decision: With p = 0.396, we reject \(H_0\) at the \(\alpha = 0.05\) level. There is statistically significant evidence of a linear association between age and BMI.
The F-test evaluates whether the overall model (i.e., all predictors together) explains a statistically significant portion of the variability in \(Y\). For simple linear regression with one predictor, the F-test is equivalent to the t-test for the slope (\(F = t^2\)).
\[F = \frac{MS_{Regression}}{MS_{Residual}} \sim F_{1,\, n-2} \quad \text{under } H_0\]
The Numerator: This represents the amount of variability in the data that your model successfully explains.
The Denominator: This represents the “leftover” or unexplained variability (the errors).
When your model does a good job of predicting the outcome, the explained variance goes up, and the unexplained variance goes down. This makes the resulting F-statistic larger. What a higher F-statistic means for your results:
Lower p-value: A larger F-statistic pushes the p-value closer to zero.
Statistical Significance: If the p-value drops below your alpha level (usually 0.05), you can reject the null hypothesis. It gives you confidence that your overall model actually has some predictive power, rather than just capturing random noise.
f_stat <- summary(model_slr)$fstatistic
f_value <- f_stat[1]
df1 <- f_stat[2]
df2 <- f_stat[3]
p_f <- pf(f_value, df1, df2, lower.tail = FALSE)
tibble(
Quantity = c("F-statistic", "df (numerator)", "df (denominator)",
"p-value", "Verification: t²", "Verification: F"),
Value = c(
round(f_value, 3),
df1,
df2,
format.pval(p_f, digits = 3),
round(slope_test$statistic^2, 3),
round(f_value, 3)
)
) %>%
kable(caption = "F-Test for Overall Model Significance") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Quantity | Value |
|---|---|
| F-statistic | 0.721 |
| df (numerator) | 1 |
| df (denominator) | 2998 |
| p-value | 0.396 |
| Verification: t² | 0.721 |
| Verification: F | 0.721 |
A 95% CI for \(\beta_1\) is:
\[b_1 \pm t_{n-2, \, 0.025} \times SE(b_1)\]
t_crit <- qt(0.975, df = n - 2)
ci_lower <- slope_test$estimate - t_crit * slope_test$std.error
ci_upper <- slope_test$estimate + t_crit * slope_test$std.error
tibble(
Bound = c("95% CI Lower", "95% CI Upper"),
Value = c(round(ci_lower, 4), round(ci_upper, 4)),
Units = c("kg/m² per year", "kg/m² per year")
) %>%
kable(caption = "95% Confidence Interval for β₁ (manually computed)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Bound | Value | Units |
|---|---|---|
| 95% CI Lower | -0.0206 | kg/m² per year |
| 95% CI Upper | 0.0082 | kg/m² per year |
A confidence interval for the mean response \(E(Y \mid X = x^*)\) gives a range of plausible values for the population mean of \(Y\) at a specific value \(x^*\):
\[\hat{Y}^* \pm t_{n-2, \, \alpha/2} \times SE(\hat{Y}^*)\]
Where:
\[SE(\hat{Y}^*) = \hat{\sigma}\sqrt{\frac{1}{n} + \frac{(x^* - \bar{X})^2}{S_{XX}}}\]
A prediction interval gives a range for a single new observation \(Y^*_{new}\) at \(X = x^*\). It is always wider than the confidence interval because it accounts for both the uncertainty in \(E(Y)\) and the individual variability around the mean:
\[\hat{Y}^* \pm t_{n-2, \, \alpha/2} \times SE_{pred}\]
Where:
\[SE_{pred} = \hat{\sigma}\sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{X})^2}{S_{XX}}}\]
# Compute CI and PI at specific age values
new_ages <- data.frame(age = c(25, 35, 45, 55, 65, 75))
ci_pred <- predict(model_slr, newdata = new_ages, interval = "confidence") %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
pi_pred <- predict(model_slr, newdata = new_ages, interval = "prediction") %>%
as.data.frame() %>%
rename(PI_Lower = lwr, PI_Upper = upr) %>%
select(-fit)
results_table <- bind_cols(new_ages, ci_pred, pi_pred) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
results_table %>%
kable(
caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by Age",
col.names = c("Age", "Fitted BMI", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))
| Age | Fitted BMI | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|---|
| 25 | 29.37 | 28.87 | 29.88 | 15.61 | 43.13 |
| 35 | 29.31 | 28.92 | 29.70 | 15.56 | 43.06 |
| 45 | 29.25 | 28.95 | 29.54 | 15.50 | 43.00 |
| 55 | 29.19 | 28.93 | 29.44 | 15.43 | 42.94 |
| 65 | 29.12 | 28.84 | 29.41 | 15.37 | 42.87 |
| 75 | 29.06 | 28.68 | 29.44 | 15.31 | 42.81 |
# Generate CI and PI across the full age range
age_grid <- data.frame(age = seq(18, 80, length.out = 200))
ci_band <- predict(model_slr, newdata = age_grid, interval = "confidence") %>%
as.data.frame() %>%
bind_cols(age_grid)
pi_band <- predict(model_slr, newdata = age_grid, interval = "prediction") %>%
as.data.frame() %>%
bind_cols(age_grid)
p_ci_pi <- ggplot() +
geom_point(data = brfss_slr, aes(x = age, y = bmi),
alpha = 0.10, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band, aes(x = age, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.3) +
geom_ribbon(data = ci_band, aes(x = age, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.4) +
geom_line(data = ci_band, aes(x = age, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Simple Linear Regression: BMI ~ Age",
subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
x = "Age (years)",
y = "BMI (kg/m²)",
caption = "BRFSS 2020, n = 3,000"
) +
theme_minimal(base_size = 13)
p_ci_pi
Regression Line with 95% Confidence and Prediction Intervals
Key distinction: If you want to estimate the average BMI for all 45-year-olds in the population, use the confidence interval. If you want to predict the BMI of a specific new 45-year-old patient, use the prediction interval.
Fitting a regression model is not enough — we must verify that the LINE assumptions are reasonably met. We do this through residual diagnostics.
par(mfrow = c(2, 2))
plot(model_slr, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)
Standard Regression Diagnostic Plots
par(mfrow = c(1, 1))
Interpreting each plot:
1. Residuals vs. Fitted: Checks linearity and equal variance. We want a horizontal red line and random scatter with no pattern. A “fan shape” (spread increasing with fitted values) indicates heteroscedasticity.
2. Normal Q-Q Plot: Checks normality of residuals. Points should fall approximately along the 45° reference line. Heavy tails or S-curves suggest non-normality.
3. Scale-Location (Spread-Location): Another check for equal variance (homoscedasticity). The square root of standardized residuals is plotted against fitted values. A flat line indicates constant variance.
4. Residuals vs. Leverage: Identifies influential observations using Cook’s distance. Points in the upper or lower right corner (beyond the dashed lines) have high influence.
The Residuals vs. Predictor plot is a visual diagnostic tool used to check if your data meets the core assumptions of Simple Linear Regression—specifically Linearity and Equal Variance (Homoscedasticity).
p_resid_x <- ggplot(augmented, aes(x = age, y = .resid)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", color = "orange", se = FALSE, linewidth = 1) +
labs(
title = "Residuals vs. Age",
subtitle = "Should show no pattern — random scatter around zero",
x = "Age (years)",
y = "Residuals"
) +
theme_minimal(base_size = 13)
p_resid_x
Residuals vs. Age — Checking Linearity
Y-axis (Residuals): The errors from your model. This is how far off each person’s actual BMI was from the model’s predicted BMI.
X-axis (Age): Your predictor variable.
(Note: In Simple Linear Regression with only one predictor, this plot looks identical in shape to the “Residuals vs. Fitted” plot).
The Red Line: This is a flat, horizontal line at exactly zero. If the model predicted a person’s BMI perfectly, their point would land exactly on this line.
The Orange Line (LOESS Smoother): This is a moving average of the residuals. It helps your eye track the overall trend of the errors across different ages.
The orange line stays fairly close to the red line, though there is a very slight curve downwards at the extreme ends of the age range. The vertical spread of the points looks relatively consistent across the ages. Overall, it doesn’t show any severe violations of linearity or equal variance, though there is a lot of random scatter (which aligns with the very low \(R^2\) value we saw earlier).
p_hist <- ggplot(augmented, aes(x = .resid)) +
geom_histogram(aes(y = after_stat(density)), bins = 40,
fill = "steelblue", color = "white", alpha = 0.8) +
geom_density(color = "red", linewidth = 1) +
stat_function(fun = dnorm,
args = list(mean = mean(augmented$.resid),
sd = sd(augmented$.resid)),
color = "black", linetype = "dashed", linewidth = 1) +
labs(
title = "Distribution of Residuals",
subtitle = "Red = kernel density | Black dashed = normal distribution",
x = "Residuals",
y = "Density"
) +
theme_minimal(base_size = 13)
p_hist
Distribution of Residuals
# ggplot version of QQ plot
p_qq <- ggplot(augmented, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on the red line if residuals are normally distributed",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)
p_qq
Normal Q-Q Plot of Residuals
# Cook's distance
augmented <- augmented %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(model_slr),
influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
)
n_influential <- sum(augmented$cooks_d > 4 / n)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4 / n, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooks
Cook’s Distance: Identifying Influential Observations
To reinforce the concepts, let’s fit a second SLR model examining the association between hours of sleep and BMI.
p_sleep <- ggplot(brfss_slr, aes(x = sleep_hrs, y = bmi)) +
geom_jitter(alpha = 0.15, color = "purple", width = 0.15, height = 0) +
geom_smooth(method = "lm", color = "darkred", linewidth = 1.2, se = TRUE) +
labs(
title = "BMI vs. Nightly Sleep Hours (BRFSS 2020)",
x = "Average Hours of Sleep per Night",
y = "BMI (kg/m²)"
) +
theme_minimal(base_size = 13)
p_sleep
BMI vs. Sleep Hours
model_sleep <- lm(bmi ~ sleep_hrs, data = brfss_slr)
tidy(model_sleep, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "SLR: BMI ~ Hours of Sleep per Night",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 30.7419 | 0.534 | 57.5683 | 0.0000 | 29.6948 | 31.7890 |
| sleep_hrs | -0.2256 | 0.075 | -3.0087 | 0.0026 | -0.3726 | -0.0786 |
b1_sleep <- coef(model_sleep)["sleep_hrs"]
r2_sleep <- summary(model_sleep)$r.squared
tibble(
Metric = c("Slope (b₁)", "R²"),
Value = c(round(b1_sleep, 4), round(r2_sleep, 4))
) %>%
kable(caption = "Sleep Model Key Statistics") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Metric | Value |
|---|---|
| Slope (b₁) | -0.2256 |
| R² | 0.0030 |
Interpretation: Each additional hour of sleep per night is associated with a change of -0.2256 kg/m² in BMI, on average. The direction of this association is negative (more sleep → lower BMI). The model explains 0.3% of variability in BMI. While statistically significant, the effect size is modest, underscoring the multifactorial nature of BMI.
par(mfrow = c(2, 2))
plot(model_sleep, which = 1:4,
col = adjustcolor("purple", 0.4), pch = 19, cex = 0.6)
par(mfrow = c(1, 1))
Our linear model estimated a negative slope for age: older adults have, on average, slightly lower BMI. But is that the full story? Cross-sectional data can show a decline at older ages due to survivorship bias — people with very high BMI may die before reaching old age, leaving a healthier-looking older sample. There may also be a genuine nonlinear pattern (BMI rises through middle age, then declines in later life).
We can test this by including an age² term in the model:
\[\widehat{\text{BMI}} = b_0 + b_1 \cdot \text{Age} + b_2 \cdot \text{Age}^2\]
This is still a linear regression model (linear in the coefficients), even though it is nonlinear in the predictor. It allows the slope to change across the range of age.
# Add age-squared term
brfss_slr <- brfss_slr %>%
mutate(age2 = age^2)
# Fit quadratic model
model_quad <- lm(bmi ~ age + age2, data = brfss_slr)
tidy(model_quad, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 5))) %>%
kable(
caption = "Quadratic Model: BMI ~ Age + Age²",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 18.54178 | 1.08095 | 17.15329 | 0 | 16.42230 | 20.66125 |
| age | 0.47435 | 0.04418 | 10.73772 | 0 | 0.38773 | 0.56096 |
| age2 | -0.00464 | 0.00042 | -11.02651 | 0 | -0.00546 | -0.00381 |
# Compare linear vs. quadratic model
tibble(
Model = c("Linear: BMI ~ Age", "Quadratic: BMI ~ Age + Age²"),
R_squared = c(
round(summary(model_slr)$r.squared, 4),
round(summary(model_quad)$r.squared, 4)
),
Adj_R2 = c(
round(summary(model_slr)$adj.r.squared, 4),
round(summary(model_quad)$adj.r.squared, 4)
),
AIC = c(round(AIC(model_slr), 1), round(AIC(model_quad), 1))
) %>%
kable(
caption = "Model Comparison: Linear vs. Quadratic",
col.names = c("Model", "R²", "Adj. R²", "AIC")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which.min(c(AIC(model_slr), AIC(model_quad))),
bold = TRUE, background = "#d4edda")
| Model | R² | Adj. R² | AIC |
|---|---|---|---|
| Linear: BMI ~ Age | 0.0002 | -0.0001 | 20203.2 |
| Quadratic: BMI ~ Age + Age² | 0.0392 | 0.0386 | 20085.9 |
# Generate predicted values from both models
age_seq <- data.frame(age = seq(18, 80, length.out = 300)) %>%
mutate(age2 = age^2)
pred_linear <- predict(model_slr, newdata = age_seq)
pred_quad <- predict(model_quad, newdata = age_seq)
pred_df <- age_seq %>%
mutate(
linear = pred_linear,
quadratic = pred_quad
) %>%
pivot_longer(cols = c(linear, quadratic),
names_to = "Model", values_to = "Predicted_BMI")
ggplot() +
geom_point(data = brfss_slr, aes(x = age, y = bmi),
alpha = 0.10, color = "steelblue", size = 1) +
geom_line(data = pred_df, aes(x = age, y = Predicted_BMI, color = Model),
linewidth = 1.3) +
scale_color_manual(
values = c("linear" = "red", "quadratic" = "darkorange"),
labels = c("linear" = "Linear fit", "quadratic" = "Quadratic fit (Age + Age²)")
) +
labs(
title = "BMI vs. Age: Linear vs. Quadratic Model",
subtitle = "Does BMI rise then fall with age, or decline monotonically?",
x = "Age (years)",
y = "BMI (kg/m²)",
color = "Model"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
Linear vs. Quadratic Fit: BMI ~ Age
Interpretation: If the coefficient on Age² is negative and statistically significant, the fitted curve is an inverted-U — BMI peaks at some middle age and declines thereafter. Extract the peak using \(\text{Age}^* = -b_1 / (2 b_2)\). A positive Age² coefficient would indicate a U-shape (BMI lowest in middle age).
b1_q <- coef(model_quad)["age"]
b2_q <- coef(model_quad)["age2"]
peak_age <- -b1_q / (2 * b2_q)
tibble(
Quantity = c("b₁ (Age)", "b₂ (Age²)", "Peak / Trough Age (-b₁ / 2b₂)"),
Value = c(round(b1_q, 5), round(b2_q, 6), round(peak_age, 1))
) %>%
kable(caption = "Quadratic Model Coefficients and Implied Turning Point") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Quantity | Value |
|---|---|
| b₁ (Age) | 0.474350 |
| b₂ (Age²) | -0.004635 |
| Peak / Trough Age (-b₁ / 2b₂) | 51.200000 |
Caution on interpretation: Even if the quadratic model fits better statistically, be cautious about causal interpretation. The cross-sectional pattern reflects cohort differences in BMI trajectories, not necessarily the aging process within any individual. Survivorship bias (heavier individuals dying earlier) can make the quadratic term appear significant in cross-sectional data.
| Quantity | Formula |
|---|---|
| Slope | \(b_1 = S_{XY} / S_{XX}\) |
| Intercept | \(b_0 = \bar{Y} - b_1 \bar{X}\) |
| SSTotal | \(\sum(Y_i - \bar{Y})^2\) |
| SSRegression | \(\sum(\hat{Y}_i - \bar{Y})^2\) |
| SSResidual | \(\sum(Y_i - \hat{Y}_i)^2\) |
| MSE | \(SS_{Residual} / (n-2)\) |
| \(R^2\) | \(SS_{Reg} / SS_{Total}\) |
| \(SE(b_1)\) | \(\hat{\sigma}/\sqrt{S_{XX}}\) |
| t-statistic | \(b_1 / SE(b_1)\) |
| 95% CI for \(\beta_1\) | \(b_1 \pm t_{n-2, 0.025} \cdot SE(b_1)\) |
In this lab, you will apply Simple Linear Regression to the
BRFSS 2020 dataset using a different outcome variable:
number of days of poor physical health in the past 30
days (phys_days). You will model it as a
continuous outcome predicted by BMI.
Research Question: Is BMI associated with the number of days of poor physical health among U.S. adults?
Use the code below to load the data. The dataset is the same one used in the lecture — you only need to load it once.
brfss_lab <- read_rds("C:/Users/tahia/OneDrive/Desktop/UAlbany PhD/Epi 553/Lab 6/new_data.rds")
# (a) Create a summary table of phys_days and bmi
brfss_lab %>%
select(bmi, phys_days) %>%
summary() %>%
kable(caption = "Descriptive Statistics: BMi and Physical activity") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| bmi | phys_days | |
|---|---|---|
| Min. :14.63 | Min. : 1.00 | |
| 1st Qu.:24.32 | 1st Qu.: 2.00 | |
| Median :27.89 | Median : 6.00 | |
| Mean :29.18 | Mean :11.66 | |
| 3rd Qu.:32.89 | 3rd Qu.:20.00 | |
| Max. :59.60 | Max. :30.00 |
brfss_lab %>%
select(bmi, phys_days) %>%
tbl_summary(
label = list(
bmi ~ "BMI (kg/m²)",
phys_days ~ "Physical activity in days"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics of BMI and Physical Activity (BRFSS 2020, n = 3,000)**")
| Characteristic | N | N = 3,0001 |
|---|---|---|
| BMI (kg/m²) | 3,000 | 29.2 (7.0) |
| Physical activity in days | 3,000 | 11.7 (11.2) |
| 1 Mean (SD) | ||
# (b) Create a histogram of phys_days — describe the distribution
ggplot(brfss_lab, aes(x = phys_days)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
labs(
title = "Histogram of Days Physically Active (Last 30 Days)",
x = "Number of Days Physically Active",
y = "Frequency"
) +
theme_minimal()
# (c) Create a scatter plot of phys_days (Y) vs bmi (X)
ggplot(brfss_lab, aes(x = bmi, y = phys_days)) +
geom_point(color = "steelblue", alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "darkred") +
labs(
title = "Physically Active Days vs BMI",
x = "BMI (kg/m²)",
y = "Number of Physically Active Days"
) +
theme_minimal()
Questions:
phys_days?
Of bmi? What do you notice about the distribution of
phys_days?Mean & SD
BMI: Mean = 29.2, SD = 7.0
Physical activity days (phys_days): Mean = 11.7 days, SD = 11.2 days
The mean (11.7) is much higher than the median (6), and the 3rd quartile is 20 with a max of 30. So the distribution looks right-skewed, with many people reporting fewer active days and some reporting very high values (close to 30).
Scatter plot interpretation: The relationship between BMI and physical activity appears weak and not strongly linear. If anything, it may show a slight negative trend (higher BMI → fewer active days), but it doesn’t look very strong. There may be a few extreme BMI values (like near 60) that could act as possible outliers.
# (a) Fit the SLR model: phys_days ~ bmi
model_lab <- lm(phys_days ~ bmi, data = brfss_lab)
# Summary output
summary(model_lab)
##
## Call:
## lm(formula = phys_days ~ bmi, data = brfss_lab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.808 -9.160 -5.623 8.943 20.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.42285 0.86881 8.544 < 2e-16 ***
## bmi 0.14513 0.02895 5.013 5.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.12 on 2998 degrees of freedom
## Multiple R-squared: 0.008314, Adjusted R-squared: 0.007983
## F-statistic: 25.13 on 1 and 2998 DF, p-value: 5.659e-07
# (b) Display a tidy coefficient table with 95% CIs
tidy(model_lab, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Physically Sctive Days ~ BMI (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.4228 | 0.8688 | 8.5437 | 0 | 5.7193 | 9.1264 |
| bmi | 0.1451 | 0.0289 | 5.0134 | 0 | 0.0884 | 0.2019 |
# (c) Extract and report: slope, intercept, t-statistic, p-value
b0 <- round(coef(model_lab)[1], 3)
b1 <- round(coef(model_lab)[2], 4)
slope_test <- tidy(model_lab, conf.int = TRUE) %>% filter(term == "bmi")
tibble(
Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
"Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
Value = c(
round(slope_test$estimate, 4),
round(slope_test$std.error, 4),
round(slope_test$statistic, 3),
n - 2,
format.pval(slope_test$p.value, digits = 3),
round(slope_test$conf.low, 4),
round(slope_test$conf.high, 4)
)
) %>%
kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Quantity | Value |
|---|---|
| Slope (b₁) | 0.1451 |
| SE(b₁) | 0.0289 |
| t-statistic | 5.013 |
| Degrees of freedom | 2998 |
| p-value | 5.66e-07 |
| 95% CI Lower | 0.0884 |
| 95% CI Upper | 0.2019 |
Questions:
Fitted equation:
For each 1 unit increase in BMI (kg/m²), the model predicts about 0.145 more physically active days (on average), holding nothing else constant (since it’s a simple model).
Not really, b0=7.4228 is the predicted active days when BMI = 0, which isn’t realistic for humans, so it doesn’t have a practical meaning here.
To test whether BMI is associated with physically active days, we tested the null hypothesis H0:β1=0, meaning there is no linear relationship between BMI and physically active days. The t-statistic for the slope was 5.013 with 2998 degrees of freedom, and the p-value was 5.66 × 10⁻⁷. Since this p-value is much smaller than α = 0.05, we reject the null hypothesis and conclude that there is a statistically significant association between BMI and physically active days in this sample.
# (a) Display the ANOVA table
anova_lab <- anova(model_lab)
anova_lab %>%
kable(
caption = "ANOVA Table: Physycally ACtive Days ~ BMI",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Source | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| bmi | 1 | 3105.365 | 3105.365 | 25.134 | 0 |
| Residuals | 2998 | 370411.743 | 123.553 | NA | NA |
# (b) Compute and report SSTotal, SSRegression, and SSResidual
SSRegression <- anova_lab$`Sum Sq`[1]
SSResidual <- anova_lab$`Sum Sq`[2]
# Total Sum of Squares
SSTotal <- SSRegression + SSResidual
# Print results
SSRegression
## [1] 3105.365
SSResidual
## [1] 370411.7
SSTotal
## [1] 373517.1
# (c) Compute R² two ways: from the model object and from the SS decomposition
#From Model object
R2_model <- summary(model_lab)$r.squared
R2_model
## [1] 0.008313849
#From SS decomposition
R2_ss <- SSRegression / SSTotal
R2_ss
## [1] 0.008313849
Questions:
From the ANOVA table, the SSRegression (SSR) is 3105.365, the SSResidual (SSE) is 370,411.743, and the SSTotal (SST) is 373,517.1. The regression has 1 degree of freedom, the residual has 2998 df, and the total df is 2999. The F-statistic is 25.134, which reflects the ratio of the regression mean square (3105.365) to the residual mean square (123.553).
The R² value is 0.0083, meaning BMI explains only about 0.83% of the variation in physically active days. In plain English, BMI barely explains any of the differences in how many days people are physically active. Even though the relationship is statistically significant, it’s very weak in terms of practical explanation.
This low R² tells us that BMI alone does a poor job explaining variation in physically active days. About 99% of the variation is due to other factors not included in the model. Things like age, sex, overall health status, disability, motivation, work schedule, environment, or even reporting error could explain most of the remaining variation.
# (a) Calculate the fitted BMI value and 95% CI for a person with BMI = 25
new_bmi <- data.frame(bmi=25)
ci_pred <- predict(model_lab, newdata = new_bmi, interval = "confidence", level=0.95) %>%
as.data.frame() %>%
rename(Fitted = fit, CI_Lower = lwr, CI_Upper = upr)
ci_pred
## Fitted CI_Lower CI_Upper
## 1 11.05108 10.58774 11.51441
# (b) Calculate the 95% prediction interval for a person with BMI = 25
pi_pred <- predict(model_lab, newdata = new_bmi, interval = "prediction", level = 0.95) %>%
as.data.frame() %>%
rename(PI_Lower = lwr, PI_Upper = upr) %>%
select(-fit)
pi_pred
## PI_Lower PI_Upper
## 1 -10.7485 32.85066
results_table <- bind_cols(new_bmi, ci_pred, pi_pred) %>%
mutate(across(where(is.numeric), ~ round(., 2)))
results_table %>%
kable(
caption = "Fitted Values, 95% Confidence Intervals, and Prediction Intervals by BMI",
col.names = c("BMI= 25", "Fitted BMI", "CI Lower", "CI Upper", "PI Lower", "PI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
add_header_above(c(" " = 2, "95% CI for Mean" = 2, "95% PI for Individual" = 2))
| BMI= 25 | Fitted BMI | CI Lower | CI Upper | PI Lower | PI Upper |
|---|---|---|---|---|---|
| 25 | 11.05 | 10.59 | 11.51 | -10.75 | 32.85 |
# (c) Plot the regression line with both the CI band and PI band
# Create a grid of BMI values across the observed range
bmi_grid <- data.frame(bmi = seq(min(brfss_lab$bmi), max(brfss_lab$bmi), length.out = 200))
ci_band <- predict(model_lab, newdata = bmi_grid, interval = "confidence", level=0.95) %>%
as.data.frame() %>%
bind_cols(bmi_grid)
pi_band <- predict(model_lab, newdata = bmi_grid, interval = "prediction", level=0.95) %>%
as.data.frame() %>%
bind_cols(bmi_grid)
p_ci_pi <- ggplot() +
geom_point(data = brfss_lab, aes(x = bmi, y = phys_days ),
alpha = 0.10, color = "steelblue", size = 1) +
geom_ribbon(data = pi_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "lightblue", alpha = 0.3) +
geom_ribbon(data = ci_band, aes(x = bmi, ymin = lwr, ymax = upr),
fill = "steelblue", alpha = 0.4) +
geom_line(data = ci_band, aes(x = bmi, y = fit),
color = "red", linewidth = 1.2) +
labs(
title = "Simple Linear Regression: Physical Activite Days ~ BMI",
subtitle = "Dark band = 95% CI for mean response | Light band = 95% PI for individual observation",
x = "BMI",
y = "Physically Active in Days",
caption = "BRFSS 2020, n = 3,000"
) +
theme_minimal(base_size = 13)
p_ci_pi
Questions:
For someone with a BMI of 25, the estimated mean number of physically active days is 11.05 days. The 95% confidence interval (CI) for this mean is 10.59 to 11.51 days, meaning we are 95% confident that the true average number of active days for people with BMI = 25 falls within this range.
For a specific new individual with a BMI of 25, the 95% prediction interval (PI) is much wider: –10.75 to 32.85 days. This interval reflects where an individual person’s number of active days could realistically fall.
The prediction interval is wider than the confidence interval because it accounts for both the uncertainty in estimating the mean and the natural person-to-person variability in physically active days. The confidence interval is used when we care about the average response for a group, while the prediction interval is used when we want to predict the outcome for one specific individual.
# (a) Produce the four standard diagnostic plots (use par(mfrow = c(2,2)) and plot())
par(mfrow = c(2, 2))
plot(model_lab, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)
# (b) Create a residuals vs. fitted plot using ggplot
# Augment model to get fitted values and residuals
aug_lab <- augment(model_lab)
ggplot(aug_lab, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.2, color = "steelblue") +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "loess", se = FALSE, color = "orange", linewidth = 1) +
labs(
title = "Residuals vs Fitted Values",
subtitle = "Checking linearity and equal variance (homoscedasticity)",
x = "Fitted values",
y = "Residuals"
) +
theme_minimal()
# (c) Create a normal Q-Q plot of residuals using ggplot
p_qq <- ggplot(aug_lab, aes(sample = .resid)) +
stat_qq(color = "steelblue", alpha = 0.3, size = 1) +
stat_qq_line(color = "red", linewidth = 1) +
labs(
title = "Normal Q-Q Plot of Residuals",
subtitle = "Points should lie on the red line if residuals are normally distributed",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal(base_size = 13)
p_qq
# (d) Create a Cook's distance plot
aug_lab <- aug_lab %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(model_lab),
influential = ifelse(cooks_d > 4 / n, "Potentially influential", "Not influential")
)
n_influential <- sum(aug_lab$cooks_d > 4 / n)
p_cooks <- ggplot(aug_lab, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 4 / n, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Potentially influential" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = 4/n threshold | ",
n_influential, " potentially influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooks
Questions:
From the Residuals vs. Fitted plot, there is some evidence of nonlinearity. The orange LOESS curve is slightly curved rather than flat, especially at lower fitted values, suggesting the relationship may not be perfectly linear. The spread of residuals also changes a bit across fitted values, which hints at mild heteroscedasticity (non-constant variance). The striped pattern you see is likely because phys_days is a count variable (0–30), so residuals fall in horizontal bands.
phys_days?Looking at the Q-Q plot, the residuals clearly deviate from the red line, especially in the tails. This suggests the residuals are not normally distributed. In this context, that makes sense because phys_days is a bounded count variable (0 to 30 days), not a continuous normal variable. The departures from normality suggest the outcome is skewed and discrete rather than bell-shaped.
For Cook’s distance, there are influential observations above the 4/n threshold. Based on your plot, there are 136 potentially influential observations. However, their Cook’s D values are still relatively small overall. I would not automatically remove them. Instead, I would check whether they are data errors or legitimate extreme values and possibly run a sensitivity analysis to see if results change when excluding them.
Ihe LINE assumptions are only partially met. Linearity is questionable, normality is clearly violated, and there is some mild heteroscedasticity. The most problematic assumption is normality, mainly because the outcome variable (phys_days) is a discrete count variable bounded between 0 and 30. A linear regression may not be the best model here is a Poisson regression might be more appropriate given the nature of the outcome.
Now fit a second SLR model using age as the
predictor of phys_days instead of BMI.
# (a) Fit SLR: phys_days ~ age
p_phys_days <- ggplot(brfss_lab, aes(x = phys_days, y = age)) +
geom_jitter(alpha = 0.15, color = "purple", width = 0.15, height = 0) +
geom_smooth(method = "lm", color = "darkred", linewidth = 1.2, se = TRUE) +
labs(
title = "Age vs. Physical Activity in Day (BRFSS 2020)",
x = "Physical Activity in Days",
y = "Age (years)"
) +
theme_minimal(base_size = 13)
p_phys_days
# (b) Display results and compare to the BMI model
model_Phys_days <- lm(bmi ~ phys_days, data = brfss_lab)
tidy(model_Phys_days, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "SLR: Age ~ Physical Activity in Days",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 28.5141 | 0.1844 | 154.6357 | 0 | 28.1525 | 28.8757 |
| phys_days | 0.0573 | 0.0114 | 5.0134 | 0 | 0.0349 | 0.0797 |
# (c) Which predictor has the stronger association? Compare R² values.
b1_phys_days <- coef(model_Phys_days)["sleep_hrs"]
r2_phys_days <- summary(model_Phys_days)$r.squared
tibble(
Metric = c("Slope (b₁)", "R²"),
Value = c(round(b1_phys_days, 4), round(r2_phys_days, 4))
) %>%
kable(caption = "Physical Activity Model Key Statistics") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Metric | Value |
|---|---|
| Slope (b₁) | NA |
| R² | 0.0083 |
Questions:
The association between age and physically active days appears to be positive, meaning as age increases, physically active days slightly increase (based on the positive slope ≈ 0.057). The magnitude is small, similar to what we saw with BMI (BMI slope ≈ 0.145). Both associations are statistically significant (p < 0.001), but the effect sizes are very small in practical terms. So direction-wise both are positive, magnitude is weak, and both are statistically significant.
phys_days?The R² for the BMI model was 0.0083, and the R² for this model is also about 0.0083. That means both predictors explain only about 0.83% of the variability in physically active days. Neither predictor meaningfully explains the outcome. They perform almost identically and very poorly in terms of explained variance.
Based on these two simple linear regression models, neither BMI nor age alone is a strong predictor of physically active days. Even though both associations are statistically significant, they explain less than 1% of the variability, which is practically negligible. A key limitation here is that phys_days is a bounded count variable (0–30 days), which violates normality assumptions and makes linear regression less appropriate. A count-based model like Poisson regression would likely be more suitable. Also, using simple models ignores other important factors (health status, disability, socioeconomic status, etc.) that probably explain much more of the variation.
Submit your completed .Rmd file and the RPubs
link to your knitted HTML document.
Your .Rmd must knit without errors. Make sure all code
chunks produce visible output and all questions are answered in complete
sentences below each code chunk.
Due: Before the next class session.
sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gtsummary_2.5.0 ggeffects_2.3.2 broom_1.0.11 plotly_4.12.0
## [5] kableExtra_1.4.0 knitr_1.51 here_1.0.2 haven_2.5.5
## [9] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.0
## [13] purrr_1.2.1 readr_2.1.6 tidyr_1.3.2 tibble_3.3.1
## [17] ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.9.0 htmlwidgets_1.6.4
## [5] insight_1.4.4 lattice_0.22-7 tzdb_0.5.0 crosstalk_1.2.2
## [9] vctrs_0.7.1 tools_4.5.2 generics_0.1.4 pkgconfig_2.0.3
## [13] Matrix_1.7-4 data.table_1.18.0 RColorBrewer_1.1-3 S7_0.2.1
## [17] gt_1.3.0 lifecycle_1.0.5 compiler_4.5.2 farver_2.1.2
## [21] textshaping_1.0.4 janitor_2.2.1 snakecase_0.11.1 litedown_0.9
## [25] htmltools_0.5.9 sass_0.4.10 yaml_2.3.12 lazyeval_0.2.2
## [29] pillar_1.11.1 jquerylib_0.1.4 cachem_1.1.0 nlme_3.1-168
## [33] commonmark_2.0.0 tidyselect_1.2.1 digest_0.6.39 stringi_1.8.7
## [37] labeling_0.4.3 splines_4.5.2 rprojroot_2.1.1 fastmap_1.2.0
## [41] grid_4.5.2 cli_3.6.5 magrittr_2.0.4 cards_0.7.1
## [45] withr_3.0.2 scales_1.4.0 backports_1.5.0 timechange_0.3.0
## [49] rmarkdown_2.30 httr_1.4.7 otel_0.2.0 hms_1.1.4
## [53] evaluate_1.0.5 viridisLite_0.4.2 mgcv_1.9-3 markdown_2.0
## [57] rlang_1.1.7 glue_1.8.0 xml2_1.5.2 svglite_2.2.2
## [61] rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1 systemfonts_1.3.1
## [65] fs_1.6.6